# Automatic Redaction/Sanitization of Documents Using SpaCy
Sanitization is the process of removing sensitive information from a document or other message (or sometimes encrypting it), so that the document may be distributed to a broader audience #### Purpose of Sanitization/Redaction of Document.

In [1]:
# Load NLP Pkg
import spacy

In [2]:
# Create NLP object
nlp = spacy.load('en')

In [3]:
ex1 = "The reporter said that it was John Mark that gave him the news in London last year"

In [4]:
docx1 = nlp(ex1)

In [5]:
# Find Entities
for ent in docx1.ents:
    print(ent.text,ent.label_)

John Mark PERSON
London GPE
last year DATE


In [6]:
# Function to Sanitize/Redact Names
def sanitize_names(text):
    docx = nlp(text)
    redacted_sentences = []
    for ent in docx.ents:
        ent.merge()
    for token in docx:
        if token.ent_type_ == 'PERSON':
            redacted_sentences.append("[REDACTED]")
        else:
            redacted_sentences.append(token.string)
    return "".join(redacted_sentences)

In [7]:
ex1

'The reporter said that it was John Mark that gave him the news in London last year'

## Redact the Names

In [8]:
sanitize_names(ex1)

'The reporter said that it was [REDACTED]that gave him the news in London last year'

In [9]:
# Visualization of Entities
from spacy import displacy
displacy.render(nlp(ex1),style='ent',jupyter=True)

In [10]:
# Apply the function and visualize it
docx2 = sanitize_names(ex1)
displacy.render(nlp(docx2),style='ent',jupyter=True)

## Redaction/Sanitization of Location/GPE

In [11]:
def sanitize_locations(text):
    docx = nlp(text)
    redacted_sentences = []
    for ent in docx.ents:
        ent.merge()
    for token in docx:
        if token.ent_type_ == 'GPE':
            redacted_sentences.append("[REDACTED]")
        else:
            redacted_sentences.append(token.string)
    return "".join(redacted_sentences)

In [12]:
sanitize_locations(ex1)

'The reporter said that it was John Mark that gave him the news in [REDACTED]last year'