In [16]:
import spacy

In [17]:
model = spacy.load("en_core_web_sm")

doc = "Sumit is an adjunct faculty at Upgrad."

processed_doc = model(doc) #proces input and perform NLP tasks

Since a named entity is a noun, let us see what information we get from POS tags

In [18]:
for token in processed_doc:
    print(token.text, " -- ", token.pos_)

Sumit  --  NOUN
is  --  AUX
an  --  DET
adjunct  --  ADJ
faculty  --  NOUN
at  --  ADP
Upgrad  --  PROPN
.  --  PUNCT


So we see that the POD tags for named entitiies are correctly identitfied. Let us see what the output of NER system in spacy to understand the differences

In [19]:
for ent in processed_doc.ents:
    print(ent.text, " -- ", ent.start_char, " -- ", ent.end_char, " -- ", ent.label_)

Upgrad  --  31  --  37  --  ORG


Okay, so we did find some named entities, but clearly we missed the faculty name. May be because the model doesn't recognize Sumit


In [25]:
doc2  = "Dr. Sumit is an adjunct faculty at Upgrad"
processed_doc2 = model(doc2) #proces input and perform NLP tasks

In [26]:
for token in processed_doc2:
    print(token.text, " -- ", token.pos_)

Dr.  --  PROPN
Sumit  --  PROPN
is  --  AUX
an  --  DET
adjunct  --  ADJ
faculty  --  NOUN
at  --  ADP
Upgrad  --  PROPN


In [27]:
for ent in processed_doc2.ents:
    print(ent.text, " -- ", ent.start_char, " -- ", ent.end_char, " -- ", ent.label_)

Sumit  --  4  --  9  --  PERSON
Upgrad  --  35  --  41  --  ORG


It was able to correct tag Sumit, now.

In [28]:
doc3 = "Statue of Liberty is situated in New York, USA."
processed_doc3 = model(doc3) #proces input and perform NLP tasks

In [30]:
for token in processed_doc3:
    print(token.text, " -- ", token.pos_)

Statue  --  PROPN
of  --  ADP
Liberty  --  PROPN
is  --  AUX
situated  --  VERB
in  --  ADP
New  --  PROPN
York  --  PROPN
,  --  PUNCT
USA  --  PROPN
.  --  PUNCT


In [31]:
for ent in processed_doc3.ents:
    print(ent.text, " -- ", ent.start_char, " -- ", ent.end_char, " -- ", ent.label_)

New York  --  33  --  41  --  GPE
USA  --  43  --  46  --  GPE


The system did not recognize "Statue of Liberty"

Let us see the output of NER at token level illustrating the IOB format discussed in lectures

In [33]:
for token in processed_doc3:
    print(token.text, " -- ", token.ent_iob_, " -- ", token.ent_type_)

Statue  --  O  --  
of  --  O  --  
Liberty  --  O  --  
is  --  O  --  
situated  --  O  --  
in  --  O  --  
New  --  B  --  GPE
York  --  I  --  GPE
,  --  O  --  
USA  --  B  --  GPE
.  --  O  --  


- You can use spacy's NER model to identify named entities in input text.
- You also studied some cases where the model is not able to correctly identify all the entities invovled.
- There are various situations where systems make errors and depending on the appliation and the severity and types of errors, follow up corrective measures can be employed (manual validation, discarding erroneous outouts, using heuristics, etc.)

Let us now consider one practical application of NER systems -- Anonymization of data and redacting personally indentifying information.

- In many scenarios, we want to withheld sensitive information such as names of persons in various confidential information.
- We can use NER methods to automatically identify PERSONS in text and remove PERSON names from text.

Let us see how it can be done with what we have learnt till now. We take an example email from Enron e-mail dataset for ilustration in this demo.

- E-mail source: http://www.enron-mail.com/email/lay-k/elizabeth/Christmas_in_Aspen_4.html

- Complete Enron data: http://www.enron-mail.com/

In [34]:
email = ('Dear Family, Jose Luis and I have changed our dates, we are '
         'going to come to Aspen on the 23rd of December and leave on the '
         '30th of December. We would like to stay in the front bedroom of '
         'the Aspen Cottage so that Mark, Natalie and Zachary can stay in '
         'the guest cottage. Please let me know if there are any problems '
         'with this. If I do not hear anything, I will assume this is all '
         'o.k. with you.'
         'Love, Liz')

In [35]:
processed_email = model(email) #proces input and perform NLP tasks

In [37]:
anoymized_email = list(email) # intialize data structure to store anonymized email

for ent in processed_email.ents:
    if ent.label_ == "PERSON":
        for char_pos in range(ent.start_char, ent.end_char):
            anoymized_email[char_pos] = "*"

print("\n\n-- After Anonymization --\n")
"".join(anoymized_email)



-- After Anonymization --



'Dear Family, ********* and I have changed our dates, we are going to come to Aspen on the 23rd of December and leave on the 30th of December. We would like to stay in the front bedroom of the Aspen Cottage so that ****, ******* and ******* can stay in the guest cottage. Please let me know if there are any problems with this. If I do not hear anything, I will assume this is all o.k. with you.Love, ***'