# Customized Named Entity Recognition

- Using Kaggle Competition 
- https://www.kaggle.com/davidg089/all-djtrum-tweets

# 1)- Importing key modules

In [1]:
from __future__ import unicode_literals

In [2]:
import spacy
from spacy.lang.en import English
from spacy.pipeline import EntityRuler
import re
import string
import pdftotext # For pdfto text conversion
import docx2txt # for converting docx to .txt format
from collections import Counter
import sys
import pandas as pd
from collections import defaultdict
import codecs # for encoding scheme of text files
from spacy.matcher import Matcher
from spacy.tokens import Span
from spacy import displacy

In [3]:
nlp = spacy.load('en_core_web_md')

# 2)- Loading Data

In [4]:
tweets = pd.read_csv("all_djt_tweets.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [5]:
tweets.head()

Unnamed: 0.1,Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str
0,0,Twitter for iPhone,Over 90% approval rating for your all time fav...,Mon Aug 27 00:39:38 +0000 2018,27040,106838.0,False,1.033877e+18
1,1,Twitter for iPhone,“Mainstream Media tries to rewrite history to ...,Sun Aug 26 22:01:33 +0000 2018,21346,76682.0,False,1.033837e+18
2,2,Twitter for iPhone,Fantastic numbers on consumer spending release...,Sun Aug 26 14:31:06 +0000 2018,18960,87334.0,False,1.033724e+18
3,3,Twitter for iPhone,"...And it will get, as I have always said, muc...",Sun Aug 26 14:27:16 +0000 2018,14963,62956.0,False,1.033723e+18
4,4,Twitter for iPhone,RT @realDonaldTrump: Social Media Giants are s...,Sun Aug 26 14:25:47 +0000 2018,50142,0.0,True,1.033722e+18


Our feature of interest is "text"

In [6]:
tweets.shape

(328053, 8)

In [7]:
tweets.text[0]

'Over 90% approval rating for your all time favorite (I hope) President within the Republican Party and 52% overall. This despite all of the made up stories by the Fake News Media trying endlessly to make me look as bad and evil as possible. Look at the real villains please!'

In [15]:
tweets.dtypes

Unnamed: 0         object
source             object
text               object
created_at         object
retweet_count      object
favorite_count    float64
is_retweet         object
id_str            float64
dtype: object

In [16]:
tweets.isnull().sum()

Unnamed: 0        163840
source            277677
text              277677
created_at        293213
retweet_count     293213
favorite_count    293213
is_retweet        293216
id_str            293216
dtype: int64

Couple of missing values. Let's concentrate on text for our analysis

# 3)- Feature of Analysis

In [17]:
tweet_text=tweets['text']

In [18]:
tweet_text[0]

'Over 90% approval rating for your all time favorite (I hope) President within the Republican Party and 52% overall. This despite all of the made up stories by the Fake News Media trying endlessly to make me look as bad and evil as possible. Look at the real villains please!'

In [23]:
type(tweet_text)

pandas.core.series.Series

In [24]:
len(tweet_text)

328053

In [20]:
doc = nlp(tweet_text[0])

In [21]:
type(doc)

spacy.tokens.doc.Doc

In [22]:
doc

Over 90% approval rating for your all time favorite (I hope) President within the Republican Party and 52% overall. This despite all of the made up stories by the Fake News Media trying endlessly to make me look as bad and evil as possible. Look at the real villains please!

In [25]:
len(doc)

57

Observe difference in length of  spacy token vs doc as of series

In [26]:
for ent in doc.ents:
        print(f'Entity: {ent}, Label: {ent.label_}, {spacy.explain(ent.label_)}')

Entity: 90%, Label: PERCENT, Percentage, including "%"
Entity: the Republican Party, Label: ORG, Companies, agencies, institutions, etc.
Entity: 52%, Label: PERCENT, Percentage, including "%"
Entity: the Fake News Media, Label: ORG, Companies, agencies, institutions, etc.


### Visualize results

In [27]:
displacy.render(doc,style='ent',jupyter=True)

In [31]:
doc2 = nlp(tweet_text[15])

In [32]:
displacy.render(doc2,style='ent',jupyter=True)

# 4)- Redacting Entities

Suppose we want to keep names in above cases as hidden i.e to automatically redact names.

We shall follow these steps

- 1. find all PERSON names
- 2. replace these by a filler like ["REDACTED"]

In [34]:
def redact_names(text):
    doc = nlp(text)
    redacted_sentence = []
    for ent in doc.ents:
        ent.merge()
    for token in doc:
        if token.ent_type_ == "PERSON":
            redacted_sentence.append("[REDACTED]")
        else:
            redacted_sentence.append(token.string)
    return "".join(redacted_sentence)

thanks to https://www.kaggle.com/nirant/hitchhiker-s-guide-to-nlp-in-spacy/data

In [35]:
one_tweet = redact_names(tweets['text'][15])
doc = nlp(one_tweet)

In [36]:
spacy.displacy.render(doc, style='ent',jupyter=True)

So, we have hidden person name. This can be done with other entities as well.

# 5)- Showing Specific Entities

In [51]:
doc3=nlp(tweet_text[20])

In [52]:
spacy.displacy.render(doc3, style='ent',jupyter=True)

Suppose in above case, we only want to see only "GPE" Entity .

### 5a) Using Option in displacy

In [53]:
options={'ents':['GPE']}

In [54]:
displacy.render(doc3,style='ent',jupyter=True, options=options)

What if we want to view multiple entities with our choice i.e GPE and PERSON

In [55]:
options={'ents':['GPE','PERSON']}

In [56]:
displacy.render(doc3,style='ent',jupyter=True, options=options)

### 5b)- Using filering