<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-Packages" data-toc-modified-id="Load-Packages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load Packages</a></span></li><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load Data</a></span></li><li><span><a href="#Detect-The-Named-Entities-Using-SpaCy" data-toc-modified-id="Detect-The-Named-Entities-Using-SpaCy-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Detect The Named Entities Using SpaCy</a></span><ul class="toc-item"><li><span><a href="#How-spacy-pre-trained-named-eintity-tagger-work?" data-toc-modified-id="How-spacy-pre-trained-named-eintity-tagger-work?-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>How spacy pre-trained named eintity tagger work?</a></span></li><li><span><a href="#Apply-to-all-sentences" data-toc-modified-id="Apply-to-all-sentences-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Apply to all sentences</a></span></li><li><span><a href="#Find-The-Most-Common-Named-Entity-Types-Overall-And-By-Target-Class." data-toc-modified-id="Find-The-Most-Common-Named-Entity-Types-Overall-And-By-Target-Class.-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Find The Most Common Named Entity Types Overall And By Target Class.</a></span></li><li><span><a href="#Find-The-Most-Common-Named-Entities-Overall-And-By-Class" data-toc-modified-id="Find-The-Most-Common-Named-Entities-Overall-And-By-Class-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Find The Most Common Named Entities Overall And By Class</a></span></li></ul></li><li><span><a href="#EOF" data-toc-modified-id="EOF-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>EOF</a></span></li></ul></div>

***
<br>
<span style="font-size:40pt; color:skyblue;"><b>
Text Analysis  
<br>
With Named Entity Recognition  
</b></span>

<img src="spacy.jpg" alt="Drawing" style="width: 500px;" align="left"/>

In this analysis,  
we use the “Quora Insincere Questions Classification” dataset from kaggle.  
  
https://www.kaggle.com/c/quora-insincere-questions-classification/data
***

# Load Packages

In [30]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import gc
import os
import psutil
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(font_scale=1.2)

import warnings

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 100)
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

plt.style.use('ggplot')

from plotly import tools
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

import spacy
from spacy import displacy

from collections import Counter

# Load Data

In [3]:
train = pd.read_csv('train_s.zip')

In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
qid              10000 non-null object
question_text    10000 non-null object
target           10000 non-null int64
dtypes: int64(1), object(2)
memory usage: 234.5+ KB


In [4]:
train.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


# Detect The Named Entities Using SpaCy

We use the pre-trained named entity tagger of the popular spaCy library.  
The models are trained on the OntoNotes 5 corpus and supports 18 named entity types.  
The list of supported types can be found here: https://spacy.io/usage/linguistic-features#entity-types

## How spacy pre-trained named eintity tagger work?

In [19]:
nlp = spacy.load('en', parse=False, entity=True)

In [20]:
sentence = str(train.iloc[0].question_text)
print(sentence)

How did Quebec nationalists see their province as a nation in the 1960s?


In [21]:
sentence_nlp = nlp(sentence)

check Named Entities

In [22]:
print([(word, word.ent_type_) for word in sentence_nlp if word.ent_type_])

[(Quebec, 'GPE'), (the, 'DATE'), (1960s, 'DATE')]


Vizualization

In [24]:
displacy.render(sentence_nlp, style='ent', jupyter=True)

## Apply to all sentences

In [23]:
%%time
ner_questions = [(nlp(s), t) for s, t in zip(train.question_text, train.target)]

Wall time: 1min 28s


## Find The Most Common Named Entity Types Overall And By Target Class.

In [25]:
net_question = [([w.ent_type_ for w in s if w.ent_type_], t) for s, t in ner_questions]

# if no named entity found, add "None"
net_question = [(s, t) if s else (["None"], t) for s, t in net_question]

In [28]:
net_question[:5]

[(['GPE', 'DATE', 'DATE'], 0),
 (['None'], 0),
 (['None'], 0),
 (['PERSON', 'PERSON', 'PERSON', 'PERSON'], 0),
 (['None'], 0)]

Flatten the named entity types list

In [29]:
ners = [ne for n in net_question for ne in n[0]]

In [32]:
ner_cnt = Counter(ners)
ner_cnt.most_common(10)

[('None', 4819),
 ('ORG', 3248),
 ('PERSON', 2044),
 ('GPE', 2002),
 ('DATE', 1655),
 ('NORP', 1118),
 ('CARDINAL', 711),
 ('LOC', 243),
 ('WORK_OF_ART', 210),
 ('PRODUCT', 167)]

Group the named entity types by target class

- sincere NE

In [33]:
sincere_ners = [ne for n in net_question for ne in n[0] if not n[1]]
ner_cnt = Counter(sincere_ners)
ner_cnt.most_common(10)

[('None', 4646),
 ('ORG', 3013),
 ('PERSON', 1805),
 ('GPE', 1797),
 ('DATE', 1589),
 ('NORP', 731),
 ('CARDINAL', 668),
 ('LOC', 213),
 ('WORK_OF_ART', 192),
 ('PRODUCT', 163)]

- insincere NE

In [34]:
insincere_ners = [ne for n in net_question for ne in n[0] if n[1]]
ner_cnt = Counter(insincere_ners)
ner_cnt.most_common(10)

[('NORP', 387),
 ('PERSON', 239),
 ('ORG', 235),
 ('GPE', 205),
 ('None', 173),
 ('DATE', 66),
 ('CARDINAL', 43),
 ('LOC', 30),
 ('WORK_OF_ART', 18),
 ('EVENT', 13)]

## Find The Most Common Named Entities Overall And By Class

In [35]:
net_question = [([w.text for w in s if w.ent_type_], t) for s, t in ner_questions]

# if no named entity found, add "None"
net_question = [(s, t) if s else (["None"], t) for s, t in net_question]

In [36]:
ners = [ne for n in net_question for ne in n[0]]
ner_cnt = Counter(ners)
ner_cnt.most_common(10)

[('None', 4819),
 ('the', 262),
 ('India', 255),
 ('year', 112),
 ('Indian', 107),
 ('Quora', 106),
 ('one', 104),
 ('years', 104),
 ('Trump', 97),
 ('US', 92)]

- sincere NE

In [37]:
sincere_ners = [ne for n in net_question for ne in n[0] if not n[1]]
ner_cnt = Counter(sincere_ners)
ner_cnt.most_common(10)

[('None', 4646),
 ('the', 238),
 ('India', 222),
 ('year', 109),
 ('one', 99),
 ('years', 97),
 ('Indian', 85),
 ('-', 85),
 ('Quora', 83),
 ("'s", 79)]

- insincere NE

In [38]:
insincere_ners = [ne for n in net_question for ne in n[0] if n[1]]
ner_cnt = Counter(insincere_ners)
ner_cnt.most_common(10)

[('None', 173),
 ('Trump', 55),
 ('India', 33),
 ('Muslims', 28),
 ('the', 24),
 ('Quora', 23),
 ('Indian', 22),
 ('Americans', 20),
 ('Indians', 18),
 ('Chinese', 16)]

A lot of the insincere questions are focusing on “Trump”, “Muslims” and “Indians”.  
So insincere questions seem to have racist content.  

# EOF