<a href="https://colab.research.google.com/github/Suhit97/Text-analysis-using-Name-Entity-Recognition/blob/master/Text_analysis_with_Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [29]:
# Colab's file access feature
from google.colab import files

#retrieve uploaded file
uploaded = files.upload()

#print results
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  
# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
User uploaded file "kaggle.json" with length 67 bytes


In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [31]:
!kaggle competitions download -c quora-insincere-questions-classification

Downloading quora-insincere-questions-classification.zip to /content
100% 6.02G/6.03G [01:27<00:00, 71.5MB/s]
100% 6.03G/6.03G [01:27<00:00, 74.1MB/s]


In [33]:
!unzip quora-insincere-questions-classification.zip -d /content

Archive:  quora-insincere-questions-classification.zip
  inflating: /content/embeddings.zip  
  inflating: /content/sample_submission.csv  
  inflating: /content/test.csv       
  inflating: /content/train.csv      


In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use("ggplot")

In [37]:
df = pd.read_csv("/content/train.csv")

print(f"Loaded {df.shape[0]} samples.")


Loaded 1306122 samples.


In [38]:
print("\n".join(list(df.sample(15).question_text.values)))

What exams I can give 2 years after class 12 CBSE boards?
How about the syllabus of IAS?
Which book I shud start reading to improve my English?
Trump's there to suppress the Muslim minority in USA and destroy Islam in the world .he's anti Muslim?
Which floor exercise at home is better for women with bad knee?
Does alcatel a3 10 support stylus?
How did you develop your writing skills in English especially the vocabulary?
What's the remainder of {2^300×3^320/2^320×3^300}?
What are your (as living in India) opinions on shopping from club factory?
Does WhatsApp call consume more data or vonage?
Is it possible to have an honest gold standard without a full reserve banking requirement to prevent the banks from changing the ratio of money to gold?
Yogi needs AC sofa to cool his ass, who cares for 63 kids who die without oxygen in Gorakhpur?
How can I do my homework without falling asleep?
Why does my ex want to be my friend but act like we're together?
How many brand accounts can one have wit

In [39]:
n_samples = 10000

df_sample = df.sample(n_samples)

In [47]:
df_sample.head()

Unnamed: 0,qid,question_text,target
404312,4f3afc68e4aec87d8117,What would Shakespeare think about BJP governm...,0
327861,40435efc246f96570fca,How will the Hebrides Islands be affected by c...,0
408432,50079cf605167d82497c,How do Indian undergrads relay student loans?,0
87692,112fa4647ce7267062a0,How does Lahav Harkov’s opinion on “Polish Dea...,0
156709,1ea825a7c02f886a4e74,Can I switch to student visa after my one year...,0


# Detect the named entities using SpaCy

In [42]:
import spacy
from spacy import displacy

nlp = spacy.load('en', parse=False, tag=False, entity=True)

In [43]:
sentence = str(df_sample.iloc[0].question_text)
sentence_nlp = nlp(sentence)

# print named entities in article
print([(word, word.ent_type_) for word in sentence_nlp if word.ent_type_])

# visualize named entities
displacy.render(sentence_nlp, style='ent', jupyter=True)

[(Shakespeare, 'PERSON'), (BJP, 'ORG'), (Anti, 'ORG'), (-, 'ORG'), (Romeo, 'ORG')]


In [46]:
ner_questions = [(nlp(s), t) for s, t in zip(df_sample.question_text, df_sample.target)]

## Find the most common named entity types overall and by target class.

In [None]:
net_question = [([w.ent_type_ for w in s if w.ent_type_], t) for s, t in ner_questions]

# if no named entity found, add "None"
net_question = [(s, t) if s else (["None"], t) for s, t in net_question]

#Flatten the namend entity types list.
ners = [ne for n in net_question for ne in n[0]]


from collections import Counter


In [49]:
ner_cnt = Counter(ners)
ner_cnt.most_common(10)

[('None', 4915),
 ('ORG', 2949),
 ('PERSON', 2052),
 ('GPE', 1973),
 ('DATE', 1545),
 ('NORP', 1089),
 ('CARDINAL', 814),
 ('WORK_OF_ART', 394),
 ('LOC', 304),
 ('PRODUCT', 150)]

Group the named entity types by target class.

In [50]:
sincere_ners = [ne for n in net_question for ne in n[0] if not n[1]]
ner_cnt = Counter(sincere_ners)
ner_cnt.most_common(10)

[('None', 4715),
 ('ORG', 2810),
 ('PERSON', 1787),
 ('GPE', 1764),
 ('DATE', 1445),
 ('CARDINAL', 772),
 ('NORP', 708),
 ('WORK_OF_ART', 384),
 ('LOC', 260),
 ('TIME', 143)]

In [51]:
insincere_ners = [ne for n in net_question for ne in n[0] if n[1]]
ner_cnt = Counter(insincere_ners)
ner_cnt.most_common(10)

[('NORP', 381),
 ('PERSON', 265),
 ('GPE', 209),
 ('None', 200),
 ('ORG', 139),
 ('DATE', 100),
 ('LOC', 44),
 ('CARDINAL', 42),
 ('PRODUCT', 18),
 ('WORK_OF_ART', 10)]

most common named entity type is “NORP” which represents “Nationalities or religious or political groups”.

## Find the most common named entities overall and by class.

In [52]:
net_question = [([w.text for w in s if w.ent_type_], t) for s, t in ner_questions]

# if no named entity found, add "None"
net_question = [(s, t) if s else (["None"], t) for s, t in net_question]

In [53]:
ners = [ne for n in net_question for ne in n[0]]
ner_cnt = Counter(ners)
ner_cnt.most_common(10)

[('None', 4915),
 ('the', 301),
 ('India', 247),
 ('one', 133),
 ('Quora', 129),
 ('years', 111),
 ('Indian', 110),
 ('Trump', 107),
 ('-', 101),
 ('US', 94)]

Group the named entities by target class.

In [54]:
sincere_ners = [ne for n in net_question for ne in n[0] if not n[1]]
ner_cnt = Counter(sincere_ners)
ner_cnt.most_common(10)

[('None', 4715),
 ('the', 281),
 ('India', 224),
 ('one', 126),
 ('years', 99),
 ('Quora', 91),
 ('-', 88),
 ('of', 86),
 ('Indian', 80),
 ('US', 79)]

In [55]:
insincere_ners = [ne for n in net_question for ne in n[0] if n[1]]
ner_cnt = Counter(insincere_ners)
ner_cnt.most_common(10)

[('None', 200),
 ('Muslims', 50),
 ('Trump', 41),
 ('Quora', 38),
 ('Americans', 31),
 ('Indian', 30),
 ('India', 23),
 ('the', 20),
 ('Indians', 17),
 ('Chinese', 17)]

We notice, that a lot of the insincere questions are focusing on “Trump”, “Muslims” and “Indians”. So insincere questions seem to have racist content.