<a href="https://www.kaggle.com/kamaljp/tweets-nlp-modeling-inprogress?scriptVersionId=89487448" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### <a id="cont"> Game Plan

Classifying a bit of sentence can become tedious when done many times. 

Computer can recognise the "tedious" word in above sentence, yet have no emotional or even computational response. We are going to try and succeed in creating such a response from the Computer, using the NLP library Spacy and ML libraries in this notebook.

[Is the dataset balanced?](#vis_1)
    
[Which country or locality has had many tweets?](#vis_2)
    
[How the sentences are represented in spaCy under the hood?](#vis_dis)
    
[Which keyword has been used in tweets to communicate the disasters?](#vis_3)
    
[Which keyword have communicated correctly when a disaster has occured?](#vis_4)

What Next?
    
    The Roots that are used in the tweets that communicate is identfied. Some of these roots create false positives and some create false negatives. Further analysis and understanding is required. Based on that, predictions will be conducted.
    
[Understanding the results](#results)


In [20]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import spacy
import plotly.express as px

from spacy import displacy
from spacy.matcher import Matcher

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/nlp-getting-started/sample_submission.csv
/kaggle/input/nlp-getting-started/train.csv
/kaggle/input/nlp-getting-started/test.csv


In [21]:
train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
train.head(2)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1


In [22]:
test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
test.head(2)

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."


In [23]:
#Loading the spacy library with the small corpus model
nlp = spacy.load('en_core_web_sm')

### <a id="vis_1"> Is the dataset balanced?

In [24]:
#Lets warm up the dataset with some visuals

#Is the training dataset balanced? 

balance = train.groupby('target')['id'].count().reset_index()
balance.head()

balance.target = balance.target.apply(lambda x: str(x))

fig = px.bar(data_frame=balance, x='target',y='id',color='target')
fig.show()

[back to top](#cont)

### <a id="vis_2"> Which country or locality has had many tweets?

In [25]:
lokale = train.groupby('location')['id'].count().reset_index()
lokale.sort_values('id',ascending=False,inplace=True)

fig = px.bar(data_frame=lokale[:50], y='location',x='id',color='location')
fig.update_layout(yaxis={'categoryorder':'total descending'})
fig.show()

[back to top](#cont)

In [26]:
#Let us sample some tweets
for x in train.text[:10]:
    print(x)

Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
Forest fire near La Ronge Sask. Canada
All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
13,000 people receive #wildfires evacuation orders in California 
Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 
#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires
#flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areas
I'm on top of the hill and I can see a fire in the woods...
There's an emergency evacuation happening now in the building across the street
I'm afraid that the tornado is coming to our area...


In [27]:
#Replacing the %20 with space
train.loc[~train.keyword.isna(),'keyword'] = train.loc[~train.keyword.isna(),'keyword'].apply(lambda x: x.replace('%20',' '))

[back to top](#cont)

### <a id="vis_dis"> How the sentences are represented in spaCy under the hood?

In [28]:
text = nlp(train.text[136])
displacy.render(text.sents, style="dep")

[back to top](#cont)

In [29]:
# The keywords can be more informative, so let us use the power of Spacy objects and lemmatize
key = nlp(train.keyword[136])

#Keyword of interest has to be a Root
for token in key:
    print(token.lemma_,token.pos_,token.dep_)

airplane NOUN compound
accident NOUN ROOT


In [30]:
keys =(_ for _ in train.loc[~train.keyword.isna(),'keyword'])

In [31]:
#Checking how the conditions work. Even though there is more than one token, it returns only the root 
for i in range(20):
    doc = nlp(next(keys))
    print(doc)
    for token in doc:
        if token.dep_ == 'ROOT':
            print(token.lemma_)

ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze


In [32]:
#helper function to return the root keyword as a lemma. That will greatly reduce the different keywords
def get_root(key):
    doc = nlp(key)
    for token in doc:
        if token.dep_ == 'ROOT':
            return token.lemma_

In [33]:
#creating ROOT column 
train.loc[~train.keyword.isna(),'roots'] = train.loc[~train.keyword.isna(),'keyword'].apply(lambda x: get_root(x))

[back to top](#cont)

### <a id="vis_3"> Which keyword has been used in tweets to communicate the disasters?

In [34]:
key_root = train.groupby('roots')['id'].count().reset_index()
key_root.sort_values('id',ascending=False,inplace=True)

fig = px.bar(data_frame=key_root[:50], y='roots',x='id',color='roots')
fig.update_layout(yaxis={'categoryorder':'total descending'})
fig.show()

[back to top](#cont)

### <a id="vis_4"> Which keyword have communicated correctly when a disaster has occured?

In [None]:
target_root = train.groupby(['roots','target'])['id'].count().reset_index()
target_root.sort_values('id',ascending=False,inplace=True)
target_root.target = target_root.target.apply(lambda x: str(x))

In [37]:
fig = px.bar(data_frame=target_root[50:100], y='roots',x='id',color='target')
fig.update_layout(yaxis={'categoryorder':'total descending'},height=1000)
fig.show()

[back to top](#cont)

### Keywords alone are insufficient

Occurance of a keyword in a tweet by itself cannot make the tweet linked with the disaster. Take the example of "Fire". We will see what the spaCy has to offer next... 

In [46]:
# Tweet that is taking about the disaster  
tweet_1 = train.loc[(train.roots == 'fire') & (train.target == 1),'text'].values[1]
tweet_1

'@POTUS Would you please explain what you are going to do about the volcanoes &amp; bush fires spouting all that CO2 into the air?'

In [47]:
# Tweet that is taking about the debate happening between politcians  
tweet_2 = train.loc[(train.roots == 'fire') & (train.target == 0),'text'].values[1]
tweet_2

'Ted Cruz fires back at Jeb &amp; Bush: \x89ÛÏWe lose because of Republicans like Jeb &amp; Mitt.\x89Û\x9d [Video] -  http://t.co/bFtiaPF35F'

In [36]:
#There are multiple pipes in the nlp object that evaluates the sentences. Lets take the support of NER
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fcf478ff600>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7fcf478ff6e0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fcf47924d50>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7fcf474c0a00>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fcf474c0870>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fcf47924cd0>)]

In [64]:
#Lets check what entities the nlp object returns for the tweets.
doc1 = nlp(tweet_1)

for ent in doc1.ents:
    print(ent.text,ent.label_)
print('_______')
for ent in doc1:
    print(ent.text,ent.pos_,ent.dep_)
    
#Seems the bush plants are considered as the person by the small spaCy Corpus. 

bush PERSON
_______
@POTUS PROPN npadvmod
Would AUX aux
you PRON nsubj
please INTJ intj
explain VERB ccomp
what PRON dobj
you PRON nsubj
are AUX aux
going VERB ccomp
to PART aux
do VERB xcomp
about ADP prep
the DET det
volcanoes NOUN pobj
& CCONJ cc
amp NOUN conj
; PUNCT punct
bush PROPN compound
fires NOUN ROOT
spouting VERB acl
all DET predet
that PRON det
CO2 PROPN dobj
into ADP prep
the DET det
air NOUN pobj
? PUNCT punct


In [63]:
#Lets check what entities the nlp object returns for the tweets.
doc2 = nlp(tweet_2)

for ent in doc2.ents:
    print(ent.text,ent.label_)
print('_______')
for ent in doc2:
    print(ent.text,ent.pos_,ent.dep_)

#We can see the 2nd tweet has got more entities. 

Ted Cruz PERSON
Jeb &amp ORG
Bush PERSON
Republicans NORP
Jeb &amp ORG
_______
Ted PROPN compound
Cruz PROPN nsubj
fires VERB ROOT
back ADV advmod
at ADP prep
Jeb PROPN pobj
& CCONJ cc
amp NOUN conj
; PUNCT punct
Bush PROPN appos
: PUNCT punct
ÛÏWe PROPN appos
lose VERB dep
because SCONJ prep
of ADP pcomp
Republicans PROPN pobj
like ADP prep
Jeb PROPN pobj
& CCONJ cc
amp NOUN conj
; PUNCT punct
Mitt.Û PROPN compound
[ X compound
Video PROPN dep
] PUNCT punct
- PUNCT punct
  SPACE dep
http://t.co/bFtiaPF35F NOUN conj


### Seems the regular NER doesn't recognize disasters

The entities listed on the tweets miss the keywords of interest completely. The Parts of speec and the Dependency pipes give some direction.

The word fire is "NOUN" in case of disasters and other case it is a "VERB". In both cases dependency wise it is a "ROOT". That may not help.

### There are itemss w

There are tweets for which there is no clear keyword that indicates whether it is a disaster or not. Can we reliably extract keywords from these tweets?

In [75]:
def has_dis_token(doc):
    doc = nlp(doc)
    for t in doc:
        if t.lower_ in train.keyword:
            print(t)
            if t.pos_ == 'NOUN':
                return True
    #if the sentence has got no token that is noun and matching the keyword, then it is a general tweet
    return False

In [76]:
#Lets test with couple of tweets, say 1st 10 training data
trial = train.iloc[:10,:]
trial['result'] = trial.text.apply(lambda x: has_dis_token(x))
trial



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,id,keyword,location,text,target,roots,result
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,,False
1,4,,,Forest fire near La Ronge Sask. Canada,1,,False
2,5,,,All residents asked to 'shelter in place' are ...,1,,False
3,6,,,"13,000 people receive #wildfires evacuation or...",1,,False
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,,False
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1,,False
6,10,,,#flood #disaster Heavy rain causes flash flood...,1,,False
7,13,,,I'm on top of the hill and I can see a fire in...,1,,False
8,14,,,There's an emergency evacuation happening now ...,1,,False
9,15,,,I'm afraid that the tornado is coming to our a...,1,,False


In [79]:
trial.text.values

array(['Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all',
       'Forest fire near La Ronge Sask. Canada',
       "All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected",
       '13,000 people receive #wildfires evacuation orders in California ',
       'Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school ',
       '#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires',
       '#flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areas',
       "I'm on top of the hill and I can see a fire in the woods...",
       "There's an emergency evacuation happening now in the building across the street",
       "I'm afraid that the tornado is coming to our area..."],
      dtype=object)