### <a id="cont"> Game Plan

Classifying a bit of sentence can become tedious when done many times. 

Computer can recognise the "tedious" word in above sentence, yet have no emotional or even computational response. We are going to try and succeed in creating such a response from the Computer, using the NLP library Spacy and ML libraries in this notebook.

[Is the dataset balanced?](#vis_1)
    
[Which country or locality has had many tweets?](#vis_2)
    
[How the sentences are represented in spaCy under the hood?](#vis_dis)
    
[Which keyword has been used in tweets to communicate the disasters?](#vis_3)
    
[Which keyword have communicated correctly when a disaster has occured?](#vis_4)

What Next?
    
    The Roots that are used in the tweets that communicate is identfied. Some of these roots create false positives and some create false negatives. Further analysis and understanding is required. Based on that, predictions will be conducted.
    
[Understanding the results](#results)


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import spacy
import plotly.express as px

from spacy import displacy
from spacy.matcher import Matcher

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/nlp-getting-started/sample_submission.csv
/kaggle/input/nlp-getting-started/train.csv
/kaggle/input/nlp-getting-started/test.csv


In [2]:
train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
train.head(2)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1


In [3]:
test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
test.head(2)

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."


In [4]:
#Loading the spacy library with the small corpus model
nlp = spacy.load('en_core_web_sm')

### <a id="vis_1"> Is the dataset balanced?

In [5]:
#Lets warm up the dataset with some visuals

#Is the training dataset balanced? 

balance = train.groupby('target')['id'].count().reset_index()
balance.head()

balance.target = balance.target.apply(lambda x: str(x))

fig = px.bar(data_frame=balance, x='target',y='id',color='target')
fig.show()

[back to top](#cont)

### <a id="vis_2"> Which country or locality has had many tweets?

In [6]:
lokale = train.groupby('location')['id'].count().reset_index()
lokale.sort_values('id',ascending=False,inplace=True)

fig = px.bar(data_frame=lokale[:50], y='location',x='id',color='location')
fig.update_layout(yaxis={'categoryorder':'total descending'})
fig.show()

[back to top](#cont)

In [7]:
#Let us sample some tweets
for x in train.text[:10]:
    print(x)

Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
Forest fire near La Ronge Sask. Canada
All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
13,000 people receive #wildfires evacuation orders in California 
Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 
#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires
#flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areas
I'm on top of the hill and I can see a fire in the woods...
There's an emergency evacuation happening now in the building across the street
I'm afraid that the tornado is coming to our area...


In [8]:
#Replacing the %20 with space
train.loc[~train.keyword.isna(),'keyword'] = train.loc[~train.keyword.isna(),'keyword'].apply(lambda x: x.replace('%20',' '))

[back to top](#cont)

### <a id="vis_dis"> How the sentences are represented in spaCy under the hood?

In [9]:
text = nlp(train.text[136])
displacy.render(text.sents, style="dep")

[back to top](#cont)

In [10]:
# The keywords can be more informative, so let us use the power of Spacy objects and lemmatize
key = nlp(train.keyword[136])

#Keyword of interest has to be a Root
for token in key:
    print(token.lemma_,token.pos_,token.dep_)

airplane NOUN compound
accident NOUN ROOT


In [11]:
keys =(_ for _ in train.loc[~train.keyword.isna(),'keyword'])

In [12]:
#Checking how the conditions work. Even though there is more than one token, it returns only the root 
for i in range(20):
    doc = nlp(next(keys))
    print(doc)
    for token in doc:
        if token.dep_ == 'ROOT':
            print(token.lemma_)

ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze
ablaze


In [13]:
#helper function to return the root keyword as a lemma. That will greatly reduce the different keywords
def get_root(key):
    doc = nlp(key)
    for token in doc:
        if token.dep_ == 'ROOT':
            return token.lemma_

In [14]:
#creating ROOT column 
train.loc[~train.keyword.isna(),'roots'] = train.loc[~train.keyword.isna(),'keyword'].apply(lambda x: get_root(x))

[back to top](#cont)

### <a id="vis_3"> Which keyword has been used in tweets to communicate the disasters?

In [15]:
key_root = train.groupby('roots')['id'].count().reset_index()
key_root.sort_values('id',ascending=False,inplace=True)

fig = px.bar(data_frame=key_root[:50], y='roots',x='id',color='roots')
fig.update_layout(yaxis={'categoryorder':'total descending'})
fig.show()

[back to top](#cont)

### <a id="vis_4"> Which keyword have communicated correctly when a disaster has occured?

In [16]:
target_root = train.groupby(['roots','target'])['id'].count().reset_index()
target_root.sort_values('id',ascending=False,inplace=True)
target_root.target = target_root.target.apply(lambda x: str(x))

fig = px.bar(data_frame=target_root[:50], y='roots',x='id',color='target')
fig.update_layout(yaxis={'categoryorder':'total descending'},height=1000)
fig.show()

[back to top](#cont)