<a href="https://www.kaggle.com/kamaljp/tweets-nlp-classification?scriptVersionId=89963613" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### <a id="cont"> Game Plan

Classifying a bit of sentence can become tedious when done many times. 

Computer can recognise the "tedious" word in above sentence, yet have no emotional or even computational response. We are going to try and succeed in creating such a response from the Computer, using the NLP library Spacy and ML libraries in this notebook.

[Is the dataset balanced?](#vis_1)
    
[Which country or locality has had many tweets?](#vis_2)
    
[How the sentences are represented in spaCy under the hood?](#vis_dis)
    
[Which keyword has been used in tweets to communicate the disasters?](#vis_3)
    
[Which keyword have communicated correctly when a disaster has occured?](#vis_4)

What Next?
    
    The Roots that are used in the tweets that communicate is identfied. Some of these roots create false positives and some create false negatives. Further analysis and understanding is required. Based on that, predictions will be conducted.
    
[Keywords alone are insufficient](#kw)

[Seems the regular NER doesn't recognize disasters](#ner)

[What is the matcher predicting?](#mat)
    
[Evaluating the predictions of matcher](#mat_eval)    
    
[So what Next???](#next)
    
        The Roots alon are not enough. The entities needs to be recognized and used for classification. NER pip
    
[Training the New NER in the pipeline](#t_ner)
    
[Training the text categorizer pipe](#tex)
    
[Output & Submission](#sub)

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import spacy
import plotly.express as px

from spacy import displacy
from spacy.matcher import Matcher

import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/nlp-getting-started/sample_submission.csv
/kaggle/input/nlp-getting-started/train.csv
/kaggle/input/nlp-getting-started/test.csv


In [2]:
train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
train.head(2)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1


In [3]:
test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
test.head(2)

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."


In [4]:
#Loading the spacy library with the small corpus model
nlp = spacy.load('en_core_web_sm')

### <a id="vis_1"> Is the dataset balanced?

In [5]:
#Lets warm up the dataset with some visuals

#Is the training dataset balanced? 

balance = train.groupby('target')['id'].count().reset_index()
balance.head()

balance.target = balance.target.apply(lambda x: str(x))

fig = px.bar(data_frame=balance, x='target',y='id',color='target')
fig.show()

[back to top](#cont)

### <a id="vis_2"> Which country or locality has had many tweets?

In [6]:
lokale = train.groupby('location')['id'].count().reset_index()
lokale.sort_values('id',ascending=False,inplace=True)

fig = px.bar(data_frame=lokale[:50], y='location',x='id',color='location')
fig.update_layout(yaxis={'categoryorder':'total descending'})
fig.show()

[back to top](#cont)

In [7]:
#Let us sample some tweets
for x in train.text[:10]:
    print(x)

Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
Forest fire near La Ronge Sask. Canada
All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
13,000 people receive #wildfires evacuation orders in California 
Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 
#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires
#flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areas
I'm on top of the hill and I can see a fire in the woods...
There's an emergency evacuation happening now in the building across the street
I'm afraid that the tornado is coming to our area...


In [8]:
#Replacing the %20 with space
train.loc[~train.keyword.isna(),'keyword'] = train.loc[~train.keyword.isna(),'keyword'].apply(lambda x: x.replace('%20',' '))

[back to top](#cont)

### <a id="vis_dis"> How the sentences are represented in spaCy under the hood?

In [9]:
text = nlp(train.text[136])
displacy.render(text.sents, style="dep")

[back to top](#cont)

In [10]:
# The keywords can be more informative, so let us use the power of Spacy objects and lemmatize
key = nlp(train.keyword[136])

#Keyword of interest has to be a Root
for token in key:
    print(token.lemma_,token.pos_,token.dep_)

airplane NOUN compound
accident NOUN ROOT


In [11]:
keys =(_ for _ in train.loc[~train.keyword.isna(),'keyword'])

In [12]:
#helper function to return the root keyword as a lemma. That will greatly reduce the different keywords
def get_root(key):
    doc = nlp(key)
    for token in doc:
        if token.dep_ == 'ROOT':
            return token.lemma_

In [13]:
#creating ROOT column 
train.loc[~train.keyword.isna(),'roots'] = train.loc[~train.keyword.isna(),'keyword'].apply(lambda x: get_root(x))

[back to top](#cont)

### <a id="vis_3"> Which keyword has been used in tweets to communicate the disasters?

In [14]:
key_root = train.groupby('roots')['id'].count().reset_index()
key_root.sort_values('id',ascending=False,inplace=True)

fig = px.bar(data_frame=key_root[:50], y='roots',x='id',color='roots')
fig.update_layout(yaxis={'categoryorder':'total descending'})
fig.show()

[back to top](#cont)

### <a id="vis_4"> Which keyword have communicated correctly when a disaster has occured?

In [15]:
target_root = train.groupby(['roots','target'])['id'].count().reset_index()
target_root.sort_values('id',ascending=False,inplace=True)
target_root.target = target_root.target.apply(lambda x: str(x))

In [16]:
fig = px.bar(data_frame=target_root[50:100], y='roots',x='id',color='target')
fig.update_layout(yaxis={'categoryorder':'total descending'},height=1000)
fig.show()

[back to top](#cont)

### <a id="kw"> Keywords alone are insufficient

Occurance of a keyword in a tweet by itself cannot make the tweet linked with the disaster. Take the example of "Fire". We will see what the spaCy has to offer next... 

In [17]:
# Tweet that is taking about the disaster  
tweet_1 = train.loc[(train.roots == 'fire') & (train.target == 1),'text'].values[1]
tweet_1

'@POTUS Would you please explain what you are going to do about the volcanoes &amp; bush fires spouting all that CO2 into the air?'

In [18]:
# Tweet that is taking about the disaster  
tweet_3 = train.loc[(train.roots == 'ablaze') & (train.target == 1),'text'].values[2]
tweet_3

'INEC Office in Abia Set Ablaze - http://t.co/3ImaomknnA'

In [19]:
# Tweet that is taking about the debate happening between politcians  
tweet_2 = train.loc[(train.roots == 'fire') & (train.target == 0),'text'].values[1]
tweet_2

'Ted Cruz fires back at Jeb &amp; Bush: \x89ÛÏWe lose because of Republicans like Jeb &amp; Mitt.\x89Û\x9d [Video] -  http://t.co/bFtiaPF35F'

In [20]:
#There are multiple pipes in the nlp object that evaluates the sentences. Lets take the support of NER
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f2189f4e7c0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f2189f4e6e0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f2189c79850>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7f2189b668c0>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7f2189b62820>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f2189c79f50>)]

In [21]:
#Lets check what entities the nlp object returns for the tweets.
doc1 = nlp(tweet_1)

for ent in doc1.ents:
    print(ent.text,ent.label_)
print('_______')
for ent in doc1:
    print(ent.text,ent.pos_,ent.dep_)
    
#Seems the bush plants are considered as the person by the small spaCy Corpus. 

bush PERSON
_______
@POTUS PROPN npadvmod
Would AUX aux
you PRON nsubj
please INTJ intj
explain VERB ccomp
what PRON dobj
you PRON nsubj
are AUX aux
going VERB ccomp
to PART aux
do VERB xcomp
about ADP prep
the DET det
volcanoes NOUN pobj
& CCONJ cc
amp PROPN conj
; PUNCT punct
bush PROPN compound
fires VERB ROOT
spouting VERB xcomp
all PRON dobj
that PRON nsubj
CO2 PROPN relcl
into ADP prep
the DET det
air NOUN pobj
? PUNCT punct


In [22]:
#Lets check what entities the nlp object returns for the tweets.
doc2 = nlp(tweet_2)

for ent in doc2.ents:
    print(ent.text,ent.label_)
print('_______')
for ent in doc2:
    print(ent.text,ent.pos_,ent.dep_)

#We can see the 2nd tweet has got more entities. 

Ted Cruz PERSON
Jeb & ORG
Bush PERSON
Republicans NORP
Jeb & ORG
_______
Ted PROPN compound
Cruz PROPN nsubj
fires VERB ROOT
back ADV advmod
at ADP prep
Jeb PROPN pobj
& CCONJ cc
amp PROPN conj
; PUNCT punct
Bush PROPN conj
: PUNCT punct
ÛÏWe ADJ amod
lose NOUN appos
because SCONJ prep
of ADP pcomp
Republicans PROPN pobj
like ADP prep
Jeb PROPN pobj
& CCONJ cc
amp PROPN conj
; PUNCT punct
Mitt.Û PROPN compound
[ X punct
Video X nmod
] X dep
- PUNCT punct
  SPACE dep
http://t.co/bFtiaPF35F NUM appos


[back to top](#cont)

### <a id="ner"> Seems the regular NER doesn't recognize disasters

The entities listed on the tweets miss the keywords of interest completely. The Parts of speec and the Dependency pipes give some direction.

The word fire in case of disasters or not, it is showing as a "VERB". In both cases dependency wise it is a "ROOT". The POS is not helping in this case.

### just using a matcher, not NER

Matchers can be created by using the keyword and then use the process Algo guy at Explosion has shared.... Let me first try this...

### There are items 

There are tweets for which there is no clear keyword that indicates whether it is a disaster or not. Can we reliably extract keywords from these tweets?

In [23]:
#the pattern required is as below...let us try adding one matcher pattern
pattern = [{'LOWER': 'evacuation', 'POS': {'NOT_IN': ['VERB']}}]
type(pattern)

list

In [24]:
#create patterns using the function, by giving the keywords
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab, validate=True)

matcher.add("evac_ptn", [pattern])

In [25]:
#Adding all the roots to the matcher

for keyword in train.roots.unique()[1:]:
    pattern = [{'LOWER': keyword, 'POS': {'NOT_IN': ['VERB']}}]
    pattern_name = keyword+'pattern'
    matcher.add(pattern_name,[pattern])

In [26]:
fire = (_ for _ in train['text'] if 'fire' in _.lower())
#below loop has to be fired in seperate cell for iteration
for i in range(20):
    print(next(fire))

Forest fire near La Ronge Sask. Canada
13,000 people receive #wildfires evacuation orders in California 
Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 
#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires
I'm on top of the hill and I can see a fire in the woods...
How the West was burned: Thousands of wildfires ablaze in California alone http://t.co/vl5TBR3wbr
How the West was burned: Thousands of wildfires ablaze in #California alone http://t.co/iCSjGZ9tE1 #climate #energy http://t.co/9FxmN0l0Bd
@Navista7 Steve these fires out here are something else! California is a tinderbox - and this clown was setting my 'hood ablaze @News24680
@nxwestmidlands huge fire at Wholesale markets ablaze http://t.co/rwzbFVNXER
@OnFireAnders I love you bb
Los Angeles Times: Arson suspect linked to 30 fires caught in Northern ... - http://t.co/xwMs1AWW8m #NewsInTweets http://t.co/TE2YeRugsi
Arson suspect lin

In [27]:
titles = (_ for _ in train['text'])

In [28]:
for i in range(20):
    doc = nlp(next(titles))
    if len(matcher(doc)) == 1:
        print(doc)

Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
Forest fire near La Ronge Sask. Canada
Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 
#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires
I'm on top of the hill and I can see a fire in the woods...
I'm afraid that the tornado is coming to our area...
Three people died from the heat wave so far
#Flood in Bago Myanmar #We arrived Bago


[back to top](#cont)

### <a id="mat"> What is the matcher predicting?

In [29]:
#I learnt about the ways to extend the Jupyter Notebook from 
#the https://www.youtube.com/watch?v=4V0JDdohxAk&list=PLBmcuObd5An559HbDr_alBnwVsGq-7uTF&index=3

from IPython.display import HTML as html_print

#Below two functions lights up the parts of the sentence that the matcher is matching.
def style(s, bold=False):
    blob = f"<text>{s}</text>"
    if bold:
        blob = f"<b style='background-color: #fff59d'>{blob}</b>"
    return blob

def html_generator(g, n=10):
    blob = ""
    for i in range(n):
        doc = next(g)

        state = [[t, False] for t in doc]
        for idx, start, end in matcher(doc):
            for i in range(start, end):
                state[i][1] = True
        blob += style(' '.join([style(str(t[0]), bold=t[1]) for t in state]) + '<br>') 
    return blob

In [30]:
g = (d for d in nlp.pipe(titles) if matcher(d))
html_print(html_generator(g, n=10))

In [31]:
train['Pred']= train.text.apply(lambda d: 1 if len(matcher(nlp(d))) > 0 else 0)

[back to top](#cont)

### <a id="mat_eval"> Evaluating the predictions of matcher

In [32]:
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report
import plotly.express as px
import plotly.graph_objects as go

In [33]:
def plot_confusion_matrix(y_true, y_pred, class_names):
    confusion_ma = confusion_matrix(y_true, y_pred)
    confusion_ma = confusion_ma.astype(int)

    layout = {
        "title": "Confusion Matrix", 
        "xaxis": {"title": "Predicted value"}, 
        "yaxis": {"title": "Real value"}
    }

    fig = go.Figure(data=go.Heatmap(z=confusion_ma,
                                    x=class_names,
                                    y=class_names,
                                    hoverongaps=False),
                    layout=layout)
    fig.show()

In [34]:
plot_confusion_matrix(train['target'], train['Pred'],['True','False'])

In [35]:
print(classification_report(train['target'], train['Pred']))

              precision    recall  f1-score   support

           0       0.71      0.46      0.56      4342
           1       0.51      0.75      0.61      3271

    accuracy                           0.59      7613
   macro avg       0.61      0.61      0.58      7613
weighted avg       0.62      0.59      0.58      7613



[back to top](#cont)

### <a id="next"> So what Next???

There seems to more false negatives from matcher. Disasters are not detected as disasters. What could be causing this anamoly? Checking the matcher, we can observe it is not picking many of the disasters tweets as linked to disasters. The words indicating the disaster is not matching... So lets improve by adding additional language detection 

The mistakes done by the matcher can be seen highlighted using the html extension of Jupyter. The matcher is not detecting the words, since the matcher chooses only those keywords that are NOT Verbs.  

In [36]:
mistakes = (train.loc[lambda d: d['Pred'] == 1].loc[lambda d: d['target'] == 1]['text'])

html_print(html_generator((nlp(i) for i in mistakes), n=10))

In [37]:
#We can see there is no matches in the tweets, which is predicted as 0
mistakes = (train.loc[lambda d: d['Pred'] == 0].loc[lambda d: d['target'] == 1]['text'])

html_print(html_generator((nlp(i) for i in mistakes), n=10))

In [38]:
#Extracting the words with the hashtag. The matcher has to choose the words following the tage
pattern = [{"TEXT": "#"}, {"IS_ASCII": True}]
matcher = Matcher(nlp.vocab)
matcher.add("hashTag", [pattern])
matches = matcher(doc)

In [39]:
doc = nlp(train.text[0])
for mid, start, end in matches:
    print(start, end, doc[start:end])

In [40]:
def parse_train_data(doc):
    detections = [(doc[start:end].start_char, doc[start:end].end_char, 'disaster') for idx, start, end in matcher(doc)]
    return (doc.text, {'entities': detections})

parse_train_data(nlp("I reject the laws of the misguided false prophets imprison nations fueling self annihilation"))

('I reject the laws of the misguided false prophets imprison nations fueling self annihilation',
 {'entities': []})

[back to top](#cont)

### <a id="t_ner"> Training the New NER in the pipeline

In [41]:
titles = train.loc[lambda d: d['target'] == 1]['text']

In [42]:
#creating the training data
TRAIN_DATA = [parse_train_data(d) for d in nlp.pipe(titles) if len(matcher(d)) == 1]
TRAIN_DATA[5:8]

[('Barbados #Bridgetown JAMAICA \x89ÛÒ Two cars set ablaze: SANTA CRUZ \x89ÛÓ Head of the St Elizabeth Police Superintende...  http://t.co/wDUEaj8Q4J',
  {'entities': [(9, 20, 'disaster')]}),
 ('Accident on I-24 W #NashvilleTraffic. Traffic moving 8m slower than usual. https://t.co/0GHk693EgJ',
  {'entities': [(19, 36, 'disaster')]}),
 ('.@NorwayMFA #Bahrain police had previously died in a road accident they were not killed by explosion https://t.co/gFJfgTodad',
  {'entities': [(12, 20, 'disaster')]})]

In [43]:
def create_blank_nlp(train_data):
    nlp = spacy.blank("en")
    ner = nlp.add_pipe('ner')
    #nlp.add_pipe(ner, last=True)
    ner = nlp.get_pipe("ner")
    for _, annotations in train_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])
    return nlp

In [44]:
import random 
import datetime as dt
from spacy.training import Example

nlp = create_blank_nlp(TRAIN_DATA)
optimizer = nlp.begin_training()  
for i in range(20): #The losses come to single digit at this iteration so retaining
    random.shuffle(TRAIN_DATA)
    losses = {}

    for text, annotations in TRAIN_DATA:
        example = Example.from_dict(nlp.make_doc(text), annotations)
        nlp.update([example],sgd=optimizer, losses=losses)
    print(f"Losses at iteration {i} - {dt.datetime.now()}", losses)

[2022-03-13 05:53:04,042] [INFO] Created vocabulary
[2022-03-13 05:53:04,045] [INFO] Finished initializing nlp object


Losses at iteration 0 - 2022-03-13 05:53:14.067345 {'ner': 188.5272182110496}
Losses at iteration 1 - 2022-03-13 05:53:23.876987 {'ner': 4.507919099329292}
Losses at iteration 2 - 2022-03-13 05:53:33.458791 {'ner': 2.183839125687786}
Losses at iteration 3 - 2022-03-13 05:53:43.074705 {'ner': 5.2960800635272565e-05}
Losses at iteration 4 - 2022-03-13 05:53:52.633217 {'ner': 2.4957640966023046e-10}
Losses at iteration 5 - 2022-03-13 05:54:02.126704 {'ner': 1.4031867391390734e-08}
Losses at iteration 6 - 2022-03-13 05:54:11.796219 {'ner': 4.615233982733577e-10}
Losses at iteration 7 - 2022-03-13 05:54:21.471781 {'ner': 1.197597693732191e-10}
Losses at iteration 8 - 2022-03-13 05:54:31.129150 {'ner': 2.666532935883055e-10}
Losses at iteration 9 - 2022-03-13 05:54:40.848390 {'ner': 4.775850908621417e-10}
Losses at iteration 10 - 2022-03-13 05:54:50.351093 {'ner': 2.4646305400963827e-10}
Losses at iteration 11 - 2022-03-13 05:54:59.895355 {'ner': 1.0683402400195507e-10}
Losses at iteration 1

In [45]:
doc = nlp("I'm afraid that the tornado is coming to our area...")
displacy.render(doc, style="ent")

In [46]:
train['Pred']= train.text.apply(lambda d: 1 if len(matcher(nlp(d))) > 0 else 0)

In [47]:
#Let us observe how the ner model predicts the disasters in the mistakes done by the matcher 
html_print(html_generator((nlp(i) for i in mistakes), n=21))

In [48]:
from spacy.training import Example

example = Example.from_dict(nlp.make_doc(text), annotations)
nlp.update([example])

{'ner': 1.0287979242492208e-16}

In [49]:
#Helper to check whether the tweet has the entity that shows disaster
def has_disas(doc):
    doc = nlp(doc)
    get_ent = []
    for t in doc.ents:
        get_ent.append(t)
    if len(get_ent) >= 1:
        return True
    else:
        return False

In [50]:
#Lets test with couple of tweets, say 1st 10 training data
train['result'] = train.text.apply(lambda x: has_disas(x))
train

Unnamed: 0,id,keyword,location,text,target,roots,Pred,result
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,,1,True
1,4,,,Forest fire near La Ronge Sask. Canada,1,,0,False
2,5,,,All residents asked to 'shelter in place' are ...,1,,0,False
3,6,,,"13,000 people receive #wildfires evacuation or...",1,,1,True
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,,1,True
...,...,...,...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1,,0,False
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1,,0,False
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1,,0,False
7611,10872,,,Police investigating after an e-bike collided ...,1,,0,False


Below classification report is comparing the simple matcher and the NER pipeline output.

In [51]:
print(classification_report(train['target'], train['result']))

              precision    recall  f1-score   support

           0       0.59      0.80      0.68      4342
           1       0.50      0.26      0.34      3271

    accuracy                           0.57      7613
   macro avg       0.54      0.53      0.51      7613
weighted avg       0.55      0.57      0.53      7613



In [52]:
print(classification_report(train['target'], train['Pred']))

              precision    recall  f1-score   support

           0       0.59      0.80      0.68      4342
           1       0.49      0.26      0.34      3271

    accuracy                           0.57      7613
   macro avg       0.54      0.53      0.51      7613
weighted avg       0.55      0.57      0.53      7613



[back to top](#cont)

The new NER has not provided any better output than what the stock NER has. The next is to train the Text Categorizer in Spacy itself.

### <a id="tex"> Training the text categorizer pipe

In [53]:
from spacy.pipeline.textcat import DEFAULT_SINGLE_TEXTCAT_MODEL
from spacy.pipeline.textcat_multilabel import DEFAULT_MULTI_TEXTCAT_MODEL
from spacy.training import Example
import random

In [54]:
#creating the new language pipeline for text categorizer
nlp = spacy.load("en_core_web_sm")
config = {
"threshold": 0.5,
"model": DEFAULT_MULTI_TEXTCAT_MODEL
}
textcat = nlp.add_pipe("textcat_multilabel",config=config)

In [55]:
train_examples = []

# Out of 7500+ train data, only 75% train data is taken for creating the categorizer
for index, row in train[:5000].iterrows():
    text = row["text"]
    rating = row["target"]
    label = {"POS": True, "NEG": False} if rating == 1 else {"NEG": True, "POS": False}
    train_examples.append(Example.from_dict(nlp.make_doc(text), {"cats": label}))

In [56]:
textcat.add_label("POS")
textcat.add_label("NEG")
textcat.initialize(lambda: train_examples, nlp=nlp)

In [57]:
#training the text categorizer on the training dataset
epochs = 5
with nlp.select_pipes(enable="textcat_multilabel"):
    optimizer = nlp.resume_training()
    for i in range(epochs):
        random.shuffle(train_examples)
        for example in train_examples:
            nlp.update([example], sgd=optimizer)

In [58]:
validation = train[5000:]

# Use the validation tweets and predict the targets and compare the actual values. 
# The helper function takes the tweet and returns the sentiment

def sent_id(tweet):
    doc = nlp(tweet)
    if doc.cats['POS'] > 0.5:
        return 1
    else:
        return 0
    

In [59]:
validation['pred'] = validation.text.apply(lambda x: sent_id(x))
validation.head()

Unnamed: 0,id,keyword,location,text,target,roots,Pred,result,pred
5000,7132,military,NY,13 reasons why we love women in the military ...,0,military,0,False,0
5001,7134,military,302,13 reasons why we love women in the military ...,0,military,0,False,0
5002,7135,military,,@UniversityofLaw For the people who died in Hu...,1,military,0,False,1
5003,7136,military,,Lot of 20 Tom Clancy Military Mystery Novels -...,0,military,1,True,0
5004,7137,military,,@CochiseCollege For the people who died in Hum...,1,military,0,False,1



### <a id="sub"> Output and Submission
    
How the text categorizer is performing on the validation set? 
    
    Compared to both the simple matcher, and the custom NER the text categorizer has much higher precision, recall and f1-scores.

In [60]:
print(classification_report(validation['target'], validation['pred']))

              precision    recall  f1-score   support

           0       0.71      0.78      0.75      1436
           1       0.70      0.62      0.65      1177

    accuracy                           0.71      2613
   macro avg       0.71      0.70      0.70      2613
weighted avg       0.71      0.71      0.71      2613



In [61]:
plot_confusion_matrix(validation['target'], validation['pred'],['True','False'])

[back to top](#cont)

In [62]:
#Applying the text categorizer to the testing data.
test['pred'] = test.text.apply(lambda x: sent_id(x))
test.head()

Unnamed: 0,id,keyword,location,text,pred
0,0,,,Just happened a terrible car crash,1
1,2,,,"Heard about #earthquake is different cities, s...",1
2,3,,,"there is a forest fire at spot pond, geese are...",0
3,9,,,Apocalypse lighting. #Spokane #wildfires,1
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan,1


In [63]:
#creating the dataframe with id and targets predicted.
my_submission = pd.DataFrame({'id': test.id, 'target': test.pred})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)

In [64]:
my_submission.head()

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,0
3,9,1
4,11,1
