# Real or Not? NLP with Disaster Tweets

- Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).


- But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:
The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.


- In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.


- Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.

## 1. Importing Required Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer,TfidfVectorizer
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.naive_bayes import MultinomialNB,BernoulliNB,GaussianNB

import re
import nltk
from nltk.stem.snowball import SnowballStemmer
import nltk.corpus as corpus
from sklearn.metrics import accuracy_score
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
import gc

## 2. EDA (Exploratory Data Analysis) & Data Preparation

### Reading CSV Data

In [2]:
#reading train data
train_df = pd.read_csv(r'C:\Users\Tejas\Downloads\Real_or_Not_NLP\nlp-getting-started\train.csv',encoding='utf-8')
pd.set_option('display.max_columns', 500) #setting up pandas to view max columns
pd.set_option('display.max_colwidth', 1000) #setting up pandas to view max text
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected,1
3,6,,,"13,000 people receive #wildfires evacuation orders in California",1
4,7,,,Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school,1


### Setting up Helper/Utility Functions

In [3]:
words = set(corpus.words.words())
def check_words(text):
#     text_re = " ".join(w for w in nltk.wordpunct_tokenize(text) if w.lower() in words or not w.isalpha())
    text_re = " ".join(w for w in nltk.wordpunct_tokenize(text) if w.lower() is not w.isalpha())
    return text_re

In [4]:
#Extract keywords related to disasters from given text
def get_tags(word):
    disaster_list = ['die','death','accident','kill','typhoon','crash','evacuation','flood','wildfire','forest fire','fire','earthquake','famine','disease','avalanches','landslides','tsunamis','volcan','plague','accident','hazardous','pollution','epidemic','armed','conflict','loss of life','violence','military','infection']
#     if len(text)==1:
#         text = list(text)
#     for text in text:
    word = word.lower()
#     print('word=',word)
    temp=[]
    for text in disaster_list:
#         print('text=',text)
        word = re.sub('#','',word)
#         print('text=',text)
        if re.findall(text,word):
            text=re.findall(text,word)
            text=",".join(text)
#             print(text)
            temp.append(text)
#             return text
    if temp:
        return temp
        return ",".join(temp)
    else:
        return 'NONE'

In [5]:
[get_tags(word) for word in train_df.text[:5]]

[['earthquake'],
 ['forest fire', 'fire'],
 ['evacuation'],
 ['evacuation', 'wildfire', 'fire'],
 ['wildfire', 'fire']]

In [6]:
#Tokenize and stem words
def stemmer(text):
    # split into words
    tokens = word_tokenize(text)
    # stemming of words
    stemmer = SnowballStemmer('english')
    stemmed = [stemmer.stem(word) for word in tokens]
    return stemmed

### Creating New Data Columns From Extracted Data

In [7]:
train_df['hashtags'] = [re.findall(r"#(\w+)", text) for text in train_df.text] #Extracting hashtags
train_df['reference'] = [get_tags(word) for word in train_df.text] #finding key words
train_df['text_re'] = [re.sub(r'\bhttp[s]?://[a-zA-Z]*[0-9]*.*\b\s','',text) for text in train_df.text] #removing any hyperlinks if any
train_df['text_re'] = [check_words(text) for text in train_df.text_re] #
train_df['text_stemmed'] = [stemmer(text) for text in train_df.text] #stemmed text
train_df.head()

Unnamed: 0,id,keyword,location,text,target,hashtags,reference,text_re,text_stemmed
0,1,,,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,1,[earthquake],[earthquake],Our Deeds are the Reason of this # earthquake May ALLAH Forgive us all,"[our, deed, are, the, reason, of, this, #, earthquak, may, allah, forgiv, us, all]"
1,4,,,Forest fire near La Ronge Sask. Canada,1,[],"[forest fire, fire]",Forest fire near La Ronge Sask . Canada,"[forest, fire, near, la, rong, sask, ., canada]"
2,5,,,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected,1,[],[evacuation],All residents asked to ' shelter in place ' are being notified by officers . No other evacuation or shelter in place orders are expected,"[all, resid, ask, to, shelter, in, place, ', are, be, notifi, by, offic, ., no, other, evacu, or, shelter, in, place, order, are, expect]"
3,6,,,"13,000 people receive #wildfires evacuation orders in California",1,[wildfires],"[evacuation, wildfire, fire]","13 , 000 people receive # wildfires evacuation orders in California","[13,000, peopl, receiv, #, wildfir, evacu, order, in, california]"
4,7,,,Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school,1,"[Alaska, wildfires]","[wildfire, fire]",Just got sent this photo from Ruby # Alaska as smoke from # wildfires pours into a school,"[just, got, sent, this, photo, from, rubi, #, alaska, as, smoke, from, #, wildfir, pour, into, a, school]"


In [8]:
train_df['disaster_lvl'] = [1 if len(text)==1 else 2 if len(text)==2 else 3 for text in train_df.reference] #custom priority scale
train_df['hashtags'] = [",".join(text) for text in train_df.hashtags] #clean up
train_df['reference'] = [",".join(text) for text in train_df.reference] #clean up
train_df['text_stemmed'] = [" ".join(text) for text in train_df.text_stemmed] #clean up
train_df.head()

Unnamed: 0,id,keyword,location,text,target,hashtags,reference,text_re,text_stemmed,disaster_lvl
0,1,,,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,1,earthquake,earthquake,Our Deeds are the Reason of this # earthquake May ALLAH Forgive us all,our deed are the reason of this # earthquak may allah forgiv us all,1
1,4,,,Forest fire near La Ronge Sask. Canada,1,,"forest fire,fire",Forest fire near La Ronge Sask . Canada,forest fire near la rong sask . canada,2
2,5,,,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected,1,,evacuation,All residents asked to ' shelter in place ' are being notified by officers . No other evacuation or shelter in place orders are expected,all resid ask to shelter in place ' are be notifi by offic . no other evacu or shelter in place order are expect,1
3,6,,,"13,000 people receive #wildfires evacuation orders in California",1,wildfires,"evacuation,wildfire,fire","13 , 000 people receive # wildfires evacuation orders in California","13,000 peopl receiv # wildfir evacu order in california",3
4,7,,,Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school,1,"Alaska,wildfires","wildfire,fire",Just got sent this photo from Ruby # Alaska as smoke from # wildfires pours into a school,just got sent this photo from rubi # alaska as smoke from # wildfir pour into a school,2


### Setting up Inputs(x) and Labels(y)

In [9]:
#setting up X & Y for training
x = train_df.text_re
y = train_df.target

In [10]:
x.head()

0                                                                      Our Deeds are the Reason of this # earthquake May ALLAH Forgive us all
1                                                                                                     Forest fire near La Ronge Sask . Canada
2    All residents asked to ' shelter in place ' are being notified by officers . No other evacuation or shelter in place orders are expected
3                                                                         13 , 000 people receive # wildfires evacuation orders in California
4                                                   Just got sent this photo from Ruby # Alaska as smoke from # wildfires pours into a school
Name: text_re, dtype: object

In [11]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

### Splitting and Vectorization of Training Data into Traning and Testing Data

In [12]:
#Spliting X & Y into Train & Test splits
x_train, x_test, y_train, y_test = train_test_split(x,y,random_state=0,stratify=y) 

In [13]:
vectorizer = TfidfVectorizer() #initializing TFIDvectorizer

In [14]:
#fitting & transforming data into vectorized form
x_train_transformed = vectorizer.fit_transform(x_train)
x_test_transformed  = vectorizer.transform(x_test)
feature_names = vectorizer.get_feature_names()

In [15]:
len(feature_names) #viewing len of features

16355

In [16]:
feature_names[1000:1500] #sample viewing feature names to ensure no undesired elements are mixed-in

['accept',
 'accepte',
 'accepts',
 'access',
 'accident',
 'accidentally',
 'accidents',
 'accionempresa',
 'accompanying',
 'according',
 'accordingly',
 'account',
 'accountable',
 'accuracy',
 'accused',
 'accuses',
 'accustomed',
 'acdelco',
 'ace',
 'acebreakingnews',
 'acenewsdesk',
 'acesse',
 'acfi2rhz4n',
 'achieve',
 'achievement',
 'achieving',
 'achimota',
 'aching',
 'acid',
 'acmrm833zq',
 'acoustic',
 'acousticmaloley',
 'acquiesce',
 'acquire',
 'acquired',
 'acquisitions',
 'acres',
 'acronym',
 'across',
 'acrylic',
 'act',
 'actavis',
 'acted',
 'actin',
 'acting',
 'action',
 'actionmoviestaughtus',
 'actions',
 'activate',
 'activated',
 'activates',
 'active',
 'actively',
 'activision',
 'activist',
 'activities',
 'activity',
 'actor',
 'actress',
 'acts',
 'actual',
 'actually',
 'acura',
 'acute',
 'aczruorytd',
 'ad',
 'adam',
 'adamantly',
 'adamnibloe',
 'adamrubinespn',
 'adamtuss',
 'adani',
 'adanne___',
 'add',
 'added',
 'addiction',
 'adding',
 'addi

In [17]:
x_train_transformed.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [18]:
#conforming shapes of transformed data

In [19]:
x_train_transformed.shape

(5709, 16355)

In [20]:
x_train_transformed.toarray().shape

(5709, 16355)

In [21]:
y_train.shape,y.shape

((5709,), (7613,))

## 3. Training Models

### Setup Models

In [22]:
#initilize models
bernoulli_nb = BernoulliNB(alpha=0.95)
multi_nb = MultinomialNB(alpha=0.95)
gaussian_nb = GaussianNB()

In [23]:
#Training models
bernoulli_nb.fit(x_train_transformed.toarray(),y_train)
multi_nb.fit(x_train_transformed.toarray(),y_train)
gaussian_nb.fit(x_train_transformed.toarray(),y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

## 4. Testing & Cross-Validating Trained Models

In [24]:
#Predicting on Trained bernoulli_nb Model with Testing data
bernoulli_nb_y_predict = bernoulli_nb.predict(x_test_transformed.toarray())
print("Bernoulli_NB accuracy_score :",accuracy_score(y_test, bernoulli_nb_y_predict))

Bernoulli_NB accuracy_score : 0.7998949579831933


In [25]:
#Predicting on Trained bernoulli_nb Model with Testing data
multi_nb_y_predict = multi_nb.predict(x_test_transformed.toarray())
print("Multi_NB accuracy_score :",accuracy_score(y_test, multi_nb_y_predict))

Multi_NB accuracy_score : 0.789390756302521


In [26]:
#Predicting on Trained bernoulli_nb Model with Testing data
gaussian_nb_y_predict = gaussian_nb.predict(x_test_transformed.toarray())
print("Gaussian_NB accuracy_score :",accuracy_score(y_test, gaussian_nb_y_predict))

Gaussian_NB accuracy_score : 0.6108193277310925


In [27]:
#Cross Validating Trained bernoulli_nb Model with Testing data
bernoulli_nb_scores = cross_val_score(bernoulli_nb, x_train_transformed.toarray(), y_train, cv=5, scoring="f1")
print(bernoulli_nb_scores)
bernoulli_nb_scores.mean() 

[0.77483444 0.752      0.7217695  0.7010551  0.75056689]


0.7400451859152714

In [28]:
#Cross Validating Trained multi_nb Model with Testing data
multi_nb_scores = cross_val_score(multi_nb, x_train_transformed.toarray(), y_train, cv=5, scoring="f1")
print(multi_nb_scores)
multi_nb_scores.mean() 

[0.75261324 0.72182254 0.70517449 0.67396594 0.72791519]


0.7162982804076969

In [29]:
#Cross Validating Trained gaussian_nb Model with Testing data
gaussian_nb_scores = cross_val_score(gaussian_nb, x_train_transformed.toarray(), y_train, cv=5, scoring="f1")
print(gaussian_nb_scores)
gaussian_nb_scores.mean() 

[0.63043478 0.65046102 0.62778731 0.62328767 0.62969283]


0.6323327232541376

#### Proceding with Benoulli_NB

In [30]:
del multi_nb,gaussian_nb #deleting unused models to freeup memory
gc.collect() #collecting garbage

158

In [31]:
x_transformed = vectorizer.fit_transform(x) #Vectorizing complete data

In [32]:
#using bernoulli_nb as it has the highest cross_val_score
bernoulli_nb.fit(x_transformed,y) #Training model with complete data

BernoulliNB(alpha=0.95, binarize=0.0, class_prior=None, fit_prior=True)

In [33]:
#Predicting on Trained Model with Complete data
y_predict = bernoulli_nb.predict(x_transformed.toarray())
print(y_predict.shape)
print("Bernoulli_NB accuracy_score :",accuracy_score(y, y_predict))

(7613,)
Bernoulli_NB accuracy_score : 0.8897937738079601


In [34]:
#Cross Validating Trained Model with Complete data
scores = cross_val_score(bernoulli_nb, x_transformed.toarray(), y, cv=5, scoring="f1")
print(scores)
scores.mean()

[0.61048689 0.62052117 0.68825911 0.64347826 0.76052028]


0.6646531419290899

In [35]:
train_df['model_output'] = y_predict #using Bernoulli Model output as a data column
train_df.head()

Unnamed: 0,id,keyword,location,text,target,hashtags,reference,text_re,text_stemmed,disaster_lvl,model_output
0,1,,,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,1,earthquake,earthquake,Our Deeds are the Reason of this # earthquake May ALLAH Forgive us all,our deed are the reason of this # earthquak may allah forgiv us all,1,0
1,4,,,Forest fire near La Ronge Sask. Canada,1,,"forest fire,fire",Forest fire near La Ronge Sask . Canada,forest fire near la rong sask . canada,2,1
2,5,,,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected,1,,evacuation,All residents asked to ' shelter in place ' are being notified by officers . No other evacuation or shelter in place orders are expected,all resid ask to shelter in place ' are be notifi by offic . no other evacu or shelter in place order are expect,1,1
3,6,,,"13,000 people receive #wildfires evacuation orders in California",1,wildfires,"evacuation,wildfire,fire","13 , 000 people receive # wildfires evacuation orders in California","13,000 peopl receiv # wildfir evacu order in california",3,1
4,7,,,Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school,1,"Alaska,wildfires","wildfire,fire",Just got sent this photo from Ruby # Alaska as smoke from # wildfires pours into a school,just got sent this photo from rubi # alaska as smoke from # wildfir pour into a school,2,1


In [36]:
#creating new DataFrame
df = train_df.copy()
df.fillna('NONE',inplace=True)
df.head()

Unnamed: 0,id,keyword,location,text,target,hashtags,reference,text_re,text_stemmed,disaster_lvl,model_output
0,1,NONE,NONE,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,1,earthquake,earthquake,Our Deeds are the Reason of this # earthquake May ALLAH Forgive us all,our deed are the reason of this # earthquak may allah forgiv us all,1,0
1,4,NONE,NONE,Forest fire near La Ronge Sask. Canada,1,,"forest fire,fire",Forest fire near La Ronge Sask . Canada,forest fire near la rong sask . canada,2,1
2,5,NONE,NONE,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected,1,,evacuation,All residents asked to ' shelter in place ' are being notified by officers . No other evacuation or shelter in place orders are expected,all resid ask to shelter in place ' are be notifi by offic . no other evacu or shelter in place order are expect,1,1
3,6,NONE,NONE,"13,000 people receive #wildfires evacuation orders in California",1,wildfires,"evacuation,wildfire,fire","13 , 000 people receive # wildfires evacuation orders in California","13,000 peopl receiv # wildfir evacu order in california",3,1
4,7,NONE,NONE,Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school,1,"Alaska,wildfires","wildfire,fire",Just got sent this photo from Ruby # Alaska as smoke from # wildfires pours into a school,just got sent this photo from rubi # alaska as smoke from # wildfir pour into a school,2,1


In [37]:
del train_df #deleting unused DataFrame to freeup memory

In [38]:
enc = LabelEncoder() #initilizing LabelEncoder

In [39]:
#LabelEncoding required non-numeric columns
df.keyword = enc.fit_transform(df.keyword)
df.location = enc.fit_transform(df.location)
df.reference = enc.fit_transform(df.reference)
df.hashtags = enc.fit_transform(df.hashtags)
df.head()

Unnamed: 0,id,keyword,location,text,target,hashtags,reference,text_re,text_stemmed,disaster_lvl,model_output
0,1,0,1753,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,1,942,56,Our Deeds are the Reason of this # earthquake May ALLAH Forgive us all,our deed are the reason of this # earthquak may allah forgiv us all,1,0
1,4,0,1753,Forest fire near La Ronge Sask. Canada,1,0,87,Forest fire near La Ronge Sask . Canada,forest fire near la rong sask . canada,2,1
2,5,0,1753,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected,1,0,63,All residents asked to ' shelter in place ' are being notified by officers . No other evacuation or shelter in place orders are expected,all resid ask to shelter in place ' are be notifi by offic . no other evacu or shelter in place order are expect,1,1
3,6,0,1753,"13,000 people receive #wildfires evacuation orders in California",1,1330,66,"13 , 000 people receive # wildfires evacuation orders in California","13,000 peopl receiv # wildfir evacu order in california",3,1
4,7,0,1753,Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school,1,52,112,Just got sent this photo from Ruby # Alaska as smoke from # wildfires pours into a school,just got sent this photo from rubi # alaska as smoke from # wildfir pour into a school,2,1


In [40]:
#setting up X & Y for training rf_clf
X = df.loc[:,['keyword','location','model_output','reference','hashtags','disaster_lvl']]
Y = df.target

In [41]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,random_state=0,stratify=Y) #Spliting X & Y into Train & Test splits

In [42]:
#Setup RandomForestClassifier Model 
rf_clf = RandomForestClassifier(n_estimators=1000,criterion='gini',max_depth=None,min_samples_split=5,min_samples_leaf=1,
    min_weight_fraction_leaf=0.0,
    max_features='auto',
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    min_impurity_split=None,
    bootstrap=True,
    oob_score=False,
    n_jobs=None,
    random_state=10,
    verbose=1,
    warm_start=False,
    class_weight=None,)

In [43]:
rf_clf.fit(X_train,Y_train) #Training RandomForestClassifier Model

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    8.7s finished


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=10, verbose=1,
                       warm_start=False)

In [44]:
#Predicting on Trained rf_clf Model with Testing data
Y_predict = rf_clf.predict(X_test)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    0.7s finished


In [45]:
print("Accuracy :",accuracy_score(Y_test, Y_predict))

Accuracy : 0.8771008403361344


In [46]:
#Cross Validating Trained multi_nb Model with Testing data
scores = cross_val_score(rf_clf, X_train, Y_train, cv=5, scoring="f1")
print(scores)
scores.mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    6.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    6.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    8.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[

[0.87096774 0.85032538 0.82102908 0.84640884 0.84768212]


[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    0.4s finished


0.8472826326606763

In [47]:
#Cross Validating Trained multi_nb Model with Complete data
scores = cross_val_score(rf_clf, X, Y, cv=5, scoring="f1")
print(scores)
print(scores.mean())

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    9.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    8.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    8.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[

[0.58364312 0.56743003 0.60044643 0.69097889 0.60642757]
0.6097852073615427


[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    0.3s finished


In [48]:
#Predicting with Trained rf_clf Model on Complete Data
df['rf_clf'] = rf_clf.predict(X)
df.head()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    2.4s finished


Unnamed: 0,id,keyword,location,text,target,hashtags,reference,text_re,text_stemmed,disaster_lvl,model_output,rf_clf
0,1,0,1753,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,1,942,56,Our Deeds are the Reason of this # earthquake May ALLAH Forgive us all,our deed are the reason of this # earthquak may allah forgiv us all,1,0,1
1,4,0,1753,Forest fire near La Ronge Sask. Canada,1,0,87,Forest fire near La Ronge Sask . Canada,forest fire near la rong sask . canada,2,1,1
2,5,0,1753,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected,1,0,63,All residents asked to ' shelter in place ' are being notified by officers . No other evacuation or shelter in place orders are expected,all resid ask to shelter in place ' are be notifi by offic . no other evacu or shelter in place order are expect,1,1,1
3,6,0,1753,"13,000 people receive #wildfires evacuation orders in California",1,1330,66,"13 , 000 people receive # wildfires evacuation orders in California","13,000 peopl receiv # wildfir evacu order in california",3,1,1
4,7,0,1753,Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school,1,52,112,Just got sent this photo from Ruby # Alaska as smoke from # wildfires pours into a school,just got sent this photo from rubi # alaska as smoke from # wildfir pour into a school,2,1,1


## 5. Validation

### Reading CSV Data

In [49]:
#reading test/validation data
test_df = pd.read_csv(r'C:\Users\Tejas\Downloads\Real_or_Not_NLP\nlp-getting-started\test.csv',encoding='utf-8')
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, stay safe everyone."
2,3,,,"there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all"
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


### Data Preparation

In [50]:
test_df['hashtags'] = [re.findall(r"#(\w+)", text) for text in test_df.text] #Extracting hashtags
test_df['hashtags'] = [",".join(text) for text in test_df.hashtags] #cleanup
test_df['reference'] = [get_tags(word) for word in test_df.text] #Extracting key words
test_df['text_re'] = [re.sub(r'\bhttp[s]?://[a-zA-Z]*[0-9]*.*\b\s','',text) for text in test_df.text] #removing any hyperlinks if any
test_df['disaster_lvl'] = [1 if len(text)==1 else 2 if len(text)==2 else 3 for text in test_df.reference] #custom priority scale
test_df['reference'] = [",".join(text) for text in test_df.reference] #cleanup
test_df.head()

Unnamed: 0,id,keyword,location,text,hashtags,reference,text_re,disaster_lvl
0,0,,,Just happened a terrible car crash,,crash,Just happened a terrible car crash,1
1,2,,,"Heard about #earthquake is different cities, stay safe everyone.",earthquake,earthquake,"Heard about #earthquake is different cities, stay safe everyone.",1
2,3,,,"there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all",,"forest fire,fire","there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all",2
3,9,,,Apocalypse lighting. #Spokane #wildfires,"Spokane,wildfires","wildfire,fire",Apocalypse lighting. #Spokane #wildfires,2
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan,,"kill,typhoon",Typhoon Soudelor kills 28 in China and Taiwan,2


### Setting up Inputs(X)

In [51]:
x = test_df.text_re #setting up inputs(X) for bernoulli_nb Model
x_transformed = vectorizer.transform(x) #Vectorizing Validation Data

### Predicting on Validation Data

In [52]:
#Predicting with Trained bernoulli_nb Model on Validation Data
y_predict = bernoulli_nb.predict(x_transformed.toarray())

In [53]:
#Assigning new column for model output
test_df['model_output'] = y_predict
test_df.head()

Unnamed: 0,id,keyword,location,text,hashtags,reference,text_re,disaster_lvl,model_output
0,0,,,Just happened a terrible car crash,,crash,Just happened a terrible car crash,1,0
1,2,,,"Heard about #earthquake is different cities, stay safe everyone.",earthquake,earthquake,"Heard about #earthquake is different cities, stay safe everyone.",1,0
2,3,,,"there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all",,"forest fire,fire","there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all",2,1
3,9,,,Apocalypse lighting. #Spokane #wildfires,"Spokane,wildfires","wildfire,fire",Apocalypse lighting. #Spokane #wildfires,2,1
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan,,"kill,typhoon",Typhoon Soudelor kills 28 in China and Taiwan,2,1


In [54]:
df = test_df.copy() #creating new DataFrame
df.fillna('NONE',inplace=True) #Filling NaN's in Data with NONE

In [55]:
del test_df #deleting unused DataFrame to freeup memory

In [56]:
#LabelEncoding required non-numeric columns
df.keyword = enc.fit_transform(df.keyword)
df.location = enc.fit_transform(df.location)
df.reference = enc.fit_transform(df.reference)
df.hashtags = enc.fit_transform(df.hashtags)
df.head()

Unnamed: 0,id,keyword,location,text,hashtags,reference,text_re,disaster_lvl,model_output
0,0,0,829,Just happened a terrible car crash,0,5,Just happened a terrible car crash,1,0
1,2,0,829,"Heard about #earthquake is different cities, stay safe everyone.",515,20,"Heard about #earthquake is different cities, stay safe everyone.",1,0
2,3,0,829,"there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all",0,39,"there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all",2,1
3,9,0,829,Apocalypse lighting. #Spokane #wildfires,380,55,Apocalypse lighting. #Spokane #wildfires,2,1
4,11,0,829,Typhoon Soudelor kills 28 in China and Taiwan,0,48,Typhoon Soudelor kills 28 in China and Taiwan,2,1


In [57]:
#setting up inputs(X) for rf_clf
X = df.loc[:,['keyword','location','model_output','reference','hashtags','disaster_lvl']] 

In [58]:
#Predicting with Trained rf_clf Model on Validation Data
df['rf_clf'] = rf_clf.predict(X)
df.head()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    0.9s finished


Unnamed: 0,id,keyword,location,text,hashtags,reference,text_re,disaster_lvl,model_output,rf_clf
0,0,0,829,Just happened a terrible car crash,0,5,Just happened a terrible car crash,1,0,0
1,2,0,829,"Heard about #earthquake is different cities, stay safe everyone.",515,20,"Heard about #earthquake is different cities, stay safe everyone.",1,0,0
2,3,0,829,"there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all",0,39,"there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all",2,1,1
3,9,0,829,Apocalypse lighting. #Spokane #wildfires,380,55,Apocalypse lighting. #Spokane #wildfires,2,1,1
4,11,0,829,Typhoon Soudelor kills 28 in China and Taiwan,0,48,Typhoon Soudelor kills 28 in China and Taiwan,2,1,1


### Outputting Results

In [59]:
#Creating new DataFrame for exporting output
output = pd.DataFrame(columns=['id','target'])
output.id = df.id #id Column
output.target = df.rf_clf #target column
output.head() #viewing output DataFrame

Unnamed: 0,id,target
0,0,0
1,2,0
2,3,1
3,9,1
4,11,1


In [60]:
#Writing output into csv file
output.to_csv(r'C:\Users\Tejas\Downloads\Real_or_Not_NLP\nlp-getting-started\Real_or_Not_NLP_v1_output.csv',index=False) 