### Algorithm applied : 
+ NLP(Natural Language Processing)

### Project Objective : 
+ To develop a deep learning algorithm to detect different types of tweets contained in a collection of English sentences or a large paragraph and predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.  

### Importing Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Importing datasets

+ Each sample in the train and test set has the following information:
- The text of a tweet
- A keyword from that tweet (although this may be blank!)
- The location the tweet was sent from (may also be blank)
- We are predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.

+ Files
- train.csv - the training set
- test.csv - the test set
- sample_submission.csv - a sample submission file in the correct format
+ Columns
- id - a unique identifier for each tweet
- text - the text of the tweet
- location - the location the tweet was sent from (may be blank)
- keyword - a particular keyword from the tweet (may be blank)
- target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

In [5]:
dataset = pd.read_csv('train.csv')
dataset

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


In [6]:
dataset.keys()

Index(['id', 'keyword', 'location', 'text', 'target'], dtype='object')

### Cleaning the text
+ We are going to remove punctuations and different types of special characters, capital letters or lower case letters from our text as these things will create problem while processing the data

In [7]:
import re
# Importing library to remove stopwords from our text as it will not help to predict our text emotion 
# like all the articles(the, a , an...)
import nltk

# Downloading all the stopwords from the nltk library
nltk.download('stopwords')

# Importing stopwords
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to C:\Users\Ravi
[nltk_data]     Kumar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Stemming of Text : It will convert all the words in their roots.

+ Example:
- loved as love
- helped as help
- hopes as hope

+ Reason: 
- As after cleaning the text when will create the bag of words model we will create sparse matrix with each column will have all the different words all having different emotions. So in order to optimize the dimension of the sparse matrix we need to apply stemming. If we don't apply the stemming then in sparse matrix we would have one column for present tense and other for the past tense that would be same thing so will create redundants and will make sparse matrix more complex with higher dimension. 

In [8]:
from nltk.stem.porter import PorterStemmer

# Cleaning the texts
# Creating empty list which will contain all the cleaned texts
# We will create a for loop to iterate all the texts of our datasets 
# and for each of these review we will apply the cleaning process
# and after cleaning all the reviews we will add it into created empty list corpus

Cleaned_Text = []
for i in range(len(dataset)):
    # Removing every punctuations and commas except a-z or A-Z by space
    Text = re.sub('[^a-zA-Z]', ' ', dataset['text'][i])
    
    # Transforming all the capital letters into lower case letters
    Text = Text.lower()
    
    # Splitting the text into different words so that we can apply stemming 
    Text = Text.split()
    
    # Stemming the text
    ps = PorterStemmer()
    all_stopwords = stopwords.words('english')
    
    # Remove 'not' from the stopwords as it can alter the emotion of expression
    all_stopwords.remove('not')
    
    # Applying Stemming on all words except the stopwords
    Text = [ps.stem(word) for word in Text if not word in set(all_stopwords)]
    
    # Joining all the words together seperating with space
    Text = ' '.join(Text)
    
    # Adding the cleaned text to the empty list
    Cleaned_Text.append(Text)

### Cleaned_Text

In [9]:
Cleaned_Text

['deed reason earthquak may allah forgiv us',
 'forest fire near la rong sask canada',
 'resid ask shelter place notifi offic evacu shelter place order expect',
 'peopl receiv wildfir evacu order california',
 'got sent photo rubi alaska smoke wildfir pour school',
 'rockyfir updat california hwi close direct due lake counti fire cafir wildfir',
 'flood disast heavi rain caus flash flood street manit colorado spring area',
 'top hill see fire wood',
 'emerg evacu happen build across street',
 'afraid tornado come area',
 'three peopl die heat wave far',
 'haha south tampa get flood hah wait second live south tampa gonna gonna fvck flood',
 'rain flood florida tampabay tampa day lost count',
 'flood bago myanmar arriv bago',
 'damag school bu multi car crash break',
 'man',
 'love fruit',
 'summer love',
 'car fast',
 'goooooooaaaaaal',
 'ridicul',
 'london cool',
 'love ski',
 'wonder day',
 'looooool',
 'way eat shit',
 'nyc last week',
 'love girlfriend',
 'cooool',
 'like pasta',
 '

### Creating the Bag of Words model

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

# Creating instance of Count Vectorizer class
cv = CountVectorizer(max_features = 2000)

# Fit method will take all the words 
# and transform method will put all those in different columns
X = cv.fit_transform(Cleaned_Text).toarray()

# Creating dependent variable
y = dataset.iloc[:, -1].values

In [12]:
y

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

### Splitting the datasets into training sets and the test sets

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

# Creating Classification model

### Checking performace of different classification model

### 1. Naive Bayes Classification Model

In [25]:
# Importing library for naive_bayes classification
from sklearn.naive_bayes import GaussianNB

# Creating instance
classifier1 = GaussianNB()

# Traning model as Naive_bayes classification model
classifier1.fit(X_train, y_train)

# Predicting test sets results using Model
y_pred = classifier1.predict(X_test)

# Comparing the predicted y and actual y to ensure accuracy of model
Df_1 = pd.DataFrame(y_pred)
Df_1.columns = ['Predicted_Disaster']
Df_1['Actual_Disaster'] = y_test
Df_1

Unnamed: 0,Predicted_Disaster,Actual_Disaster
0,0,1
1,1,0
2,1,1
3,0,0
4,0,0
...,...,...
1518,0,0
1519,1,1
1520,1,1
1521,1,1


In [26]:
# Importing library for creating Confusion matrix and accuracy
from sklearn.metrics import confusion_matrix, accuracy_score

In [27]:
# Creating Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix :\n\n',cm)

# Finding R_score
R = accuracy_score(y_test, y_pred)
print('\nR score = ',R)

Confusion Matrix :

 [[727 147]
 [209 440]]

R score =  0.7662508207485227


### 2. The Decision Tree Classification model

In [28]:
# Importing library for Decision Tree Classification
from sklearn.tree import DecisionTreeClassifier

# Creating instance
classifier2 = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)

# Traning model as Decision Tree Classification model
classifier2.fit(X_train, y_train)

# Predicting test sets results using Model
y_pred = classifier2.predict(X_test)

# Comparing the predicted y and actual y to ensure accuracy of model
Df_2 = pd.DataFrame(y_pred)
Df_2.columns = ['Predicted_Disaster']
Df_2['Actual_Disaster'] = y_test
Df_2

Unnamed: 0,Predicted_Disaster,Actual_Disaster
0,0,1
1,0,0
2,1,1
3,0,0
4,0,0
...,...,...
1518,0,0
1519,1,1
1520,1,1
1521,1,1


In [29]:
# Creating Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix :\n\n',cm)

# Finding R_score
R = accuracy_score(y_test, y_pred)
print('\nR score = ',R)

Confusion Matrix :

 [[689 185]
 [223 426]]

R score =  0.7321076822061721


### 3. The KNN Classification model

In [30]:
# Importing library for K Neighbors Classification
from sklearn.neighbors import KNeighborsClassifier

# Creating instance
classifier3 = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)

# Traning model as KNN Classification model
classifier3.fit(X_train, y_train)

# Predicting test sets results using Model
y_pred = classifier3.predict(X_test)

# Comparing the predicted y and actual y to ensure accuracy of model
Df_3 = pd.DataFrame(y_pred)
Df_3.columns = ['Predicted_Disaster']
Df_3['Actual_Disaster'] = y_test
Df_3

Unnamed: 0,Predicted_Disaster,Actual_Disaster
0,0,1
1,0,0
2,0,1
3,0,0
4,0,0
...,...,...
1518,0,0
1519,1,1
1520,0,1
1521,1,1


In [31]:
# Creating Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix :\n\n',cm)

# Finding R_score
R = accuracy_score(y_test, y_pred)
print('\nR score = ',R)

Confusion Matrix :

 [[831  43]
 [386 263]]

R score =  0.7183191070256073


### 4. The Kernel SVM classification model

In [32]:
# Importing library for Kernel SVM model
from sklearn.svm import SVC

# Creating instance
classifier4 = SVC(kernel = 'rbf', random_state = 0)

# Traning model as KNN Classification model
classifier4.fit(X_train, y_train)

# Predicting test sets results using Model
y_pred = classifier4.predict(X_test)

# Comparing the predicted y and actual y to ensure accuracy of model
Df_4 = pd.DataFrame(y_pred)
Df_4.columns = ['Predicted_Disaster']
Df_4['Actual_Disaster'] = y_test
Df_4

Unnamed: 0,Predicted_Disaster,Actual_Disaster
0,0,1
1,0,0
2,1,1
3,0,0
4,0,0
...,...,...
1518,0,0
1519,1,1
1520,1,1
1521,1,1


In [33]:
# Creating Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix :\n\n',cm)

# Finding R_score
R = accuracy_score(y_test, y_pred)
print('\nR score = ',R)

Confusion Matrix :

 [[791  83]
 [199 450]]

R score =  0.81483913328956


### 5 The Logistic Regression model

In [34]:
# Importing library for the Logistic Regression model
from sklearn.linear_model import LogisticRegression

# Creating instance
classifier5 = LogisticRegression(random_state = 0)

# Traning model as KNN Classification model
classifier5.fit(X_train, y_train)

# Predicting test sets results using Model
y_pred = classifier5.predict(X_test)

# Comparing the predicted y and actual y to ensure accuracy of model
Df_5 = pd.DataFrame(y_pred)
Df_5.columns = ['Predicted_Disaster']
Df_5['Actual_Disaster'] = y_test
Df_5

Unnamed: 0,Predicted_Disaster,Actual_Disaster
0,0,1
1,0,0
2,1,1
3,0,0
4,0,0
...,...,...
1518,0,0
1519,1,1
1520,1,1
1521,1,1


In [35]:
# Creating Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix :\n\n',cm)

# Finding R_score
R = accuracy_score(y_test, y_pred)
print('\nR score = ',R)

Confusion Matrix :

 [[749 125]
 [195 454]]

R score =  0.7898883782009193


### 6. The Random Forest Classification model

In [36]:
# Importing library for the Random Forest Classification model
from sklearn.ensemble import RandomForestClassifier

# Creating instance
classifier6 = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)

# Traning model as KNN Classification model
classifier6.fit(X_train, y_train)

# Predicting test sets results using Model
y_pred = classifier6.predict(X_test)

# Comparing the predicted y and actual y to ensure accuracy of model
Df_6 = pd.DataFrame(y_pred)
Df_6.columns = ['Predicted_Disaster']
Df_6['Actual_Disaster'] = y_test
Df_6

Unnamed: 0,Predicted_Disaster,Actual_Disaster
0,0,1
1,0,0
2,0,1
3,0,0
4,0,0
...,...,...
1518,0,0
1519,1,1
1520,1,1
1521,1,1


In [37]:
# Creating Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix :\n\n',cm)

# Finding R_score
R = accuracy_score(y_test, y_pred)
print('\nR score = ',R)

Confusion Matrix :

 [[756 118]
 [247 402]]

R score =  0.7603414313854235


### 7. The SVM Classification model

In [38]:
# Importing library for the SVM model
from sklearn.svm import SVC

# Creating instance
classifier7 = SVC(kernel = 'linear', random_state = 0)

# Traning model as KNN Classification model
classifier7.fit(X_train, y_train)

# Predicting test sets results using Model
y_pred = classifier7.predict(X_test)

# Comparing the predicted y and actual y to ensure accuracy of model
Df_7 = pd.DataFrame(y_pred)
Df_7.columns = ['Predicted_Disaster']
Df_7['Actual_Disaster'] = y_test
Df_7

Unnamed: 0,Predicted_Disaster,Actual_Disaster
0,0,1
1,0,0
2,1,1
3,0,0
4,1,0
...,...,...
1518,0,0
1519,1,1
1520,1,1
1521,1,1


In [39]:
# Creating Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix :\n\n',cm)

# Finding R_score
R = accuracy_score(y_test, y_pred)
print('\nR score = ',R)

Confusion Matrix :

 [[738 136]
 [188 461]]

R score =  0.7872619829284307


# Finalising Model : Kernel SVM with 81.5 % Accuracy

+ As we saw above Kernel SVM performed well so we are going to train our data Using kernel SVM Predicting sentiment for single paragraph of text
+ We just repeat the same text preprocessing process we did before, but this time with a single Text.

In [41]:
def Prediction_Single_Text(Text):
    # Removing every punctuations and commas except a-z or A-Z by space
    Text = re.sub('[^a-zA-Z]', ' ', Text)
    
    # Transforming all the capital letters into lower case letters
    Text = Text.lower()
    
    # Splitting the text into different words so that we can apply stemming 
    Text = Text.split()
    
    # Stemming the text
    ps = PorterStemmer()
    all_stopwords = stopwords.words('english')
    
    # Remove 'not' from the stopwords as it can alter the emotion of expression
    all_stopwords.remove('not')
    
    # Applying Stemming on all words except the stopwords
    Text = [ps.stem(word) for word in Text if not word in set(all_stopwords)]
    
    # Joining all the words together seperating with space
    Text = ' '.join(Text)
    
    Cleaned_Text = [Text]
    new_X_test = cv.transform(Cleaned_Text).toarray()
    new_y_pred4 = classifier4.predict(new_X_test)
    print("Predicted value by Kernel SVM Classification Model: ",new_y_pred4)

### Predicting Negative Review

In [42]:
Text1 = 'Forest caught fire near New Delhi.'
Prediction_Single_Text(Text1)

Predicted value by Kernel SVM Classification Model:  [1]


### Predicting positive Review

In [43]:
Text2 = 'Road is clear no traffic there.'
Prediction_Single_Text(Text2)

Predicted value by Kernel SVM Classification Model:  [0]


# Testing on data

In [56]:
dataset2 = pd.read_csv('test.csv')
dataset2

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan
...,...,...,...,...
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...
3259,10865,,,Storm in RI worse than last hurricane. My city...
3260,10868,,,Green Line derailment in Chicago http://t.co/U...
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO) htt...


In [57]:
x_test = dataset2['text']

In [58]:
Cleaned_Text1 = []
for i in range(len(x_test)):
    # Removing every punctuations and commas except a-z or A-Z by space
    Text = re.sub('[^a-zA-Z]', ' ', x_test[i])
    
    # Transforming all the capital letters into lower case letters
    Text = Text.lower()
    
    # Splitting the text into different words so that we can apply stemming 
    Text = Text.split()
    
    # Stemming the text
    ps = PorterStemmer()
    all_stopwords = stopwords.words('english')
    
    # Remove 'not' from the stopwords as it can alter the emotion of expression
    all_stopwords.remove('not')
    
    # Applying Stemming on all words except the stopwords
    Text = [ps.stem(word) for word in Text if not word in set(all_stopwords)]
    
    # Joining all the words together seperating with space
    Text = ' '.join(Text)
    
    # Adding the cleaned text to the empty list
    Cleaned_Text1.append(Text)

In [59]:
Cleaned_Text1

['happen terribl car crash',
 'heard earthquak differ citi stay safe everyon',
 'forest fire spot pond gees flee across street cannot save',
 'apocalyps light spokan wildfir',
 'typhoon soudelor kill china taiwan',
 'shake earthquak',
 'probabl still show life arsen yesterday eh eh',
 'hey',
 'nice hat',
 'fuck',
 'like cold',
 'nooooooooo',
 'tell',
 '',
 'awesom',
 'birmingham wholesal market ablaz bbc news fire break birmingham wholesal market http co irwqcezweu',
 'sunkxssedharri wear short race ablaz',
 'previouslyondoyintv toke makinwa marriag crisi set nigerian twitter ablaz http co cmghxba xi',
 'check http co roi nsmejj http co tj zjin http co yduixefip http co lxtjc kl nsfw',
 'psa split person techi follow ablaz co burner follow ablaz',
 'bewar world ablaz sierra leon amp guap',
 'burn man ablaz turban diva http co hodwosamw via etsi',
 'not diss song peopl take thing run smh eye open though set game ablaz cyhitheprync',
 'rape victim die set ablaz year old girl die burn inj

In [60]:
new_X_test = cv.transform(Cleaned_Text1).toarray()
new_y_pred4 = classifier4.predict(new_X_test)

In [62]:
dataset2['Prediction'] = pd.DataFrame(new_y_pred4)
dataset2

Unnamed: 0,id,keyword,location,text,Prediction
0,0,,,Just happened a terrible car crash,1
1,2,,,"Heard about #earthquake is different cities, s...",1
2,3,,,"there is a forest fire at spot pond, geese are...",1
3,9,,,Apocalypse lighting. #Spokane #wildfires,0
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan,1
...,...,...,...,...,...
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...,1
3259,10865,,,Storm in RI worse than last hurricane. My city...,1
3260,10868,,,Green Line derailment in Chicago http://t.co/U...,1
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO) htt...,1


In [63]:
dataset2.to_csv(f'predcted_results.csv',index = False)

In [64]:
dataset3 = pd.read_csv('predcted_results.csv')
dataset3

Unnamed: 0,id,keyword,location,text,Prediction
0,0,,,Just happened a terrible car crash,1
1,2,,,"Heard about #earthquake is different cities, s...",1
2,3,,,"there is a forest fire at spot pond, geese are...",1
3,9,,,Apocalypse lighting. #Spokane #wildfires,0
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan,1
...,...,...,...,...,...
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...,1
3259,10865,,,Storm in RI worse than last hurricane. My city...,1
3260,10868,,,Green Line derailment in Chicago http://t.co/U...,1
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO) htt...,1
