#**Real or Not? Binary Classification of Disaster Tweets**
Kaggle Competition

##Loading all necessary liraries

In [0]:
import pandas as pd
from nltk.tokenize import word_tokenize
import re
import string
import nltk
from nltk.corpus import stopwords

# XGBoost
import xgboost as xgb
from xgboost import XGBClassifier
# sklearn 
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from sklearn.model_selection import GridSearchCV,StratifiedKFold,RandomizedSearchCV
  

##Reading Training and Test Dataset:


In [0]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

##Basic EDA - Exploratory Data Analysis

In [0]:
#Looking at first few rows of dataset
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [0]:
print("Training Dataset Size:",train_df.shape)
print("Test Dataset Size:",test_df.shape)

Training Dataset Size: (7613, 5)
Test Dataset Size: (3263, 4)


In [0]:
#missing values in training dataset
train_df.isnull().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

###Target column Distribution

In [0]:
train_df['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [0]:
#exlporing location column
train_df['location'].value_counts()

USA                          104
New York                      71
United States                 50
London                        45
Canada                        29
                            ... 
IDN                            1
Winnipeg, MB, Canada           1
Cleveland, Ohio                1
Yadkinville, NC                1
Wolverhampton/Brum/Jersey      1
Name: location, Length: 3341, dtype: int64

##**Data Cleaning**


*   Making all uppercase to lowercase
*   Removing noise from tweets

  *   URLs
  *   HTML tags
  *   emogis
  *   Punctuation
  *   New-Line
  *   Removing Digits
*   Tokenization: Converting normal text string into a list of tokens/words
*   Stopwords removal(optional)











In [0]:
#function to remove noise from text
def clean_text(text):
    text = text.lower() #convert to lowercase to maintain standard flow between text
    text = re.sub('\[.*?\]', '', text) #removing text in square brackets
    text = re.sub('https?://\S+|www\.\S+', '', text) #removing url
    text = re.sub('<.*?>+', '', text) #removing html tags
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) #removing puntuations
    text = re.sub('\n', '', text)#removing new line from the text field
    text = re.sub('\w*\d\w*', '', text) #removing digits from the string
    return text

train_df['text'] = train_df['text'].apply(lambda x : clean_text(x))
test_df['text'] = test_df['text'].apply(lambda x : clean_text(x))

In [0]:
#function to remove emoji's
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                            "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

train_df['text']=train_df['text'].apply(lambda x: remove_emoji(x))
test_df['text']=test_df['text'].apply(lambda x: remove_emoji(x))

In [0]:
#removing alpha numeric values from text
train_df['text'] = train_df['text'].str.replace('[^a-z A-Z]','')

In [0]:
#adding new column with the count of words in a single row
train_df['word_count'] = train_df['text'].str.split().map(len)

#only using the rows with word count more than 0
train_df = train_df[train_df['word_count'] > 0]

#moving forward with 2 column from dataset , i.e., text and target
train_df = train_df[["text","target"]]
test_df = test_df[["text"]]

### Bag of Words- Countvectorizer Features
Countvectorizer convert a collection of text documents to a matrix of token counts

In [0]:
count_vectorizer = CountVectorizer()
train_vectors = count_vectorizer.fit_transform(train_df['text'])
test_vectors = count_vectorizer.transform(test_df["text"])

###TF-IDF Features
(Term Frequency-Inverse Document Frequency) - Rescaling the frequency of words hows often they appear in all the documents

**Term Frequency: is a scoring of the frequency of the word in the current document or text**

TF = (Number of times term t appears in a document)/(Number of terms in a document)

**Inverse Document Frequency: is a scoring of how rare the word is across documents**

IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t appeared in.

In [0]:
tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1,2))
train_dfidf = tfidf.fit_transform(train_df['text'])
test_dfidf = tfidf.transform(test_df['text'])

##Building Text Classification Model


###Logistic Regression Classifier

In [0]:
# Fitting a simple Logistic Regression on Count-vectors
clf = LogisticRegression(C=1.0)
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=5, scoring="f1")


In [0]:
scores

array([0.62962963, 0.55050505, 0.61184211, 0.57837838, 0.71981058])

Logistic Regression with **Bag of words** features results, 

After 5th cross validation training F1 score is 71.9% 

In [0]:
# Fitting a simple Logistic Regression on TFIDF
clf_tfidf = LogisticRegression(C=1.0)
scores = model_selection.cross_val_score(clf_tfidf, train_dfidf, train_df["target"], cv=5, scoring="f1")
scores

array([0.61538462, 0.56688494, 0.60249554, 0.58574181, 0.71885522])

Logistic Regression with **TF-IDF** Features results, 

after 5th cross validation training F1 score is 71.8%

###Naive Bayes Classifier

In [0]:
#Fitting a simple Naive Bayes on Count-Vectors
clf_NB = MultinomialNB()
scores = model_selection.cross_val_score(clf_NB, train_vectors, train_df["target"], cv=5, scoring="f1")
scores

array([0.65368332, 0.63370787, 0.68676471, 0.64526485, 0.73986014])

Naive Bayes Classifier with **Bag of words** features results, 

After 5th cross validation training F1 score is 73.9% 

In [0]:
# Fitting a simple Naive Bayes on TFIDF
clf_NB_TFIDF = MultinomialNB()
scores = model_selection.cross_val_score(clf_NB_TFIDF, train_dfidf, train_df["target"], cv=5, scoring="f1")
scores

array([0.59486166, 0.58962693, 0.6356453 , 0.60170293, 0.74196208])

Naive Bayes Classifier with **TF-IDF** features results,

After 5th Cross Validation taining F1 Score is 74.1%

##Neural Network - Deep Learning

Why not a standard network fit for this data challenge?


*   Inputs, outputs can be of different lengths in different examples
*   Doesn't share features learned across different position of text

To overcome this we use Recurrent Neural Network, 

as it scans words from left to right, one drawback is that it only knows features from its left



##Word Level - Vanilla RNN (RNN from Scratch) (Recurrent Neural Network)

The idea is to create a word level RNN in python/numpy that will provide a baseline model for more complex Neural Networks architecture, and also to gain low level understanding of the working of RNN (Sequence model)

Credits: 

Andrej Karpathy https://gist.github.com/karpathy/d4dee566867f8291f086: Minimal character-level language model with a Vanilla Recurrent Neural Network, in Python/numpy. And the blog http://karpathy.github.io/2015/05/21/rnn-effectiveness/.

The deep learning book by Michael Nielsen particularly http://neuralnetworksanddeeplearning.com/chap6.html

Andrew ng Deep learning course (Course 5) on Coursera

**Steps Taken:**



1.   Creating a vocabulary list of unique words from data to be later used in encoding each words into a one-hot vector using 1-k encoding. (k = len(vocab_list)
2.   Initialize the RNN model parameters
3.   Feedforward the training data(tweets) vectorized form into the network and calculate loss for that training example
4.   Backpropagate through time and obtain the gradient parameters.
5.   Gradient clipping to avoid exploding gradient problem
6.   Select/iterating over different learning rate and calculating new model paramters
7.   Repeating steps from 3-6 for some number of iterations that covers all training examples atleast 3 times



In [0]:
#creating vocabulary list which will have uniqe word list from training data
vocab_list = list(train_df['text'].str.split(' ',expand=True).stack().unique())
total_words = list(train_df['text'].str.split(' ',expand=True).stack())

vocab_list_size = len(vocab_list)
total_words_len = len(total_words)

print("Vocab size : ",vocab_list_size)
print("Total words in data : ", total_words_len)

#creating a dictionary that has an index for each of the unique words
words_idx = { word:i for i, word in enumerate(vocab_list) }


Vocab size :  16512
Total words in data :  113521


In [0]:
#converting a single training sample to retreive from dictionary
temp = train_df['text'].str.split().values[1]
inputs = [words_idx[i] for i in temp]
output = train_df['target'].values[1]
print("original text:",temp)
print("Wordlist index:",inputs)
print("Target label:",output)

original text: ['forest', 'fire', 'near', 'la', 'ronge', 'sask', 'canada']
Wordlist index: [13, 14, 15, 16, 17, 18, 19]
Target label: 1


In [0]:
#hyperparameters
learning_rate = 0.005
n_a = hidden_size = 100
n_x = vocab_list_size 
n_y = 2

#model_parameters
Waa = np.random.randn(hidden_size,hidden_size)*0.1
Wax = np.random.randn(hidden_size,vocab_list_size)*0.1
Wya = np.random.randn(2,hidden_size)*0.1
ba = np.zeros((n_a,1))
by = np.zeros((n_y,1))

#spliting into training and validation
temp_df = train_df
train_df = train_df.iloc[:7000]
validation_df = temp_df.iloc[7000:]

print('The training set examples: %d' %(len(train_df)))
print('The validation set examples: %d' %(len(validation_df)))


The training set examples: 7000
The validation set examples: 612


In [0]:
#activation function
def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)


In [0]:
#feed-forward -- takes in the index of words in a example tweet and return the prediction
def rnn_feedforward(input_data):
  #initializing
  xt,at = [], np.zeros((n_a,1))
  for t in range(len(input_data)):
    xt.append(np.zeros((n_x,1)))#encode in 1-k one hot-representation
    xt[t][input_data[t]] = 1
    at = np.tanh(np.dot(Waa,at)+np.dot(Wax,xt[t])+ba) #hidden state activation function
    
  yt = np.dot(Wya,at) + by 
  pred = softmax(yt) #softmax function for getting probability for binary classification
  prediction = np.argmax(pred)
  return prediction
  

In [0]:

num_iterations = 5

#memory variables for Adagrad
mWaa, mWya, mWax = np.zeros_like(Waa), np.zeros_like(Wya), np.zeros_like(Wax)
mby, mba = np.zeros_like(by), np.zeros_like(ba)

for i in range(num_iterations):
  
  idx = i%len(train_df)
  example = train_df['text'].str.split().values[idx]
  inputs = [words_idx[i] for i in example]
  print(inputs)
  targets = int(train_df['target'].values[idx])
  
  prediction = rnn_feedforward(inputs)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
[[0.56772971]
 [0.43227029]]
prediction: 0
[13, 14, 15, 16, 17, 18, 19]
[[0.46812068]
 [0.53187932]]
prediction: 1
[12, 20, 21, 22, 23, 24, 25, 2, 26, 27, 28, 29, 30, 31, 32, 33, 23, 24, 25, 34, 2, 35]
[[0.36159748]
 [0.63840252]]
prediction: 1
[37, 38, 39, 32, 34, 24, 40]
[[0.43414214]
 [0.56585786]]
prediction: 1
[41, 42, 43, 6, 44, 45, 46, 47, 48, 49, 45, 39, 50, 51, 52, 53]
[[0.6215141]
 [0.3784859]]
prediction: 0


Function that takes input as one row with text and target value, feedforward the network, calculates the cost which is loss of predicted vs actual Y/target.

Then performs backpropagation and updates the gradient of all the parameters and returns it.

In [0]:
#feedforward and backpropagation
def rnn_model(input_data,targets):
  
  xt,at = [],[]
  at.append(np.zeros((n_a,1)))
  loss = 0
  
  #feed forward
  for t in range(len(input_data)):
    xt.append(np.zeros((n_x,1)))
    xt[t][input_data[t]] = 1
    
    at.append(np.tanh(np.dot(Waa,at[t])+np.dot(Wax,xt[t])+ba))
    
  yt = np.dot(Wya,at[-1]) + by
  pred = np.exp(yt) / np.sum(np.exp(yt),axis=0)
  
  prediction = np.argmax(pred)
  
  #loss-cost
  
  y = np.zeros((2,1))
  y[targets] = 1
  loss = -np.log(np.sum(pred*y,axis = 0)) #cross-entropy loss
  
  #backpropagation through time
  dWaa, dWya, dWax = np.zeros_like(Waa), np.zeros_like(Wya), np.zeros_like(Wax)
  dby, dba = np.zeros_like(by), np.zeros_like(ba)
  
  dy = pred - y
  dWya = np.dot(dy,at[-1].transpose())
  dby = np.copy(dy)
  
  dat = np.dot(Wya.transpose(),dy)
  dtanh = (1-at[-1]*at[-1]) * dat
  
  dWax = np.dot(dtanh, xt[-1].transpose())
  dWaa = np.dot(dtanh, at[-2].transpose())
  dba = np.copy(dtanh)
  da_next = np.dot(Waa.transpose(),dtanh)
  for t in reversed(range(len(input_data)-2)):
    
    dat = np.copy(da_next)
    dtanh = (1- at[t+1] * at[t+1]) * dat
    
    dWax += np.dot(dtanh, xt[t].transpose())
    dWaa += np.dot(dtanh, at[t].transpose())
    dba += np.copy(dtanh)
    da_next = np.dot(Waa.transpose(),dtanh)
    
  for dparams in [dWaa,dWax,dba,dby,dWya]:
    np.clip(dparams,-5,5,out=dparams)
    
  return loss, dWaa, dWax, dba, dby, dWya

 

Feeding training data into the network to retrieve the gradient and using Adagrad optimizer to perform the gradient descent. 

And we repeat this process for all the training examples and for n epochs

**Adagrad Optimizer** : Adaptively scales the learning rate with respect to the accumulated square gradient at each iteration in each dimention

In [0]:

#select any number of iteration for start
num_iterations = 21000

#memory variables for Adagrad optimizer - backprop
mWaa, mWya, mWax = np.zeros_like(Waa), np.zeros_like(Wya), np.zeros_like(Wax)
mby, mba = np.zeros_like(by), np.zeros_like(ba)

for i in range(num_iterations):
  idx = i%len(train_df)
  example = train_df['text'].str.split().values[idx]
  inputs = [words_idx[i] for i in example]
  targets = int(train_df['target'].values[idx])
  
  #prediction = rnn_feedforward(inputs,targets)
  loss,dWaa, dWax, dba, dby, dWya = rnn_model(inputs,targets)
  
  #Adagrad optimizer
  #performing paramete update with Adagrad
  for param, dparam, mem in zip([Waa, Wax, Wya, ba, by],
                                [dWaa, dWax, dWya, dba, dby],
                                [mWaa, mWax, mWya, mba, mby]):
    mem += dparam * dparam
    param += -learning_rate * dparam / np.sqrt(mem+1e-8) #adagrad update
  
  # validation accuracy
  # using for loop instead of vectorization
  if i % 700 == 0:
    predictions = []
    count=0
    actual_targets= validation_df['target'].tolist()
    for j in range(len(validation_df)):
        example = validation_df['text'].str.split().values[j]
        inputs = [words_idx[l] for l in example]
        predictions.append(rnn_feedforward(inputs))
          
    for y, y_hat in zip(actual_targets, predictions):
        if y==y_hat:
            count+=1
    print('The validation_accuracy after iterations:%d is %d'%(i,(count/len(validation_df))*100))
    

The validation_accuracy after iterations:0 is 51
The validation_accuracy after iterations:700 is 47
The validation_accuracy after iterations:1400 is 50
The validation_accuracy after iterations:2100 is 48
The validation_accuracy after iterations:2800 is 62
The validation_accuracy after iterations:3500 is 69
The validation_accuracy after iterations:4200 is 69
The validation_accuracy after iterations:4900 is 69
The validation_accuracy after iterations:5600 is 73
The validation_accuracy after iterations:6300 is 72
The validation_accuracy after iterations:7000 is 72
The validation_accuracy after iterations:7700 is 70
The validation_accuracy after iterations:8400 is 67
The validation_accuracy after iterations:9100 is 72
The validation_accuracy after iterations:9800 is 74
The validation_accuracy after iterations:10500 is 76
The validation_accuracy after iterations:11200 is 75
The validation_accuracy after iterations:11900 is 73
The validation_accuracy after iterations:12600 is 76
The validati

##Results Summary:
The validation accuracy arrives closely to 76% (better than randomly guessing). Tried 2 sets of learning rate values with 0.001 accuracy was 70% whereas with learning rate 0.005 accuracy boosted to 76%. 

I have not experimented with other hyperparameters or tried any high level/complex data cleaning. Still, it seems that my RNN is learning association between words that helps to classify the tweets.

Future scope:
we can try different network architectures 
training model for longer iterations

In [0]:
#saving matrix values for parameters which includes weights and biases for later use to predict test data
# saving the model
import pickle
filename = 'rnn_model_v1.pkl'
with open(filename, "wb") as f:
    pickle.dump((Waa, Wax, ba, by, Wya ), f)

In [0]:
#loading the saved model
import pickle
filename = 'rnn_model_v1.pkl'
with open(filename, "rb") as f:
    Waa, Wax, ba, by, Wya  = pickle.load(f)

In [0]:
#checking if parameters loaded in correctly
#using loaded model
count = 0
for j in range(len(validation_df)):
  
  example = validation_df['text'].str.split().values[j]
  inputs = [words_idx[l] for l in example]
  predictions.append(rnn_feedforward(inputs))
          
for y, y_hat in zip(actual_targets, predictions):
  if y==y_hat:
    count+=1
print("Correctly Predicted:",count)
print("Total Data in validation set:",len(validation_df))
print('The validation_accuracy is',(count/len(validation_df))*100)

Correctly Predicted: 453
Total Data in validation set: 612
The validation_accuracy is 74.01960784313727


##Pushing Files into GitHub from google colab

In [0]:
!git init

Initialized empty Git repository in /content/.git/


In [0]:
!git config -- globaluser.email "sagardaswani401@gmail.com"
!git config -- globaluser.name "Sagar401"

In [0]:
!git add -A

In [0]:
!git commit -m "first commit"