<a href="https://colab.research.google.com/github/Mukilan-Krishnakumar/NLP_With_Disaster_Tweets/blob/main/NLP_with_Disaster_Tweets_Part_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome to NLP with Disaster Tweets. We are going to create a baseline NLP model with TensorFlow, this model is similar to the model which I learnt from the [Coursera Course on Natural Language Processing](https://www.coursera.org/learn/natural-language-processing-tensorflow?specialization=tensorflow-in-practice&utm_source=gg&utm_medium=sem&utm_campaign=33-DeepLearningAI-TensorFlow-IN&utm_content=B2C&campaignid=12462557662&adgroupid=120411989496&device=c&keyword=&matchtype=&network=g&devicemodel=&adpostion=&creativeid=510017701427&hide_mobile_promo&gclid=Cj0KCQiA2ZCOBhDiARIsAMRfv9J7y-xaQNirNg9EReDTgRS6rxEsTUV3U7qwa8TiGi2rZk_grBWsjgwaAujhEALw_wcB) taught by [Laurence Moroney](https://www.linkedin.com/in/laurence-moroney/).

I am going to implement the model and improve upon it. This is part 1 of the NLP with Disaster Tweets series and we will gradually improve the model. 

📌 **Note** : As we are trying to get to the juice of model building this part doesn't cover EDA. EDA will be done in the subsequent parts. 

Let's get started.

## Downloading Dataset From Kaggle

To download dataset directly from kaggle we need to install kaggle in this machine. We also need to download a file called **kaggle.json**. This can be downloaded from Your Account -> Account -> API -> Generate Token. 

We need to upload this file to our colab runtime. Keep in mind that if you are using normal Colab, an uploaded file would be recycled. 

We create a folder called kaggle and copy our json file into that folder. 

We run `chmod 600` which means only the owner of the file has full read and write acces to it. 

We can download the kaggle dataset using `kaggle competitions download nlp-getting-started`. 

📌 **Note** : If you didn't click **Join Competition** in kaggle, you won't be able to download the dataset. 

😂 I did make that mistake so please be careful.


In [1]:
! pip install kaggle



In [2]:
! mkdir ~/.kaggle

In [3]:
! cp kaggle.json ~/.kaggle/

In [4]:
! chmod 600 ~/.kaggle/kaggle.json

In [5]:
! kaggle competitions download nlp-getting-started

Downloading train.csv to /content
  0% 0.00/965k [00:00<?, ?B/s]
100% 965k/965k [00:00<00:00, 11.0MB/s]
Downloading test.csv to /content
  0% 0.00/411k [00:00<?, ?B/s]
100% 411k/411k [00:00<00:00, 21.0MB/s]
Downloading sample_submission.csv to /content
  0% 0.00/22.2k [00:00<?, ?B/s]
100% 22.2k/22.2k [00:00<00:00, 23.0MB/s]


In [None]:
! unzip nlp-getting-started.zip

unzip:  cannot find or open nlp-getting-started.zip, nlp-getting-started.zip.zip or nlp-getting-started.zip.ZIP.


## Importing Necessary Modules

We are going to import few necessary python modules to create our model.

We will be importing the following modules:
* Pandas - For data manupalation and analysis
* Numpy  - For array manipulation
* Matplotlib.pyplot - For plotting graphs and visualizing data
* Seaborn - For high level visualization
* Re - For using Regular Expresions (RegEx)
* TensorFlow - For building our Neural Network

We will also import `Tokenizer` and `pad_sequences`. As the official documentation states:
1. `Tokenizer` : This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf. 
2. `pad_sequences` : This function transforms a list (of length num_samples) of sequences (lists of integers) into a 2D Numpy array of shape (num_samples, num_timesteps). **It essentially adds padding to sentences to make them of equal length.** 


In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import tensorflow as tf


from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

We need to convert our csv file into a **Pandas DataFrame**. 

After converting we see the first 5 rows.


In [7]:
df = pd.read_csv('/content/train.csv')
df_test = pd.read_csv('/content/train.csv')
df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [8]:
df['text']

0       Our Deeds are the Reason of this #earthquake M...
1                  Forest fire near La Ronge Sask. Canada
2       All residents asked to 'shelter in place' are ...
3       13,000 people receive #wildfires evacuation or...
4       Just got sent this photo from Ruby #Alaska as ...
                              ...                        
7608    Two giant cranes holding a bridge collapse int...
7609    @aria_ahrary @TheTawniest The out of control w...
7610    M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...
7611    Police investigating after an e-bike collided ...
7612    The Latest: More Homes Razed by Northern Calif...
Name: text, Length: 7613, dtype: object

## Cleaning Data

If we scroll a bit to the right in the `df['text']`, we can see many **URLS and Uppercased Words**. URLs are meaningless to our model and using Uppercase words bring redundancy in our word index. 

We remove these both using a custom function called cleaningText which removes URLs and lowercases all sentences.

In [9]:
def cleaningText(df):
  '''
  This function gets a dataframe object as an input and removes the URLs from text column and makes every sentence lowercase.
  '''
  df['text'] = [re.sub(r'http\S+', '', x, flags=re.MULTILINE) for x in df['text']]
  df['text'] = df['text'].str.lower()

In [10]:
cleaningText(df)
df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,our deeds are the reason of this #earthquake m...,1
1,4,,,forest fire near la ronge sask. canada,1
2,5,,,all residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,just got sent this photo from ruby #alaska as ...,1


After running our custom function, we can store the text and label into individual lists. 

In [11]:
sentences = [x for x in df['text']]
labels = [x for x in df['target']]
print(sentences)




We make sure our labels are numerical values and are stored in Numpy arrays by using `np.array`. 

We split the data into training and testing data based on 80/20 rule. 

We have about 8000 records, we take the first 6090 to be training and the rest to be testing.

In [12]:
labels = np.array(labels)

training_sentences = sentences[:6090]
training_labels = labels[:6090]

testing_sentences = sentences[6090:]
testing_labels = labels[6090:]

## Model Parameters

We need to specify a few things before we build our very own NLP model. 

We need to set up `vocab_size`, this is the maximum number of words we can store in our very own dictionary of sorts. We set it to `10000`.

We need to set up `embedding_dim`, embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. We set it to `16`.

A tweet can be `280` characheters long, so we will set `max_length` to be `280`.

We do padding on the end, in computer lingo this is called `post-padding`. We will set up `trunc-type` to `post`.

If our model is faced with a new word it has not seen before, it will categorize it to `Out-Of-Vocabulary`, so we will set up `oov_tok` to be `<OOV>`. 

What we are going to do is convert all the words in our sentences into a dictionary of sorts (word_index) which allots individual tokens to each words. 

Our ML model can never work on text data, so we use this tokenizing mechanism to convert our sentences into sequences, they are numerical representation of our sentences. We pad them to make all the sequences be of same length.

We do the same for testing sequences and labels.


In [13]:
vocab_size = 10000
embedding_dim = 16
max_length = 280
trunc_type='post'
oov_tok = "<OOV>"

tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)

If you are curious about our word_index we can print them.

In [14]:
word_index

{'<OOV>': 1,
 'the': 2,
 'a': 3,
 'to': 4,
 'in': 5,
 'of': 6,
 'and': 7,
 'i': 8,
 'is': 9,
 'for': 10,
 'on': 11,
 'you': 12,
 'my': 13,
 'with': 14,
 'it': 15,
 'that': 16,
 'by': 17,
 'at': 18,
 'this': 19,
 'are': 20,
 'from': 21,
 'be': 22,
 'was': 23,
 'have': 24,
 'up': 25,
 'amp': 26,
 'me': 27,
 'like': 28,
 'just': 29,
 'as': 30,
 'so': 31,
 'but': 32,
 'not': 33,
 'your': 34,
 'fire': 35,
 'out': 36,
 'no': 37,
 'will': 38,
 'an': 39,
 'all': 40,
 'after': 41,
 'when': 42,
 'if': 43,
 'get': 44,
 'has': 45,
 '2': 46,
 'we': 47,
 'via': 48,
 "i'm": 49,
 'new': 50,
 'now': 51,
 'more': 52,
 'or': 53,
 'about': 54,
 'people': 55,
 'he': 56,
 'news': 57,
 'over': 58,
 'what': 59,
 'they': 60,
 'emergency': 61,
 'do': 62,
 'how': 63,
 'one': 64,
 'been': 65,
 "it's": 66,
 "don't": 67,
 "'": 68,
 'into': 69,
 'can': 70,
 'there': 71,
 'video': 72,
 'disaster': 73,
 'body': 74,
 'burning': 75,
 'her': 76,
 'than': 77,
 'would': 78,
 'buildings': 79,
 'who': 80,
 'police': 81,
 'u'

## Model Building

Finally, we getting to juice of this tutorial. We are building our very own ML model. 

We will be building a Sequential model.

We use the following layers:
* Embedding layer - Turns positive integers (indexes) into dense vectors of fixed size. This basically converts our sequences into vectors.
* GloabalAveragePooling1D - Global average pooling operation for temporal data. It basically computes the maximum of imput channels, finds the most relevant information.
* Dense layers - One used with activation `relu` for achieving lower loss and another with `sigmoid` for classifying our tweet into either 1 (Disaster) or 0 (Not a Disaster).

We will compile our model with `binary_crossentropy` as our loss because we only have binary classes (1 and 0). 

We will use `Adam` optimizer along with `accuracy` as metrics.

We can visualize the layers of our model with `model.summary()`.

In [15]:
model = tf.keras.Sequential([
                             tf.keras.layers.Embedding(vocab_size,embedding_dim,input_length = max_length),
                             tf.keras.layers.GlobalAveragePooling1D(),
                             tf.keras.layers.Dense(6, activation = "relu"),
                             tf.keras.layers.Dense(1, activation = "sigmoid")
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 280, 16)           160000    
                                                                 
 global_average_pooling1d (G  (None, 16)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 6)                 102       
                                                                 
 dense_1 (Dense)             (None, 1)                 7         
                                                                 
Total params: 160,109
Trainable params: 160,109
Non-trainable params: 0
_________________________________________________________________


We will make our model run 10 times (`epochs = 10`).

We will fit our model on training data and labels, we will evaluate on testing data.

In [16]:
np.random.seed(42)
num_epochs = 10
model.fit( padded, training_labels,epochs = num_epochs, validation_data = (testing_padded, testing_labels))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f12003e8890>

😂 Wow, my model is only able to get 84 % accuracy. This is much better than our model guessing, we can improve this score by doing EDA and building a better model. For now, this is good enough.