# Covid-19 Tweets NLP using Tensorflow

For this project we are trying classify the sentiment of tweets regarding Covid-19. \
We are going to use Tensorflow and Keras to perform NLP, more specifically we will be using techinques such as:

* Tokenizing
* Padding,
* Embedding
* GRU
* LSTM
* Convolutions
* Dropout 

There will also be some data cleaning and reduction of vocabulary to increase the performance of our models.

 We will achieve about 88% accuracy on the validation data.

# Library

In [1]:
import pandas as pd
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

ModuleNotFoundError: No module named 'tensorflow'

# Import the data

In [2]:
df_training = pd.read_csv('../input/covid-19-nlp-text-classification/Corona_NLP_train.csv', encoding = 'latin_1')
df_validation = pd.read_csv('../input/covid-19-nlp-text-classification/Corona_NLP_test.csv', encoding = 'latin_1')

FileNotFoundError: [Errno 2] No such file or directory: '../input/covid-19-nlp-text-classification/Corona_NLP_train.csv'

In [13]:
df_training.head()

In [14]:
df_validation.head()

In [15]:
training = df_training.copy()
validation = df_validation.copy()

We will only be using the Original Tweet column to predict the Sentiment.

In [16]:
training = training[['OriginalTweet', 'Sentiment']]
validation = validation[['OriginalTweet', 'Sentiment']]

## Missing Values
Lets check for any missing values.

In [17]:
training.isnull().sum()

In [18]:
validation.isnull().sum()

Take a look on the different sentiment categories.

In [19]:
sns.catplot(x = 'Sentiment', kind = 'count', data = training, height = 5, aspect = 2)

In [20]:
sns.catplot(x = 'Sentiment', kind = 'count', data = validation, height = 5, aspect = 2)

## Mapping the Sentiment Column
Now we recode the Sentiment column into numerical values. We will only use 3 categories, Negative, Neutral and Positive.

In [21]:
l = {'Extremely Negative': 0,
     'Negative': 0,
     'Neutral': 1,
     'Positive': 2,
     'Extremely Positive': 2}


In [22]:
training['Sentiment'] = training['Sentiment'].map(l)
validation['Sentiment'] = validation['Sentiment'].map(l)

In [23]:
training['Sentiment'].value_counts

In [24]:
validation['Sentiment'].value_counts

# Cleaning
Now we clean the tweets by removing the urls and the people tagged from them. Also we are going to remove tweets that are shorter than 20 characters from the training set.

In [25]:
training['OriginalTweet'] = training['OriginalTweet'].str.replace(r'http\S+', '', regex = True)
training['OriginalTweet'] = training['OriginalTweet'].str.replace(r'@\S+', '', regex = True)
validation['OriginalTweet'] = validation['OriginalTweet'].str.replace(r'http\S+', '', regex = True)
validation['OriginalTweet'] = validation['OriginalTweet'].str.replace(r'@\S+', '', regex = True)
print(training['OriginalTweet'])

In [26]:
print(training.shape)
training = training[(training['OriginalTweet'].str.len() > 20)]
print(training.shape)

# Tokenize
For the tokenization everything is quite standard, we tokenize and then we pad all the sentences so that they have the same length. The max length of a sentence is 120 words which sufficient. We have decreased the vocabulary of the Tokenizer from about 50000 to 6000 in order to reduce overfitting since this will make vocabulary consist of only the 6000 most common words from the training data. 

In [27]:
embedding_dim = 16
max_length = 120
trunc_type = 'post'
oov_tok = '<OOV>'

training_sentences = training['OriginalTweet']
training_labels = training['Sentiment']

validation_sentences = validation['OriginalTweet']
validation_labels = validation['Sentiment']

tokenizer = Tokenizer(oov_token = oov_tok, num_words = 6000)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
vocab_size = len(word_index) +1

sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences, maxlen = max_length, truncating = trunc_type)

validation_sequences = tokenizer.texts_to_sequences(validation_sentences)
validation_padded = pad_sequences(validation_sequences, maxlen = max_length, truncating = trunc_type)

# Models
I am going to try to different models, one using GRU and one using LSTM.

I am going to use a callback so that the training stops after 5 epochs if the validation accuarcy is not increasing. That should remove the problem of overfitting. Also the model will restore the weights for the epoch with the best validation accuracy.

In [28]:
callback = tf.keras.callbacks.EarlyStopping(monitor = 'val_accuracy', patience = 5, restore_best_weights = True)

## GRU

In [29]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = max_length),
    tf.keras.layers.Dropout(0.5),
    
    tf.keras.layers.Conv1D(128, 5, activation = 'relu'),
    tf.keras.layers.MaxPooling1D(pool_size = 1),
    tf.keras.layers.Dropout(0.3),
    
    tf.keras.layers.Conv1D(256, 5, activation = 'relu'),
    tf.keras.layers.Dropout(0.3),
    
    tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32)),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(64, activation = 'relu'),
    tf.keras.layers.Dense(3, activation = 'softmax')
])
model.compile(loss = 'sparse_categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
model.summary()

In [None]:
history = model.fit(padded, 
                    np.array(training_labels), 
                    epochs = 20, 
                    validation_data = (validation_padded, np.array(validation_labels)),
                    callbacks = [callback])

In [None]:
loss = history.history['loss']
acc = history.history['accuracy']

val_loss = history.history['val_loss']
val_acc = history.history['val_accuracy']

plt.plot(loss, label = 'Training Loss')
plt.plot(val_loss, label = 'Validation Loss')
plt.title('Loss Plot')
plt.ylabel('Loss')
plt.legend()

plt.figure()

plt.plot(acc, label = 'Training Accuracy')
plt.plot(val_acc, label = 'Validation Accuracy')
plt.title('Accuracy Plot')
plt.xlabel('Epochs')
plt.legend()
plt.ylabel('Accuarcy')

While our accuracy on the training data is increasing our accuracy on the validation data is decreasing/stagnant which is sign that we are overfitting.
The callback went in and stoppped the training early.

## LSTM

In [None]:
model_2 = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 64, input_length = max_length),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Conv1D(128, 5, activation = 'relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Conv1D(256, 5, activation = 'relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Conv1D(512, 5, activation = 'relu'),
    tf.keras.layers.MaxPooling1D(pool_size = 4),
    tf.keras.layers.LSTM(128),
    tf.keras.layers.Dense(64, activation = 'relu'),
    tf.keras.layers.Dropout(0.4),
    tf.keras.layers.Dense(32, activation = 'relu'),
    tf.keras.layers.Dense(3, activation = 'softmax')
])
model_2.summary()

In [None]:
adam = tf.keras.optimizers.Adam(
    learning_rate=0.0005, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,
    name='Adam'
)
model_2.compile(loss = 'sparse_categorical_crossentropy', optimizer = adam, metrics = ['accuracy'])
history_2 = model_2.fit(padded,
                        training_labels,
                        epochs = 20,
                        validation_data = (validation_padded, validation_labels), 
                        callbacks = [callback]
                       )

In [None]:

loss = history_2.history['loss']
acc = history_2.history['accuracy']

val_loss = history_2.history['val_loss']
val_acc = history_2.history['val_accuracy']



plt.plot(loss, label = 'Training Loss')
plt.plot(val_loss, label = 'Validation loss')
plt.title('Loss Plot')
plt.ylabel('Loss')
plt.legend()

plt.figure()

plt.plot(acc, label = 'Training Accuracy')
plt.plot(val_acc, label = 'Validation Accuracy')
plt.title('Accuracy Plot')
plt.ylabel('Accuarcy')
plt.legend()
plt.xlabel('Epochs')
