Text Classification of COVID Tweets Using World2Vec and LSTMs 

Tweets have been classified as covid-19-related (1) or not covid-19-related (0). All tweets have had the following keywords removed:

-corona

-coronavirus

-covid

-covid19

-covid-19

-sarscov2

-19

**Word Embeddings** Definition Word embeddings are a type of word representation that look at the context of words which allows words with similar meaning to have a similar vector representation.


Each word is represented as  a vector.


Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network.



**Word2Vec**

Word2Vec is a shallow, two-layer neural networks which is trained to reconstruct linguistic contexts of words.
Input: Large Corpus of words


Output: Vector Space, with each unique word in the corpus being assigned a corresponding vector in the space.


Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.


**Deep Network** Deep network takes the sequence of embedding vectors as input and converts them to a compressed representation. The compressed representation effectively captures all the information in the sequence of words in the text. The deep neywrok part is usually an RNN or some forms of it like LSTM/GRU. The dropout is added to overcome the tendency to overfit, a very common problem with RNN based networks.

**Recurrent Neural Networks** Recurrent Neural Networks are used to handle sequential data. One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames/words might inform the understanding of the present frame/words. 


Sometimes, we only need to look at recent information to perform the present task. But Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

Standard RNNs fail to learn in the presence of time lags greater than 5 – 10 discrete time steps between relevant input events and target signals. The vanishing error problem casts doubt on whether standard RNNs can indeed exhibit significant practical advantages over time window-based feedforward networks. A recent model, “Long Short-Term Memory” (LSTM), is not affected by this problem. LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps by enforcing constant error flow through “constant error carrousels” (CECs) within special units, called cells

 
https://towardsdatascience.com/machine-learning-word-embedding-sentiment-classification-using-keras-b83c28087456

https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21
**LSTM** Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

![lstm_gru.png](attachment:lstm_gru.png)

**GRU** GRU (Gated Recurrent Unit) aims to solve the vanishing gradient problem which comes with a standard recurrent neural network. GRU can also be considered as a variation on the LSTM because both are designed similarly and, in some cases, produce equally excellent results.

In [1]:
import pandas as pd
import numpy as 

In [2]:
data=pd.read_csv('/content/drive/MyDrive/updated_train.csv')

In [3]:
data.head()

Unnamed: 0,ID,text,target
0,train_0,The bitcoin halving is cancelled due to,1
1,train_1,MercyOfAllah In good times wrapped in its gran...,0
2,train_2,266 Days No Digital India No Murder of e learn...,1
3,train_3,India is likely to run out of the remaining RN...,1
4,train_4,In these tough times the best way to grow is t...,0


In [4]:
df=data[['text','target']]

In [17]:
import nltk
import string
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [15]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [18]:
from nltk.sem.logic import Tokens
lines=df['text'].values.tolist()
review_lines=[]
for line in lines :
  tokens=word_tokenize(line)
  tokens=[word.lower() for word in tokens]
  table=str.maketrans("","",string.punctuation)
  stripped=[w.translate(table) for w in tokens]
  words=[word for word in stripped if word.isalpha()]
  stop_word=set(stopwords.words('english'))
  wordss=[word for word in words if not word in stop_word]
  review_lines.append(wordss)


In [20]:
tokens

['interest',
 'rate',
 'swap',
 'derivative',
 'pricing',
 'in',
 'python',
 'harbourfront',
 'technologies']

In [23]:
stripped

['interest',
 'rate',
 'swap',
 'derivative',
 'pricing',
 'in',
 'python',
 'harbourfront',
 'technologies']

In [24]:
words

['interest',
 'rate',
 'swap',
 'derivative',
 'pricing',
 'in',
 'python',
 'harbourfront',
 'technologies']

In [26]:
wordss

['interest',
 'rate',
 'swap',
 'derivative',
 'pricing',
 'python',
 'harbourfront',
 'technologies']

In [31]:
review_lines[1]

['mercyofallah',
 'good',
 'times',
 'wrapped',
 'granular',
 'detail',
 'challenge',
 'find',
 'meaning',
 'model',
 'humility']

In [42]:
import gensim
from gensim.models import Word2Vec

In [33]:
#Workers:number of cores,  min_count=min lenght of word, embeding dim=100
model=gensim.models.Word2Vec(review_lines,size=100,window=5,min_count=1,workers=4)

In [36]:
#Save Word2Vec model
filename='covid_tweet_embedding_word2vec.txt'
model.wv.save_word2vec_format(filename,binary=False)

In [34]:
#test model
model.wv.most_similar('lockdown')

[('get', 0.9938178062438965),
 ('sports', 0.9936801791191101),
 ('amp', 0.9933863878250122),
 ('people', 0.9933496713638306),
 ('one', 0.9932780265808105),
 ('like', 0.9931713342666626),
 ('ramadan', 0.992913007736206),
 ('today', 0.9928785562515259),
 ('day', 0.9924890398979187),
 ('new', 0.9924325942993164)]

In [37]:
#save model in dictionary
embedding_index={}
f=open('covid_tweet_embedding_word2vec.txt',encoding='utf-8')
for line in f:
  values=line.split()
  word=values[0]
  coefs=np.asarray(values[1:])
  embedding_index[word]=coefs


f.close() 

In [38]:
df.shape

(5287, 2)

In [39]:
X_train=df.loc[:4200,'text'].values
Y_train=df.loc[:4200,'target'].values

X_test=df.loc[4201:,'text'].values
Y_test=df.loc[4201:,'target'].values

- Tokenization
- Padding: we need inputs with same size
- Splitting the data

In [43]:
#trobleshooting
import os
pip install keras
keras.__version__
keras.preprocessing
from keras.preprocessing import image as image_utils

In [57]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [60]:
from typing import Sequence
#split train data to train and vaidation sets 
tokenizer_obj=Tokenizer()
tokenizer_obj.fit_on_texts(review_lines)
sequences=tokenizer_obj.texts_to_sequences(review_lines)
print(sequences)

[[4026, 6110, 968, 82], [13, 31, 66, 3002, 6111, 3003, 418, 195, 3004, 546, 3005], [90, 547, 144, 1335, 313, 19, 133, 196, 4027, 1490, 3, 419], [144, 893, 592, 2409, 6112, 302, 640, 112, 9, 73, 511], [894, 66, 77, 68, 1491, 180, 371, 1336, 24, 4, 180, 3006, 2, 6113], [969, 2012, 1074, 232, 36, 260, 1492, 223, 689, 24, 314, 3007, 333, 641], [4028, 2, 481, 26, 970, 15, 1493, 233, 456, 4029, 2013, 15, 6114, 805, 4030, 6115, 735, 9, 334, 2014, 15, 2410, 512, 52, 895], [3008, 2015, 1709, 970, 4031], [806, 30, 130, 1337, 123, 3009, 1710, 224, 6116], [482, 2411, 111, 4032, 1494, 335, 4033, 1711, 1338, 642, 335, 101, 6117, 1338, 2016, 690, 1495, 3010, 3011, 1338, 73, 270], [354, 9, 6118, 2017, 31, 736, 101, 593, 643, 6119], [77, 594, 271, 1192, 212, 3, 548, 595, 6120, 20, 206], [372, 7, 737, 6121, 30, 6122, 2412, 1712, 481, 315], [2018, 420, 1496, 76, 2413, 82, 971, 32], [89, 2414, 1339, 6123, 6124, 4034, 2415, 160, 3012, 4035], [373, 6125, 1193, 43, 96, 483, 2416, 4036], [147, 4037, 691, 255,

In [63]:
tweet_pad=pad_sequences(sequences,100) #max_length=100
print(tweet_pad)
lables=df['target'].values

[[    0     0     0 ...  6110   968    82]
 [    0     0     0 ...  3004   546  3005]
 [    0     0     0 ...  1490     3   419]
 ...
 [    0     0     0 ...  1472  4467 13425]
 [    0     0     0 ...   900   453    32]
 [    0     0     0 ...    10 13431  2363]]


In [62]:
tweet_pad.shape

(5287, 100)

In [72]:
indices=np.arange(tweet_pad.shape[0])
np.random.shuffle(indices)
tweet_pad=tweet_pad[indices]
labels=lables[indices]

test_split=0.2
num_test_samples=int(test_split*tweet_pad.shape[0])

X_train_pad=tweet_pad[:-num_test_samples]
y_train_pad=lables[:-num_test_samples]

X_test_pad=tweet_pad[-num_test_samples:]
y_test_pad=lables[-num_test_samples:]

In [73]:
X_train_pad.shape, X_test_pad.shape

((4230, 100), (1057, 100))

Make sure that all words in our vocabulary have an embedding vector. when we created the embedding few of them dont have vectors.
whitchever embedding doesn't have a vector, assign a zero vector for them.

In [74]:
embeding_dim=100
word_index=tokenizer_obj.word_index
num_words=len(word_index)+1
embedding_matrix=np.zeros((num_words,embeding_dim))

for word,i in word_index.items():
  embedding_vector=embedding_index.get(word)
  if embedding_vector is not None:
    embedding_matrix[i]=embedding_vector

Training Defining Model

In Keras, A Sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.

1. Load pre-trained word embeddings into an Embedding layer

2. Adding Convolution Layer

3. Pooling Layer

4. Adding Dense Layer Note that we set trainable = False so as to keep the embeddings fixed

In [75]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, GRU
from keras.layers.embeddings import Embedding
from keras.initializers import Constant
from keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Flatten, LSTM
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from tensorflow.keras.initializers import Constant
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D, Flatten, Dense
# define model

In [79]:
max_length=100
model=Sequential()
embedding_layer=Embedding(num_words,embeding_dim,embeddings_initializer=Constant(embedding_matrix),
                          input_length=max_length,trainable=False)
#trainable=False -->we dont want to train it now because same data we are using for emitting
model.add(embedding_layer)
model.add(LSTM(units=32,dropout=0.2,recurrent_dropout=0.2)) #dropout:avoid over fitting. 20% of neurons in every iteration randomly selected and switched off
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 100, 100)          1343200   
                                                                 
 lstm_1 (LSTM)               (None, 32)                17024     
                                                                 
 dense_1 (Dense)             (None, 1)                 33        
                                                                 
Total params: 1,360,257
Trainable params: 17,057
Non-trainable params: 1,343,200
_________________________________________________________________
None


In [84]:
history=model.fit(X_train_pad,y_train_pad,batch_size=64,epochs=25,validation_data=(X_test_pad,y_test_pad),verbose=2)

Epoch 1/25
67/67 - 11s - loss: 0.6933 - accuracy: 0.5078 - val_loss: 0.6924 - val_accuracy: 0.5298 - 11s/epoch - 162ms/step
Epoch 2/25
67/67 - 7s - loss: 0.6927 - accuracy: 0.5180 - val_loss: 0.6918 - val_accuracy: 0.5307 - 7s/epoch - 109ms/step
Epoch 3/25
67/67 - 8s - loss: 0.6926 - accuracy: 0.5168 - val_loss: 0.6918 - val_accuracy: 0.5307 - 8s/epoch - 114ms/step
Epoch 4/25
67/67 - 9s - loss: 0.6925 - accuracy: 0.5151 - val_loss: 0.6919 - val_accuracy: 0.5307 - 9s/epoch - 133ms/step
Epoch 5/25
67/67 - 8s - loss: 0.6924 - accuracy: 0.5142 - val_loss: 0.6914 - val_accuracy: 0.5317 - 8s/epoch - 113ms/step
Epoch 6/25
67/67 - 8s - loss: 0.6924 - accuracy: 0.5187 - val_loss: 0.6926 - val_accuracy: 0.5251 - 8s/epoch - 118ms/step
Epoch 7/25
67/67 - 8s - loss: 0.6924 - accuracy: 0.5232 - val_loss: 0.6910 - val_accuracy: 0.5336 - 8s/epoch - 112ms/step
Epoch 8/25
67/67 - 8s - loss: 0.6929 - accuracy: 0.5161 - val_loss: 0.6915 - val_accuracy: 0.5307 - 8s/epoch - 124ms/step
Epoch 9/25
67/67 - 11s

Evaluating and testing model:

In [85]:
score,acc=model.evaluate(X_test_pad,y_test_pad,batch_size=64)



In [92]:
test1='The pandemic is causing huge losses in covid.'
test2='I like pizza.'
test_sample=[test1,test2]
test_tokens=tokenizer_obj.texts_to_sequences(test_sample)
test_token_pad=pad_sequences(test_tokens,maxlen=100)

model.predict(test_token_pad)

array([[0.46762234],
       [0.49222568]], dtype=float32)