This kernel demonstrates binary classification of tweets using Sequence model.


Let's start with importing all the necessary packages.

In [103]:
import pandas as pd
import numpy as np
import re
import string
from collections import Counter, namedtuple

from nltk.corpus import stopwords
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score, accuracy_score

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM, Embedding
from keras.optimizers import Nadam,adam

np.random.seed(1)

#### Load data

In [58]:
data = pd.read_csv('../input/nlp-getting-started/train.csv')

let's see how data looks like

In [59]:
data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


Check for class imbalance

In [60]:
data.drop(columns = ['id','keyword','location'], inplace=True)
neg, pos = np.bincount(data.target)
print(f'Total: {len(data)} \nPositive: {pos} \nNegative: {neg}')

Total: 7613 
Positive: 3271 
Negative: 4342


There is no class imbalance problem.

Check for null values in data.

In [61]:
data.isnull().sum()

text      0
target    0
dtype: int64

### Let's work with tweets

Clean the text by removing urls, html tags, emojis and stopwords.

In [62]:
def clean_text(text):
    
    #remove urls
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    text = url_pattern.sub(r'', text)
    
    #remove html
    html_pattern = re.compile(r'<.*?>')
    text = html_pattern.sub(r'', text)
    
    #remove emojis
    emoji_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags = re.UNICODE)
    text = emoji_pattern.sub(r'',text)
    
    #remove punctuations
    table = str.maketrans("", "", string.punctuation)
    text = text.translate(table)
    
    #remove stopwords
    stop = set(stopwords.words('english'))
    text = [word.lower() for word in text.split() if word.lower() not in stop]

    return ' '.join(text)

In [63]:
data['text'] = data['text'].apply(lambda x: clean_text(x))

In [64]:
data.head()

Unnamed: 0,text,target
0,deeds reason earthquake may allah forgive us,1
1,forest fire near la ronge sask canada,1
2,residents asked shelter place notified officer...,1
3,13000 people receive wildfires evacuation orde...,1
4,got sent photo ruby alaska smoke wildfires pou...,1


We cannot directly use textual data as input to our sequence model. We need to map each word in the tweet to an integer. We can then use Embedding layer of keras to vector encode the words.

Let's find the vocabulary size first.

In [65]:
def word_counter(text):  
    
    count = Counter()
    for i in text.values:
        for word in i.split():
            count[word] += 1
    return count    

text = data['text']
counter = word_counter(text)

vocab_size = len(counter)

We need to have a fixed sized input for the model, here I am using maximum length as 20. Try with different values to find the best one. Usually a smaller value is recommended since it makes the input less sparse when padded with zeros.

In [66]:
max_len = 20

To map the words to unique integer values, we will be using keras Tokenizer.

Keras Tokenizer can be used to get the sequence for each tweet. It maps each word to an integer, representing an index of that word in word_index list.

In [67]:
t = Tokenizer(num_words = vocab_size)
t.fit_on_texts(data['text'])

word_index = t.word_index

dict(list(word_index.items())[:10])

{'like': 1,
 'im': 2,
 'amp': 3,
 'fire': 4,
 'get': 5,
 'new': 6,
 'via': 7,
 'people': 8,
 'one': 9,
 'news': 10}

We will use this tokenizer later on train and test tweets.

Let's take initial 7500 examples for training and validation, remaining for testing.

In [147]:
df = data[:7500]

### Let's build a sequential model using keras.


In [148]:
model = Sequential()
model.add(Embedding(vocab_size, 200, input_length = max_len))
model.add(LSTM(80))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

nadam = Nadam(learning_rate=0.0001)

model.compile(loss = 'binary_crossentropy', optimizer=nadam, metrics=['accuracy'])

In [149]:
model.summary()

Model: "sequential_17"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_17 (Embedding)     (None, 20, 200)           3594200   
_________________________________________________________________
lstm_17 (LSTM)               (None, 80)                89920     
_________________________________________________________________
dropout_17 (Dropout)         (None, 80)                0         
_________________________________________________________________
dense_17 (Dense)             (None, 1)                 81        
Total params: 3,684,201
Trainable params: 3,684,201
Non-trainable params: 0
_________________________________________________________________


In [150]:
skf = StratifiedKFold(n_splits=5)
X = df['text']
y = df['target']

In [151]:
accuracy = []
# train model on 5 folds
for train_index, test_index in skf.split(X, y):
    
    train_x, test_x = X[train_index], X[test_index]
    train_y, test_y = y[train_index], y[test_index]
    print("Tweet before tokenization: ", train_x.iloc[0])
    
    #Tokenize the tweets using tokenizer.
    train_tweets = t.texts_to_sequences(train_x)
    test_tweets = t.texts_to_sequences(test_x)
    print("Tweet after tokenization: ", train_tweets[0])
    
    #pad the tokenized tweet data
    train_tweets_padded = pad_sequences(train_tweets, maxlen=max_len, padding='post', truncating='post')
    test_tweets_padded = pad_sequences(test_tweets, maxlen=max_len, padding='post', truncating='post')
    print('Tweet after padding: ', train_tweets_padded[0])
    
    #train model on processed tweets
    history = model.fit(train_tweets_padded, train_y, epochs=5, validation_data = (test_tweets_padded,test_y))
    
    #make predictions
    pred_y = model.predict_classes(test_tweets_padded)
    print("Validation accuracy : ",accuracy_score(pred_y, test_y))
    
    #store validation accuracy
    accuracy.append(accuracy_score(pred_y, test_y))

Tweet before tokenization:  ika tuning soup diet recipes fat burning soup recipes fat burning soup diet recip
Tweet after tokenization:  [8621, 4975, 2946, 2460, 2947, 1276, 21, 2946, 2947, 1276, 21, 2946, 2460, 8622]
Tweet after padding:  [8621 4975 2946 2460 2947 1276   21 2946 2947 1276   21 2946 2460 8622
    0    0    0    0    0    0]


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 6000 samples, validate on 1500 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Validation accuracy :  0.6833333333333333
Tweet before tokenization:  deeds reason earthquake may allah forgive us
Tweet after tokenization:  [4368, 716, 152, 54, 1454, 4369, 13]
Tweet after padding:  [4368  716  152   54 1454 4369   13    0    0    0    0    0    0    0
    0    0    0    0    0    0]
Train on 6000 samples, validate on 1500 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Validation accuracy :  0.942
Tweet before tokenization:  deeds reason earthquake may allah forgive us
Tweet after tokenization:  [4368, 716, 152, 54, 1454, 4369, 13]
Tweet after padding:  [4368  716  152   54 1454 4369   13    0    0    0    0    0    0    0
    0    0    0    0    0    0]
Train on 6000 samples, validate on 1500 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Validation accuracy :  0.962
Tweet before tokenization:  deeds reason earthquake may allah forgive us
Tweet aft

In [155]:
print("Validation accuracy of the model :", np.mean(accuracy))

Validation accuracy of the model : 0.9118666666666666


Our model is trained with validation accuracy of 91%, let's see how it performs on unseen tweets from test data.

In [153]:
test_df = data[7501:]

tokenized_tweets = t.texts_to_sequences(test_df['text'])
padded_tweets = pad_sequences(tokenized_tweets, maxlen=max_len, padding='post', truncating='post')
test_y = test_df['target']
pred_y = model.predict_classes(padded_tweets)

In [154]:
accuracy_score(pred_y, test_y)

0.9285714285714286

We acheived 92% test accuracy!!🎉

References:

https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

https://www.youtube.com/watch?v=j7EB7yeySDw