# TEXT CLASSIFICATION USING LSTM AND CNN

For classification I am using OLID datset. More information and the dataset can be found here: https://sites.google.com/site/offensevalsharedtask/offenseval2019

First task is to classify whether a given tweet is offensive or not. 

In [1]:
import keras
import pandas as pd
import emoji
import nltk
import string
import re

Using TensorFlow backend.


Below you can see sample of the training data. For this task we are only interested in tweet and subtask_a which is the label containing either OFF (OFFENSIVE) or NOT.   The training dataset contains tweets and labels in a single tsv file 

In [6]:
train_set = pd.read_csv('OLIDv1.0/olid-training-v1.0.tsv',sep='\t')
df = pd.DataFrame(train_set)

In [7]:
tweets_list = [tweet.split() for tweet in df['tweet']]

In [8]:
print(tweets_list[:5])

[['@USER', 'She', 'should', 'ask', 'a', 'few', 'native', 'Americans', 'what', 'their', 'take', 'on', 'this', 'is.'], ['@USER', '@USER', 'Go', 'home', 'you’re', 'drunk!!!', '@USER', '#MAGA', '#Trump2020', '👊🇺🇸👊', 'URL'], ['Amazon', 'is', 'investigating', 'Chinese', 'employees', 'who', 'are', 'selling', 'internal', 'data', 'to', 'third-party', 'sellers', 'looking', 'for', 'an', 'edge', 'in', 'the', 'competitive', 'marketplace.', 'URL', '#Amazon', '#MAGA', '#KAG', '#CHINA', '#TCOT'], ['@USER', 'Someone', 'should\'veTaken"', 'this', 'piece', 'of', 'shit', 'to', 'a', 'volcano.', '😂"'], ['@USER', '@USER', 'Obama', 'wanted', 'liberals', '&amp;', 'illegals', 'to', 'move', 'into', 'red', 'states']]


Preprocessing needs to be performed on the tweets. The preprocessing performed is as follows:
1. Convert all words to lower case
2. remove all non alphanumeric characters (#, : , emojis)
3. remove words such as 'user' and 'url' as it does not give any useful information
4. lemmatize words 

In [9]:
from nltk.stem import WordNetLemmatizer

def preprocess(tweets_list):
    lemmatizer = WordNetLemmatizer() 
    clean_tweet = []
    for tweet in tweets_list:
        new_tweet = []
        for word in tweet:
            word = word.lower()
            word = re.sub('[\W_]', '', word)
            word = lemmatizer.lemmatize(word)
            if word !='' and word!="user" and word!="url":
                new_tweet.append(word)
        clean_tweet.append(new_tweet)
    return clean_tweet

keras Tokenizer is used which converts sentences to integers which can then be fed into neural network. WE also need to pad sequences so all of them have same length.

In [10]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

clean_tweet = preprocess(tweets_list)

tokenizer = Tokenizer()
tokenizer.fit_on_texts(clean_tweet)
word2idx = tokenizer.word_index
idx2word = tokenizer.index_word
sents_as_ids = tokenizer.texts_to_sequences(clean_tweet)
VOCAB_SIZE = len(word2idx)+1  # 0 saved for padding so we add 1
MAXIMUM_LENGTH = 30
processed_train_data = pad_sequences(sents_as_ids,MAXIMUM_LENGTH,padding ='post',truncating='post')



convert labels to integers (OFF=1, NOT=0)

In [11]:
labels = []
for label in df['subtask_a']:
    if label=='OFF':
        labels.append(1)
    else:
        labels.append(0)
        

perform the same steps on test data. Here the tweet and labels are stored in two different files

In [176]:
test_set = pd.read_csv('OLIDv1.0/testset-levela.tsv',sep='\t')
test_labels = pd.read_csv('OLIDv1.0/labels-levela.csv',header=None)

test_set_df = pd.DataFrame(test_set)
test_labels_df = pd.DataFrame(test_labels)
test_tweets_list = [tweet.split() for tweet in test_set_df['tweet']]
test_clean_tweets = preprocess(test_tweets_list)
test_sents_as_ids = tokenizer.texts_to_sequences(test_clean_tweets)
processed_test_data = pad_sequences(test_sents_as_ids,MAXIMUM_LENGTH,padding ='post',truncating='post')

test_labels = []
for label in test_labels_df.iloc[:,1]:
    if label == 'OFF':
        test_labels.append(1)
    else:
        test_labels.append(0)


[99, 156, 29, 631, 5200, 2688, 2281, 6756, 241, 8448, 1117, 1552, 687, 324, 817, 5, 1424, 2382, 375, 354]


This is a LSTM model

In [150]:
#HYPERPARAMETERS
EMBD_SIZE =10
LSTM_dropout_rate = 0.3
HIDDEN_UNITS = 50

model = keras.Sequential()
model.add(keras.layers.Embedding(VOCAB_SIZE,EMBD_SIZE))
model.add(keras.layers.Dropout(rate=LSTM_dropout_rate))
model.add(keras.layers.LSTM(units=HIDDEN_UNITS))
model.add(keras.layers.Dropout(LSTM_dropout_rate))
model.add(keras.layers.Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

Model: "sequential_31"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_30 (Embedding)     (None, None, 10)          188720    
_________________________________________________________________
dropout_25 (Dropout)         (None, None, 10)          0         
_________________________________________________________________
lstm_11 (LSTM)               (None, 50)                12200     
_________________________________________________________________
dropout_26 (Dropout)         (None, 50)                0         
_________________________________________________________________
dense_30 (Dense)             (None, 1)                 51        
Total params: 200,971
Trainable params: 200,971
Non-trainable params: 0
_________________________________________________________________


Split the data into train and validation

In [151]:
train_data = processed_train_data[:10000]
train_labels = labels[:10000]
val_set = processed_train_data[10000:]
val_labels = labels[10000:]

In [180]:
BATCH_SIZE = 50
EPOCHS = 4

history = model.fit(train_data,train_labels,batch_size=BATCH_SIZE,epochs=EPOCHS,verbose=1,validation_data=(val_set,val_labels))
print("RESULTS ON TEST DATA")
model.evaluate(processed_test_data,test_labels,batch_size=50)


Train on 10000 samples, validate on 3240 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
RESULTS ON TEST DATA


[0.984401518522307, 0.739534854888916]

The second model is based on CNN  

In [178]:
#HYPERPARAMETERS
CNN_EMBD_SIZE = 10
num_of_filters = 32
filter_size = 3
dropout_rate = 0.2

cnn_model = keras.Sequential()
cnn_model.add(keras.layers.Embedding(VOCAB_SIZE,CNN_EMBD_SIZE))
cnn_model.add(keras.layers.Conv1D(num_of_filters,filter_size,activation='relu'))
cnn_model.add(keras.layers.GlobalAveragePooling1D())
cnn_model.add(keras.layers.Dropout(dropout_rate))
cnn_model.add(keras.layers.Dense(1,activation='sigmoid'))
cnn_model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
cnn_model.summary()

Model: "sequential_41"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_40 (Embedding)     (None, None, 10)          188720    
_________________________________________________________________
conv1d_32 (Conv1D)           (None, None, 32)          992       
_________________________________________________________________
global_average_pooling1d_24  (None, 32)                0         
_________________________________________________________________
dropout_30 (Dropout)         (None, 32)                0         
_________________________________________________________________
dense_40 (Dense)             (None, 1)                 33        
Total params: 189,745
Trainable params: 189,745
Non-trainable params: 0
_________________________________________________________________


In [181]:
history = cnn_model.fit(train_data,train_labels,batch_size=50,
                    epochs=3,verbose=1,validation_data=(val_set,val_labels))

cnn_model.evaluate(processed_test_data,test_labels,batch_size=50)

Train on 10000 samples, validate on 3240 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


[0.5626795350812203, 0.7930232286453247]