# Hybrid Model

The Hybrid Model used is a combination of Convolutional Neural Network (CNN) and Bidirectional Long-Short-Term-Memory Network (BiLSTM) to form a CNN-BiLSTM model by stacking the layers together. 

## 1. Import Libraries
Libraries needed for are imported in this step.

In [1]:
import pandas as pd
import numpy as np
import re
import gensim
import contractions
import string
import nltk
import keras
from cleantext import clean
from gensim.models import word2vec
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.tokenize import TreebankWordDetokenizer
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences
from keras import regularizers
from keras.layers import Embedding
from keras.models import Sequential
from keras import layers
from keras import regularizers
from keras import backend as K
from keras.callbacks import ModelCheckpoint
from sklearn.metrics import classification_report

## 2. Read Training Dataset
The training dataset used is the dataset downloaded from Kaggle that contains around 27k rows of Tweets with their sentiments labeled.

In [2]:
# read training dataset
train = pd.read_csv("train.csv", encoding = "unicode_escape")
print("Initial: ")
print(train.head())

Initial: 
       textID                                               text  \
0  cb774db0d1                I`d have responded, if I were going   
1  549e992a42      Sooo SAD I will miss you here in San Diego!!!   
2  088c60f138                          my boss is bullying me...   
3  9642c003ef                     what interview! leave me alone   
4  358bd9e861   Sons of ****, why couldn`t they put them on t...   

                         selected_text sentiment Time of Tweet Age of User  \
0  I`d have responded, if I were going   neutral       morning        0-20   
1                             Sooo SAD  negative          noon       21-30   
2                          bullying me  negative         night       31-45   
3                       leave me alone  negative       morning       46-60   
4                        Sons of ****,  negative          noon       60-70   

       Country  Population -2020  Land Area (Km²)  Density (P/Km²)  
0  Afghanistan          38928346         65

## 3. Drop Rows with Missing Values
Rows that contain missing values are dropped to clean the dataset.

In [3]:
# drop rows with null values
train = train.dropna(axis = 0)

## 4. Preprocess Texts
The steps used to preprocess texts are as follows: <br>
a. Remove Links <br>
b. Remove Remove Emails <br>
c. Remove New Line Characters <br>
d. Remove Numbers <br>
e. Remove Emojis <br>
f. Remove Punctuations and Accents <br>
g. Remove Irrelevant Words (Length shorter or equal to 2 OR length greater or equal to 15) <br>
h. Remove Stopwords <br>
i. Tokenization <br>
j. Part-of-Speech (POS) Tagging <br>
k. Lemmatization <br>

In [4]:
stop_words = set(stopwords.words('english'))
def get_wordnet_pos(treebank_pos):
    if treebank_pos.startswith('J'):
        return wordnet.ADJ
    elif treebank_pos.startswith('V'):
        return wordnet.VERB
    elif treebank_pos.startswith('N'):
        return wordnet.NOUN
    elif treebank_pos.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def depure_data(data):
    # Remove URL
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    data = url_pattern.sub(r'', data)

    # Remove Emails
    data = re.sub('\S*@\S*\s?', '', data)

    # Remove new line characters
    data = re.sub('\s+', ' ', data)
    
    # Remove distracting single quotes
    data = re.sub("\'", "", data)
        
    return data

def sent_to_words(sentences):
    for sentence in sentences:
        # 1. Converts to lowercase
        # 2. Removes accents and punctuations
        # 3. Removes words shorter (<=) or longer (>=) than minimum length of 2 and maximum length of 15
        # 4. Removes numerical characters
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
        
def detokenize(text):
    return TreebankWordDetokenizer().detokenize(text)

def stopwords_lemmatize(data):
    # Remove stopwords
    data = " ".join([word for word in str(data).split() if word not in stop_words])
    
    # Apply tokenization
    data = nltk.word_tokenize(data)
    
    # Perform POS tagging
    data = nltk.pos_tag(data)
    
    # Perform lemmatization
    lemmatized_tokens = []

    for i in range(len(data)):
        word, pos = data[i]
        
        lemmatized_tokens.append(nltk.WordNetLemmatizer().lemmatize(word, get_wordnet_pos(pos)))
    
    data = " ".join(lemmatized_tokens)
    
    return data

temp = []
# Convert the values into a list
data_to_list = train['text'].values.tolist()
for i in range(len(data_to_list)):
    temp.append(depure_data(data_to_list[i]))
data_words = list(sent_to_words(temp))
detokenized = []
for i in range(len(data_words)):
    detokenized.append(detokenize(data_words[i]))
data = []
for i in range(len(detokenized)):
    data.append(stopwords_lemmatize(detokenized[i]))
print(data[:5])

['respond go', 'sooo sad miss san diego', 'bos bully', 'interview leave alone', 'son put release already buy']


## 5. Label Encoding
The label is encoded into 1 for positive, 0 for neutral and -1 for negative before it is one-hot encoded into the following structure: <br>
[neutral, positive, negative]

In [5]:
train['sentiment'] = train['sentiment'].map({"positive": 1, "neutral": 0, "negative": -1})
y = train['sentiment'].to_numpy()
y = keras.utils.to_categorical(y, 3)
print(y[:5])

[[1. 0. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]]


## 6. Feature Extraction
The feature vector or raw word vector is extracted by converting the texts into sequence of numbers. <br>
The maximum number of words that will be stored in the tokenizer is 5000 and the maximum length of the vector is 200, texts that are shorter than the maximum length will be padded with 0s. 

In [6]:
max_words = 5000
max_len = 200

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(data)
sequences = tokenizer.texts_to_sequences(data)
tweets = pad_sequences(sequences, maxlen=max_len)
print(tweets)

[[   0    0    0 ...    0 1208    2]
 [   0    0    0 ...   19 1209 1898]
 [   0    0    0 ...    0 1162 3767]
 ...
 [   0    0    0 ...  434  668 2348]
 [   0    0    0 ...    0    0  523]
 [   0    0    0 ...  460  143  312]]


## 7. Train Validation Split
The training dataset is split by 80% and 20% into train and validation set. 

In [7]:
X_train, X_test, y_train, y_test = train_test_split(tweets, y, test_size = 0.2, random_state = 0)

## 8. Model Training
The models in this file are the two individual models, CNN, BiLSTM and the hybrid model, CNN-BiLSTM. After each model is trained, the model is saved to a .h5 file so that the model can be loaded for prediction without training again. 

### a. BiLSTM

In [8]:
model0 = Sequential()
model0.add(layers.Embedding(max_words, 40, input_length=max_len))
model0.add(layers.Bidirectional(layers.LSTM(20,dropout=0.9)))
model0.add(layers.Dense(3, activation='softmax'))
model0.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model0.summary()

checkpoint0 = ModelCheckpoint("best_model0.hdf5", monitor='val_accuracy', verbose=1,save_best_only=True, mode='auto', period=1,save_weights_only=False)
history = model0.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test),callbacks=[checkpoint0])
model0.save('BiLSTM.h5')

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 200, 40)           200000    
                                                                 
 bidirectional_3 (Bidirectio  (None, 40)               9760      
 nal)                                                            
                                                                 
 dense_6 (Dense)             (None, 3)                 123       
                                                                 
Total params: 209,883
Trainable params: 209,883
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 1: val_accuracy improved from -inf to 0.64793, saving model to best_model0.hdf5
Epoch 2/10
Epoch 2: val_accuracy improved from 0.64793 to 0.68395, saving model to best_model0.hdf5
Epoch 3/10
Epoch 3: val_accuracy improved from 0.

### b. CNN

In [17]:
model1 = Sequential()
model1.add(layers.Embedding(max_words, 40, input_length=max_len))
model1.add(layers.Conv1D(filters=32, kernel_size=3, activation='relu', kernel_regularizer=regularizers.l2(l=0.01)))
model1.add(layers.MaxPooling1D(pool_size=2))
model1.add(layers.Dropout(0.9))
model1.add(layers.Flatten())
model1.add(layers.Dense(10, activation='relu', kernel_regularizer=regularizers.l2(l=0.01)))
model1.add(layers.Dense(3, activation='softmax'))
model1.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model1.summary()

history = model1.fit(X_train, y_train, epochs=8, batch_size=64, validation_data=(X_test, y_test))
model1.save('CNN.h5')

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_6 (Embedding)     (None, 200, 40)           200000    
                                                                 
 conv1d (Conv1D)             (None, 198, 32)           3872      
                                                                 
 max_pooling1d (MaxPooling1D  (None, 99, 32)           0         
 )                                                               
                                                                 
 dropout (Dropout)           (None, 99, 32)            0         
                                                                 
 flatten (Flatten)           (None, 3168)              0         
                                                                 
 dense_6 (Dense)             (None, 10)                31690     
                                                      

### c. CNN-BiLSTM

In [5]:
model2 = Sequential()
model2.add(layers.Embedding(max_words, 40, input_length=max_len)) 
model2.add(layers.Conv1D(filters=32, kernel_size=3, activation='relu', kernel_regularizer=regularizers.l2(l=0.01)))
model2.add(layers.MaxPooling1D(pool_size=2))
model2.add(layers.Dropout(0.6))
model2.add(layers.Bidirectional(layers.LSTM(30,dropout=0.7)))
model2.add(layers.Dense(10, activation='relu', kernel_regularizer=regularizers.l2(l=0.01)))
model2.add(layers.Dense(3, activation='softmax'))
model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model2.summary()

checkpoint2 = ModelCheckpoint("best_model2.hdf5", monitor='val_accuracy', verbose=1,save_best_only=True, mode='auto', period=1,save_weights_only=False)
history = model2.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test),callbacks=[checkpoint2])
model2.save('hybrid.h5')

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 200, 40)           200000    
                                                                 
 conv1d_2 (Conv1D)           (None, 198, 32)           3872      
                                                                 
 max_pooling1d_2 (MaxPooling  (None, 99, 32)           0         
 1D)                                                             
                                                                 
 dropout_2 (Dropout)         (None, 99, 32)            0         
                                                                 
 bidirectional_2 (Bidirectio  (None, 60)               15120     
 nal)                                                            
                                                                 
 dense_4 (Dense)             (None, 10)               

## 9. Model Evaluation
Each of the models are evaluated using accuracy, precision, recall and F1-score which are obtained from the classification report. 

### a. BiLSTM

In [9]:
model0 = keras.models.load_model('BiLSTM.h5')

sentiment = ['neutral', 'positive', 'negative']

print(" BiLSTM MODEL ")
print("==============")

print(" TRAIN ACCURACY ")
print("================")
y_predTrain = model0.predict(X_train)
y_train_transform = []
for y in y_train:
    y_train_transform.append(sentiment[np.around(y, decimals = 0).argmax()])

predicted_train = []
for y in y_predTrain:
    predicted_train.append(sentiment[np.around(y, decimals=0).argmax()])
    
print(classification_report(y_train_transform, predicted_train))


print(" VALIDATION ACCURACY ")
print("=====================")
y_predTest = model0.predict(X_test)
y_test_transform = []
for y in y_test:
    y_test_transform.append(sentiment[np.around(y, decimals = 0).argmax()])

predicted_test = []
for y in y_predTest:
    predicted_test.append(sentiment[np.around(y, decimals=0).argmax()])
    
print(classification_report(y_test_transform, predicted_test))


print(" REVIEW TEST ACCURACY ")
print("======================")
reviewTest = pd.read_csv("iPhoneTest.csv")
X_reviewTest = reviewTest['text']
y_reviewTest = reviewTest['sentiment'].values.tolist()

X_reviewTest = X_reviewTest.values.tolist()
temp = []
for i in range(len(X_reviewTest)):
    temp.append(depure_data(X_reviewTest[i]))
data_words = list(sent_to_words(temp))
detokenized = []
for i in range(len(data_words)):
    detokenized.append(detokenize(data_words[i]))
data = []
for i in range(len(detokenized)):
    data.append(stopwords_lemmatize(detokenized[i]))
print(data[:5])
X_reviewTest = tokenizer.texts_to_sequences(data)
X_reviewTest = pad_sequences(X_reviewTest, maxlen=max_len)

y_predictReview = model0.predict(X_reviewTest)
predicted_sentiment_review = []
for y in y_predictReview:
    predicted_sentiment_review.append(sentiment[np.around(y, decimals=0).argmax()])
print(classification_report(y_reviewTest, predicted_sentiment_review))

print(" COVID TEST ACCURACY ")
print("=====================")
covidTest = pd.read_csv("covidTest.csv")
X_covidTest = covidTest['text']
y_covidTest = covidTest['sentiment'].values.tolist()

X_covidTest = X_covidTest.values.tolist()
temp = []
for i in range(len(X_covidTest)):
    temp.append(depure_data(X_covidTest[i]))
data_words = list(sent_to_words(temp))
detokenized = []
for i in range(len(data_words)):
    detokenized.append(detokenize(data_words[i]))
data = []
for i in range(len(detokenized)):
    data.append(stopwords_lemmatize(detokenized[i]))
print(data[:5])
X_covidTest = tokenizer.texts_to_sequences(data)
X_covidTest = pad_sequences(X_covidTest, maxlen=max_len)

y_predictCovid = model0.predict(X_covidTest)
predicted_sentiment_covid = []
for y in y_predictCovid:
    predicted_sentiment_covid.append(sentiment[np.around(y, decimals=0).argmax()]) 
print(classification_report(y_covidTest, predicted_sentiment_covid))

 BiLSTM MODEL 
 TRAIN ACCURACY 
              precision    recall  f1-score   support

    negative       0.84      0.65      0.73      6258
     neutral       0.67      0.86      0.76      8842
    positive       0.86      0.73      0.79      6884

    accuracy                           0.76     21984
   macro avg       0.79      0.75      0.76     21984
weighted avg       0.78      0.76      0.76     21984

 VALIDATION ACCURACY 
              precision    recall  f1-score   support

    negative       0.79      0.59      0.67      1523
     neutral       0.65      0.83      0.73      2275
    positive       0.82      0.70      0.75      1698

    accuracy                           0.72      5496
   macro avg       0.75      0.70      0.72      5496
weighted avg       0.74      0.72      0.72      5496

 REVIEW TEST ACCURACY 
['anyone iphone pro iphone', 'remember first time buy smart phone iphone show grandfather proud could first question able make rich try prove right time crypto',

### b. CNN

In [8]:
model1 = keras.models.load_model('CNN.h5')

sentiment = ['neutral', 'positive', 'negative']

print(" CNN MODEL ")
print("===========")

print(" TRAIN ACCURACY ")
print("================")
y_predTrain = model1.predict(X_train)
y_train_transform = []
for y in y_train:
    y_train_transform.append(sentiment[np.around(y, decimals = 0).argmax()])

predicted_train = []
for y in y_predTrain:
    predicted_train.append(sentiment[np.around(y, decimals=0).argmax()])
    
print(classification_report(y_train_transform, predicted_train))


print(" VALIDATION ACCURACY ")
print("=====================")
y_predTest = model1.predict(X_test)
y_test_transform = []
for y in y_test:
    y_test_transform.append(sentiment[np.around(y, decimals = 0).argmax()])

predicted_test = []
for y in y_predTest:
    predicted_test.append(sentiment[np.around(y, decimals=0).argmax()])
    
print(classification_report(y_test_transform, predicted_test))


print(" REVIEW TEST ACCURACY ")
print("======================")
reviewTest = pd.read_csv("iPhoneTest.csv")
X_reviewTest = reviewTest['text']
y_reviewTest = reviewTest['sentiment'].values.tolist()

X_reviewTest = X_reviewTest.values.tolist()
temp = []
for i in range(len(X_reviewTest)):
    temp.append(depure_data(X_reviewTest[i]))
data_words = list(sent_to_words(temp))
detokenized = []
for i in range(len(data_words)):
    detokenized.append(detokenize(data_words[i]))
data = []
for i in range(len(detokenized)):
    data.append(stopwords_lemmatize(detokenized[i]))
print(data[:5])
X_reviewTest = tokenizer.texts_to_sequences(data)
X_reviewTest = pad_sequences(X_reviewTest, maxlen=max_len)

y_predictReview = model1.predict(X_reviewTest)
predicted_sentiment_review = []
for y in y_predictReview:
    predicted_sentiment_review.append(sentiment[np.around(y, decimals=0).argmax()])
print(classification_report(y_reviewTest, predicted_sentiment_review))

print(" COVID TEST ACCURACY ")
print("=====================")
covidTest = pd.read_csv("covidTest.csv")
X_covidTest = covidTest['text']
y_covidTest = covidTest['sentiment'].values.tolist()

X_covidTest = X_covidTest.values.tolist()
temp = []
for i in range(len(X_covidTest)):
    temp.append(depure_data(X_covidTest[i]))
data_words = list(sent_to_words(temp))
detokenized = []
for i in range(len(data_words)):
    detokenized.append(detokenize(data_words[i]))
data = []
for i in range(len(detokenized)):
    data.append(stopwords_lemmatize(detokenized[i]))
print(data[:5])
X_covidTest = tokenizer.texts_to_sequences(data)
X_covidTest = pad_sequences(X_covidTest, maxlen=max_len)

y_predictCovid = model1.predict(X_covidTest)
predicted_sentiment_covid = []
for y in y_predictCovid:
    predicted_sentiment_covid.append(sentiment[np.around(y, decimals=0).argmax()]) 
print(classification_report(y_covidTest, predicted_sentiment_covid))

 CNN MODEL 
 TRAIN ACCURACY 
              precision    recall  f1-score   support

    negative       0.87      0.41      0.56      6258
     neutral       0.60      0.88      0.71      8842
    positive       0.85      0.74      0.79      6884

    accuracy                           0.70     21984
   macro avg       0.77      0.68      0.69     21984
weighted avg       0.75      0.70      0.69     21984

 VALIDATION ACCURACY 
              precision    recall  f1-score   support

    negative       0.82      0.35      0.49      1523
     neutral       0.57      0.84      0.68      2275
    positive       0.78      0.68      0.73      1698

    accuracy                           0.66      5496
   macro avg       0.72      0.63      0.63      5496
weighted avg       0.70      0.66      0.64      5496

 REVIEW TEST ACCURACY 
['anyone iphone pro iphone', 'remember first time buy smart phone iphone show grandfather proud could first question able make rich try prove right time crypto', 'i

### c. CNN-BiLSTM

In [3]:
model2 = keras.models.load_model('hybrid.h5')

sentiment = ['neutral', 'positive', 'negative']

print(" HYBRID MODEL ")
print("==============")

print(" TRAIN ACCURACY ")
print("================")
y_predTrain = model2.predict(X_train)
y_train_transform = []
for y in y_train:
    y_train_transform.append(sentiment[np.around(y, decimals = 0).argmax()])

predicted_train = []
for y in y_predTrain:
    predicted_train.append(sentiment[np.around(y, decimals=0).argmax()])
    
print(classification_report(y_train_transform, predicted_train))


print(" VALIDATION ACCURACY ")
print("=====================")
y_predTest = model2.predict(X_test)
y_test_transform = []
for y in y_test:
    y_test_transform.append(sentiment[np.around(y, decimals = 0).argmax()])

predicted_test = []
for y in y_predTest:
    predicted_test.append(sentiment[np.around(y, decimals=0).argmax()])
    
print(classification_report(y_test_transform, predicted_test))


print(" REVIEW TEST ACCURACY ")
print("======================")
reviewTest = pd.read_csv("iPhoneTest.csv")
X_reviewTest = reviewTest['text']
y_reviewTest = reviewTest['sentiment'].values.tolist()

X_reviewTest = X_reviewTest.values.tolist()
temp = []
for i in range(len(X_reviewTest)):
    temp.append(depure_data(X_reviewTest[i]))
data_words = list(sent_to_words(temp))
detokenized = []
for i in range(len(data_words)):
    detokenized.append(detokenize(data_words[i]))
data = []
for i in range(len(detokenized)):
    data.append(stopwords_lemmatize(detokenized[i]))
print(data[:5])
X_reviewTest = tokenizer.texts_to_sequences(data)
X_reviewTest = pad_sequences(X_reviewTest, maxlen=max_len)

y_predictReview = model2.predict(X_reviewTest)
predicted_sentiment_review = []
for y in y_predictReview:
    predicted_sentiment_review.append(sentiment[np.around(y, decimals=0).argmax()])
print(classification_report(y_reviewTest, predicted_sentiment_review))

print(" COVID TEST ACCURACY ")
print("=====================")
covidTest = pd.read_csv("covidTest.csv")
X_covidTest = covidTest['text']
y_covidTest = covidTest['sentiment'].values.tolist()

X_covidTest = X_covidTest.values.tolist()
temp = []
for i in range(len(X_covidTest)):
    temp.append(depure_data(X_covidTest[i]))
data_words = list(sent_to_words(temp))
detokenized = []
for i in range(len(data_words)):
    detokenized.append(detokenize(data_words[i]))
data = []
for i in range(len(detokenized)):
    data.append(stopwords_lemmatize(detokenized[i]))
print(data[:5])
X_covidTest = tokenizer.texts_to_sequences(data)
X_covidTest = pad_sequences(X_covidTest, maxlen=max_len)

y_predictCovid = model2.predict(X_covidTest)
predicted_sentiment_covid = []
for y in y_predictCovid:
    predicted_sentiment_covid.append(sentiment[np.around(y, decimals=0).argmax()]) 
print(classification_report(y_covidTest, predicted_sentiment_covid))

 HYBRID MODEL 
 TRAIN ACCURACY 
              precision    recall  f1-score   support

    negative       0.83      0.72      0.77      6258
     neutral       0.73      0.83      0.78      8842
    positive       0.87      0.82      0.84      6884

    accuracy                           0.79     21984
   macro avg       0.81      0.79      0.80     21984
weighted avg       0.80      0.79      0.80     21984

 VALIDATION ACCURACY 
              precision    recall  f1-score   support

    negative       0.73      0.62      0.67      1523
     neutral       0.66      0.75      0.70      2275
    positive       0.77      0.73      0.75      1698

    accuracy                           0.71      5496
   macro avg       0.72      0.70      0.71      5496
weighted avg       0.71      0.71      0.71      5496

 REVIEW TEST ACCURACY 
['anyone iphone pro iphone', 'remember first time buy smart phone iphone show grandfather proud could first question able make rich try prove right time crypto',