# Stacked Logistic Regression-UniDirectional LSTM and Ensemble Implementation for Yelp Dataset Multiclass Sentiment Analysis Classification

**Platforms utilisied for testing/running this jupyter notebook file: Anaconda Jupyter Notebook and Google Colab (Optimal for training deep learning models using GPU)**

# Table of Contents 

* [Preliminary Step](#Preliminary Step)
    * [Mounting Google Drive](#Mounting Google Drive)
* [1. Introduction](#Introduction)
* [2. Load/Read in all required libraries](#libraries)
* [3. Load/read in all datasets](#Load/read in all datasets)
* [4. Exploratory Data Analysis (EDA)](#EDA)
* [5. Text Preprocessing and Auxiliary/Helper Functions](#Text Preprocessing Auxiliary/Helper Functions)
   * [5.1 Preprocessing Labeled Training Dataset](#Preprocessing Labeled Training Dataset)
* [6. Logistic Regression Model Implementation](#Logistic Regression Model Implementation)
   * [6.1 Fit Logistic Regression Model](#Fit Logistic Regression Model)
* [7. Creating Augmented Dataset (Labeled Data + Predicted Unlabeled Data) total 650k](#Creating Augmented Dataset Labeled Data + Predicted Unlabeled Data total 650k)
   * [7.1 Preprocessing Unlabeled Training Dataset](#Preprocessing Unlabeled Training Dataset)
   * [7.2 Unlabeled Training Dataset Feature Extraction](#Unlabeled Training Dataset Feature Extraction)
   * [7.3 Unlabeled Training Dataset Predictions](#Unlabeled Training Dataset Predictions)
   * [7.4 Creating Augmented Training Dataset Labeled Data + Predicted Unlabeled Data](#Creating Augmented Training Dataset Labeled Data + Predicted Unlabeled Data)
* [8. Unidirectional Long Short Term Memory (LSTM) Implementation](#Unidirectional Long Short Term Memory LSTM Implementation)
   * [8.1 Load/Read in preprocessed Augmented Dataset and Test Dataset](#Load/Read in preprocessed Augmented Dataset and Test Dataset)
   * [8.2 Tokenization for Undirectional LSTM](#Tokenization for Undirectional LSTM)
   * [8.3 Augmented Training Dataset (Labeled Data + Predicted Unlabeled Data) Feature Extraction for ULSTM](#Augmented Training Dataset Labeled Data + Predicted Unlabeled Data Feature Extraction for ULSTM)
   * [8.4 Fit UniDirectional Long Short Term Memory (LSTM) Model on Augmented Dataset](#Fit UniDirectional LSTM Model on Augmented Dataset)
* [9. Augmented Dataset Recreation from predicting unlabeled dataset with LSTM model  ](#Augmented Dataset Recreation from predicting unlabeled dataset with LSTM model)
   * [9.1 Augmented Dataset Recreation by utilising Unidirectional LSTM Model to Predict Unlabeled Data](#Augmented Dataset Recreation by utilising Unidirectional LSTM Model to Predict Unlabeled Data)
      * [9.1.1 Export to CSV repredicted Augmented Dataset with original unpreprocessed 650K text column](#Export to CSV repredicted Augmented Dataset with original unpreprocessed 650K text column)
      * [9.1.2 Export to CSV repredicted Augmented Dataset with original preprocessed 650K text column](#Export to CSV repredicted Augmented Dataset with original preprocessed 650K text column)
* [10. Stacked Ensemble Model with implementation variants of LSTM, GRU and CNN Models](#Stacked Ensemble Model with implementation variants of LSTM, GRU and CNN Models)
   * [10.1 Load/Read in all required deep learning libraries and packages](#Load/Read in all required deep learning libraries and packages)
   * [10.2 Load/Read in Augmented Dataset without preprocessed reviews and Testing Dataset](#Load/Read in Augmented Dataset without preprocessed reviews and Testing Dataset)
   * [10.3 Tokenization](#Tokenization)
   * [10.4 FastText Word Embedding Matrix](#FastText Word Embedding Matrix)
      * [10.4.1 Download FastText](#Download FastText)
      * [10.4.2 Unzip FastText Word Embeddings](#Unzip FastText Word Embeddings)
      * [10.4.3 Constructing and Implementing FastText Word Embedding Matrix](#Constructing and Implementing FastText Word Embedding Matrix)
   * [10.5 One Hot Encode Training Dataset](#One Hot Encode Training Dataset)
   * [10.6 Ensemble Models Implementation](#Ensemble Models Implementation)
   * [10.7 Loading Optimal Model Implementations](#Loading Optimal Model Implementations)
* [11. Ensemble Testing Dataset Predictions](#Ensemble Testing Dataset Predictions)
* [12. Create and export output file predict_label.csv](#Create and export output file predict_label.csv)
* [References](#references)

## Preliminary Step 
<a id="Preliminary Step"></a>

### Mounting Google Drive
<a id="Mounting Google Drive"></a>

As a preliminary step the as this `jupyter notebook file` **Stacked Multi-Class Sentiment Classifier.ipynb**, is executed inside **Google Colab**, thus, all the required datasets will be load into **Google Drive**. Therefore, the following block of code mounts the respective users Google Drive into the `jupyter notebook file`, thus, allowing easy integration in implementing and using all datasets and files. 

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
ls /content/gdrive/'My Drive'/yelp-multiclass-datasets/yelp-multiclass-sentiment-dataset

labeled_data.csv
logistic_200KAugmented_dataset.csv
logistic_650KAugmented_dataset.csv
logistic_GloVeEmbed_PreProcSentences_650KAugmented_dataset.csv
logistic_originalSent_650KAugmented_dataset.csv
logistic_origSent_200KAugmented_dataset.csv
[0m[01;34mmodels[0m/
test_data.csv
ULMFit_model.pkl


## 1. Introduction 

## 2. Load/Read in all required libraries  
<a id="Load/Read in Libraries"></a>

In [0]:
import pandas as pd
import numpy as np
import nltk
import re
import string
import matplotlib.pyplot as plt 
from keras.preprocessing.text import Tokenizer 
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding
from keras.callbacks import EarlyStopping
from keras.layers import SpatialDropout1D
from keras.layers.core import Dense
from keras.layers.recurrent import LSTM
from keras import utils
from keras.models import load_model
from nltk import PorterStemmer

Using TensorFlow backend.


## 3. Load/Read in all datasets
<a id="Load/Read in all datasets"></a>

In [None]:
labeled_data = pd.read_csv("/content/gdrive/My Drive/labeled_data.csv") # training_dataset 50000 x 2 with columns [text, label]
unlabeled_data = pd.read_csv("/content/gdrive/My Drive/unlabeled_data.csv") # unlabeled training data 600000 with column [text]
test_data = pd.read_csv("/content/gdrive/My Drive/test_data.csv") # testing dataset 50000 x 2 with columns [text_id, text]

## 4. Exploratory Data Analysis (EDA)
<a id="Exploratory Data Analysis"></a>

In [0]:
labeled_data.head()

Unnamed: 0,text,label
0,The new rule is - \r\nif you are waiting for a...,4
1,"Flirted with giving this two stars, but that's...",3
2,I was staying at planet Hollywood across the s...,5
3,Food is good but prices are super expensive. ...,2
4,Worse company to deal with they do horrible wo...,1


In [0]:
#polarity
labeled_data.label.value_counts() / labeled_data.shape[0]

1    0.20244
5    0.20036
2    0.19916
4    0.19912
3    0.19892
Name: label, dtype: float64

## 5. Text Preprocessing and Auxiliary/Helper Functions
<a id="Text Preprocessing and Auxiliary/Helper Functions"></a>

In [0]:
ps = PorterStemmer()

def process_text(text):
    #pre rules
    punctuation = string.punctuation + '\n\n\r';
    punc_replace = ''.join([' ' for s in punctuation]);
    doco_clean = text.replace('-', ' ');
    doco_alphas = re.sub(r'\W +', ' ', doco_clean)
    trans_table = str.maketrans(punctuation, punc_replace)
    doco_clean = ' '.join([word.translate(trans_table) for word in doco_alphas.split(' ')]); 
    doco_clean = doco_clean.split(' '); 
    doco_clean = [ps.stem(word) for word in doco_clean]; # try stemming
    doco_clean = [word.lower() for word in doco_clean if len(word) > 0]
    
    
    return doco_clean

In [0]:
def tokenize(string):
    res = [word for word in nltk.word_tokenize(string) if word and not re.search(pattern=r"\s+", string=word)]
    
    return res

### 5.1 Preprocessing Labeled Training Dataset 
<a id="Preprocessing Labeled Training Dataset"></a>

In [0]:
labeled_reviews = []

for line in range(labeled_data.shape[0]):
    labeled_reviews.append(labeled_data.iloc[line,0])

In [0]:
cleaned_review_text = [process_text(review) for review in labeled_reviews];
cleaned_labeled_sentences = [' '.join(r) for r in cleaned_review_text]

labeled_data['text'] = cleaned_labeled_sentences

In [0]:
labeled_data.head()

Unnamed: 0,text,label
0,the new rule is if you are wait for a tabl whi...,4
1,flirt with give thi two star but that s a pret...,3
2,i wa stay at planet hollywood across the stree...,5
3,food is good but price are super expens 8 buck...,2
4,wors compani to deal with they do horribl work...,1


## 6. Logistic Regression Model Implementation
<a id="Logistic Regression Model Implementation"></a>

In [0]:
from sklearn.model_selection import train_test_split

train_df, valid_df = train_test_split(labeled_data, test_size=0.162, random_state = 1412)


train_df = train_df.reset_index(drop=True)
valid_df = valid_df.reset_index(drop=True)

In [0]:
valid_df.head()

Unnamed: 0,text,label
0,i got hook with the great workout but after th...,1
1,serious terribl custom servic if you need actu...,1
2,i have tri thi place twice i like it veri much...,5
3,some of the worst pizza i ve ever had we use a...,1
4,i don t know how the commiss work at the att s...,1


In [0]:
#dependent variables 
# run this code block if there's split for training dataset and validation dataset
y_train = train_df['label']
y_valid = valid_df['label']

In [0]:
#text features
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

tfidf = TfidfVectorizer(tokenizer=tokenize, ngram_range=(1,3), min_df=20, sublinear_tf=True)
tfidf_fit = tfidf.fit(labeled_data['text'])
text_train = tfidf_fit.transform(train_df['text'])
text_valid = tfidf_fit.transform(valid_df['text'])
text_train.shape, text_valid.shape

((41900, 66810), (8100, 66810))

### 6.1 Fit Logistic Regression Model 
<a id="Fit Logistic Regression Model "></a>

In [0]:
#fit logistic regression models
model = LogisticRegression(C=2., penalty='l2', solver='liblinear', dual=False, multi_class='ovr') # liblinear only support ovr
model.fit(text_train,y_train)

LogisticRegression(C=2.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [0]:
# run if there's validation dataset
model.score(text_valid,y_valid)

0.6175308641975309

## 7. Creating Augmented Dataset (Labeled Data + Predicted Unlabeled Data) total 650k 
<a id="Creating Augmented Dataset Labeled Data + Predicted Unlabeled Data total 650k"></a>

### 7.1 Preprocessing Unlabeled Training Dataset 
<a id="Preprocessing Unlabeled Training Dataset "></a>

In [0]:
# unlabeled_data_subset (total 148K dataset 100K preedicted dataset + 48K labeled dataset)
test_df = unlabeled_data[0:500001]

In [0]:
unlabeled_reviews = []

for line in range(test_df.shape[0]):
    unlabeled_reviews.append(test_df.iloc[line,0])

In [0]:
cleaned_unlabeled_review_text = [process_text(review) for review in unlabeled_reviews];
cleaned_unlabeled_sentences = [' '.join(r) for r in cleaned_unlabeled_review_text]

In [0]:
test_df['text'] = cleaned_unlabeled_sentences

In [0]:
test_df.head()

### 7.2 Unlabeled Training Dataset Feature Extraction 
<a id="Unlabeled Training Dataset Feature Extraction"></a>

In [0]:
#text features
tfidf = TfidfVectorizer(tokenizer=tokenize, ngram_range=(1,3), min_df=20, sublinear_tf=True)
tfidf_fit = tfidf.fit(labeled_data['text'])

In [0]:
unlabeled_data_Xtrain = tfidf_fit.transform(test_df['text'])
unlabeled_data_Xtrain.shape

### 7.3 Unlabeled Training Dataset Predictions
<a id="Unlabeled Training Dataset Predictions"></a>

In [0]:
unlabeled_preds = model.predict(unlabeled_data_Xtrain)

unlabeled_predicted_df = pd.DataFrame({"text" : test_df['text'],
                                       "label" : unlabeled_preds})

### 7.4 Creating Augmented Training Dataset (Labeled Data + Predicted Unlabeled Data)
<a id="Creating Augmented Training Dataset Labeled Data + Predicted Unlabeled Data"></a>

In [0]:
augmented_training_data = labeled_data.append(unlabeled_predicted_df)

In [0]:
# reset dataframe index to start from 0 
augmented_train_df = augmented_training_data.reset_index(drop=True)
augmented_train_df.shape

In [0]:
augmented_train_df.head()

### 7.5 Exporting Augmented Training Dataset to CSV File

In [None]:
augmented_train_df.to_csv(
    '/content/gdrive/My Drive/yelp-multiclass-datasets/yelp-multiclass-sentiment-dataset/logistic_650KAugmented_dataset.csv', 
    index=False, header=True)

## 8. Unidirectional Long Short Term Memory (LSTM) Implementation 
<a id="Unidirectional Long Short Term Memory (LSTM) Implementation"></a>

### 8.1 Load/Read in preprocessed Augmented Dataset and Test Dataset 
<a id="Load/Read in preprocessed Augmented Dataset and Test Dataset"></a>

In [None]:
train_data = pd.read_csv(
    "/content/gdrive/My Drive/yelp-multiclass-datasets/yelp-multiclass-sentiment-dataset/logistic_650KAugmented_dataset.csv")

test_data = pd.read_csv(
    "/content/gdrive/My Drive/yelp-multiclass-datasets/yelp-multiclass-sentiment-dataset/test_data.csv")

In [None]:
# drop all NaN rows
augmented_training_data = train_data.dropna()

# check that all NaN is dropped
augmented_training_data[augmented_training_data.isna().any(axis=1)]

### 8.2 Tokenization for Undirectional LSTM 
<a id="Tokenization for Undirectional LSTM"></a>

In [0]:
augmented_clean_sentences = list(augmented_training_data.text)

In [0]:
# Use a Keras Tokenizer and fit on the sentences for all cleaned labeled sentences

tokenizer = Tokenizer();
tokenizer.fit_on_texts(augmented_clean_sentences);
text_sequences = np.array(tokenizer.texts_to_sequences(augmented_clean_sentences));
sequence_dict = tokenizer.word_index;
word_dict = dict((num, val) for (val, num) in sequence_dict.items());

In [0]:
# The maximum number of words to be used. (most frequent) which is the total tokens + 1 = 144522
MAX_NB_WORDS = len(tokenizer.word_index) + 1

# Max number of words in each complaint.
MAX_SEQUENCE_LENGTH = 250
# This is fixed.
EMBEDDING_DIM = 300
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='', lower=False)
tokenizer.fit_on_texts(augmented_clean_sentences)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 144521 unique tokens.


### 8.3 Augmented Training Dataset (Labeled Data + Predicted Unlabeled Data) Feature Extraction for ULSTM
<a id="Augmented Training Dataset Labeled Data Predicted Unlabeled Data Feature Extraction for ULSTM"></a>

In [0]:
X = tokenizer.texts_to_sequences(augmented_clean_sentences)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)

Shape of data tensor: (649988, 250)


In [0]:
Y = pd.get_dummies(augmented_training_data['label']).values
print('Shape of label tensor:', Y.shape)

Shape of label tensor: (649988, 5)


### 8.4 Fit UniDirectional Long Short Term Memory (LSTM) Model on Augmented Dataset 
<a id="Fit UniDirectional LSTM Model on Augmented Dataset"></a>

In [0]:
from keras.layers import CuDNNLSTM
from keras.layers import CuDNNGRU
from keras.layers import Dropout
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras import backend as K
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints

In [0]:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, 300, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(CuDNNLSTM(150, return_sequences=True))
model.add(CuDNNLSTM(150))
model.add(Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())

epochs = 3
batch_size = 32

history = model.fit(X, Y, epochs=epochs, 
                    batch_size=batch_size, validation_split=0.02,
                    callbacks=[EarlyStopping(monitor='val_loss', patience=1, min_delta=0.0001)])

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 250, 300)          43356600  
_________________________________________________________________
spatial_dropout1d_5 (Spatial (None, 250, 300)          0         
_________________________________________________________________
cu_dnnlstm_9 (CuDNNLSTM)     (None, 250, 150)          271200    
_________________________________________________________________
cu_dnnlstm_10 (CuDNNLSTM)    (None, 150)               181200    
_________________________________________________________________
dense_5 (Dense)              (None, 5)                 755       
Total params: 43,809,755
Trainable params: 43,809,755
Non-trainable params: 0
_________________________________________________________________
None
Train on 636988 samples, validate on 13000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [0]:
# save model 
model.save('Unidirection LSTM.h5')

In [0]:
# load model 
model = load_model('Unidirection LSTM.h5')

## 9. Augmented Dataset Recreation from predicting unlabeled dataset with LSTM model  
<a id="Augmented Dataset Recreation from predicting unlabeled dataset with LSTM model  "></a>

### 9.1 Augmented Dataset Recreation by utilising Unidirectional LSTM Model to Predict Unlabeled Data
<a id="Augmented Dataset Recreation by utilising Unidirectional LSTM Model to Predict Unlabeled Data"></a>

In [0]:
### Repredict the unlabeled dataset and recreate the augmented dataset with LSTM model which has increased accuracy of ~62%

trainn_data = pd.read_csv(
    "/content/gdrive/My Drive/yelp-multiclass-datasets/yelp-multiclass-sentiment-dataset/logistic_650KAugmented_dataset.csv")

preproc_label_data = trainn_data.loc[0:49999]

predicted_unlabel_data = trainn_data.loc[50000:].fillna('')

predicted_clean_unlabelsentences = list(predicted_unlabel_data.text)

X_test = tokenizer.texts_to_sequences(predicted_clean_unlabelsentences)
padded_X = pad_sequences(X_test, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', padded_X.shape)

repredict_unlabeled = model.predict(padded_X, verbose=2)

# output label classes (polarity levels)
labels = [1, 2, 3, 4, 5]

predicted_labels = []

for prediction in repredict_unlabeled:
    predicted_labels.append(labels[np.argmax(prediction)])
    
repredict_unlabel_data = pd.DataFrame({'text' : predicted_unlabel_data['text'],
                                       'label' : predicted_labels})

repredicted_augment_data = preproc_label_data.append(repredict_unlabel_data)

Shape of data tensor: (600000, 250)


### 9.1.1 Export to CSV repredicted Augmented Dataset with original unpreprocessed 650K text column 
<a id="Export to CSV repredicted Augmented Dataset with original unpreprocessed 650K text column"></a>

In [None]:
original_sent_data = pd.read_csv(
    "/content/gdrive/My Drive/yelp-multiclass-datasets/yelp-multiclass-sentiment-dataset/logistic_originalSent_650KAugmented_dataset.csv")

del repredicted_augment_df['text']

repredicted_augment_df['text'] = list(original_sent_data.text)

repredicted_augment_df.to_csv(
    '/content/gdrive/My Drive/650K_LSTM_Repredicted_originalSent_Augment_Dataset.csv', index=False, header=True)

### 9.1.2 Export to CSV repredicted Augmented Dataset with original preprocessed 650K text column 
<a id="Export to CSV repredicted Augmented Dataset with original preprocessed 650K text column"></a>

In [None]:
repredicted_augment_df = repredicted_augment_data.reset_index(drop=True) 

repredicted_augment_df.to_csv('/content/gdrive/My Drive/650K LSTM Repredicted Augment Dataset.csv', index=False, header=True)

## 10. Stacked Ensemble Model with implementation variants of LSTM, GRU and CNN Models
<a id="Stacked Ensemble Model with implementation variants of LSTM, GRU and CNN Models"></a>

### 10.1 Load/Read in all required deep learning libraries and packages 
<a id="Load/Read in all required deep learning libraries and packages"></a>

In [None]:
# Deep Learning Libraries 
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Conv1D, GRU, CuDNNGRU, CuDNNLSTM
from keras.layers import BatchNormalization
from keras.layers import Bidirectional, GlobalMaxPool1D, MaxPooling1D, Add, Flatten
from keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D, concatenate, SpatialDropout1D
from keras.models import Model, load_model
from keras import initializers, regularizers, constraints, optimizers, layers, callbacks
from keras import backend as K
from keras.engine import InputSpec, Layer
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint, TensorBoard, Callback, EarlyStopping

### 10.2 Load/Read in Augmented Dataset without preprocessed reviews and Testing Dataset
<a id="Load/Read in Augmented Dataset without preprocessed reviews and Testing Dataset"></a>

In [None]:
# read in augmented dataset with unpreprocessed sentence column for training
train = pd.read_csv(
'/content/gdrive/My Drive/yelp-multiclass-sentiment-dataset/650K_LSTM_Repredicted_originalSent_Augment_Dataset.csv', sep=",")

# read in testing dataset 
test = pd.read_csv(
    '/content/gdrive/My Drive/yelp-multiclass-sentiment-dataset/test_data.csv', sep=",")

In [None]:
# check for na as some reviews have just be emptry and remove them as this'll affect tokenization 
train = train.dropna()

### 10.3 Tokenization 
<a id="Tokenization"></a>

In [None]:
full_text = list(train['text'].values)
y = train['label']

In [None]:
tk = Tokenizer(lower = True, filters='')
tk.fit_on_texts(full_text)

In [None]:
train_tokenized = tk.texts_to_sequences(train['text'])
test_tokenized = tk.texts_to_sequences(test['text'])

In [None]:
max_len = 250
X_train = pad_sequences(train_tokenized, maxlen = max_len)
X_test = pad_sequences(test_tokenized, maxlen = max_len)

### 10.4 FastText Word Embedding Matrix
<a id="FastText Word Embedding Matrix"></a>

### 10.4.1 Download FastText 
<a id="Download FastText"></a>

In [None]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip

### 10.4.2 Unzip FastText Word Embeddings
<a id="Unzip FastText Word Embeddings"></a>

In [None]:
import zipfile
zip_ref = zipfile.ZipFile('crawl-300d-2M.vec.zip', 'r')
zip_ref.extractall('/content/gdrive/My Drive/FastText Embeddings/')
zip_ref.close()

In [None]:
embedding_path = "/content/gdrive/My Drive/FastText Embeddings/crawl-300d-2M.vec"

In [None]:
embed_size = 300
max_features = len(tk.word_index) + 1

### 10.4.3 Constructing and Implementing FastText Word Embedding Matrix
<a id="Constructing and Implementing FastText Word Embedding Matrix"></a>

In [None]:
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embedding_index = dict(get_coefs(*o.strip().split(" ")) for o in open(embedding_path))

word_index = tk.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.zeros((nb_words + 1, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

### 10.5 One Hot Encode Training Dataset 
<a id="One Hot Encode Training Dataset"></a>

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
y_ohe = ohe.fit_transform(y.values.reshape(-1, 1))

### 10.6 Ensemble Models Implementation
<a id="Ensemble Models Implementation"></a>

In [None]:
def build_model1(lr=0.0, lr_d=0.0, units=0, spatial_dr=0.0, kernel_size1=3, kernel_size2=2, dense_units=128, dr=0.1, conv_size=32):
    file_path = "/content/gdrive/My Drive/best_model.hdf5"
    check_point = ModelCheckpoint(file_path, monitor = "val_loss", verbose = 1,
                                  save_best_only = True, mode = "min")
    early_stop = EarlyStopping(monitor = "val_loss", mode = "min", patience = 3)
    
    inp = Input(shape = (max_len,))
    x = Embedding(1153123, embed_size, weights = [embedding_matrix], trainable = False)(inp)
    x1 = SpatialDropout1D(spatial_dr)(x)

    x_gru = Bidirectional(CuDNNGRU(units, return_sequences = True))(x1)
    x1 = Conv1D(conv_size, kernel_size=kernel_size1, padding='valid', kernel_initializer='he_uniform')(x_gru)
    avg_pool1_gru = GlobalAveragePooling1D()(x1)
    max_pool1_gru = GlobalMaxPooling1D()(x1)
    
    x3 = Conv1D(conv_size, kernel_size=kernel_size2, padding='valid', kernel_initializer='he_uniform')(x_gru)
    avg_pool3_gru = GlobalAveragePooling1D()(x3)
    max_pool3_gru = GlobalMaxPooling1D()(x3)
    
    x_lstm = Bidirectional(CuDNNLSTM(units, return_sequences = True))(x1)
    x1 = Conv1D(conv_size, kernel_size=kernel_size1, padding='valid', kernel_initializer='he_uniform')(x_lstm)
    avg_pool1_lstm = GlobalAveragePooling1D()(x1)
    max_pool1_lstm = GlobalMaxPooling1D()(x1)
    
    x3 = Conv1D(conv_size, kernel_size=kernel_size2, padding='valid', kernel_initializer='he_uniform')(x_lstm)
    avg_pool3_lstm = GlobalAveragePooling1D()(x3)
    max_pool3_lstm = GlobalMaxPooling1D()(x3)
    
    
    x = concatenate([avg_pool1_gru, max_pool1_gru, avg_pool3_gru, max_pool3_gru,
                    avg_pool1_lstm, max_pool1_lstm, avg_pool3_lstm, max_pool3_lstm])
    x = BatchNormalization()(x)
    x = Dropout(dr)(Dense(dense_units, activation='relu') (x))
    x = BatchNormalization()(x)
    x = Dropout(dr)(Dense(int(dense_units / 2), activation='relu') (x))
    x = Dense(5, activation = "softmax")(x)
    model = Model(inputs = inp, outputs = x)
    model.compile(loss = "categorical_crossentropy", optimizer = Adam(lr = lr, decay = lr_d), metrics = ["accuracy"])
    history = model.fit(X_train, y_ohe, batch_size = 128, epochs = 4, validation_split=0.1, 
                        verbose = 1, callbacks = [check_point, early_stop])
    model = load_model(file_path)
    return model

In [None]:
model1 = build_model1(lr = 1e-3, lr_d = 1e-10, units = 100, spatial_dr = 0.2, kernel_size1=3, kernel_size2=2, 
                      dense_units=32, dr=0.1, conv_size=32)
model1.save('/content/gdrive/My Drive/best_model1.hdf5')

In [None]:
model2 = build_model1(lr = 1e-3, lr_d = 1e-10, units = 100, spatial_dr = 0.5, kernel_size1=3, kernel_size2=2, 
                      dense_units=64, dr=0.2, conv_size=32)
model2.save('/content/gdrive/My Drive/best_model2.hdf5')

In [None]:
def build_model2(lr=0.0, lr_d=0.0, units=0, spatial_dr=0.0, kernel_size1=3, kernel_size2=2, dense_units=128, dr=0.1, conv_size=32):
    file_path = "/content/gdrive/My Drive/best_model.hdf5"
    check_point = ModelCheckpoint(file_path, monitor = "val_loss", verbose = 1,
                                  save_best_only = True, mode = "min")
    early_stop = EarlyStopping(monitor = "val_loss", mode = "min", patience = 3)

    inp = Input(shape = (max_len,))
    x = Embedding(1153123, embed_size, weights = [embedding_matrix], trainable = False)(inp)
    x1 = SpatialDropout1D(spatial_dr)(x)

    x_gru = Bidirectional(CuDNNGRU(units, return_sequences = True))(x1)
    x_lstm = Bidirectional(CuDNNLSTM(units, return_sequences = True))(x1)
    
    x_conv1 = Conv1D(conv_size, kernel_size=kernel_size1, padding='valid', kernel_initializer='he_uniform')(x_gru)
    avg_pool1_gru = GlobalAveragePooling1D()(x_conv1)
    max_pool1_gru = GlobalMaxPooling1D()(x_conv1)
    
    x_conv2 = Conv1D(conv_size, kernel_size=kernel_size2, padding='valid', kernel_initializer='he_uniform')(x_gru)
    avg_pool2_gru = GlobalAveragePooling1D()(x_conv2)
    max_pool2_gru = GlobalMaxPooling1D()(x_conv2)
    
    
    x_conv3 = Conv1D(conv_size, kernel_size=kernel_size1, padding='valid', kernel_initializer='he_uniform')(x_lstm)
    avg_pool1_lstm = GlobalAveragePooling1D()(x_conv3)
    max_pool1_lstm = GlobalMaxPooling1D()(x_conv3)
    
    x_conv4 = Conv1D(conv_size, kernel_size=kernel_size2, padding='valid', kernel_initializer='he_uniform')(x_lstm)
    avg_pool2_lstm = GlobalAveragePooling1D()(x_conv4)
    max_pool2_lstm = GlobalMaxPooling1D()(x_conv4)
    
    
    x = concatenate([avg_pool1_gru, max_pool1_gru, avg_pool2_gru, max_pool2_gru,
                    avg_pool1_lstm, max_pool1_lstm, avg_pool2_lstm, max_pool2_lstm])
    x = BatchNormalization()(x)
    x = Dropout(dr)(Dense(dense_units, activation='relu') (x))
    x = BatchNormalization()(x)
    x = Dropout(dr)(Dense(int(dense_units / 2), activation='relu') (x))
    x = Dense(5, activation = "softmax")(x)
    model = Model(inputs = inp, outputs = x)
    model.compile(loss = "categorical_crossentropy", optimizer = Adam(lr = lr, decay = lr_d), metrics = ["accuracy"])
    history = model.fit(X_train, y_ohe, batch_size = 128, epochs = 4, validation_split=0.1, 
                        verbose = 1, callbacks = [check_point, early_stop])
    model = load_model(file_path)
    return model

In [None]:
model3 = build_model2(lr = 1e-4, lr_d = 0, units = 100, spatial_dr = 0.5, kernel_size1=4, kernel_size2=3, 
                      dense_units=32, dr=0.1, conv_size=32)
model3.save('/content/gdrive/My Drive/best_model3.hdf5')

In [None]:
model4 = build_model2(lr = 1e-3, lr_d = 0, units = 100, spatial_dr = 0.5, kernel_size1=3, kernel_size2=3, 
                      dense_units=64, dr=0.3, conv_size=32)
model4.save('/content/gdrive/My Drive/best_model4.hdf5')

In [None]:
model5 = build_model2(lr = 1e-3, lr_d = 1e-7, units = 100, spatial_dr = 0.3, kernel_size1=3, kernel_size2=3, 
                      dense_units=64, dr=0.4, conv_size=64)
model5.save('/content/gdrive/My Drive/best_model5.hdf5')

### 10.7 Loading Optimal Model Implementations
<a id="Loading Optimal Model Implementations"></a>

In [None]:
########################################################## IMPORTANT ########################################################## 

# If the Google Colab Session crashes or runs into a Runtime error, then please run this code block to load in the presaved 
# optimal model for model1 to model5. This'll prevent the hassle of re-building the models, as they're computationally 
# expensive and can raise Out of Memory (OOM) error inside Google Colab

from keras.models import load_model

model1 = load_model('/content/gdrive/My Drive/best_model1.hdf5') # Model 1 Implementation
model2 = load_model('/content/gdrive/My Drive/best_model2.hdf5') # Model 2 Implementation
model3 = load_model('/content/gdrive/My Drive/best_model3.hdf5') # Model 3 Implementation
model4 = load_model('/content/gdrive/My Drive/best_model4.hdf5') # Model 4 Implementation
model5 = load_model('/content/gdrive/My Drive/best_model5.hdf5') # Model 5 Implementation

## 11. Ensemble Testing Dataset Predictions
<a id="Ensemble Testing Dataset Predictions"></a>

In [None]:
pred1 = model1.predict(X_test, batch_size = 1024, verbose = 1)
pred = pred1
pred2 = model2.predict(X_test, batch_size = 1024, verbose = 1)
pred += pred2
pred3 = model3.predict(X_test, batch_size = 1024, verbose = 1)
pred += pred3
pred4 = model4.predict(X_test, batch_size = 1024, verbose = 1)
pred += pred4
pred5 = model5.predict(X_test, batch_size = 1024, verbose = 1)
pred += pred5

In [None]:
predictions = pred

In [None]:
labels = [1,2,3,4,5]

predicted_labels = []

for prediction in predictions:
  predicted_labels.append(labels[np.round(np.argmax(prediction)).astype(int)])

## 12. Create and export output file predict_label.csv 
<a id="Create and export output file predict_label.csv "></a>

In [None]:
test_df = pd.DataFrame({'test_id' : test['test_id'],
                       'label' : predicted_labels})

In [None]:
test_df.to_csv('predict_label.csv', index=False, header=True)

In [None]:
# This'll allow exporting/downloading CSV files from Google Colab to the Local Machine 
from google.colab import files
files.download('predict_label.csv')

## References
<a id="references"></a>

- Calin, Timbus. (2019, May 12), "How to fix name“ Embedding is not defined” in Keras" Stack Overflow [Online] Available at: https://stackoverflow.com/questions/56097089/how-to-fix-name-embedding-is-not-defined-in-keras (Accessed: 12/10/2019) 
- cjbrog. (2017, April 9), "Deprecation warnings from sklearn" Stack Overflow [Online] Available at: https://stackoverflow.com/questions/43302400/deprecation-warnings-from-sklearn (Accessed: 12/10/2019) 
- Dieter. (2019), “How To: Preprocessing for GloVe Part1: EDA” kaggle [Online] Available at: https://www.kaggle.com/christofhenkel/how-to-preprocessing-for-glove-part1-eda (Accessed: 17/10/2019)
- Dieter. (2019), “How To: Preprocessing for GloVe Part2: Usage” kaggle [Online] Available at: https://www.kaggle.com/christofhenkel/how-to-preprocessing-for-glove-part2-usage (Accessed: 17/10/2019)
- Kashyap. (2017, March 4), "Why can't I use preprocessing module in Keras?" Stack Overflow [Online] Available at: https://stackoverflow.com/questions/42598630/why-cant-i-use-preprocessing-module-in-keras (Accessed: 12/10/2019) 
- Lukyanenko, Andrew. (2019), “Movie Review Sentiment Analysis EDA and models” kaggle [Online] Available at: https://www.kaggle.com/artgor/movie-review-sentiment-analysis-eda-and-models (Accessed: 25/10/2019) 
- maxpumperla. (2019, February 14), "Name 'LSTM' is not defined " GitHub maxpumperla/hyperas [Online] Available at: https://github.com/maxpumperla/hyperas/issues/199 (Accessed: 12/10/2019)  
- micts. (2018, August 28), "Do I have to preprocess my new data for a prediction, if I have used preprocessing for building the model?" StackExchange Cross Validated [Online] Available at: https://stats.stackexchange.com/questions/364382/do-i-have-to-preprocess-my-new-data-for-a-prediction-if-i-have-used-preprocessi (Accessed: 11/10/2019) 
- Sonthalia, Akash. (2019 September 30), "name 'Sequential' is not defined" Stack Overflow [Online] Available at: https://stackoverflow.com/questions/57021088/name-sequential-is-not-defined (Accessed: 12/10/2019)
- TensorFlow Core r2.0. (2019), "tf.keras.layers.SpatialDropout1D" TensorFlow [Online] Available at: https://www.tensorflow.org/api_docs/python/tf/keras/layers/SpatialDropout1D (Accessed: 12/10/2019) 
- user4815162342. (2012, November 2), "'str' object has no attribute 'punctuation' [closed]" Stack Overflow [Online] Available at: https://stackoverflow.com/questions/13197913/str-object-has-no-attribute-punctuation (Accessed: 11/10/2019)
- Valdenegro, Matias. (2019, June 20), "Keras EarlyStopping is not recognized" Stack Overflow [Online] Available at: https://stackoverflow.com/questions/56687658/keras-earlystopping-is-not-recognized (Accessed: 12/10/2019) 
