## Stanford Sentiment Treebank - Movie Review Classification Competition
The SST-2 dataset is a benchmark dataset for sentiment analysis. It contains a collection of movie reviews with a binary label indicating whether the review is positive or negative. There are approximately 8,000 reviews in total, which are split into training and test sets. 

Building a predictive model using the SST-2 dataset can be practically useful for a variety of applications, such as:

1. Product/Service Performance: Companies and producers can use such models to automatically classify reviews of their products or services as positive or negative, allowing them to identify areas for improvement and respond to customer feedback.

2. Marketing and Investment: Market researchers and directors/producers can use such models to analyze customer sentiment towards specific movies or brands, helping them to identify market trends.

3. Personalization: Companies can use such models to identify consumers' preference and sentiment toward a certain movie or category of movies along with user information in order to provide personalized streaming services and high quality recommendation


## 1. Get data in and set up X_train, X_test, y_train objects

In [None]:
#install aimodelshare library
! pip install aimodelshare==0.0.189

In [None]:
# Get competition data
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/sst2_competition_data-repository:latest') 


Data downloaded successfully.


In [None]:
# Set up X_train, X_test, and y_train_labels objects
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=Warning)

X_train=pd.read_csv("sst2_competition_data/X_train.csv", squeeze=True)
X_test=pd.read_csv("sst2_competition_data/X_test.csv", squeeze=True)

y_train_labels=pd.read_csv("sst2_competition_data/y_train_labels.csv", squeeze=True)

# ohe encode Y data
y_train = pd.get_dummies(y_train_labels)

X_train.head()

0    The Rock is destined to be the 21st Century 's...
1    The gorgeously elaborate continuation of `` Th...
2    Singer/composer Bryan Adams contributes a slew...
3                 Yet the act is still charming here .
4    Whether or not you 're enlightened by any of D...
Name: text, dtype: object

In [None]:
y_train.head()

Unnamed: 0,Negative,Positive
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1


In [None]:
len(X_train), len(X_test)

(6920, 1821)

##2.   Preprocess data using keras tokenizer / Write and Save Preprocessor function


In [87]:
# This preprocessor function makes use of the tf.keras tokenizer

from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import pad_sequences
import numpy as np

# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen=40, max_words=10000):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

print(preprocessor(X_train).shape)
print(preprocessor(X_test).shape)

(6920, 40)
(1821, 40)


#### Save preprocessor function to local "preprocessor.zip" file

In [88]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


## 3. Fit model on preprocessed data and save preprocessor function and model 


### Conv1d with Word Embedding

In [89]:
from tensorflow.keras.layers import Dense, Embedding, Flatten, Conv1D, GlobalMaxPooling1D
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(10000, 16, input_length=40))
model.add(Conv1D(filters=64, kernel_size=5, activation='relu'))
model.add(Flatten())
model.add(Dense(2, activation='softmax'))
model.summary()

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Model: "sequential_25"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_25 (Embedding)    (None, 40, 16)            160000    
                                                                 
 conv1d_8 (Conv1D)           (None, 36, 64)            5184      
                                                                 
 flatten_22 (Flatten)        (None, 2304)              0         
                                                                 
 dense_28 (Dense)            (None, 2)                 4610      
                                                                 
Total params: 169,794
Trainable params: 169,794
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Save model to local ".onnx" file

In [90]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

#### Generate predictions from X_test data and submit model to competition


In [None]:
#Set credentials using modelshare.org username/password

from aimodelshare.aws import set_credentials
    
apiurl="https://rlxjxnoql9.execute-api.us-east-1.amazonaws.com/prod/m" #This is the unique rest api that powers this specific Playground

set_credentials(apiurl=apiurl)

AI Modelshare Username:··········
AI Modelshare Password:··········
AI Model Share login credentials set successfully.


In [81]:
#Instantiate Competition

mycompetition= ai.Competition(apiurl)

In [82]:
#Submit Model 1: 

#-- Generate predicted y values (Model 1)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 257

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


#### Model Performance
Accuracy: 0.7782656422

F-1 score: 0.7755062704

Precision: 0.7929332887

Recall: 0.7783882784

Conv1d with Embedding v.3
- cut movie review after 100 words instead of 40
- learn a feature vector of size 100 instead of 40

It does not help

#### Model Performance
Accuracy: 0.7771679473

F-1 score: 0.7744502214

Precision: 0.7914783666

Recall: 0.7772893773

### Conv1d with Embedding v.2

In [27]:
from tensorflow.keras.layers import Dense, Embedding, Flatten, Conv1D, GlobalMaxPooling1D
from tensorflow.keras.models import Sequential

model1 = Sequential()
model1.add(Embedding(10000, 16, input_length=40))
model1.add(Conv1D(filters=128, kernel_size=5, activation='relu'))
model1.add(GlobalMaxPooling1D())
model1.add(Flatten())
model1.add(Dense(32, activation='relu'))
model1.add(Dense(2, activation='softmax'))
model1.summary()

model1.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model1.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_8 (Embedding)     (None, 40, 16)            160000    
                                                                 
 conv1d_6 (Conv1D)           (None, 36, 128)           10368     
                                                                 
 global_max_pooling1d_1 (Glo  (None, 128)              0         
 balMaxPooling1D)                                                
                                                                 
 flatten_6 (Flatten)         (None, 128)               0         
                                                                 
 dense_9 (Dense)             (None, 32)                4128      
                                                                 
 dense_10 (Dense)            (None, 2)                 66        
                                                      

In [28]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model1, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model1.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [29]:
prediction_column_index=model.predict(preprocessor(X_test)).argmax(axis=1)

prediction_labels = [y_train.columns[i] for i in prediction_column_index]

mycompetition.submit_model(model_filepath = "model1.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 238

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


#### Model Performance (no obvious improvement)
Accuracy: 0.7782656422

F-1 score: 0.7755062704

Precision: 0.7929332887

Recall: 0.7783882784

### LSTM with Embedding


In [91]:
# Train and submit model 2 using same preprocessor
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten

model2 = Sequential()
model2.add(Embedding(10000, 16, input_length=40))
model2.add(LSTM(32, return_sequences=True, dropout=0.2))
model2.add(LSTM(32, dropout=0.2))
model2.add(Flatten())
model2.add(Dense(2, activation='softmax'))

model2.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model2.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [92]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model2, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model2.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [86]:
#Submit Model 2: 

#-- Generate predicted y values (Model 2)
prediction_column_index=model2.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model2.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 260

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


#### Model Performance
Accuracy: 0.8166849616

F-1 score: 0.8164859515

Precision: 0.8179831567

Recall: 0.8166497976

LSTM with Embedding performs better than models with Conv1d layers and models with pretrained weights, so I will continue improving the predictive power based off the LSTMs

LSTM with Embedding v.2
- cut movie review after 100 words instead of 40
- learn a feature vector of size 100 instead of 40

It does not help

#### Model Performance
Accuracy: 0.7947310648

F-1 score: 0.7939524786

Precision: 0.7994057409

Recall: 0.7947994987

### Deeper Stacked LSTM

In [66]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten

model4 = Sequential()
model4.add(Embedding(10000, 32, input_length=40))
model4.add(LSTM(32, return_sequences=True, dropout=0.2))
model4.add(LSTM(32, return_sequences=True, dropout=0.2))
model4.add(LSTM(32, return_sequences=True, dropout=0.2))
model4.add(LSTM(32, dropout=0.2))
model4.add(Flatten())
model4.add(Dense(2, activation='softmax'))

model4.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model4.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [67]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model4, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model4.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [36]:
prediction_column_index=model4.predict(preprocessor(X_test)).argmax(axis=1)

prediction_labels = [y_train.columns[i] for i in prediction_column_index]

mycompetition.submit_model(model_filepath = "model4.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 241

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


#### Model Performance (does not help)
Accuracy: 0.7914379802

F-1 score: 0.7906182863

Precision: 0.7962277273

Recall: 0.7915076152

Having several layers of LSTMs negatively influence the predictive power of our model

### Bidirectional LSTM with Embedding

In [68]:
# Train and submit model 3 using same preprocessor
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten, Bidirectional

model5 = Sequential()
model5.add(Embedding(10000, 32, input_length=40))
model5.add(Bidirectional(LSTM(32, dropout=0.2)))
model5.add(Flatten())
model5.add(Dense(2, activation='softmax'))

model5.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model5.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [69]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model5, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model5.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [42]:
prediction_column_index=model5.predict(preprocessor(X_test)).argmax(axis=1)

prediction_labels = [y_train.columns[i] for i in prediction_column_index]

mycompetition.submit_model(model_filepath = "model5.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 244

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


#### Model Performance 
Accuracy: 0.8090010977

F-1 score: 0.8089824544

Precision: 0.8091470177

Recall: 0.8090129169

Changing from LSTM to bidirectional LSTM does not help with predicting the sentiment result

### Transfer Learning with glove embeddings

In [45]:
# What if we wanted to use a matrix of pretrained embeddings?  Same as transfer learning before, but now we are importing a pretrained Embedding matrix:
# Download Glove embedding matrix weights (Might take 10 mins or so!)
! wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip

--2023-04-17 00:41:42--  http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.6B.zip [following]
--2023-04-17 00:41:42--  https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip [following]
--2023-04-17 00:41:42--  https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182753 (822M) [app

In [46]:
! unzip glove.6B.zip 

Archive:  glove.6B.zip
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       
  inflating: glove.6B.50d.txt        


In [47]:
# Extract embedding data
import os
glove_dir = os.getcwd()

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400001 word vectors.


In [48]:
# Build embedding matrix
embedding_dim = 100 # change if you use txt files with larger number of features
max_words = 10000

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [49]:
# Set up same model architecture as before and then import Glove weights to Embedding layer:
model3 = Sequential()
model3.add(Embedding(max_words, embedding_dim, input_length=40))
model3.add(Flatten())
model3.add(Dense(32, activation='relu'))
model3.add(Dense(2, activation='sigmoid'))
model3.summary()

Model: "sequential_15"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_15 (Embedding)    (None, 40, 100)           1000000   
                                                                 
 flatten_12 (Flatten)        (None, 4000)              0         
                                                                 
 dense_17 (Dense)            (None, 32)                128032    
                                                                 
 dense_18 (Dense)            (None, 2)                 66        
                                                                 
Total params: 1,128,098
Trainable params: 1,128,098
Non-trainable params: 0
_________________________________________________________________


In [50]:
# Add weights in same manner as transfer learning and turn of trainable option before fitting model to freeze weights.
model3.layers[0].set_weights([embedding_matrix])
model3.layers[0].trainable = False



model3.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model3.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

model3.save_weights('pre_trained_glove_model.h5')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [52]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model3, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model3.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [53]:
prediction_column_index=model3.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model3.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 246

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


#### Model Performance 
Accuracy: 0.7091108672

F-1 score: 0.7074810644

Precision: 0.7140338679

Recall: 0.7091936572

In [70]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten

model6 = Sequential()
model6.add(Embedding(10000, embedding_dim, input_length=40))
model6.add(LSTM(32, return_sequences=True, dropout=0.2))
model6.add(LSTM(32, dropout=0.2))
model6.add(Flatten())
model6.add(Dense(2, activation='softmax'))
model6.summary()

Model: "sequential_22"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_22 (Embedding)    (None, 40, 100)           1000000   
                                                                 
 lstm_33 (LSTM)              (None, 40, 32)            17024     
                                                                 
 lstm_34 (LSTM)              (None, 32)                8320      
                                                                 
 flatten_19 (Flatten)        (None, 32)                0         
                                                                 
 dense_25 (Dense)            (None, 2)                 66        
                                                                 
Total params: 1,025,410
Trainable params: 1,025,410
Non-trainable params: 0
_________________________________________________________________


In [71]:
model6.layers[0].set_weights([embedding_matrix])
model6.layers[0].trainable = False



model6.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model6.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

model6.save_weights('pre_trained_glove_model.h5')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [72]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model6, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model6.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [61]:
prediction_column_index=model6.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model6.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 249

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


#### Model Performance 
Accuracy: 0.7848518112

F-1 score: 0.7840062322

Precision: 0.7895349065

Recall: 0.7849214382


In [83]:
# Compare two or more models
# data=mycompetition.compare_models([1, 2, 3, 4, 5, 6], verbose=1)
# mycompetition.stylize_compare(data)

### Best Model
Model with two layers of LSTMs and word embeddings outperforms others

In [73]:
model2.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_7 (Embedding)     (None, 40, 16)            160000    
                                                                 
 lstm_2 (LSTM)               (None, 40, 32)            6272      
                                                                 
 lstm_3 (LSTM)               (None, 32)                8320      
                                                                 
 flatten_5 (Flatten)         (None, 32)                0         
                                                                 
 dense_8 (Dense)             (None, 2)                 66        
                                                                 
Total params: 174,658
Trainable params: 174,658
Non-trainable params: 0
_________________________________________________________________
