<p align="center"><img width="50%" src="https://aimodelsharecontent.s3.amazonaws.com/aimodshare_banner.jpg" /></p>


---

## Assignment 3: Text Classification Using the Stanford SST Sentiment Dataset



## GitHub link: https://github.com/Bobbie8881/Projects-in-ML

## Get data in and set up X_train, X_test, y_train objects

In [2]:
#install aimodelshare library
! pip install aimodelshare==0.0.189

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting aimodelshare==0.0.189
  Downloading aimodelshare-0.0.189-py3-none-any.whl (967 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m967.8/967.8 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tensorflow==2.9.2
  Downloading tensorflow-2.9.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (511.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.8/511.8 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tf2onnx
  Downloading tf2onnx-1.14.0-py3-none-any.whl (451 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m451.2/451.2 kB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
Collecting onnxconverter-common>=1.7.0
  Downloading onnxconverter_common-1.13.0-py2.py3-none-any.whl (83 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.8/83.8 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m


Load in Data

In [1]:
# Get competition data
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/sst2_competition_data-repository:latest') 


Data downloaded successfully.


In [2]:
# Set up X_train, X_test, and y_train_labels objects
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=Warning)

X_train=pd.read_csv("sst2_competition_data/X_train.csv", squeeze=True)
X_test=pd.read_csv("sst2_competition_data/X_test.csv", squeeze=True)

y_train_labels=pd.read_csv("sst2_competition_data/y_train_labels.csv", squeeze=True)

# ohe encode Y data
y_train = pd.get_dummies(y_train_labels)

X_train.head()

0    The Rock is destined to be the 21st Century 's...
1    The gorgeously elaborate continuation of `` Th...
2    Singer/composer Bryan Adams contributes a slew...
3                 Yet the act is still charming here .
4    Whether or not you 're enlightened by any of D...
Name: text, dtype: object

## Discuss the dataset in general terms and describe why building a predictive model using this data might be practically useful.  Who could benefit from a model like this? Explain.

In [8]:
print(len(X_train))

6920


In [9]:
print(len(X_test))

1821


In [11]:
y_train_labels

0       Positive
1       Positive
2       Positive
3       Positive
4       Positive
          ...   
6915    Negative
6916    Negative
6917    Positive
6918    Negative
6919    Negative
Name: label, Length: 6920, dtype: object

The sst2_competition_data is a dataset that contains movie reviews along with their corresponding sentiment labels, either "positive" or "negative". The dataset consists of 6,920 training samples and 1821 test samples.

Building a predictive model using this dataset could be pratically useful for businesses in the entertainment industry, such as movie production companies, streaming services, and movie review websites. With the ability to accurately predict the sentiment of movie reviews, such businesses could obtain valuable insights regarding the public's perception of their movies, which could aid in refining their marketing strategies and making well-informed decisions pertaining to the production, distribution, and promotion of their movies.

For example, by examining the pre-release evaluations of a new film, the movie production company may apply a predictive model trained on this information to forecast the likelihood that the film will be successful. If the model predicts a negative sentiment, the studio may decide to alter the film or its marketing plan in an effort to increase the likelihood of success. On the other hand, if the model forecasts a favorable mood, the business can utilize that data to customize its marketing strategy to increase awareness and ticket sales.

## Preprocess data using keras tokenizer / Write and Save Preprocessor function


In [12]:
# This preprocessor function makes use of the tf.keras tokenizer

from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import pad_sequences
import numpy as np

# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen=40, max_words=10000):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

print(preprocessor(X_train).shape)
print(preprocessor(X_test).shape)

(6920, 40)
(1821, 40)


## Fit model on preprocessed data and save preprocessor function and model 


Model1: An Embedding layer and three LSTM layers (Use an Embedding layer and LSTM layers in at least one model)

In [13]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, Flatten

model1 = Sequential()
model1.add(Embedding(10000, 16, input_length=40))
model1.add(LSTM(32, return_sequences=True, dropout=0.2))
model1.add(LSTM(32, return_sequences=True, dropout=0.2))
model1.add(LSTM(32, dropout=0.2))
model1.add(Flatten())
model1.add(Dense(2, activation='softmax'))

model1.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model1.fit(preprocessor(X_train), y_train,
                    epochs=1,
                    batch_size=32,
                    validation_split=0.2)



Save preprocessor function to local "preprocessor.zip" file

In [18]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


Save model to local ".onnx" file

In [19]:
# Save keras model1 to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model1 = model_to_onnx(model1, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model1.onnx", "wb") as f:
    f.write(onnx_model1.SerializeToString())

Generate predictions from X_test data and submit model to competition


In [20]:
#Set credentials using modelshare.org username/password

from aimodelshare.aws import set_credentials
    
apiurl="https://rlxjxnoql9.execute-api.us-east-1.amazonaws.com/prod/m" #This is the unique rest api that powers this specific Playground

set_credentials(apiurl=apiurl)

AI Modelshare Username:··········
AI Modelshare Password:··········
AI Model Share login credentials set successfully.


In [21]:
#Instantiate Competition

mycompetition= ai.Competition(apiurl)

Submit Model 1

In [23]:
#Submit Model 1: 

#-- Generate predicted y values (Model 1)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model1.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model1.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels,
                           custom_metadata={"team":"6"})

Insert search tags to help users find your model (optional): Bob model 1
Provide any useful notes about your model (optional): Bob model 1

Your model has been submitted as model version 93

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


Model2: An Embedding layer and two Conv1d layers (Use an Embedding layer and Conv1d layers in at least one model)

In [27]:
from tensorflow.keras.layers import Dense, Embedding, Flatten, Conv1D, MaxPooling1D
from tensorflow.keras.models import Sequential
model2 = Sequential()

model2.add(Embedding(10000, 16, input_length=40))

model2.add(Conv1D(64, kernel_size=2, strides=1))
model2.add(MaxPooling1D(2))

model2.add(Conv1D(256, kernel_size=4, strides=2))
model2.add(MaxPooling1D(2))

model2.add(Flatten())

model2.add(Dense(256, activation='relu'))
model2.add(Dense(2, activation='softmax'))

model2.summary()

model2.compile(loss="categorical_crossentropy", optimizer = "adam", metrics = ["acc"])

history = model2.fit(preprocessor(X_train), y_train,
                    epochs = 10, 
                    batch_size = 32,
                    validation_split=0.2)


Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 40, 16)            160000    
                                                                 
 conv1d_6 (Conv1D)           (None, 39, 64)            2112      
                                                                 
 max_pooling1d_6 (MaxPooling  (None, 19, 64)           0         
 1D)                                                             
                                                                 
 conv1d_7 (Conv1D)           (None, 8, 256)            65792     
                                                                 
 max_pooling1d_7 (MaxPooling  (None, 4, 256)           0         
 1D)                                                             
                                                                 
 flatten_4 (Flatten)         (None, 1024)             

Save model to local ".onnx" file

In [56]:
# Save keras model2 to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model2 = model_to_onnx(model2, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model2.onnx", "wb") as f:
    f.write(onnx_model1.SerializeToString())

Submit Model 2

In [30]:
#Submit Model 2: 

#-- Generate predicted y values (Model 2)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model2.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model2.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels,
                          custom_metadata={"team":"6"})

Insert search tags to help users find your model (optional): Bob model 2
Provide any useful notes about your model (optional): Bob model 2

Your model has been submitted as model version 99

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


Model3: Transfer learning with glove embeddings (Use transfer learning with glove embeddings for at least one of these models)

In [32]:
# Download Glove embedding matrix weights (Might take 10 mins or so!)
! wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip

--2023-04-13 22:13:05--  http://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/wordvecs/glove.6B.zip [following]
--2023-04-13 22:13:06--  https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip [following]
--2023-04-13 22:13:06--  https://downloads.cs.stanford.edu/nlp/data/wordvecs/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182753 (822M) [app

In [33]:
! unzip glove.6B.zip 

Archive:  glove.6B.zip
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       
  inflating: glove.6B.50d.txt        


In [35]:
import os
# Extract embedding data for 100 feature embedding matrix
glove_dir = os.getcwd()

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400001 word vectors.


In [41]:
from tensorflow.keras.preprocessing.text import Tokenizer
word_index = tokenizer.word_index
# Build embedding matrix
embedding_dim = 100 # change if you use txt files using larger number of features

embedding_matrix = np.zeros((10000, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < 10000:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [72]:
# Set up model3 and import Glove weights to Embedding layer:

model3 = Sequential()
model3.add(Embedding(10000, 100, input_length=40))
model3.add(Flatten())
model3.add(Dense(32, activation='relu'))
model3.add(Dense(32, activation='relu'))
model3.add(Dense(2, activation='softmax'))
model3.summary()

# Add weights in same manner as transfer learning and turn of trainable option before fitting model to freeze weights.
model3.layers[0].set_weights([embedding_matrix])
model3.layers[0].trainable = False

model3.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model3.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)


Model: "sequential_15"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_15 (Embedding)    (None, 40, 100)           1000000   
                                                                 
 flatten_15 (Flatten)        (None, 4000)              0         
                                                                 
 dense_34 (Dense)            (None, 32)                128032    
                                                                 
 dense_35 (Dense)            (None, 32)                1056      
                                                                 
 dense_36 (Dense)            (None, 2)                 66        
                                                                 
Total params: 1,129,154
Trainable params: 1,129,154
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10

Save model to local ".onnx" file

In [73]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model3 = model_to_onnx(model3, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model3.onnx", "wb") as f:
    f.write(onnx_model3.SerializeToString())

Submit Model 3

In [74]:
#Submit Model 3: 

#-- Generate predicted y values (Model 3)
prediction_column_index=model3.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 3 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model3.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels,
                           custom_metadata={"team":"6"})

Insert search tags to help users find your model (optional): Bob Model 3
Provide any useful notes about your model (optional): Bob Model 3

Your model has been submitted as model version 112

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [75]:
# Compare model 1 2 3
data=mycompetition.compare_models([1, 2, 3], verbose=1)
mycompetition.stylize_compare(data)

Unnamed: 0,Model_1_Layer,Model_1_Shape,Model_1_Params,Model_2_Layer,Model_2_Shape,Model_2_Params,Model_3_Layer,Model_3_Shape,Model_3_Params
0,Embedding,"[None, 40, 16]",160000.0,Embedding,"[None, 40, 16]",160000,Embedding,"[None, 40, 16]",160000.0
1,Flatten,"[None, 640]",0.0,LSTM,"[None, 40, 32]",6272,LSTM,"[None, 40, 256]",279552.0
2,Dense,"[None, 2]",1282.0,LSTM,"[None, 32]",8320,Flatten,"[None, 10240]",0.0
3,,,,Flatten,"[None, 32]",0,Dense,"[None, 2]",20482.0
4,,,,Dense,"[None, 2]",66,,,


## Discuss which models performed better and point out relevant hyper-parameter values for successful models.

The best model I have is model 3 (Transfer learning with glove embeddings), the train accuray is incredibly high, and the test accuracy is around 0.70. The relevant hyper-paramet values for this successful model is

 Embedding layer parameters:

*  input_dim: 10000, which is the size of the vocabulary of the input data.
*  output_dim: 100, which is the dimensionality of the output space.
*  input_length: 40, which is the length of the input sequences.

Dense layer parameters:


*  units: 32, which is the number of output units in the layer.
*  activation: 'relu', which is the activation function used by the layer.

Dense layer parameters:


*  units: 32, which is the number of output units in the layer.
*  activation: 'relu', which is the activation function used by the layer.

Dense layer parameters:


*   units: 2, which is the number of output units in the layer.
*   activation: 'softmax', which is the activation function used by the layer.
*   epochs: 10, which is the number of times the model will iterate over the entire training dataset.
*  batch_size: 32, which is the number of samples processed before the model is updated.

After talking with my teammates, I have a new model which has a higher test accuracy which I decrease one dense layer, and change the other layer's unit to 256.

Model4: 

In [76]:
model4 = Sequential()
model4.add(Embedding(10000, 100, input_length=40))
model4.add(Flatten())
model4.add(Dense(256, activation='relu'))
model4.add(Dense(2, activation='softmax'))
model4.summary()

# Add weights in same manner as transfer learning and turn of trainable option before fitting model to freeze weights.
model4.layers[0].set_weights([embedding_matrix])
model4.layers[0].trainable = False

model4.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model4.fit(preprocessor(X_train), y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Model: "sequential_16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_16 (Embedding)    (None, 40, 100)           1000000   
                                                                 
 flatten_16 (Flatten)        (None, 4000)              0         
                                                                 
 dense_37 (Dense)            (None, 256)               1024256   
                                                                 
 dense_38 (Dense)            (None, 2)                 514       
                                                                 
Total params: 2,024,770
Trainable params: 2,024,770
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Save model to local ".onnx" file

In [77]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model4 = model_to_onnx(model4, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model4.onnx", "wb") as f:
    f.write(onnx_model4.SerializeToString())

Submit Model 4

In [78]:
#Submit Model 4: 

#-- Generate predicted y values (Model 4)
prediction_column_index=model4.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 4 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model4.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels,
                           custom_metadata={"team":"6"})

Insert search tags to help users find your model (optional): Bob Model 4
Provide any useful notes about your model (optional): Bob Model 4

Your model has been submitted as model version 113

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


## Discuss which models you tried and which models performed better and point out relevant hyper-parameter values for successful models.

Now, the best model I have is model 4 (Transfer learning with glove embeddings), the train accuray is also incredibly high, and the test accuracy is around 0.72. The relevant hyper-paramet values for this successful model is

 Embedding layer parameters:

*  input_dim: 10000, which is the size of the vocabulary of the input data.
*  output_dim: 100, which is the dimensionality of the output space.
*  input_length: 40, which is the length of the input sequences.

Dense layer parameters:


*  units: 256, which is the number of output units in the layer.
*  activation: 'relu', which is the activation function used by the layer.

Dense layer parameters:


*   units: 2, which is the number of output units in the layer.
*   activation: 'softmax', which is the activation function used by the layer.
*   epochs: 10, which is the number of times the model will iterate over the entire training dataset.
*  batch_size: 32, which is the number of samples processed before the model is updated.