<p align="center"><img width="50%" src="https://aimodelsharecontent.s3.amazonaws.com/aimodshare_banner.jpg" /></p>


---

## Stanford Sentiment Treebank - Movie Review Classification Competition
Let's share our models to a centralized leaderboard, so that we can collaborate and learn from the model experimentation process...

**Instructions:**
1.   Get data in and set up X_train / X_test / y_train
2.   Preprocess data using Sklearn TFIDF Vectorizer/ Write and Save Preprocessor function
3. Fit model on preprocessed data and save preprocessor function and model 
4. Generate predictions from X_test data and submit model to competition
5. Repeat submission process to improve place on leaderboard



## 1. Get data in and set up X_train, X_test, y_train objects

In [None]:
#install aimodelshare library
! pip install aimodelshare==0.0.189

In [2]:
# Get competition data
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/sst2_competition_data-repository:latest') 


Data downloaded successfully.


In [25]:
# Set up X_train, X_test, and y_train_labels objects
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=Warning)

X_train=pd.read_csv("sst2_competition_data/X_train.csv", squeeze=True)
X_test=pd.read_csv("sst2_competition_data/X_test.csv", squeeze=True)

y_train_labels=pd.read_csv("sst2_competition_data/y_train_labels.csv", squeeze=True)

X_train.head()

0    The Rock is destined to be the 21st Century 's...
1    The gorgeously elaborate continuation of `` Th...
2    Singer/composer Bryan Adams contributes a slew...
3                 Yet the act is still charming here .
4    Whether or not you 're enlightened by any of D...
Name: text, dtype: object

##2.   Preprocess data using Sklearn Tfidf Vectorizer / Write and Save Preprocessor function


In [9]:
# Build a Document-Term Matrix (DTM) out of words in the training set 
# Remove stop words that occur too frequently to be useful, and 
# Use Term Frequency - Inverse Document Frequency (TF-IDF) formula to weight by how common words are generally

from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_simple = tf_idf_vectorizer.fit(X_train)

In [13]:
# Write function to transform data with preprocessor 
# New samples will be put into a DTM based on vocabularly from training set

def preprocessor(data):
  from sklearn.feature_extraction.text import TfidfVectorizer
  import numpy as np
  new_tfidf_df = tfidf_simple.transform(data)
  new_tfidf_df = new_tfidf_df.todense()
  return np.array(new_tfidf_df)

print(preprocessor(X_train).shape)
print(preprocessor(X_test).shape)

(6920, 13504)
(1821, 13504)


##3. Fit model on preprocessed data and save preprocessor function and model 


In [15]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=0)
model.fit(preprocessor(X_train), y_train_labels) # Fitting to the training set.
model.score(preprocessor(X_train), y_train_labels) # Fit score, 0-1 scale.

0.5393063583815029

#### Save preprocessor function to local "preprocessor.zip" file

In [16]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


#### Save model to local ".onnx" file

In [17]:
# Save sklearn model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

# Check how many preprocessed input features are there?
from skl2onnx.common.data_types import FloatTensorType

feature_count=preprocessor(X_test).shape[1] #Get count of preprocessed features
initial_type = [('float_input', FloatTensorType([None, feature_count]))]  # Insert correct number of preprocessed features

onnx_model = model_to_onnx(model, framework='sklearn',
                          initial_types=initial_type,
                          transfer_learning=False,
                          deep_learning=False)

with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

## 4. Generate predictions from X_test data and submit model to competition


In [19]:
#Set credentials using modelshare.org username/password

from aimodelshare.aws import set_credentials
    
apiurl="https://rlxjxnoql9.execute-api.us-east-1.amazonaws.com/prod/m" #This is the unique rest api that powers this Movie Review Playground

set_credentials(apiurl=apiurl)

AI Modelshare Username:··········
AI Modelshare Password:··········
AI Model Share login credentials set successfully.


In [20]:
#Instantiate Competition
import aimodelshare as ai
mycompetition= ai.Competition(apiurl)

In [26]:
#Submit Model 1: 

#-- Generate predicted values (a list of predicted labels "positive" or "negative") (Model 1)
prediction_labels = model.predict(preprocessor(X_test))

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 11

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [27]:
# Get leaderboard to explore current best model architectures

# Get raw data in pandas data frame
data = mycompetition.get_leaderboard()

# Stylize leaderboard data
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,deep_learning,model_type,depth,num_params,embedding_layers,flatten_layers,lstm_layers,dense_layers,softmax_act,tanh_act,loss,optimizer,memory_size,username,version
0,76.73%,76.48%,77.92%,76.74%,keras,True,Sequential,3.0,161282.0,1.0,1.0,,1.0,1.0,,str,RMSprop,645576.0,AIModelShare,1
1,71.02%,71.02%,71.03%,71.02%,keras,True,Sequential,4.0,460034.0,1.0,1.0,1.0,1.0,1.0,1.0,str,RMSprop,1840960.0,AIModelShare,3
2,68.83%,68.55%,69.52%,68.84%,keras,True,Sequential,5.0,174658.0,1.0,1.0,2.0,1.0,1.0,2.0,str,RMSprop,699864.0,AIModelShare,2
3,61.91%,59.27%,66.14%,61.94%,sklearn,,GradientBoostingClassifier,,,,,,,,,,,,AIModelShare,7
4,53.68%,41.25%,74.55%,53.73%,sklearn,,RandomForestClassifier,,,,,,,,,,,,AIModelShare,5
5,54.67%,43.97%,70.09%,54.71%,sklearn,,RandomForestClassifier,,,,,,,,,,,,AIModelShare,6
6,51.59%,37.04%,72.45%,51.64%,sklearn,,RandomForestClassifier,,,,,,,,,,,,AIModelShare,4
7,51.48%,36.81%,72.24%,51.53%,sklearn,,RandomForestClassifier,,,,,,,,,,,,mikedparrott,10
8,51.48%,36.81%,72.24%,51.53%,sklearn,,RandomForestClassifier,,,,,,,,,,,,mikedparrott,11
9,50.16%,50.16%,50.16%,50.16%,unknown,,unknown,,,,,,,,,,,,AIModelShare,8


## 5. Repeat submission process to improve place on leaderboard


In [28]:
# Train and submit model 2 using same preprocessor (note that you could save a new preprocessor, but we will use the same one for this example).
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=500, max_depth=5, random_state=0)
model.fit(preprocessor(X_train), y_train_labels) # Fitting to the training set.
model.score(preprocessor(X_train), y_train_labels) # Fit score, 0-1 scale.

0.5570809248554913

In [None]:
# Save sklearn model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

feature_count=preprocessor(X_test).shape[1] #Get count of preprocessed features
initial_type = [('float_input', FloatTensorType([None, feature_count]))]  # Insert correct number of preprocessed features

onnx_model = model_to_onnx(model, framework='sklearn',
                          initial_types=initial_type,
                          transfer_learning=False,
                          deep_learning=False)

with open("model2.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [None]:
#Submit Model 2: 

#-- Generate predicted values (a list of predicted labels "positive" or "negative") (Model 2)
prediction_labels = model.predict(preprocessor(X_test))

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model2.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 5

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [29]:
# Compare two or more models
data=mycompetition.compare_models([4, 5], verbose=1)
mycompetition.stylize_compare(data)

Unnamed: 0,param_name,default_value,model_version_4,model_version_5
0,bootstrap,True,True,True
1,ccp_alpha,0.000000,0.000000,0.000000
2,class_weight,,,
3,criterion,gini,gini,gini
4,max_depth,,3,5
5,max_features,auto,auto,auto
6,max_leaf_nodes,,,
7,max_samples,,,
8,min_impurity_decrease,0.000000,0.000000,0.000000
9,min_impurity_split,,,







In [None]:
# Submit a third model using GridSearchCV

from sklearn.model_selection import GridSearchCV
import numpy as np

param_grid = {'n_estimators': np.arange(100, 300, 500),'max_depth':[1, 3, 5]} #np.arange creates sequence of numbers for each k value

gridmodel = GridSearchCV(RandomForestClassifier(), param_grid=param_grid, cv=10)

#use meta model methods to fit score and predict model:
gridmodel.fit(preprocessor(X_train), y_train_labels)

#extract best score and parameter by calling objects "best_score_" and "best_params_"
print("best mean cross-validation score: {:.3f}".format(gridmodel.best_score_))
print("best parameters: {}".format(gridmodel.best_params_))


best mean cross-validation score: 0.549
best parameters: {'max_depth': 5, 'n_estimators': 100}


In [None]:
# Save sklearn model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx


feature_count=preprocessor(X_test).shape[1] #Get count of preprocessed features
initial_type = [('float_input', FloatTensorType([None, feature_count]))]  # Insert correct number of preprocessed features

onnx_model = model_to_onnx(gridmodel, framework='sklearn',
                          initial_types=initial_type,
                          transfer_learning=False,
                          deep_learning=False)

with open("gridmodel.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [None]:
#Submit Model 3: 

#-- Generate predicted values (a list of predicted labels "real" or "fake")
prediction_labels = gridmodel.predict(preprocessor(X_test))

# Submit Model 3 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "gridmodel.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 6

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763


In [None]:
# Get leaderboard

data = mycompetition.get_leaderboard()
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,embedding_layers,flatten_layers,lstm_layers,dense_layers,softmax_act,tanh_act,loss,optimizer,memory_size,username,version
0,76.73%,76.48%,77.92%,76.74%,keras,False,True,Sequential,3.0,161282.0,1.0,1.0,,1.0,1.0,,str,RMSprop,645576.0,AIModelShare,1
1,71.02%,71.02%,71.03%,71.02%,keras,False,True,Sequential,4.0,460034.0,1.0,1.0,1.0,1.0,1.0,1.0,str,RMSprop,1840960.0,AIModelShare,3
2,68.83%,68.55%,69.52%,68.84%,keras,False,True,Sequential,5.0,174658.0,1.0,1.0,2.0,1.0,1.0,2.0,str,RMSprop,699864.0,AIModelShare,2
3,53.68%,41.25%,74.55%,53.73%,sklearn,False,False,RandomForestClassifier,,,,,,,,,,,,AIModelShare,5
4,54.67%,43.97%,70.09%,54.71%,sklearn,False,False,RandomForestClassifier,,,,,,,,,,,,AIModelShare,6
5,51.59%,37.04%,72.45%,51.64%,sklearn,False,False,RandomForestClassifier,,,,,,,,,,,,AIModelShare,4


In [None]:
# Compare two or more models
data=mycompetition.compare_models([4, 5, 6], verbose=1)
mycompetition.stylize_compare(data)

Unnamed: 0,param_name,default_value,model_version_4,model_version_5,model_version_6
0,bootstrap,True,True,True,True
1,ccp_alpha,0.000000,0.000000,0.000000,0.000000
2,class_weight,,,,
3,criterion,gini,gini,gini,gini
4,max_depth,,3,5,5
5,max_features,auto,auto,auto,auto
6,max_leaf_nodes,,,,
7,max_samples,,,,
8,min_impurity_decrease,0.000000,0.000000,0.000000,0.000000
9,min_impurity_split,,,,







In [None]:
# Here are several classic ML architectures you can choose from to experiment with next:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier


#Example code to fit model:
model = GradientBoostingClassifier(n_estimators=50, learning_rate=1.0,
    max_depth=1, random_state=0).fit(preprocessor(X_train), y_train_labels)
model.score(preprocessor(X_train), y_train_labels)

# Save sklearn model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx


feature_count=preprocessor(X_test).shape[1] #Get count of preprocessed features
initial_type = [('float_input', FloatTensorType([None, feature_count]))]  # Insert correct number of preprocessed features

onnx_model = model_to_onnx(model, framework='sklearn',
                          initial_types=initial_type,
                          transfer_learning=False,
                          deep_learning=False)

with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

#-- Generate predicted values (a list of predicted labels "real" or "fake")
prediction_labels = model.predict(preprocessor(X_test))

# Submit model to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)


Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 7

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:2763
