<p align="center"><img width="50%" src="https://aimodelsharecontent.s3.amazonaws.com/aimodshare_banner.jpg" /></p>


---

## Model Submission Guide: CapIQ-Industry Classification Competition
Let's share our models to a centralized leaderboard, so that we can collaborate and learn from the model experimentation process...

**Instructions:**
1.   Get data in and set up X_train / X_test / y_train
2.   Preprocess data using Sklearn Column Transformer/ Write and Save Preprocessor function
3. Fit model on preprocessed data and save preprocessor function and model 
4. Generate predictions from X_test data and submit model to competition
5. Repeat submission process to improve place on leaderboard



## 1. Get data in and set up X_train, X_test, y_train objects

In [None]:
#install aimodelshare library
! pip install aimodelshare --upgrade

In [2]:
# Get competition data
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/capiq_industry_competition-repository:latest') 


Data downloaded successfully.


In [3]:
# Separate data into X_train, y_train, and X_test
import pandas as pd
y_train_labels = pd.read_csv("capiq_industry_competition/y_train.csv", squeeze=True)
y_train = pd.get_dummies(y_train_labels)

X_train = pd.read_csv("capiq_industry_competition/X_train.csv")
X_test=pd.read_csv("capiq_industry_competition/X_test.csv")

X_train.head()

Unnamed: 0,Name,Symbol,Exchange,Rating,MarketCap,EnterpriseValue,Revenue,GrossProfit,EBITDA,EBIT,...,CurrentAssets,ShortTermDebt,LTD_Cap_Leases,Leases_LongTerm,LongTermDebt,Liabilities,Liabilities_N_Equity,Debt_Current,Debt_NonCurrent,Debt_Net
0,3M Company (NYSE:MMM),NYSE:MMM,New York Stock Exchange (NYSE),A+,99704.48,109611.58,31841.9,15447.6,8734.6,7966.3,...,13272.0,353.7,1117.3,253.1,11248.2,31269.35,37798.6,1471.0,11501.3,9166.3
1,AAR Corp. (NYSE:AIR),NYSE:AIR,New York Stock Exchange (NYSE),BB,1131.61,1412.65,1816.2,272.93,129.26,103.655,...,1014.52,0.06,37.37,13.08,328.11,1303.615,1760.14,37.43,341.19,276.71
2,Abbott Laboratories (NYSE:ABT),NYSE:ABT,New York Stock Exchange (NYSE),AA-,112198.8,121025.92,27682.23,15806.85,6757.35,5505.365,...,20018.3,1243.9,608.1,261.3,15047.1,46540.7,60436.2,1852.0,15308.4,8373.9
3,AbbVie Inc. (NYSE:ABBV),NYSE:ABBV,New York Stock Exchange (NYSE),BBB+,135936.95,166341.05,31122.8,23163.6,13948.2,12604.85,...,22157.2,678.8,3891.5,186.2,37593.7,70162.0,71921.5,4570.3,37779.9,30633.7
4,Adecoagro S.A. (NYSE:AGRO),NYSE:AGRO,New York Stock Exchange (NYSE),BB,1021.64,1760.08,817.16,260.81,255.28,193.0,...,623.58,7.6627,178.79,53.7321,595.08,1513.135,1949.59,186.449,648.81,606.12


##2.   Preprocess data using Sklearn / Write and Save Preprocessor function


In [4]:
# Simple Preprocessor with sklearn 
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

numeric_features = ['MarketCap', 'EnterpriseValue', 'Revenue', 'GrossProfit', 'EBITDA', 
                    'EBIT', 'NetIncome', 'Cash', 'PPnE', 'Assets', 'Debt', 'Equity', 
                    'Receivables', 'Inventory', 'CurrentAssets', 'ShortTermDebt', 
                    'LTD_Cap_Leases', 'Leases_LongTerm', 'LongTermDebt', 'Liabilities', 
                    'Liabilities_N_Equity', 'Debt_Current', 'Debt_NonCurrent', 'Debt_Net']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')), 
    ('scaler', StandardScaler())])

categorical_features = ['Rating']
## Replacing missing values with Modal value and then one-hot encoding.
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Final preprocessor object set up with ColumnTransformer...

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# fit preprocessor to your data
preprocess = preprocess.fit(X_train)

In [5]:
# Here is where we actually write the preprocessor function:
def preprocessor(data):
    preprocessed_data=preprocess.transform(data)
    return preprocessed_data

In [6]:
# check shape of X data after preprocessing it using our new function
preprocessor(X_train).shape

(405, 40)

##3. Fit model on preprocessed data and save preprocessor function and model 


In [7]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential()
model.add(Dense(16, input_dim=40, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(32, activation='relu'))

model.add(Dense(10, activation='softmax')) 
                                            
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

# Fitting the NN to the Training set
model.fit(preprocessor(X_train), y_train, 
               epochs = 15, validation_split=0.25) 

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7ffa3db42e10>

#### Save preprocessor function to local "preprocessor.zip" file

In [8]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


#### Save model to local ".onnx" file

In [9]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

## 4. Generate predictions from X_test data and submit model to competition


In [10]:
#Set credentials using modelshare.org username/password

from aimodelshare.aws import set_credentials

#This is the unique rest api that powers this Industry Classification Playground -- make sure to update the apiurl for new competition deployments
apiurl="https://o10cku0dzb.execute-api.us-east-1.amazonaws.com/prod/m"

set_credentials(apiurl=apiurl)

AI Modelshare Username:··········
AI Modelshare Password:··········
AI Model Share login credentials set successfully.


In [11]:
#Instantiate Competition
import aimodelshare as ai
mycompetition= ai.Competition(apiurl)

In [12]:
#Submit Model 1: 

#-- Generate predicted y values (Model 1)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 5

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1752


In [13]:
# Get leaderboard to explore current best model architectures

# Get raw data in pandas data frame
data = mycompetition.get_leaderboard()

# Stylize leaderboard data
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,dense_layers,softmax_act,relu_act,loss,optimizer,memory_size,username,version
0,43.14%,36.92%,39.65%,35.96%,sklearn,False,False,SVC,,1800.0,,,,,,,ML_Risk_Mgmnt,2
1,39.22%,34.51%,38.56%,33.24%,sklearn,False,False,SVC,,1800.0,,,,,,,ML_Risk_Mgmnt,4
2,45.10%,29.24%,27.39%,32.50%,sklearn,False,False,RandomForestClassifier,,,,,,,,,ML_Risk_Mgmnt,3
3,35.29%,21.47%,22.70%,24.72%,sklearn,False,False,RandomForestClassifier,,,,,,,,,ML_Risk_Mgmnt,1
4,25.49%,4.98%,3.21%,11.11%,keras,False,True,Sequential,4.0,1802.0,4.0,1.0,3.0,str,SGD,8256.0,ML_Risk_Mgmnt,5


## 5. Repeat submission process to improve place on leaderboard


In [None]:
# Train and submit model 2 using same preprocessor (note that you could save a new preprocessor, but we will use the same one for this example).
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation

model_2 = Sequential()
model_2.add(Dense(128, input_dim=40, activation='relu'))
model_2.add(Dropout(.3))
model_2.add(Dense(64, activation='relu'))
model_2.add(Dense(64, activation='relu'))
model_2.add(Dropout(.3))
model_2.add(Dense(64, activation='relu'))

model_2.add(Dense(10, activation='softmax')) 
                                            
# Compile model
model_2.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

# Fitting the NN to the Training set
model_2.fit(preprocessor(X_train), y_train, 
               epochs = 50, validation_split=0.25) 

In [15]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model_2, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model2.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [16]:
#Submit Model 2: 

#-- Generate predicted y values (Model 2)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model_2.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model2.onnx",
                                 prediction_submission=prediction_labels,
                                 preprocessor_filepath="preprocessor.zip")

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 6

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1752


In [17]:
# Compare two or more models
data=mycompetition.compare_models([5,6], verbose=1)
mycompetition.stylize_compare(data)

Unnamed: 0,Model_5_Layer,Model_5_Shape,Model_5_Params,Model_6_Layer,Model_6_Shape,Model_6_Params
0,Dense,"[None, 16]",656.0,Dense,"[None, 128]",5248
1,Dense,"[None, 16]",272.0,Dropout,"[None, 128]",0
2,Dense,"[None, 32]",544.0,Dense,"[None, 64]",8256
3,Dense,"[None, 10]",330.0,Dense,"[None, 64]",4160
4,,,,Dropout,"[None, 64]",0
5,,,,Dense,"[None, 64]",4160
6,,,,Dense,"[None, 10]",650


## Optional: Tune model within range of hyperparameters with Keras Tuner

*Simple example shown below. Consult [documentation](https://keras.io/guides/keras_tuner/getting_started/) to see full functionality.*

In [18]:
! pip install keras_tuner

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting keras_tuner
  Downloading keras_tuner-1.1.3-py3-none-any.whl (135 kB)
[K     |████████████████████████████████| 135 kB 7.5 MB/s 
Collecting kt-legacy
  Downloading kt_legacy-1.0.4-py3-none-any.whl (9.6 kB)
Collecting jedi>=0.10
  Downloading jedi-0.18.1-py2.py3-none-any.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 57.7 MB/s 
Installing collected packages: jedi, kt-legacy, keras-tuner
Successfully installed jedi-0.18.1 keras-tuner-1.1.3 kt-legacy-1.0.4


In [19]:
#Separate validation data 
from sklearn.model_selection import train_test_split
x_train_split, x_val, y_train_split, y_val = train_test_split(
     X_train, y_train, test_size=0.2, random_state=42)

In [21]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, BatchNormalization
from keras.regularizers import l1, l2, l1_l2
import keras_tuner as kt


#Define model structure & parameter search space with function
def build_model(hp):
    model = keras.Sequential()
    model.add(Dense(64, input_dim=40, activation='relu', kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01)))
    model.add(Dense(64, activation='relu', kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01)))
    model.add(Dense(units=hp.Int("units", min_value=32, max_value=512, step=32), #range 32-512 inclusive, minimum step between tested values is 32
                    activation='relu', kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01)))
    model.add(Dense(10, activation='softmax')) 
    model.compile(
        optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"],
    )
    return model

#initialize the tuner (which will search through parameters)
tuner = kt.RandomSearch(
    hypermodel=build_model, 
    objective="val_accuracy", # objective to optimize
    max_trials=3, #max number of trials to run during search
    executions_per_trial=3, #higher number reduces variance of results; guages model performance more accurately 
    overwrite=True,
    directory="tuning_model",
    project_name="tuning_units",
)

tuner.search(preprocessor(x_train_split), y_train_split, epochs=1, validation_data=(preprocessor(x_val), y_val))

Trial 3 Complete [00h 00m 04s]
val_accuracy: 0.13580246766408285

Best val_accuracy So Far: 0.21810699502627054
Total elapsed time: 00h 00m 13s


In [22]:
# Build model with best hyperparameters

# Get the top 2 hyperparameters.
best_hps = tuner.get_best_hyperparameters(5)
# Build the model with the best hp.
tuned_model = build_model(best_hps[0])
# Fit with the entire dataset.
tuned_model.fit(x=preprocessor(X_train), y=y_train, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7ffa3bcdf950>

In [23]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(tuned_model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("tuned_model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [24]:
#Submit Model 3: 

#-- Generate predicted y values (Model 3)
prediction_column_index=tuned_model.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit to Competition Leaderboard
mycompetition.submit_model(model_filepath = "tuned_model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 7

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1752


In [25]:
# Get leaderboard

data = mycompetition.get_leaderboard()
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,dense_layers,dropout_layers,softmax_act,relu_act,loss,optimizer,memory_size,username,version
0,43.14%,36.92%,39.65%,35.96%,sklearn,False,False,SVC,,1800.0,,,,,,,,ML_Risk_Mgmnt,2
1,39.22%,34.51%,38.56%,33.24%,sklearn,False,False,SVC,,1800.0,,,,,,,,ML_Risk_Mgmnt,4
2,45.10%,29.24%,27.39%,32.50%,sklearn,False,False,RandomForestClassifier,,,,,,,,,,ML_Risk_Mgmnt,3
3,35.29%,21.47%,22.70%,24.72%,sklearn,False,False,RandomForestClassifier,,,,,,,,,,ML_Risk_Mgmnt,1
4,35.29%,17.07%,17.69%,21.11%,keras,False,True,Sequential,7.0,22474.0,5.0,2.0,1.0,4.0,str,SGD,91232.0,ML_Risk_Mgmnt,6
5,21.57%,7.31%,5.75%,14.39%,keras,False,True,Sequential,4.0,25994.0,4.0,,1.0,3.0,str,RMSprop,105024.0,ML_Risk_Mgmnt,7
6,25.49%,4.98%,3.21%,11.11%,keras,False,True,Sequential,4.0,1802.0,4.0,,1.0,3.0,str,SGD,8256.0,ML_Risk_Mgmnt,5


In [26]:
# Compare two or more models
data=mycompetition.compare_models([5, 6, 7], verbose=1)
mycompetition.stylize_compare(data)

Unnamed: 0,Model_5_Layer,Model_5_Shape,Model_5_Params,Model_6_Layer,Model_6_Shape,Model_6_Params,Model_7_Layer,Model_7_Shape,Model_7_Params
0,Dense,"[None, 16]",656.0,Dense,"[None, 128]",5248,Dense,"[None, 64]",2624.0
1,Dense,"[None, 16]",272.0,Dropout,"[None, 128]",0,Dense,"[None, 64]",4160.0
2,Dense,"[None, 32]",544.0,Dense,"[None, 64]",8256,Dense,"[None, 256]",16640.0
3,Dense,"[None, 10]",330.0,Dense,"[None, 64]",4160,Dense,"[None, 10]",2570.0
4,,,,Dropout,"[None, 64]",0,,,
5,,,,Dense,"[None, 64]",4160,,,
6,,,,Dense,"[None, 10]",650,,,
