<p align="center"><img width="50%" src="https://aimodelsharecontent.s3.amazonaws.com/aimodshare_banner.jpg" /></p>


---




<p align="center"><h1 align="center">Quick Start: Titanic Tabular Classification Tutorial</h1> 

---

<h3 align="center">(Deploy model to an AI Model Share Model Playground REST API<br> and Web Dashboard in five easy steps...)</h3></p>
<p align="center"><img width="100%" src="https://aimodelsharecontent.s3.amazonaws.com/aimstutorialsteps.gif" /></p>


---



## **Credential Configuration**

In order to deploy an AI Model Share Model Playground, you will need a credentials text file. 

Generating your credentials file requires two sets of information: 
1. Your AI Model Share username and password (create them [HERE](https://www.modelshare.org/login)). 
2. Your AWS (Amazon Web Services) access keys (follow the tutorial [HERE](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html)). 

You only need to generate your credentials file once. After running the configure function below, save the outputted file for all your future Model Playground deployments and competition submissions. 

*Note: Handle your credentials file with the same level of security you handle your passwords. Do not share your file with anyone, send via email, or upload to Github.*


In [None]:
#install aimodelshare library
! pip install aimodelshare-nightly

In [None]:
# Generate credentials file
import aimodelshare as ai 
from aimodelshare.aws import configure_credentials 

configure_credentials()

AI Modelshare Username:··········
AI Modelshare Password:··········
AWS_ACCESS_KEY_ID:··········
AWS_SECRET_ACCESS_KEY:··········
AWS_REGION:··········
Configuration successful. New credentials file saved as 'credentials.txt'


## **Set up Environment**

Use your credentials file to set your credentials for all aimodelshare functions. 

In [None]:
# Set credentials 
from aimodelshare.aws import set_credentials

set_credentials(credential_file="credentials.txt", type="deploy_model")

AI Model Share login credentials set successfully.
AWS credentials set successfully.


In [None]:
# Get materials for tutorial
import aimodelshare as ai
X_train, X_test, y_train, y_test, example_data, y_test_labels = ai.import_quickstart_data("titanic")

Downloading [===>                                             ]

Data downloaded successfully.

Preparing downloaded files for use...

Success! Your Quick Start materials have been downloaded. 
You are now ready to run the tutorial.


## **(1) Preprocessor Function & Setup**

### **Generate Pyspark Dataframe**

In [None]:
import pyspark

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml.feature import Imputer, VectorAssembler
from pyspark.ml.feature import StandardScaler, IndexToString

from pyspark.sql import SparkSession
from pyspark.sql.types import FloatType
from pyspark.sql.functions import col

# initiate spark session
spark = SparkSession \
    .builder \
    .appName('Titanic Data') \
    .getOrCreate()

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

numeric_columns = X_train.select_dtypes(include=numerics).columns.to_list()
categorical_columns = X_train.select_dtypes(exclude=numerics).columns.to_list()

# load the data
training_data = (
    spark.read \
    .csv("titanic_competition_data/training_data.csv", header=True)
)

# There is a limitation in column name
for i, column in enumerate(numeric_columns):
    training_data = training_data.withColumn(column, col(column).cast(FloatType()))
    
# load the data
test_data = (
    spark.read \
    .csv("titanic_competition_data/test_data.csv", header=True)
)

# There is a limitation in column name
for i, column in enumerate(numeric_columns):
    test_data = test_data.withColumn(column, col(column).cast(FloatType()))

### **Write a Preprocessor Function**


> ###   Preprocessor functions are used to preprocess data into the precise data your model requires to generate predictions.  

*  *Preprocessor functions should always be named "preprocessor".*
*  *You can use any Python library in a preprocessor function, but all libraries should be imported inside your preprocessor function.*  
*  *For tabular prediction models users should minimally include function inputs for an unpreprocessed pandas dataframe.*  
*  *Any categorical features should be preprocessed to one hot encoded numeric values.* 


In [None]:
# create the preprocessing pipelines for both numeric and categorical data
imputed_numeric_features = ["imputed_" + x for x in numeric_columns]
imputed_categorical_features = ["imputed_" + x for x in categorical_columns]
indexed_categorical_features = ["indexed_" + x for x in categorical_columns]
one_hot_categorical_features = ["one_hot_" + x for x in categorical_columns]
features =  imputed_numeric_features + one_hot_categorical_features

preprocess = Pipeline(stages=[
    Imputer(
        strategy='median',
        inputCols=numeric_columns,
        outputCols=imputed_numeric_features
    ),
    StringIndexer(
        inputCols=categorical_columns, 
        outputCols=indexed_categorical_features, 
        handleInvalid='keep'
    ),
    Imputer(
        strategy='mode',
        inputCols=indexed_categorical_features,
        outputCols=imputed_categorical_features
    ),
    OneHotEncoder(
        inputCols=imputed_categorical_features, 
        outputCols=one_hot_categorical_features,
        dropLast=False
    ),
    VectorAssembler(
        inputCols=features,
        outputCol='features'
    ),
    StandardScaler(
        inputCol='features',
        outputCol='scaled_features',
        withStd=True,
        withMean=False
    ),

])

label_indexer = StringIndexer(
    inputCol='survived', 
    outputCol='indexed_label', 
    handleInvalid='skip'
)

# Main preprocessor
preprocess_model = preprocess.fit(training_data)

# To convert float label into string label
label_indexer_model = label_indexer.fit(training_data)

The current prediction API runtime can only digest Pandas Dataframe, therefore we need to create preprocessor functions for pandas dataframe (prediction API runtime) and pyspark dataframe (this notebook).


-- Here is where we actually write the preprocessor function:


In [None]:
# Write function to transform data with preprocessor

# Prediction API runtime preprocessor function for onnx model
# the input is pandas dataframe
def preprocessor(df_data):
    import numpy as np
    import os
    import tempfile

    # initiate spark session
    spark = SparkSession.builder \
            .master("local[1]") \
            .appName("Preprocessor Titanic Data") \
            .getOrCreate()

    # load the data
    # we can only use /tmp in lambda
    # https://aws.amazon.com/blogs/compute/choosing-between-aws-lambda-data-storage-options-in-web-apps/
    temp_dir = tempfile.gettempdir()
    temp_csv_path = temp_dir + "/temp_preprocessor_data.csv"
    df_data.to_csv(temp_csv_path)
    df_data = spark.read.csv(temp_csv_path, header=True)
    
    for i, column in enumerate(numeric_columns):
        df_data = df_data.withColumn(column, col(column).cast(FloatType()))

    preprocessed_data = preprocess_model.transform(df_data)
    
    def to_array(x):
        return x[0].toArray().astype(np.float32)

    input_data = preprocessed_data.select('scaled_features').toPandas().values

    input_data = np.apply_along_axis(to_array, 1, input_data)

    os.remove(temp_csv_path)
    
    return input_data

# For pyspark training and testing code
# the input is spark dataframe
def preprocess_training(data):
    return label_indexer_model.transform(preprocess_model.transform(data))

# test data doesn't contain label
def preprocess_test(data):
    return preprocess_model.transform(data)

In [None]:
# check data after preprocessing it using our new function
preprocess_training(training_data).show()

+------+------+------+-------+--------+--------+--------------+-----------+------------+-----------+----------------+-----------+----------------+-------------+----------------+--------------------+--------------------+-------------+
|pclass|   sex|   age|   fare|embarked|survived|imputed_pclass|imputed_age|imputed_fare|indexed_sex|indexed_embarked|imputed_sex|imputed_embarked|  one_hot_sex|one_hot_embarked|            features|     scaled_features|indexed_label|
+------+------+------+-------+--------+--------+--------------+-----------+------------+-----------+----------------+-----------+----------------+-------------+----------------+--------------------+--------------------+-------------+
|   1.0|  male|  44.0|   90.0|       Q|    died|           1.0|       44.0|        90.0|        0.0|             2.0|        0.0|             2.0|(2,[0],[1.0])|   (4,[2],[1.0])|[1.0,44.0,90.0,1....|[1.19848275660349...|          0.0|
|   3.0|  male|  null|   7.75|       Q|    died|           3.0| 

## **(2) Build Model Using pyspark (or Your Preferred ML Library)**

### **Logistic Regression with L1 Regularization (Lasso)**

In [None]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# instantiate pyspark estimator
lr_1 = Pipeline(stages=[
    LogisticRegression(
        regParam=10, 
        elasticNetParam=1.0, 
        featuresCol='scaled_features', 
        labelCol='indexed_label',
        predictionCol='indexed_prediction'
    ),
    IndexToString(
        inputCol="indexed_prediction", 
        outputCol="prediction", 
        labels=label_indexer_model.labels,
    )
])

# Fit the model
model = lr_1.fit(preprocess_training(training_data))

# Evaluate our model
predictions = model.transform(preprocess_training(training_data))

evaluator = MulticlassClassificationEvaluator(
    labelCol='indexed_label', 
    predictionCol='indexed_prediction', 
    metricName='accuracy')
accuracy = evaluator.evaluate(predictions)
accuracy

0.6198662846227316

## **(3) Save Preprocessor**
### Saves preprocessor function to "preprocessor.zip" file

In [None]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

In [None]:
#  Now let's import and test the preprocessor function to see if it is working...

import aimodelshare as ai
prep=ai.import_preprocessor("preprocessor.zip")

# check the data after preprocessing 
prep(X_test).shape

## **(4) Save pyspark model to Onnx File Format**


In [None]:
temp = preprocess_model.transform(training_data)

one_hot_feature_count = 0
for feature_name in one_hot_categorical_features:
    one_hot_feature_count += len(temp.collect()[0][feature_name])

feature_count =  len(imputed_numeric_features) + one_hot_feature_count # structure of sparse vector

In [None]:
# Save pyspark model to local ONNX file
from onnxmltools.convert.common.data_types import FloatTensorType
from aimodelshare.aimsonnx import model_to_onnx

# specify intial types of predictors
initial_types = [('scaled_features', FloatTensorType([None, feature_count]))]

# transform pyspark model to ONNX
framework = 'pyspark'
onnx_model = model_to_onnx(model, framework, initial_types=initial_types,
                           spark_session=spark, transfer_learning=False, 
                           deep_learning=False)

# Save model to local .onnx file
with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

## **(5) Create your Model Playground and Deploy REST API/ Live Web-Application**

In [None]:
#Set up arguments for Model Playground deployment
import pandas as pd 

model_filepath = "model.onnx"
preprocessor_filepath = "preprocessor.zip"
exampledata = example_data

In [None]:
from aimodelshare import ModelPlayground

#Instantiate ModelPlayground() Class

myplayground=ModelPlayground(model_type="tabular", classification=True, private=False)

# Create Model Playground (generates live rest api and web-app for your model/preprocessor)

myplayground.deploy(model_filepath, preprocessor_filepath, y_train, exampledata, pyspark_support=True, timeout=60) 

## **Use your new Model Playground!**

Follow the link in the output above to:
- Generate predictions with your interactive web dashboard
- Access example code in Python, R, and Curl

Or, follow the rest of the tutorial to create a competition for your Model Playground and: 
- Access verified model performance metrics 
- Upload multiple models to a leaderboard 
- Easily compare model performance & structure 

## **Part 2: Create a Competition**

-------

After deploying your Model Playground, you can now create a competition. 

Creating a competition allows you to:
1. Verify the model performance metrics on aimodelshare.org
2. Submit models to a leaderboard
3. Grant access to other users to submit models to the leaderboard
4. Easily compare model performance and structure 

## Define Custom Evaluation Metrics (optional)

In [None]:
# Eval metrics can be defined from scratch or use functions from other libraries (e.g., sklearn)
# If you want multiple custom metrics, please make sure that the function returns a dict
# The keys of the dict will be used as column identifiers in the leaderboard

def custom_eval_metric(y_true, y_pred): 

  from sklearn.metrics import balanced_accuracy_score
  from sklearn.metrics import f1_score

  bal_acc = balanced_accuracy_score(y_true, y_pred)
  f1_weighted = f1_score(y_true, y_pred, average='weighted')

  metrics = {"f1_weighted": f1_weighted ,
             "balanced_accuracy": bal_acc}

  return metrics

In [None]:
# Export custom evaluation function to zip file
from aimodelshare.custom_eval_metrics import export_eval_metric
export_eval_metric(custom_eval_metric, '', 'custom_eval')

Your eval_metric is now saved to 'custom_eval.zip'


Create Competition

In [None]:
# Create list of authorized participants for competition
# Note that participants should use the same email address when creating modelshare.org account
emaillist=["pra2118@columbia.edu"]

In [None]:
# Create Competition
myplayground.create_competition(data_directory='titanic_competition_data', 
                                y_test = y_test_labels,
                                eval_metric_filepath = 'custom_eval.zip',
                                email_list=emaillist)

custom_eval.zip

--INPUT COMPETITION DETAILS--

Enter competition name:test
Enter competition description:

--INPUT DATA DETAILS--

Note: (optional) Save an optional LICENSE.txt file in your competition data directory to make users aware of any restrictions on data sharing/usage.

Enter data description (i.e.- filenames denoting training and test data, file types, and any subfolders where files are stored):
Enter optional data license descriptive name (e.g.- 'MIT, Apache 2.0, CC0, Other, etc.'):
Uploading your data. Please wait for a confirmation message.

 Success! Model competition created. 

You may now update your prediction API runtime model and verify evaluation metrics with the update_runtime_model() function.

To upload new models and/or preprocessors to this API, team members should use 
the following credentials:

apiurl='https://xa6ln3rdr8.execute-api.us-east-1.amazonaws.com/prod/m'
from aimodelshare.aws import set_credentials
set_credentials(apiurl=apiurl)

They can then su

In [None]:
#Instantiate Competition
#--Note: If you start a new session, the first argument should be the Model Playground url in quotes. 
#--e.g.- mycompetition= ai.Competition("https://2121212.execute-api.us-east-1.amazonaws.com/prod/m)
#See Model Playground "Compete" tab for example model submission code.
mycompetition= ai.Competition(myplayground.playground_url)

In [None]:

# Add, remove, or completely update authorized participants for competition later
emaillist=["emailaddress4@email.com"]

mycompetition.update_access_list(email_list=emaillist,update_type="Add")

['pra2118@columbia.edu', 'pra2118@columbia.edu', 'emailaddress4@email.com']


'Success: Your competition participant access list is now updated.'

Submit Models

In [None]:
#Authorized users can submit new models after setting credentials using modelshare.org username/password

from aimodelshare.aws import set_credentials

set_credentials(apiurl=myplayground.playground_url) # example url from deployed playground: apiurl= "https://123456.execute-api.us-east-1.amazonaws.com/prod/m

In [None]:
#Submit Model 1: 

#-- Generate predicted values (a list of predicted labels "survived" or "died") (Model 1)
predictions = model.transform(preprocess_test(test_data))
prediction_index = predictions.select('prediction').toPandas().to_numpy()

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_index)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 1

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1460


In [None]:
# Create model 2 (L2 Regularization - Ridge)
# instantiate pyspark estimator
lr_2 = Pipeline(stages=[
    LogisticRegression(
        regParam=10, 
        elasticNetParam=0.0, 
        featuresCol='scaled_features', 
        labelCol='indexed_label',
        predictionCol='indexed_prediction'
    ),
    IndexToString(
        inputCol="indexed_prediction", 
        outputCol="prediction", 
        labels=label_indexer_model.labels,
    )
])

# Fit the model
model_2 = lr_2.fit(preprocess_training(training_data))

In [None]:
# Save Model 2 to .onnx file
# transform pyspark model to ONNX
framework = 'pyspark'
onnx_model = model_to_onnx(model_2, framework, initial_types=initial_types, 
                          spark_session=spark, transfer_learning=False,
                          deep_learning=False)

# Save model to local .onnx file
with open("model_2.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

{'classlabels_ints': [0, 1],
 'coefficients': [0.015509647857829247,
                  0.0013041722755419913,
                  -0.011493781488800765,
                  0.02496369969047826,
                  -0.024963699690478184,
                  0.007439672349634482,
                  -0.008861733293314019,
                  0.0009338777959797627,
                  -0.0024943580382469116,
                  -0.015509647857829247,
                  -0.0013041722755419913,
                  0.011493781488800765,
                  -0.02496369969047826,
                  0.024963699690478184,
                  -0.007439672349634482,
                  0.008861733293314019,
                  -0.0009338777959797627,
                  0.0024943580382469116],
 'intercepts': [0.386037263192894, -0.386037263192894],
 'multi_class': 1,
 'name': 'LinearClassifier',
 'post_transform': 'LOGISTIC'}


In [None]:
# Submit Model 2

#-- Generate predicted y values (Model 2)
predictions = model_2.transform(preprocess_test(test_data))
prediction_index = predictions.select('prediction').toPandas().to_numpy()

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath="model_2.onnx",
                           preprocessor_filepath="preprocessor.zip",
                           prediction_submission=prediction_index)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 2

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1460


Submit Predictions Only

In [None]:
# Submit predictions to Competition Leaderboard without the need to submit an ONNX object
# This option can be used if the predictions were generated with a ML framework that is currently not supported
# The model will be evaluated and represented in the leaderboard, but no other model metadata will be extracted automatically
predictions = model.transform(preprocess_test(test_data))
prediction_index = predictions.select('prediction').toPandas().to_numpy()

mycompetition.submit_model(model_filepath = None,
                           preprocessor_filepath=None,
                           prediction_submission=prediction_index)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 3

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1460


Submit Model With Custom Metadata

In [None]:
# Custom metadata can be added by passing a dict to the custom_metadata argument of the submit_model() method
# This option can be used to fill in missing data points or add new columns to the leaderboard

custom_meta = {'team': 'one',
               'model_type': 'your_model_type',
               'new_column': 'new metadata'}

mycompetition.submit_model(model_filepath = None,
                                 preprocessor_filepath=None,
                                 prediction_submission=prediction_index,
                                 custom_metadata = custom_meta)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 4

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1460


Get Leaderboard

In [None]:
data = mycompetition.get_leaderboard()
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,f1_weighted,balanced_accuracy,ml_framework,model_type,num_params,model_config,new_column,team,username,version
0,65.27%,39.49%,32.63%,50.00%,0.515505,0.5,pyspark,LogisticRegressionModel,9.0,"{'aggregationDepth': 2, 'elast...",,,raudipra,1
1,65.27%,39.49%,32.63%,50.00%,0.515505,0.5,pyspark,LogisticRegressionModel,9.0,"{'aggregationDepth': 2, 'elast...",,,raudipra,2
2,65.27%,39.49%,32.63%,50.00%,0.515505,0.5,unknown,unknown,,None...,,,raudipra,3
3,65.27%,39.49%,32.63%,50.00%,0.515505,0.5,unknown,your_model_type,,None...,new metadata,one,raudipra,4


Compare Models

In [None]:
# Compare two or more models 
data=mycompetition.compare_models([1,2,3,4], verbose=1)
mycompetition.stylize_compare(data)


#### Check structure of y test data 
(This helps users understand how to submit predicted values to leaderboard)

In [None]:
mycompetition.inspect_y_test()

{'class_balance': {'died': 171, 'survived': 91},
 'class_labels': ['died', 'survived'],
 'label_dtypes': {"<class 'str'>": 262},
 'y_length': 262,
 'ytest_example': ['survived', 'died', 'died', 'survived', 'died']}

## **Part 3: Maintaining your Model Playground**

-------

Update Runtime model

*Use this function to 1) update the prediction API behind your Model Playground with a new model, chosen from the leaderboard and 2) verify the modelperformance metrics in your Model Playground*

In [None]:
myplayground.update_runtime_model(model_version=2)

Runtime model & preprocessor for api: https://xa6ln3rdr8.execute-api.us-east-1.amazonaws.com/prod/m updated to model version 2.

Model metrics are now updated and verified for this model playground.


Delete Deployment 

*Use this function to delete the entire Model Playground, including the REST API, web dashboard, competition, and all submitted models*

In [None]:
myplayground.delete_deployment()

Running this function will permanently delete all resources tied to this deployment, 
 including the eval lambda and all models submitted to the model competition.

To confirm, type 'permanently delete':permanently delete


'Deployment deleted successfully.'