<p align="center"><img width="50%" src="https://aimodelsharecontent.s3.amazonaws.com/aimodshare_banner.jpg" /></p>


---


<p align="center"><h1 align="center">California Housing Model Submission Guide

##### <p align="center">*Source: Sklearn [California Housing Dataset](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset)* 

---
Let's share our models to a centralized leaderboard, so that we can collaborate and learn from the model experimentation process...

**Instructions:**
1.   Get data in and set up X_train / X_test / y_train
2.   Preprocess data  with Sklearn Column Transformer/ Write and Save Preprocessor function
3. Fit model on preprocessed data and save preprocessor function and model 
4. Generate predictions from X_test data and submit model to competition
5. Repeat submission process to improve place on leaderboard



**Objective:** Predict median house value for California districts, expressed in hundreds of thousands of dollars

**Data**: 1990 Census attributes by Block Group. 
(A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data. A block group typically has a population of 600 to 3,000 people.) 

**Features**
* **MedInc** median income in block group
* **HouseAge** median house age in block group
* **AveRooms** average number of rooms per household
* **AveBedrms** average number of bedrooms per household
* **Population** block group population
* **AveOccup** average number of household members
* **Latitude** block group latitude
* **Longitude** block group longitude

**Target**
*   Median house value for California districts, expressed in hundreds of thousands of dollars ($100,000)

## 1. Get data in and set up X_train, X_test, y_train objects

In [None]:
#install aimodelshare library
! pip install aimodelshare --upgrade

In [31]:
# Get competition data
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/ca_housing_competition_data-repository:latest') 


Data downloaded successfully.


In [32]:
# Load data into X_train, y_train, and X_test objects
import pandas as pd

X_train = pd.read_csv("ca_housing_competition_data/X_train.csv")
y_train = pd.read_csv("ca_housing_competition_data/y_train.csv", squeeze=True)

X_test=pd.read_csv("ca_housing_competition_data/X_test.csv")

X_train.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,3.2596,33.0,5.017657,1.006421,2300.0,3.691814,32.71,-117.03
1,3.8125,49.0,4.473545,1.041005,1314.0,1.738095,33.77,-118.16
2,4.1563,4.0,5.645833,0.985119,915.0,2.723214,34.66,-120.48
3,1.9425,36.0,4.002817,1.033803,1418.0,3.994366,32.69,-117.11
4,3.5542,43.0,6.268421,1.134211,874.0,2.3,36.78,-119.8


In [33]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)

(16512, 8)
(4128, 8)
(16512,)


##2.   Preprocess data using Sklearn Column Transformer / Write and Save Preprocessor function


In [6]:
# In this case we use Sklearn's Column transformer in our preprocessor function

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Create the preprocessing pipeline for both numeric data.
numeric_features=X_train.columns.tolist()

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# final preprocessor object set up with ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[('num', numeric_transformer, numeric_features)])

#Fit your preprocessor object
preprocess=preprocessor.fit(X_train) 

In [7]:
# Write function to transform data with preprocessor

def preprocessor(data):
    preprocessed_data=preprocess.transform(data)
    return preprocessed_data

In [8]:
# check shape of X data 
preprocessor(X_train).shape

(16512, 8)

##3. Fit model on preprocessed data and save preprocessor function and model 


In [13]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

model = RandomForestRegressor(n_estimators = 500, max_depth = 3, random_state=0)
model.fit(preprocessor(X_train), y_train) # Fitting to the training set.
model_predictions = model.predict(preprocessor(X_train))
r2_score(model_predictions, y_train)

0.14724751823359472

#### Save preprocessor function to local "preprocessor.zip" file

In [14]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


#### Save model to local ".onnx" file

In [15]:
# Save sklearn model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

# Check how many preprocessed input features there are
from skl2onnx.common.data_types import FloatTensorType

feature_count=preprocessor(X_test).shape[1]
initial_type = [('float_input', FloatTensorType([None, feature_count]))]  # Insert correct number of features in preprocessed data

onnx_model = model_to_onnx(model, framework='sklearn',
                          initial_types=initial_type,
                          transfer_learning=False,
                          deep_learning=False)


with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

## 4. Generate predictions from X_test data and submit model to competition


In [16]:
#Set credentials using modelshare.org username/password

from aimodelshare.aws import set_credentials
    
apiurl="https://hcvbryu1a3.execute-api.us-east-1.amazonaws.com/prod/m" 
#This is the unique rest api that powers this CA Housing Prediction Playground

set_credentials(apiurl=apiurl)

AI Modelshare Username:··········
AI Modelshare Password:··········
AI Model Share login credentials set successfully.


In [17]:
#Instantiate Competition
import aimodelshare as ai
mycompetition= ai.Competition(apiurl)

In [18]:
#Submit Model 1: 

#-- Generate predicted values (a list of predicted car prices) (Model 1)
predicted_values = model.predict(preprocessor(X_test))

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=predicted_values)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 2

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1394


In [19]:
# Get leaderboard to explore current best model architectures

# Get raw data in pandas data frame
data = mycompetition.get_leaderboard()

# Stylize leaderboard data
mycompetition.stylize_leaderboard(data)

Unnamed: 0,mse,rmse,mae,r2,ml_framework,transfer_learning,deep_learning,model_type,num_params,model_config,username,version
0,0.56,0.75,0.53,0.58,sklearn,False,False,LinearRegression,8.0,"{'copy_X': True, 'fit_intercep...",gstreett,1
1,0.6,0.78,0.58,0.54,sklearn,False,False,RandomForestRegressor,,"{'bootstrap': True, 'ccp_alpha...",gstreett,2


## 5. Repeat submission process to improve place on leaderboard


In [20]:
# Create model 2 
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

model_2 = RandomForestRegressor(n_estimators = 1000, max_depth = 2, random_state=0)
model_2.fit(preprocessor(X_train), y_train) # Fitting to the training set.
model_predictions = model_2.predict(preprocessor(X_train))
r2_score(model_predictions, y_train) 

-0.21960584796351568

In [21]:
# Save Model 2 to .onnx file

# Check how many preprocessed input features there are
from skl2onnx.common.data_types import FloatTensorType

feature_count=preprocessor(X_test).shape[1]
initial_type = [('float_input', FloatTensorType([None, feature_count]))]  # Insert correct number of features in preprocessed data

onnx_model = model_to_onnx(model_2, framework='sklearn',
                          initial_types=initial_type,
                          transfer_learning=False,
                          deep_learning=False)

# Save model to local .onnx file
with open("model_2.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString()) 

In [22]:
# Submit Model 2

#-- Generate predicted y values (Model 2)
prediction_labels = model_2.predict(preprocessor(X_test))

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model_2.onnx",
                                 prediction_submission=prediction_labels,
                                 preprocessor_filepath="preprocessor.zip")

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 3

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1394


In [24]:
# Compare two or more models
data=mycompetition.compare_models([2,3], verbose=1)
mycompetition.stylize_compare(data)

Unnamed: 0,param_name,default_value,model_version_2,model_version_3
0,bootstrap,True,True,True
1,ccp_alpha,0.000000,0.000000,0.000000
2,criterion,mse,mse,mse
3,max_depth,,3,2
4,max_features,auto,auto,auto
5,max_leaf_nodes,,,
6,max_samples,,,
7,min_impurity_decrease,0.000000,0.000000,0.000000
8,min_impurity_split,,,
9,min_samples_leaf,1,1,1







In [25]:
# Submit a third model using GridSearchCV

from sklearn.model_selection import GridSearchCV
import numpy as np

param_grid = {'n_estimators': np.arange(100, 300, 500),'max_depth':[1, 3, 5]} #np.arange creates sequence of numbers for each k value

gridmodel = GridSearchCV(RandomForestRegressor(), param_grid=param_grid, cv=10)

#use meta model methods to fit score and predict model:
gridmodel.fit(preprocessor(X_train), y_train)

#extract best score and parameter by calling objects "best_score_" and "best_params_"
print("best mean cross-validation score: {:.3f}".format(gridmodel.best_score_))
print("best parameters: {}".format(gridmodel.best_params_))


best mean cross-validation score: 0.667
best parameters: {'max_depth': 5, 'n_estimators': 100}


In [26]:
# Save sklearn model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

feature_count=preprocessor(X_test).shape[1] #Get count of preprocessed features
initial_type = [('float_input', FloatTensorType([None, feature_count]))]  # Insert correct number of preprocessed features

onnx_model = model_to_onnx(gridmodel, framework='sklearn',
                          initial_types=initial_type,
                          transfer_learning=False,
                          deep_learning=False)

with open("gridmodel.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [27]:
#Submit Model 3: 

#-- Generate predicted values
prediction_labels = gridmodel.predict(preprocessor(X_test))

# Submit Model 3 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "gridmodel.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 4

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1394


In [28]:
# Get leaderboard

data = mycompetition.get_leaderboard()
mycompetition.stylize_leaderboard(data)

Unnamed: 0,mse,rmse,mae,r2,ml_framework,transfer_learning,deep_learning,model_type,num_params,model_config,username,version
0,0.46,0.68,0.49,0.65,sklearn,False,False,RandomForestRegressor,,"{'bootstrap': True, 'ccp_alpha...",gstreett,4
1,0.56,0.75,0.53,0.58,sklearn,False,False,LinearRegression,8.0,"{'copy_X': True, 'fit_intercep...",gstreett,1
2,0.6,0.78,0.58,0.54,sklearn,False,False,RandomForestRegressor,,"{'bootstrap': True, 'ccp_alpha...",gstreett,2
3,0.73,0.86,0.65,0.44,sklearn,False,False,RandomForestRegressor,,"{'bootstrap': True, 'ccp_alpha...",gstreett,3


In [29]:
# Compare two or more models
data=mycompetition.compare_models([2, 3, 4], verbose=1)
mycompetition.stylize_compare(data)

Unnamed: 0,param_name,default_value,model_version_2,model_version_3,model_version_4
0,bootstrap,True,True,True,True
1,ccp_alpha,0.000000,0.000000,0.000000,0.000000
2,criterion,mse,mse,mse,mse
3,max_depth,,3,2,5
4,max_features,auto,auto,auto,auto
5,max_leaf_nodes,,,,
6,max_samples,,,,
7,min_impurity_decrease,0.000000,0.000000,0.000000,0.000000
8,min_impurity_split,,,,
9,min_samples_leaf,1,1,1,1







In [30]:
# Here are several classic ML architectures you can choose from to experiment with next:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import GradientBoostingRegressor


#Example code to fit model:
model = GradientBoostingRegressor(n_estimators=50, learning_rate=1.0,
    max_depth=1, random_state=0).fit(preprocessor(X_train), y_train)
model.score(preprocessor(X_train), y_train)

# Save sklearn model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

feature_count=preprocessor(X_test).shape[1] #Get count of preprocessed features
initial_type = [('float_input', FloatTensorType([None, feature_count]))]  # Insert correct number of preprocessed features

onnx_model = model_to_onnx(model, framework='sklearn',
                          initial_types=initial_type,
                          transfer_learning=False,
                          deep_learning=False)

with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

#-- Generate predicted values (a list of predicted labels "real" or "fake")
prediction_labels = model.predict(preprocessor(X_test))

# Submit model to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)


Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 5

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1394
