<p align="center"><img width="50%" src="https://aimodelsharecontent.s3.amazonaws.com/aimodshare_banner.jpg" /></p>


---


<p align="center"><h1 align="center">Internet Ad Model Submission Guide - Deep Learning

##### <p align="center">*Data Source: Lichman, M. (2013). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml) . Irvine, CA: University of California, School of Information and Computer Science.*

---
Let's share our models to a centralized leaderboard, so that we can collaborate and learn from the model experimentation process...

**Instructions:**
1.   Get data in and set up X_train / X_test / y_train
2.   Preprocess data  with Sklearn Column Transformer/ Write and Save Preprocessor function
3. Fit model on preprocessed data and save preprocessor function and model 
4. Generate predictions from X_test data and submit model to competition
5. Repeat submission process to improve place on leaderboard

# Objective: Predict whether an image is an advertisement (ad.) or not (nonad.)

---

**Data**: This dataset represents a set of possible advertisements on Internet pages. The features encode the geometry of the image (if available) as well as phrases occuring in the URL, the image's URL and alt text, the anchor text, and words occuring near the anchor text. 

**Features (1558 features)**
* **height** height of image
* **width** width of image
* **aratio** aspect ratio of image
* **URL Terms** 457 features of page urls 
* **orig URL Terms** 495 features from original image urls
* **anc URL Terms** 472 features from anchor urls
* **alt Terms** 111 features from image alt text
* **caption Terms** 19 features from image captions

**Target**
*   Binary variable (ad./nonad)

## 1. Get data in and set up X_train, X_test, y_train objects

In [None]:
#install aimodelshare library
! pip install aimodelshare --upgrade

In [2]:
# Get competition data
from aimodelshare import download_data
download_data('public.ecr.aws/y2e2a1d6/internet_ads_competition_data-repository:latest')


Data downloaded successfully.


In [3]:
# Load data into X_train, y_train, and X_test
import pandas as pd
X_train = pd.read_csv("internet_ads_competition_data/X_train.csv")
y_train = pd.read_csv("internet_ads_competition_data/y_train.csv", squeeze=True)
y_train = pd.get_dummies(y_train)

X_test=pd.read_csv("internet_ads_competition_data/X_test.csv")

X_train.head()

Unnamed: 0,height,width,aratio,local,url*images+buttons,url*likesbooks.com,url*www.slake.com,url*hydrogeologist,url*oso,url*media,url*peace+images,url*blipverts,url*tkaine+kats,url*labyrinth,url*advertising+blipverts,url*images+oso,url*area51+corridor,url*ran+gifs,url*express-scripts.com,url*off,url*cnet,url*time+1998,url*josefina3,url*truluck.com,url*clawnext+gif,url*autopen.com,url*tvgen.com,url*pixs,url*heartland+5309,url*meadows+9196,url*blue,url*ad+gif,url*area51,url*www.internauts.ca,url*afn.org,url*ran.org,url*shareware.com,url*baons+images,url*area51+labyrinth,url*pics,...,alt*site,alt*to+visit,alt*rank+my,alt*from,alt*page,alt*graphic,alt*like+mine,alt*email+me,alt*visit,alt*free,alt*the+kat,alt*award,alt*services,alt*about,alt*for,alt*here+to,alt*network,alt*you,alt*logo,alt*home,alt*kat,caption*and,caption*home+page,caption*click+here,caption*the,caption*pratchett,caption*here+for,caption*site,caption*page,caption*to,caption*of,caption*home,caption*my,caption*your,caption*in,caption*bytes,caption*here,caption*click,caption*for,caption*you
0,60.0,468.0,7.8,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,120.0,120.0,1.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,90.0,128.0,1.4222,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,24.0,120.0,5.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,77.0,108.0,1.4025,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [4]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)

(2623, 1558)
(656, 1558)
(2623, 2)


##2.   Preprocess data using Sklearn Column Transformer / Write and Save Preprocessor function


In [5]:
# In this case we use Sklearn's Column transformer in our preprocessor function

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

#Preprocess data using sklearn's Column Transformer approach

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['height', 'width', 'aratio']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')), #'imputer' names the step
    ('scaler', StandardScaler())])

binary_features = X_train.columns.tolist()
binary_features = [colname for colname in binary_features if colname not in numeric_features]

# Replacing missing values with Modal value 
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))])

# Final preprocessor object set up with ColumnTransformer...

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, binary_features)])

# fit preprocessor to your data
preprocess = preprocess.fit(X_train)

In [6]:
# Write function to transform data with preprocessor

def preprocessor(data):
    preprocessed_data=preprocess.transform(data)
    return preprocessed_data

In [7]:
# check shape of X data 
preprocessor(X_train).shape

(2623, 1558)

##3. Fit model on preprocessed data and save preprocessor function and model 


In [8]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
import keras

feature_count=preprocessor(X_train).shape[1] #count features in input data

model = Sequential()
model.add(Dense(64, input_dim=feature_count, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(64, activation='relu'))

model.add(Dense(2, activation='softmax')) 
                                            
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

# Fitting the NN to the Training set
model.fit(preprocessor(X_train), y_train, 
               batch_size = 20, 
               epochs = 3, validation_split=0.25) 

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f8b75ff1750>

#### Save preprocessor function to local "preprocessor.zip" file

In [9]:
import aimodelshare as ai
ai.export_preprocessor(preprocessor,"") 

Your preprocessor is now saved to 'preprocessor.zip'


#### Save model to local ".onnx" file

In [10]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

## 4. Generate predictions from X_test data and submit model to competition


In [11]:
#Set credentials using modelshare.org username/password

from aimodelshare.aws import set_credentials
    
apiurl="https://vy08zh602l.execute-api.us-east-1.amazonaws.com/prod/m"
#This is the unique rest api that powers this Internet Ad Prediction Playground

set_credentials(apiurl=apiurl)

AI Modelshare Username:··········
AI Modelshare Password:··········
AI Model Share login credentials set successfully.


In [12]:
#Instantiate Competition
import aimodelshare as ai
mycompetition= ai.Competition(apiurl)

In [23]:
#Submit Model 1: 

#-- Generate predicted y values (Model 1)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 1 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 8

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1462


In [24]:
# Get leaderboard to explore current best model architectures

# Get raw data in pandas data frame
data = mycompetition.get_leaderboard()

# Stylize leaderboard data
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,deep_learning,model_type,depth,num_params,dense_layers,softmax_act,relu_act,loss,optimizer,model_config,memory_size,username,version
0,96.65%,94.16%,94.16%,94.16%,sklearn,,LogisticRegression,,1558.0,,,,,liblinear,"{'C': 10, 'class_weight': None...",,gstreett,3
1,95.27%,91.11%,95.00%,88.14%,sklearn,,GradientBoostingClassifier,,,,,,,,"{'ccp_alpha': 0.0, 'criterion'...",,gstreett,6
2,91.31%,81.39%,93.07%,76.04%,sklearn,,LogisticRegression,,1558.0,,,,,lbfgs,"{'C': 0.01, 'class_weight': No...",,gstreett,4
3,90.24%,77.93%,93.82%,72.28%,sklearn,,RandomForestClassifier,,,,,,,,"{'bootstrap': True, 'ccp_alpha...",,gstreett,5
4,72.87%,48.75%,48.74%,48.94%,unknown,,unknown,,,,,,,,None...,,gstreett,2
5,82.62%,45.24%,41.31%,50.00%,keras,True,Sequential,4.0,108226.0,4.0,1.0,3.0,str,SGD,"{'name': 'sequential_2', 'laye...",1326440.0,gstreett,8
6,71.65%,46.86%,46.69%,47.17%,sklearn,,LogisticRegression,,1558.0,,,,,liblinear,"{'C': 0.01, 'class_weight': No...",,gstreett,1
7,nan%,nan%,nan%,nan%,keras,True,Sequential,4.0,108226.0,4.0,1.0,3.0,str,SGD,"{'name': 'sequential_2', 'laye...",1326440.0,gstreett,7


## 5. Repeat submission process to improve place on leaderboard


In [27]:
# Train and submit model 2 using same preprocessor (note that you could save a new preprocessor, but we will use the same one for this example).
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation

feature_count=preprocessor(X_train).shape[1] #count features in input data

model_2 = Sequential()
model_2.add(Dense(128, input_dim=feature_count, activation='relu'))
model_2.add(Dropout(.3))
model_2.add(Dense(64, activation='relu'))
model_2.add(Dense(64, activation='relu'))
model_2.add(Dropout(.3))
model_2.add(Dense(64, activation='relu'))

model_2.add(Dense(2, activation='softmax')) 
                                            
# Compile model
model_2.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

# Fitting the NN to the Training set
model_2.fit(preprocessor(X_train), y_train, 
               batch_size = 20, 
               epochs = 5, validation_split=0.25) 

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f5f13a8ea10>

In [28]:
# Save Model 2 to .onnx file

onnx_model = model_to_onnx(model_2, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

# Save model to local .onnx file
with open("model_2.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString()) 

In [29]:
# Submit Model 2

#-- Generate predicted y values (Model 2)
#Note: Keras predict returns the predicted column index location for classification models
prediction_column_index=model_2.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit Model 2 to Competition Leaderboard
mycompetition.submit_model(model_filepath = "model_2.onnx",
                                 prediction_submission=prediction_labels,
                                 preprocessor_filepath="preprocessor.zip")

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 9

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1462


In [30]:
# Compare two or more models
data=mycompetition.compare_models([7, 8], verbose=1)
mycompetition.stylize_compare(data)

Unnamed: 0,Model_8_Layer,Model_8_Shape,Model_8_Params,Model_9_Layer,Model_9_Shape,Model_9_Params
0,Dense,"[None, 64]",99776.0,Dense,"[None, 128]",199552
1,Dense,"[None, 64]",4160.0,Dropout,"[None, 128]",0
2,Dense,"[None, 64]",4160.0,Dense,"[None, 64]",8256
3,Dense,"[None, 2]",130.0,Dense,"[None, 64]",4160
4,,,,Dropout,"[None, 64]",0
5,,,,Dense,"[None, 64]",4160
6,,,,Dense,"[None, 2]",130


## Optional: Tune model within range of hyperparameters with Keras Tuner

*Simple example shown below. Consult [documentation](https://keras.io/guides/keras_tuner/getting_started/) to see full functionality.*

In [None]:
! pip install keras_tuner

In [32]:
#Separate validation data 
from sklearn.model_selection import train_test_split
x_train_split, x_val, y_train_split, y_val = train_test_split(
     X_train, y_train, test_size=0.2, random_state=42)

In [34]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, BatchNormalization
from keras.regularizers import l1, l2, l1_l2
import keras_tuner as kt


#Define model structure & parameter search space with function
feature_count=preprocessor(X_train).shape[1] #count features in input data

def build_model(hp):
    model = keras.Sequential()
    model.add(Dense(64, input_dim=feature_count, activation='relu', kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01)))
    model.add(Dense(64, activation='relu', kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01)))
    model.add(Dense(units=hp.Int("units", min_value=32, max_value=512, step=32), #range 32-512 inclusive, minimum step between tested values is 32
                    activation='relu', kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01)))
    model.add(Dense(2, activation='softmax')) 
    model.compile(
        optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"],
    )
    return model

#initialize the tuner (which will search through parameters)
tuner = kt.RandomSearch(
    hypermodel=build_model, 
    objective="val_accuracy", # objective to optimize
    max_trials=3, #max number of trials to run during search
    executions_per_trial=3, #higher number reduces variance of results; guages model performance more accurately 
    overwrite=True,
    directory="tuning_model",
    project_name="tuning_units",
)

tuner.search(preprocessor(x_train_split), y_train_split, epochs=1, validation_data=(preprocessor(x_val), y_val))

Trial 3 Complete [00h 00m 06s]
val_accuracy: 0.9453968207041422

Best val_accuracy So Far: 0.9453968207041422
Total elapsed time: 00h 00m 20s


In [35]:
# Build model with best hyperparameters

# Get the top 2 hyperparameters.
best_hps = tuner.get_best_hyperparameters(5)
# Build the model with the best hp.
tuned_model = build_model(best_hps[0])
# Fit with the entire dataset.
tuned_model.fit(x=preprocessor(X_train), y=y_train, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f6025c78350>

In [36]:
# Save keras model to local ONNX file
from aimodelshare.aimsonnx import model_to_onnx

onnx_model = model_to_onnx(tuned_model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("tuned_model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [37]:
#Submit Model 3: 

#-- Generate predicted y values (Model 3)
prediction_column_index=tuned_model.predict(preprocessor(X_test)).argmax(axis=1)

# extract correct prediction labels 
prediction_labels = [y_train.columns[i] for i in prediction_column_index]

# Submit to Competition Leaderboard
mycompetition.submit_model(model_filepath = "tuned_model.onnx",
                                 preprocessor_filepath="preprocessor.zip",
                                 prediction_submission=prediction_labels)

Insert search tags to help users find your model (optional): 
Provide any useful notes about your model (optional): 

Your model has been submitted as model version 10

To submit code used to create this model or to view current leaderboard navigate to Model Playground: 

 https://www.modelshare.org/detail/model:1462


In [38]:
# Get leaderboard

data = mycompetition.get_leaderboard()
mycompetition.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,deep_learning,model_type,depth,num_params,dense_layers,dropout_layers,softmax_act,relu_act,loss,optimizer,model_config,memory_size,username,version
0,96.65%,94.16%,94.16%,94.16%,sklearn,,LogisticRegression,,1558.0,,,,,,liblinear,"{'C': 10, 'class_weight': None...",,gstreett,3
1,95.27%,91.11%,95.00%,88.14%,sklearn,,GradientBoostingClassifier,,,,,,,,,"{'ccp_alpha': 0.0, 'criterion'...",,gstreett,6
2,95.27%,91.44%,93.25%,89.87%,keras,True,Sequential,4.0,127522.0,4.0,,1.0,3.0,str,RMSprop,"{'name': 'sequential_1', 'laye...",2103408.0,gstreett,10
3,91.31%,81.39%,93.07%,76.04%,sklearn,,LogisticRegression,,1558.0,,,,,,lbfgs,"{'C': 0.01, 'class_weight': No...",,gstreett,4
4,90.24%,77.93%,93.82%,72.28%,sklearn,,RandomForestClassifier,,,,,,,,,"{'bootstrap': True, 'ccp_alpha...",,gstreett,5
5,86.59%,67.63%,84.68%,63.83%,keras,True,Sequential,7.0,216258.0,5.0,2.0,1.0,4.0,str,SGD,"{'name': 'sequential_4', 'laye...",1560592.0,gstreett,9
6,72.87%,48.75%,48.74%,48.94%,unknown,,unknown,,,,,,,,,None...,,gstreett,2
7,82.62%,45.24%,41.31%,50.00%,keras,True,Sequential,4.0,108226.0,4.0,,1.0,3.0,str,SGD,"{'name': 'sequential_2', 'laye...",1326440.0,gstreett,8
8,71.65%,46.86%,46.69%,47.17%,sklearn,,LogisticRegression,,1558.0,,,,,,liblinear,"{'C': 0.01, 'class_weight': No...",,gstreett,1
9,nan%,nan%,nan%,nan%,keras,True,Sequential,4.0,108226.0,4.0,,1.0,3.0,str,SGD,"{'name': 'sequential_2', 'laye...",1326440.0,gstreett,7


In [40]:
# Compare two or more models
data=mycompetition.compare_models([7, 8, 9], verbose=1)
mycompetition.stylize_compare(data)

Unnamed: 0,Model_8_Layer,Model_8_Shape,Model_8_Params,Model_9_Layer,Model_9_Shape,Model_9_Params,Model_10_Layer,Model_10_Shape,Model_10_Params
0,Dense,"[None, 64]",99776.0,Dense,"[None, 128]",199552,Dense,"[None, 64]",99776.0
1,Dense,"[None, 64]",4160.0,Dropout,"[None, 128]",0,Dense,"[None, 64]",4160.0
2,Dense,"[None, 64]",4160.0,Dense,"[None, 64]",8256,Dense,"[None, 352]",22880.0
3,Dense,"[None, 2]",130.0,Dense,"[None, 64]",4160,Dense,"[None, 2]",706.0
4,,,,Dropout,"[None, 64]",0,,,
5,,,,Dense,"[None, 64]",4160,,,
6,,,,Dense,"[None, 2]",130,,,
