These notebooks are part of Kaggle’s [Practical Model Evaluation](https://www.kaggle.com/practical-model-evaluation) event, which ran from December 3-5 2019. You can find the [livestreams for this event here](https://youtu.be/7RdKnACscjA?list=PLqFaTIg4myu-HA1VGJi_7IGFkKRoZeOFt).

* Day 1 Notebook: [Figuring out what matters for you](https://www.kaggle.com/rtatman/practical-model-evaluation-day-1)
* Day 2 Notebook: [Training models with automated machine learning](https://www.kaggle.com/rtatman/practical-model-evaluation-day-2)

***

For today's exercise, we're going to be working on classifying roles into job titles based on information about the role. The data will be from the [2018](https://www.kaggle.com/kaggle/kaggle-survey-2018) and [2019](https://www.kaggle.com/c/kaggle-survey-2019) Kaggle data science survey. 

I've already [done some data cleaning](https://www.kaggle.com/rebeccaturner/data-prep-for-job-title-classification) but if you'd like to do your own or do some additional feature engineering, feel free!

Today we'll be building four different models using four different libraries, including some automated machine learning libraries. 

> Automated machine learning (or AutoML for short) is the task of removing human labor from the process of training machine learning models. Currently most AutoML research is focused on automating model selection and hyperparameter tuning. [This video](https://www.youtube.com/watch?v=Rsg_XzgGqZw&utm_medium=notebook&utm_source=kaggle&utm_campaign=automl-event) goes into more details.

The libraries we'll be using are:

* [XGBoost](https://xgboost.readthedocs.io/en/latest/) (not automated machine learning: we'll be using this as a baseline)
* [Cloud AutoML](https://cloud.google.com/automl/?utm_medium=notebook&utm_source=kaggle&utm_campaign=automl-event), an enterprise-focused automated machine learning product
* [TPOT](https://epistasislab.github.io/tpot/), an open source automated machine learning library developed at the University of Pennsylvania
* [H20.ai AutoML](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html), a second open source automated machine learning library developed by researchers at H20.ai

Let's get started!

## Load in data

First let's load in our pre-cleaned data. I'll be using the 2018 data as an example and then have you work through the 2019 data as your exercise.

In [1]:
# Importing libraries
import random
from sklearn.model_selection import train_test_split
from sklearn.metrics import auc, accuracy_score, confusion_matrix
import pandas as pd
import category_encoders as ce

# set a seed for reproducability
random.seed(42)

# read in our data
df_2018 = pd.read_csv("../input/data-prep-for-job-title-classification/data_jobs_info_2018.csv")
df_2019 = pd.read_csv("../input/data-prep-for-job-title-classification/data_jobs_info_2019.csv")

## Data preparation

We do have an additional step of preperation. First, we'll split into training and testing sets:

In [2]:
# split into predictor & target variables
X = df_2018.drop("job_title", axis=1)
y = df_2018["job_title"]

# Splitting data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80, test_size=0.20)

# save out the split training data to use with Cloud AutoML
with open("train_data.csv", "+w") as file:
    pd.concat([X_train, y_train], axis=1).to_csv(file, index=False)

For H20 AutoML and Cloud AutoML we don't need to do anything else. (Actually for Cloud AutoML we don't even need to split our data, but we'll look at that in a minute.) 

For TPOT and XGBoost, however, we need to make sure that all our input data is numeric. We'll be using [ordinal label encoding](https://contrib.scikit-learn.org/categorical-encoding/ordinal.html) for this.

In [3]:
# encode all features using ordinal encoding
encoder_x = ce.OrdinalEncoder()
X_encoded = encoder_x.fit_transform(X)

# you'll need to use a different encoder for each dataframe
encoder_y = ce.OrdinalEncoder()
y_encoded = encoder_y.fit_transform(y)

# split encoded dataset
X_train_encoded, X_test_encoded, y_train_encoded, y_test_encoded = train_test_split(X_encoded, y_encoded,
                                                    train_size=0.80, test_size=0.20)

# XGBoost Baseline

First, we're going to train a basic XGBoost model using the default arguments. We cover XGBoost in more detail in the [Intro To Machine Learning course](https://www.kaggle.com/learn/intro-to-machine-learning), so I'm not going to talk about it here.

In [4]:
from xgboost import XGBClassifier

# train XGBoost model with default parameters
my_model = XGBClassifier()
my_model.fit(X_train_encoded, y_train_encoded, verbose=False)

# and save our model
my_model.save_model("xgboost_baseline.model")

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


# Cloud AutoML

Now let's train our Cloud AutoML model! We'll be using both the GCP console and notebook code here, so you'll probably want to open those in separate tabs or windows.

### Prepare your account and project


* First you’ll need to [create a GCP account](https://accounts.google.com/signup/v2/webcreateaccount?service=cloudconsole&continue=https%3A%2F%2Fconsole.cloud.google.com%2F&dsh=S-385463669%3A1575309184770524&gmb=exp&biz=false&flowName=GlifWebSignIn&flowEntry=SignUp&nogm=true&utm_medium=notebook&utm_source=kaggle&utm_campaign=automl-event) (if you already have a Google account you can use that one) and [enable billing](https://www.youtube.com/watch?v=uINleRduCWM&utm_medium=notebook&utm_source=kaggle&utm_campaign=automl-event).


> For now, you do need to have a credit card in order to enable billing and you need billing enabled to use Cloud AutoML. If you’re not able to enable billing you can still follow along with the rest of the workshop, just skip the Cloud AutoML parts. 

 

* From there, [create a new project](https://cloud.google.com/appengine/docs/standard/nodejs/building-app/creating-project?utm_medium=notebook&utm_source=kaggle&utm_campaign=automl-event). You should [set the region of your project](https://cloud.google.com/compute/docs/regions-zones/?utm_medium=notebook&utm_source=kaggle&utm_campaign=automl-event) to “us-central1”.

* Go to the [AutoML Tables](https://console.cloud.google.com/automl-tables?utm_medium=notebook&utm_source=kaggle&utm_campaign=automl-event) page in the Google Cloud Console and click *enable API*. This will let you train an AutoML Tables model in your current project. 


### Creating your dataset


For this workshop, we’re going to create our AutoML datasets using the GCP console. The reason for this is that importing datasets can take a while. If you have the code in your notebook to import your dataset right before the code to create your model, when you run your notebook top to bottom it’ll give you an error because the modelling code was run before the dataset was done importing.


* Click on “Datasets” in the list on the left hand side of your screen and then click on the blue **[+] New Dataset** text near the top of your screen.

* Give your dataset a name and make sure the region is **US-CENTRAL1**.  

* Select “Upload files from your computer” and select the file with the dataset you want. 

* Click on **BROWSE** under the “Select Files” button and a side panel will pop up. 

    * If you haven’t created any buckets you’ll see the text “No buckets found”. To create a new bucket, click on the icon that looks like a shopping basket with a plus sign in it.

    * Follow the prompts to create your bucket. **Important:** Make sure in the “Choose where to store your data” step, you pick “Region” and set the location as “us-central1 (Iowa). 

* Select the bucket where you’d like to store your data.

* Import your dataset. (This may take a while.)

* Once your dataset is done importing, take a close look at your imported data and make sure it looks the way you’d expect.


### Training our model


In order to train an AutoML model from inside Kaggle Notebooks, you’ll need to attach a notebook to your Google Cloud Account. [This video goes into more detail](https://youtu.be/xP99eh6nQN0?utm_medium=notebook&utm_source=kaggle&utm_campaign=automl-event). 


Then you can modify the following code to start your AutoML model training:

In [5]:
from google.cloud import automl_v1beta1 as automl
from kaggle.gcp import KaggleKernelCredentials
from kaggle_secrets import GcpTarget
from google.cloud import storage

# don't change this value!
REGION = 'us-central1' # don't change: this is the only region that works currently

# these you'll change based on your GCP project/data
PROJECT_ID = 'kaggle-automl-example' # this will come from your specific GCP project
DATASET_DISPLAY_NAME = 'data_jobs_info_2018' # name of your uploaded dataset (from GCP console)
TARGET_COLUMN = 'job_title' # column with feature you're trying to predict

# these can be whatever you like
MODEL_DISPLAY_NAME = 'kaggle_automl_example_model' # what you want to call your model
TRAIN_BUDGET = 1000 # max time to train model in milli-hours, from 1000-72000

storage_client = storage.Client(project=PROJECT_ID, credentials=KaggleKernelCredentials(GcpTarget.GCS)) 
tables_gcs_client = automl.GcsClient(client=storage_client, credentials=KaggleKernelCredentials(GcpTarget.GCS)) 
tables_client = automl.TablesClient(project=PROJECT_ID, region=REGION, gcs_client=tables_gcs_client, credentials=KaggleKernelCredentials(GcpTarget.AUTOML))

In [6]:
# first you'll need to make sure your model is predicting the right column
tables_client.set_target_column(
    dataset_display_name=DATASET_DISPLAY_NAME,
    column_spec_display_name=TARGET_COLUMN,
)

RetryError: Deadline of 600.0s exceeded while calling functools.partial(<function _wrap_unary_errors.<locals>.error_remapped_callable at 0x7f9891d9d840>, parent: "projects/kaggle-automl-example/locations/us-central1"
, metadata=[('x-goog-request-params', 'parent=projects/kaggle-automl-example/locations/us-central1'), ('x-goog-api-client', 'automl-tables-wrapper/0.6.0 gl-python/3.6.6 grpc/1.24.3 gax/1.14.3 gapic/0.6.0')]), last exception: 503 DNS resolution failed

In [7]:
# and then you'll need to kick off your model training
response = tables_client.create_model(MODEL_DISPLAY_NAME, dataset_display_name=DATASET_DISPLAY_NAME, 
                                      train_budget_milli_node_hours=TRAIN_BUDGET, 
                                      exclude_column_spec_names=[TARGET_COLUMN])

# check if it's done yet (it won't be)
response.done()

RetryError: Deadline of 600.0s exceeded while calling functools.partial(<function _wrap_unary_errors.<locals>.error_remapped_callable at 0x7f98904aed90>, parent: "projects/kaggle-automl-example/locations/us-central1"
, metadata=[('x-goog-request-params', 'parent=projects/kaggle-automl-example/locations/us-central1'), ('x-goog-api-client', 'automl-tables-wrapper/0.6.0 gl-python/3.6.6 grpc/1.24.3 gax/1.14.3 gapic/0.6.0')]), last exception: 503 DNS resolution failed

Once our model starts training, we don't need to do anything else: it's already saved in our GCP account and good to go for tomorrow.

# TPOT

Alright, now we'll move onto [TPOT](https://epistasislab.github.io/tpot/). This is an academic library built on top of scikit-learn, and my favorite thing about it is that when you export a model you're actually exporting all the Python code you need to train that model.

In [8]:
from tpot import TPOTClassifier

# create & fit TPOT classifier with 
tpot = TPOTClassifier(generations=8, population_size=20, 
                      verbosity=2, early_stop=2)
tpot.fit(X_train_encoded, y_train_encoded)

# save our model code
tpot.export('tpot_pipeline.py')

# print the model code to see what it says
!cat tpot_pipeline.py

  y = column_or_1d(y, warn=True)


HBox(children=(IntProgress(value=0, description='Optimization Progress', max=180, style=ProgressStyle(descript…

Generation 1 - Current best internal CV score: 0.4733457285129677
Generation 2 - Current best internal CV score: 0.4733457285129677

The optimized pipeline was not improved after evaluating 2 more generations. Will end the optimization process.

TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: RandomForestClassifier(input_matrix, bootstrap=False, criterion=gini, max_features=0.4, min_samples_leaf=12, min_samples_split=19, n_estimators=100)
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=None)

# 

# H20.ai AutoML

For our final model we'll be using [H20.ai's open source AutoML library](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html). One thing that I like about this library is that, as each model is trained, its evaluated both on its own and as part of a stacked ensemble. Kaggle competitors are very fond of stacking (and H20 is known for hiring a lot of top Kagglers) so it's nice to have that automated for us.

In [9]:
import h2o
from h2o.automl import H2OAutoML

# initilaize an H20 instance running locally
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_232"; OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-1~deb9u1-b09); OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)
  Starting server from /opt/conda/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmptv60xpi4
  JVM stdout: /tmp/tmptv60xpi4/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmptv60xpi4/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,Etc/UTC
H2O data parsing timezone:,UTC
H2O cluster version:,3.26.0.8
H2O cluster version age:,1 month and 17 days
H2O cluster name:,H2O_from_python_unknownUser_399j5f
H2O cluster total nodes:,1
H2O cluster free memory:,3.556 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [10]:
# convert our data to h20Frame, an alternative to pandas datatables
train_data = h2o.H2OFrame(X_train)
test_data = h2o.H2OFrame(list(y_train))

train_data = train_data.cbind(test_data)

# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml = H2OAutoML(max_models=20, seed=1)
aml.train(y="C1", training_frame=train_data)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |████████████████████████████████████████████████████████| 100%


In [11]:
# View the top five models from the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=5)

# The leader model can be access with `aml.leader`

model_id,mean_per_class_error,logloss,rmse,mse
GBM_5_AutoML_20191205_073544,0.677042,1.4732,0.691298,0.477892
GLM_grid_1_AutoML_20191205_073544_model_1,0.682991,1.43154,0.692876,0.480077
DeepLearning_grid_1_AutoML_20191205_073544_model_1,0.683952,1.89134,0.689833,0.475869
XGBoost_1_AutoML_20191205_073544,0.684796,1.44713,0.699173,0.488843
GBM_2_AutoML_20191205_073544,0.686873,1.49917,0.697304,0.486232




In [12]:
# save the model out (we'll need to for tomorrow!)
h2o.save_model(aml.leader)

'/kaggle/working/GBM_5_AutoML_20191205_073544'

# Check that we've saved each of our models

Before we wrap up for the day, we want to make sure we've saved all of our models for tomorrow! The Cloud AutoML model is saved automatically on GCP, but we've saved each of the other models in our current working directory. Let's just double check that that's the case:

In [13]:
# check to see that we've saved all of our models
! ls

GBM_5_AutoML_20191205_073544  __output__.json	train_data.csv
__notebook__.ipynb	      tpot_pipeline.py	xgboost_baseline.model


Alright, we've got three models and the code for the notebook. We're all set!

# Exercise

Now it's your turn! Following the steps above, use the `df_2019` dataframe and: 

* Prepare your data (split into testing and training, encode variables)
* Train your models: XGBoost, Cloud AutoML, TPOT and H20 AutoML

> Note: if you can't or would prefer not to set up billing on order to use Cloud AutoML, feel free to skip training that model.

* Remember to save your models! You'll need them tomorrow and, since it takes a while to run AutoML code, you don't want to have to run it multiple times.

Have fun training your models and I'll see you all tomorrow for our final model evaluation!