# A Quick Introduction to Automated Machine Learning (AutoML) and the "Mankind Vs Machine" Community Challenge

Geoff Pidcock | 20181103 <br>

<a id='Section0'></a>
## Motivation
The topic of Automated Machine Learning (AutoML) has been buzzing throughout 2018, with big players like Google [making it a focus of their keynotes](https://techcrunch.com/2018/01/17/googles-automl-lets-you-train-custom-machine-learning-models-without-having-to-code/), and automation platform providers like [DataRobot securing crazy amounts of funding](https://www.businesswire.com/news/home/20181025005739/en/DataRobot-Raises-100-Million-Series-Led-Meritech) in order to scale up. <br>

As a data analyst responsible for delivering predictive analytic products to an Australian Business, I decided to do some research into the topic. I wanted to find out:
- [What, exactly, is AutoML](#Section2)
- [What the tools can and can't do - and what value they could bring to my trade](#Section3)
- [How to get started using these tools (with some code)](#Section4)
- [How good are these tools really (and should I be worried about being out of a job... :D) ](#Section5)
- [Whether the community was also interested in exploring these tools](#Section1)
- [Where to learn more, and follow developments in AutoML](#Section6)

Here's what I've found out. 

**IMPORTANT NOTE:** <br> 
If you're considering competing in the ["Mankind Vs Machine" Machine Learning Challenge starting Friday Nov 9](https://ga.co/2PtgkLn), and haven't used Kaggle Before, [I would encourage skipping through to the code](#Section4), and attempting a Kaggle submission on the [Titanic Challenge](https://www.kaggle.com/c/titanic) before the event. 

[You can learn more about this challenge here.](#Section1)

#### Figure 1: Interest over time, for a number of keywords relating to AutoML tools/platforms
<img src="./images/interest_over_time.png" style="width:450px;height:321px;" align="left"/>

____

<a id='Section2'></a>
## What, exactly, is AutoML?
[Go back to top](#Section0) <br>
In a nut shell: AutoML is **Machine Learning for Machine Learning**! Simple :D <br>
<img src="./images/inception.jpeg" style="width:512px;height:256px;" align="left"/>

Quoting [Sibanjan Das and Umit Mert Cakmak](https://www.packtpub.com/big-data-and-business-intelligence/hands-automated-machine-learning),

>AutoML aims to **ease the process of building ML models by automating commonly-used steps**, such as feature preprocessing, model selection, and hyperparameters tuning.

AutoML isn't any one particular tool or platform ([as argued very eloquently by Rachel Thomas](https://www.fast.ai/2018/07/23/auto-ml-3/))- it is a field of machine learning including research, open-source AutoML libraries, workshops, and competitions (including a [community challenge coming up in Sydney next week](http://bit.ly/2CNsaJN)). <br>

This field has produced a number of tools that can be used in your data science projects, including: <br>

*Free* <br>
- [TEAPOT (Python)](https://github.com/EpistasisLab/tpot)
- [Auto-SKLearn (Python)](https://automl.github.io/auto-sklearn/stable/)  -warning, not Windows compatible!
- [MLBox (Python)](https://github.com/AxeldeRomblay/MLBox)
- [Featuretools (Python)](https://www.featuretools.com/)

*Open with Paid Support* <br>
- [H2O.AI (Various)](http://docs.h2o.ai/)

*Provided at Cost* <br>
- [DataRobot (Various Languages, GUI)](https://www.datarobot.com/product/)
- [Google Cloud AutoML (Various Languages)](https://cloud.google.com/automl/)
- [Einstein AI (Various Languages, GUI)](https://developer.salesforce.com/einstein/)
<br>

**Note**: There is an absence of R specifric tools in this list! I'd love to hear from R coders who know of, or could recommend, AutoML packages.

These tools have a wide range of capabilities. The next section contains a short dive into what they can and can't do.

___

<a id='Section3'></a>
## What the tools can and can't do - and what value they could bring to your trade
[Go back to contents](#Section0) <br>
Creating a predictive analytics product is a messy process, involving many steps. A popular model for this process is [CRISP-DM](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining):

#### Figure 2: Overview of a predictive analytics project process (CRISP-DM)
*Highlighting is added to the stages where AutoML typically provides value.*
<img src="./images/crisp-dm.png" style="width:603px;height:421px;" align="left"/>

A number of steps are presently out of scope of the tools mentioned above, including: 
- "Business Understanding" - automated services do not presently replace the need for consultation and domain expertise.
- **"Data Understanding"** - automated services still do not manage data collection/description/tagging. No matter how much I wish it was otherwise, in 2018, it still is "garbage in, garbage out"!
- "Deployment" - though the tools can assist deployment (e.g. by wrapping everything up with a nice API), there is still all that project management and system implementation goodness that one cannot do away with, yet.

So in my opinion, **what value could they bring to your trade?**
1. **Reduce the burden of DataPrep.** <br>Based upon the [Kaggle State of ML Survey 2017](https://www.kaggle.com/surveys/2017), the majority of a data scientist's time is spent on data understanding and data preparation. Though there is no silver bullet for data understanding, having a system manage the model specific needs of your data (like nulls, encoding, or derived features aka feature engineering) could be a major win. Competitions like the [AutoML ChaLearn](http://automl.chalearn.org/) have demonstrated these capabilities in tools like Auto-SKLearn.
2. **Establish a new benchmark to beat by bespoke data prep/modelling.** <br> This seems to be the opinion of most experienced data scientists I speak to.
3. **Provide a 'good enough' model so that you can focus on the other time consuming, non negotiable, parts of the project.** <br> I don't imagine this will be a popular opinion - as data science training and competitions tend to emphasize the importance of  model selection and hand-tuned hyper-parameters. Those that work on their own (i.e. in a startup) or in small/dislocated teams may agree though, as **the value is not in the model, but in the business outcome**.

___

<a id='Section4'></a>
## How to get started using these tools (with some code)
[Go back to top](#Section0) <br>
Using an Auto-ML tool is a matter of 
- installing it, 
- understanding the AutoML data prep requirements, 
- instantiating AutoML as an estimator,
- and giving the tool plenty of time to train - as *AutoML unfortunately does not mean quick ML!*

The following section demonstrates how to use the AutoML library TPOT in generating a survival prediction on the famous Titanic Dataset. <br>
Some of this code has been adapted from [Sibanjan Das and Umit Mert Cakmak](https://www.packtpub.com/big-data-and-business-intelligence/hands-automated-machine-learning), and from the [TPOT titanic tutorial](https://github.com/EpistasisLab/tpot/blob/master/tutorials/Titanic_Kaggle.ipynb). <br>
The data has been [sourced from the Titanic Demonstration Kaggle Competition](https://www.kaggle.com/c/titanic/data)

### TPOT AutoML Example
#### Installation and Setup

In [None]:
# Installing the AutoML library
import sys
!{sys.executable} -m pip install tpot
# Note - there are optional extras needed to use other elements of TPOT - see the docs: http://epistasislab.github.io/tpot/installing/

In [2]:
# Importing libraries
from tpot import TPOTClassifier
from sklearn.dummy import DummyClassifier # baseline
from sklearn.ensemble import RandomForestClassifier # Random forest estimator - will be used later for comparison
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt

#### Understanding the Data Prep Requirements
For TPOT to work, some basic data prep is needed. This includes
- relabelling of the training data features (i.e. the target is relabelled to 'class')
- management of null values
- categorical feature encoding
- removal of identifiers or other irrelevant features

Note: Each library (or platform) will have it's own requirements, so it's important to read the docs. <br> 
[You can find TPOT's docs here](https://epistasislab.github.io/tpot/)

In [3]:
# Importing the kaggle train and test datasets
titanic_train = pd.read_csv("./data/train.csv")
titanic_test = pd.read_csv("./data/test.csv")

In [6]:
# Quickly inspecting training data
titanic_train.info()
"""
Looks like we'll have to:
- DROP Name, Ticket
- ENCODE Sex, Embarked, potentially Cabin (though might be better to drop it given the nulls)
- Fill nulls in Age, Embarked,
""" 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [8]:
# Prep training data
titanic_train_prepped = titanic_train.copy()

# TPOT requires the tagged column to be renamed as "Class"
titanic_train_prepped.rename(columns={'Survived': 'class'}, inplace=True)

# encode categorical variables
titanic_train_prepped['Sex'] = titanic_train_prepped['Sex'].map({'male':0,'female':1})
titanic_train_prepped['Embarked'] = titanic_train_prepped['Embarked'].map({'S':0,'C':1,'Q':2})

# fill nulls
titanic_train_prepped = titanic_train_prepped.fillna(-999)

# removal of identifiers and other features
titanic_train_prepped = titanic_train_prepped.drop(['Name','Ticket','Cabin'], axis=1)

In [11]:
# Split titanic data into test and train
X_train, X_test, y_train, y_test = train_test_split(titanic_train_prepped.drop(columns=['class']),\
                                                    titanic_train_prepped['class'],\
                                                    train_size=0.75,\
                                                    test_size=0.25)

In [14]:
# Applying the same processing to the Kaggle test data
titanic_test_prepped = titanic_test.copy()

# encode categorical variables
titanic_test_prepped['Sex'] = titanic_test_prepped['Sex'].map({'male':0,'female':1})
titanic_test_prepped['Embarked'] = titanic_test_prepped['Embarked'].map({'S':0,'C':1,'Q':2})

# fill nulls
titanic_test_prepped = titanic_test_prepped.fillna(-999)

# removal of identifiers and other features
titanic_test_prepped = titanic_test_prepped.drop(['Name','Ticket','Cabin'], axis=1)

In [15]:
# Making sure test dataset is the same shape as train
assert (titanic_train_prepped.drop(columns=['class']).shape[1] == titanic_test_prepped.shape[1]), "Not Equal"

#### Instantiating AutoML the Estimator
The code below uses some standard hyperparameter settings when instantiating TPOT. <br>
An important TPOT parameter to set is the number of generations. Why? Because training TPOT with lots of generations can take a long time. 
[Quoting this reference:](https://github.com/EpistasisLab/tpot/blob/master/tutorials/Titanic_Kaggle.ipynb) <br>
> On a standard laptop with 4GB RAM, it roughly takes 5 minutes per generation to run. For each added generation, it should take 5 mins more. Thus, for the default value of 100, total run time could be roughly around 8 hours.

In [9]:
tpot = TPOTClassifier(generations=10, population_size=30, verbosity=2)

#### ... And giving it plenty of time to train! 

In [12]:
# When you invoke fit method, TPOT will create generations of populations, 
# seeking best set of parameters. Arguments you have used to create
# TPOTClassifier such as generations and population_size will affect the
# search space and resulting pipeline.
tpot.fit(X_train, y_train)

  return f(*args, **kwds)




Generation 1 - Current best internal CV score: 0.8100381135259376
Generation 2 - Current best internal CV score: 0.8100381135259376
Generation 3 - Current best internal CV score: 0.8100381135259376
Generation 4 - Current best internal CV score: 0.8175012988524379
Generation 5 - Current best internal CV score: 0.8175012988524379
Generation 6 - Current best internal CV score: 0.8175012988524379
Generation 7 - Current best internal CV score: 0.8175123546843892
Generation 8 - Current best internal CV score: 0.8190048919978221
Generation 9 - Current best internal CV score: 0.8190048919978221
Generation 10 - Current best internal CV score: 0.8279385029738526

Best pipeline: LogisticRegression(RandomForestClassifier(input_matrix, bootstrap=False, criterion=gini, max_features=0.25, min_samples_leaf=10, min_samples_split=11, n_estimators=100), C=1.0, dual=False, penalty=l1)


TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
        disable_update_check=False, early_stop=None, generations=10,
        max_eval_time_mins=5, max_time_mins=None, memory=None,
        mutation_rate=0.9, n_jobs=1, offspring_size=None,
        periodic_checkpoint_folder=None, population_size=30,
        random_state=None, scoring=None, subsample=1.0, use_dask=False,
        verbosity=2, warm_start=False)

In [21]:
# Thankfully after it's done, you can export the pipeline summarized above. 
tpot.export('example_tpot_pipeline.py')
# It's quite interesting to read the code it generates - take a look!

True

#### So how did the TPOT do?
- Does it beat a dummy classifier?
- Does it beat a Random Forest?
- Does it rank well on Kaggle :p?

Let's find out!

In [16]:
# Getting TPOT's accuracy score
tpot.score(X_test,y_test)

0.820627802690583

In [17]:
# Does it beat the dummy?
dummy = DummyClassifier()
dummy.fit(X_train, y_train)
dummy.score(X_test,y_test)

0.5022421524663677

It beats the dummy!

In [18]:
# Does it beat the Random Forest (my personal favourite estimator, and arguably it's own flavour of AutoML, given it handles ensembling)
rfclf = RandomForestClassifier(n_estimators=100, max_features='auto', oob_score=True, random_state=1)
rfclf.fit(X_train, y_train)
rfclf.score(X_test,y_test)

0.8071748878923767

It beats the Random Forest!

In [19]:
# Does it rank well on Kaggle?
submission = tpot.predict(titanic_test_prepped)

# Create the submission file
final = pd.DataFrame({'PassengerId': titanic_test_prepped['PassengerId'], 'Survived': submission})
final.to_csv('data/submission.csv', index = False)

**It doesn't rank well on Kaggle :p** <br>
The score from this run-through was ~0.73.
It did previously beat the common sense "Gender Model" of "women survived the titanic" <br>
(i.e. if Sex = F, Survived =1)

#### Figure 3: Kaggle rankings of slapdash TPOT model. 
*The ranking shown is the first submission. <br>
The second run through did not beat the first submission, or the common sense benchmark (labelled "Gender Based Model"). *

<img src="./images/kaggle_kinda_fail.png" style="width:673px;height:500px;" align="left"/>

<a id='Section5'></a>
## How good are these tools really? (and should I be worried about being out of a job... :D)
[Go back to top](#Section0) <br>

This is actually a really hard question to answer! <br>

I can see the value of TPOT to my trade, in providing an easy new benchmark to beat, and providing some inspiration towards model selection and pipeline construction. It is certainly easier kicking off TPOT and leaving it to brew, than thinking through every possible model and searching through each hyper-parameter! <Br>

**As for should I be worried about being out of a job...** <br>

*On the side of **No, I shouldn't**:*
- The rest of the Data Science and Predictive Analytics process is hard and technical, and out of the scope of the tools I've come across so far. (i.e. it's more a productivity tool than a replacement tool).
- TPOT sucked at Kaggle - take that, computers!

*On the side of **Yes, I should**:*
- Some of those other steps in the Data Science process could be executed by good project or product manager, with a sound understanding of "democratized" AutoML tools. 
- TPOT's poor Kaggle performance was likely errors and missed opportunities on my part, including:
>- Thoughtless data pre-processing 
(i.e. I could have imputed the age, rather than filled it with -999!)
>- No Application of Domain Knowledge  
(i.e. I could have introduced a feature "child" based on the value of age, and a model with clear "women" and "child" flags would likely perform much better than a model with sex and age).
<br>
- There are other tools to TPOT! Tools like Auto-SKLearn and Featuretools can help with more of the data preparation than TPOT. 
- The team behind AlphaGo are actively at work applying reinforcement learning to the problem of AutoML ([see this paper](https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxhdXRvbWwyMDE4aWNtbHxneDo0OTQzOThjNmZmYjYxYjc3)). <br> 
- **I can certainly imaging a future sometime soon of a Kaggler being upstaged by an automatic opponent, much like Lee Sedol at Go **

So I dunno! <br>
¯\\_(ツ)_/¯


<a id='Section1'></a>
## HELP US EXPLORE AUTO-ML AT THE MANKIND VS MACHINE COMMUNITY CHALLENGE!
[Go back to top](#Section0) <br>

Join over 40 other people in exploring the capabilities of AutoML, and in trying to beat their predictions :) <br>

It's kicks off this Friday at GA Sydney! <br>
[You can register for the kick off here:](https://ga.co/2PtgkLn) <br>
>"Given a dataset and a problem, can your prediction beat an AutoML service?" <br>
>The dataset and prediction problem will be made available to attendees during the event (on Friday morning 9th of November), and submissions will be managed using a Kaggle In Class competition. 
>The competition will close Wednesday the 21st of November, and a winner will be judged on Friday the 23rd of November. <br>
>The judging criteria will depend on the dataset, but it is likely to be the submission with greatest prediction accuracy. <br>
>The prize for winner is glory, a mix of GA swag, $$$, and GA class credit.

Think of it as a free opportunity to get familiar with these tools, and learn about their strengths and weaknesses with the rest of the Sydney Data Science community.

#### Figure 4: Come along and defeat the adorable robots!

<img src="./images/Japanese-Technology-Robotic-Mall-Robot-Japan-1964072.jpg" style="width:512px;height:356px;" align="left"/>


<a id='Section6'></a>
## Where to learn more, and follow developments in AutoML?
[Go back to top](#Section0) <br>

*Learn More*
- [BOOK: Das and Mert Cakmak; Hands on Machine Learning; April 2018; Packt Publishing](https://www.packtpub.com/big-data-and-business-intelligence/hands-automated-machine-learning)
- [BOOK: Hutte, Kotthoff, Vanschoren et al; AUTOML: Methods, Systems, Challenges; 2018](https://www.automl.org/book/)

*Follow Developments*
- [The Freidburg AutoML Group - makers of Auto-SKLearn, and managers of the ChaLearn AutoML Challenge](https://twitter.com/AutoMLFreiburg)
- [Paper submissions to the AutoML Track of ICML - e.g. 2018](https://sites.google.com/site/automl2018icml/accepted-papers)

If I've missed anything, let me know, and I'll update this list!