This is from a "Getting Started" competition from Kaggle [Titanic competition](https://www.kaggle.com/c/titanic)  to showcase how we can use Auto-ML along with datmo and docker, in order to track our work and make machine learning workflow reprocible and usable. Some part of data analysis is inspired from this [kernel](https://www.kaggle.com/sinakhorami/titanic-best-working-classifier)

This approach can be categorized into following methods,

1. Exploratory Data Analysis (EDA) 
2. Data Cleaning
3. Using Auto-ML to figure out the best algorithm and hyperparameter

During the process of EDA and feature engineering, we would be using datmo to create versions of work by creating snapshot. 

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import re as re

train = pd.read_csv('./input/train.csv', header = 0, dtype={'Age': np.float64})
test  = pd.read_csv('./input/test.csv' , header = 0, dtype={'Age': np.float64})
full_data = [train, test]

print (train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None


#### 1. Exploratory Data Analysis 
###### To understand how each feature has the contribution to Survive

###### a. `Sex`

In [2]:
print (train[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean())

      Sex  Survived
0  female  0.742038
1    male  0.188908


###### b. `Pclass`

In [3]:
print (train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean())

   Pclass  Survived
0       1  0.629630
1       2  0.472826
2       3  0.242363


c. `SibSp and Parch`

With the number of siblings/spouse and the number of children/parents we can create new feature called Family Size. 

In [4]:
for dataset in full_data:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
print (train[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean())

   FamilySize  Survived
0           1  0.303538
1           2  0.552795
2           3  0.578431
3           4  0.724138
4           5  0.200000
5           6  0.136364
6           7  0.333333
7           8  0.000000
8          11  0.000000


`FamilySize` seems to have a significant effect on our prediction. `Survived` has increased until a `FamilySize` of 4 and has decreased after that. Let's categorize people to check they are alone or not.

In [5]:
for dataset in full_data:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
print (train[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean())

   IsAlone  Survived
0        0  0.505650
1        1  0.303538


d. `Embarked` 

we fill the missing values with most occured value `S`

In [6]:
for dataset in full_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')
print (train[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean())

  Embarked  Survived
0        C  0.553571
1        Q  0.389610
2        S  0.339009


e. `Fare`

Fare also has some missing values which will be filled with the median

In [7]:
for dataset in full_data:
    dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())
train['CategoricalFare'] = pd.qcut(train['Fare'], 4)
print (train[['CategoricalFare', 'Survived']].groupby(['CategoricalFare'], as_index=False).mean())

   CategoricalFare  Survived
0   (-0.001, 7.91]  0.197309
1   (7.91, 14.454]  0.303571
2   (14.454, 31.0]  0.454955
3  (31.0, 512.329]  0.581081


It shows the `Fare` has a significant affect on survival, showcasing that people haivng paid higher fares had higher chances of survival

f. `Age`

There are plenty of missing values in this feature. # generate random numbers between (mean - std) and (mean + std). then we categorize age into 5 range.

In [8]:
for dataset in full_data:
    age_avg = dataset['Age'].mean()
    age_std = dataset['Age'].std()
    age_null_count = dataset['Age'].isnull().sum()
    
    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
    dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
    dataset['Age'] = dataset['Age'].astype(int)
    
train['CategoricalAge'] = pd.cut(train['Age'], 5)

print (train[['CategoricalAge', 'Survived']].groupby(['CategoricalAge'], as_index=False).mean())


  CategoricalAge  Survived
0  (-0.08, 16.0]  0.521368
1   (16.0, 32.0]  0.353468
2   (32.0, 48.0]  0.372470
3   (48.0, 64.0]  0.434783
4   (64.0, 80.0]  0.090909


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


g. `Name`

Let's get the title of people 

In [9]:
def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""

for dataset in full_data:
    dataset['Title'] = dataset['Name'].apply(get_title)

print("=====Title vs Sex=====")
print(pd.crosstab(train['Title'], train['Sex']))
print("")
print("=====Title vs Survived=====")
print (train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean())

=====Title vs Sex=====
Sex       female  male
Title                 
Capt           0     1
Col            0     2
Countess       1     0
Don            0     1
Dr             1     6
Jonkheer       0     1
Lady           1     0
Major          0     2
Master         0    40
Miss         182     0
Mlle           2     0
Mme            1     0
Mr             0   517
Mrs          125     0
Ms             1     0
Rev            0     6
Sir            0     1

=====Title vs Survived=====
       Title  Survived
0       Capt  0.000000
1        Col  0.500000
2   Countess  1.000000
3        Don  0.000000
4         Dr  0.428571
5   Jonkheer  0.000000
6       Lady  1.000000
7      Major  0.500000
8     Master  0.575000
9       Miss  0.697802
10      Mlle  1.000000
11       Mme  1.000000
12        Mr  0.156673
13       Mrs  0.792000
14        Ms  1.000000
15       Rev  0.000000
16       Sir  1.000000


Let's categorize it and check the title impact on survival rate convert the rare titles to `Rare`

In [10]:
for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
    'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

print (train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean())


    Title  Survived
0  Master  0.575000
1    Miss  0.702703
2      Mr  0.156673
3     Mrs  0.793651
4    Rare  0.347826


In [11]:
import json
config = {"features analyzed": ["Sex", "Pclass", "FamilySize", "IsAlone", "Embarked", "Fare", "Age", "Title"]}

with open('config.json', 'w') as outfile:
    json.dump(config, outfile)

# NOTE: SAVE YOUR JUPYTER NOTEBOOK HERE

#### Creating a datmo snapshot to save my work, this helps me save my current work before proceeding onto data cleaning 

Let's run this on terminal,

```bash
home:~/datmo-tutorials/kaggle-titanic$ datmo snapshot create -m "EDA"
Creating a new snapshot
Created snapshot with id: b978fce31d

home:~/datmo-tutorials/kaggle-titanic$ datmo snapshot ls
+------------+---------------------+------------------------------------------+-------+---------+-------+
|     id     |      created at     |                  config                  | stats | message | label |
+------------+---------------------+------------------------------------------+-------+---------+-------+
| b978fce31d | 2018-06-03 19:40:50 |     {u'features analyzed': [u'Sex',      |   {}  |   EDA   |  None |
|            |                     |  u'Pclass', u'FamilySize', u'IsAlone',   |       |         |       |
|            |                     | u'Embarked', u'Fare', u'Age', u'Title']} |       |         |       |
+------------+---------------------+------------------------------------------+-------+---------+-------+
```

#### 2. Data Cleaning
Now let's clean our data and map our features into numerical values.

In [13]:
train_copy  = train.copy()
test_copy = test.copy()
full_data_copy = [train_copy, test_copy]

for dataset in full_data_copy:
    # Mapping Sex
    dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
    
    # Mapping titles
    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)
    
    # Mapping Embarked
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
    
    # Mapping Fare
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare']                               = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare']                                  = 3
    dataset['Fare'] = dataset['Fare'].astype(int)
    
    # Mapping Age
    dataset.loc[ dataset['Age'] <= 16, 'Age']                          = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age']                           = 4


In [14]:
# Feature Selection
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp',\
                 'Parch', 'FamilySize']

train_copy = train_copy.drop(drop_elements, axis = 1)
train_copy = train_copy.drop(['CategoricalAge', 'CategoricalFare'], axis = 1)

test_copy  = test_copy.drop(drop_elements, axis = 1)

print (train_copy.head(10))

train_copy = train_copy.values
test_copy  = test_copy.values

   Survived  Pclass  Sex  Age  Fare  Embarked  IsAlone  Title
0         0       3    1    1     0         0        0      1
1         1       1    0    2     3         1        0      3
2         1       3    0    1     1         0        1      2
3         1       1    0    2     3         0        0      3
4         0       3    1    2     1         0        1      1
5         0       3    1    1     1         2        1      1
6         0       1    1    3     3         0        1      1
7         0       3    1    0     2         0        0      4
8         1       3    0    1     1         0        0      3
9         1       2    0    0     2         1        0      3


In [15]:
config = {"selected features": ["Sex", "Pclass", "Age", "Fare", "Embarked", "Fare", "IsAlone", "Title"]}

with open('config.json', 'w') as outfile:
    json.dump(config, outfile)

#### 3. Using Auto-ML to figure out the best algorithm and hyperparameter
##### Now we have cleaned our data it's time to use auto-ml in order to get the best algorithm for this data
![](./images/usage_auto-ml.png)

In [16]:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

X = train_copy[0::, 1::]
y = train_copy[0::, 0]

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_titanic_pipeline.py')

Optimization Progress:  33%|███▎      | 100/300 [00:56<01:30,  2.21pipeline/s]

Generation 1 - Current best internal CV score: 0.826378965656


Optimization Progress:  50%|█████     | 150/300 [01:51<01:53,  1.33pipeline/s]

Generation 2 - Current best internal CV score: 0.835289799956


Optimization Progress:  67%|██████▋   | 200/300 [02:38<01:19,  1.26pipeline/s]

Generation 3 - Current best internal CV score: 0.835289799956


Optimization Progress:  83%|████████▎ | 250/300 [03:35<00:56,  1.13s/pipeline]

Generation 4 - Current best internal CV score: 0.835289799956


                                                                              

Generation 5 - Current best internal CV score: 0.835289799956

Best pipeline: LogisticRegression(ExtraTreesClassifier(Normalizer(GaussianNB(input_matrix), norm=l1), bootstrap=False, criterion=entropy, max_features=0.4, min_samples_leaf=3, min_samples_split=14, n_estimators=100), C=25.0, dual=True, penalty=l2)
0.766816143498


True

In [17]:
stats = {"accuracy": (tpot.score(X_test, y_test))} 

with open('stats.json', 'w') as outfile:
    json.dump(stats, outfile)

# NOTE:  SAVE YOUR JUPYTER NOTEBOOK HERE

### Let's again create a datmo snapshot to save my work, this helps me save my current work before changing my feature selection

Let's run this on terminal,

```
home:~/datmo-tutorials/kaggle-titanic$ datmo snapshot create -m "auto-ml-1"
Creating a new snapshot
Created snapshot with id: 6ac8f41754

home:~/datmo-tutorials/kaggle-titanic$ datmo snapshot ls
+------------+----------------+------------------------------------------+---------------+-----------+-------+
|     id     |   created at   |                  config                  |       stats   |  message  | label |
+------------+----------------+------------------------------------------+---------------+-----------+-------+
| 6ac8f41754 |   2018-06-03   |     {u'selected features': [u'Sex',      | {u'accuracy': | auto-ml-1 |  None |
|            |    19:54:22    | u'Pclass', u'Age', u'Fare', u'Embarked', |     0.76}     |           |       |
|            |                |     u'Fare', u'IsAlone', u'Title']}      |               |           |       |
| b978fce31d |   2018-06-03   |     {u'features analyzed': [u'Sex',      |     {}        |    EDA    |  None |
|            |    19:40:50    |  u'Pclass', u'FamilySize', u'IsAlone',   |               |           |       |
|            |                | u'Embarked', u'Fare', u'Age', u'Title']} |               |           |       |
+------------+----------------+------------------------------------------+---------------+-----------+-------+
```

#### Another feature selection
1. Let's leave `FamilySize` rather than just unsing `IsAlone` 
2. Let's use `Fare_Per_Person` insted of binning `Fare`

In [18]:
train_copy  = train.copy()
test_copy = test.copy()
full_data_copy = [train_copy, test_copy]

for dataset in full_data_copy:
    # Mapping Sex
    dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
    
    # Mapping titles
    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)
    
    # Mapping Embarked
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
    
    # Mapping Fare
    dataset['FarePerPerson']=dataset['Fare']/(dataset['FamilySize']+1)
    
    # Mapping Age
    dataset.loc[ dataset['Age'] <= 16, 'Age']                          = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age']                           = 4

In [19]:
# Feature Selection
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp',\
                 'Parch', 'IsAlone', 'Fare']

train_copy = train_copy.drop(drop_elements, axis = 1)
train_copy = train_copy.drop(['CategoricalAge', 'CategoricalFare'], axis = 1)

test_copy  = test_copy.drop(drop_elements, axis = 1)

print (train_copy.head(10))

train_copy = train_copy.values
test_copy  = test_copy.values

   Survived  Pclass  Sex  Age  Embarked  FamilySize  Title  FarePerPerson
0         0       3    1    1         0           2      1       2.416667
1         1       1    0    2         1           2      3      23.761100
2         1       3    0    1         0           1      2       3.962500
3         1       1    0    2         0           2      3      17.700000
4         0       3    1    2         0           1      1       4.025000
5         0       3    1    1         2           1      1       4.229150
6         0       1    1    3         0           1      1      25.931250
7         0       3    1    0         0           5      4       3.512500
8         1       3    0    1         0           3      3       2.783325
9         1       2    0    0         1           2      3      10.023600


In [20]:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

X = train_copy[0::, 1::]
y = train_copy[0::, 0]

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_titanic_pipeline.py')

Optimization Progress:  33%|███▎      | 100/300 [00:48<00:58,  3.41pipeline/s]

Generation 1 - Current best internal CV score: 0.835423966217


Optimization Progress:  50%|█████     | 150/300 [01:13<01:26,  1.73pipeline/s]

Generation 2 - Current best internal CV score: 0.836949671027


Optimization Progress:  67%|██████▋   | 200/300 [01:53<02:41,  1.62s/pipeline]

Generation 3 - Current best internal CV score: 0.839912800243


Optimization Progress:  83%|████████▎ | 250/300 [02:46<01:02,  1.25s/pipeline]

Generation 4 - Current best internal CV score: 0.839912800243


                                                                              

Generation 5 - Current best internal CV score: 0.839912800243

Best pipeline: RandomForestClassifier(input_matrix, bootstrap=True, criterion=entropy, max_features=0.7, min_samples_leaf=2, min_samples_split=3, n_estimators=100)
0.838565022422


True

In [21]:
config = {"selected features": ["Sex", "Pclass", "Age", "Fare", "Embarked", "FarePerPerson", "FamilySize", "Title"]}

with open('config.json', 'w') as outfile:
    json.dump(config, outfile)

stats = {"accuracy": (tpot.score(X_test, y_test))} 

with open('stats.json', 'w') as outfile:
    json.dump(stats, outfile)

# NOTE:  SAVE YOUR JUPYTER NOTEBOOK HERE

### Let's again create a datmo snapshot to save my final work

Let's run this on terminal,

```
home:~/datmo-tutorials/kaggle-titanic$ datmo snapshot create -m "auto-ml-2"
Creating a new snapshot
Created snapshot with id: 2bcedce966

home:~/datmo-tutorials/kaggle-titanic$ datmo snapshot ls
+------------+----------------+------------------------------------------+---------------+-----------+-------+
|     id     |   created at   |                  config                  |       stats   |  message  | label |
+------------+----------------+------------------------------------------+---------------+-----------+-------+
| 2bcedce966 |   2018-06-03   |     {u'selected features': [u'Sex',      | {u'accuracy': | auto-ml-2 |  None |
|            |    20:31:35    | u'Pclass', u'Age', u'Fare', u'Embarked', |     0.83}     |           |       |
|            |                |      u'FarePerPerson', u'FamilySize',    |               |           |       |
|            |   	         |                u'Title']}                |               |           |       |
| 6ac8f41754 |   2018-06-03   |     {u'selected features': [u'Sex',      | {u'accuracy': | auto-ml-1 |  None |
|            |    19:54:22    | u'Pclass', u'Age', u'Fare', u'Embarked', |     0.76}     |           |       |
|            |                |     u'Fare', u'IsAlone', u'Title']}      |               |           |       |
| b978fce31d |   2018-06-03   |     {u'features analyzed': [u'Sex',      |     {}        |    EDA    |  None |
|            |    19:40:50    |  u'Pclass', u'FamilySize', u'IsAlone',   |               |           |       |
|            |                | u'Embarked', u'Fare', u'Age', u'Title']} |               |           |       |
+------------+----------------+------------------------------------------+---------------+-----------+-------+
```

#### Let's now move to a different snapshot in order to either get the `experimentation.ipynb`, `submission.csv` or `tpot_titanice_pipeline.py` or any other files in that version. Since this will change the code as well, we should run this outside of the Jupyter notebook. You can save your Jupyter notebook here and run the code below in a new terminal. You should see your Jupyter notebook change to the previous version. 

We perform `checkout` command in order to achieve it

```bash
home:~/datmo-tutorials/auto-ml$ # Run this command: datmo snapshot checkout --id <snapshot-id>
home:~/datmo-tutorials/auto-ml$ datmo snapshot checkout --id 30803662
```
