# Documentation

1) http://auto-ml.readthedocs.io/en/latest/api_docs_for_geeks.html  
2) https://github.com/ClimbsRocks/auto_ml

# Code

In [1]:
import pandas as pd
import numpy as np

df = pd.read_excel("bank.xlsx")

In [2]:
df['target'] = df['y'].apply(lambda x: 1 if x == 'yes' else 0)
df.drop('y',axis=1,inplace=True)

In [3]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,target
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,0
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,0
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,0
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,0
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,0


In [4]:
df.isnull().mean().sort_values(ascending=False)*100

target       0.0
loan         0.0
job          0.0
marital      0.0
education    0.0
default      0.0
balance      0.0
housing      0.0
contact      0.0
poutcome     0.0
day          0.0
month        0.0
duration     0.0
campaign     0.0
pdays        0.0
previous     0.0
age          0.0
dtype: float64

In [5]:
df.select_dtypes(include=['object']).columns

Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'month', 'poutcome'],
      dtype='object')

In [6]:
# Data transformation
# Convert categorical values to numeric using label encoder
from sklearn import preprocessing
from collections import defaultdict
d = defaultdict(preprocessing.LabelEncoder)

# Encoding the categorical variable
fit = df.select_dtypes(include=['object']).fillna('NA').apply(lambda x: d[x.name].fit_transform(x))

#Convert the categorical columns based on encoding
for i in list(d.keys()):
    df[i] = d[i].transform(df[i].fillna('NA'))

In [7]:
from sklearn.cross_validation import train_test_split

train, test = train_test_split(df, test_size = 0.4)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

features_train = train[train.columns.difference(['target'])]
label_train = train['target']
features_test = test[test.columns.difference(['target'])]
label_test = test['target']



In [8]:
from auto_ml import Predictor

In [9]:
col_desc_dictionary = {'target': 'output',
                       'job': 'categorical', 
                       'marital': 'categorical', 
                       'education': 'categorical', 
                       'default': 'categorical', 
                       'housing': 'categorical', 
                       'loan': 'categorical', 
                       'contact': 'categorical',
                       'month': 'categorical', 
                       'poutcome': 'categorical'}

In [22]:
ml_predictor = Predictor(type_of_estimator='classifier', column_descriptions=col_desc_dictionary)

In [23]:
ml_predictor.train(train, ml_for_analytics=True)

Welcome to auto_ml! We're about to go through and make sense of your data using machine learning, and give you a production-ready pipeline to get predictions with.

If you have any issues, or new feature ideas, let us know at http://auto.ml
You are running on version 2.9.10
Now using the model training_params that you passed in:
{}
After overwriting our defaults with your values, here are the final params that will be used to initialize the model:
{'learning_rate': 0.1, 'presort': False, 'warm_start': True}
Running basic data cleaning
Fitting DataFrameVectorizer
Now using the model training_params that you passed in:
{}
After overwriting our defaults with your values, here are the final params that will be used to initialize the model:
{'learning_rate': 0.1, 'presort': False, 'warm_start': True}


********************************************************************************************
About to fit the pipeline for the model GradientBoostingClassifier to predict target
Started at:
2

<auto_ml.predictor.Predictor at 0x10ec8d898>

In [26]:
ml_predictor.score(features_train,label_train)

Here is our brier-score-loss, which is the default value we optimized for while training, and is the value returned from .score() unless you requested a custom scoring metric
It is a measure of how close the PROBABILITY predictions are.
0.0576

Here is the trained estimator's overall accuracy (when it predicts a label, how frequently is that the correct label?)
92.5%

Here is a confusion matrix showing predictions vs. actuals by label:
Predicted >     0    1   All
v Actual v                  
0            2370   32  2402
1             172  138   310
All          2542  170  2712

Here is predictive value by class:
Class:  0 = 0.932336742722266
Class:  1 = 0.8117647058823529
+--------------------------------+-----------------------------------+--------------------------------+
| Bucket Edges                   |   Predicted Probability Of Bucket |   Actual Probability of Bucket |
|--------------------------------+-----------------------------------+--------------------------------|
| (0.0

-0.05755529819521337

In [27]:
ml_predictor.score(features_test,label_test)

Here is our brier-score-loss, which is the default value we optimized for while training, and is the value returned from .score() unless you requested a custom scoring metric
It is a measure of how close the PROBABILITY predictions are.
0.0754

Here is the trained estimator's overall accuracy (when it predicts a label, how frequently is that the correct label?)
89.5%

Here is a confusion matrix showing predictions vs. actuals by label:
Predicted >     0    1   All
v Actual v                  
0            1557   41  1598
1             149   62   211
All          1706  103  1809

Here is predictive value by class:
Class:  0 = 0.9126611957796014
Class:  1 = 0.6019417475728155
+--------------------------------+-----------------------------------+--------------------------------+
| Bucket Edges                   |   Predicted Probability Of Bucket |   Actual Probability of Bucket |
|--------------------------------+-----------------------------------+--------------------------------|
| (0.

-0.07544554098010467

In [28]:
file_name = ml_predictor.save()



We have saved the trained pipeline to a filed called "auto_ml_saved_pipeline.dill"
It is saved in the directory: 
/Users/mbagav200/Desktop/Medium
To use it to get predictions, please follow the following flow (adjusting for your own uses as necessary:


`from auto_ml.utils_models import load_ml_model
`trained_ml_pipeline = load_ml_model("auto_ml_saved_pipeline.dill")
`trained_ml_pipeline.predict(data)`


Note that this pickle/dill file can only be loaded in an environment with the same modules installed, and running the same Python version.
This version of Python is:
sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)


When passing in new data to get predictions on, columns that were not present (or were not found to be useful) in the training data will be silently ignored.
It is worthwhile to make sure that you feed in all the most useful data points though, to make sure you can get the highest quality predictions.


In [11]:
from auto_ml.utils_models import load_ml_model

trained_model = load_ml_model('auto_ml_saved_pipeline.dill')

In [13]:
predictions = trained_model.predict(train)

In [15]:
trained_model.score(features_train,label_train)

-0.0627740411585786