<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Creating-models" data-toc-modified-id="Creating-models-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Creating models</a></span><ul class="toc-item"><li><span><a href="#Define-a-pipeline" data-toc-modified-id="Define-a-pipeline-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Define a pipeline</a></span></li><li><span><a href="#Create-a-benchmark-model" data-toc-modified-id="Create-a-benchmark-model-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Create a benchmark model</a></span></li><li><span><a href="#Tune-Model/Try-different-models" data-toc-modified-id="Tune-Model/Try-different-models-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Tune Model/Try different models</a></span></li><li><span><a href="#XGBoost" data-toc-modified-id="XGBoost-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>XGBoost</a></span></li><li><span><a href="#Data-Leakage" data-toc-modified-id="Data-Leakage-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Data Leakage</a></span></li></ul></li></ul></div>

# Creating models

## Define a pipeline

 Advantages for defining a pipeline are:

- <b>Cleaner Code:</b> Accounting for data at each step of preprocessing can get messy. With a pipeline, you won't need to manually keep track of your training and validation data at each step.
- <b>Fewer Bugs:</b> There are fewer opportunities to misapply a step or forget a preprocessing step.
- <b>Easier to Productionize:</b> It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won't go into the many related concerns here, but pipelines can help.
- <b>More Options for Model Validation:</b> You will see an example in the next tutorial, which covers cross-validation.

In [1]:
import pandas as pd
data_path = 'data/melb_data.csv'
data = pd.read_csv(data_path)
data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


For this data set we will have multiple steps like:
- Imputing null values
- transforming categorical values
- splitting data into training and evaluating segments
- defining basic model
- evaluating that model
- improving the model
- evaluating the improved model
- Testing the model on test data

In [3]:
from sklearn.pipeline import Pipeline

In [6]:
def impute_null_values(df):
    '''code to impute null values from the dataframe
       will also modify the passed dataframe'''

def transform_categorical_values(df):
    '''will create label encoded categorical values
       will add new columns having the categorical labels as 'Column'_category'''

def split_data(df, train_percent):
    '''based on train percent the dataframe will be split
       returns 2 dataframe, train_df and eval_df'''
    return {}, {}

def model_creation(model_params):
    '''creates model based on the model parameters passed'''
    return {}

In [None]:
pipeline = Pipeline(steps=[
    ('imputer', impute_null_values),
    ('label_encoder', transform_categorical_values),
    ('splitter', split_data)
])

## Create a benchmark model

## Tune Model/Try different models

## XGBoost

## Data Leakage