Refactoring
----------------

Kaggle provides a functionality for running the scripts. Below there is an example of a script that achieves a very good score on the leaderboard. It is simple and easy to follow but it is not very complex. 

https://www.kaggle.com/tunguz/homesite-quote-conversion/xgboost-benchmark-1/run/102736/code

```python
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split

seed = 666

train = pd.read_csv('data/homesite/train.csv.zip')
test = pd.read_csv('data/homesite/test.csv.zip')

y = train.QuoteConversion_Flag.values
train = train.drop(['QuoteNumber', 'QuoteConversion_Flag'], axis=1)
test = test.drop('QuoteNumber', axis=1)

# Lets play with some dates
train['Date'] = pd.to_datetime(pd.Series(train['Original_Quote_Date']))
train = train.drop('Original_Quote_Date', axis=1)

test['Date'] = pd.to_datetime(pd.Series(test['Original_Quote_Date']))
test = test.drop('Original_Quote_Date', axis=1)

train['Year'] = train['Date'].apply(lambda x: int(str(x)[:4]))
train['Month'] = train['Date'].apply(lambda x: int(str(x)[5:7]))
train['weekday'] = train['Date'].dt.dayofweek


test['Year'] = test['Date'].apply(lambda x: int(str(x)[:4]))
test['Month'] = test['Date'].apply(lambda x: int(str(x)[5:7]))
test['weekday'] = test['Date'].dt.dayofweek

train = train.drop('Date', axis=1)
test = test.drop('Date', axis=1)

train = train.fillna(-1)
test = test.fillna(-1)

for f in train.columns:
    if train[f].dtype=='object':
        print(f)
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(train[f].values) + list(test[f].values))
        train[f] = lbl.transform(list(train[f].values))
        test[f] = lbl.transform(list(test[f].values))

clf = xgb.XGBClassifier(n_estimators=800,
                        nthread=-1,
                        max_depth=8,
                        learning_rate=0.03,
                        silent=True,
                        subsample=0.8,
                        colsample_bytree=0.8)
                        
xgb_model = clf.fit(train, y, eval_metric="auc")

preds = clf.predict_proba(test)[:,1]
sample = pd.read_csv('../input/sample_submission.csv')
sample.QuoteConversion_Flag = preds
sample.to_csv('xgb_benchmark.csv', index=False)
```

## Exercise
1. What are the building blocks of this pipeline. What transformations you must do to go from raw data to preprocessed data you can use in the Machine Learning algorithm

2. Do you need to separately process training and testing data? What if you didn't have testing data when preprocessing the data? Do you see some duplicated code?

3. Rewrite this code as a chain of transformers in the spirit of the previous examples.

4. Process the training data, save the serialized pipeline (using `joblib.dump`), load the pipeline (using `joblib.load`) and predict testing data. 

Hint: use `PandasSelector` from previous exercise

In [None]:
# templat