<div style="float:right; padding-top: 15px; padding-right: 15px">
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="250">
        </a>
    </div>
</div>

## 0. python imports & setup

for learning purposes, libraries will be imported inside its corresponding usage section...

## 1. data loading

In [3]:
import pandas as pd
import numpy as np

* diamonds: labeled data we can use for training and testing
* diamonds_predict: diamonds to predict its price and upload result to Kaggle

In [4]:
diamonds = pd.read_csv('../data/raw/diamonds_train.csv')
diamonds_predict = pd.read_csv('../data/raw/diamonds_predict.csv')

In [5]:
diamonds.head().T

Unnamed: 0,0,1,2,3,4
carat,1.21,0.32,0.71,0.41,1.02
cut,Premium,Very Good,Fair,Good,Ideal
color,J,H,G,D,G
clarity,VS2,VS2,VS1,SI1,SI1
depth,62.4,63,65.5,63.8,60.5
table,58,57,55,56,59
price,4268,505,2686,738,4882
x,6.83,4.35,5.62,4.68,6.55
y,6.79,4.38,5.53,4.72,6.51
z,4.25,2.75,3.65,3,3.95


as you can see, there are both categorical and numerical columns...

## 2. eda

In [6]:
def removeoutliers(df, listvars, z):
    from scipy import stats
    for var in listvars:
        df1 = df[np.abs(stats.zscore(df[var])) < z]
    return df1

In [7]:
linear_vars = diamonds.select_dtypes(include=[np.number]).columns

In [8]:
diamonds = removeoutliers(diamonds, linear_vars,2)

this section is up to you! this guided lesson is about a machine learning pipeline...

## 3. ml preprocessing

in this section I will teach how to use scikit-learn's Pipiline and ColumnTransformer, one of the best practices for composing preprocessing and modeling in a single and elegand class... pay attention as it is hard to understand...

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

* https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
* https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

let's identify numerical and categorical features...

In [10]:
NUM_FEATS = ['carat', 'depth', 'table', 'x', 'y', 'z']
CAT_FEATS = ['cut', 'color', 'clarity']
FEATS = NUM_FEATS + CAT_FEATS
TARGET = 'price'

let's define a preprocessing transformer for numerical columns...

In [11]:
numeric_transformer = \
Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), 
                ('scaler', RobustScaler())])

let's define a preprocessing transformer for categorical columns...

In [12]:
categorical_transformer = \
Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
                ('onehot', OneHotEncoder(handle_unknown='ignore'))])

let's join these transformers using a `ColumnTransformer`:

In [13]:
preprocessor = \
ColumnTransformer(transformers=[('num', numeric_transformer, NUM_FEATS),
                                ('cat', categorical_transformer, CAT_FEATS)])

inspecting the full preprocessor:

In [14]:
preprocessor

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('num',
                                 Pipeline(memory=None,
                                          steps=[('imputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='median',
                                                                verbose=0)),
                                                 ('scaler',
                                                  RobustScaler(copy=True,
                                                               quantile_range=(25.0,
                         

how does this preprocessing looks like?

at least in this case, it is at the cost of interpretability of transformed DataFrame...

In [15]:
pd.DataFrame(data=preprocessor.fit_transform(diamonds)).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,0.809524,0.4,0.333333,0.655556,0.644068,0.678571,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,-0.603175,0.8,0.0,-0.722222,-0.717514,-0.660714,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.015873,2.466667,-0.666667,-0.016667,-0.067797,0.142857,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,-0.460317,1.333333,-0.333333,-0.538889,-0.525424,-0.4375,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.507937,-0.866667,0.666667,0.5,0.485876,0.410714,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


## 4. train a simple model

first, lets train a simple model using holdout, train - test split...

In [16]:
from sklearn.model_selection import train_test_split

In [17]:
diamonds_train, diamonds_test = train_test_split(diamonds)

In [18]:
print(diamonds_train.shape)
print(diamonds_test.shape)

(29245, 10)
(9749, 10)


let's choose a model from scikit-learn cheatsheet: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [19]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor

model= Pipeline(steps=[('preprocessor', preprocessor),
                       ('regressor', GradientBoostingRegressor(warm_start=True))])

In [20]:
model.fit(diamonds_train[FEATS], diamonds_train[TARGET]);

## 5. check model performance on test and train data

In [21]:
from sklearn.metrics import mean_squared_error

In [22]:
y_test = model.predict(diamonds_test[FEATS])
y_train = model.predict(diamonds_train[FEATS])

In [23]:
print(f"test error: {mean_squared_error(y_pred=y_test, y_true=diamonds_test[TARGET], squared=False)}")
print(f"train error: {mean_squared_error(y_pred=y_train, y_true=diamonds_train[TARGET], squared=False)}")

test error: 628.5379857657118
train error: 612.8153521232724


## 6. check model performance using cross validation

In [24]:
from sklearn.model_selection import cross_val_score

In [25]:
scores = cross_val_score(model, 
                         diamonds[FEATS], 
                         diamonds[TARGET], 
                         scoring='neg_root_mean_squared_error', 
                         cv=5, n_jobs=-1)

In [26]:
import numpy as np
np.mean(-scores)

637.8959960871698

## 7. optimize model using grid search

In [27]:
from sklearn.model_selection import RandomizedSearchCV

In [28]:
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'regressor__n_estimators': [16, 32, 64, 128, 256, 512],
    'regressor__max_depth': [2, 4, 8, 16],
}

grid_search = RandomizedSearchCV(model, 
                                 param_grid, 
                                 cv=5, 
                                 verbose=10, 
                                 scoring='neg_root_mean_squared_error', 
                                 n_jobs=6,
                                 n_iter=32)

grid_search.fit(diamonds[FEATS], diamonds[TARGET])

Fitting 5 folds for each of 32 candidates, totalling 160 fits


[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done   1 tasks      | elapsed:    3.0s
[Parallel(n_jobs=6)]: Done   6 tasks      | elapsed:   20.2s
[Parallel(n_jobs=6)]: Done  13 tasks      | elapsed:   26.2s
[Parallel(n_jobs=6)]: Done  20 tasks      | elapsed:   31.1s
[Parallel(n_jobs=6)]: Done  29 tasks      | elapsed:  1.7min
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:  1.8min
[Parallel(n_jobs=6)]: Done  49 tasks      | elapsed:  2.4min
[Parallel(n_jobs=6)]: Done  60 tasks      | elapsed:  3.1min
[Parallel(n_jobs=6)]: Done  73 tasks      | elapsed:  4.1min
[Parallel(n_jobs=6)]: Done  86 tasks      | elapsed:  4.8min
[Parallel(n_jobs=6)]: Done 101 tasks      | elapsed:  5.4min
[Parallel(n_jobs=6)]: Done 116 tasks      | elapsed:  5.8min
[Parallel(n_jobs=6)]: Done 133 tasks      | elapsed:  6.4min
[Parallel(n_jobs=6)]: Done 160 out of 160 | elapsed:  7.2min finished


RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=Pipeline(memory=None,
                                      steps=[('preprocessor',
                                              ColumnTransformer(n_jobs=None,
                                                                remainder='drop',
                                                                sparse_threshold=0.3,
                                                                transformer_weights=None,
                                                                transformers=[('num',
                                                                               Pipeline(memory=None,
                                                                                        steps=[('imputer',
                                                                                                SimpleImputer(add_indicator=False,
                                                                                     

In [29]:
grid_search.best_params_

{'regressor__n_estimators': 128,
 'regressor__max_depth': 8,
 'preprocessor__num__imputer__strategy': 'mean'}

In [30]:
grid_search.best_estimator_

Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='mean',
                                                               

In [31]:
grid_search.best_score_

-452.3075339327912

## 8. prepare submission

In [32]:
y_pred = grid_search.predict(diamonds_predict[FEATS])

In [33]:
submission_df = pd.DataFrame({'id': diamonds_predict['id'], 'price': y_pred})

In [34]:
submission_df.head()

Unnamed: 0,id,price
0,0,2898.859256
1,1,5544.693232
2,2,9500.93723
3,3,3998.863505
4,4,1615.672524


In [35]:
submission_df.describe()

Unnamed: 0,id,price
count,13485.0,13485.0
mean,6742.0,3911.108909
std,3892.928525,3849.015649
min,0.0,297.599935
25%,3371.0,935.649862
50%,6742.0,2452.083783
75%,10113.0,5330.489651
max,13484.0,18500.875847


In [70]:
submission_df.price.clip(0, 20000, inplace=True)

In [36]:
submission_df.to_csv('../submissions/diamonds_GBR_15-21.csv', index=False)

## 9. let's try more models...

<div style="padding-top: 25px; float: right">
    <div>    
        <i>&nbsp;&nbsp;© Copyright by</i>
    </div>
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="125">
        </a>
    </div>
</div>