<div style="float:right; padding-top: 15px; padding-right: 15px">
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="250">
        </a>
    </div>
</div>

## 0. Import basic libraries

In [38]:
import pandas as pd
import numpy as np

## 1. Explanation

We will be using:

* diamonds_train.csv: labeled dataset for training and testing purposes
* diamonds_predict: dataset with two columns (id & and price) to generate the prediction

Finally we will upload our submission.csv dataset to Kaggle to compete for the most accurate price prediction

## 2. Data load

In [39]:
diamonds = pd.read_csv('../data/diamonds_train.csv')
diamonds_predict = pd.read_csv('../data/diamonds_predict.csv')

In [40]:
diamonds.head()
diamonds.shape

(40455, 10)

In [41]:
diamonds_predict.head()
diamonds_predict.shape

(13485, 10)

As we can observe from the table there are categorical and numerical variables. Therefore, we will try different
transformations to our model to understand it's weights and correlation to the independent variable (price), which is the variable we'd like to predict.

## Adjusting the data for Machine Learning

Let's check if there are any null values because machine learning models often encounter problems with null or NA values

In [42]:
diamonds.isnull().sum()

carat      0
cut        0
color      0
clarity    0
depth      0
table      0
price      0
x          0
y          0
z          0
dtype: int64

There are no null values, let's proceed building the model

## 3. Building the model

### 3.1 Import Machine Learning modules

Now we are going to split our data according to the sklearn module, therefore we need to import a few machine learning libraries and metrics for the model to work. 


In [43]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

### 3.2 Preprocessing

Since our dataset has some categorical data, like cut, color and clarity, we need to transform those variables into numerical ones. Most of the Machine learning algorithms can't handle categorical data.

In [44]:
# Applying labeling encoding and creating synthetic variables to calculate a weighted sum of all features

In [45]:
clarity={'I1':0, 'SI2':1, 'SI1':2, 'VS2':3, 'VS1':4, 'VVS2':5, 'VVS1':6, 'IF':7}
cut={'Fair':0, 'Good':1, 'Very Good':2, 'Premium':3, 'Ideal':4}
color={'J':0, 'I':1, 'H':2, 'G':3, 'F':4, 'E':5, 'D':6}

In [46]:
def labeling(s, dic):
    return dic[s]

In [47]:
diamonds.clarity=diamonds.clarity.apply(lambda x: labeling(x, clarity))
diamonds.cut=diamonds.cut.apply(lambda x: labeling(x, cut))
diamonds.color=diamonds.color.apply(lambda x: labeling(x, color))

In [48]:
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1.21,3,0,3,62.4,58.0,4268,6.83,6.79,4.25
1,0.32,2,2,3,63.0,57.0,505,4.35,4.38,2.75
2,0.71,0,3,4,65.5,55.0,2686,5.62,5.53,3.65
3,0.41,1,6,2,63.8,56.0,738,4.68,4.72,3.0
4,1.02,4,3,2,60.5,59.0,4882,6.55,6.51,3.95


In [49]:
# Now let's create synthetic variables

In [50]:
diamonds['Vol']=diamonds.x*diamonds.y*diamonds.z
diamonds['Sum']=diamonds.carat**2+2*diamonds.clarity+diamonds.color
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,Vol,Sum
0,1.21,3,0,3,62.4,58.0,4268,6.83,6.79,4.25,197.096725,7.4641
1,0.32,2,2,3,63.0,57.0,505,4.35,4.38,2.75,52.39575,8.1024
2,0.71,0,3,4,65.5,55.0,2686,5.62,5.53,3.65,113.43689,11.5041
3,0.41,1,6,2,63.8,56.0,738,4.68,4.72,3.0,66.2688,10.1681
4,1.02,4,3,2,60.5,59.0,4882,6.55,6.51,3.95,168.429975,8.0404


### 3.3 Doing the split

In [51]:
diamonds_train, diamonds_test = train_test_split(diamonds, test_size=0.2, random_state=42)

In [52]:
print(diamonds_train.shape)
print(diamonds_test.shape)

(32364, 12)
(8091, 12)


In [53]:
diamonds_test

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,Vol,Sum
17775,0.70,1,4,4,63.8,58.0,2970,5.58,5.61,3.57,111.754566,12.4900
13506,0.61,4,5,4,61.3,54.0,3004,5.53,5.50,3.38,102.802700,13.3721
4325,0.33,4,3,6,61.6,55.0,838,4.46,4.47,2.75,54.824550,15.1089
37870,1.00,3,4,3,58.6,61.0,6468,6.53,6.50,3.82,162.139900,11.0000
21321,0.29,1,5,5,62.7,61.0,633,4.20,4.22,2.64,46.791360,15.0841
...,...,...,...,...,...,...,...,...,...,...,...,...
3781,1.03,2,3,2,63.1,57.0,4764,6.45,6.41,4.06,167.858670,8.0609
26959,0.32,4,6,2,62.7,54.0,756,4.39,4.35,2.74,52.324410,10.1024
15529,0.71,4,6,2,61.3,57.0,2690,5.74,5.78,3.53,117.115516,10.5041
36333,0.90,2,4,2,59.6,63.0,3992,6.24,6.17,3.70,142.452960,8.8100


In [54]:
NUM_FEATS = ['carat', 'depth', 'table', 'x', 'y', 'z', 'Vol', 'Sum']
CAT_FEATS = ['cut', 'color', 'clarity']
FEATS = NUM_FEATS + CAT_FEATS
TARGET = 'price'

### 3.3 Applying the ligth gbm model

In [55]:
# Instead of using the lightgbm classifier, we are going to use the lightgbm regressor because we'd like to predict
# a numerical value

In [56]:
import lightgbm as lgb

In [57]:
lgbm = lgb.LGBMRegressor(boosting_type='gbdt',
                        num_leaves=50,
                        max_depth=-1,
                        learning_rate=0.05,
                        n_estimators=512)

lgbm.fit(diamonds_train[FEATS], diamonds_train[TARGET], eval_metric='l2')

LGBMRegressor(learning_rate=0.05, n_estimators=512, num_leaves=50)

In [58]:
y_train = lgbm.predict(diamonds_train[FEATS])
y_test = lgbm.predict(diamonds_test[FEATS])

## 4. Evaluating accuracy of our model

In [59]:
from sklearn.metrics import mean_squared_error

In [60]:
print(f"test error: {mean_squared_error(y_pred=y_test, y_true=diamonds_test[TARGET], squared=False)}")
print(f"train error: {mean_squared_error(y_pred=y_train, y_true=diamonds_train[TARGET], squared=False)}")

test error: 517.4539271099432
train error: 344.31613882859153


## 5. Optimizing our model with hyperparameters search

In [61]:
# Let's define the hyperparametres that can make an impact in our model

In [62]:
param_test ={'num_leaves': [10,30,50,100],
             'learning_rate': [0.01, 0.05, 0.1, 0.2],
             'max_depth': [-1, 3, 5],
             'n_estimators': [16, 32, 64, 128, 256, 512]}

In [63]:
from sklearn.model_selection import RandomizedSearchCV

In [65]:
grid_search = RandomizedSearchCV(lgbm, 
                                 param_test, 
                                 cv=5, 
                                 verbose=10, 
                                 n_jobs=-1,
                                 n_iter=16)

grid_search.fit(diamonds[FEATS], diamonds[TARGET])

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    2.1s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    2.9s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    4.6s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    6.0s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    7.9s
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:   13.9s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:   16.1s
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:   22.5s finished


RandomizedSearchCV(cv=5,
                   estimator=LGBMRegressor(learning_rate=0.05, n_estimators=200,
                                           num_leaves=200),
                   n_iter=16, n_jobs=-1,
                   param_distributions={'learning_rate': [0.01, 0.05, 0.1, 0.2],
                                        'max_depth': [-1, 3, 5],
                                        'n_estimators': [16, 32, 64, 128, 256,
                                                         512],
                                        'num_leaves': [10, 30, 50, 100]},
                   verbose=10)

In [66]:
grid_search.best_params_

{'num_leaves': 50, 'n_estimators': 512, 'max_depth': -1, 'learning_rate': 0.05}

In [67]:
grid_search.best_score_

0.9823690383774603

## 6. Preparing data for submission

Before doing the submission we need to apply the same modifications of our dataset, the syntethic varaibles to 
the diamonds_predict dataset

In [64]:
diamonds_predict

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z
0,0,0.79,Very Good,F,SI1,62.7,60.0,5.82,5.89,3.67
1,1,1.20,Ideal,J,VS1,61.0,57.0,6.81,6.89,4.18
2,2,1.57,Premium,H,SI1,62.2,61.0,7.38,7.32,4.57
3,3,0.90,Very Good,F,SI1,63.8,54.0,6.09,6.13,3.90
4,4,0.50,Very Good,F,VS1,62.9,58.0,5.05,5.09,3.19
...,...,...,...,...,...,...,...,...,...,...
13480,13480,0.57,Ideal,E,SI1,61.9,56.0,5.35,5.32,3.30
13481,13481,0.71,Ideal,I,VS2,62.2,55.0,5.71,5.73,3.56
13482,13482,0.70,Ideal,F,VS1,61.6,55.0,5.75,5.71,3.53
13483,13483,0.70,Very Good,F,SI2,58.8,57.0,5.85,5.89,3.45


In [65]:
clarity={'I1':0, 'SI2':1, 'SI1':2, 'VS2':3, 'VS1':4, 'VVS2':5, 'VVS1':6, 'IF':7}
cut={'Fair':0, 'Good':1, 'Very Good':2, 'Premium':3, 'Ideal':4}
color={'J':0, 'I':1, 'H':2, 'G':3, 'F':4, 'E':5, 'D':6}

In [66]:
def labeling(s, dic):
    return dic[s]

In [67]:
diamonds_predict.clarity=diamonds_predict.clarity.apply(lambda x: labeling(x, clarity))
diamonds_predict.cut=diamonds_predict.cut.apply(lambda x: labeling(x, cut))
diamonds_predict.color=diamonds_predict.color.apply(lambda x: labeling(x, color))

In [68]:
diamonds_predict['Vol']=diamonds_predict.x*diamonds_predict.y*diamonds_predict.z
diamonds_predict['Sum']=diamonds_predict.carat**2+2*diamonds_predict.clarity+diamonds_predict.color

In [69]:
diamonds_predict

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,Vol,Sum
0,0,0.79,2,4,2,62.7,60.0,5.82,5.89,3.67,125.806866,8.6241
1,1,1.20,4,0,4,61.0,57.0,6.81,6.89,4.18,196.129362,9.4400
2,2,1.57,3,2,2,62.2,61.0,7.38,7.32,4.57,246.878712,8.4649
3,3,0.90,2,4,2,63.8,54.0,6.09,6.13,3.90,145.593630,8.8100
4,4,0.50,2,4,4,62.9,58.0,5.05,5.09,3.19,81.997355,12.2500
...,...,...,...,...,...,...,...,...,...,...,...,...
13480,13480,0.57,4,5,2,61.9,56.0,5.35,5.32,3.30,93.924600,9.3249
13481,13481,0.71,4,1,3,62.2,55.0,5.71,5.73,3.56,116.477148,7.5041
13482,13482,0.70,4,4,4,61.6,55.0,5.75,5.71,3.53,115.898725,12.4900
13483,13483,0.70,2,4,1,58.8,57.0,5.85,5.89,3.45,118.874925,6.4900


## 7. Applying the optimized model to predict dataset

In [70]:
y_pred = lgbm.predict(diamonds_predict[FEATS])

In [71]:
lgbm_submission = pd.DataFrame({'id': diamonds_predict['id'], 'price': y_pred})

In [72]:
lgbm_submission.head()

Unnamed: 0,id,price
0,0,2867.749354
1,1,5482.045691
2,2,9470.866115
3,3,3959.814112
4,4,1646.337979


In [73]:
lgbm_submission.to_csv('lgbm_without_grid_search2.csv', index=False)