## 0. Review
In this model the following decissions were taken: 
- Model choosen is XGBRegressor
- All features are considered in the analysis (XGB has a good perform even there are dependent features or with week correlation in the dataset).
- Data preparation is not needed as:
    - There are not null values.
    - There are few zero values (they are present in train/test and XGB can predict with 0 values)
    - There are few outliers (8% in train and 1% in test), as we can not exclude them in the train it's better to keep them in the train so the model can learn about them.
- Feature scalling does not provide better results (data range are not very big)
- Target encoding using price's mean seems to reduce error variance.
- Target encoding using price's std does not seems to provide better results.
- Cross target encoding not seems to provide better results (there is more overfitting).
- Category encoding performs better that one hot encoding when using tree decission models.

## 1. Libraries import

In [1]:
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Imports 
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", 50)
#train test split
from sklearn.model_selection import train_test_split
# model
from xgboost import XGBRegressor
# error
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error

## 2. Data import

### 2.1 Impor train data

In [3]:
df_diamonds_train=pd.read_csv('../data/diamonds_train.csv')
df_diamonds_train.drop(['Unnamed: 0', 'index_id'], axis=1,inplace=True) #dropped unnecessary columns
df_diamonds_train.head()

Unnamed: 0,depth,table,x,y,z,price,carat,cut,color,clarity,city
0,62.4,58.0,6.83,6.79,4.25,4268,1.21,Premium,J,VS2,Dubai
1,63.0,57.0,4.35,4.38,2.75,505,0.32,Very Good,H,VS2,Kimberly
2,65.5,55.0,5.62,5.53,3.65,2686,0.71,Fair,G,VS1,Las Vegas
3,63.8,56.0,4.68,4.72,3.0,738,0.41,Good,D,SI1,Kimberly
4,60.5,59.0,6.55,6.51,3.95,4882,1.02,Ideal,G,SI1,Dubai


### 2.2 Impor test data

In [4]:
df_diamonds_test=pd.read_csv('../data/diamonds_test.csv')
df_diamonds_test.drop(['id'], axis=1,inplace=True) #dropped unnecessary columns
df_diamonds_test.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,city
0,0.79,Very Good,F,SI1,62.7,60.0,5.82,5.89,3.67,Amsterdam
1,1.2,Ideal,J,VS1,61.0,57.0,6.81,6.89,4.18,Surat
2,1.57,Premium,H,SI1,62.2,61.0,7.38,7.32,4.57,Kimberly
3,0.9,Very Good,F,SI1,63.8,54.0,6.09,6.13,3.9,Kimberly
4,0.5,Very Good,F,VS1,62.9,58.0,5.05,5.09,3.19,Amsterdam


## 3. Features to be considered in the analysis

All features are considered in the analysis (model chosen XGB has a good perform even there are dependent features or with week correlation in the dataset):

In [5]:
num_features_list=['x','y','z','depth','table','carat']
cat_features_list=['cut','color','clarity','city']
features_list=['x','y','z','depth','table','carat','cut','color','clarity','city']
target='price'

## 4. Data preparation 

Data preparation is not needed as:
- There are not null values.
- There are few zero values (they are present in train/test and XGB can predict with 0 values)
- There are few outliers (8% in train and 1% in test), as we can not exclude them in the train it's better to keep them in the train so the model can learn about them.

## 5. Feature engineering (training set)

- Feature scalling does not provide better results (data range are not very big).
- Target encoding using price mean seems to reduce error variance.
- Target encoding using price std does not seems to provide better results.
- Cross target encoding not seems to provide better results (there is more overfitting).
- Category encoding performs better that one hot encoding when using tree decission models.

### 5.1 Target encoding 

In [6]:
# Target encoding for categorical variables using price's mean
#mean
cut_encoding = df_diamonds_train.groupby(['cut'])['price'].mean().to_dict()
df_diamonds_train['cut_encoding'] = df_diamonds_train['cut'].map(cut_encoding).astype(float)
color_encoding = df_diamonds_train.groupby(['color'])['price'].mean().to_dict()
df_diamonds_train['color_encoding'] = df_diamonds_train['color'].map(color_encoding).astype(float)
clarity_encoding = df_diamonds_train.groupby(['clarity'])['price'].mean().to_dict()
df_diamonds_train['clarity_encoding'] = df_diamonds_train['clarity'].map(clarity_encoding).astype(float)
#std
cut_encoding_std = df_diamonds_train.groupby(['cut'])['price'].std().to_dict()
df_diamonds_train['cut_encoding_std'] = df_diamonds_train['cut'].map(cut_encoding_std).astype(float)
color_encoding_std = df_diamonds_train.groupby(['color'])['price'].std().to_dict()
df_diamonds_train['color_encoding_std'] = df_diamonds_train['color'].map(color_encoding_std).astype(float)
clarity_encoding_std = df_diamonds_train.groupby(['clarity'])['price'].std().to_dict()
df_diamonds_train['clarity_encoding_std'] = df_diamonds_train['clarity'].map(clarity_encoding_std).astype(float)

In [7]:
features_list_encoding=['x','y','z','depth','table','carat','cut','color','clarity','city',
                        'cut_encoding','color_encoding','clarity_encoding',
                        'cut_encoding_std','color_encoding_std','clarity_encoding_std']

### 5.2 Category encoding

In [8]:
# Defining features y target
X=df_diamonds_train[features_list_encoding]
y=df_diamonds_train['price']

In [9]:
for column in cat_features_list:
    X[column]=X[column].astype('category')
    X[column]=X[column].cat.codes

In [10]:
X.head()

Unnamed: 0,x,y,z,depth,table,carat,cut,color,clarity,city,cut_encoding,color_encoding,clarity_encoding,cut_encoding_std,color_encoding_std,clarity_encoding_std
0,6.83,6.79,4.25,62.4,58.0,1.21,3,6,5,2,4617.322612,5346.234112,3913.590182,4380.357286,4437.967123,4029.640798
1,4.35,4.38,2.75,63.0,57.0,0.32,4,4,5,3,3994.44442,4476.469014,3913.590182,3955.185677,4204.035086,4029.640798
2,5.62,5.53,3.65,65.5,55.0,0.71,0,3,4,4,4333.27198,4023.214902,3796.813551,3496.467642,4063.947046,4001.986722
3,4.68,4.72,3.0,63.8,56.0,0.41,1,0,2,3,3880.611794,3134.943157,3999.856908,3647.03984,3315.698012,3821.246565
4,6.55,6.51,3.95,60.5,59.0,1.02,2,3,2,2,3436.112577,4023.214902,3999.856908,3790.911135,4063.947046,3821.246565


### 5.3 Define train and validation

In [11]:
# Splitting train and test
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

## 6. Feature engineering (test set) 

In [12]:
# Adapting categorical features for validation model
X_test=df_diamonds_test[features_list]

### 6.1 Target encoding (mean and std)

In [13]:
# Target encoding for categorical variables using price's mean
#std
cut_encoding = df_diamonds_train.groupby(['cut'])['price'].mean().to_dict()
X_test['cut_encoding'] = X_test['cut'].map(cut_encoding).astype(float)
color_encoding = df_diamonds_train.groupby(['color'])['price'].mean().to_dict()
X_test['color_encoding'] = X_test['color'].map(color_encoding).astype(float)
clarity_encoding = df_diamonds_train.groupby(['clarity'])['price'].mean().to_dict()
X_test['clarity_encoding'] = X_test['clarity'].map(clarity_encoding).astype(float)
#std
# Std
cut_encoding_std = df_diamonds_train.groupby(['cut'])['price'].std().to_dict()
X_test['cut_encoding_std'] = X_test['cut'].map(cut_encoding_std).astype(float)
color_encoding_std = df_diamonds_train.groupby(['color'])['price'].std().to_dict()
X_test['color_encoding_std'] = X_test['color'].map(color_encoding_std).astype(float)
clarity_encoding_std = df_diamonds_train.groupby(['clarity'])['price'].std().to_dict()
X_test['clarity_encoding_std'] = X_test['clarity'].map(clarity_encoding_std).astype(float)

In [14]:
X_train.shape

(32364, 16)

In [15]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32364 entries, 32121 to 15795
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   x                     32364 non-null  float64
 1   y                     32364 non-null  float64
 2   z                     32364 non-null  float64
 3   depth                 32364 non-null  float64
 4   table                 32364 non-null  float64
 5   carat                 32364 non-null  float64
 6   cut                   32364 non-null  int8   
 7   color                 32364 non-null  int8   
 8   clarity               32364 non-null  int8   
 9   city                  32364 non-null  int8   
 10  cut_encoding          32364 non-null  float64
 11  color_encoding        32364 non-null  float64
 12  clarity_encoding      32364 non-null  float64
 13  cut_encoding_std      32364 non-null  float64
 14  color_encoding_std    32364 non-null  float64
 15  clarity_encodin

### 6.2 Change categorical variables to code

In [16]:
for column in cat_features_list:
    X_test[column]=X_test[column].astype('category')
    X_test[column]=X_test[column].cat.codes

In [17]:
X_test.head()

Unnamed: 0,x,y,z,depth,table,carat,cut,color,clarity,city,cut_encoding,color_encoding,clarity_encoding,cut_encoding_std,color_encoding_std,clarity_encoding_std
0,5.82,5.89,3.67,62.7,60.0,0.79,4,2,2,0,3994.44442,3677.35572,3999.856908,3955.185677,3771.406126,3821.246565
1,6.81,6.89,4.18,61.0,57.0,1.2,2,6,4,10,3436.112577,5346.234112,3796.813551,3790.911135,4437.967123,4001.986722
2,7.38,7.32,4.57,62.2,61.0,1.57,3,4,2,3,4617.322612,4476.469014,3999.856908,4380.357286,4204.035086,3821.246565
3,6.09,6.13,3.9,63.8,54.0,0.9,4,2,2,3,3994.44442,3677.35572,3999.856908,3955.185677,3771.406126,3821.246565
4,5.05,5.09,3.19,62.9,58.0,0.5,4,2,4,0,3994.44442,3677.35572,3796.813551,3955.185677,3771.406126,4001.986722


In [18]:
X_test.shape

(13485, 16)

In [19]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13485 entries, 0 to 13484
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   x                     13485 non-null  float64
 1   y                     13485 non-null  float64
 2   z                     13485 non-null  float64
 3   depth                 13485 non-null  float64
 4   table                 13485 non-null  float64
 5   carat                 13485 non-null  float64
 6   cut                   13485 non-null  int8   
 7   color                 13485 non-null  int8   
 8   clarity               13485 non-null  int8   
 9   city                  13485 non-null  int8   
 10  cut_encoding          13485 non-null  float64
 11  color_encoding        13485 non-null  float64
 12  clarity_encoding      13485 non-null  float64
 13  cut_encoding_std      13485 non-null  float64
 14  color_encoding_std    13485 non-null  float64
 15  clarity_encoding_st

## 7. Model definition - XGBRegressor

RandomForestRegressor: multiple trees in paralel changing samples and convining diferrent features (overfitting when the tree is big and good to reduce error variance)

Main Parameters:
   - bootstrap -> method for sampling data points (TRUE bagging and FALSE pasting, with/without replacement)
   - n_estimators -> number of trees in the foreset
   - max_depth -> max number of levels in each decision tree
   - max_features -> max number of features considered for splitting a node
   - ccp_alpha ->
   - criterion ->
   - max_leaf_nodes -> max number of solution nodes 
   - max_samples ->
   - min_impurity_decrease ->
   - min_samples_leaf -> min number of data points allowed in a leaf node
   - min_samples_split -> min number of data points placed in a node before the node is split
   - min_weight_fraction_leaf
   - n_estimators -> number of trees in the foreset
   - n_jobs 
   - oob_score 
   - random_state
   - verbose
   - warm_start   

In [20]:
# 1. XGBRegressor 
model = XGBRegressor(n_estimators=200,colsample_bylevel=1,colsample_bynode=1,
                     colsample_bytree=0.8,reg_alpha=1, reg_lambda=1,gamma=0,learning_rate=0.1, random_state=42)
hyperparameters = model.get_params()
print(type(model), '\n')

<class 'xgboost.sklearn.XGBRegressor'> 



## 8. Model training with validation

In [21]:
%%time
# Model training
model.fit(X_train, y_train,eval_set=[(X_train,y_train),(X_val,y_val)],early_stopping_rounds=40)

[0]	validation_0-rmse:5049.43262	validation_1-rmse:5090.20605
[1]	validation_0-rmse:4561.66943	validation_1-rmse:4601.57275
[2]	validation_0-rmse:4123.17432	validation_1-rmse:4161.91260
[3]	validation_0-rmse:3729.97925	validation_1-rmse:3767.03613
[4]	validation_0-rmse:3374.57471	validation_1-rmse:3410.81763
[5]	validation_0-rmse:3056.49268	validation_1-rmse:3092.52441
[6]	validation_0-rmse:2773.00781	validation_1-rmse:2808.54492
[7]	validation_0-rmse:2516.19653	validation_1-rmse:2549.88599
[8]	validation_0-rmse:2286.21973	validation_1-rmse:2319.63794
[9]	validation_0-rmse:2079.98828	validation_1-rmse:2113.38574
[10]	validation_0-rmse:1894.95496	validation_1-rmse:1927.09985
[11]	validation_0-rmse:1730.08118	validation_1-rmse:1763.08301
[12]	validation_0-rmse:1582.69898	validation_1-rmse:1614.79004
[13]	validation_0-rmse:1450.82617	validation_1-rmse:1483.48950
[14]	validation_0-rmse:1333.63123	validation_1-rmse:1366.36145
[15]	validation_0-rmse:1231.67981	validation_1-rmse:1265.17737
[1

[134]	validation_0-rmse:411.72043	validation_1-rmse:529.23926
[135]	validation_0-rmse:410.92343	validation_1-rmse:529.00342
[136]	validation_0-rmse:410.70425	validation_1-rmse:529.02191
[137]	validation_0-rmse:410.63678	validation_1-rmse:528.98181
[138]	validation_0-rmse:409.64673	validation_1-rmse:528.70306
[139]	validation_0-rmse:408.64368	validation_1-rmse:528.60754
[140]	validation_0-rmse:407.87918	validation_1-rmse:528.77643
[141]	validation_0-rmse:407.65222	validation_1-rmse:528.79614
[142]	validation_0-rmse:407.55701	validation_1-rmse:528.77850
[143]	validation_0-rmse:407.45706	validation_1-rmse:528.64130
[144]	validation_0-rmse:407.08591	validation_1-rmse:528.66596
[145]	validation_0-rmse:406.82190	validation_1-rmse:528.57983
[146]	validation_0-rmse:406.73496	validation_1-rmse:528.54010
[147]	validation_0-rmse:405.56232	validation_1-rmse:528.76141
[148]	validation_0-rmse:404.99353	validation_1-rmse:528.99402
[149]	validation_0-rmse:404.39078	validation_1-rmse:529.02295
[150]	va

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.8, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.1, max_delta_step=0,
             max_depth=6, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=200, n_jobs=8,
             num_parallel_tree=1, predictor='auto', random_state=42,
             reg_alpha=1, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [22]:
%%time
# Model predictions
y_pred_val = model.predict(X_val)

CPU times: user 79.7 ms, sys: 6.58 ms, total: 86.3 ms
Wall time: 18.7 ms


In [23]:
%%time
# Model predictions
y_pred_train = model.predict(X_train)

CPU times: user 196 ms, sys: 4.59 ms, total: 200 ms
Wall time: 36.4 ms


## 9. Training set error

In [24]:
# Model predictions
rmse_train = mean_squared_error(y_train, y_pred_train)**0.5
rmse_train

406.7349402669031

In [25]:
mae_train=mean_absolute_error(y_train, y_pred_train)
mae_train

228.50397469955857

In [26]:
r2r = r2_score(y_val, y_pred_val)
r2r

0.9828456740423707

## 9. Validation set error

In [27]:
#432
rmse_val = mean_squared_error(y_val, y_pred_val)**0.5
rmse_val

528.5401536791891

In [28]:
mae_val=mean_absolute_error(y_val, y_pred_val)
mae_val

271.8974370809085

In [29]:
r2r = r2_score(y_val, y_pred_val)
r2r

0.9828456740423707

## 10. Model training without validation

In [30]:
%%time
# Model training
model.fit(X, y,eval_set=[(X_train,y_train),(X_val,y_val)],early_stopping_rounds=40)

[0]	validation_0-rmse:5048.78076	validation_1-rmse:5087.30908
[1]	validation_0-rmse:4560.51123	validation_1-rmse:4597.12061
[2]	validation_0-rmse:4121.58301	validation_1-rmse:4156.11377
[3]	validation_0-rmse:3728.11719	validation_1-rmse:3760.80078
[4]	validation_0-rmse:3372.50244	validation_1-rmse:3403.88306
[5]	validation_0-rmse:3053.63525	validation_1-rmse:3083.90088
[6]	validation_0-rmse:2770.06714	validation_1-rmse:2798.31665
[7]	validation_0-rmse:2513.57324	validation_1-rmse:2539.69287
[8]	validation_0-rmse:2283.63696	validation_1-rmse:2310.04468
[9]	validation_0-rmse:2077.72949	validation_1-rmse:2103.17798
[10]	validation_0-rmse:1893.39062	validation_1-rmse:1916.80481
[11]	validation_0-rmse:1728.79297	validation_1-rmse:1749.74353
[12]	validation_0-rmse:1581.59228	validation_1-rmse:1600.14465
[13]	validation_0-rmse:1449.89734	validation_1-rmse:1467.38135
[14]	validation_0-rmse:1333.08093	validation_1-rmse:1349.81543
[15]	validation_0-rmse:1231.29407	validation_1-rmse:1247.80481
[1

[134]	validation_0-rmse:429.03751	validation_1-rmse:426.82028
[135]	validation_0-rmse:428.63422	validation_1-rmse:426.47473
[136]	validation_0-rmse:428.25781	validation_1-rmse:426.15961
[137]	validation_0-rmse:427.41715	validation_1-rmse:425.36350
[138]	validation_0-rmse:426.39966	validation_1-rmse:424.44168
[139]	validation_0-rmse:426.14548	validation_1-rmse:424.18866
[140]	validation_0-rmse:425.54807	validation_1-rmse:423.40637
[141]	validation_0-rmse:424.74588	validation_1-rmse:422.52145
[142]	validation_0-rmse:424.53650	validation_1-rmse:422.32080
[143]	validation_0-rmse:424.28876	validation_1-rmse:421.94772
[144]	validation_0-rmse:423.73218	validation_1-rmse:421.35541
[145]	validation_0-rmse:423.68399	validation_1-rmse:421.28751
[146]	validation_0-rmse:422.76389	validation_1-rmse:420.49701
[147]	validation_0-rmse:422.14563	validation_1-rmse:420.19379
[148]	validation_0-rmse:421.83908	validation_1-rmse:419.95749
[149]	validation_0-rmse:421.37454	validation_1-rmse:419.61243
[150]	va

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.8, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.1, max_delta_step=0,
             max_depth=6, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=200, n_jobs=8,
             num_parallel_tree=1, predictor='auto', random_state=42,
             reg_alpha=1, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

## 11. Test Preditions

In [31]:
predictions = model.predict(X_test)

In [32]:
predictions=pd.DataFrame(predictions)

In [33]:
predictions.reset_index(inplace=True)

In [34]:
predictions=predictions.rename({0: 'price','index': 'id'}, axis=1)

## 12. Save Preditions

In [35]:
predictions.to_csv('../data/diamonds_predictions_2.csv',index=False)