## 0. Review
In this model the following decissions were taken: 
- Model choosen is XGBRegressor
- All features are considered in the analysis (XGB has a good perform even there are dependent features or with week correlation in the dataset).
- Data preparation is not needed as:
    - There are not null values.
    - There are few zero values (they are present in train/test and XGB can predict with 0 values)
    - There are few outliers (8% in train and 1% in test), as we can not exclude them in the train it's better to keep them in the train so the model can learn about them.
- Feature scalling does not provide better results (data range are not very big)
- Target encoding using price's mean seems to reduce error variance.
- Target encoding using price's std does not seems to provide better results.
- Cross target encoding not seems to provide better results (there is more overfitting).
- Category encoding performs better that one hot encoding when using tree decission models.

## 1. Libraries import

In [63]:
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [64]:
# Imports 
import numpy as np
import pandas as pd
pd.set_option("display.max_columns", 50)
#scaling
from sklearn.preprocessing import StandardScaler
#train test split
from sklearn.model_selection import train_test_split
# model
from xgboost import XGBRegressor
# error
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error

## 2. Data import

### 2.1 Impor train data

In [65]:
df_diamonds_train=pd.read_csv('../data/diamonds_train.csv')
df_diamonds_train.drop(['Unnamed: 0', 'index_id'], axis=1,inplace=True) #dropped unnecessary columns
df_diamonds_train.head()

Unnamed: 0,depth,table,x,y,z,price,carat,cut,color,clarity,city
0,62.4,58.0,6.83,6.79,4.25,4268,1.21,Premium,J,VS2,Dubai
1,63.0,57.0,4.35,4.38,2.75,505,0.32,Very Good,H,VS2,Kimberly
2,65.5,55.0,5.62,5.53,3.65,2686,0.71,Fair,G,VS1,Las Vegas
3,63.8,56.0,4.68,4.72,3.0,738,0.41,Good,D,SI1,Kimberly
4,60.5,59.0,6.55,6.51,3.95,4882,1.02,Ideal,G,SI1,Dubai


### 2.2 Impor test data

In [66]:
df_diamonds_test=pd.read_csv('../data/diamonds_test.csv')
df_diamonds_test.drop(['id'], axis=1,inplace=True) #dropped unnecessary columns
df_diamonds_test.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,city
0,0.79,Very Good,F,SI1,62.7,60.0,5.82,5.89,3.67,Amsterdam
1,1.2,Ideal,J,VS1,61.0,57.0,6.81,6.89,4.18,Surat
2,1.57,Premium,H,SI1,62.2,61.0,7.38,7.32,4.57,Kimberly
3,0.9,Very Good,F,SI1,63.8,54.0,6.09,6.13,3.9,Kimberly
4,0.5,Very Good,F,VS1,62.9,58.0,5.05,5.09,3.19,Amsterdam


## 3. Features to be considered in the analysis

All features are considered in the analysis (model chosen XGB has a good perform even there are dependent features or with week correlation in the dataset):

In [67]:
num_features_list=['x','y','z','depth','table','carat']
cat_features_list=['cut','color','clarity','city']
features_list=['x','y','z','depth','table','carat','cut','color','clarity','city']
target='price'

## 4. Data preparation 

Data preparation is not needed as:
- There are not null values.
- There are few zero values (they are present in train/test and XGB can predict with 0 values)
- There are few outliers (8% in train and 1% in test), as we can not exclude them in the train it's better to keep them in the train so the model can learn about them.

## 5. Feature engineering (training set)

- Feature scalling does not provide better results (data range are not very big).
- Target encoding using price mean seems to reduce error variance.
- Target encoding using price std does not seems to provide better results.
- Cross target encoding not seems to provide better results (there is more overfitting).
- Category encoding performs better that one hot encoding when using tree decission models.

### 5.1 Target encoding 

In [68]:
# Target encoding for categorical variables using price's mean
cut_encoding = df_diamonds_train.groupby(['cut'])['price'].mean().to_dict()
df_diamonds_train['cut_encoding'] = df_diamonds_train['cut'].map(cut_encoding).astype(float)
color_encoding = df_diamonds_train.groupby(['color'])['price'].mean().to_dict()
df_diamonds_train['color_encoding'] = df_diamonds_train['color'].map(color_encoding).astype(float)
clarity_encoding = df_diamonds_train.groupby(['clarity'])['price'].mean().to_dict()
df_diamonds_train['clarity_encoding'] = df_diamonds_train['clarity'].map(clarity_encoding).astype(float)

In [69]:
features_list_encoding=['x','y','z','depth','table','carat','cut','color','clarity','city',
                        'cut_encoding','color_encoding','clarity_encoding']

## Scaling

In [70]:
num_features_list=['x','y','z','depth','table','carat','cut_encoding','color_encoding','clarity_encoding']

In [71]:
df_diamonds_train_num=df_diamonds_train[num_features_list]
df_diamonds_train_carat=df_diamonds_train[cat_features_list]
df_diamonds_train_target=df_diamonds_train['price']

In [72]:
# Perform the feature scaling on the numeric attributes of the dataset
num_scaler = StandardScaler()
df_diamonds_train_num_scaled = num_scaler.fit_transform(df_diamonds_train_num)
df_diamonds_train_num_scaled = pd.DataFrame(df_diamonds_train_num_scaled, columns=num_features_list)
df_diamonds_train_num_scaled


Unnamed: 0,x,y,z,depth,table,carat,cut_encoding,color_encoding,clarity_encoding
0,0.978807,0.921985,1.022657,0.452019,0.247981,0.867006,1.456645,1.991090,-0.022040
1,-1.226738,-1.179816,-1.129259,0.871099,-0.199745,-1.004557,0.139558,0.769625,-0.022040
2,-0.097286,-0.176882,0.161891,2.617265,-1.095198,-0.184434,0.856015,0.133092,-0.195311
3,-0.933258,-0.883296,-0.770607,1.429872,-0.647472,-0.815298,-0.101143,-1.114363,0.105960
4,0.729794,0.677793,0.592274,-0.875068,0.695707,0.467458,-1.041044,0.133092,0.105960
...,...,...,...,...,...,...,...,...,...
40450,1.218927,1.140014,1.280887,0.661559,-0.199745,1.140380,-1.041044,0.133092,-0.195311
40451,2.295019,2.195276,1.711271,-3.249854,1.143433,2.570338,-0.101143,-0.352620,1.739881
40452,0.569714,0.599302,0.678351,0.661559,-0.647472,0.446430,-1.041044,0.769625,0.105960
40453,-1.137805,-1.101325,-1.114913,0.102785,-1.408606,-0.983529,-1.041044,1.991090,-0.195311


In [73]:
df_diamonds_train=pd.merge(df_diamonds_train_num_scaled, df_diamonds_train_carat, left_index=True, right_index=True)
df_diamonds_train=pd.merge(df_diamonds_train, df_diamonds_train_target, left_index=True, right_index=True)
df_diamonds_train.head()

Unnamed: 0,x,y,z,depth,table,carat,cut_encoding,color_encoding,clarity_encoding,cut,color,clarity,city,price
0,0.978807,0.921985,1.022657,0.452019,0.247981,0.867006,1.456645,1.99109,-0.02204,Premium,J,VS2,Dubai,4268
1,-1.226738,-1.179816,-1.129259,0.871099,-0.199745,-1.004557,0.139558,0.769625,-0.02204,Very Good,H,VS2,Kimberly,505
2,-0.097286,-0.176882,0.161891,2.617265,-1.095198,-0.184434,0.856015,0.133092,-0.195311,Fair,G,VS1,Las Vegas,2686
3,-0.933258,-0.883296,-0.770607,1.429872,-0.647472,-0.815298,-0.101143,-1.114363,0.10596,Good,D,SI1,Kimberly,738
4,0.729794,0.677793,0.592274,-0.875068,0.695707,0.467458,-1.041044,0.133092,0.10596,Ideal,G,SI1,Dubai,4882


### 5.2 Category encoding

In [74]:
# Defining features y target
X=df_diamonds_train[features_list_encoding]
y=df_diamonds_train['price']

In [75]:
for column in cat_features_list:
    X[column]=X[column].astype('category')
    X[column]=X[column].cat.codes

In [77]:
X = X[features_list_encoding]

In [78]:
X.head()

Unnamed: 0,x,y,z,depth,table,carat,cut,color,clarity,city,cut_encoding,color_encoding,clarity_encoding
0,0.978807,0.921985,1.022657,0.452019,0.247981,0.867006,3,6,5,2,1.456645,1.99109,-0.02204
1,-1.226738,-1.179816,-1.129259,0.871099,-0.199745,-1.004557,4,4,5,3,0.139558,0.769625,-0.02204
2,-0.097286,-0.176882,0.161891,2.617265,-1.095198,-0.184434,0,3,4,4,0.856015,0.133092,-0.195311
3,-0.933258,-0.883296,-0.770607,1.429872,-0.647472,-0.815298,1,0,2,3,-0.101143,-1.114363,0.10596
4,0.729794,0.677793,0.592274,-0.875068,0.695707,0.467458,2,3,2,2,-1.041044,0.133092,0.10596


### 5.3 Define train and validation

In [79]:
# Splitting train and test
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

## 6. Feature engineering (test set) 

In [80]:
# Adapting categorical features for validation model
X_test=df_diamonds_test[features_list]

### 6.1 Target encoding (mean and std)

In [81]:
# Target encoding for categorical variables using price's mean
cut_encoding = df_diamonds_train.groupby(['cut'])['price'].mean().to_dict()
X_test['cut_encoding'] = X_test['cut'].map(cut_encoding).astype(float)
color_encoding = df_diamonds_train.groupby(['color'])['price'].mean().to_dict()
X_test['color_encoding'] = X_test['color'].map(color_encoding).astype(float)
clarity_encoding = df_diamonds_train.groupby(['clarity'])['price'].mean().to_dict()
X_test['clarity_encoding'] = X_test['clarity'].map(clarity_encoding).astype(float)

## Scaling

In [82]:
X_test_num=X_test[num_features_list]
X_test_carat=X_test[cat_features_list]

In [83]:
# Perform the feature scaling on the numeric attributes of the dataset
num_scaler = StandardScaler()
X_test_num_scaled = num_scaler.fit_transform(X_test_num)
X_test_num_scaled = pd.DataFrame(X_test_num_scaled, columns=num_features_list)
X_test_num_scaled.head()

Unnamed: 0,x,y,z,depth,table,carat,cut_encoding,color_encoding,clarity_encoding
0,0.075022,0.133236,0.173091,0.6695,1.121874,-0.018412,0.122524,-0.342517,0.109405
1,0.964007,1.019395,0.870787,-0.514957,-0.219192,0.855078,-1.051999,2.001599,-0.194731
2,1.475847,1.400444,1.404319,0.321131,1.568896,1.643349,1.432828,0.779922,0.109405
3,0.317472,0.345914,0.487738,1.435914,-1.560258,0.215939,0.122524,-0.342517,0.109405
4,-0.616411,-0.575691,-0.483564,0.808848,0.22783,-0.636246,0.122524,-0.342517,-0.194731


In [84]:
X_test=pd.merge(X_test_num_scaled, X_test_carat, left_index=True, right_index=True)
X_test.head()

Unnamed: 0,x,y,z,depth,table,carat,cut_encoding,color_encoding,clarity_encoding,cut,color,clarity,city
0,0.075022,0.133236,0.173091,0.6695,1.121874,-0.018412,0.122524,-0.342517,0.109405,Very Good,F,SI1,Amsterdam
1,0.964007,1.019395,0.870787,-0.514957,-0.219192,0.855078,-1.051999,2.001599,-0.194731,Ideal,J,VS1,Surat
2,1.475847,1.400444,1.404319,0.321131,1.568896,1.643349,1.432828,0.779922,0.109405,Premium,H,SI1,Kimberly
3,0.317472,0.345914,0.487738,1.435914,-1.560258,0.215939,0.122524,-0.342517,0.109405,Very Good,F,SI1,Kimberly
4,-0.616411,-0.575691,-0.483564,0.808848,0.22783,-0.636246,0.122524,-0.342517,-0.194731,Very Good,F,VS1,Amsterdam


In [85]:
X_train.shape

(32364, 13)

In [86]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32364 entries, 32121 to 15795
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   x                 32364 non-null  float64
 1   y                 32364 non-null  float64
 2   z                 32364 non-null  float64
 3   depth             32364 non-null  float64
 4   table             32364 non-null  float64
 5   carat             32364 non-null  float64
 6   cut               32364 non-null  int8   
 7   color             32364 non-null  int8   
 8   clarity           32364 non-null  int8   
 9   city              32364 non-null  int8   
 10  cut_encoding      32364 non-null  float64
 11  color_encoding    32364 non-null  float64
 12  clarity_encoding  32364 non-null  float64
dtypes: float64(9), int8(4)
memory usage: 2.6 MB


### 6.2 Change categorical variables to code

In [87]:
for column in cat_features_list:
    X_test[column]=X_test[column].astype('category')
    X_test[column]=X_test[column].cat.codes

In [88]:
X_test = X_test[features_list_encoding]

In [89]:
X_test.head()

Unnamed: 0,x,y,z,depth,table,carat,cut,color,clarity,city,cut_encoding,color_encoding,clarity_encoding
0,0.075022,0.133236,0.173091,0.6695,1.121874,-0.018412,4,2,2,0,0.122524,-0.342517,0.109405
1,0.964007,1.019395,0.870787,-0.514957,-0.219192,0.855078,2,6,4,10,-1.051999,2.001599,-0.194731
2,1.475847,1.400444,1.404319,0.321131,1.568896,1.643349,3,4,2,3,1.432828,0.779922,0.109405
3,0.317472,0.345914,0.487738,1.435914,-1.560258,0.215939,4,2,2,3,0.122524,-0.342517,0.109405
4,-0.616411,-0.575691,-0.483564,0.808848,0.22783,-0.636246,4,2,4,0,0.122524,-0.342517,-0.194731


In [90]:
X_test.shape

(13485, 13)

In [91]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13485 entries, 0 to 13484
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   x                 13485 non-null  float64
 1   y                 13485 non-null  float64
 2   z                 13485 non-null  float64
 3   depth             13485 non-null  float64
 4   table             13485 non-null  float64
 5   carat             13485 non-null  float64
 6   cut               13485 non-null  int8   
 7   color             13485 non-null  int8   
 8   clarity           13485 non-null  int8   
 9   city              13485 non-null  int8   
 10  cut_encoding      13485 non-null  float64
 11  color_encoding    13485 non-null  float64
 12  clarity_encoding  13485 non-null  float64
dtypes: float64(9), int8(4)
memory usage: 1001.0 KB


## 7. Model definition - XGBRegressor

RandomForestRegressor: multiple trees in paralel changing samples and convining diferrent features (overfitting when the tree is big and good to reduce error variance)

Main Parameters:
   - bootstrap -> method for sampling data points (TRUE bagging and FALSE pasting, with/without replacement)
   - n_estimators -> number of trees in the foreset
   - max_depth -> max number of levels in each decision tree
   - max_features -> max number of features considered for splitting a node
   - ccp_alpha ->
   - criterion ->
   - max_leaf_nodes -> max number of solution nodes 
   - max_samples ->
   - min_impurity_decrease ->
   - min_samples_leaf -> min number of data points allowed in a leaf node
   - min_samples_split -> min number of data points placed in a node before the node is split
   - min_weight_fraction_leaf
   - n_estimators -> number of trees in the foreset
   - n_jobs 
   - oob_score 
   - random_state
   - verbose
   - warm_start   

In [92]:
# 1. XGBRegressor 
model = XGBRegressor(n_estimators=200,colsample_bylevel=1,colsample_bynode=1,
                     colsample_bytree=0.8,reg_alpha=1, reg_lambda=1,gamma=0,learning_rate=0.1, random_state=42)
hyperparameters = model.get_params()
print(type(model), '\n')

<class 'xgboost.sklearn.XGBRegressor'> 



## 8. Model training with validation

In [93]:
%%time
# Model training
model.fit(X_train, y_train,eval_set=[(X_train,y_train),(X_val,y_val)],early_stopping_rounds=40)

[0]	validation_0-rmse:5050.20361	validation_1-rmse:5089.46191
[1]	validation_0-rmse:4561.85986	validation_1-rmse:4599.20605
[2]	validation_0-rmse:4125.37353	validation_1-rmse:4160.49902
[3]	validation_0-rmse:3731.65332	validation_1-rmse:3765.33496
[4]	validation_0-rmse:3378.73560	validation_1-rmse:3411.42163
[5]	validation_0-rmse:3059.69751	validation_1-rmse:3091.21118
[6]	validation_0-rmse:2781.38159	validation_1-rmse:2810.80444
[7]	validation_0-rmse:2523.39648	validation_1-rmse:2552.75732
[8]	validation_0-rmse:2295.53052	validation_1-rmse:2323.95093
[9]	validation_0-rmse:2090.69482	validation_1-rmse:2118.79736
[10]	validation_0-rmse:1904.92310	validation_1-rmse:1932.43872
[11]	validation_0-rmse:1738.95666	validation_1-rmse:1766.02600
[12]	validation_0-rmse:1590.44519	validation_1-rmse:1617.73425
[13]	validation_0-rmse:1458.02942	validation_1-rmse:1486.18897
[14]	validation_0-rmse:1340.17981	validation_1-rmse:1368.77698
[15]	validation_0-rmse:1235.43701	validation_1-rmse:1263.81348
[1

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.8, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.1, max_delta_step=0,
             max_depth=6, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=200, n_jobs=8,
             num_parallel_tree=1, predictor='auto', random_state=42,
             reg_alpha=1, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [94]:
%%time
# Model predictions
y_pred_val = model.predict(X_val)

CPU times: user 44.6 ms, sys: 3.23 ms, total: 47.9 ms
Wall time: 9.09 ms


In [95]:
%%time
# Model predictions
y_pred_train = model.predict(X_train)

CPU times: user 116 ms, sys: 3.16 ms, total: 119 ms
Wall time: 22.9 ms


## 9. Training set error

In [96]:
# Model predictions
rmse_train = mean_squared_error(y_train, y_pred_train)**0.5
rmse_train

447.4857359017463

In [97]:
mae_train=mean_absolute_error(y_train, y_pred_train)
mae_train

247.18925707379432

In [98]:
r2r = r2_score(y_val, y_pred_val)
r2r

0.9828380390491122

## 9. Validation set error

In [99]:
#432
rmse_val = mean_squared_error(y_val, y_pred_val)**0.5
rmse_val

528.6577610863301

In [100]:
mae_val=mean_absolute_error(y_val, y_pred_val)
mae_val

276.8615789982757

In [101]:
r2r = r2_score(y_val, y_pred_val)
r2r

0.9828380390491122

## 10. Model training without validation

In [102]:
%%time
# Model training
model.fit(X, y,eval_set=[(X_train,y_train),(X_val,y_val)],early_stopping_rounds=40)

[0]	validation_0-rmse:5050.02051	validation_1-rmse:5086.54346
[1]	validation_0-rmse:4560.79883	validation_1-rmse:4596.78564
[2]	validation_0-rmse:4124.03711	validation_1-rmse:4157.51758
[3]	validation_0-rmse:3729.29858	validation_1-rmse:3761.69580
[4]	validation_0-rmse:3374.62305	validation_1-rmse:3406.22314
[5]	validation_0-rmse:3055.26245	validation_1-rmse:3086.02002
[6]	validation_0-rmse:2776.91675	validation_1-rmse:2804.64111
[7]	validation_0-rmse:2518.89331	validation_1-rmse:2545.64551
[8]	validation_0-rmse:2289.88916	validation_1-rmse:2313.61377
[9]	validation_0-rmse:2086.11768	validation_1-rmse:2108.48804
[10]	validation_0-rmse:1900.31494	validation_1-rmse:1921.19141
[11]	validation_0-rmse:1734.93152	validation_1-rmse:1753.57800
[12]	validation_0-rmse:1587.10584	validation_1-rmse:1605.68176
[13]	validation_0-rmse:1454.63806	validation_1-rmse:1473.25513
[14]	validation_0-rmse:1337.58752	validation_1-rmse:1355.24036
[15]	validation_0-rmse:1233.05481	validation_1-rmse:1250.19080
[1

[134]	validation_0-rmse:422.11838	validation_1-rmse:422.50534
[135]	validation_0-rmse:422.06284	validation_1-rmse:422.40808
[136]	validation_0-rmse:421.78479	validation_1-rmse:422.28607
[137]	validation_0-rmse:420.96139	validation_1-rmse:422.02811
[138]	validation_0-rmse:420.36871	validation_1-rmse:421.57956
[139]	validation_0-rmse:420.28583	validation_1-rmse:421.47900
[140]	validation_0-rmse:419.96643	validation_1-rmse:421.20959
[141]	validation_0-rmse:419.70294	validation_1-rmse:420.93365
[142]	validation_0-rmse:419.43716	validation_1-rmse:420.83466
[143]	validation_0-rmse:418.68454	validation_1-rmse:420.27155
[144]	validation_0-rmse:418.22919	validation_1-rmse:420.05045
[145]	validation_0-rmse:417.44641	validation_1-rmse:419.29245
[146]	validation_0-rmse:416.70544	validation_1-rmse:418.66391
[147]	validation_0-rmse:416.65277	validation_1-rmse:418.61050
[148]	validation_0-rmse:415.80582	validation_1-rmse:417.93521
[149]	validation_0-rmse:415.26651	validation_1-rmse:417.40268
[150]	va

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.8, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.1, max_delta_step=0,
             max_depth=6, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=200, n_jobs=8,
             num_parallel_tree=1, predictor='auto', random_state=42,
             reg_alpha=1, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

## 11. Test Preditions

In [103]:
predictions = model.predict(X_test)

In [104]:
predictions=pd.DataFrame(predictions)

In [105]:
predictions.reset_index(inplace=True)

In [106]:
predictions=predictions.rename({0: 'price','index': 'id'}, axis=1)

## 12. Save Preditions

In [107]:
predictions.to_csv('../data/diamonds_predictions_3.csv',index=False)