#  **PREDICTING DIAMONDS PRICES**

## Modelos:  
- LinearRegression: **modelo_1**( *Diamonds_train sub X_train/X_test* ), **modelo_2**(*Diamonds_train completo*), **model_7, model_8**
- Randomforest Regressor: **modelo_3**( Diamonds_train sub train/test ), **modelo_4**(*Diamonds_train completo*, **model_9, model_11, model_12, model_13**.
- GradientBoosting Regressor: **modelo_5**( Diamonds_train sub train/test ), **modelo_6**(*Diamonds_train completo*, **model_10** 
- KNeighborsRegressor: **model_14**.

In [70]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import ensemble

## **Features**
- **id**: only for test & sample submission files, id for prediction sample identification.  
- **price**: price in USD.  
- **carat**: weight of the diamond.  
- **cut**: quality of the cut (Fair, Good, Very Good, Premium, Ideal).  
- **color**: diamond colour, from J (worst) to D (best).  
- **clarity**: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)).  
- **x**: length in mm.  
- **y**: width in mm.  
- **z**: depth in mm.  
- **depth**: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79).  
- **table**: width of top of diamond relative to widest point (43--95).  

## Descargamos el csv  diamonds_train 

In [3]:
diamonds = pd.read_csv('../Inputs/diamonds_train.csv')
diamonds_copy = diamonds.copy

#### Visualizamos el df

In [4]:
display(diamonds.head())
print('Shape: ', diamonds.shape)
display(diamonds.describe())

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,price
0,0,0.78,Premium,F,VS1,61.5,58.0,5.93,5.98,3.66,3446
1,1,0.31,Ideal,D,SI1,60.8,56.0,4.37,4.32,2.64,732
2,2,0.3,Ideal,F,SI1,62.3,54.0,4.3,4.34,2.69,475
3,3,1.04,Ideal,E,VVS2,62.0,58.0,6.54,6.46,4.03,9552
4,4,0.65,Ideal,J,SI1,61.4,55.0,5.58,5.62,3.44,1276


Shape:  (40345, 11)


Unnamed: 0,id,carat,depth,table,x,y,z,price
count,40345.0,40345.0,40345.0,40345.0,40345.0,40345.0,40345.0,40345.0
mean,20172.0,0.797699,61.749721,57.449421,5.730824,5.734808,3.537958,3931.902243
std,11646.742642,0.474309,1.427728,2.234087,1.121767,1.151536,0.695039,3994.318832
min,0.0,0.2,43.0,43.0,0.0,0.0,0.0,326.0
25%,10086.0,0.4,61.0,56.0,4.71,4.72,2.91,952.0
50%,20172.0,0.7,61.8,57.0,5.69,5.71,3.52,2397.0
75%,30258.0,1.04,62.5,59.0,6.54,6.54,4.03,5317.0
max,40344.0,5.01,79.0,95.0,10.74,58.9,8.06,18823.0


In [5]:
list(diamonds['cut'].unique())

['Premium', 'Ideal', 'Very Good', 'Fair', 'Good']

### Empezamos la limpieza de datos 

In [6]:
diamonds['cut'] = diamonds['cut'].replace({'Fair':0, 'Good':1, 'Very Good':2, 'Premium':3, 'Ideal':4})

In [7]:
clar_val = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
for n,val in enumerate(clar_val):
    diamonds['clarity'] = diamonds['clarity'].replace(val,n)

In [8]:
diamonds = pd.get_dummies(diamonds)

In [9]:
diamonds


Unnamed: 0,id,carat,cut,clarity,depth,table,x,y,z,price,color_D,color_E,color_F,color_G,color_H,color_I,color_J
0,0,0.78,3,4,61.5,58.0,5.93,5.98,3.66,3446,0,0,1,0,0,0,0
1,1,0.31,4,2,60.8,56.0,4.37,4.32,2.64,732,1,0,0,0,0,0,0
2,2,0.30,4,2,62.3,54.0,4.30,4.34,2.69,475,0,0,1,0,0,0,0
3,3,1.04,4,5,62.0,58.0,6.54,6.46,4.03,9552,0,1,0,0,0,0,0
4,4,0.65,4,2,61.4,55.0,5.58,5.62,3.44,1276,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40340,40340,0.50,4,4,61.9,56.0,5.09,5.12,3.16,1716,0,1,0,0,0,0,0
40341,40341,0.33,4,3,61.8,56.0,4.40,4.44,2.73,781,1,0,0,0,0,0,0
40342,40342,0.40,3,6,61.5,58.0,4.69,4.74,2.90,1123,0,0,0,1,0,0,0
40343,40343,1.06,4,4,61.2,57.0,6.59,6.56,4.03,5651,0,0,0,0,0,1,0


#### Creamos dos df para hacer nuestro modelo.   
- **X**: nos quedamos con todas las columnas del df menos 'id' y 'price' para entrenar a nuestro modelo
- **y**: esta será la columna de los precios, la realidad de lo que queremos predecir 

In [38]:
X = diamonds.drop(columns = ['id', 'price'])
y = diamonds['price']

### Normalizamos y estandarizamos nuestros datos de X1 y lo nombramos Xpr

In [39]:
from sklearn.preprocessing import StandardScaler, Normalizer
from sklearn.pipeline import make_pipeline
pipeline = [
    StandardScaler(),
    Normalizer(),
]

tr = make_pipeline(*pipeline)

X_pr = tr.fit_transform(X)
X_pr = pd.DataFrame(X_pr,columns=X.columns)

In [40]:
X_pr.head()

Unnamed: 0,carat,cut,clarity,depth,table,x,y,z,color_D,color_E,color_F,color_G,color_H,color_I,color_J
0,-0.015023,0.033951,0.230242,-0.070417,0.099217,0.071483,0.085723,0.070692,-0.153041,-0.189095,0.86871,-0.207405,-0.171552,-0.135294,-0.093625
1,-0.257773,0.245897,-0.160316,-0.166762,-0.162646,-0.304121,-0.308012,-0.323888,0.659475,-0.11775,-0.11618,-0.129151,-0.106825,-0.084247,-0.0583
2,-0.268947,0.251402,-0.163904,0.098787,-0.395738,-0.326923,-0.310455,-0.312699,-0.097432,-0.120386,0.553055,-0.132042,-0.109217,-0.086133,-0.059605
3,0.16466,0.316155,0.379565,0.056503,0.079435,0.232506,0.202988,0.228185,-0.122527,0.686232,-0.149375,-0.166053,-0.137348,-0.108319,-0.074958
4,-0.065784,0.207209,-0.135092,-0.051746,-0.231614,-0.028403,-0.021062,-0.029774,-0.080305,-0.099224,-0.097901,-0.108831,-0.090018,-0.070992,0.908384


#### Separamos los datos de X_pr e y en el 80% para entrenar el modelo y el 20% para realizar el test 

In [41]:
#normalizado y estandarizado
X_pr_train, X_pr_test, y_pr_train, y_pr_test = train_test_split(X_pr, y, test_size=0.2)

## LinearRegression

#### Entrenamos el modelo LinearRegression con X1_train, y1_train

In [14]:
model_1 = LinearRegression()
model_1.fit(X_pr_train, y_pr_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

#### Hacemos la predicción del precio (y1), para los datos de X1_test

In [43]:
y_pr_pred = model_1.predict(X_pr_test)
y_pr_pred

array([ 2378.21378776, 11560.95497144,  1702.76785111, ...,
        5108.92323612, 11242.89743418,  1185.68444325])

#### Comparamos la prediccion con la realidad 

In [45]:
result = pd.DataFrame({
    "y_pred":y_pr_pred,
    "gt":y_pr_test
})
result.head()

Unnamed: 0,y_pred,gt
20847,2378.213788,2453
5259,11560.954971,8709
37711,1702.767851,1142
17871,1767.629649,2348
37066,2184.720841,2789


In [46]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y1_test, y1_pred)

30321051.616682608

### Volvemos a entrenar el modelo pero con el 100% del csv diamonds_train (X_pr,  y)

In [47]:
model_2 = LinearRegression()
model_2.fit(X_pr, y)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

### Descargamos los datos de diamonds_test y los tratamos igual que los datos de diamonds_train. En este caso no tenemos la columna 'price'

In [48]:
diamonds_test = pd.read_csv('../Inputs/diamonds_test.csv')
diamonds_test_copy = diamonds_test.copy
diamonds_test['cut'] = diamonds_test['cut'].replace({'Fair':0, 'Good':1, 'Very Good':2, 'Premium':3, 'Ideal':4})

In [49]:

clar_val = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
for n,val in enumerate(clar_val):
    diamonds_test['clarity'] = diamonds_test['clarity'].replace(val,n)
diamonds_test = pd.get_dummies(diamonds_test)

diamonds_test.shape

(13449, 16)

In [50]:
X2 = diamonds_test.drop(columns='id')
X2_pr = tr.fit_transform(X2)
X2_pr = pd.DataFrame(X2_pr,columns=X2.columns)
X2_pr.head()

Unnamed: 0,carat,cut,clarity,depth,table,x,y,z,color_D,color_E,color_F,color_G,color_H,color_I,color_J
0,0.196873,0.027789,-0.383049,0.097582,0.0709,0.262929,0.239398,0.246147,-0.115173,-0.146035,-0.142526,-0.157329,0.718958,-0.10092,-0.073612
1,-0.167669,0.269837,-0.174395,0.144352,-0.059423,-0.162395,-0.156578,-0.134698,-0.10284,-0.130398,-0.127265,-0.140483,-0.116918,0.832922,-0.06573
2,0.516568,0.017873,-0.125612,0.021547,0.134,0.425156,0.418939,0.397514,-0.074074,-0.093922,-0.091666,0.384834,-0.084213,-0.064907,-0.047344
3,0.246851,0.025541,-0.179502,-0.342208,0.317815,0.309785,0.298803,0.23775,-0.105852,-0.134217,0.607053,-0.144597,-0.120342,-0.092754,-0.067655
4,-0.175092,0.327228,-0.211486,0.013148,-0.369731,-0.137473,-0.153951,-0.136196,-0.124713,-0.158132,0.715217,-0.170362,-0.141784,-0.10928,-0.07971


#### Hacemos a predicción del precio para X2pr (los datos de los diamantes del test )

In [51]:
y2_pr_pred = model_2.predict(X2_pr)

#### Añadimos nuestra predicción al df para que cuadren los ids. Nos quedamos solo con las dos columnas que nos interesan y lo pasamos a csv para subirlo a kaggle y ver el resultado 

In [52]:
diamonds_test['price']=y2_pr_pred

In [53]:
price_pred_1  = diamonds_test[['id','price']].set_index('id')

In [54]:
price_pred_1.to_csv('3_predict.csv', index=True)

## Randomforest Regressor

In [30]:
from sklearn.ensemble import RandomForestRegressor

In [55]:
# diamonds_train limpio sin normalizar ni escalar. Dividido en X1(features) e y1(ground truth).
X1 = diamonds.drop(columns = ['id', 'price'])
y1 = diamonds['price']

In [56]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2)

## Modelo_3 entrenado con el 80% de diamonds_train

In [57]:
model_3 =  RandomForestRegressor(n_estimators=50)
model_3.fit(X1_train, y1_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=50, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

In [58]:
y3_pred = model_3.predict(X1_test)
y3_pred

array([ 1035.44,  4426.48,  1075.82, ...,   982.38,  5160.36, 11528.78])

In [61]:
result = pd.DataFrame({
    "y_pred":y3_pred,
    "gt":y1_test
})
result.head()

Unnamed: 0,y_pred,gt
12415,1035.44,1050
39825,4426.48,4390
22327,1075.82,1115
16042,4933.38,4872
35745,1777.36,1723


## Modelo_4 entrenado con el 100% de diamonds_train

In [62]:
model_4 = RandomForestRegressor(n_estimators=50)

In [63]:
model_4.fit(X1, y1)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=50, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

### Aplicamos modelo_4 a diamonds_test sin normallizar ni estandarizar

In [65]:
X2.head()

Unnamed: 0,carat,cut,clarity,depth,table,x,y,z,color_D,color_E,color_F,color_G,color_H,color_I,color_J
0,1.1,3,1,62.2,58.0,6.69,6.6,4.13,0,0,0,0,1,0,0
1,0.51,4,2,62.5,57.0,5.07,5.1,3.18,0,0,0,0,0,1,0
2,2.03,3,2,61.9,59.0,8.14,8.09,5.02,0,0,0,1,0,0,0
3,1.21,3,2,60.0,60.0,6.96,6.91,4.16,0,0,1,0,0,0,0
4,0.55,4,2,61.8,55.0,5.27,5.22,3.24,0,0,1,0,0,0,0


In [66]:
y_pred_mod4 = model_4.predict(X2)

In [69]:
diamonds_test['price']=y_pred_mod4
price_pred_2  = diamonds_test[['id','price']].set_index('id')
price_pred_2.to_csv('4_predict.csv', index=True)

# GradientBoostingRegressor

In [71]:
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2,
          'learning_rate': 0.01, 'loss': 'ls'}
model_5 = ensemble.GradientBoostingRegressor(**params)

model_5.fit(X1_train, y1_train)
y5_pred = model_5.predict(X1_test)


In [72]:
from sklearn.metrics import mean_squared_error

In [74]:
mse = mean_squared_error(y1_test, y5_pred)
print("MSE: %.4f" % mse)

MSE: 476369.9052


### Entrenamos el modelo_5 con el 100% de d_train

In [77]:
model_6 = ensemble.GradientBoostingRegressor(**params)
model_6.fit(X1, y1)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.01, loss='ls', max_depth=4,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=500,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [78]:
y_pred_mod6 = model_6.predict(X2)

In [79]:
diamonds_test['price']=y_pred_mod6
price_pred_3  = diamonds_test[['id','price']].set_index('id')
price_pred_3.to_csv('6_predict.csv', index=True)

# Probamos a tratar los datos de otra forma
### Probamos eliminando algunas columnas

In [82]:
X_pr.corr()

Unnamed: 0,carat,cut,clarity,depth,table,x,y,z,color_D,color_E,color_F,color_G,color_H,color_I,color_J
carat,1.0,-0.131246,-0.299308,0.026642,0.174722,0.979011,0.977812,0.975435,-0.085474,-0.124199,-0.035231,-0.009006,0.108798,0.13789,0.141862
cut,-0.131246,1.0,0.158073,-0.219491,-0.422856,-0.113188,-0.115529,-0.139107,0.004457,-0.00017,-0.008127,0.04315,-0.016825,-0.0171,-0.042756
clarity,-0.299308,0.158073,1.0,-0.056163,-0.14596,-0.320383,-0.315191,-0.32189,-0.081518,-0.047616,0.005973,0.122585,-0.022093,0.002139,-0.018233
depth,0.026642,-0.219491,-0.056163,1.0,-0.257089,-0.023352,-0.025598,0.093842,-0.010419,-0.02263,-0.01343,0.00392,0.023663,0.021352,0.014872
table,0.174722,-0.422856,-0.14596,-0.257089,1.0,0.178694,0.17206,0.144652,-0.006916,0.009768,-0.000455,-0.036469,0.013021,0.015924,0.036619
x,0.979011,-0.113188,-0.320383,-0.023352,0.178694,1.0,0.997929,0.989919,-0.083326,-0.115471,-0.022409,-0.002458,0.098987,0.117512,0.11934
y,0.977812,-0.115529,-0.315191,-0.025598,0.17206,0.997929,1.0,0.989306,-0.083071,-0.115333,-0.021972,-0.002309,0.097418,0.118049,0.119589
z,0.975435,-0.139107,-0.32189,0.093842,0.144652,0.989919,0.989306,1.0,-0.083906,-0.117114,-0.023288,-0.001883,0.100008,0.11919,0.120367
color_D,-0.085474,0.004457,-0.081518,-0.010419,-0.006916,-0.083326,-0.083071,-0.083906,1.0,-0.184645,-0.179505,-0.208632,-0.148385,-0.097733,-0.049481
color_E,-0.124199,-0.00017,-0.047616,-0.02263,0.009768,-0.115471,-0.115333,-0.117114,-0.184645,1.0,-0.246606,-0.284178,-0.208808,-0.144855,-0.083196


In [87]:
X_pr_drop = X_pr.drop(columns=['x','y'])

In [88]:
X_prdrop_train, X_prdrop_test, y_prdrop_train, y_prdrop_test = train_test_split(X_pr_drop, y, test_size=0.2)

In [90]:
model_7 = LinearRegression()
model_7.fit(X_prdrop_train, y_prdrop_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [91]:
y_prdrop_pred = model_7.predict(X_prdrop_test)
y_prdrop_pred

array([ 2042.04679129,   737.7502436 ,  1107.16378953, ...,
        5809.07563361, 14580.25151472,  1548.45324385])

In [93]:
mse7 = mean_squared_error(y_prdrop_test, y_prdrop_pred)
print("MSE: %.4f" % mse7)

MSE: 1572142.1735


In [94]:
model_8 = LinearRegression()
model_8.fit(X_pr_drop, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [96]:
y_pred_mod8 = model_8.predict(X2_pr.drop(columns=['x','y']))

In [97]:
diamonds_test['price']=y_pred_mod8
price_pred_4  = diamonds_test[['id','price']].set_index('id')
price_pred_4.to_csv('8_predict.csv', index=True)

## Eliminamos las columnas 'x' e 'y' del dataset original sin norm ni stand
### { de momento el que mejor funciona}

In [98]:
X3 = diamonds.drop(columns = ['id', 'price','x','y'])
y3 = diamonds['price']

In [99]:
model_9 = model_4.fit(X3, y3)

In [100]:
y9_pred = model_9.predict(X2.drop(columns=['x','y']))

In [101]:
diamonds_test['price']=y9_pred
price_pred_5  = diamonds_test[['id','price']].set_index('id')
price_pred_5.to_csv('9_predict.csv', index=True)

### probamos lo mismo pero con gradientboost

In [102]:
model_10 = model_6.fit(X3, y3)

In [103]:
y10_pred = model_10.predict(X2.drop(columns=['x','y']))

In [104]:
diamonds_test['price']=y10_pred
price_pred_6  = diamonds_test[['id','price']].set_index('id')
price_pred_6.to_csv('10_predict.csv', index=True)

## probamos Random Forest con otros parametros y mismo train y test
### {funciona un poco mejor}


In [105]:
model_11 = RandomForestRegressor(n_estimators = 1000, random_state = 42)

In [106]:
model_11.fit(X3, y3)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=1000, n_jobs=None, oob_score=False,
                      random_state=42, verbose=0, warm_start=False)

In [107]:
y11_pred = model_11.predict(X2.drop(columns=['x','y']))

In [108]:
diamonds_test['price']=y11_pred
price_pred_7  = diamonds_test[['id','price']].set_index('id')
price_pred_7.to_csv('11_predict.csv', index=True)

## Probamoos RFR eliminando las columnas 'table' y 'cut'
### {no funciona mejor }

In [109]:
importances = list(model_11.feature_importances_)

In [113]:
feature_list = list(X3.columns)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

Variable: carat                Importance: 0.88
Variable: clarity              Importance: 0.07
Variable: depth                Importance: 0.01
Variable: z                    Importance: 0.01
Variable: color_H              Importance: 0.01
Variable: color_I              Importance: 0.01
Variable: color_J              Importance: 0.01
Variable: cut                  Importance: 0.0
Variable: table                Importance: 0.0
Variable: color_D              Importance: 0.0
Variable: color_E              Importance: 0.0
Variable: color_F              Importance: 0.0
Variable: color_G              Importance: 0.0


[None, None, None, None, None, None, None, None, None, None, None, None, None]

In [115]:
X4= X3.drop(columns=['table','cut'])

In [117]:
model_12=model_11.fit(X4, y3)

In [118]:
y12_pred = model_12.predict(X2.drop(columns=['x','y','table','cut']))

In [119]:
diamonds_test['price']=y12_pred
price_pred_8  = diamonds_test[['id','price']].set_index('id')
price_pred_8.to_csv('12_predict.csv', index=True)

## Probamos otra vez con RFR esta vez sin la columna 'z'
### {Tampoco funciona mejor}


In [121]:
X5=X3.drop(columns='z')

In [122]:
model_13 = model_11.fit(X5, y3)

In [123]:
y13_pred = model_13.predict(X2.drop(columns=['x','y','z']))

In [124]:
diamonds_test['price']=y13_pred
price_pred_9  = diamonds_test[['id','price']].set_index('id')
price_pred_9.to_csv('13_predict.csv', index=True)

## Modelo KNeighbors Regressor
### {Funciona muy mal }

In [125]:
from sklearn.neighbors import KNeighborsRegressor
model_14 = KNeighborsRegressor(n_neighbors=2)
model_14.fit(X3, y3)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=2, p=2,
                    weights='uniform')

In [126]:
y14_pred = model_14.predict(X2.drop(columns=['x','y']))

In [127]:
diamonds_test['price']=y14_pred
price_pred_10  = diamonds_test[['id','price']].set_index('id')
price_pred_10.to_csv('14_predict.csv', index=True)