![image info](https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/images/banner_1.png)

# Proyecto 1 - Predicción de precios de vehículos usados

En este proyecto podrán poner en práctica sus conocimientos sobre modelos predictivos basados en árboles y ensambles, y sobre la disponibilización de modelos. Para su desasrrollo tengan en cuenta las instrucciones dadas en la "Guía del proyecto 1: Predicción de precios de vehículos usados".

**Entrega**: La entrega del proyecto deberán realizarla durante la semana 4. Sin embargo, es importante que avancen en la semana 3 en el modelado del problema y en parte del informe, tal y como se les indicó en la guía.

Para hacer la entrega, deberán adjuntar el informe autocontenido en PDF a la actividad de entrega del proyecto que encontrarán en la semana 4, y subir el archivo de predicciones a la [competencia de Kaggle](https://www.kaggle.com/t/b8be43cf89c540bfaf3831f2c8506614).

## Datos para la predicción de precios de vehículos usados

En este proyecto se usará el conjunto de datos de Car Listings de Kaggle, donde cada observación representa el precio de un automóvil teniendo en cuenta distintas variables como: año, marca, modelo, entre otras. El objetivo es predecir el precio del automóvil. Para más detalles puede visitar el siguiente enlace: [datos](https://www.kaggle.com/jpayne/852k-used-car-listings).

## Ejemplo predicción conjunto de test para envío a Kaggle

En esta sección encontrarán el formato en el que deben guardar los resultados de la predicción para que puedan subirlos a la competencia en Kaggle.

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
# Importación librerías
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

In [57]:
# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/datasets/dataTrain_carListings.zip')
dataTesting = pd.read_csv('https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/datasets/dataTest_carListings.zip', index_col=0)

In [58]:
# Visualización datos de entrenamiento
dataTraining.head(20)


Unnamed: 0,Price,Year,Mileage,State,Make,Model
0,34995,2017,9913,FL,Jeep,Wrangler
1,37895,2015,20578,OH,Chevrolet,Tahoe4WD
2,18430,2012,83716,TX,BMW,X5AWD
3,24681,2014,28729,OH,Cadillac,SRXLuxury
4,26998,2013,64032,CO,Jeep,Wrangler
5,40495,2016,9987,ME,BMW,3
6,33995,2015,12961,WA,Mercedes-Benz,C-ClassC300
7,21995,2014,6480,CT,Toyota,CamryL
8,28591,2014,29165,CA,Toyota,TacomaPreRunner
9,10895,2008,70005,CA,Buick,LaCrosse4dr


In [7]:
dataTraining.describe()

Unnamed: 0,Price,Year,Mileage
count,400000.0,400000.0,400000.0
mean,21146.919312,2013.198125,55072.96
std,10753.66494,3.292326,40881.02
min,5001.0,1997.0,5.0
25%,13499.0,2012.0,25841.0
50%,18450.0,2014.0,42955.0
75%,26999.0,2016.0,77433.0
max,79999.0,2018.0,2457832.0


In [8]:
# counting the duplicates
dups = dataTraining.pivot_table(index = ['Year','Mileage','State','Make','Model'], aggfunc ='size')
# displaying the duplicate Series
print(dups)



Year  Mileage  State  Make       Model      
1997  400       NV    Porsche    Boxster2dr     1
      3821      WA    Subaru     Impreza        1
      7145      WI    Chevrolet  Camaro2dr      1
      23746     WA    Porsche    911            1
      29430     MO    Chevrolet  Corvette2dr    1
                                               ..
2018  11553     AL    Chevrolet  EquinoxFWD     1
      11592     MS    Chevrolet  EquinoxFWD     1
      11634     MN    Chevrolet  EquinoxAWD     1
      14526     OH    Chevrolet  EquinoxFWD     1
      14645     MS    Chevrolet  EquinoxAWD     1
Length: 399523, dtype: int64


In [9]:
# Filtrar duplicados que aparecen más de dos veces
dups_mask = dups > 2
dups_filtered = dups[dups_mask]

# Mostrar los duplicados filtrados
print(dups_filtered)


Year  Mileage  State  Make           Model          
2016  5         CA    Chevrolet      Silverado           6
                NC    Kia            OptimaLX            5
      10        UT    Ford           Super               5
      11        CA    Kia            Forte               4
                FL    Mercedes-Benz  Sprinter            3
      12        IL    Ford           TaurusSEL           4
                WA    Mercedes-Benz  C-ClassC300         3
      15        CA    Chevrolet      Silverado           5
      36        AZ    Volvo          S60T5               3
      150       NJ    Ford           Expedition4WD       3
      5000      SC    Chrysler       200Limited          4
2017  5         CA    Chevrolet      Silverado           4
      6         VA    Toyota         Tacoma2WD           3
      7         IL    Toyota         SequoiaPlatinum     4
      8         FL    Toyota         CamrySE             9
      9         TX    Jeep           Wrangler            4
   

In [10]:
# Visualización datos de test
dataTesting.head()

Unnamed: 0_level_0,Year,Mileage,State,Make,Model
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,2014,31909,MD,Nissan,MuranoAWD
1,2017,5362,FL,Jeep,Wrangler
2,2014,50300,OH,Ford,FlexLimited
3,2004,132160,WA,BMW,5
4,2015,25226,MA,Jeep,Grand


In [11]:
# Predicción del conjunto de test - acá se genera un número aleatorio como ejemplo
np.random.seed(42)
y_pred = pd.DataFrame(np.random.rand(dataTesting.shape[0]) * 75000 + 5000, index=dataTesting.index, columns=['Price'])

In [12]:
# Guardar predicciones en formato exigido en la competencia de kaggle
y_pred.to_csv('test_submission.csv', index_label='ID')
y_pred.head()

Unnamed: 0_level_0,Price
ID,Unnamed: 1_level_1
0,33090.508914
1,76303.572981
2,59899.545636
3,49899.386315
4,16701.398033


## Preprocesamiento de datos

In [13]:
dataTraining['Make'].unique()

array(['Jeep', 'Chevrolet', 'BMW', 'Cadillac', 'Mercedes-Benz', 'Toyota',
       'Buick', 'Dodge', 'Volkswagen', 'GMC', 'Ford', 'Hyundai',
       'Mitsubishi', 'Honda', 'Nissan', 'Mazda', 'Volvo', 'Kia', 'Subaru',
       'Chrysler', 'INFINITI', 'Land', 'Porsche', 'Lexus', 'MINI',
       'Lincoln', 'Audi', 'Ram', 'Mercury', 'Tesla', 'FIAT', 'Acura',
       'Scion', 'Pontiac', 'Jaguar', 'Bentley', 'Suzuki', 'Freightliner'],
      dtype=object)

In [14]:
dataTraining['Model'].unique()

array(['Wrangler', 'Tahoe4WD', 'X5AWD', 'SRXLuxury', '3', 'C-ClassC300',
       'CamryL', 'TacomaPreRunner', 'LaCrosse4dr', 'ChargerSXT',
       'CamryLE', 'Jetta', 'AcadiaFWD', 'EscapeSE', 'SonataLimited',
       'Santa', 'Outlander', 'CruzeSedan', 'Civic', 'CorollaL', '350Z2dr',
       'EdgeSEL', 'F-1502WD', 'FocusSE', 'PatriotSport', 'Accord',
       'MustangGT', 'FusionHybrid', 'ColoradoCrew', 'Wrangler4WD',
       'CR-VEX-L', 'CTS', 'CherokeeLimited', 'Yukon', 'Elantra', 'New',
       'CorollaLE', 'Canyon4WD', 'Golf', 'Sonata4dr', 'Elantra4dr',
       'PatriotLatitude', 'Mazda35dr', 'Tacoma2WD', 'Corolla4dr',
       'Silverado', 'TerrainFWD', 'EscapeFWD', 'Grand', 'RAV4FWD',
       'Liberty4WD', 'FocusTitanium', 'DurangoAWD', 'S60T5', 'CivicLX',
       'MuranoAWD', 'ForteEX', 'TraverseAWD', 'CamaroConvertible',
       'Sportage2WD', 'Pathfinder4WD', 'Highlander4dr', 'WRXSTI', 'Ram',
       'F-150XLT', 'SiennaXLE', 'LaCrosseFWD', 'RogueFWD', 'CamaroCoupe',
       'JourneySXT', 'Acc

In [15]:
dataTraining['State'].unique()

array([' FL', ' OH', ' TX', ' CO', ' ME', ' WA', ' CT', ' CA', ' LA',
       ' NY', ' PA', ' SC', ' ND', ' NC', ' GA', ' AZ', ' TN', ' KY',
       ' NJ', ' UT', ' IA', ' AL', ' NE', ' IL', ' OK', ' MD', ' NV',
       ' WV', ' MI', ' VA', ' WI', ' MA', ' OR', ' IN', ' NM', ' MO',
       ' HI', ' KS', ' AR', ' MN', ' MS', ' MT', ' AK', ' VT', ' SD',
       ' NH', ' DE', ' ID', ' RI', ' WY', ' DC'], dtype=object)

In [16]:
print(dataTraining.describe)

<bound method NDFrame.describe of         Price  Year  Mileage State        Make           Model
0       34995  2017     9913    FL        Jeep        Wrangler
1       37895  2015    20578    OH   Chevrolet        Tahoe4WD
2       18430  2012    83716    TX         BMW           X5AWD
3       24681  2014    28729    OH    Cadillac       SRXLuxury
4       26998  2013    64032    CO        Jeep        Wrangler
...       ...   ...      ...   ...         ...             ...
399995  29900  2015    25287    TX       Lexus            RXRX
399996  17688  2015    17677    MI   Chevrolet      EquinoxFWD
399997  24907  2014    66688    NC       Buick  EnclaveLeather
399998  11498  2014    37872    IN  Volkswagen           Jetta
399999  16900  2014    78606    CO      Nissan     PathfinderS

[400000 rows x 6 columns]>


In [39]:
dataTraining.isnull().values.any()

False

In [52]:
dataTraining['Model'].value_counts()

Silverado          18085
Grand              12344
Sierra              8409
Accord              7357
F-1504WD            6684
                   ...  
PathfinderSE          53
Galant4dr             53
SLK-ClassSLK350       52
Monte                 48
RX-84dr               48
Name: Model, Length: 525, dtype: int64

In [53]:
# Create arrary of categorial variables to be encoded
categorical_cols = ['State', 'Make', 'Model']
le = LabelEncoder()
# apply label encoder on categorical feature columns
dataTraining[categorical_cols] = dataTraining[categorical_cols].apply(lambda col: le.fit_transform(col))

In [54]:
Y = (dataTraining.Price)
X = dataTraining.drop(['Price'], axis=1)

In [55]:
X

Unnamed: 0,Year,Mileage,State,Make,Model
0,2017,9913,9,17,489
1,2015,20578,35,6,448
2,2012,83716,43,2,499
3,2014,28729,35,5,398
4,2013,64032,5,17,489
...,...,...,...,...,...
399995,2015,25287,43,20,377
399996,2015,17677,22,6,158
399997,2014,66688,27,4,154
399998,2014,37872,15,36,264


In [47]:
# Separación de datos en set de entrenamiento y test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.95, random_state=42)

## Calibracion del modelo XGBoost

In [48]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor

In [24]:
#Find optimal number of trees
learning_range = [0.01, 0.05, 0.1, 0.15, 0.20, 0.25, 0.3]
for val in learning_range:
    score = cross_val_score(XGBRegressor(learning_rate= val, random_state= 42), X, Y, scoring="neg_root_mean_squared_error")
    print(f'Average score({val}): {"{:.3f}".format(score.mean())}')


KeyboardInterrupt: 

In [None]:
#Find optimal gamma
gamma_range = [0, 0.25, 0.5, 0.75, 1]
for val in gamma_range:
    score = cross_val_score(XGBRegressor(gamma= val, random_state= 42), X, Y,scoring="neg_root_mean_squared_error")
    print(f'Average score({val}): {"{:.3f}".format(score.mean())}')

In [None]:
#Find optimal gamma
col_range = [0, 0.25, 0.5, 0.75, 1,1.25,1.5]
for val in col_range:
    score = cross_val_score(XGBRegressor(colsample_bytree= val, random_state= 42), X_train, y_train,scoring="neg_root_mean_squared_error")
    print(f'Average score({val}): {"{:.3f}".format(score.mean())}')

In [None]:
learning_range = [0.1,0.2,0.3,0.40, 0.50, 0.60, 0.70,0.8,0.9,1]
for val in learning_range:
    score = cross_val_score(XGBRegressor(learning_rate= val, random_state= 42), X_train, y_train, scoring="neg_root_mean_squared_error")
    print(f'Average score({val}): {"{:.3f}".format(score.mean())}')

In [None]:
gamma_range = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
for val in gamma_range:
    score = cross_val_score(XGBRegressor(learning_rate= 0.3, colsample_bytree= 0.75,gamma= val, random_state= 42), X_test, y_test,scoring="neg_root_mean_squared_error")
    print(f'Average score({val}): {"{:.3f}".format(score.mean())}')

In [None]:
n_estimator = [100,200,300,400,500,600,700,800,900]
for val in n_estimator:
    score = cross_val_score(XGBRegressor(learning_rate= 0.3, colsample_bytree= 0.75,n_estimators= val, random_state= 42), X_test, y_test,scoring="neg_root_mean_squared_error")
    print(f'Average score({val}): {"{:.3f}".format(score.mean())}')

In [49]:
cross_val_score(XGBRegressor(learning_rate= 0.3, colsample_bytree= 0.75,n_estimators= 2000,random_state= 42,), X_train, y_train, scoring="neg_root_mean_squared_error").mean()

nan

## Entrenamiento del modelo

In [50]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

In [27]:
# Parametros
params = {"booster":"gbtree", "max_depth": 4, "eta": .3, "objective": "binary:logistic", "nthread":10}

In [28]:
xgb = XGBRegressor( learning_rate= 0.3, colsample_bytree= 0.75,n_estimators= 600, random_state= 42).fit(X_train,y_train)
# Evaluar modelo con los datos de prueba

In [51]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, accuracy_score , f1_score,r2_score
y_pred_xgb = xgb.predict(X_test)
mse = mean_squared_error(y_test, y_pred_xgb)
maeXGB= mean_absolute_error(y_test, y_pred_xgb)
r2scorexgb = r2_score(y_test, y_pred_xgb)
#accuracyXGB = accuracy_score(y_test, y_pred_xgb)
#f1ScoreXGB = f1_score(y_test, y_pred_xgb)

# Imprimir resultado
print(f'El MSE para el modelo Random Forest es: {mse}')
print(f'El MAE para el modelo Random Forest es: {maeXGB}')
print(f'El R2-Score para el modelo XGBoost es: {r2scorexgb}')
#print(f'El Accuracy para el modelo XGBoost es: {accuracyXGB}')
#print(f'El F1-Score para el modelo XGBoost es: {f1ScoreXGB}')


AttributeError: module 'xgboost' has no attribute 'predict'

In [31]:
# Guardar el modelo en un archivo pickle
import joblib
joblib.dump(xgb,"model_deployment_Proyecto1/VehiclePricePrediction.pkl")


['model_deployment_Proyecto1/VehiclePricePrediction.pkl']

In [36]:
X_test

Unnamed: 0,Year,Mileage,State,Make,Model
23218,2014,42534,32,14,150
20731,2010,99603,47,12,416
39555,2009,42352,37,8,248
147506,2006,86386,47,27,297
314215,2014,39154,37,2,29
...,...,...,...,...,...
162603,2015,25789,20,13,321
327859,2016,28500,18,35,444
1519,2015,36641,48,7,467
398253,2016,43401,3,35,89


In [None]:
from sklearn import metrics
import matplotlib.pyplot as plt

fpr, tpr, thresholds = metrics.roc_curve(y_test,y_pred_xgb ,pos_label=1)
AUC= metrics.auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % AUC)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.01, 1])
plt.ylim([0, 1.05])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
# Guardar predicciones en formato exigido en la competencia de kaggle
y_pred_xgb

In [None]:
dataTesting[categorical_cols] = dataTesting[categorical_cols].apply(lambda col: le.fit_transform(col))

In [None]:
y_pred_uno = xgb.predict(dataTesting)

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Carga de datos de archivo .csv
dataTraining = pd.read_csv('https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/datasets/dataTrain_carListings.zip')
dataTesting = pd.read_csv('https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/datasets/dataTest_carListings.zip', index_col=0)

# Selección de características relevantes
features = ['Year','Mileage','State','Make','Model']

# Preprocesamiento de los datos
X = pd.get_dummies(dataTraining[features])
y = dataTraining['Price']
X = X.fillna(X.mean())

# Separación del conjunto de entrenamiento y validación
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33, random_state=42)

# Entrenamiento del modelo de regresión lineal
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluación del modelo en el conjunto de validación
y_pred = model.predict(X_val)
mse = mean_squared_error(y_val, y_pred)
r2 = model.score(X_val, y_val)
print('MSE:', mse)
print('R^2:', r2)

# Predicción del conjunto de prueba

X_test = pd.get_dummies(dataTesting[features])
X_test = X_test.fillna(X_test.mean())
y_test = model.predict(X_test)

# Guardar predicciones en formato exigido en la competencia de kaggle
submission = pd.DataFrame({'ID': dataTesting.index, 'Price': y_test})
submission.to_csv('test_submission.csv', index=False)

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder
import xgboost as xgb

# Carga de datos de archivo .csv
data = pd.read_csv('https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/datasets/dataTrain_carListings.zip')

# Preprocesamiento de los datos
X = data.drop('Price', axis=1)
# Selección de características relevantes
features = ['Year','Mileage','State','Make','Model']

# Preprocesamiento de los datos
X = pd.get_dummies(dataTraining[features])
y = dataTraining['Price']
X = X.fillna(X.mean())

# Separación de datos en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Definición de la cuadrícula de hiperparámetros a buscar
param_grid = {
    'learning_rate': [0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'n_estimators': [50, 100, 200],
    'subsample': [0.5, 0.75, 1.0],
    'colsample_bytree': [0.5, 0.75, 1.0]
}

# Definición del modelo XGBoost
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)

# Búsqueda de los mejores hiperparámetros utilizando validación cruzada
grid_search = GridSearchCV(xgb_model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Predicción del conjunto de prueba con el mejor modelo encontrado
y_pred = grid_search.predict(X_test)

# Evaluación del modelo
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print('RMSE:', rmse)

from sklearn.metrics import r2_score, mean_absolute_error, mean_absolute_percentage_error

r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)
print('R2:', r2)
print('mae:', mae)
print('mape:', mape)


In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

# Carga de datos de archivo .csv
data = pd.read_csv('https://raw.githubusercontent.com/davidzarruk/MIAD_ML_NLP_2023/main/datasets/dataTrain_carListings.zip')

# Preprocesamiento de los datos
X = data.drop('Price', axis=1)
y = data['Price']
le = LabelEncoder()
X['Brand'] = le.fit_transform(X['Brand'])

# Separación de datos en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Definición del modelo y entrenamiento
tree = DecisionTreeRegressor(random_state=42)
tree.fit(X_train, y_train)

# Predicción del conjunto de prueba
y_pred = tree.predict(X_test)

# Evaluación del modelo
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print('RMSE:', rmse)
from sklearn.metrics import r2_score, mean_absolute_error, mean_absolute_percentage_error

r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)