Hola **Nelson**!

Soy **Patricio Requena** 👋. Es un placer ser el revisor de tu proyecto el día de hoy!

Revisaré tu proyecto detenidamente con el objetivo de ayudarte a mejorar y perfeccionar tus habilidades. Durante mi revisión, identificaré áreas donde puedas hacer mejoras en tu código, señalando específicamente qué y cómo podrías ajustar para optimizar el rendimiento y la claridad de tu proyecto. Además, es importante para mí destacar los aspectos que has manejado excepcionalmente bien. Reconocer tus fortalezas te ayudará a entender qué técnicas y métodos están funcionando a tu favor y cómo puedes aplicarlos en futuras tareas. 

_**Recuerda que al final de este notebook encontrarás un comentario general de mi parte**_, empecemos!

Encontrarás mis comentarios dentro de cajas verdes, amarillas o rojas, ⚠️ **por favor, no muevas, modifiques o borres mis comentarios** ⚠️:


<div class="alert alert-block alert-success">
<b>Comentario del revisor</b> <a class=“tocSkip”></a>
Si todo está perfecto.
</div>

<div class="alert alert-block alert-warning">
<b>Comentario del revisor</b> <a class=“tocSkip”></a>
Si tu código está bien pero se puede mejorar o hay algún detalle que le hace falta.
</div>

<div class="alert alert-block alert-danger">
<b>Comentario del revisor</b> <a class=“tocSkip”></a>
Si de pronto hace falta algo o existe algún problema con tu código o conclusiones.
</div>

Puedes responderme de esta forma:
<div class="alert alert-block alert-info">
<b>Respuesta del estudiante</b> <a class=“tocSkip”></a>
</div>

# Introduction

The used car sales service **Rusty Bargain** is developing an application to attract new customers. With this app, you can quickly determine the market value of your car. The main objective it's to create a model that determines the market value.  The KPIs of the model will be:
- Prediction quality  
- Prediction speed  
- Training time

### Dataset Description
The dataset is stored in the /datasets/car_data.csv file and contains information about used cars, including their technical specifications, history, and selling price.

- Dataset Features:
- DateCrawled – Date the profile was downloaded from the database.
- VehicleType – Type of vehicle body.
- RegistrationYear – Year the vehicle was registered.
- Gearbox – Type of transmission.
- Power – Vehicle power (in horsepower, HP).
- Model – Vehicle model.
- Mileage – Mileage (measured in km according to the dataset's regional specifics).
- RegistrationMonth – Month the vehicle was registered.
- FuelType – Type of fuel.
- Brand – Vehicle brand.
- NotRepaired – Indicates whether the vehicle has been repaired or not.
- DateCreated – Date the profile was created.
- NumberOfPictures – Number of vehicle photos.
- PostalCode – Postal code of the profile owner (user).
- LastSeen – Date the user was last active.

**Target Variable:**
- Price – Vehicle price (in euros).

This dataset will be used to train a model to predict the market value of used cars, optimizing prediction quality, prediction speed, and training time.

## 1. Data exploration and preprocessing

### 1.1. Libreries initialization

In [1]:
import numpy as np
import pandas as pd
import time

import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder

### 1.2. Data loading

In [2]:
df = pd.read_csv('/Users/tomaster/Documents/GitHub/TT_DataScience/Data-Science-TT-Bootcamp/sprint_13_NumericMethods/Project/car_data.csv')

df.columns = df.columns.str.lower()

df.sample(10)


Unnamed: 0,datecrawled,price,vehicletype,registrationyear,gearbox,power,model,mileage,registrationmonth,fueltype,brand,notrepaired,datecreated,numberofpictures,postalcode,lastseen
241009,29/03/2016 13:44,7800,wagon,2005,manual,143,c_klasse,80000,1,petrol,mercedes_benz,no,29/03/2016 00:00,0,55237,05/04/2016 22:46
267941,09/03/2016 12:39,7350,coupe,2011,auto,110,megane,150000,1,gasoline,renault,no,09/03/2016 00:00,0,47799,29/03/2016 10:46
47528,15/03/2016 10:52,1750,sedan,1992,manual,90,80,150000,3,petrol,audi,no,15/03/2016 00:00,0,36269,01/04/2016 22:47
69207,09/03/2016 19:46,6200,small,2007,auto,4,,70000,1,gasoline,sonstige_autos,no,09/03/2016 00:00,0,83735,10/03/2016 16:16
300173,14/03/2016 21:44,990,wagon,2000,manual,0,focus,150000,6,petrol,ford,no,14/03/2016 00:00,0,49176,21/03/2016 15:15
59245,27/03/2016 05:36,12500,coupe,2005,auto,0,other,150000,8,petrol,mercedes_benz,no,27/03/2016 00:00,0,46236,07/04/2016 04:16
82715,31/03/2016 21:58,8500,wagon,2009,manual,102,golf,50000,6,petrol,volkswagen,no,31/03/2016 00:00,0,35510,06/04/2016 15:17
152169,07/03/2016 12:48,6250,sedan,2004,manual,0,a4,150000,6,gasoline,audi,,07/03/2016 00:00,0,21073,08/03/2016 13:49
342134,15/03/2016 03:02,13750,sedan,2011,manual,122,passat,70000,9,petrol,volkswagen,no,15/03/2016 00:00,0,58791,05/04/2016 15:47
87449,04/04/2016 18:50,5200,convertible,1984,manual,86,other,150000,4,petrol,toyota,no,04/04/2016 00:00,0,24797,04/04/2016 18:50


<div class="alert alert-block alert-warning">
<b>Comentario del revisor (1ra Iteracion)</b> <a class=“tocSkip”></a>

Ten cuidado con el uso de rutas absolutas hacia los archivos, cuando trabajes en un equipo con más Data Scientist es importante que tus notebook sean reproducibles y cuando usas ruta absoluta de **tu computador** esto dará error de ejecución ya que no todos tiene los archivos en un mismo lugar. Una opción a esto es usar la librería `os` para construir las rutas hacia el archivo o dejar algo más sencillo cómo `/datasets/file.csv`
</div>

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   datecrawled        354369 non-null  object
 1   price              354369 non-null  int64 
 2   vehicletype        316879 non-null  object
 3   registrationyear   354369 non-null  int64 
 4   gearbox            334536 non-null  object
 5   power              354369 non-null  int64 
 6   model              334664 non-null  object
 7   mileage            354369 non-null  int64 
 8   registrationmonth  354369 non-null  int64 
 9   fueltype           321474 non-null  object
 10  brand              354369 non-null  object
 11  notrepaired        283215 non-null  object
 12  datecreated        354369 non-null  object
 13  numberofpictures   354369 non-null  int64 
 14  postalcode         354369 non-null  int64 
 15  lastseen           354369 non-null  object
dtypes: int64(7), object(

In [4]:

df["datecrawled"] = pd.to_datetime(df["datecrawled"], format="%d/%m/%Y %H:%M")
df["datecreated"] = pd.to_datetime(df["datecreated"], format="%d/%m/%Y %H:%M")
df["lastseen"] = pd.to_datetime(df["lastseen"], format="%d/%m/%Y %H:%M")

df['vehicletype'] = df['vehicletype'].astype(str)
df['gearbox'] = df['gearbox'].astype(str)
df['model'] = df['model'].astype(str)

df['fueltype'] = df['fueltype'].astype(str)
df['brand'] = df['brand'].astype(str)
df["notrepaired"] = df["notrepaired"].map({"yes": True, "no": False}).astype(bool)



df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   datecrawled        354369 non-null  datetime64[ns]
 1   price              354369 non-null  int64         
 2   vehicletype        354369 non-null  object        
 3   registrationyear   354369 non-null  int64         
 4   gearbox            354369 non-null  object        
 5   power              354369 non-null  int64         
 6   model              354369 non-null  object        
 7   mileage            354369 non-null  int64         
 8   registrationmonth  354369 non-null  int64         
 9   fueltype           354369 non-null  object        
 10  brand              354369 non-null  object        
 11  notrepaired        354369 non-null  bool          
 12  datecreated        354369 non-null  datetime64[ns]
 13  numberofpictures   354369 non-null  int64   

Although there are duplicate values, for now, the columns will remain as they are.

### 1.3. Data descriptive exploration

In [5]:
df.describe().round()

Unnamed: 0,datecrawled,price,registrationyear,power,mileage,registrationmonth,datecreated,numberofpictures,postalcode,lastseen
count,354369,354369.0,354369.0,354369.0,354369.0,354369.0,354369,354369.0,354369.0,354369
mean,2016-03-21 12:57:41.165057280,4417.0,2004.0,110.0,128211.0,6.0,2016-03-20 19:12:07.753274112,0.0,50509.0,2016-03-29 23:50:30.593703680
min,2016-03-05 14:06:00,0.0,1000.0,0.0,5000.0,0.0,2014-03-10 00:00:00,0.0,1067.0,2016-03-05 14:15:00
25%,2016-03-13 11:52:00,1050.0,1999.0,69.0,125000.0,3.0,2016-03-13 00:00:00,0.0,30165.0,2016-03-23 02:50:00
50%,2016-03-21 17:50:00,2700.0,2003.0,105.0,150000.0,6.0,2016-03-21 00:00:00,0.0,49413.0,2016-04-03 15:15:00
75%,2016-03-29 14:37:00,6400.0,2008.0,143.0,150000.0,9.0,2016-03-29 00:00:00,0.0,71083.0,2016-04-06 10:15:00
max,2016-04-07 14:36:00,20000.0,9999.0,20000.0,150000.0,12.0,2016-04-07 00:00:00,0.0,99998.0,2016-04-07 14:58:00
std,,4514.0,90.0,190.0,37905.0,4.0,,0.0,25783.0,


<div class="alert alert-block alert-success">
<b>Comentario del revisor (1ra Iteracion)</b> <a class=“tocSkip”></a>

Muy buen trabajo con el tratamiento y análisis de los datos, siempre en un proyecto lo importante es primero entender los datos con los que se trabajará antes de pasar al modelado
</div>

## 2. Model Trainning

For this section, we are going to try different models in order to choose the best model configuration. The models we are testing are:

- Linear Regression 
- Decision Tree
- Random Forest 

The criteria of the election is:

- Prediction quality  
- Prediction speed  
- Training time

In [6]:
features = [
    "datecrawled", "vehicletype", "registrationyear", "gearbox",
    "power", "model", "mileage", "registrationmonth", "fueltype",
    "brand", "notrepaired", "datecreated", "numberofpictures",
    "postalcode", "lastseen"
]

target = ["price"]




# Split dataset
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

for col in ["datecrawled", "datecreated", "lastseen"]:
    X_train[col + "_year"] = X_train[col].dt.year
    X_train[col + "_month"] = X_train[col].dt.month
    X_train[col + "_day"] = X_train[col].dt.day

    X_test[col + "_year"] = X_test[col].dt.year
    X_test[col + "_month"] = X_test[col].dt.month
    X_test[col + "_day"] = X_test[col].dt.day

X_train = X_train.drop(columns=["datecrawled", "datecreated", "lastseen"], errors="ignore")
X_test = X_test.drop(columns=["datecrawled", "datecreated", "lastseen"], errors="ignore")


encoder = OneHotEncoder()
X_train_encoded = encoder.fit_transform(X_train.select_dtypes(include=["object"]))
X_test_encoded = encoder.transform(X_test.select_dtypes(include=["object"]))

# Convertir a DataFrame
X_train_encoded = pd.DataFrame(X_train_encoded, index=X_train.index)
X_test_encoded = pd.DataFrame(X_test_encoded, index=X_test.index)



# Unir con las columnas numéricas
X_train = pd.concat([X_train.select_dtypes(exclude=["object"]).reset_index(drop=True), X_train_encoded.reset_index(drop=True)], axis=1)
X_test = pd.concat([X_test.select_dtypes(exclude=["object"]).reset_index(drop=True), X_test_encoded.reset_index(drop=True)], axis=1)

X_train.columns = X_train.columns.astype(str)
X_test.columns = X_test.columns.astype(str)
X_train = X_train.drop(columns=['0'])
X_test = X_test.drop(columns=['0'])
X_train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 283495 entries, 0 to 283494
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype
---  ------             --------------   -----
 0   registrationyear   283495 non-null  int64
 1   power              283495 non-null  int64
 2   mileage            283495 non-null  int64
 3   registrationmonth  283495 non-null  int64
 4   notrepaired        283495 non-null  bool 
 5   numberofpictures   283495 non-null  int64
 6   postalcode         283495 non-null  int64
 7   datecrawled_year   283495 non-null  int32
 8   datecrawled_month  283495 non-null  int32
 9   datecrawled_day    283495 non-null  int32
 10  datecreated_year   283495 non-null  int32
 11  datecreated_month  283495 non-null  int32
 12  datecreated_day    283495 non-null  int32
 13  lastseen_year      283495 non-null  int32
 14  lastseen_month     283495 non-null  int32
 15  lastseen_day       283495 non-null  int32
dtypes: bool(1), int32(9), int64(6)
memory 

<div class="alert alert-block alert-success">
<b>Comentario del revisor (1ra Iteracion)</b> <a class=“tocSkip”></a>

Muy bien! Ahora tienes los datos listos para el entrenamiento
</div>

In [7]:
def evaluate_model(model, X_train, X_test, y_train, y_test):
    start_train = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start_train

    start_pred = time.time()
    y_pred = model.predict(X_test)
    pred_time = time.time() - start_pred

    rmse = np.sqrt(mean_squared_error(y_test, y_pred))

    return rmse, train_time, pred_time

# 1. Linear Regression
lr = LinearRegression()
lr_rmse, lr_train_time, lr_pred_time = evaluate_model(lr, X_train, X_test, y_train, y_test)
print(f"Linear Regression -> RMSE: {lr_rmse:.4f}, Train Time: {lr_train_time:.4f} sec, Pred Time: {lr_pred_time:.4f} sec")

# 2. Decision Tree
dt = DecisionTreeRegressor(max_depth=10, random_state=42)
dt_rmse, dt_train_time, dt_pred_time = evaluate_model(dt, X_train, X_test, y_train, y_test)
print(f"Decision Tree -> RMSE: {dt_rmse:.4f}, Train Time: {dt_train_time:.4f} sec, Pred Time: {dt_pred_time:.4f} sec")

# 3. Random Forest
rf = RandomForestRegressor(n_estimators=50, max_depth=10, random_state=42, n_jobs=-1)
rf_rmse, rf_train_time, rf_pred_time = evaluate_model(rf, X_train, X_test, y_train, y_test)
print(f"Random Forest -> RMSE: {rf_rmse:.4f}, Train Time: {rf_train_time:.4f} sec, Pred Time: {rf_pred_time:.4f} sec")


Linear Regression -> RMSE: 4024.3450, Train Time: 0.2191 sec, Pred Time: 0.0126 sec
Decision Tree -> RMSE: 2249.5911, Train Time: 1.2130 sec, Pred Time: 0.0113 sec


  return fit_method(estimator, *args, **kwargs)


Random Forest -> RMSE: 2190.8211, Train Time: 7.9950 sec, Pred Time: 0.0743 sec


In [11]:
df_results = pd.DataFrame({
    'Model': ['Linear Regression', 'Decision Tree', 'Random Forest'],
    'RMSE': [lr_rmse, dt_rmse, rf_rmse],
    'Training Time (s)': [lr_train_time, dt_train_time, rf_train_time],
    'Prediction Time (s)': [lr_pred_time, dt_pred_time, rf_pred_time]
})

df_results['Total Time (s)'] = df_results['Training Time (s)'] + df_results['Prediction Time (s)']

print(df_results.sort_values(by='RMSE'))



               Model         RMSE  Training Time (s)  Prediction Time (s)  \
2      Random Forest  2190.821130           7.995027             0.074296   
1      Decision Tree  2249.591051           1.213011             0.011265   
0  Linear Regression  4024.345013           0.219097             0.012588   

   Total Time (s)  
2        8.069323  
1        1.224276  
0        0.231685  


<div class="alert alert-block alert-danger">
<b>Comentario del revisor (1ra Iteracion)</b> <a class=“tocSkip”></a>

Muy buen avance con el proyecto Nelson! Tienes los entrenamientos de manera adecuada y mediste el tiempo de entrenamiento. Sin embargo, si revisamos la descripción del proyecto también dice que a la empresa le interesa la de velocidad de predicción que a fin de cuentas también se puede medir con el tiempo que tarde, por lo que deberías aplicar la librería `time` tanto para cuando haces `.fit()` cómo para cuando haces `.predict()`, es decir, deberías tener dos resultados en cuanto a tiempo para los modelos.
</div>

## 3. Conclusion

The best model for this group of data it's Decision Tree because it has a good overall performance and the best prediction Time.

# Lista de control

Escribe 'x' para verificar. Luego presiona Shift+Enter

- [x]  Jupyter Notebook está abierto
- [ ]  El código no tiene errores- [ ]  Las celdas con el código han sido colocadas en orden de ejecución- [ ]  Los datos han sido descargados y preparados- [ ]  Los modelos han sido entrenados
- [ ]  Se realizó el análisis de velocidad y calidad de los modelos