# Introduction

The used car sales service **Rusty Bargain** is developing an application to attract new customers. With this app, you can quickly determine the market value of your car. The main objective it's to create a model that determines the market value.  The KPIs of the model will be:
- Prediction quality  
- Prediction speed  
- Training time

### Dataset Description
The dataset is stored in the /datasets/car_data.csv file and contains information about used cars, including their technical specifications, history, and selling price.

- Dataset Features:
- DateCrawled – Date the profile was downloaded from the database.
- VehicleType – Type of vehicle body.
- RegistrationYear – Year the vehicle was registered.
- Gearbox – Type of transmission.
- Power – Vehicle power (in horsepower, HP).
- Model – Vehicle model.
- Mileage – Mileage (measured in km according to the dataset's regional specifics).
- RegistrationMonth – Month the vehicle was registered.
- FuelType – Type of fuel.
- Brand – Vehicle brand.
- NotRepaired – Indicates whether the vehicle has been repaired or not.
- DateCreated – Date the profile was created.
- NumberOfPictures – Number of vehicle photos.
- PostalCode – Postal code of the profile owner (user).
- LastSeen – Date the user was last active.

**Target Variable:**
- Price – Vehicle price (in euros).

This dataset will be used to train a model to predict the market value of used cars, optimizing prediction quality, prediction speed, and training time.

## 1. Data exploration and preprocessing

### 1.1. Libreries initialization

In [1]:
import numpy as np
import pandas as pd
import math

import seaborn as sns

import sklearn.linear_model
import sklearn.metrics
import sklearn.neighbors
import sklearn.preprocessing

from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import KNeighborsClassifier

from IPython.display import display

### 1.2. Data loading

In [18]:
df = pd.read_csv('/Users/tomaster/Documents/GitHub/TT_DataScience/Data-Science-TT-Bootcamp/sprint_13_NumericMethods/Project/car_data.csv')

df.columns = df.columns.str.lower()

df.sample(10)


Unnamed: 0,datecrawled,price,vehicletype,registrationyear,gearbox,power,model,mileage,registrationmonth,fueltype,brand,notrepaired,datecreated,numberofpictures,postalcode,lastseen
125392,19/03/2016 14:37,1950,small,1999,manual,50,lupo,150000,3,petrol,volkswagen,no,19/03/2016 00:00,0,33154,19/03/2016 14:37
66137,14/03/2016 16:45,888,,2017,auto,0,corsa,150000,0,petrol,opel,,14/03/2016 00:00,0,45307,18/03/2016 03:46
139119,08/03/2016 11:57,6444,sedan,2003,auto,122,e_klasse,150000,4,gasoline,mercedes_benz,no,08/03/2016 00:00,0,19370,29/03/2016 05:15
94275,15/03/2016 20:37,2200,small,2003,auto,102,golf,150000,2,petrol,volkswagen,no,15/03/2016 00:00,0,31547,07/04/2016 01:46
264722,28/03/2016 00:57,3950,small,2009,manual,97,rio,150000,7,lpg,kia,no,27/03/2016 00:00,0,26639,06/04/2016 06:17
203294,15/03/2016 22:38,2650,sedan,1997,manual,150,5er,150000,2,petrol,bmw,no,15/03/2016 00:00,0,19055,07/04/2016 10:17
272672,22/03/2016 14:46,0,,2017,manual,179,passat,150000,3,gasoline,volkswagen,,22/03/2016 00:00,0,44147,22/03/2016 14:46
46890,03/04/2016 15:46,2500,small,2006,manual,60,panda,125000,9,petrol,fiat,no,03/04/2016 00:00,0,12163,05/04/2016 14:47
297182,23/03/2016 21:55,390,wagon,1998,manual,78,nubira,125000,10,,daewoo,yes,23/03/2016 00:00,0,91352,04/04/2016 03:45
34712,31/03/2016 10:58,2500,small,2005,manual,75,2_reihe,150000,11,petrol,peugeot,no,31/03/2016 00:00,0,52134,06/04/2016 04:15


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   datecrawled        354369 non-null  object
 1   price              354369 non-null  int64 
 2   vehicletype        316879 non-null  object
 3   registrationyear   354369 non-null  int64 
 4   gearbox            334536 non-null  object
 5   power              354369 non-null  int64 
 6   model              334664 non-null  object
 7   mileage            354369 non-null  int64 
 8   registrationmonth  354369 non-null  int64 
 9   fueltype           321474 non-null  object
 10  brand              354369 non-null  object
 11  notrepaired        283215 non-null  object
 12  datecreated        354369 non-null  object
 13  numberofpictures   354369 non-null  int64 
 14  postalcode         354369 non-null  int64 
 15  lastseen           354369 non-null  object
dtypes: int64(7), object(

In [20]:

df["datecrawled"] = pd.to_datetime(df["datecrawled"], format="%d/%m/%Y %H:%M")
df["datecreated"] = pd.to_datetime(df["datecreated"], format="%d/%m/%Y %H:%M")
df["lastseen"] = pd.to_datetime(df["lastseen"], format="%d/%m/%Y %H:%M")

df['vehicletype'] = df['vehicletype'].astype(str)
df['gearbox'] = df['gearbox'].astype(str)
df['model'] = df['model'].astype(str)

df['fueltype'] = df['fueltype'].astype(str)
df['brand'] = df['brand'].astype(str)
df["notrepaired"] = df["notrepaired"].map({"yes": True, "no": False}).astype(bool)



df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 354369 entries, 0 to 354368
Data columns (total 16 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   datecrawled        354369 non-null  datetime64[ns]
 1   price              354369 non-null  int64         
 2   vehicletype        354369 non-null  object        
 3   registrationyear   354369 non-null  int64         
 4   gearbox            354369 non-null  object        
 5   power              354369 non-null  int64         
 6   model              354369 non-null  object        
 7   mileage            354369 non-null  int64         
 8   registrationmonth  354369 non-null  int64         
 9   fueltype           354369 non-null  object        
 10  brand              354369 non-null  object        
 11  notrepaired        354369 non-null  bool          
 12  datecreated        354369 non-null  datetime64[ns]
 13  numberofpictures   354369 non-null  int64   

Although there are duplicate values, for now, the columns will remain as they are.

### 1.3. Data descriptive exploration

In [24]:
df.describe().round()

Unnamed: 0,datecrawled,price,registrationyear,power,mileage,registrationmonth,datecreated,numberofpictures,postalcode,lastseen
count,354369,354369.0,354369.0,354369.0,354369.0,354369.0,354369,354369.0,354369.0,354369
mean,2016-03-21 12:57:41.165057280,4417.0,2004.0,110.0,128211.0,6.0,2016-03-20 19:12:07.753274112,0.0,50509.0,2016-03-29 23:50:30.593703680
min,2016-03-05 14:06:00,0.0,1000.0,0.0,5000.0,0.0,2014-03-10 00:00:00,0.0,1067.0,2016-03-05 14:15:00
25%,2016-03-13 11:52:00,1050.0,1999.0,69.0,125000.0,3.0,2016-03-13 00:00:00,0.0,30165.0,2016-03-23 02:50:00
50%,2016-03-21 17:50:00,2700.0,2003.0,105.0,150000.0,6.0,2016-03-21 00:00:00,0.0,49413.0,2016-04-03 15:15:00
75%,2016-03-29 14:37:00,6400.0,2008.0,143.0,150000.0,9.0,2016-03-29 00:00:00,0.0,71083.0,2016-04-06 10:15:00
max,2016-04-07 14:36:00,20000.0,9999.0,20000.0,150000.0,12.0,2016-04-07 00:00:00,0.0,99998.0,2016-04-07 14:58:00
std,,4514.0,90.0,190.0,37905.0,4.0,,0.0,25783.0,


## 2. Model Trainning

## Análisis del modelo

# Lista de control

Escribe 'x' para verificar. Luego presiona Shift+Enter

- [x]  Jupyter Notebook está abierto
- [ ]  El código no tiene errores- [ ]  Las celdas con el código han sido colocadas en orden de ejecución- [ ]  Los datos han sido descargados y preparados- [ ]  Los modelos han sido entrenados
- [ ]  Se realizó el análisis de velocidad y calidad de los modelos