Problema tomado de: https://github.com/fferegrino/cf-ml/blob/main/car-prices/car-price.ipynb

# Predicción de automóviles usados

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
import os
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
cars = pd.read_csv('cars.csv')

In [3]:
cars.head()

Unnamed: 0,maker,model,year,transmission,mileage,fuelType,tax,mpg,engineSize,price
0,cclass,C Class,2020,Automatic,1200,Diesel,,,2.0,30495
1,cclass,C Class,2020,Automatic,1000,Petrol,,,1.5,29989
2,cclass,C Class,2020,Automatic,500,Diesel,,,2.0,37899
3,cclass,C Class,2019,Automatic,5000,Diesel,,,2.0,30399
4,cclass,C Class,2019,Automatic,4500,Diesel,,,2.0,29899


## Análisis Exploratorio de Datos

In [4]:
profile = ProfileReport(cars, title="Raw Car Dataset Analysis", explorative=True)
profile.to_file("cars-report.html")

Summarize dataset: 100%|██████████| 61/61 [00:11<00:00,  5.43it/s, Completed]                     
Generate report structure: 100%|██████████| 1/1 [00:02<00:00,  2.71s/it]
Render HTML: 100%|██████████| 1/1 [00:01<00:00,  1.49s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 99.36it/s]


#### Eliminar valores duplicados

In [6]:
print(len(cars))
cars = cars.drop_duplicates(keep='first')
print(len(cars))

108540
106267


### Dividir Dataset

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
rest, test = train_test_split(cars, test_size=0.2, shuffle=True) # 20% of 100 = 20
train, val = train_test_split(rest, test_size=0.25, shuffle=True) # 25% of 80 = 20
distributions = np.array([len(train), len(val), len(test)])

print(distributions)
print(distributions / len(cars))

[63759 21254 21254]
[0.59998871 0.20000565 0.20000565]


### One-hot encode con variable categóricas

In [11]:
from sklearn.preprocessing import OneHotEncoder
maker_encoder = OneHotEncoder()

In [30]:
maker_encoder.fit(train[["maker"]])
mkr = maker_encoder.transform(train[["maker"]]).todense()

print(mkr.shape)

(63759, 11)


In [31]:
maker_encoder.categories_

[array(['audi', 'bmw', 'cclass', 'focus', 'ford', 'hyundi', 'merc',
        'skoda', 'toyota', 'vauxhall', 'vw'], dtype=object)]

In [13]:
df = pd.DataFrame(mkr, columns=maker_encoder.categories_, index=train[["maker"]].index)
df["actual"] = train[["maker"]]
df.sample(5)

Unnamed: 0,audi,bmw,cclass,focus,ford,hyundi,merc,skoda,toyota,vauxhall,vw,actual
35997,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,ford
44599,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,ford
16260,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,audi
86156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,vw
62336,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,vauxhall


In [25]:
test_maker = "audi"
pd.get_dummies([test_maker])

Unnamed: 0,audi
0,1


In [26]:
maker_encoder.transform([[test_maker]]).todense()

matrix([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

#### Feature Scaling