# Projeto 01 - Car Price Prediction

### Objetivo
O objetivo é estimar um valor de venda para novos veículos.

### Base de dados
O conjunto de dados foi retirado do site CarDekho e a base contém 5.689 carros.




In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import random
random.seed(202101)

In [None]:
!pip3 install catboost

## Análise dos dados

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
df_train = pd.read_csv("/content/drive/MyDrive/Aprendizado de Máquina/Bases/train_car_details.csv")

Os dados são compostos pelas variáveis:


* Variaveis quantitativas:<br>
 * Ano de fabricacao do carro (year)
 * Qtd de Km dirigidos (km_driven)
 * Potência máxima do motor (max_power)
 * Qtd de acentos (seats)
 * Quilometragem por litro (mileage)
 * Potencia do motor (engine)
 * Torque: responsável pela capacidade do motor produzir força motriz, ou seja, o movimento giratório
 * Preço de venda (selling_price) **Valor a ser predito**

* Variaveis qualitativas:
 * nome do carro (name)
 * tipo de combustivel utilizado (fuel)
 * tipo de vendendor (seller_type)
 * transmission: câmbio automático ou manual

 * Quantos donos ja possuiram o carro (owner)

### Análise de valores ausentes

In [5]:
df_train.head()

Unnamed: 0,Id,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,1,Hyundai Santro GLS I - Euro I,1999,80000,110000,Petrol,Individual,Manual,Second Owner,,,,,
1,2,Maruti Ertiga VDI,2012,459999,87000,Diesel,Individual,Manual,First Owner,20.77 kmpl,1248 CC,88.76 bhp,200Nm@ 1750rpm,7.0
2,3,BMW 3 Series 320d Luxury Line,2010,1100000,102000,Diesel,Dealer,Automatic,First Owner,19.62 kmpl,1995 CC,187.74 bhp,400Nm@ 1750-2500rpm,5.0
3,4,Tata New Safari DICOR 2.2 EX 4x2,2009,229999,212000,Diesel,Individual,Manual,Third Owner,11.57 kmpl,2179 CC,138.1 bhp,320Nm@ 1700-2700rpm,7.0
4,5,Toyota Fortuner 3.0 Diesel,2010,800000,125000,Diesel,Individual,Manual,Second Owner,11.5 kmpl,2982 CC,171 bhp,343Nm@ 1400-3400rpm,7.0


In [6]:
#Porcentagem de nan por atributo
print(round(100*df_train.isna().sum()/len(df_train), 2))

Id               0.00
name             0.00
year             0.00
selling_price    0.00
km_driven        0.00
fuel             0.00
seller_type      0.00
transmission     0.00
owner            0.00
mileage          2.76
engine           2.76
max_power        2.65
torque           2.78
seats            2.76
dtype: float64


In [7]:
pd.set_option('display.max_rows', 200)
df_train[df_train['mileage'].isnull()]
#pd.reset_option('display.max_rows')

Unnamed: 0,Id,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage,engine,max_power,torque,seats
0,1,Hyundai Santro GLS I - Euro I,1999,80000,110000,Petrol,Individual,Manual,Second Owner,,,,,
11,12,Fiat Punto 1.3 Emotion,2010,190000,120000,Diesel,Individual,Manual,Second Owner,,,,,
37,38,Hyundai Santro Xing XG,2006,72000,110000,Petrol,Individual,Manual,Second Owner,,,,,
48,49,Toyota Etios Liva GD,2012,400000,107500,Diesel,Individual,Manual,Second Owner,,,,,
92,93,Maruti Zen Estilo VXI BSIV W ABS,2011,170000,55113,Petrol,Individual,Manual,First Owner,,,,,
94,95,Maruti Swift Dzire VDI Optional,2017,589000,41232,Diesel,Dealer,Manual,First Owner,,,0.0,,
106,107,Maruti Esteem Vxi - BSII,2005,50000,60000,Petrol,Individual,Manual,Third Owner,,,,,
154,155,Maruti Estilo LXI,2010,135000,132000,Petrol,Individual,Manual,Second Owner,,,,,
157,158,Maruti Esteem Vxi - BSII,2005,93000,120000,Petrol,Individual,Manual,Third Owner,,,,,
173,174,Ford Fiesta 1.6 SXI ABS Duratec,2011,325000,53287,Petrol,Dealer,Manual,First Owner,,,,,


Pelo fatos dos NaN's estar majoritariamente presente nas mesmas linhas e por representar um baixo volume em relação ao total (menos de 3%), tais linhas serão retiradas.

In [8]:
print(f'Quantidade de linhas totais: ', df_train.shape[0])
# Remove as linhas com NaN
df_train = df_train.dropna(axis=0)   
print(f'Quantidade de linhas após retirada dos NaNs: ', df_train.shape[0])
#Aproximadamente 3% de linhas eliminadas

Quantidade de linhas totais:  5689
Quantidade de linhas após retirada dos NaNs:  5531


## Pré-Processamento dos dados

Como observado, algumas variáveis possuem a unidade de medida associada. O objetivo é transformá-las em valores numéricos.

Mas antes, será verificado o comportamento das variáveis categóricas para identificar se será necessário algum tipo de tratamento.

In [9]:
df_train.columns

Index(['Id', 'name', 'year', 'selling_price', 'km_driven', 'fuel',
       'seller_type', 'transmission', 'owner', 'mileage', 'engine',
       'max_power', 'torque', 'seats'],
      dtype='object')

In [10]:
print(df_train.fuel.unique())
print(df_train.seller_type.unique())
print(df_train.transmission.unique())
print(df_train.owner.unique())

['Diesel' 'Petrol' 'CNG' 'LPG']
['Individual' 'Dealer' 'Trustmark Dealer']
['Manual' 'Automatic']
['First Owner' 'Third Owner' 'Second Owner' 'Fourth & Above Owner'
 'Test Drive Car']


In [11]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5531 entries, 1 to 5688
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             5531 non-null   int64  
 1   name           5531 non-null   object 
 2   year           5531 non-null   int64  
 3   selling_price  5531 non-null   int64  
 4   km_driven      5531 non-null   int64  
 5   fuel           5531 non-null   object 
 6   seller_type    5531 non-null   object 
 7   transmission   5531 non-null   object 
 8   owner          5531 non-null   object 
 9   mileage        5531 non-null   object 
 10  engine         5531 non-null   object 
 11  max_power      5531 non-null   object 
 12  torque         5531 non-null   object 
 13  seats          5531 non-null   float64
dtypes: float64(1), int64(4), object(9)
memory usage: 648.2+ KB


### Mileage

In [12]:
df_train.mileage.unique()

array(['20.77 kmpl', '19.62 kmpl', '11.57 kmpl', '11.5 kmpl', '19.7 kmpl',
       '15.6 kmpl', '18.6 kmpl', '13.58 kmpl', '25.8 kmpl', '19.33 kmpl',
       '23.01 kmpl', '23.9 kmpl', '15.96 kmpl', '14.0 kmpl', '21.04 kmpl',
       '23.65 kmpl', '12.8 kmpl', '22.0 kmpl', '20.0 kmpl', '19.01 kmpl',
       '16.2 kmpl', '21.5 kmpl', '19.0 kmpl', '23.59 kmpl', '20.46 kmpl',
       '13.93 kmpl', '22.32 kmpl', '22.74 kmpl', '15.64 kmpl',
       '24.4 kmpl', '17.5 kmpl', '23.95 kmpl', '20.4 kmpl', '25.2 kmpl',
       '22.3 kmpl', '15.8 kmpl', '12.83 kmpl', '18.8 kmpl', '15.5 kmpl',
       '20.63 kmpl', '16.5 kmpl', '18.9 kmpl', '20.5 kmpl', '25.83 kmpl',
       '11.79 kmpl', '21.01 kmpl', '17.0 kmpl', '17.01 kmpl', '23.5 kmpl',
       '18.5 kmpl', '23.1 kmpl', '21.38 kmpl', '18.0 kmpl', '12.99 kmpl',
       '26.0 kmpl', '15.1 kmpl', '25.17 kmpl', '23.0 kmpl', '23.08 kmpl',
       '25.4 kmpl', '21.12 kmpl', '21.1 kmpl', '15.3 kmpl', '22.54 kmpl',
       '23.57 kmpl', '26.6 km/kg', '19.2 km/kg',

Temos km/L e Km/Kg.

In [13]:
for i in df_train.index:
  df_train.loc[i,"teste"]=df_train.loc[i,"mileage"].split()[1]