#Regressão -Preparação de features
---
**Aula Prática 02**: Preparamento de features


**Objetivo**: Preparar features para serem treinadas em um modelo.


Banco de dados:


Preço de carros usados


[Disponivel no kaggle](https://www.kaggle.com/datasets/rishabhkarn/used-car-dataset/data)


[Disponível para download](https://drive.google.com/file/d/1Ny6GypPH4AtJi6CJHmEUEI3KN11hDuGG/view?usp=drive_link)


Descrição dos dados:
* car_name: nome do carro
* registration_year: ano de registro
* insurance_validity: tipo de seguro
* fuel_type: tipo de combustivel
* seats: número de assentos
* kms_drive: total km dirigidos
* ownsership: número de proprietarios
* transmission: tipo de cambio
* manufacturing_year: ano de fabricação
* mileage(kmpl): km por litro
* engine (cc): tamanho do motor
* max_power(bhp): potência do carro
* torque(Nm): torque do motor
* price (in lakhs): preço em lakhs (medida indiana para 100.000)

##Import das principais funções e leitura dos dados


---


In [None]:
import pandas as pd #pacote para leitura dos dados
import numpy as np

In [None]:
#opção 1 -> montar o drive no colab e acessar o arquivo
#from google.colab import drive
#drive.mount('/content/drive')


#opção 2 -> fazer download e fazer upload por aqui
from google.colab import files
uploaded = files.upload()

Saving Used Car Dataset.csv to Used Car Dataset.csv


In [None]:
path = 'Used Car Dataset.csv'
df = pd.read_csv(path)

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,car_name,registration_year,insurance_validity,fuel_type,seats,kms_driven,ownsership,transmission,manufacturing_year,mileage(kmpl),engine(cc),max_power(bhp),torque(Nm),price(in lakhs)
0,0,2017 Mercedes-Benz S-Class S400,Jul-17,Comprehensive,Petrol,5,56000,First Owner,Automatic,2017,7.81,2996.0,2996.0,333.0,63.75
1,1,2020 Nissan Magnite Turbo CVT XV Premium Opt BSVI,Jan-21,Comprehensive,Petrol,5,30615,First Owner,Automatic,2020,17.4,999.0,999.0,9863.0,8.99
2,2,2018 BMW X1 sDrive 20d xLine,Sep-18,Comprehensive,Diesel,5,24000,First Owner,Automatic,2018,20.68,1995.0,1995.0,188.0,23.75
3,3,2019 Kia Seltos GTX Plus,Dec-19,Comprehensive,Petrol,5,18378,First Owner,Manual,2019,16.5,1353.0,1353.0,13808.0,13.56
4,4,2019 Skoda Superb LK 1.8 TSI AT,Aug-19,Comprehensive,Petrol,5,44900,First Owner,Automatic,2019,14.67,1798.0,1798.0,17746.0,24.0


In [None]:
df.dtypes

Unnamed: 0              int64
car_name               object
registration_year      object
insurance_validity     object
fuel_type              object
seats                   int64
kms_driven              int64
ownsership             object
transmission           object
manufacturing_year     object
mileage(kmpl)         float64
engine(cc)            float64
max_power(bhp)        float64
torque(Nm)            float64
price(in lakhs)       float64
dtype: object

In [None]:
df = df.rename(columns = {'mileage(kmpl)':'mileage',
                          'engine(cc)':'engine',
                          'max_power(bhp)': 'max_power',
                          'torque(Nm)': 'torque',
                          'price(in lakhs)': 'price'})

df.drop(columns='Unnamed: 0', inplace=True)

## Preparação de dados numéricos
---
Para o processamento das variáveis numéricas faremos o seguinte:


1. Imputação/exclusão em dados faltantes

```python
# Preencher dado faltante
df['coluna'].fillna(valor)
```

Exercícios:

Preencha as variáveis que possuem dados faltantes com a mediana das observações.

In [None]:
df.dtypes

car_name               object
registration_year      object
insurance_validity     object
fuel_type              object
seats                   int64
kms_driven              int64
ownsership             object
transmission           object
manufacturing_year     object
mileage               float64
engine                float64
max_power             float64
torque                float64
price                 float64
dtype: object

In [None]:
var_numerica = ['seats', 'kms_driven', 'mileage', 'engine', 'max_power', 'torque']

#### Solução

In [None]:
df[var_numerica].isnull().sum()

seats         0
kms_driven    0
mileage       3
engine        3
max_power     3
torque        4
dtype: int64

In [None]:
for var in var_numerica:
  mediana = df[var].median()
  df[var] = df[var].fillna(mediana)

In [None]:
df[var_numerica].isnull().sum()

seats         0
kms_driven    0
mileage       0
engine        0
max_power     0
torque        0
dtype: int64

## Preparação de dados categóricos


---
Para o processamento das variáveis categóricas faremos o seguinte:


1. Transformação de variáveis numéricas que representam a categoria:
  a. Dummy
  b. One hot encoding




---
### Variáveis Dummy


Variáveis Dummy é uma representação de uma variável categórica com k categorias em k-1 variáveis, todas as novas variáveis são binárias.


Exemplo:
Variável faixa de idade


|Variável original|18-30|30-50|50+|
|---|---|---|---|
|0-18|0|0|0|
|18-30|1|0|0|
|30-50|0|1|0|
|50+|0|0|1|


Para criar variáveis dummy iremos utilizar a função get_dummy do pacote Pandas


Exemplo:
``` python
import pandas as pd
df_dummy = pd.get_dummies(df, drop_first=True)
```
 Exercícios:


Crie uma variável dummy para o conjunto de dados? Notou algum problema na criação das dummys?

In [None]:
df.dtypes

car_name               object
registration_year      object
insurance_validity     object
fuel_type              object
seats                   int64
kms_driven              int64
ownsership             object
transmission           object
manufacturing_year     object
mileage               float64
engine                float64
max_power             float64
torque                float64
price                 float64
dtype: object

In [None]:
var_cat = ['registration_year',
           'fuel_type',
           'ownsership',
           'transmission']

In [None]:
var_cat + var_numerica

['registration_year',
 'fuel_type',
 'ownsership',
 'transmission',
 'seats',
 'kms_driven',
 'mileage',
 'engine',
 'max_power',
 'torque']

In [None]:
df[var_cat + var_numerica]

Unnamed: 0,registration_year,fuel_type,ownsership,transmission,seats,kms_driven,mileage,engine,max_power,torque
0,Jul-17,Petrol,First Owner,Automatic,5,56000,7.81,2996.0,2996.0,333.0
1,Jan-21,Petrol,First Owner,Automatic,5,30615,17.40,999.0,999.0,9863.0
2,Sep-18,Diesel,First Owner,Automatic,5,24000,20.68,1995.0,1995.0,188.0
3,Dec-19,Petrol,First Owner,Manual,5,18378,16.50,1353.0,1353.0,13808.0
4,Aug-19,Petrol,First Owner,Automatic,5,44900,14.67,1798.0,1798.0,17746.0
...,...,...,...,...,...,...,...,...,...,...
1548,Aug-20,Diesel,First Owner,Automatic,5,35000,1493.00,11345.0,11345.0,250.0
1549,2022,Petrol,999 cc,2022,5,10000,999.00,6706.0,6706.0,91.0
1550,Jun-17,Petrol,First Owner,Manual,5,49000,17.50,1199.0,1199.0,887.0
1551,May-18,Petrol,Second Owner,Manual,5,40000,18.78,999.0,999.0,75.0


#### Solução

In [None]:
df_dummy = pd.get_dummies(df[var_cat + var_numerica], drop_first=True)

In [None]:
df_dummy.columns

Index(['seats', 'kms_driven', 'mileage', 'engine', 'max_power', 'torque',
       'registration_year_2007', 'registration_year_2009',
       'registration_year_2010', 'registration_year_2011',
       ...
       'transmission_2016', 'transmission_2017', 'transmission_2018',
       'transmission_2020', 'transmission_2021', 'transmission_2022',
       'transmission_2023', 'transmission_Automatic', 'transmission_Manual',
       'transmission_Power Windows Front'],
      dtype='object', length=223)

### Limpeza das categorias

In [None]:
df[var_cat[0]].value_counts()

2017      40
Jul-18    38
Aug-18    29
Jul-17    29
May-17    26
          ..
Jul-10     1
May-12     1
Feb-13     1
Nov-11     1
Apr-14     1
Name: registration_year, Length: 178, dtype: int64

In [None]:
def gera_data(registration_year):
  if len(registration_year)!=4:
    try:
      return '20' + registration_year.split('-')[1]
    except:
      return '200000'
  return registration_year

In [None]:
df[var_cat[0]].apply(gera_data).value_counts()

2018      235
2017      217
2019      181
2016      137
2015      132
2020      128
2021      111
2014      105
2022      103
2012       51
2013       47
2023       46
2011       29
2010       17
2009       11
200000      2
2007        1
Name: registration_year, dtype: int64

In [None]:
df['registration_tratada'] = df[var_cat[0]].apply(gera_data).astype(int)

In [None]:
def categoria_idade(registration_year):
  if registration_year < 2010:
    return 'antes_2010'
  elif registration_year < 2020:
    return 'antes 2020'
  elif registration_year < 2030:
    return 'depois 2020'
  else:
    return 'None'

In [None]:
df['registration_tratada'].apply(categoria_idade).value_counts(dropna=False)

antes 2020     1151
depois 2020     388
antes_2010       12
None              2
Name: registration_tratada, dtype: int64

In [None]:
df['registration_tratada'] = df['registration_tratada'].apply(categoria_idade)

In [None]:
df['registration_tratada']

0        antes 2020
1       depois 2020
2        antes 2020
3        antes 2020
4        antes 2020
           ...     
1548    depois 2020
1549    depois 2020
1550     antes 2020
1551     antes 2020
1552     antes 2020
Name: registration_tratada, Length: 1553, dtype: object

In [None]:
df[var_cat[1]].value_counts(dropna=False)

Petrol     1013
Diesel      516
CNG          22
5 Seats       2
Name: fuel_type, dtype: int64

In [None]:
df = df[df[var_cat[1]] != '5 Seats']

In [None]:
df[var_cat[1]].value_counts()

Petrol    1013
Diesel     516
CNG         22
Name: fuel_type, dtype: int64

In [None]:
df['registration_tratada'].value_counts()

antes 2020     1151
depois 2020     388
antes_2010       12
Name: registration_tratada, dtype: int64

In [None]:
df[var_cat[2]].value_counts()

First Owner     1240
Second Owner     240
1995 cc           24
Third Owner       21
1498 cc            3
999 cc             2
Fifth Owner        2
1497 cc            2
1451 cc            2
998 cc             2
1461 cc            2
2993 cc            2
1998 cc            1
1996 cc            1
1950 cc            1
1199 cc            1
1248 cc            1
1197 cc            1
1984 cc            1
2999 cc            1
1968 cc            1
Name: ownsership, dtype: int64

In [None]:
def tratamento_ownsership(ownsership):
  if ownsership in ['First Owner', 'Second Owner', 'Third Owner', 'Fifth Owner']:
    return ownsership
  else:
    return 'None'

In [None]:
df['ownsership_tratado'] = df.ownsership.apply(tratamento_ownsership)

In [None]:
df['ownsership_tratado'].value_counts()

First Owner     1240
Second Owner     240
None              48
Third Owner       21
Fifth Owner        2
Name: ownsership_tratado, dtype: int64

In [None]:
df[var_cat[3]].value_counts()

Manual       835
Automatic    668
2017          28
2014           5
2011           3
2023           2
2020           2
2021           2
2022           2
2018           2
2015           1
2016           1
Name: transmission, dtype: int64

In [None]:
df['transmission_tratado'] = df[var_cat[3]].apply(lambda x: x if x in ['Manual', 'Automatic'] else 'None')

In [None]:
df['transmission_tratado'].value_counts()

Manual       835
Automatic    668
None          48
Name: transmission_tratado, dtype: int64

In [None]:
var_cat = ['registration_tratada',
           'fuel_type',
           'ownsership_tratado',
           'transmission_tratado']

In [None]:
df_dummy = pd.get_dummies(df[var_cat + var_numerica],drop_first=True)

In [None]:
df_dummy.head()

Unnamed: 0,seats,kms_driven,mileage,engine,max_power,torque,registration_tratada_antes_2010,registration_tratada_depois 2020,fuel_type_Diesel,fuel_type_Petrol,ownsership_tratado_First Owner,ownsership_tratado_None,ownsership_tratado_Second Owner,ownsership_tratado_Third Owner,transmission_tratado_Manual,transmission_tratado_None
0,5,56000,7.81,2996.0,2996.0,333.0,0,0,0,1,1,0,0,0,0,0
1,5,30615,17.4,999.0,999.0,9863.0,0,1,0,1,1,0,0,0,0,0
2,5,24000,20.68,1995.0,1995.0,188.0,0,0,1,0,1,0,0,0,0,0
3,5,18378,16.5,1353.0,1353.0,13808.0,0,0,0,1,1,0,0,0,1,0
4,5,44900,14.67,1798.0,1798.0,17746.0,0,0,0,1,1,0,0,0,0,0


In [None]:
y = a + b * registration_tratada_antes_2010 + c*registration_tratada_depois 2020 + d * fuel_type_Diesel + e * fuel_type_Petrol
y = a -> antes de 2020 e combustivel CNG
y = a + b
y = a + c

In [None]:
var_numerica.append('price')
df_dummy = pd.get_dummies(df[var_cat + var_numerica],drop_first=True)
df_dummy.to_csv('dado_tratado.csv', index=False)

### Variáveis one hot

Variáveis one hot encoding é uma representação de uma variável categórica com k categorias em k variáveis, todas as novas variáveis são binárias.

Exemplo:
Variável faixa de idade

|Variável original|0-18|18-30|30-50|50+|
|---|---|---|---|---|
|0-18|1|0|0|0|
|18-30|0|1|0|0|
|30-50|0|0|1|0|
|50+|0|0|0|1|

Para criar variáveis dummy iremos utilizar a função get_dummy do pacote Pandas

Exemplo:
``` python
import pandas as pd
df_dummy = pd.get_dummies(df)
```

Exercícios:

Crie uma variável dummy com representação one-hot encoding para o dado tratado.

#### Solução

In [None]:
df_dummy = pd.get_dummies(df[var_cat + var_numerica])

In [None]:
df_dummy.head()

Unnamed: 0,seats,kms_driven,mileage,engine,max_power,torque,price,registration_tratada_antes 2020,registration_tratada_antes_2010,registration_tratada_depois 2020,...,fuel_type_Diesel,fuel_type_Petrol,ownsership_tratado_Fifth Owner,ownsership_tratado_First Owner,ownsership_tratado_None,ownsership_tratado_Second Owner,ownsership_tratado_Third Owner,transmission_tratado_Automatic,transmission_tratado_Manual,transmission_tratado_None
0,5,56000,7.81,2996.0,2996.0,333.0,63.75,1,0,0,...,0,1,0,1,0,0,0,1,0,0
1,5,30615,17.4,999.0,999.0,9863.0,8.99,0,0,1,...,0,1,0,1,0,0,0,1,0,0
2,5,24000,20.68,1995.0,1995.0,188.0,23.75,1,0,0,...,1,0,0,1,0,0,0,1,0,0
3,5,18378,16.5,1353.0,1353.0,13808.0,13.56,1,0,0,...,0,1,0,1,0,0,0,0,1,0
4,5,44900,14.67,1798.0,1798.0,17746.0,24.0,1,0,0,...,0,1,0,1,0,0,0,1,0,0
