**IA & Big Data**

Prof. Miguel Bozer da Silva - miguel.bozer@senaisp.edu.br

---

In [27]:
# Importando as bibliotecas para os modelos
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# One Hot Encoding

## Tarefa #1: Recebendo os dados

In [28]:
# Recebendo os dados

carros = pd.read_csv('Used_fiat_500_in_Italy_dataset.csv', sep = ',')
carros.head()

Unnamed: 0,model,engine_power,transmission,age_in_days,km,previous_owners,lat,lon,price
0,pop,69,manual,4474,56779,2,45.071079,7.46403,4490
1,lounge,69,manual,2708,160000,1,45.069679,7.70492,4500
2,lounge,69,automatic,3470,170000,2,45.514599,9.28434,4500
3,sport,69,manual,3288,132000,2,41.903221,12.49565,4700
4,sport,69,manual,3712,124490,2,45.532661,9.03892,4790


## Tarefa #2: Corrigindo os dados

In [29]:
carros.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 380 entries, 0 to 379
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   model            380 non-null    object 
 1   engine_power     380 non-null    int64  
 2   transmission     380 non-null    object 
 3   age_in_days      380 non-null    int64  
 4   km               380 non-null    int64  
 5   previous_owners  380 non-null    int64  
 6   lat              380 non-null    float64
 7   lon              380 non-null    float64
 8   price            380 non-null    int64  
dtypes: float64(2), int64(5), object(2)
memory usage: 26.8+ KB


In [30]:
carros.shape

(380, 9)

Vamos explorar as colunas que são do tipo `object` para aplicarmos o *One Hot Encoding* ou o *Label Encoding*:

In [31]:
carros['model'].unique()

array(['pop', 'lounge', 'sport', 'star'], dtype=object)

In [32]:
carros['transmission'].unique()

array(['manual', 'automatic'], dtype=object)

A coluna model e transmission possuem textos e precisamos corrigir isso

Vamos agora transformar a coluna de transmissão que é uma coluna que possui apenas dois valores possíveis. Para isso, vamos usando o comando o `replace`. Se o carro for manual o valor será substituído por 0 e se o carro for automático o valor será substituído por 1:

In [33]:
carros['transmission'].replace({'manual':0 , 'automatic': 1}, inplace=True)

In [34]:
carros['transmission'].unique()

array([0, 1])

Vamos aplicar o One Hot Enconding na coluna *model* para transformar os textos em colunas:

In [35]:
# Aplicando o One Hot Enconding
modelos = pd.get_dummies(carros["model"], prefix = "modelo")

In [36]:
modelos.head()

Unnamed: 0,modelo_lounge,modelo_pop,modelo_sport,modelo_star
0,False,True,False,False
1,True,False,False,False
2,True,False,False,False
3,False,False,True,False
4,False,False,True,False


Criamos dessa forma 4 colunas novas que são binárias indicando o modelo do veículo. Vamos agora criar um novo `DataFrame` unindo os `DataFrames` carros e transmissao:

In [37]:
# Concatenando os dados:
carros_corrigidos = pd.concat([carros, modelos], axis=1)

In [38]:
# Exibindo o novo Dataframe
carros_corrigidos.head()

Unnamed: 0,model,engine_power,transmission,age_in_days,km,previous_owners,lat,lon,price,modelo_lounge,modelo_pop,modelo_sport,modelo_star
0,pop,69,0,4474,56779,2,45.071079,7.46403,4490,False,True,False,False
1,lounge,69,0,2708,160000,1,45.069679,7.70492,4500,True,False,False,False
2,lounge,69,1,3470,170000,2,45.514599,9.28434,4500,True,False,False,False
3,sport,69,0,3288,132000,2,41.903221,12.49565,4700,False,False,True,False
4,sport,69,0,3712,124490,2,45.532661,9.03892,4790,False,False,True,False


Pensando em um modelo de *Machine Learning*, a coluna *model* pode ser excluída, pois ela não seria usada para treinar o modelo.

In [39]:
carros_corrigidos.drop(columns=['model'], inplace = True)

In [40]:
carros_corrigidos.head()

Unnamed: 0,engine_power,transmission,age_in_days,km,previous_owners,lat,lon,price,modelo_lounge,modelo_pop,modelo_sport,modelo_star
0,69,0,4474,56779,2,45.071079,7.46403,4490,False,True,False,False
1,69,0,2708,160000,1,45.069679,7.70492,4500,True,False,False,False
2,69,1,3470,170000,2,45.514599,9.28434,4500,True,False,False,False
3,69,0,3288,132000,2,41.903221,12.49565,4700,False,False,True,False
4,69,0,3712,124490,2,45.532661,9.03892,4790,False,False,True,False


# Label Encoding

## Tarefa #1: Recebendo os dados

In [41]:
# Recebendo os dados
titanic = pd.read_csv('titanic.csv', sep = ';')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,Dead,Third Class,"Kelly, Mr. James",male,345.0,0,0,330911,78292.0,,Q
1,893,Alive,Third Class,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,Dead,Second Class,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,96875.0,,Q
3,895,Dead,Third Class,"Wirz, Mr. Albert",male,27.0,0,0,315154,86625.0,,S
4,896,Alive,Third Class,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,122875.0,,S


## Tarefa #2: Corrigindo os dados

In [42]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    object 
 2   Pclass       418 non-null    object 
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(3), object(7)
memory usage: 39.3+ KB


Vamos explorar as colunas que são do tipo `object` para aplicarmos *Label Encoding*:

In [43]:
titanic['Survived'].unique()

array(['Dead', 'Alive'], dtype=object)

In [44]:
titanic['Pclass'].unique()

array(['Third Class', 'Second Class', 'First Class'], dtype=object)

Vamos agora aplicar o Label Encoding na coluna Pclass:

In [45]:
titanic['Pclass'].replace({'First Class':1, 'Second Class':2, 'Third Class':3}, inplace = True)

In [46]:
titanic['Pclass'].unique()

array([3, 2, 1])

In [47]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,Dead,3,"Kelly, Mr. James",male,345.0,0,0,330911,78292.0,,Q
1,893,Alive,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,Dead,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,96875.0,,Q
3,895,Dead,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,86625.0,,S
4,896,Alive,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,122875.0,,S


# Exercícios

## Exercício 1)

Para o conjunto de dados do Titanic, substitua os textos das colunas Pclass e Sex usando o Label Encoding e para a coluna Embarqued use o One Hot Encoding

In [48]:
# Checando os valores únicos da coluna "Embarked"
titanic["Embarked"].unique()

array(['Q', 'S', 'C'], dtype=object)

In [49]:
# O Label Encoding já foi executado nos blocos anteriores portanto agora irei apenas fazer o One Hot na coluna Embarqued
titanic['Embarked'].replace({'Q':0 , 'S': 1, 'C': 2}, inplace = True)

titanic_embarcados = pd.get_dummies(titanic['Embarked'], prefix = "Embarked")

In [50]:
titanic_embarcados.head()

Unnamed: 0,Embarked_0,Embarked_1,Embarked_2
0,True,False,False
1,False,True,False
2,True,False,False
3,False,True,False
4,False,True,False


In [51]:
titanic_modelo = pd.concat([titanic, titanic_embarcados], axis = 1)

In [52]:
titanic_modelo.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Embarked_0,Embarked_1,Embarked_2
0,892,Dead,3,"Kelly, Mr. James",male,345.0,0,0,330911,78292.0,,0,True,False,False
1,893,Alive,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,1,False,True,False
2,894,Dead,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,96875.0,,0,True,False,False
3,895,Dead,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,86625.0,,1,False,True,False
4,896,Alive,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,122875.0,,1,False,True,False


## Exercício 2)

Para o conjunto de dados nursey.csv (Disponível no Google Classroom) aplique o Label Encoding ou o One Hot Enconding para as colunas:

Label Encoding:
* Child's Nursery (has_nurs): (1) proper, (2) less proper, (3) improper, (4) critical, (5) very critical (Berçário infantil (tem_enfermeiras): (1) adequado, (2) menos adequado, (3) impróprio, (4) crítico, (5) muito crítico;);

One Hot Encoding
* Social conditions (social): (1) non-problematic, (2) slightly problematic, (3) problematic (Condições sociais (sociais): (1) não problemáticas, (2) ligeiramente problemáticas, (3) problemáticas);


In [54]:
nursery = pd.read_csv('nursery.csv', sep = ',')
nursery.head()

Unnamed: 0,parents,has_nurs,form,children,housing,finance,social,health,final evaluation
0,usual,proper,complete,1,convenient,convenient,nonprob,recommended,recommend
1,usual,proper,complete,1,convenient,convenient,nonprob,priority,priority
2,usual,proper,complete,1,convenient,convenient,nonprob,not_recom,not_recom
3,usual,proper,complete,1,convenient,convenient,slightly_prob,recommended,recommend
4,usual,proper,complete,1,convenient,convenient,slightly_prob,priority,priority


### Label Encoding

In [55]:
nursery['has_nurs'].unique()

array(['proper', 'less_proper', 'improper', 'critical', 'very_crit'],
      dtype=object)

In [57]:
nursery['has_nurs'].replace({'proper': 1, 'less_proper': 2, 'improper': 3, 'critical': 4, 'very_crit': 5}, inplace = True)

In [58]:
nursery['has_nurs'].unique()

array([1, 2, 3, 4, 5])

### One Hot Encoding


In [59]:
nursery['social'].unique()

array(['nonprob', 'slightly_prob', 'problematic'], dtype=object)

In [63]:
nursery['social'].replace({'nonprob': 1, 'slightly_prob': 2, 'problematic': 3 }, inplace = True)

nursery_social = pd.get_dummies(nursery['social'], prefix = 'social')

In [64]:
nursery_applied = pd.concat([nursery, nursery_social], axis = 1)

In [65]:
nursery_applied.head()

Unnamed: 0,parents,has_nurs,form,children,housing,finance,social,health,final evaluation,social_1,social_2,social_3
0,usual,1,complete,1,convenient,convenient,1,recommended,recommend,True,False,False
1,usual,1,complete,1,convenient,convenient,1,priority,priority,True,False,False
2,usual,1,complete,1,convenient,convenient,1,not_recom,not_recom,True,False,False
3,usual,1,complete,1,convenient,convenient,2,recommended,recommend,False,True,False
4,usual,1,complete,1,convenient,convenient,2,priority,priority,False,True,False
