<img src="https://raw.githubusercontent.com/andre-marcos-perez/ebac-course-utils/main/media/logo/newebac_logo_black_half.png" alt="ebac-logo">

---

# **Módulo** | Análise de Dados: Fundamentos de Aprendizado de Máquina
Caderno de **Exercícios**<br> 
Professor [André Perez](https://www.linkedin.com/in/andremarcosperez/)

---

# **Tópicos**

<ol type="1">
  <li>Teoria;</li>
  <li>Atributos categóricos;</li>
  <li>Atributos numéricos;</li>
  <li>Dados faltantes.</li>
</ol>

---

# **Exercícios**

## 1\. Pinguins 

Neste exercício, vamos utilizar uma base de dados com informações sobre penguins. A idéia é preparar a base de dados para prever a espécie do penguin (variável resposta) baseado em suas características físicas e geográficas (variáveis preditivas).

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
data = sns.load_dataset('penguins')

In [3]:
data.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


### **1.1. Valores nulos** 

A base de dados possui valores faltantes, utilize os conceitos da aula para trata-los.

In [6]:
data.isnull().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [9]:
data.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


**OBS. Questão 1.1**.:

Idealmente, poderiam ser aplicadas as transformações:

```
data['bill_length_mm'].fillna(data['bill_length_mm'].mean(), inplace = True)
data['bill_depth_mm'].fillna(data['bill_depth_mm'].mean(), inplace = True)
data['flipper_length_mm'].fillna(data['flipper_length_mm'].mean(), inplace = True)
data['body_mass_g'].fillna(data['body_mass_g'].median(), inplace = True) # med: em função do std
data['sex'].dropna(inplace = True)
```

No entanto, como precisaremos de todos os dados para o treino, talvez seja interessante retirar todos os dados faltantes:


In [27]:
data.dropna(inplace = True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 333 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 20.8+ KB


### **1.2. Variáveis numéricas** 

Identifique as variáveis numéricas e crie uma nova coluna **padronizando** seus valores. A nova coluna deve ter o mesmo nome da coluna original acrescidade de "*_std*".

> **Nota**: Você não deve tratar a variável resposta.

In [34]:
# resposta da questão 1.2
data['bill_length_mm_std'] = data['bill_length_mm'].apply(lambda valor: (valor - data['bill_length_mm'].mean()) / data['bill_length_mm'].std() )
data['bill_depth_mm_std'] = data['bill_depth_mm'].apply(lambda valor: (valor - data['bill_depth_mm'].mean()) / data['bill_depth_mm'].std() )
data['flipper_length_mm_std'] = data['flipper_length_mm'].apply(lambda valor: (valor - data['flipper_length_mm'].mean()) / data['flipper_length_mm'].std() )
data['body_mass_g_std'] = data['body_mass_g'].apply(lambda valor: (valor - data['body_mass_g'].mean()) / data['body_mass_g'].std() )

data.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,bill_length_mm_std,bill_depth_mm_std,flipper_length_mm_std,body_mass_g_std
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,-0.894695,0.779559,-1.424608,-0.567621
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,-0.821552,0.119404,-1.067867,-0.505525
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,-0.675264,0.424091,-0.425733,-1.188572
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,-1.333559,1.084246,-0.568429,-0.940192
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male,-0.858123,1.7444,-0.782474,-0.691811


> **Checagem**

In [35]:
print(f"Média: {data['bill_length_mm_std'].mean()}")
print(f"Std: {data['bill_length_mm_std'].std()}")

Média: -2.379811396328639e-15
Std: 1.0000000000000004


### **1.3. Variáveis categóricas** 

Identifique as variáveis categóricas nominais e ordinais, crie uma nova coluna aplicando a técnica correta de conversão a seus valores. A nova coluna deve ter o mesmo nome da coluna original acrescidade de "*_nom*" ou "*_ord*".

> **Nota**: Você não deve tratar a variável resposta.

In [41]:
data['island'].drop_duplicates()

0     Torgersen
20       Biscoe
30        Dream
Name: island, dtype: object

In [44]:
# resposta da questão 1.3
data['island_nom_Torgersen'] = data['island'].apply(lambda valor: 1 if valor == 'Torgersen' else 0)
data['island_nom_Biscoe'] = data['island'].apply(lambda valor: 1 if valor == 'Biscoe' else 0)
data['island_nom_Dream'] = data['island'].apply(lambda valor: 1 if valor == 'Dream' else 0)
data['sex_nom_M'] = data['sex'].apply(lambda valor: 1 if valor == 'Male' else 0)
data['sex_nom_F'] = data['sex'].apply(lambda valor: 1 if valor == 'Female' else 0)

In [50]:
data.head()

Unnamed: 0,species,bill_length_mm_std,bill_depth_mm_std,flipper_length_mm_std,body_mass_g_std,island_nom_Torgersen,island_nom_Biscoe,island_nom_Dream,sex_nom_M,sex_nom_F
0,Adelie,-0.894695,0.779559,-1.424608,-0.567621,1,0,0,1,0
1,Adelie,-0.821552,0.119404,-1.067867,-0.505525,1,0,0,0,1
2,Adelie,-0.675264,0.424091,-0.425733,-1.188572,1,0,0,0,1
4,Adelie,-1.333559,1.084246,-0.568429,-0.940192,1,0,0,0,1
5,Adelie,-0.858123,1.7444,-0.782474,-0.691811,1,0,0,1,0


### **1.4. Limpeza** 

Descarte as colunas originais e mantenha apenas a variável resposta e as variáveis preditivas com o sufixo *_std*", *_nom*" e "*_ord*". 

In [47]:
data.columns

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex', 'bill_length_mm_std',
       'bill_depth_mm_std', 'flipper_length_mm_std', 'body_mass_g_std',
       'island_nom_Torgersen', 'island_nom_Biscoe', 'island_nom_Dream',
       'sex_nom_M', 'sex_nom_F'],
      dtype='object')

In [49]:
# resposta da questão 1.4
data = data[['species', 'bill_length_mm_std',
    'bill_depth_mm_std', 'flipper_length_mm_std', 'body_mass_g_std',
    'island_nom_Torgersen', 'island_nom_Biscoe', 'island_nom_Dream',
    'sex_nom_M', 'sex_nom_F']]

data.head()

Unnamed: 0,species,bill_length_mm_std,bill_depth_mm_std,flipper_length_mm_std,body_mass_g_std,island_nom_Torgersen,island_nom_Biscoe,island_nom_Dream,sex_nom_M,sex_nom_F
0,Adelie,-0.894695,0.779559,-1.424608,-0.567621,1,0,0,1,0
1,Adelie,-0.821552,0.119404,-1.067867,-0.505525,1,0,0,0,1
2,Adelie,-0.675264,0.424091,-0.425733,-1.188572,1,0,0,0,1
4,Adelie,-1.333559,1.084246,-0.568429,-0.940192,1,0,0,0,1
5,Adelie,-0.858123,1.7444,-0.782474,-0.691811,1,0,0,1,0


---