<img src="https://raw.githubusercontent.com/andre-marcos-perez/ebac-course-utils/main/media/logo/newebac_logo_black_half.png" alt="ebac-logo">

---

# **Módulo** | Análise de Dados: Fundamentos de Aprendizado de Máquina
Caderno de **Exercícios**<br>
Professor [André Perez](https://www.linkedin.com/in/andremarcosperez/)

---

# **Tópicos**

<ol type="1">
  <li>Teoria;</li>
  <li>Atributos categóricos;</li>
  <li>Atributos numéricos;</li>
  <li>Dados faltantes.</li>
</ol>

---

# **Exercícios**

## 1\. Pinguins

Neste exercício, vamos utilizar uma base de dados com informações sobre penguins. A idéia é preparar a base de dados para prever a espécie do penguin (variável resposta) baseado em suas características físicas e geográficas (variáveis preditivas).

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
data = sns.load_dataset('penguins')

In [3]:
data.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


### **1.1. Valores nulos**

A base de dados possui valores faltantes, utilize os conceitos da aula para trata-los.

In [5]:
# resposta da questão 1.1

data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [6]:
data.isna().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

In [7]:
data = data.dropna(how='any', axis= 0)
data.isna().sum()

species               0
island                0
bill_length_mm        0
bill_depth_mm         0
flipper_length_mm     0
body_mass_g           0
sex                   0
bill_length_mm_std    0
bill_depth_mm_std     0
body_mass_g_std       0
dtype: int64

### **1.2. Variáveis numéricas**

Identifique as variáveis numéricas e crie uma nova coluna **padronizando** seus valores. A nova coluna deve ter o mesmo nome da coluna original acrescidade de "*_std*".

> **Nota**: Você não deve tratar a variável resposta.

In [4]:
# resposta da questão 1.2

def padronizar_coluna(data, nome_coluna):

    media = data[nome_coluna].mean()
    desvio_padrao = data[nome_coluna].std()

    data[f"{nome_coluna}_std"] = data[nome_coluna].apply(lambda x: (x - media) / desvio_padrao)

    return data

In [9]:
# Aplicando a função às suas colunas
data = padronizar_coluna(data, 'bill_length_mm')
data = padronizar_coluna(data, 'bill_depth_mm')
data = padronizar_coluna(data, 'body_mass_g')
data = padronizar_coluna(data, 'flipper_length_mm')

data.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,bill_length_mm_std,bill_depth_mm_std,body_mass_g_std,flipper_length_mm_std
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,-0.894695,0.779559,-0.567621,-1.424608
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,-0.821552,0.119404,-0.505525,-1.067867
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,-0.675264,0.424091,-1.188572,-0.425733
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,-1.333559,1.084246,-0.940192,-0.568429
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male,-0.858123,1.7444,-0.691811,-0.782474


### **1.3. Variáveis categóricas**

Identifique as variáveis categóricas nominais e ordinais, crie uma nova coluna aplicando a técnica correta de conversão a seus valores. A nova coluna deve ter o mesmo nome da coluna original acrescidade de "*_nom*" ou "*_ord*".

> **Nota**: Você não deve tratar a variável resposta.

In [10]:
# resposta da questão 1.3
# atributo categorico nominal
data[['sex']].head()

# usando a tecnica de codificacao chamada 'one hot encoding'.
data['sex_m_nom'] = data['sex'].apply(lambda sex: 1 if sex == 'Male' else 0)
data['sex_f_nom'] = data['sex'].apply(lambda sex: 1 if sex == 'Female' else 0)

data.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,bill_length_mm_std,bill_depth_mm_std,body_mass_g_std,flipper_length_mm_std,sex_m_nom,sex_f_nom
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,-0.894695,0.779559,-0.567621,-1.424608,1,0
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,-0.821552,0.119404,-0.505525,-1.067867,0,1
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,-0.675264,0.424091,-1.188572,-0.425733,0,1
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,-1.333559,1.084246,-0.940192,-0.568429,0,1
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male,-0.858123,1.7444,-0.691811,-0.782474,1,0


In [12]:
# atributo categorico ordinal
data[["island"]].head()

data['island'].drop_duplicates()

# classificando as ilhas presente no atributo "island"
ilhas_pinguim = {
    "Torgersen": 3,
    "Biscoe": 2,
    "Dream": 1,
}

data["island_ord"] = data["island"].apply(lambda x: ilhas_pinguim[x])

data.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,bill_length_mm_std,bill_depth_mm_std,body_mass_g_std,flipper_length_mm_std,sex_m_nom,sex_f_nom,island_ord
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,-0.894695,0.779559,-0.567621,-1.424608,1,0,3
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,-0.821552,0.119404,-0.505525,-1.067867,0,1,3
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,-0.675264,0.424091,-1.188572,-0.425733,0,1,3
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,-1.333559,1.084246,-0.940192,-0.568429,0,1,3
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male,-0.858123,1.7444,-0.691811,-0.782474,1,0,3


### **1.4. Limpeza**

Descarte as colunas originais e mantenha apenas a variável resposta e as variáveis preditivas com o sufixo *_std*", *_nom*" e "*_ord*".

In [13]:
# resposta da questão 1.4
data.drop(["island","bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "sex"], axis=1)


Unnamed: 0,species,bill_length_mm_std,bill_depth_mm_std,body_mass_g_std,flipper_length_mm_std,sex_m_nom,sex_f_nom,island_ord
0,Adelie,-0.894695,0.779559,-0.567621,-1.424608,1,0,3
1,Adelie,-0.821552,0.119404,-0.505525,-1.067867,0,1,3
2,Adelie,-0.675264,0.424091,-1.188572,-0.425733,0,1,3
4,Adelie,-1.333559,1.084246,-0.940192,-0.568429,0,1,3
5,Adelie,-0.858123,1.744400,-0.691811,-0.782474,1,0,3
...,...,...,...,...,...,...,...,...
338,Gentoo,0.586470,-1.759497,0.891616,0.929884,0,1,2
340,Gentoo,0.513326,-1.454811,0.798473,1.001232,0,1,2
341,Gentoo,1.171621,-0.743875,1.916186,1.500670,1,0,2
342,Gentoo,0.220750,-1.200905,1.233139,0.787187,0,1,2


---