# 1ª Avaliação de Aprendizagem de Máquina - 12/11/2020



   * Nome: Itamar de Paiva Rocha Filho
   * Matrícula: 20180026510

### Base de dados

Acesse [https://tmfilho.github.io/akcdata/](https://tmfilho.github.io/akcdata/) e baixe a base de dados sobre as 277 raças de cachorros registradas no _American Kennel Club_. O conjunto contém 20 atributos de diferentes tipos, incluindo atributos textuais, numéricos e categóricos. Altura ( _Height_ ), peso ( _weight_ ) e expectativa de vida ( _life expectancy_ ) aparecem como intervalos \[min, max\]. A tabela abaixo é uma versão traduzida da tabela do site.


|  #  | atributo | descrição | exemplo |
| --- | ------ | ----------- | ------- |
|  1  | description | 1 a 3 parágrafos descrevendo a raça | "Akitas are burly, heavy-boned spitz-type dogs of imposing stature..." |
|  2  | temperament | temperamentoda raça descrito em palavras-chave | "Confident, Famously Funny, Fearless" |
|  3  | popularity  | ranking de popularidade da raça (1-195) | 13 |
|  4  | min_height  | altura mínima em cm | 60.96 |
|  5  | max_height  | altura máxima em cm | 71.12 |
|  6  | min_weight  | peso mínimo em kg | 3.18 |
|  7  | max_weight  | peso máximo em kg | 4.54 |
|  8  | min_expectancy  | expectativa de vida mínima em anos | 10 |
|  9  | max_expectancy  | expectativa de vida máxima em anos | 12 |
|  10  | group  | um de 9 grupos designados pelo AKC (7 grupos principais e 2 extras para raças novas ou que não se ajustam ainda a um grupo clássico) | "Herding Group" |
|  11  | grooming_frequency_value  | Um número que representa a frequência de cuidados com o pelo | 0.4 |
|  12  | grooming_frequency_category  | Categorização dos cuidados com o pelo | "Weekly Brushing" |
|  13  | shedding_value  | Um número que representa a frequência de perda de pelo | 0.6 |
|  14  | shedding_category  | Categorização da perda de pelos | "Seasonal" |
|  15  | energy_level_value  | Um número que representa o nível de energia da raça | 0.8 |
|  16  | energy_level_category  | Categorização do nível de energia | "Energetic" |
|  17  | trainability_value  | Um número que representa a facilidade de treinar a raça | 1.0 |
|  18  | trainability_category  | Uma categorização da facilidade de treinamento | "Eager to Please" |
|  19  | demeanor_value  | Um número que representa a reação da raça a pessoas estranhas e outros animais | 0.6 |
|  20  | demeanor_category  | Categorização da reação da raça a pessoas estranhas e outros animais | "Alert/Responsive" |


## Resolução da prova

### Importando as bibliotecas necessárias

In [1]:
import pandas as pd
import numpy as np

### Lendo e analisando os dados

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/tmfilho/akcdata/master/data/akc-data-latest.csv", index_col=0)
df.head(3)

Unnamed: 0,description,temperament,popularity,min_height,max_height,min_weight,max_weight,min_expectancy,max_expectancy,group,grooming_frequency_value,grooming_frequency_category,shedding_value,shedding_category,energy_level_value,energy_level_category,trainability_value,trainability_category,demeanor_value,demeanor_category
Affenpinscher,The Affen’s apish look has been described many...,"Confident, Famously Funny, Fearless",148,22.86,29.21,3.175147,4.535924,12.0,15.0,Toy Group,0.6,2-3 Times a Week Brushing,0.6,Seasonal,0.6,Regular Exercise,0.8,Easy Training,1.0,Outgoing
Afghan Hound,"The Afghan Hound is an ancient breed, his whol...","Dignified, Profoundly Loyal, Aristocratic",113,63.5,68.58,22.679619,27.215542,12.0,15.0,Hound Group,0.8,Daily Brushing,0.2,Infrequent,0.8,Energetic,0.2,May be Stubborn,0.2,Aloof/Wary
Airedale Terrier,The Airedale Terrier is the largest of all ter...,"Friendly, Clever, Courageous",60,58.42,58.42,22.679619,31.751466,11.0,14.0,Terrier Group,0.6,2-3 Times a Week Brushing,0.4,Occasional,0.6,Regular Exercise,1.0,Eager to Please,0.8,Friendly


### Análise dos dados

* Vendo as colunas presentes na base de dados

In [3]:
df.columns

Index(['description', 'temperament', 'popularity', 'min_height', 'max_height',
       'min_weight', 'max_weight', 'min_expectancy', 'max_expectancy', 'group',
       'grooming_frequency_value', 'grooming_frequency_category',
       'shedding_value', 'shedding_category', 'energy_level_value',
       'energy_level_category', 'trainability_value', 'trainability_category',
       'demeanor_value', 'demeanor_category'],
      dtype='object')

* Checando os tipos dos dados e a nulidade dos mesmos

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 277 entries, Affenpinscher to Yorkshire Terrier
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   description                  277 non-null    object 
 1   temperament                  276 non-null    object 
 2   popularity                   198 non-null    object 
 3   min_height                   277 non-null    float64
 4   max_height                   277 non-null    float64
 5   min_weight                   275 non-null    float64
 6   max_weight                   275 non-null    float64
 7   min_expectancy               274 non-null    float64
 8   max_expectancy               274 non-null    float64
 9   group                        277 non-null    object 
 10  grooming_frequency_value     270 non-null    float64
 11  grooming_frequency_category  270 non-null    object 
 12  shedding_value               257 non-null    float64
 13 

Podemos ver acima que apresentamos alguns valores nulos para algumas espécies

In [5]:
df.loc[df.popularity == "of", "popularity"] = np.nan

In [6]:
df.popularity = df.popularity.astype("float64")

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 277 entries, Affenpinscher to Yorkshire Terrier
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   description                  277 non-null    object 
 1   temperament                  276 non-null    object 
 2   popularity                   195 non-null    float64
 3   min_height                   277 non-null    float64
 4   max_height                   277 non-null    float64
 5   min_weight                   275 non-null    float64
 6   max_weight                   275 non-null    float64
 7   min_expectancy               274 non-null    float64
 8   max_expectancy               274 non-null    float64
 9   group                        277 non-null    object 
 10  grooming_frequency_value     270 non-null    float64
 11  grooming_frequency_category  270 non-null    object 
 12  shedding_value               257 non-null    float64
 13 

* Correlação entre colunas

In [8]:
corr = df.corr()
corr.style.background_gradient(cmap="coolwarm")

Unnamed: 0,popularity,min_height,max_height,min_weight,max_weight,min_expectancy,max_expectancy,grooming_frequency_value,shedding_value,energy_level_value,trainability_value,demeanor_value
popularity,1.0,0.048636,0.064344,-0.020344,-0.054358,0.139584,0.051369,-0.175239,-0.120208,0.064634,-0.250075,-0.252602
min_height,0.048636,1.0,0.935146,0.815263,0.792409,-0.460569,-0.463709,-0.186637,0.285123,0.174194,-0.049236,-0.239074
max_height,0.064344,0.935146,1.0,0.79301,0.806405,-0.441227,-0.467058,-0.244199,0.320306,0.19856,-0.051842,-0.294494
min_weight,-0.020344,0.815263,0.79301,1.0,0.947519,-0.541745,-0.551445,-0.138691,0.284724,-0.072179,-0.132576,-0.276253
max_weight,-0.054358,0.792409,0.806405,0.947519,1.0,-0.513515,-0.522455,-0.11624,0.281833,-0.076534,-0.075793,-0.27365
min_expectancy,0.139584,-0.460569,-0.441227,-0.541745,-0.513515,1.0,0.765434,0.028934,-0.251795,-0.000183,0.091815,0.181565
max_expectancy,0.051369,-0.463709,-0.467058,-0.551445,-0.522455,0.765434,1.0,0.127438,-0.356471,0.012556,0.094136,0.168534
grooming_frequency_value,-0.175239,-0.186637,-0.244199,-0.138691,-0.11624,0.028934,0.127438,1.0,-0.070256,-0.151441,0.098834,0.112278
shedding_value,-0.120208,0.285123,0.320306,0.284724,0.281833,-0.251795,-0.356471,-0.070256,1.0,0.173437,0.043577,-0.179746
energy_level_value,0.064634,0.174194,0.19856,-0.072179,-0.076534,-0.000183,0.012556,-0.151441,0.173437,1.0,0.119501,-0.003133


* Removendo colunas que não serão utilizadas

In [9]:
df1 = df.copy()

In [10]:
df1 = df1.drop(columns=["description", "grooming_frequency_category", "shedding_category", "energy_level_category", "trainability_category", "demeanor_category"])

* Checando dados nulos

In [11]:
print("Porcentagem de dados faltantes")
np.round(df1.isnull().sum()/df1.shape[0] * 100,2)

Porcentagem de dados faltantes


temperament                  0.36
popularity                  29.60
min_height                   0.00
max_height                   0.00
min_weight                   0.72
max_weight                   0.72
min_expectancy               1.08
max_expectancy               1.08
group                        0.00
grooming_frequency_value     2.53
shedding_value               7.22
energy_level_value           2.17
trainability_value           8.66
demeanor_value               9.03
dtype: float64

* Tratando os valores faltantes Numéricos com KNNimputer

In [12]:
from sklearn.impute import KNNImputer

In [13]:
float_columns = ["popularity", "min_height", "max_height", "min_weight", "max_weight", "min_expectancy", "max_expectancy", "grooming_frequency_value", "shedding_value", "energy_level_value", "trainability_value", "demeanor_value"]

In [14]:
imputer = KNNImputer(n_neighbors=3, weights="uniform")
transformed = pd.DataFrame(imputer.fit_transform(df1[float_columns]), columns=float_columns, index=df1.index)

In [15]:
df1[float_columns] = transformed

In [16]:
df1

Unnamed: 0,temperament,popularity,min_height,max_height,min_weight,max_weight,min_expectancy,max_expectancy,group,grooming_frequency_value,shedding_value,energy_level_value,trainability_value,demeanor_value
Affenpinscher,"Confident, Famously Funny, Fearless",148.000000,22.86,29.21,3.175147,4.535924,12.0,15.0,Toy Group,0.6,0.6,0.6,0.8,1.0
Afghan Hound,"Dignified, Profoundly Loyal, Aristocratic",113.000000,63.50,68.58,22.679619,27.215542,12.0,15.0,Hound Group,0.8,0.2,0.8,0.2,0.2
Airedale Terrier,"Friendly, Clever, Courageous",60.000000,58.42,58.42,22.679619,31.751466,11.0,14.0,Terrier Group,0.6,0.4,0.6,1.0,0.8
Akita,"Courageous, Dignified, Profoundly Loyal",47.000000,60.96,71.12,31.751466,58.967008,10.0,13.0,Working Group,0.8,0.6,0.8,1.0,0.6
Alaskan Malamute,"Affectionate, Loyal, Playful",58.000000,58.42,63.50,34.019428,38.555351,10.0,14.0,Working Group,0.6,0.6,0.8,0.4,0.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wirehaired Vizsla,"Gentle, Loyal, Trainable",167.000000,54.61,63.50,20.411657,29.483504,12.0,14.0,Sporting Group,0.2,0.6,0.8,0.6,0.6
Working Kelpie,"Alert, Eager, Intelligent",84.000000,48.26,63.50,12.700586,27.215542,12.0,15.0,Foundation Stock Service,0.2,0.6,0.8,0.4,0.6
Xoloitzcuintli,"Loyal, Alert, Calm",140.000000,25.40,58.42,4.535924,24.947580,13.0,18.0,Non-Sporting Group,0.2,0.2,0.8,0.6,0.6
Yakutian Laika,"Affectionate, Intelligent, Active",57.333333,53.34,58.42,18.143695,24.947580,10.0,12.0,Foundation Stock Service,0.4,0.6,0.8,0.2,0.4


In [17]:
print("Porcentagem de dados faltantes")
np.round(df1.isnull().sum()/df1.shape[0] * 100,2)

Porcentagem de dados faltantes


temperament                 0.36
popularity                  0.00
min_height                  0.00
max_height                  0.00
min_weight                  0.00
max_weight                  0.00
min_expectancy              0.00
max_expectancy              0.00
group                       0.00
grooming_frequency_value    0.00
shedding_value              0.00
energy_level_value          0.00
trainability_value          0.00
demeanor_value              0.00
dtype: float64

In [18]:
categorical_missing_features = ["temperament"]

In [19]:
df1[categorical_missing_features].mode()

Unnamed: 0,temperament
0,"Friendly, Smart, Willing to Please"


In [20]:
df2 = df1.copy()

In [21]:
for column in categorical_missing_features:
    df2[column] = df1[column].fillna(df1[column].mode()[0])

In [22]:
print("Porcentagem de dados faltantes")
np.round(df2.isnull().sum()/df2.shape[0] * 100,2)

Porcentagem de dados faltantes


temperament                 0.0
popularity                  0.0
min_height                  0.0
max_height                  0.0
min_weight                  0.0
max_weight                  0.0
min_expectancy              0.0
max_expectancy              0.0
group                       0.0
grooming_frequency_value    0.0
shedding_value              0.0
energy_level_value          0.0
trainability_value          0.0
demeanor_value              0.0
dtype: float64

A coluna temperamento tem uma amostragem diferente pra cada caso praticamente.

In [23]:
df2.temperament

Affenpinscher              Confident, Famously Funny, Fearless
Afghan Hound         Dignified, Profoundly Loyal, Aristocratic
Airedale Terrier                  Friendly, Clever, Courageous
Akita                  Courageous, Dignified, Profoundly Loyal
Alaskan Malamute                  Affectionate, Loyal, Playful
                                       ...                    
Wirehaired Vizsla                     Gentle, Loyal, Trainable
Working Kelpie                       Alert, Eager, Intelligent
Xoloitzcuintli                              Loyal, Alert, Calm
Yakutian Laika               Affectionate, Intelligent, Active
Yorkshire Terrier           Affectionate, Sprightly, Tomboyish
Name: temperament, Length: 277, dtype: object

In [24]:
df2

Unnamed: 0,temperament,popularity,min_height,max_height,min_weight,max_weight,min_expectancy,max_expectancy,group,grooming_frequency_value,shedding_value,energy_level_value,trainability_value,demeanor_value
Affenpinscher,"Confident, Famously Funny, Fearless",148.000000,22.86,29.21,3.175147,4.535924,12.0,15.0,Toy Group,0.6,0.6,0.6,0.8,1.0
Afghan Hound,"Dignified, Profoundly Loyal, Aristocratic",113.000000,63.50,68.58,22.679619,27.215542,12.0,15.0,Hound Group,0.8,0.2,0.8,0.2,0.2
Airedale Terrier,"Friendly, Clever, Courageous",60.000000,58.42,58.42,22.679619,31.751466,11.0,14.0,Terrier Group,0.6,0.4,0.6,1.0,0.8
Akita,"Courageous, Dignified, Profoundly Loyal",47.000000,60.96,71.12,31.751466,58.967008,10.0,13.0,Working Group,0.8,0.6,0.8,1.0,0.6
Alaskan Malamute,"Affectionate, Loyal, Playful",58.000000,58.42,63.50,34.019428,38.555351,10.0,14.0,Working Group,0.6,0.6,0.8,0.4,0.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wirehaired Vizsla,"Gentle, Loyal, Trainable",167.000000,54.61,63.50,20.411657,29.483504,12.0,14.0,Sporting Group,0.2,0.6,0.8,0.6,0.6
Working Kelpie,"Alert, Eager, Intelligent",84.000000,48.26,63.50,12.700586,27.215542,12.0,15.0,Foundation Stock Service,0.2,0.6,0.8,0.4,0.6
Xoloitzcuintli,"Loyal, Alert, Calm",140.000000,25.40,58.42,4.535924,24.947580,13.0,18.0,Non-Sporting Group,0.2,0.2,0.8,0.6,0.6
Yakutian Laika,"Affectionate, Intelligent, Active",57.333333,53.34,58.42,18.143695,24.947580,10.0,12.0,Foundation Stock Service,0.4,0.6,0.8,0.2,0.4


Contudo, podemos notar que são categorias separadas com vírgulas, mas que seguem um determinado padrão e, portanto, podem ser separadas em colunas indicando se a raça possui ou não essa categoria, como faremos na célula  a seguir:

In [25]:
df2.temperament["Affenpinscher"].split(", ")

['Confident', 'Famously Funny', 'Fearless']

In [26]:
dict_charac = []
for characteristics, indexation in zip(df2.temperament, df2.index):
    try:
        for splits in characteristics.split(", "):
            if splits not in df2.columns:
                dict_charac.append(splits)
                df2[splits] = 0
            df2.loc[indexation, splits] = 1
    except:
        print(f"value : {characteristics} , index : {indexation}")

In [27]:
df2.drop(columns=["temperament"], inplace=True)

In [28]:
df2

Unnamed: 0,popularity,min_height,max_height,min_weight,max_weight,min_expectancy,max_expectancy,group,grooming_frequency_value,shedding_value,...,Keenly Alert,Good-Tempered,Canny,Deeply Devoted,Playful but also Work-Oriented. Very Active and Upbeat.,Courteous,Entertaining,Gregarious,Sprightly,Tomboyish
Affenpinscher,148.000000,22.86,29.21,3.175147,4.535924,12.0,15.0,Toy Group,0.6,0.6,...,0,0,0,0,0,0,0,0,0,0
Afghan Hound,113.000000,63.50,68.58,22.679619,27.215542,12.0,15.0,Hound Group,0.8,0.2,...,0,0,0,0,0,0,0,0,0,0
Airedale Terrier,60.000000,58.42,58.42,22.679619,31.751466,11.0,14.0,Terrier Group,0.6,0.4,...,0,0,0,0,0,0,0,0,0,0
Akita,47.000000,60.96,71.12,31.751466,58.967008,10.0,13.0,Working Group,0.8,0.6,...,0,0,0,0,0,0,0,0,0,0
Alaskan Malamute,58.000000,58.42,63.50,34.019428,38.555351,10.0,14.0,Working Group,0.6,0.6,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wirehaired Vizsla,167.000000,54.61,63.50,20.411657,29.483504,12.0,14.0,Sporting Group,0.2,0.6,...,0,0,0,0,0,0,0,0,0,0
Working Kelpie,84.000000,48.26,63.50,12.700586,27.215542,12.0,15.0,Foundation Stock Service,0.2,0.6,...,0,0,0,0,0,0,0,0,0,0
Xoloitzcuintli,140.000000,25.40,58.42,4.535924,24.947580,13.0,18.0,Non-Sporting Group,0.2,0.2,...,0,0,0,0,0,0,0,0,0,0
Yakutian Laika,57.333333,53.34,58.42,18.143695,24.947580,10.0,12.0,Foundation Stock Service,0.4,0.6,...,0,0,0,0,0,0,0,0,0,0


## Resolução das questões

<hr>

**Questão 1 (1,0 ponto)** É possível extrair tarefas de classificação ou de regressão dessa base? Que variáveis poderiam ser alvos nessas tarefas e quais seriam os atributos/variáveis independentes? **Ainda não vimos processamento de linguagem, então podem desconsiderar a variável description**.

In [29]:
df1.head(3)

Unnamed: 0,temperament,popularity,min_height,max_height,min_weight,max_weight,min_expectancy,max_expectancy,group,grooming_frequency_value,shedding_value,energy_level_value,trainability_value,demeanor_value
Affenpinscher,"Confident, Famously Funny, Fearless",148.0,22.86,29.21,3.175147,4.535924,12.0,15.0,Toy Group,0.6,0.6,0.6,0.8,1.0
Afghan Hound,"Dignified, Profoundly Loyal, Aristocratic",113.0,63.5,68.58,22.679619,27.215542,12.0,15.0,Hound Group,0.8,0.2,0.8,0.2,0.2
Airedale Terrier,"Friendly, Clever, Courageous",60.0,58.42,58.42,22.679619,31.751466,11.0,14.0,Terrier Group,0.6,0.4,0.6,1.0,0.8


Sim, é possível extrair tarefas de classificação ou regressão a partir dos dados apresentados.

**Regressão**
Para uma tarefa de regressão, faria mais sentido prever a popularidade, utilizando os outros dados como variáveis para executar essa tarefa. Contudo, também poderíamos criar um modelo de regressão para prever qualquer um dos dados numéricos, embora no caso mostrado isso não faça muito sentido.

**Classificação**
Para uma tarefa de classificação, poderíamos prever: O group e o temperamento (o que precisaria de um refinamento maior e possivelmente mais dados), usando as outras variáveis presentes nos nossos dados.

<hr>

**Questão 2 (1,0 ponto)** Como você trataria as variáveis categóricas nessas tarefas? Faça esse tratamento e mostre a DataFrame resultante.

<hr>

O tratamento deveria ser feito com o onehotencoding para as variáveis que não tem um **hierarquia** nos seus exemplos, mas apenas uma divisão categórica, como é o caso da coluna *group* e pode ser utilizado uma representação numérica inteira ou fracionária para colunas que têm uma hierarquia em seus valores, como a **energy_level_category** (que foi removida, mas segue representada pela **energy_level_value**)

In [30]:
from sklearn.preprocessing import OneHotEncoder

In [31]:
df2.head()

Unnamed: 0,popularity,min_height,max_height,min_weight,max_weight,min_expectancy,max_expectancy,group,grooming_frequency_value,shedding_value,...,Keenly Alert,Good-Tempered,Canny,Deeply Devoted,Playful but also Work-Oriented. Very Active and Upbeat.,Courteous,Entertaining,Gregarious,Sprightly,Tomboyish
Affenpinscher,148.0,22.86,29.21,3.175147,4.535924,12.0,15.0,Toy Group,0.6,0.6,...,0,0,0,0,0,0,0,0,0,0
Afghan Hound,113.0,63.5,68.58,22.679619,27.215542,12.0,15.0,Hound Group,0.8,0.2,...,0,0,0,0,0,0,0,0,0,0
Airedale Terrier,60.0,58.42,58.42,22.679619,31.751466,11.0,14.0,Terrier Group,0.6,0.4,...,0,0,0,0,0,0,0,0,0,0
Akita,47.0,60.96,71.12,31.751466,58.967008,10.0,13.0,Working Group,0.8,0.6,...,0,0,0,0,0,0,0,0,0,0
Alaskan Malamute,58.0,58.42,63.5,34.019428,38.555351,10.0,14.0,Working Group,0.6,0.6,...,0,0,0,0,0,0,0,0,0,0


In [32]:
ohe = OneHotEncoder(sparse=False)
grupos = pd.DataFrame(ohe.fit_transform(df2['group'].values.reshape(-1, 1)), columns=list(ohe.categories_[0]), index=df2.index)

In [33]:
df3 = df2.drop(columns=["group"])

In [34]:
df3 = df3.join(grupos, how="outer")

In [35]:
df3

Unnamed: 0,popularity,min_height,max_height,min_weight,max_weight,min_expectancy,max_expectancy,grooming_frequency_value,shedding_value,energy_level_value,...,Tomboyish,Foundation Stock Service,Herding Group,Hound Group,Miscellaneous Class,Non-Sporting Group,Sporting Group,Terrier Group,Toy Group,Working Group
Affenpinscher,148.000000,22.86,29.21,3.175147,4.535924,12.0,15.0,0.6,0.6,0.6,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
Afghan Hound,113.000000,63.50,68.58,22.679619,27.215542,12.0,15.0,0.8,0.2,0.8,...,0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
Airedale Terrier,60.000000,58.42,58.42,22.679619,31.751466,11.0,14.0,0.6,0.4,0.6,...,0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
Akita,47.000000,60.96,71.12,31.751466,58.967008,10.0,13.0,0.8,0.6,0.8,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
Alaskan Malamute,58.000000,58.42,63.50,34.019428,38.555351,10.0,14.0,0.6,0.6,0.8,...,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wirehaired Vizsla,167.000000,54.61,63.50,20.411657,29.483504,12.0,14.0,0.2,0.6,0.8,...,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
Working Kelpie,84.000000,48.26,63.50,12.700586,27.215542,12.0,15.0,0.2,0.6,0.8,...,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Xoloitzcuintli,140.000000,25.40,58.42,4.535924,24.947580,13.0,18.0,0.2,0.2,0.8,...,0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
Yakutian Laika,57.333333,53.34,58.42,18.143695,24.947580,10.0,12.0,0.4,0.6,0.8,...,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<hr>

**Questão 3 (1,0 ponto)** As variáveis tem escalas diferentes. Como você trataria essa situação para não prejudicar o desempenho de modelos que são sensíveis a isso? Faça esse tratamento e mostre a DataFrame resultante.

<hr>

Poderíamos fazê-lo utilizando diversas técnicas de pre processamento. Para esse caso, utilizaremos MinMax Scaler, que colocar os valores das variáveis em um range de 0-1 de forma padrão.

In [36]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df4 = scaler.fit_transform(df3)

In [37]:
df4.shape

(277, 161)

<hr>

**Questão 4 (1,5 ponto)** Suponha uma tarefa que envolva predizer a popularidade de uma raça. Essa é uma tarefa de regressão ou de classificação? Selecione atributos para usar como variáveis independentes nessa tarefa. Faça uma análise de importância das variáveis realizando testes de coeficientes do ajuste linear e usando os valores de importância dados por um modelo de árvore de decisão. Note que alguns valores podem estar faltando para certas raças. **Pode usar bibliotecas relevantes. Não é necessário implementar do zero**.

<hr>

Neste caso, precisaremos mudar o valor da popularidade para antes da imputation dos valores faltantes. Como não temos certeza se os valores que usamos estão corretos, eles podem acabar influenciando de forma negativamente o modelo. Assim, descartaremos os valores que não foram nos passados.

In [50]:
df5 = df3.copy()

In [51]:
df5["popularity"] = df["popularity"]

In [52]:
df5.isnull().sum()

popularity            82
min_height             0
max_height             0
min_weight             0
max_weight             0
                      ..
Non-Sporting Group     0
Sporting Group         0
Terrier Group          0
Toy Group              0
Working Group          0
Length: 161, dtype: int64

In [53]:
df5 = df5.dropna()

In [63]:
y = df5["popularity"]

In [55]:
X = df5.drop(columns=["popularity"])

In [56]:
#fazendo o scalling novamente
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X = scaler.fit_transform(X)

In [64]:
X.shape, y.shape

((195, 160), (195,))

A tarefa pode ser tanto de regressão quanto de classificação. Contudo, ao utilizarmos uma classificação os valores ficariam **restritos** às classes apresentadas, embora a definição varie de 1 a 195. Com a regressão seria possível uma estrapolação desses valores. Abaixo está mostrado o máximo, mínimo e o número de valores diferentes.

In [70]:
max(y), min(y)

(192.0, 1.0)

In [74]:
y.unique().shape[0]

190

<hr>

**Questão 5 (2,0 pontos)** Faça uma avaliação de desempenho dos modelos que vimos até agora no curso, considerando a tarefa de predizer a popularidade da raça e usando validação cruzada.

<hr>

In [None]:
pass

<hr>

**Questão 6 (1,5 pontos)** Suponha uma tarefa que envolva predizer o grupo de uma raça. Essa é uma tarefa de regressão ou de classificação? Selecione variáveis para usar como atributos nessa tarefa. Faça uma análise de importância das variáveis realizando testes de coeficientes do ajuste linear e usando os valores de importância dados por um modelo de árvore de decisão. **Pode usar bibliotecas relevantes. Não é necessário implementar do zero**. Nota: as classes _Foundation Stock Service_ e _Miscellaneous Class_ são classes não-descritivas e podem ser desconsideradas.

<hr>

In [None]:
pass

<hr>

**Questão 7 (2,0 ponto)** Faça uma avaliação de desempenho dos modelos que vimos até agora no curso, considerando a tarefa de predizer o grupo da raça e usando validação cruzada. Avalie acurácia e precisão e cobertura por classe. Mostre a matriz de confusão do melhor de todos os modelos ajustados durante a validação cruzada.

<hr>

In [None]:
pass