# Desafio 6

Neste desafio, vamos praticar _feature engineering_, um dos processos mais importantes e trabalhosos de ML. Utilizaremos o _data set_ [Countries of the world](https://www.kaggle.com/fernandol/countries-of-the-world), que contém dados sobre os 227 países do mundo com informações sobre tamanho da população, área, imigração e setores de produção.

> Obs.: Por favor, não modifique o nome das funções de resposta.

## _Setup_ geral

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import (
    CountVectorizer, TfidfTransformer, TfidfVectorizer
)
#import sklearn as sk

In [4]:
# # Algumas configurações para o matplotlib.
# #%matplotlib inline
# from IPython.core.pylabtools import figsize
# figsize(12, 8)
# sns.set()

In [5]:
countries = pd.read_csv("countries.csv")

In [6]:
new_column_names = [
    "Country", "Region", "Population", "Area", "Pop_density", "Coastline_ratio",
    "Net_migration", "Infant_mortality", "GDP", "Literacy", "Phones_per_1000",
    "Arable", "Crops", "Other", "Climate", "Birthrate", "Deathrate", "Agriculture",
    "Industry", "Service"
]

countries.columns = new_column_names

countries.head(5)

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,480,0,2306,16307,700.0,360,32,1213,22,8765,1,466,2034,38.0,24.0,38.0
1,Albania,EASTERN EUROPE,3581655,28748,1246,126,-493,2152,4500.0,865,712,2109,442,7449,3,1511,522,232.0,188.0,579.0
2,Algeria,NORTHERN AFRICA,32930091,2381740,138,4,-39,31,6000.0,700,781,322,25,9653,1,1714,461,101.0,6.0,298.0
3,American Samoa,OCEANIA,57794,199,2904,5829,-2071,927,8000.0,970,2595,10,15,75,2,2246,327,,,
4,Andorra,WESTERN EUROPE,71201,468,1521,0,66,405,19000.0,1000,4972,222,0,9778,3,871,625,,,


In [7]:
countries['Pop_density'].str.replace(',','.')

0       48.0
1      124.6
2       13.8
3      290.4
4      152.1
       ...  
222    419.9
223      1.0
224     40.6
225     15.3
226     31.3
Name: Pop_density, Length: 227, dtype: object

In [8]:
cols_transf_Num = ['Birthrate','Deathrate','Agriculture',
                   'Industry','Service','Literacy','Phones_per_1000',
                   'Arable','Crops','Other','Pop_density','Coastline_ratio',
                   'Net_migration','Infant_mortality']

for coluna in cols_transf_Num:
    countries[coluna] = countries[coluna].str.replace(',','.').astype(float)

In [9]:
countries.dtypes

Country              object
Region               object
Population            int64
Area                  int64
Pop_density         float64
Coastline_ratio     float64
Net_migration       float64
Infant_mortality    float64
GDP                 float64
Literacy            float64
Phones_per_1000     float64
Arable              float64
Crops               float64
Other               float64
Climate              object
Birthrate           float64
Deathrate           float64
Agriculture         float64
Industry            float64
Service             float64
dtype: object

In [10]:
cols_strip = ['Country','Region']

for coluna in cols_strip:
    countries[coluna] = countries[coluna].str.strip()

In [11]:
countries.head()

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,48.0,0.0,23.06,163.07,700.0,36.0,3.2,12.13,0.22,87.65,1,46.6,20.34,0.38,0.24,0.38
1,Albania,EASTERN EUROPE,3581655,28748,124.6,1.26,-4.93,21.52,4500.0,86.5,71.2,21.09,4.42,74.49,3,15.11,5.22,0.232,0.188,0.579
2,Algeria,NORTHERN AFRICA,32930091,2381740,13.8,0.04,-0.39,31.0,6000.0,70.0,78.1,3.22,0.25,96.53,1,17.14,4.61,0.101,0.6,0.298
3,American Samoa,OCEANIA,57794,199,290.4,58.29,-20.71,9.27,8000.0,97.0,259.5,10.0,15.0,75.0,2,22.46,3.27,,,
4,Andorra,WESTERN EUROPE,71201,468,152.1,0.0,6.6,4.05,19000.0,100.0,497.2,2.22,0.0,97.78,3,8.71,6.25,,,


In [12]:
countries['Region'].unique()

array(['ASIA (EX. NEAR EAST)', 'EASTERN EUROPE', 'NORTHERN AFRICA',
       'OCEANIA', 'WESTERN EUROPE', 'SUB-SAHARAN AFRICA',
       'LATIN AMER. & CARIB', 'C.W. OF IND. STATES', 'NEAR EAST',
       'NORTHERN AMERICA', 'BALTICS'], dtype=object)

## Observações

Esse _data set_ ainda precisa de alguns ajustes iniciais. Primeiro, note que as variáveis numéricas estão usando vírgula como separador decimal e estão codificadas como strings. Corrija isso antes de continuar: transforme essas variáveis em numéricas adequadamente.

Além disso, as variáveis `Country` e `Region` possuem espaços a mais no começo e no final da string. Você pode utilizar o método `str.strip()` para remover esses espaços.

## Inicia sua análise a partir daqui

In [13]:
# Sua análise começa aqui.


## Questão 1

Quais são as regiões (variável `Region`) presentes no _data set_? Retorne uma lista com as regiões únicas do _data set_ com os espaços à frente e atrás da string removidos (mas mantenha pontuação: ponto, hífen etc) e ordenadas em ordem alfabética.

In [14]:
def q1():
    return list(np.sort(countries['Region'].unique()))

In [15]:
q1()

['ASIA (EX. NEAR EAST)',
 'BALTICS',
 'C.W. OF IND. STATES',
 'EASTERN EUROPE',
 'LATIN AMER. & CARIB',
 'NEAR EAST',
 'NORTHERN AFRICA',
 'NORTHERN AMERICA',
 'OCEANIA',
 'SUB-SAHARAN AFRICA',
 'WESTERN EUROPE']

## Questão 2

Discretizando a variável `Pop_density` em 10 intervalos com `KBinsDiscretizer`, seguindo o encode `ordinal` e estratégia `quantile`, quantos países se encontram acima do 90º percentil? Responda como um único escalar inteiro.

In [16]:
countries.head()

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,48.0,0.0,23.06,163.07,700.0,36.0,3.2,12.13,0.22,87.65,1,46.6,20.34,0.38,0.24,0.38
1,Albania,EASTERN EUROPE,3581655,28748,124.6,1.26,-4.93,21.52,4500.0,86.5,71.2,21.09,4.42,74.49,3,15.11,5.22,0.232,0.188,0.579
2,Algeria,NORTHERN AFRICA,32930091,2381740,13.8,0.04,-0.39,31.0,6000.0,70.0,78.1,3.22,0.25,96.53,1,17.14,4.61,0.101,0.6,0.298
3,American Samoa,OCEANIA,57794,199,290.4,58.29,-20.71,9.27,8000.0,97.0,259.5,10.0,15.0,75.0,2,22.46,3.27,,,
4,Andorra,WESTERN EUROPE,71201,468,152.1,0.0,6.6,4.05,19000.0,100.0,497.2,2.22,0.0,97.78,3,8.71,6.25,,,


In [17]:
from sklearn.preprocessing import KBinsDiscretizer

discretizador = KBinsDiscretizer(n_bins=10,
                                 encode='ordinal',
                                 strategy='quantile')

discretizador.fit(countries[['Pop_density']])

popDensity_disc = discretizador.transform(countries[['Pop_density']])

popDensity_disc

array([[3.],
       [6.],
       [1.],
       [8.],
       [7.],
       [0.],
       [6.],
       [7.],
       [1.],
       [5.],
       [8.],
       [0.],
       [5.],
       [5.],
       [2.],
       [9.],
       [9.],
       [9.],
       [3.],
       [8.],
       [1.],
       [4.],
       [9.],
       [3.],
       [0.],
       [5.],
       [0.],
       [2.],
       [7.],
       [4.],
       [4.],
       [3.],
       [4.],
       [8.],
       [4.],
       [2.],
       [0.],
       [5.],
       [7.],
       [0.],
       [0.],
       [2.],
       [6.],
       [2.],
       [8.],
       [2.],
       [1.],
       [5.],
       [5.],
       [3.],
       [5.],
       [5.],
       [5.],
       [6.],
       [6.],
       [1.],
       [5.],
       [7.],
       [4.],
       [3.],
       [5.],
       [8.],
       [1.],
       [3.],
       [2.],
       [4.],
       [2.],
       [3.],
       [1.],
       [6.],
       [0.],
       [4.],
       [0.],
       [6.],
       [9.],
       [4.],
       [7.],

In [18]:
np.unique(popDensity_disc.flatten())

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

In [19]:
pd.Series(popDensity_disc.flatten()).value_counts()[9.0]

23

In [20]:
def q2():
    # Retorne aqui o resultado da questão 2.
    from sklearn.preprocessing import KBinsDiscretizer

    discretizador = KBinsDiscretizer(n_bins=10,
                                     encode='ordinal',
                                     strategy='quantile')

    discretizador.fit(countries[['Pop_density']])

    popDensity_disc = discretizador.transform(countries[['Pop_density']])

    return int(pd.Series(popDensity_disc.flatten()).value_counts()[9.0])

In [21]:
q2()

23

# Questão 3

Se codificarmos as variáveis `Region` e `Climate` usando _one-hot encoding_, quantos novos atributos seriam criados? Responda como um único escalar.

In [23]:
def q3():
    #Hot enconding para np.int
    from sklearn.preprocessing import OneHotEncoder
    one_hot_encoder = OneHotEncoder(sparse=False)
    
    #Codificando as variáveis
    region_climate_encoded = one_hot_encoder.fit(countries[['Region', 'Climate']].fillna('0').astype('str'))
    
    #Pegando as novas features geradas
    new_attributes = region_climate_encoded.get_feature_names()
    
    return len(new_attributes)
q3()

18

In [24]:
q3()

18

## Questão 4

Aplique o seguinte _pipeline_:

1. Preencha as variáveis do tipo `int64` e `float64` com suas respectivas medianas.
2. Padronize essas variáveis.

Após aplicado o _pipeline_ descrito acima aos dados (somente nas variáveis dos tipos especificados), aplique o mesmo _pipeline_ (ou `ColumnTransformer`) ao dado abaixo. Qual o valor da variável `Arable` após o _pipeline_? Responda como um único float arredondado para três casas decimais.

In [25]:
test_country = [
    'Test Country', 'NEAR EAST', -0.19032480757326514,
    -0.3232636124824411, -0.04421734470810142, -0.27528113360605316,
    0.13255850810281325, -0.8054845935643491, 1.0119784924248225,
    0.6189182532646624, 1.0074863283776458, 0.20239896852403538,
    -0.043678728558593366, -0.13929748680369286, 1.3163604645710438,
    -0.3699637766938669, -0.6149300604558857, -0.854369594993175,
    0.263445277972641, 0.5712416961268142
]

test_df = pd.DataFrame(data=np.array(test_country).reshape(1,-1), columns=countries.columns)

In [26]:
cols_num = (countries.dtypes=='int64') | (countries.dtypes=='float64')
cols_num = countries.dtypes[cols_num].index.tolist()
countries[cols_num]

Unnamed: 0,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Birthrate,Deathrate,Agriculture,Industry,Service
0,31056997,647500,48.0,0.00,23.06,163.07,700.0,36.0,3.2,12.13,0.22,87.65,46.60,20.34,0.380,0.240,0.380
1,3581655,28748,124.6,1.26,-4.93,21.52,4500.0,86.5,71.2,21.09,4.42,74.49,15.11,5.22,0.232,0.188,0.579
2,32930091,2381740,13.8,0.04,-0.39,31.00,6000.0,70.0,78.1,3.22,0.25,96.53,17.14,4.61,0.101,0.600,0.298
3,57794,199,290.4,58.29,-20.71,9.27,8000.0,97.0,259.5,10.00,15.00,75.00,22.46,3.27,,,
4,71201,468,152.1,0.00,6.60,4.05,19000.0,100.0,497.2,2.22,0.00,97.78,8.71,6.25,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
222,2460492,5860,419.9,0.00,2.98,19.62,800.0,,145.2,16.90,18.97,64.13,31.67,3.92,0.090,0.280,0.630
223,273008,266000,1.0,0.42,,,,,,0.02,0.00,99.98,,,,,0.400
224,21456188,527970,40.6,0.36,0.00,61.50,800.0,50.2,37.2,2.78,0.24,96.98,42.89,8.30,0.135,0.472,0.393
225,11502010,752614,15.3,0.00,0.00,88.29,800.0,80.6,8.2,7.08,0.03,92.90,41.00,19.93,0.220,0.290,0.489


In [27]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline(steps=[
    ('imputacao', SimpleImputer(strategy='median')),
    ('padronizacao', StandardScaler())
])

num_pipeline = num_pipeline.fit(countries[cols_num])

test_country_transf = num_pipeline.transform(test_df[cols_num])

test_country_transf = pd.DataFrame(test_country_transf.flatten().reshape(1,-1),
                                   columns=cols_num)
valor = np.round(test_country_transf['Arable'], 3)
float(valor)

-1.047

In [29]:
def q4():
    # Retorne aqui o resultado da questão 4.
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    from sklearn.impute import SimpleImputer

    num_pipeline = Pipeline(steps=[
        ('imputacao', SimpleImputer(strategy='median')),
        ('padronizacao', StandardScaler())
    ])

    num_pipeline = num_pipeline.fit(countries[cols_num])

    test_country_transf = num_pipeline.transform(test_df[cols_num])

    test_country_transf = pd.DataFrame(test_country_transf.flatten().reshape(1,-1),
                                       columns=cols_num)
    valor = np.round(test_country_transf['Arable'], 3)
    return float(valor)

In [30]:
q4()

-1.047

## Questão 5

Descubra o número de _outliers_ da variável `Net_migration` segundo o método do _boxplot_, ou seja, usando a lógica:

$$x \notin [Q1 - 1.5 \times \text{IQR}, Q3 + 1.5 \times \text{IQR}] \Rightarrow x \text{ é outlier}$$

que se encontram no grupo inferior e no grupo superior.

Você deveria remover da análise as observações consideradas _outliers_ segundo esse método? Responda como uma tupla de três elementos `(outliers_abaixo, outliers_acima, removeria?)` ((int, int, bool)).

In [31]:
def q5():

    quartil_01 = countries['Net_migration'].quantile(.25)
    quartil_03 = countries['Net_migration'].quantile(.75)

    iqr = quartil_03 - quartil_01
    lim_inf, lim_sup = (quartil_01 - 1.5*iqr), (quartil_03 + 1.5*iqr)

    outl_inf = countries['Net_migration'] < lim_inf
    outl_inf = countries[outl_inf]['Net_migration']

    outl_sup = countries['Net_migration'] > lim_sup
    outl_sup = countries[outl_sup]['Net_migration']

    dados_inf_pad = StandardScaler().fit_transform(countries[['Net_migration']])
    dados_inf_pad = dados_inf_pad[countries['Net_migration'] < lim_inf]
    quant_outl_inf = (dados_inf_pad < 3).sum()

    dados_sup_pad = StandardScaler().fit_transform(countries[['Net_migration']])
    dados_sup_pad = dados_sup_pad[countries['Net_migration'] > lim_sup]
    quant_outl_sup = (dados_sup_pad > 3).sum()

    if (quant_outl_inf + quant_outl_sup) > 1:
        remover = True
    else:
        remover = False

    return (len(outl_inf), len(outl_sup), bool(0))



In [32]:
q5()

(24, 26, False)

## Questão 6
Para as questões 6 e 7 utilize a biblioteca `fetch_20newsgroups` de datasets de test do `sklearn`

Considere carregar as seguintes categorias e o dataset `newsgroups`:

```
categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)
```


Aplique `CountVectorizer` ao _data set_ `newsgroups` e descubra o número de vezes que a palavra _phone_ aparece no corpus. Responda como um único escalar.

In [40]:
def q6():
    # Retorne aqui o resultado da questão 4.
    categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
    newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)

    count_vectorizer = CountVectorizer()
    newsgroups_counts = count_vectorizer.fit_transform(newsgroup.data)

    aux = pd.DataFrame(newsgroups_counts.toarray(),
                 columns=np.array(count_vectorizer.get_feature_names()))

    return int(aux['phone'].sum())

In [41]:
q6()

213

## Questão 7

Aplique `TfidfVectorizer` ao _data set_ `newsgroups` e descubra o TF-IDF da palavra _phone_. Responda como um único escalar arredondado para três casas decimais.

In [42]:
def q7():
    # Retorne aqui o resultado da questão 4.
    categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
    newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)

    count_vectorizer = CountVectorizer()
    newsgroups_counts = count_vectorizer.fit_transform(newsgroup.data)

    aux = pd.DataFrame(newsgroups_counts.toarray(),
                 columns=np.array(count_vectorizer.get_feature_names()))
    
    tfidf_transformer = TfidfTransformer()

    tfidf_transformer.fit(newsgroups_counts)

    newsgroups_tfidf = tfidf_transformer.transform(newsgroups_counts)

    aux = pd.DataFrame(newsgroups_tfidf.toarray(),
                 columns=np.array(count_vectorizer.get_feature_names()))

    return float(round(aux['phone'].sum(), 3))


In [43]:
q7()

8.888