# Desafio 6

Neste desafio, vamos praticar _feature engineering_, um dos processos mais importantes e trabalhosos de ML. Utilizaremos o _data set_ [Countries of the world](https://www.kaggle.com/fernandol/countries-of-the-world), que contém dados sobre os 227 países do mundo com informações sobre tamanho da população, área, imigração e setores de produção.

> Obs.: Por favor, não modifique o nome das funções de resposta.

## _Setup_ geral

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

import sklearn as sk
from sklearn.preprocessing import (KBinsDiscretizer, OneHotEncoder, StandardScaler)
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import (CountVectorizer, TfidfVectorizer)

In [2]:
countries = pd.read_csv("countries.csv")

In [3]:
new_column_names = [
    "Country", "Region", "Population", "Area", "Pop_density", "Coastline_ratio",
    "Net_migration", "Infant_mortality", "GDP", "Literacy", "Phones_per_1000",
    "Arable", "Crops", "Other", "Climate", "Birthrate", "Deathrate", "Agriculture",
    "Industry", "Service"
]

countries.columns = new_column_names

countries.head(5)

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,480,0,2306,16307,700.0,360,32,1213,22,8765,1,466,2034,38.0,24.0,38.0
1,Albania,EASTERN EUROPE,3581655,28748,1246,126,-493,2152,4500.0,865,712,2109,442,7449,3,1511,522,232.0,188.0,579.0
2,Algeria,NORTHERN AFRICA,32930091,2381740,138,4,-39,31,6000.0,700,781,322,25,9653,1,1714,461,101.0,6.0,298.0
3,American Samoa,OCEANIA,57794,199,2904,5829,-2071,927,8000.0,970,2595,10,15,75,2,2246,327,,,
4,Andorra,WESTERN EUROPE,71201,468,1521,0,66,405,19000.0,1000,4972,222,0,9778,3,871,625,,,


## Observações

Esse _data set_ ainda precisa de alguns ajustes iniciais. Primeiro, note que as variáveis numéricas estão usando vírgula como separador decimal e estão codificadas como strings. Corrija isso antes de continuar: transforme essas variáveis em numéricas adequadamente.

Além disso, as variáveis `Country` e `Region` possuem espaços a mais no começo e no final da string. Você pode utilizar o método `str.strip()` para remover esses espaços.

### Adequando dataset

In [4]:
countries.shape

(227, 20)

Primeiro precisamos preencher valores NaN por uma String para que seja possível substituir as vírgulas. Com esse passo realizado voltamos com os valores NaN pra fazer a transformação de str para float. Esses passos são necessários já que o replace em um valor NaN é impossível.

In [5]:
countries = countries.fillna('Unknow')
for column in ['Pop_density','Coastline_ratio','Net_migration','Infant_mortality','Literacy','Phones_per_1000','Arable',
            'Crops','Other','Birthrate','Deathrate','Agriculture','Industry','Service']:
    try:
        countries[column] = countries[column].apply(lambda x: x.replace(',','.'))
    except:
        continue
        
countries = countries.replace('Unknow',np.nan)
for column in ['Pop_density','Coastline_ratio','Net_migration','Infant_mortality','Literacy','Phones_per_1000','Arable',
            'Crops','Other','Birthrate','Deathrate','Agriculture','Industry','Service']:
    try:
        countries[column] = countries[column].apply(lambda x: float(x))
    except:
        continue

In [6]:
for column in ['Region','Population']:
    try:
        countries[column] = countries[column].apply(lambda x: x.strip())
    except:
        continue

In [7]:
countries.head(5)

Unnamed: 0,Country,Region,Population,Area,Pop_density,Coastline_ratio,Net_migration,Infant_mortality,GDP,Literacy,Phones_per_1000,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,ASIA (EX. NEAR EAST),31056997,647500,48.0,0.0,23.06,163.07,700.0,36.0,3.2,12.13,0.22,87.65,1,46.6,20.34,0.38,0.24,0.38
1,Albania,EASTERN EUROPE,3581655,28748,124.6,1.26,-4.93,21.52,4500.0,86.5,71.2,21.09,4.42,74.49,3,15.11,5.22,0.232,0.188,0.579
2,Algeria,NORTHERN AFRICA,32930091,2381740,13.8,0.04,-0.39,31.0,6000.0,70.0,78.1,3.22,0.25,96.53,1,17.14,4.61,0.101,0.6,0.298
3,American Samoa,OCEANIA,57794,199,290.4,58.29,-20.71,9.27,8000.0,97.0,259.5,10.0,15.0,75.0,2,22.46,3.27,,,
4,Andorra,WESTERN EUROPE,71201,468,152.1,0.0,6.6,4.05,19000.0,100.0,497.2,2.22,0.0,97.78,3,8.71,6.25,,,


## Questão 1

Quais são as regiões (variável `Region`) presentes no _data set_? Retorne uma lista com as regiões únicas do _data set_ com os espaços à frente e atrás da string removidos (mas mantenha pontuação: ponto, hífen etc) e ordenadas em ordem alfabética.

In [8]:
sorted(countries['Region'].unique())

['ASIA (EX. NEAR EAST)',
 'BALTICS',
 'C.W. OF IND. STATES',
 'EASTERN EUROPE',
 'LATIN AMER. & CARIB',
 'NEAR EAST',
 'NORTHERN AFRICA',
 'NORTHERN AMERICA',
 'OCEANIA',
 'SUB-SAHARAN AFRICA',
 'WESTERN EUROPE']

In [9]:
def q1():
    return sorted(countries['Region'].unique())

## Questão 2

Discretizando a variável `Pop_density` em 10 intervalos com `KBinsDiscretizer`, seguindo o encode `ordinal` e estratégia `quantile`, quantos países se encontram acima do 90º percentil? Responda como um único escalar inteiro.

Vale destacar que a estratégia 'quantile' divide os valores em conjuntos de tamanho aproximadamente igual. Como o primeiro conjunto chamamos aqui de 0, então o décimo conjunto contém do 90º percentil até o 100º percentil.

In [10]:
Discretizer = KBinsDiscretizer(n_bins= 10, encode= 'ordinal', strategy= 'quantile')
aux = Discretizer.fit_transform(countries[['Pop_density']])
Discretizer.bin_edges_
len(aux[aux==9])

23

In [11]:
def q2():
    return len(aux[aux==9])

# Questão 3

Se codificarmos as variáveis `Region` e `Climate` usando _one-hot encoding_, quantos novos atributos seriam criados? Responda como um único escalar.

Partindo da hipótese que cada variável categória com n possibilidades gera n novos tributos, testa-se:

In [12]:
len(countries['Region'].unique())

11

In [13]:
len(pd.get_dummies(countries['Region']).columns)

11

As linhas confirmam a hipótese que o _one-hot-enconding_ de uma variável categória com n possibilidades gera n novos tributos.

Somando as possibilidades das variáveis em questão:

In [14]:
len(countries['Region'].unique()) + len(countries['Climate'].unique())

18

In [15]:
def q3():
    return len(countries['Region'].unique()) + len(countries['Climate'].unique())

## Questão 4

Aplique o seguinte _pipeline_:

1. Preencha as variáveis do tipo `int64` e `float64` com suas respectivas medianas.
2. Padronize essas variáveis.

Após aplicado o _pipeline_ descrito acima aos dados (somente nas variáveis dos tipos especificados), aplique o mesmo _pipeline_ (ou `ColumnTransformer`) ao dado abaixo. Qual o valor da variável `Arable` após o _pipeline_? Responda como um único float arredondado para três casas decimais.

In [16]:
countries.dtypes

Country              object
Region               object
Population            int64
Area                  int64
Pop_density         float64
Coastline_ratio     float64
Net_migration       float64
Infant_mortality    float64
GDP                 float64
Literacy            float64
Phones_per_1000     float64
Arable              float64
Crops               float64
Other               float64
Climate              object
Birthrate           float64
Deathrate           float64
Agriculture         float64
Industry            float64
Service             float64
dtype: object

Primeiro descrevemos o Pipeline em questão:

In [40]:
pipe = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),('standart', StandardScaler())])

Agora é preciso treinar o pipeline com a coluna `Arable` do dataset `countries`.

In [66]:
pipe.fit(countries[['Arable']])
test_country = [
    'Test Country', 'NEAR EAST', -0.19032480757326514,
    -0.3232636124824411, -0.04421734470810142, -0.27528113360605316,
    0.13255850810281325, -0.8054845935643491, 1.0119784924248225,
    0.6189182532646624, 1.0074863283776458, 0.20239896852403538,
    -0.043678728558593366, -0.13929748680369286, 1.3163604645710438,
    -0.3699637766938669, -0.6149300604558857, -0.854369594993175,
    0.263445277972641, 0.5712416961268142
]
test_country = pd.DataFrame(test_country, columns= ['Info'] ,index= ["Country", "Region", "Population", "Area", "Pop_density", "Coastline_ratio",
    "Net_migration", "Infant_mortality", "GDP", "Literacy", "Phones_per_1000",
    "Arable", "Crops", "Other", "Climate", "Birthrate", "Deathrate", "Agriculture",
    "Industry", "Service"])

Com o treinamento executado podemos agora substituir os valores categóricos por NaN e em seguida realizar a transformação do dataset `test_country`.

In [67]:
test_country['Info']['Country'] = np.NaN
test_country['Info']['Region'] = np.NaN
transformed_test_country= pd.DataFrame(pipe.transform(test_country), columns= ['Info'], index=["Country", "Region", "Population", "Area", "Pop_density", "Coastline_ratio",
    "Net_migration", "Infant_mortality", "GDP", "Literacy", "Phones_per_1000",
    "Arable", "Crops", "Other", "Climate", "Birthrate", "Deathrate", "Agriculture",
    "Industry", "Service"])

In [68]:
def q4():
    return round(float(transformed_test_country['Info']['Arable']),3)
q4()

-1.047

## Questão 5

Descubra o número de _outliers_ da variável `Net_migration` segundo o método do _boxplot_, ou seja, usando a lógica:

$$x \notin [Q1 - 1.5 \times \text{IQR}, Q3 + 1.5 \times \text{IQR}] \Rightarrow x \text{ é outlier}$$

que se encontram no grupo inferior e no grupo superior.

Você deveria remover da análise as observações consideradas _outliers_ segundo esse método? Responda como uma tupla de três elementos `(outliers_abaixo, outliers_acima, removeria?)` ((int, int, bool)).

In [20]:
quantiles = countries['Net_migration'].quantile([0.25, 0.5, 0.75])
IRQ = quantiles[0.75]-quantiles[0.25]
lower_limit = quantiles[0.25] - 1.5*(IRQ)
upper_limit = quantiles[0.75] + 1.5*(IRQ)
lower_limit, upper_limit

(-3.8149999999999995, 3.885)

In [21]:
lower_outliers = countries[(countries['Net_migration'] < lower_limit)]
upper_outliers = countries[(countries['Net_migration'] > upper_limit)]

In [22]:
def q5():
    return (lower_outliers.shape[0], upper_outliers.shape[0], False)

## Questão 6
Para as questões 6 e 7 utilize a biblioteca `fetch_20newsgroups` de datasets de test do `sklearn`

Considere carregar as seguintes categorias e o dataset `newsgroups`:

```
categories = ['sci.electronics', 'comp.graphics', 'rec.motorcycles']
newsgroup = fetch_20newsgroups(subset="train", categories=categories, shuffle=True, random_state=42)
```


Aplique `CountVectorizer` ao _data set_ `newsgroups` e descubra o número de vezes que a palavra _phone_ aparece no corpus. Responda como um único escalar.

In [23]:
vectorizer = CountVectorizer()

In [24]:
newsgroups = fetch_20newsgroups(subset='train', categories= ['sci.electronics', 'comp.graphics', 'rec.motorcycles'], 
                                shuffle= True, random_state=42)

As categorias foram úteis para recuperar parte do conjunto de documentos (corpus).

In [25]:
def q6():
    word_searched = 'phone'
    count = 0
    for document in newsgroups.data:
        document_vectorized = vectorizer.fit_transform([document])
        list_features = vectorizer.get_feature_names()
        try:
            word_position = list_features.index(word_searched)
        except ValueError:
            continue
        count = count + document_vectorized.toarray()[0][word_position]
    return int(count)

## Questão 7

Aplique `TfidfVectorizer` ao _data set_ `newsgroups` e descubra o TF-IDF da palavra _phone_. Responda como um único escalar arredondado para três casas decimais.

In [26]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(newsgroups.data)
newsgroups_tfidf = tfidf_vectorizer.transform(newsgroups.data)
newsgroups_tfidf.toarray().shape

(1773, 27335)

In [27]:
tfidf_dataframe = pd.DataFrame(newsgroups_tfidf.toarray(), columns= np.array(tfidf_vectorizer.get_feature_names()))['phone']
tfidf_dataframe.sum()

8.88774594667355

Visualizando a matriz TFIDF:

In [28]:
pd.DataFrame(newsgroups_tfidf.toarray(), columns= np.array(tfidf_vectorizer.get_feature_names())).head(5)

Unnamed: 0,00,000,0000,0000000004,0000000005,0000000667,0000001200,000005102000,0001,000100255pixel,...,zyeh,zygot,zyxel,zz,zzr11,zzr1100,zzzzzz,ªl,³ation,ýé
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.038791,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Nota-se que a matriz é esparsa, cada coluna faz referência a um termo e cada linha a um documento.

In [29]:
def q7():
    return round(float(tfidf_dataframe.sum()),3)