## 2. Limpeza, Normalização e Padronização de dados

O pré-processamento de dados é uma etapa importante para o processo de análise de dados, pois a qualidade do resultado do seu modelo começa com a qualidade dos dados que você está “inputando”. Assim, parte considerável do tempo do cientista de dados é gasto no esforço que envolve a limpeza de dados e a engenharia de recursos (transformar dados brutos em atributos que melhor representem seus dados). Independentemente de o cientista de dados receber dados coletados ou ter que realizar a coleta, os dados estarão em formato bruto, que precisarão ser convertidos e filtrados.

### 2.1 Import

In [1]:
import pandas as pd
from scipy import stats
import datetime
import numpy as np 
from sklearn import preprocessing


pd.set_option('display.max_columns', 30)


### 2.2 Carregando o dataset

O dataset da atividade anterior foi aumentado com novas colunas para simular caracteristicas de dados brutos.

In [2]:
meu_data_frame = pd.read_pickle("../data/ugly_cereal.pkl")
display(meu_data_frame.head(20))

Unnamed: 0,uuid_samanthaserver,uuid_scorpionbase,name,mfr,sales_week,type,calories,data_cre_scorp,protein,fat,price,data_cre_saman,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,f4e57ef5-aeee-41a6-99dc-5822e6bb851f,d6d24c40-f84c-11ea-b117-000000000019,100% Bran,N,75898 un.,C,70,1995-04-30,4,1,$ 9.00,Sunday 30. April 1995,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,59c1f03b-6d4e-401a-9b38-577486f3a551,d6d24c6e-f84c-11ea-9823-000000000047,100% Natural Bran,Q,35839 un.,C,120,1998-07-12,42,5,$ 10.00,Sunday 12. July 1998,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,a10fe3a7-57c6-4f1d-961a-3537e97017ac,d6d24c60-f84c-11ea-a76a-000000000039,All-Bran,K,,C,70,1993-11-15,4,1,$ 3.00,Monday 15. November 1993,260,9.0,7.0,64,320,25,3,1.0,0.33,59.425505
3,578659e8-aa31-4283-bb2a-8cf73582148e,d6d24c50-f84c-11ea-ae4e-000000000029,All-Bran with Extra Fiber,K,27346 un.,C,50,1995-10-07,4,0,$ 4.00,Saturday 07. October 1995,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,2f5a2bbf-a5cd-41e9-93e4-b2aaf18b1138,d6d22567-f84c-11ea-854c-000000000017,Almond Delight,R,,C,110,1993-06-30,2,2,$ 8.00,Wednesday 30. June 1993,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
5,e441c0e6-18c4-424e-9060-fc464f432331,d6d22568-f84c-11ea-b587-000000000018,Apple Cinnamon Cheerios,G,23146 un.,C,110,1995-08-11,2,2,$ 7.00,Friday 11. August 1995,180,1.5,10.5,10,70,25,1,1.0,0.75,29.509541
6,3694839c-72ba-4258-ab05-4addbf80f959,d6d24c66-f84c-11ea-ba03-00000000003f,Apple Jacks,K,,C,12600,1996-06-08,2,0,$ 4.00,Saturday 08. June 1996,125,1.0,11.0,14,30,25,2,1.0,1.0,33.174094
7,cf267755-f453-4ff4-96d6-520962c6613e,d6d22553-f84c-11ea-b929-000000000003,Basic 4,G,7210 un.,C,130,1996-01-25,78,2,$ 10.00,Thursday 25. January 1996,210,2.0,18.0,8,100,25,3,1.33,0.75,37.038562
8,38ba8725-9a83-4350-8b12-25e408600dba,d6d2255b-f84c-11ea-a1d0-00000000000b,Bran Chex,R,,C,90,2000-05-13,2,1,$ 3.00,Saturday 13. May 2000,200,4.0,15.0,80,125,25,1,1.0,0.67,49.120253
9,07b09eb4-ef1b-449b-a506-2247b6d4bb08,d6d24c59-f84c-11ea-9b66-000000000032,Bran Flakes,P,,C,90,1996-08-23,3,0,$ 7.00,Friday 23. August 1996,210,5.0,13.0,5,190,25,3,1.0,0.67,53.313813


## 2.3 Limpeza de dados

### 2.3.1 Eliminando atributos redundantes

Neste exemplo, percebe-se que há redundancia entre as colunas "data_cre_scorp" e "data_cre_seman". Aparentemente o valor deste atributo em cada amostra é o mesmo, mas em formatos diferentes. Vamos checar.

In [3]:

#flag que marca se as colunas são iguais
sao_iguais = True

#iterando sobre a coluna "data_cre_saman"
for i in range(len(meu_data_frame["data_cre_saman"].values)):
    #convertendo os valores de "data_cre_saman" para "data_cre_saman", ex: Sunday 30. April 1995 -> 1195-4-30
    alter_data = str(datetime.datetime.strptime(meu_data_frame["data_cre_saman"].values[i], "%A %d. %B %Y")).replace(" 00:00:00","")
    comp_data = str(meu_data_frame["data_cre_scorp"].values[i])
    
    #caso um único exemplo seja diferente, a flag é mercada como false
    if comp_data != alter_data:
            sao_iguais = False
            break

if sao_iguais:
    print("Há redundância de informação, vou deletar a coluna 'data_cre_saman'")
    meu_data_frame.drop(columns=['data_cre_saman'], axis=1, inplace=True) 
else:
    print("Não são iguais, é melhor pesquisar um pouco mais sobre a natureza desses atributos")
    
display(meu_data_frame.head(10))

Há redundância de informação, vou deletar a coluna 'data_cre_saman'


Unnamed: 0,uuid_samanthaserver,uuid_scorpionbase,name,mfr,sales_week,type,calories,data_cre_scorp,protein,fat,price,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,f4e57ef5-aeee-41a6-99dc-5822e6bb851f,d6d24c40-f84c-11ea-b117-000000000019,100% Bran,N,75898 un.,C,70,1995-04-30,4,1,$ 9.00,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,59c1f03b-6d4e-401a-9b38-577486f3a551,d6d24c6e-f84c-11ea-9823-000000000047,100% Natural Bran,Q,35839 un.,C,120,1998-07-12,42,5,$ 10.00,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,a10fe3a7-57c6-4f1d-961a-3537e97017ac,d6d24c60-f84c-11ea-a76a-000000000039,All-Bran,K,,C,70,1993-11-15,4,1,$ 3.00,260,9.0,7.0,64,320,25,3,1.0,0.33,59.425505
3,578659e8-aa31-4283-bb2a-8cf73582148e,d6d24c50-f84c-11ea-ae4e-000000000029,All-Bran with Extra Fiber,K,27346 un.,C,50,1995-10-07,4,0,$ 4.00,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,2f5a2bbf-a5cd-41e9-93e4-b2aaf18b1138,d6d22567-f84c-11ea-854c-000000000017,Almond Delight,R,,C,110,1993-06-30,2,2,$ 8.00,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
5,e441c0e6-18c4-424e-9060-fc464f432331,d6d22568-f84c-11ea-b587-000000000018,Apple Cinnamon Cheerios,G,23146 un.,C,110,1995-08-11,2,2,$ 7.00,180,1.5,10.5,10,70,25,1,1.0,0.75,29.509541
6,3694839c-72ba-4258-ab05-4addbf80f959,d6d24c66-f84c-11ea-ba03-00000000003f,Apple Jacks,K,,C,12600,1996-06-08,2,0,$ 4.00,125,1.0,11.0,14,30,25,2,1.0,1.0,33.174094
7,cf267755-f453-4ff4-96d6-520962c6613e,d6d22553-f84c-11ea-b929-000000000003,Basic 4,G,7210 un.,C,130,1996-01-25,78,2,$ 10.00,210,2.0,18.0,8,100,25,3,1.33,0.75,37.038562
8,38ba8725-9a83-4350-8b12-25e408600dba,d6d2255b-f84c-11ea-a1d0-00000000000b,Bran Chex,R,,C,90,2000-05-13,2,1,$ 3.00,200,4.0,15.0,80,125,25,1,1.0,0.67,49.120253
9,07b09eb4-ef1b-449b-a506-2247b6d4bb08,d6d24c59-f84c-11ea-9b66-000000000032,Bran Flakes,P,,C,90,1996-08-23,3,0,$ 7.00,210,5.0,13.0,5,190,25,3,1.0,0.67,53.313813


Além das colunas redundantes, colunas com grande quantidade de dados nulos também devem ser removidas antes da filtragem por amostra.

### 2.3.2 Removendo amostras com valores de atributos nulos

In [4]:
#um boa prática que antecede a remoção de valores nulos é a conversão de valores inválidos para nulos
#por exemplo, campos com espaço em branco, caracteres especiais sem significado (?,*,.), etc.
#Um regex pode ser usado para converter valores para NaN
meu_data_frame = meu_data_frame.replace(r'^\s*$', float("NaN"), regex=True)

#guardar as amostras irregulares é uma boa prática, lembre-se que esses dados podem ser revisados
#e podem ser úteis no futuro.
removed_data_frame = meu_data_frame[meu_data_frame.isnull().any(axis=1)].copy()

#deletando as amostram que possuem algum valor NaN em qualquer atributo 
meu_data_frame = meu_data_frame.dropna()

display(removed_data_frame)

Unnamed: 0,uuid_samanthaserver,uuid_scorpionbase,name,mfr,sales_week,type,calories,data_cre_scorp,protein,fat,price,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
2,a10fe3a7-57c6-4f1d-961a-3537e97017ac,d6d24c60-f84c-11ea-a76a-000000000039,All-Bran,K,,C,70,1993-11-15,4,1,$ 3.00,260,9.0,7.0,64,320,25,3,1.0,0.33,59.425505
4,2f5a2bbf-a5cd-41e9-93e4-b2aaf18b1138,d6d22567-f84c-11ea-854c-000000000017,Almond Delight,R,,C,110,1993-06-30,2,2,$ 8.00,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843
6,3694839c-72ba-4258-ab05-4addbf80f959,d6d24c66-f84c-11ea-ba03-00000000003f,Apple Jacks,K,,C,12600,1996-06-08,2,0,$ 4.00,125,1.0,11.0,14,30,25,2,1.0,1.0,33.174094
8,38ba8725-9a83-4350-8b12-25e408600dba,d6d2255b-f84c-11ea-a1d0-00000000000b,Bran Chex,R,,C,90,2000-05-13,2,1,$ 3.00,200,4.0,15.0,80,125,25,1,1.0,0.67,49.120253
9,07b09eb4-ef1b-449b-a506-2247b6d4bb08,d6d24c59-f84c-11ea-9b66-000000000032,Bran Flakes,P,,C,90,1996-08-23,3,0,$ 7.00,210,5.0,13.0,5,190,25,3,1.0,0.67,53.313813
16,62edcb57-8da5-47b0-8432-78c15ff7d83b,d6d24c4f-f84c-11ea-9640-000000000028,Corn Flakes,K,,C,100,1999-03-20,2,0,$ 9.00,290,1.0,21.0,2,35,25,1,1.0,1.0,45.863324


### 2.3.3 Removendo amostras duplicadas

In [5]:
print("Dimensões do dataset (linha,coluna):",  meu_data_frame.shape) 
meu_data_frame = meu_data_frame.drop_duplicates()
print("Dimensões do dataset (linha,coluna) após eliminar duplicatas:",  meu_data_frame.shape) 

Dimensões do dataset (linha,coluna): (91, 21)
Dimensões do dataset (linha,coluna) após eliminar duplicatas: (80, 21)


### 2.3.4 Remoção de símbolos especiais, escalas de medidas e grandezas numericas

In [6]:
#removendo o "un." dos valores da coluna "sales_week" e convertendo de "object" para "int64"
meu_data_frame["sales_week"] = meu_data_frame["sales_week"].str.replace("un.","")
meu_data_frame["sales_week"] = meu_data_frame["sales_week"].astype('int64')
display(meu_data_frame.head(10))

Unnamed: 0,uuid_samanthaserver,uuid_scorpionbase,name,mfr,sales_week,type,calories,data_cre_scorp,protein,fat,price,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,f4e57ef5-aeee-41a6-99dc-5822e6bb851f,d6d24c40-f84c-11ea-b117-000000000019,100% Bran,N,75898,C,70,1995-04-30,4,1,$ 9.00,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,59c1f03b-6d4e-401a-9b38-577486f3a551,d6d24c6e-f84c-11ea-9823-000000000047,100% Natural Bran,Q,35839,C,120,1998-07-12,42,5,$ 10.00,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
3,578659e8-aa31-4283-bb2a-8cf73582148e,d6d24c50-f84c-11ea-ae4e-000000000029,All-Bran with Extra Fiber,K,27346,C,50,1995-10-07,4,0,$ 4.00,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
5,e441c0e6-18c4-424e-9060-fc464f432331,d6d22568-f84c-11ea-b587-000000000018,Apple Cinnamon Cheerios,G,23146,C,110,1995-08-11,2,2,$ 7.00,180,1.5,10.5,10,70,25,1,1.0,0.75,29.509541
7,cf267755-f453-4ff4-96d6-520962c6613e,d6d22553-f84c-11ea-b929-000000000003,Basic 4,G,7210,C,130,1996-01-25,78,2,$ 10.00,210,2.0,18.0,8,100,25,3,1.33,0.75,37.038562
10,161d40c6-6bfe-4498-81a4-12b06df58756,d6d24c4d-f84c-11ea-90fa-000000000026,Cap'n'Crunch,Q,88370,C,1400,1997-07-04,1,2,$ 4.00,220,0.0,12.0,12,35,25,2,1.0,0.75,18.042851
11,4c69040a-d182-4851-bde9-7f4915b3ae04,d6d24c41-f84c-11ea-8e90-00000000001a,Cheerios,G,99093,C,110,1990-06-23,6,2,$ 5.00,290,2.0,17.0,1,105,25,1,1.0,1.25,50.764999
12,489175e9-5419-4fff-a35b-f69b79fe4dfb,d6d22552-f84c-11ea-a64e-000000000002,Cinnamon Toast Crunch,G,33487,C,120,1993-10-18,1,3,$ 10.00,210,0.0,13.0,9,45,25,2,1.0,0.75,19.823573
13,1694a14c-218d-4019-b8d6-48f12ca7934f,d6d22558-f84c-11ea-a468-000000000008,Clusters,G,64210,C,110,1993-01-06,3,2,$ 8.00,140,2.0,13.0,7,105,25,3,1.0,0.5,40.400208
14,13655434-9748-467f-a25d-d9cfbfe1a927,d6d24c71-f84c-11ea-9ce2-00000000004a,Cocoa Puffs,G,71837,C,110,1990-05-20,1,1,$ 8.00,180,0.0,12.0,13,55,25,2,1.0,1.0,22.736446


In [7]:
#removendo o cifrão dos valores da coluna "price" e convertendo de "object" para "float64"
meu_data_frame["price"] = meu_data_frame["price"].str.replace("$","")
meu_data_frame["price"] = meu_data_frame["price"].astype('float64')
display(meu_data_frame.head(10))

Unnamed: 0,uuid_samanthaserver,uuid_scorpionbase,name,mfr,sales_week,type,calories,data_cre_scorp,protein,fat,price,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,f4e57ef5-aeee-41a6-99dc-5822e6bb851f,d6d24c40-f84c-11ea-b117-000000000019,100% Bran,N,75898,C,70,1995-04-30,4,1,9.0,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,59c1f03b-6d4e-401a-9b38-577486f3a551,d6d24c6e-f84c-11ea-9823-000000000047,100% Natural Bran,Q,35839,C,120,1998-07-12,42,5,10.0,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
3,578659e8-aa31-4283-bb2a-8cf73582148e,d6d24c50-f84c-11ea-ae4e-000000000029,All-Bran with Extra Fiber,K,27346,C,50,1995-10-07,4,0,4.0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
5,e441c0e6-18c4-424e-9060-fc464f432331,d6d22568-f84c-11ea-b587-000000000018,Apple Cinnamon Cheerios,G,23146,C,110,1995-08-11,2,2,7.0,180,1.5,10.5,10,70,25,1,1.0,0.75,29.509541
7,cf267755-f453-4ff4-96d6-520962c6613e,d6d22553-f84c-11ea-b929-000000000003,Basic 4,G,7210,C,130,1996-01-25,78,2,10.0,210,2.0,18.0,8,100,25,3,1.33,0.75,37.038562
10,161d40c6-6bfe-4498-81a4-12b06df58756,d6d24c4d-f84c-11ea-90fa-000000000026,Cap'n'Crunch,Q,88370,C,1400,1997-07-04,1,2,4.0,220,0.0,12.0,12,35,25,2,1.0,0.75,18.042851
11,4c69040a-d182-4851-bde9-7f4915b3ae04,d6d24c41-f84c-11ea-8e90-00000000001a,Cheerios,G,99093,C,110,1990-06-23,6,2,5.0,290,2.0,17.0,1,105,25,1,1.0,1.25,50.764999
12,489175e9-5419-4fff-a35b-f69b79fe4dfb,d6d22552-f84c-11ea-a64e-000000000002,Cinnamon Toast Crunch,G,33487,C,120,1993-10-18,1,3,10.0,210,0.0,13.0,9,45,25,2,1.0,0.75,19.823573
13,1694a14c-218d-4019-b8d6-48f12ca7934f,d6d22558-f84c-11ea-a468-000000000008,Clusters,G,64210,C,110,1993-01-06,3,2,8.0,140,2.0,13.0,7,105,25,3,1.0,0.5,40.400208
14,13655434-9748-467f-a25d-d9cfbfe1a927,d6d24c71-f84c-11ea-9ce2-00000000004a,Cocoa Puffs,G,71837,C,110,1990-05-20,1,1,8.0,180,0.0,12.0,13,55,25,2,1.0,1.0,22.736446


### 2.3.5 Filtrando valores invalidos

In [8]:
#obtendo as colunas do dataframe
columns = meu_data_frame.columns

#iterando sobre cada coluna
for column in columns:
    #verificando se a coluna é numérica
    if(meu_data_frame[column].dtype == "int64" or meu_data_frame[column].dtype == "float64"):
        #verificando se existe valor menor que zero em uma coluna
        print("Antes")
        display(meu_data_frame[column][meu_data_frame[column] < 0])
        meu_data_frame[column].values[meu_data_frame[column] < 0] = 0
        print("Depois")
        display(meu_data_frame[column][meu_data_frame[column] < 0])



Antes


Series([], Name: sales_week, dtype: int64)

Depois


Series([], Name: sales_week, dtype: int64)

Antes


Series([], Name: calories, dtype: int64)

Depois


Series([], Name: calories, dtype: int64)

Antes


Series([], Name: protein, dtype: int64)

Depois


Series([], Name: protein, dtype: int64)

Antes


Series([], Name: fat, dtype: int64)

Depois


Series([], Name: fat, dtype: int64)

Antes


Series([], Name: price, dtype: float64)

Depois


Series([], Name: price, dtype: float64)

Antes


Series([], Name: sodium, dtype: int64)

Depois


Series([], Name: sodium, dtype: int64)

Antes


Series([], Name: fiber, dtype: float64)

Depois


Series([], Name: fiber, dtype: float64)

Antes


57   -1.0
Name: carbo, dtype: float64

Depois


Series([], Name: carbo, dtype: float64)

Antes


57   -1
Name: sugars, dtype: int64

Depois


Series([], Name: sugars, dtype: int64)

Antes


20   -1
4    -1
Name: potass, dtype: int64

Depois


Series([], Name: potass, dtype: int64)

Antes


Series([], Name: vitamins, dtype: int64)

Depois


Series([], Name: vitamins, dtype: int64)

Antes


Series([], Name: shelf, dtype: int64)

Depois


Series([], Name: shelf, dtype: int64)

Antes


Series([], Name: weight, dtype: float64)

Depois


Series([], Name: weight, dtype: float64)

Antes


Series([], Name: cups, dtype: float64)

Depois


Series([], Name: cups, dtype: float64)

Antes


Series([], Name: rating, dtype: float64)

Depois


Series([], Name: rating, dtype: float64)

### 2.3.6 Codificação de categorias

In [9]:
#dados categoricos são convertidos para representação numérica (escala nominal)
meu_data_frame["mfr"] = meu_data_frame["mfr"].cat.codes
meu_data_frame["type"] = meu_data_frame["type"].cat.codes
display(meu_data_frame.head(10))

Unnamed: 0,uuid_samanthaserver,uuid_scorpionbase,name,mfr,sales_week,type,calories,data_cre_scorp,protein,fat,price,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,f4e57ef5-aeee-41a6-99dc-5822e6bb851f,d6d24c40-f84c-11ea-b117-000000000019,100% Bran,3,75898,0,70,1995-04-30,4,1,9.0,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,59c1f03b-6d4e-401a-9b38-577486f3a551,d6d24c6e-f84c-11ea-9823-000000000047,100% Natural Bran,5,35839,0,120,1998-07-12,42,5,10.0,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
3,578659e8-aa31-4283-bb2a-8cf73582148e,d6d24c50-f84c-11ea-ae4e-000000000029,All-Bran with Extra Fiber,2,27346,0,50,1995-10-07,4,0,4.0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
5,e441c0e6-18c4-424e-9060-fc464f432331,d6d22568-f84c-11ea-b587-000000000018,Apple Cinnamon Cheerios,1,23146,0,110,1995-08-11,2,2,7.0,180,1.5,10.5,10,70,25,1,1.0,0.75,29.509541
7,cf267755-f453-4ff4-96d6-520962c6613e,d6d22553-f84c-11ea-b929-000000000003,Basic 4,1,7210,0,130,1996-01-25,78,2,10.0,210,2.0,18.0,8,100,25,3,1.33,0.75,37.038562
10,161d40c6-6bfe-4498-81a4-12b06df58756,d6d24c4d-f84c-11ea-90fa-000000000026,Cap'n'Crunch,5,88370,0,1400,1997-07-04,1,2,4.0,220,0.0,12.0,12,35,25,2,1.0,0.75,18.042851
11,4c69040a-d182-4851-bde9-7f4915b3ae04,d6d24c41-f84c-11ea-8e90-00000000001a,Cheerios,1,99093,0,110,1990-06-23,6,2,5.0,290,2.0,17.0,1,105,25,1,1.0,1.25,50.764999
12,489175e9-5419-4fff-a35b-f69b79fe4dfb,d6d22552-f84c-11ea-a64e-000000000002,Cinnamon Toast Crunch,1,33487,0,120,1993-10-18,1,3,10.0,210,0.0,13.0,9,45,25,2,1.0,0.75,19.823573
13,1694a14c-218d-4019-b8d6-48f12ca7934f,d6d22558-f84c-11ea-a468-000000000008,Clusters,1,64210,0,110,1993-01-06,3,2,8.0,140,2.0,13.0,7,105,25,3,1.0,0.5,40.400208
14,13655434-9748-467f-a25d-d9cfbfe1a927,d6d24c71-f84c-11ea-9ce2-00000000004a,Cocoa Puffs,1,71837,0,110,1990-05-20,1,1,8.0,180,0.0,12.0,13,55,25,2,1.0,1.0,22.736446


## 2.4 Normalização e padronização

A transformação dos seus dados, que já estão tratados, é uma pratica que tem vários impactos positivos na área de Ciência de Dados. Além de facilitar a visualizalção dos dados, a normalização e a padronização evitar que seu algoritmo  de aprendizado de máquina fique enviesado para as variáveis com maior ordem de grandeza.

* A normalização tem como objetivo converter a distribuição original para uma distribuição dentro de um intervalo, por exemplo: [0,1] ou [-1,1]

* A padronização tem como objetivo converter a distribuição original para uma distribuição com média 0 e desvio padrão 1.


In [10]:
#instanciando o normalizador MinMAx
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))

#obtendo as colunas do dataframe
columns = meu_data_frame.columns

#fazendo uma copia do dataframe, iremos normalizar a copia
normalized_data_frame = meu_data_frame.copy()

#iterando sobre cada coluna
for column in columns:
    #verificando se a coluna é numérica
    if(meu_data_frame[column].dtype == "int64" or meu_data_frame[column].dtype == "float64"):
        x = meu_data_frame[column].values
        x_norm = min_max_scaler.fit_transform(x.reshape(-1, 1))
        normalized_data_frame[column] = pd.DataFrame(x_norm)
        normalized_data_frame.rename(columns={column:column+"_norm"}, inplace=True)

display(normalized_data_frame.head(10))

Unnamed: 0,uuid_samanthaserver,uuid_scorpionbase,name,mfr,sales_week_norm,type,calories_norm,data_cre_scorp,protein_norm,fat_norm,price_norm,sodium_norm,fiber_norm,carbo_norm,sugars_norm,potass_norm,vitamins_norm,shelf_norm,weight_norm,cups_norm,rating_norm
0,f4e57ef5-aeee-41a6-99dc-5822e6bb851f,d6d24c40-f84c-11ea-b117-000000000019,100% Bran,3,0.747559,0,0.014815,1995-04-30,0.038961,0.2,0.857143,0.40625,0.714286,0.217391,0.4,0.848485,0.25,1.0,0.5,0.064,0.665593
1,59c1f03b-6d4e-401a-9b38-577486f3a551,d6d24c6e-f84c-11ea-9823-000000000047,100% Natural Bran,5,0.311581,0,0.051852,1998-07-12,0.532468,1.0,1.0,0.046875,0.142857,0.347826,0.533333,0.409091,0.0,1.0,0.5,0.6,0.210685
3,578659e8-aa31-4283-bb2a-8cf73582148e,d6d24c50-f84c-11ea-ae4e-000000000029,All-Bran with Extra Fiber,2,0.173438,0,0.044444,1995-10-07,0.012987,0.4,0.571429,0.5625,0.107143,0.456522,0.666667,0.212121,0.25,0.0,0.5,0.4,0.151551
5,e441c0e6-18c4-424e-9060-fc464f432331,d6d22568-f84c-11ea-b587-000000000018,Apple Cinnamon Cheerios,1,0.883297,0,1.0,1995-08-11,0.0,0.4,0.142857,0.6875,0.0,0.521739,0.8,0.106061,0.25,0.5,0.5,0.4,0.0
7,cf267755-f453-4ff4-96d6-520962c6613e,d6d22553-f84c-11ea-b929-000000000003,Basic 4,1,0.285983,0,0.051852,1996-01-25,0.0,0.6,1.0,0.65625,0.0,0.565217,0.6,0.136364,0.25,0.5,0.5,0.4,0.023535
10,161d40c6-6bfe-4498-81a4-12b06df58756,d6d24c4d-f84c-11ea-90fa-000000000026,Cap'n'Crunch,5,0.153402,0,0.044444,1997-07-04,0.012987,0.0,0.285714,0.875,0.0,0.956522,0.2,0.075758,0.25,0.0,0.5,0.6,0.309299
11,4c69040a-d182-4851-bde9-7f4915b3ae04,d6d24c41-f84c-11ea-8e90-00000000001a,Cheerios,1,0.314977,0,0.044444,1990-06-23,0.0,0.0,1.0,0.28125,0.071429,0.565217,0.8,0.060606,0.25,0.5,0.5,0.6,0.234463
12,489175e9-5419-4fff-a35b-f69b79fe4dfb,d6d22552-f84c-11ea-a64e-000000000002,Cinnamon Toast Crunch,1,0.631281,0,0.044444,1993-10-18,0.0,0.2,0.571429,0.5625,0.0,0.521739,0.866667,0.19697,0.25,0.5,0.5,0.6,0.057541
13,1694a14c-218d-4019-b8d6-48f12ca7934f,d6d22558-f84c-11ea-a468-000000000008,Clusters,1,0.390301,0,0.044444,1993-01-06,0.025974,0.6,0.428571,0.4375,0.285714,0.434783,0.466667,0.484848,0.25,1.0,0.5,0.2,0.296132
14,13655434-9748-467f-a25d-d9cfbfe1a927,d6d24c71-f84c-11ea-9ce2-00000000004a,Cocoa Puffs,1,0.335274,0,0.037037,1990-05-20,0.025974,0.0,1.0,0.25,0.071429,0.913043,0.0,0.0,0.0,0.5,0.5,0.6,0.614455


In [11]:

standard_scaler = preprocessing.StandardScaler()
columns = meu_data_frame.columns

standarlized_data_frame = meu_data_frame.copy()

for column in columns:
    if(meu_data_frame[column].dtype == "int64" or meu_data_frame[column].dtype == "float64"):
        x = meu_data_frame[column].values
        x_norm = standard_scaler.fit_transform(x.reshape(-1, 1))
        standarlized_data_frame[column] = pd.DataFrame(x_norm)
        standarlized_data_frame.rename(columns={column:column+"_stda"}, inplace=True)
display(standarlized_data_frame.head(10))

Unnamed: 0,uuid_samanthaserver,uuid_scorpionbase,name,mfr,sales_week_stda,type,calories_stda,data_cre_scorp,protein_stda,fat_stda,price_stda,sodium_stda,fiber_stda,carbo_stda,sugars_stda,potass_stda,vitamins_stda,shelf_stda,weight_stda,cups_stda,rating_stda
0,f4e57ef5-aeee-41a6-99dc-5822e6bb851f,d6d24c40-f84c-11ea-b117-000000000019,100% Bran,3,0.800189,0,-0.369227,1995-04-30,0.003958,-0.080556,1.093355,-0.349229,3.373672,-2.261488,-0.235621,2.63342,-0.127804,0.943737,-0.21677,-2.155883,1.87247
1,59c1f03b-6d4e-401a-9b38-577486f3a551,d6d24c6e-f84c-11ea-9823-000000000047,100% Natural Bran,5,-0.666055,0,-0.024155,1998-07-12,4.014278,3.601987,1.496435,-1.722265,-0.051919,-1.550142,0.224127,0.559579,-1.263842,0.943737,-0.21677,0.782961,-0.585175
3,578659e8-aa31-4283-bb2a-8cf73582148e,d6d24c50-f84c-11ea-ae4e-000000000029,All-Bran with Extra Fiber,2,-1.130646,0,-0.093169,1995-10-07,-0.207112,0.84008,0.287195,0.247744,-0.266019,-0.957353,0.683874,-0.370073,-0.127804,-1.491713,-0.21677,-0.313623,-0.904643
5,e441c0e6-18c4-424e-9060-fc464f432331,d6d22568-f84c-11ea-b587-000000000018,Apple Cinnamon Cheerios,1,1.256691,0,8.809683,1995-08-11,-0.312647,0.84008,-0.922046,0.725321,-0.908317,-0.60168,1.143622,-0.870655,-0.127804,-0.273988,-0.21677,-0.313623,-1.7234
7,cf267755-f453-4ff4-96d6-520962c6613e,d6d22553-f84c-11ea-b929-000000000003,Basic 4,1,-0.752144,0,-0.024155,1996-01-25,-0.312647,1.760716,1.496435,0.605927,-0.908317,-0.364565,0.454,-0.727632,-0.127804,-0.273988,-0.21677,-0.313623,-1.596251
10,161d40c6-6bfe-4498-81a4-12b06df58756,d6d24c4d-f84c-11ea-90fa-000000000026,Cap'n'Crunch,5,-1.198031,0,-0.093169,1997-07-04,-0.207112,-1.001191,-0.518966,1.441688,-0.908317,1.769473,-0.925242,-1.013679,-0.127804,-1.491713,-0.21677,0.782961,-0.052412
11,4c69040a-d182-4851-bde9-7f4915b3ae04,d6d24c41-f84c-11ea-8e90-00000000001a,Cheerios,1,-0.654636,0,-0.093169,1990-06-23,-0.312647,-1.001191,1.496435,-0.826807,-0.480118,-0.364565,1.143622,-1.085191,-0.127804,-0.273988,-0.21677,0.782961,-0.456713
12,489175e9-5419-4fff-a35b-f69b79fe4dfb,d6d22552-f84c-11ea-a64e-000000000002,Cinnamon Toast Crunch,1,0.409132,0,-0.093169,1993-10-18,-0.312647,-0.080556,0.287195,0.247744,-0.908317,-0.60168,1.373495,-0.441585,-0.127804,-0.273988,-0.21677,0.782961,-1.412535
13,1694a14c-218d-4019-b8d6-48f12ca7934f,d6d22558-f84c-11ea-a468-000000000008,Clusters,1,-0.401312,0,-0.093169,1993-01-06,-0.101577,1.760716,-0.115886,-0.229834,0.804479,-1.075911,-0.005747,0.917138,-0.127804,0.943737,-0.21677,-1.410206,-0.123548
14,13655434-9748-467f-a25d-d9cfbfe1a927,d6d24c71-f84c-11ea-9ce2-00000000004a,Cocoa Puffs,1,-0.586373,0,-0.162184,1990-05-20,-0.101577,-1.001191,1.496435,-0.946201,-0.480118,1.532358,-1.614863,-1.371238,-1.263842,-0.273988,-0.21677,0.782961,1.5962


## 2.5 Correlação de atributos

O termo correlação representa, sob o ponto de vista da estatística, uma medida de associação
entre duas ou mais variáveis. Por definição, se forem considerados numa população, os pares de valores de duas variáveis (xi;yi), a correlação pode ser definida pela equação de Pearson abaixo:

<img src="imgs/corr.png" width=35% />

O valor da correção, conhecido como coeficiente de correlação, assume valores no intervalo de -1 a 1, de acordo com o grau de associação entre as variáveis em questão.

In [12]:
#calculando a tabela de correlação
corr = meu_data_frame.corr()

p = 0.75 # correlação mínima
var = []

#iterando sobre a tabela
for i in corr.columns:
    for j in corr.columns:
        if(i != j):
            if np.abs(corr[i][j]) > p: # se maior do que |p|
                var.append([i,j])
print('Variáveis mais correlacionadas:\n', var)

Variáveis mais correlacionadas:
 [['fiber', 'potass'], ['sugars', 'rating'], ['potass', 'fiber'], ['rating', 'sugars']]
