# PRATICA GUIADA - Limpeza de dados.

#### O Pandas fornece um conjunto de métodos para trabalhar com dados faltantes. Os métodos reconhecem como dados faltantes valores que podem vir de Numpy ou do Python nativo. 

In [1]:
import pandas as pd
import numpy as np

## Detecção de dados faltantes

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

#### O método [.isnull()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html) retorna uma máscara booleana para a série que indica os dados faltantes. ele também reconhece o valor faltante do Python nativo.

In [3]:
string_data = pd.Series([None, 'artichoke', np.nan, 'avocado'])
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

#### Para encontrar os valores com dados faltantes, podemos filtrar a série usando boolean indexing.

In [4]:
# Filtrem os valores nulos
print(string_data[string_data.isnull()])

0    None
2     NaN
dtype: object


In [5]:
# Filtrem os valores não nulos
print(string_data[string_data.notnull()])

1    artichoke
3      avocado
dtype: object


#### Na hora de trabalhar com dataframes, podemos selecionar as linhas ou colunas que não contêm nenhum valor faltante.

In [6]:
df = pd.DataFrame(np.random.randn(7, 3))
df

Unnamed: 0,0,1,2
0,-0.752353,-1.370861,1.394807
1,-0.174794,-0.438135,-0.944785
2,2.74045,0.44751,-0.298952
3,-0.055555,-0.6439,0.968234
4,-0.926365,-0.702047,0.480154
5,1.596814,0.155727,-2.36099
6,-1.198358,0.985941,-0.489588


In [7]:
# Agora, geramos alguns dados faltantes
df.iloc[:4, 1] = np.nan
df.iloc[2, 2] = None
df

Unnamed: 0,0,1,2
0,-0.752353,,1.394807
1,-0.174794,,-0.944785
2,2.74045,,
3,-0.055555,,0.968234
4,-0.926365,-0.702047,0.480154
5,1.596814,0.155727,-2.36099
6,-1.198358,0.985941,-0.489588


#### É possível remover as linhas que apresentam `NaN`.

In [8]:
df.dropna(axis = 0)

Unnamed: 0,0,1,2
4,-0.926365,-0.702047,0.480154
5,1.596814,0.155727,-2.36099
6,-1.198358,0.985941,-0.489588


#### E é possível remover as colunas que apresentam `NaN`.

In [9]:
df.dropna(axis = 1)

Unnamed: 0,0
0,-0.752353
1,-0.174794
2,2.74045
3,-0.055555
4,-0.926365
5,1.596814
6,-1.198358


* É possível definir um critério limitado para realizar o drop. Por exemplo, realizar o drop das linhas que são NaN na coluna 2.

In [10]:
df.dropna(axis = 0, subset=[2])

Unnamed: 0,0,1,2
0,-0.752353,,1.394807
1,-0.174794,,-0.944785
3,-0.055555,,0.968234
4,-0.926365,-0.702047,0.480154
5,1.596814,0.155727,-2.36099
6,-1.198358,0.985941,-0.489588


In [11]:
df.dropna(axis = 1, subset=[3,4,5,6])

Unnamed: 0,0,2
0,-0.752353,1.394807
1,-0.174794,-0.944785
2,2.74045,
3,-0.055555,0.968234
4,-0.926365,0.480154
5,1.596814,-2.36099
6,-1.198358,-0.489588


## Métodos de imputação, preenchimento de dados faltantes.

#### Para discutir o preenchimento dos dados faltantes, vamos primeiro renomer as colunas com o método [`.columns`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html). 

In [12]:
df.columns = ['col1','col2','col3']
df

Unnamed: 0,col1,col2,col3
0,-0.752353,,1.394807
1,-0.174794,,-0.944785
2,2.74045,,
3,-0.055555,,0.968234
4,-0.926365,-0.702047,0.480154
5,1.596814,0.155727,-2.36099
6,-1.198358,0.985941,-0.489588


#### Podemos optar pelo preenchimento com um escalar, este método retorna um objeto novo. Para alterar dataframe diretamente o parâmetro `inplace = True` é utilizado [`.fillna`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html).

In [13]:
df.fillna(0)

Unnamed: 0,col1,col2,col3
0,-0.752353,0.0,1.394807
1,-0.174794,0.0,-0.944785
2,2.74045,0.0,0.0
3,-0.055555,0.0,0.968234
4,-0.926365,-0.702047,0.480154
5,1.596814,0.155727,-2.36099
6,-1.198358,0.985941,-0.489588


In [14]:
df

Unnamed: 0,col1,col2,col3
0,-0.752353,,1.394807
1,-0.174794,,-0.944785
2,2.74045,,
3,-0.055555,,0.968234
4,-0.926365,-0.702047,0.480154
5,1.596814,0.155727,-2.36099
6,-1.198358,0.985941,-0.489588


In [15]:
df.fillna(0, inplace = True)
df

Unnamed: 0,col1,col2,col3
0,-0.752353,0.0,1.394807
1,-0.174794,0.0,-0.944785
2,2.74045,0.0,0.0
3,-0.055555,0.0,0.968234
4,-0.926365,-0.702047,0.480154
5,1.596814,0.155727,-2.36099
6,-1.198358,0.985941,-0.489588


#### O preenchimento dos dados faltantes pode ser feito com um dicionário.

In [16]:
df = pd.DataFrame(np.random.randn(7, 3), columns = ['col1','col2','col3'])
df.iloc[1:4, 1] = np.nan
df.iloc[:2, 2] = np.nan
df

Unnamed: 0,col1,col2,col3
0,0.195822,-2.313764,
1,0.5883,,
2,0.897625,,0.627021
3,0.022954,,-0.151825
4,-1.870631,0.472543,-0.218509
5,0.248425,0.513274,-0.554663
6,0.29278,-0.506811,-0.396594


In [17]:
df.fillna({'col2': 0.5, 'col3': -1})

Unnamed: 0,col1,col2,col3
0,0.195822,-2.313764,-1.0
1,0.5883,0.5,-1.0
2,0.897625,0.5,0.627021
3,0.022954,0.5,-0.151825
4,-1.870631,0.472543,-0.218509
5,0.248425,0.513274,-0.554663
6,0.29278,-0.506811,-0.396594


#### Para preencher com base nos últimos valores válidos, é possível usar a função [`.fillna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html), com o parâmetro method = 'ffill'.

In [18]:
df.fillna(method ='ffill') 

Unnamed: 0,col1,col2,col3
0,0.195822,-2.313764,
1,0.5883,-2.313764,
2,0.897625,-2.313764,0.627021
3,0.022954,-2.313764,-0.151825
4,-1.870631,0.472543,-0.218509
5,0.248425,0.513274,-0.554663
6,0.29278,-0.506811,-0.396594


#### A função função [`.fillna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) também apresenta o parâmetro method = 'bfill'.


In [19]:
df.fillna(method ='bfill') 

Unnamed: 0,col1,col2,col3
0,0.195822,-2.313764,0.627021
1,0.5883,0.472543,0.627021
2,0.897625,0.472543,0.627021
3,0.022954,0.472543,-0.151825
4,-1.870631,0.472543,-0.218509
5,0.248425,0.513274,-0.554663
6,0.29278,-0.506811,-0.396594


### Preenchimento com a média e a média condicionada

#### O método `.fillna()` também aceita um novo dataframe com índices que coincidam com os valores faltantes. 

In [20]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),'data2': np.random.rand(6)},
                  columns = ['key', 'data1','data2'])
df

Unnamed: 0,key,data1,data2
0,A,0,0.226138
1,B,1,0.906723
2,C,2,0.76395
3,A,3,0.468661
4,B,4,0.863666
5,C,5,0.975536


In [21]:
df.iloc[2:3, 1] = np.nan
df.iloc[3:4, 2] = np.nan
df

Unnamed: 0,key,data1,data2
0,A,0.0,0.226138
1,B,1.0,0.906723
2,C,,0.76395
3,A,3.0,
4,B,4.0,0.863666
5,C,5.0,0.975536


In [22]:
print(df.data1.mean())
df.data2.mean()

2.6


0.7472026790409811

In [23]:
df.fillna(df.mean())

Unnamed: 0,key,data1,data2
0,A,0.0,0.226138
1,B,1.0,0.906723
2,C,2.6,0.76395
3,A,3.0,0.747203
4,B,4.0,0.863666
5,C,5.0,0.975536


#### É possível calcular as médias do dataframe usando os métodos [`.groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html?highlight=groupby#pandas.DataFrame.groupby) e `.transform()`, agrupando o dataframe pela coluna `"key"`.

In [24]:
df.groupby(by = 'key').transform('mean')

Unnamed: 0,data1,data2
0,1.5,0.226138
1,2.5,0.885195
2,5.0,0.869743
3,1.5,0.226138
4,2.5,0.885195
5,5.0,0.869743


#### Também é possível realizar a utilizando o método `.fillna()`.

In [25]:
df.fillna(df.groupby('key').transform('mean'))

Unnamed: 0,key,data1,data2
0,A,0.0,0.226138
1,B,1.0,0.906723
2,C,5.0,0.76395
3,A,3.0,0.226138
4,B,4.0,0.863666
5,C,5.0,0.975536


## Tidy Data

Vamos trabalhar com alguns exemplos de messy data presentes no trabalho original de Whickham. 
A ideia é esbarrarmos com conjuntos de dados como eles poderiam existir no mundo real e passá-los para um formato com que as ferramentas padrão de mineração de dados e visualização possam trabalhar melhor, conforme as regras de "tidy data".

Vamos trabalhar com alguns tipos de conjuntos de dados desordenados:

#### Os nomes de colunas são valores, não variáveis

In [80]:
import pandas as pd
import datetime
from os import listdir
from os.path import isfile, join
import glob
import re


df = pd.read_csv("pew-raw.csv")
df

Unnamed: 0,religion,<$10k,$10-20k,$20-30k,$30-40,$40-50,$50-75
0,Agnostic,27,34,60,81,76,137
1,Atheist,12,27,37,52,35,70
2,Buddhist,27,21,30,34,33,58
3,Catholic,418,617,732,670,638,1116
4,Dont Know / refused,15,14,15,11,10,35
5,Evangelical Prot,575,869,1064,982,881,1486
6,Hindu,1,9,7,9,11,34
7,Historically Black Prot,228,224,236,238,197,223
8,Jehovahs Witness,20,27,24,24,21,30
9,Jewish,19,19,25,25,30,95


#### Para reorganizar o conjunto de dados, utilizamos o método [`.melt()`](https://pandas.pydata.org/docs/reference/api/pandas.melt.html). Nos parâmetros, indicamos que a variável que vamos conservar é `"religion"` (é possível conservar mais variáveis). E que com o restante das colunas vamos construir uma nova variável onde cada coluna seja uma categoria.

In [82]:
df_ordenado = pd.melt(df, ["religion"], var_name = "income", value_name = "freq")
df_ordenado.head(10)

Unnamed: 0,religion,income,freq
0,Agnostic,<$10k,27
1,Atheist,<$10k,12
2,Buddhist,<$10k,27
3,Catholic,<$10k,418
4,Dont Know / refused,<$10k,15
5,Evangelical Prot,<$10k,575
6,Hindu,<$10k,1
7,Historically Black Prot,<$10k,228
8,Jehovahs Witness,<$10k,20
9,Jewish,<$10k,19


* Para desfazer pode utilizar a função pivot_table
* Como a pivot_table usa o index do dataframe de saída como variável para a linha, é possível retornar os dados do índice para uma coluna usando a funcao reset_index

In [28]:
df_ordenado.pivot_table(index='religion',columns='income',values='freq').reset_index()

income,religion,$10-20k,$20-30k,$30-40,$40-50,$50-75,<$10k
0,Agnostic,34,60,81,76,137,27
1,Atheist,27,37,52,35,70,12
2,Buddhist,21,30,34,33,58,27
3,Catholic,617,732,670,638,1116,418
4,Dont Know / refused,14,15,11,10,35,15
5,Evangelical Prot,869,1064,982,881,1486,575
6,Hindu,9,7,9,11,34,1
7,Historically Black Prot,224,236,238,197,223,228
8,Jehovahs Witness,27,24,24,21,30,20
9,Jewish,19,25,25,30,95,19


#### Mais de um valor na mesma coluna

A seguir, vamos usar dados da OMS. O conjunto de dados é composto pela quantidade de casos de tuberculose observados por país, ano, sexo e idade.  

In [146]:
df = pd.read_csv("tb-raw.csv")
df

Unnamed: 0,country,year,m014,m1524,m2534,m3544,m4554,m5564,m65,mu,f014
0,AD,2000,0.0,0.0,1.0,0.0,0,0,0.0,,
1,AE,2000,2.0,4.0,4.0,6.0,5,12,10.0,,3.0
2,AF,2000,52.0,228.0,183.0,149.0,129,94,80.0,,93.0
3,AG,2000,0.0,0.0,0.0,0.0,0,0,1.0,,1.0
4,AL,2000,2.0,19.0,21.0,14.0,24,19,16.0,,3.0
5,AM,2000,2.0,152.0,130.0,131.0,63,26,21.0,,1.0
6,AN,2000,0.0,0.0,1.0,2.0,0,0,0.0,,0.0
7,AO,2000,186.0,999.0,1003.0,912.0,482,312,194.0,,247.0
8,AR,2000,97.0,278.0,594.0,402.0,419,368,330.0,,121.0
9,AS,2000,,,,,1,1,,,


#### Para ordenar este conjunto de dados, vamos extrair os valores de sexo e idade, a fim de organizá-los em uma única coluna. Depois, vamos criar três colunas com base no conteúdo: sexo, idade_de e idade_até.

In [147]:
df2 = df.copy()
df2['10_20'] = [5,1,3,4,4,5,6,10,10,20]
df2['20_30'] = [5,1,3,4,4,5,6,10,10,20]

In [148]:
df2

Unnamed: 0,country,year,m014,m1524,m2534,m3544,m4554,m5564,m65,mu,f014,10_20,20_30
0,AD,2000,0.0,0.0,1.0,0.0,0,0,0.0,,,5,5
1,AE,2000,2.0,4.0,4.0,6.0,5,12,10.0,,3.0,1,1
2,AF,2000,52.0,228.0,183.0,149.0,129,94,80.0,,93.0,3,3
3,AG,2000,0.0,0.0,0.0,0.0,0,0,1.0,,1.0,4,4
4,AL,2000,2.0,19.0,21.0,14.0,24,19,16.0,,3.0,4,4
5,AM,2000,2.0,152.0,130.0,131.0,63,26,21.0,,1.0,5,5
6,AN,2000,0.0,0.0,1.0,2.0,0,0,0.0,,0.0,6,6
7,AO,2000,186.0,999.0,1003.0,912.0,482,312,194.0,,247.0,10,10
8,AR,2000,97.0,278.0,594.0,402.0,419,368,330.0,,121.0,10,10
9,AS,2000,,,,,1,1,,,,20,20


In [149]:
df2 = pd.melt(df2, id_vars = ["country", "year","10_20",'20_30'], value_name = "cases", var_name = "sex_and_age")
df2

Unnamed: 0,country,year,10_20,20_30,sex_and_age,cases
0,AD,2000,5,5,m014,0.0
1,AE,2000,1,1,m014,2.0
2,AF,2000,3,3,m014,52.0
3,AG,2000,4,4,m014,0.0
4,AL,2000,4,4,m014,2.0
...,...,...,...,...,...,...
85,AM,2000,5,5,f014,1.0
86,AN,2000,6,6,f014,0.0
87,AO,2000,10,10,f014,247.0
88,AR,2000,10,10,f014,121.0


In [150]:
df2 = pd.melt(df2, id_vars = ["country", "year", "sex_and_age", "cases"], value_name = "total", var_name = "income")
df2

Unnamed: 0,country,year,sex_and_age,cases,income,total
0,AD,2000,m014,0.0,10_20,5
1,AE,2000,m014,2.0,10_20,1
2,AF,2000,m014,52.0,10_20,3
3,AG,2000,m014,0.0,10_20,4
4,AL,2000,m014,2.0,10_20,4
...,...,...,...,...,...,...
175,AM,2000,f014,1.0,20_30,5
176,AN,2000,f014,0.0,20_30,6
177,AO,2000,f014,247.0,20_30,10
178,AR,2000,f014,121.0,20_30,10


In [151]:
df = pd.melt(df, id_vars = ["country", "year"], value_name = "cases", var_name = "sex_and_age")
df.head(20)

Unnamed: 0,country,year,sex_and_age,cases
0,AD,2000,m014,0.0
1,AE,2000,m014,2.0
2,AF,2000,m014,52.0
3,AG,2000,m014,0.0
4,AL,2000,m014,2.0
5,AM,2000,m014,2.0
6,AN,2000,m014,0.0
7,AO,2000,m014,186.0
8,AR,2000,m014,97.0
9,AS,2000,m014,


#### Vamos realizar a extração de variáveis, com a ajudar das [expressões regulares](https://docs.python.org/3/library/re.html) e da função [`.str.extract()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html). Pedimos à função que ela divida o valor que recebe em três partes:
* (\D): Uma única letra ou caractere não numérico 
* (\d+): Um ou mais números (para dar conta de "idade de")
* (\d{2}): Dois dígitos

In [152]:
# regular expressions
tmp_df = df["sex_and_age"].str.extract("(\D)(\d+)(\d{2})", expand = False)
tmp_df

Unnamed: 0,0,1,2
0,m,0,14
1,m,0,14
2,m,0,14
3,m,0,14
4,m,0,14
...,...,...,...
85,f,0,14
86,f,0,14
87,f,0,14
88,f,0,14


In [153]:
pd.Series(['m014']).str.extract("(\D)(\d+)(\d{2})", expand = False)

Unnamed: 0,0,1,2
0,m,0,14


#### Atribuímos os nomes `"sex"`, `"age_lower"` e  `"age_upper"` às colunas do dataframe `tmp_df`.

In [154]:
tmp_df.columns = ["sex", "age_lower", "age_upper"]
tmp_df

Unnamed: 0,sex,age_lower,age_upper
0,m,0,14
1,m,0,14
2,m,0,14
3,m,0,14
4,m,0,14
...,...,...,...
85,f,0,14
86,f,0,14
87,f,0,14
88,f,0,14


#### Criamos a coluna idade com base em `"age_lower"` e `"age_upper"`.

In [155]:
tmp_df["age"] = tmp_df["age_lower"] + "-" + tmp_df["age_upper"]
tmp_df

Unnamed: 0,sex,age_lower,age_upper,age
0,m,0,14,0-14
1,m,0,14,0-14
2,m,0,14,0-14
3,m,0,14,0-14
4,m,0,14,0-14
...,...,...,...,...
85,f,0,14,0-14
86,f,0,14,0-14
87,f,0,14,0-14
88,f,0,14,0-14


In [156]:
df

Unnamed: 0,country,year,sex_and_age,cases
0,AD,2000,m014,0.0
1,AE,2000,m014,2.0
2,AF,2000,m014,52.0
3,AG,2000,m014,0.0
4,AL,2000,m014,2.0
...,...,...,...,...
85,AM,2000,f014,1.0
86,AN,2000,f014,0.0
87,AO,2000,f014,247.0
88,AR,2000,f014,121.0


#### Unimos os dos conjuntos de dados com a ajuda da função [`.concat()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html).

In [157]:
df_concat = pd.concat([df, tmp_df], axis = 1)
df_concat.head()

Unnamed: 0,country,year,sex_and_age,cases,sex,age_lower,age_upper,age
0,AD,2000,m014,0.0,m,0,14,0-14
1,AE,2000,m014,2.0,m,0,14,0-14
2,AF,2000,m014,52.0,m,0,14,0-14
3,AG,2000,m014,0.0,m,0,14,0-14
4,AL,2000,m014,2.0,m,0,14,0-14


In [158]:
df_concat["age"].value_counts()

0-14     20
15-24    10
55-64    10
25-34    10
45-54    10
35-44    10
Name: age, dtype: int64

#### Conferimos a presença de valores faltantes.

In [159]:
np.sum(df_concat.isnull())

country         0
year            0
sex_and_age     0
cases          17
sex            20
age_lower      20
age_upper      20
age            20
dtype: int64

#### Analisando os casos faltantes, observamos que a expressão regular não funcionou para mulheres com mais de 65 anos ou de idade indefinida.

In [161]:
df_concat.loc[df_concat['age'].isnull(),:]

Unnamed: 0,country,year,sex_and_age,cases,sex,age_lower,age_upper,age
60,AD,2000,m65,0.0,,,,
61,AE,2000,m65,10.0,,,,
62,AF,2000,m65,80.0,,,,
63,AG,2000,m65,1.0,,,,
64,AL,2000,m65,16.0,,,,
65,AM,2000,m65,21.0,,,,
66,AN,2000,m65,0.0,,,,
67,AO,2000,m65,194.0,,,,
68,AR,2000,m65,330.0,,,,
69,AS,2000,m65,,,,,


In [162]:
df_concat.loc[df_concat['sex_and_age'] == 'm65', 'age'] = '65 or more'
df_concat.loc[df_concat['sex_and_age'] == 'm65', 'sex'] = 'm'
df_concat.loc[df_concat['age'].isnull(), ]
#df.tail()

Unnamed: 0,country,year,sex_and_age,cases,sex,age_lower,age_upper,age
70,AD,2000,mu,,,,,
71,AE,2000,mu,,,,,
72,AF,2000,mu,,,,,
73,AG,2000,mu,,,,,
74,AL,2000,mu,,,,,
75,AM,2000,mu,,,,,
76,AN,2000,mu,,,,,
77,AO,2000,mu,,,,,
78,AR,2000,mu,,,,,
79,AS,2000,mu,,,,,


#### Checamos se o número de nulos diminuiu.

In [163]:
np.sum(df.isnull())

country         0
year            0
sex_and_age     0
cases          17
dtype: int64

#### Excluímos as colunas sobrantes.

In [165]:
df_concat = df_concat.drop(['sex_and_age',"age_lower","age_upper"], axis = 1)
df_concat.head()

Unnamed: 0,country,year,cases,sex,age
0,AD,2000,0.0,m,0-14
1,AE,2000,2.0,m,0-14
2,AF,2000,52.0,m,0-14
3,AG,2000,0.0,m,0-14
4,AL,2000,2.0,m,0-14


#### Como as pessoas com idade indefinida não apresentam nenhum caso, é correto eliminar esses faltantes com dropna.

In [167]:
df_concat = df_concat.dropna()
df_concat = df_concat.sort_values(["country", "year", "sex", "age"])
df_concat.head(10)

Unnamed: 0,country,year,cases,sex,age
0,AD,2000,0.0,m,0-14
10,AD,2000,0.0,m,15-24
20,AD,2000,1.0,m,25-34
30,AD,2000,0.0,m,35-44
40,AD,2000,0.0,m,45-54
50,AD,2000,0.0,m,55-64
60,AD,2000,0.0,m,65 or more
81,AE,2000,3.0,f,0-14
1,AE,2000,2.0,m,0-14
11,AE,2000,4.0,m,15-24


In [169]:
df_concat.pivot_table(index='age',columns='sex',values='cases', aggfunc='sum').reset_index()

sex,age,f,m
0,0-14,469.0,341.0
1,15-24,,1680.0
2,25-34,,1937.0
3,35-44,,1616.0
4,45-54,,1123.0
5,55-64,,832.0
6,65 or more,,652.0


## Ferramentas para a limpeza e manipulação de dados

#### O Pandas tem um conjunto de métodos que permitem operar sobre os elementos de um Dataframe ou uma Series. Para aplicar a lógica desejada, podemos definir funções com nome ou utilizar expressões lambda que depois não podem ser reutilizadas

* [pd.DataFrame.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html): Opera sobre linhas ou colunas completas.
* [pd.DataFrame.applymap](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html): Opera sobre cada um dos elementos do Dataframe.
* [pd.Series.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html): Opera sobre cada um dos elementos da Série. 
* [pd.Series.map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html): Opera sobre cada um dos elementos da Serie, muito parecido com Series.apply. 

### Função [`.apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html).

#### A função apply do Pandas permite realizar operações vetorizadas sobre os conjuntos de dados tanto linha por linha quanto coluna por coluna.

In [2]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 4), columns = ['a', 'b', 'c', 'd'])
df

Unnamed: 0,a,b,c,d
0,-1.861923,2.233081,1.743397,0.191558
1,0.543267,1.013022,-0.186358,0.066285
2,1.194406,-1.727876,-0.637592,1.063092
3,-0.59207,0.264621,-0.779915,-0.028528
4,-0.672697,0.638193,-2.672151,0.340108


#### Utilizamos `df.apply` para encontrar a raiz quadrada dos elementos de cada coluna. A quantidade `NaN` significa "Not a Number" e é o valor atribuído a operações inválidas, como a raiz de um número negativo. 

In [3]:
df.apply(np.sqrt)

Unnamed: 0,a,b,c,d
0,,1.49435,1.320378,0.437674
1,0.737067,1.00649,,0.257458
2,1.092889,,,1.031064
3,,0.514413,,
4,,0.79887,,0.583188


#### O parâmetro `axis = 0` faz referência às colunas, esse é o eixo reduzido.

In [10]:
df.apply(np.mean,axis=0) #  axis= 0 - interpreta as coluns e axis= 1 - interpreta as linhas 

a   -0.277803
b    0.484208
c   -0.506524
d    0.326503
dtype: float64

#### Procuramos a média de todas as linhas, o parâmetro axis = 1 indica que a função é aplicada a cada linha. Observar que o apply anterior não alterou o conjunto de dados, mas criou uma cópia e depois a alterou. O conjunto de dados original conserva o mesmo valor.

In [47]:
df.apply(np.mean, axis = 1)

0    0.792786
1    0.751723
2   -0.659844
3   -0.133671
4    0.249730
dtype: float64

#### A função [`np.mean()`](https://numpy.org/doc/stable/reference/generated/numpy.mean.html) é uma função que vem definida em numpy, mas podemos querer aplicar uma função totalmente própria para, por exemplo, criar uma nova coluna que seja a adição entre as séries a e d. Isso pode ser feito com expressões lambda.

#### As funções map(), apply() e applymap() são muito convenientes para usar na limpeza de dados. 

#### Por exemplo, vamos supor que queremos tirar todos os acentos e caracteres próprios do espanhol de todas as strings de um Dataframe.

#### Além disso, queremos converter todas as letras para minúscula.

In [211]:
data = pd.DataFrame({'nome': ['Tomás','Carla','Paula'], 'cidade': ['Rio','São Paulo','sao  paulo']}, 
columns =['nome','cidade'])
data

Unnamed: 0,nome,cidade
0,Tomás,Rio
1,Carla,São Paulo
2,Paula,sao paulo


In [212]:
data.cidade.unique()

array(['Rio', 'São Paulo', 'sao  paulo'], dtype=object)

In [213]:
import unidecode
texto = 'São Paulo'
texto_sem_acento = unidecode.unidecode(texto)
texto_minusculo = str.lower(texto_sem_acento)
texto_sem_espaco = str.replace(texto_minusculo," ","_")

print(texto, " > ", texto_sem_acento, " > ",texto_minusculo," > ", texto_sem_espaco)

São Paulo  >  Sao Paulo  >  sao paulo  >  sao_paulo


In [216]:
#! conda install unidecode 
def quitar_caracteres(entrada):
    texto_sem_acento = unidecode.unidecode(entrada)
    texto_minusculo = str.lower(texto_sem_acento)
    texto_espaco_duplo = str.replace(texto_minusculo,"  "," ")
    texto_sem_espaco = str.replace(texto_espaco_duplo," ","_")
    return texto_sem_espaco

In [217]:
data['cidade'].apply(quitar_caracteres)

0          rio
1    sao_paulo
2    sao_paulo
Name: cidade, dtype: object

In [205]:
data['cidade'].transform(quitar_caracteres)

0          rio
1    sao_paulo
2    sao_paulo
Name: cidade, dtype: object

In [206]:
data['cidade'].map(quitar_caracteres)

0          rio
1    sao_paulo
2    sao_paulo
Name: cidade, dtype: object

In [209]:
data = data.applymap(quitar_caracteres)
data

Unnamed: 0,nome,cidade
0,tomas,rio
1,carla,sao_paulo
2,paula,sao_paulo


In [210]:
data.cidade.unique()

array(['rio', 'sao_paulo'], dtype=object)