Limpeza e preparação dos dados

1. Como tratar dados ausentes

In [1]:
import pandas as pd
import numpy as np

In [3]:
string_data = pd.Series(['fusrodah', 'midvrshaan', np.nan, 'dragon shouts'])
string_data

0         fusrodah
1       midvrshaan
2              NaN
3    dragon shouts
dtype: object

In [4]:
string_data.isnull()    

0    False
1    False
2     True
3    False
dtype: bool

In [5]:
# o valor do Python 'None' também é tratado como NA em arrays de objetos
string_data[0] = None

string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

In [6]:
string_data.notnull() # negação do isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [7]:
# dropna() remove valores faltantes

# NaT é um dado de data faltante
df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'], "toy": [np.nan, 'Batmobile', 'Bullwhip'], 
                   "born": [pd.NaT, pd.Timestamp("1940-04-25"), pd.NaT]})

df

Unnamed: 0,name,toy,born
0,Alfred,,NaT
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


In [8]:
df.dropna() # axis='index' é o padrão

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25


In [9]:
df.dropna(axis='columns') # remove colunas em que há valores faltantes

Unnamed: 0,name
0,Alfred
1,Batman
2,Catwoman


In [10]:
# df.dropna(how='all') # remove todas as linhas em que todos os seus elementos estão faltando
df.dropna(thresh=2) # mantém linhas com pelo menos 2 valores não faltantes

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


In [11]:
df.dropna(subset=['name', 'toy']) # define em quais colunas procurar por valores faltantes

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,NaT


Filtrando dados ausentes

podemos usar o pandas.isnull() e usar uma indexação booleana, mas o método dropna pode ser útil

In [12]:
from numpy import nan as NA

data = pd.Series([1, NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [21]:
data_bool = data.notnull()
data[data_bool]
# ou somente: data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [13]:
# ou usando o dropna()
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

usando dropna com DataFrames

In [22]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [23]:
# padrão: excluir as linhas que têm um valor ausente
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [24]:
data.dropna(axis='columns')

0
1
2
3


In [25]:
# exclui uma linha que tenha todos os elementos como NaN
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [26]:
df = pd.DataFrame(np.random.randn(7, 3))
df

Unnamed: 0,0,1,2
0,-0.342089,-0.21971,0.027679
1,0.097919,-1.362805,0.347169
2,0.338667,1.347761,0.080502
3,-0.843304,-0.087323,-0.311521
4,-0.461455,-2.088353,-0.401039
5,0.131544,1.207047,0.528728
6,1.513388,0.77592,0.04949


In [31]:
df.iloc[:4, 1] = pd.NA
df.iloc[:2, 2] = pd.NA
df

Unnamed: 0,0,1,2
0,-0.342089,,
1,0.097919,,
2,0.338667,,0.080502
3,-0.843304,,-0.311521
4,-0.461455,-2.088353,-0.401039
5,0.131544,1.207047,0.528728
6,1.513388,0.77592,0.04949


In [32]:
df.dropna()

Unnamed: 0,0,1,2
4,-0.461455,-2.088353,-0.401039
5,0.131544,1.207047,0.528728
6,1.513388,0.77592,0.04949


In [33]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.338667,,0.080502
3,-0.843304,,-0.311521
4,-0.461455,-2.088353,-0.401039
5,0.131544,1.207047,0.528728
6,1.513388,0.77592,0.04949


Preenchendo dados ausentes usando fillna

In [34]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.342089,0.0,0.0
1,0.097919,0.0,0.0
2,0.338667,0.0,0.080502
3,-0.843304,0.0,-0.311521
4,-0.461455,-2.088353,-0.401039
5,0.131544,1.207047,0.528728
6,1.513388,0.77592,0.04949


In [35]:
df.fillna({1: 0.5, 2: 0}) # chave é a coluna, valor é aquele que ficará no lugar do dado faltante

Unnamed: 0,0,1,2
0,-0.342089,0.5,0.0
1,0.097919,0.5,0.0
2,0.338667,0.5,0.080502
3,-0.843304,0.5,-0.311521
4,-0.461455,-2.088353,-0.401039
5,0.131544,1.207047,0.528728
6,1.513388,0.77592,0.04949


fillna devolve um novo objeto, mas o objeto pode ser alterado in-place

In [36]:
_ = df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,-0.342089,0.0,0.0
1,0.097919,0.0,0.0
2,0.338667,0.0,0.080502
3,-0.843304,0.0,-0.311521
4,-0.461455,-2.088353,-0.401039
5,0.131544,1.207047,0.528728
6,1.513388,0.77592,0.04949


In [37]:
data = pd.Series([1., NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [38]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

Transformação de dados

2. Removendo duplicatas: sempre podemos encontrar linhas duplicadas em um DataFrame

In [39]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], 'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [40]:
# duplicated de DataFrame devolve uma Series booleana informando se a linha é uma duplicata
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [41]:
# drop_duplicates devolve um DataFrame com dados em que o array duplicated é False
data.drop_duplicates() # ou data[~(data.duplicated())]

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
