In [1]:
import pandas as pd

# Tratando dados nulos e/ou faltantes
- Vamos dar uma olhada em como lidar com dados faltantes

In [2]:
nba = pd.read_csv("../data/nba.csv")
nba.tail()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,


- Já podemos ver que na coluna `College` temos um `NaN`
- Além disso, a linha 457 é toda de `NaN`

## Removendo dados faltantes
- Um método para remover esse tipo de dado é o `dropna()`

In [3]:
nba.dropna().tail()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
449,Rodney Hood,Utah Jazz,5.0,SG,23.0,6-8,206.0,Duke,1348440.0
451,Chris Johnson,Utah Jazz,23.0,SF,26.0,6-6,206.0,Dayton,981348.0
452,Trey Lyles,Utah Jazz,41.0,PF,20.0,6-10,234.0,Kentucky,2239800.0
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


- Observe que o método removeu as linhas que possui `NaN`
- Porém, ele removeu toda e qualquer linha que possua pelo menos um `NaN` como a 454 e a 455
- O método `dropna()` tem alguns parametros para lidar com isso (checar documentação)
    - `how`: pode assumir `all` ou `any` (default). Se for `all` só apaga se todos os valores estiverem nulos
    - `axis`: se for 0 apaga linha, se for 1, apaga coluna
    - `subset`: aceita uma lista de colunas ou index para delimitar de onde quer ser deletado os dados
    - `inplace`: como ja vimos, altera o resultado no DF corrente

In [4]:
nba.dropna(how="all").tail()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
452,Trey Lyles,Utah Jazz,41.0,PF,20.0,6-10,234.0,Kentucky,2239800.0
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


- Perceba que a ultima linha foi apagada, mas a 455 e 454 nao pois só tem um `Nan` e não todos

In [5]:
nba.dropna(subset=["Salary", "College"]).tail()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
449,Rodney Hood,Utah Jazz,5.0,SG,23.0,6-8,206.0,Duke,1348440.0
451,Chris Johnson,Utah Jazz,23.0,SF,26.0,6-6,206.0,Dayton,981348.0
452,Trey Lyles,Utah Jazz,41.0,PF,20.0,6-10,234.0,Kentucky,2239800.0
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


- Considera apenas `Salary` e `College` na análise

## Preenchendo dados faltantes
- O método `fillna()` é utilizado para preencher dados faltantes (caso você saiva o que fazer com eles)

In [6]:
nba.fillna(value=0).tail()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,0,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,0,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,0,0,0.0,0,0.0,0,0.0,0,0.0


- Observe que todos os `NaN`s foram substituidos por 0
- Para o salario pode até fazer sentido, mas para College nao
- Nesse caso, faz mais sentido substituir por coluna:

In [7]:
nba["Salary"].fillna(value=0, inplace=True)
nba.tail()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  nba["Salary"].fillna(value=0, inplace=True)


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,,0.0


In [8]:
nba["College"].fillna(value="No college", inplace=True)
nba.tail()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  nba["College"].fillna(value="No college", inplace=True)


Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,No college,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,No college,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
457,,,,,,,,No college,0.0


## Ajustando os tipos das colunas após tratar os dados faltantes
- Como ja mencionado, sempre que tem um `NaN` em um valor numérico, ele assume `float64`
- Isso nao é desejável, principalmente em tabelas gigantescas pois requisita mais memoria
- A ideia é ajustar isso, mas só é possível depois de tratar os dados faltantes

In [13]:
nba.dropna(how="all", inplace=True)
nba["College"].fillna(value="No college", inplace=True)
nba["Salary"].fillna(value=0, inplace=True)
nba["Age"].fillna(value=0, inplace=True)
nba.tail()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  nba["College"].fillna(value="No college", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  nba["Salary"].fillna(value=0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting 

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,No college,900000
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,No college,2900000
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276
457,,,,,0.0,,,No college,0


- Vamos dar uma olhada nos tipos:

In [14]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      457 non-null    object 
 1   Team      457 non-null    object 
 2   Number    457 non-null    float64
 3   Position  457 non-null    object 
 4   Age       458 non-null    float64
 5   Height    457 non-null    object 
 6   Weight    457 non-null    float64
 7   College   458 non-null    object 
 8   Salary    458 non-null    int64  
dtypes: float64(3), int64(1), object(5)
memory usage: 32.3+ KB


- Agora vamos ajustar os tipos usando o método `astype()`
- podemos fazer isso para uma séria (coluna) da seguinte forma:

In [11]:
nba["Salary"].astype("int")

0      7730337
1      6796117
2            0
3      1148640
4      5000000
        ...   
453    2433333
454     900000
455    2900000
456     947276
457          0
Name: Salary, Length: 458, dtype: int64

- Não existe inplace para o método, entao tem que atribuir
- E já vamos aplicar para outras colunas

In [15]:
nba['Salary'] = nba["Salary"].astype("int")
nba['Age'] = nba["Age"].astype("int")

In [16]:
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25,6-2,180.0,Texas,7730337
1,Jae Crowder,Boston Celtics,99.0,SF,25,6-6,235.0,Marquette,6796117
2,John Holland,Boston Celtics,30.0,SG,27,6-5,205.0,Boston University,0
3,R.J. Hunter,Boston Celtics,28.0,SG,22,6-5,185.0,Georgia State,1148640
4,Jonas Jerebko,Boston Celtics,8.0,PF,29,6-10,231.0,No college,5000000


- Podemos usar dados categoricos (category)
- Isso ajuda a reduzir o tamanho das dados ao criar categorias para classes que possuem dados repetidos
    - Ao inves de usar uma string, vai ser uma categoria
    - isso ocorre para as posições e nomes dos times, por exemplo, que se repetem

In [17]:
nba["Position"] = nba["Position"].astype("category")
nba["Team"] = nba["Team"].astype("category")

In [18]:
nba.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25,6-2,180.0,Texas,7730337
1,Jae Crowder,Boston Celtics,99.0,SF,25,6-6,235.0,Marquette,6796117
2,John Holland,Boston Celtics,30.0,SG,27,6-5,205.0,Boston University,0
3,R.J. Hunter,Boston Celtics,28.0,SG,22,6-5,185.0,Georgia State,1148640
4,Jonas Jerebko,Boston Celtics,8.0,PF,29,6-10,231.0,No college,5000000


- Agora vamos dar uma olhada no info novamente

In [19]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   Name      457 non-null    object  
 1   Team      457 non-null    category
 2   Number    457 non-null    float64 
 3   Position  457 non-null    category
 4   Age       458 non-null    int64   
 5   Height    457 non-null    object  
 6   Weight    457 non-null    float64 
 7   College   458 non-null    object  
 8   Salary    458 non-null    int64   
dtypes: category(2), float64(2), int64(2), object(3)
memory usage: 27.6+ KB


- Perceba que reduzimos o tamanho da memoria necessaria
- Obvio que esse DF é pequeno e so mudamos 4 colunas, mas quando isso escala, faz diferença