## Handling missing Data with Pandas

Pandas borrows its capabilities from numpy selection + adds a number of convenient methods to handle missing values. Let's see one at a time:

<hr style="border: 3px solid SlateGray"> </hr>

In [1]:
import numpy as np
import pandas as pd

## Pandas utility functions

Similarly to `numpy`, pandas aslo has a few utility functions to identify and detect null values:

In [2]:
pd.isnull(np.nan)

True

In [3]:
pd.isnull(None)

True

In [4]:
pd.isna(np.nan)

True

In [5]:
pd.isna(None)

True

The opposite ones also exist:

In [6]:
pd.notnull(None)

False

In [7]:
pd.notnull(np.nan)

False

In [8]:
pd.notnull(3)

True

The functions also work with `Series` and `DataFrame`s:

In [9]:
pd.isnull(pd.Series([1, np.nan, 7]))

0    False
1     True
2    False
dtype: bool

In [10]:
pd.notnull(pd.Series([1, np.nan, 7]))

0     True
1    False
2     True
dtype: bool

In [11]:
pd.isnull(pd.DataFrame({
    'Column A': [1, np.nan, 7],
    'Column B': [np.nan, 2, 3],
    'Column C': [np.nan, 2, np.nan]
}))

Unnamed: 0,Column A,Column B,Column C
0,False,True,True
1,True,False,False
2,False,False,True


<hr style='border:3px solid SlateGray'></hr>

### Pandas Operations with Missing Values

Pandas manages missing values more gracefully than numpy. `nan`s will no longer behave as "values", and operations will just ignore them completely:

In [12]:
pd.Series([1, 2, np.nan]).count()

2

In [13]:
pd.Series([1, 2, np.nan]).sum()

3.0

In [14]:
pd.Series([1, 2, np.nan]).mean()

1.5

---

### Filtering missing data

As we saw with numpy, we could combine boolean slection + `pd.isnull` to filter out those `nan`s and null values:

In [15]:
s = pd.Series([1, 2, 3, np.nan, np.nan, 4])

In [16]:
pd.notnull(s)

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [17]:
pd.notnull(s).sum()

4

In [18]:
s[pd.notnull(s)]

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

But both `notnull` and `isnull` are also methods of `Series` and `DataFrame`s, so we could use it that way:

In [19]:
s.isnull()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [20]:
s.notnull()

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [21]:
s[s.notnull()]

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

<hr style='border: 3px solid SlateGray'></hr>

### Dropping null values

Boolean selection + `notnull()` seems a little bit verbose and repetitibe. And as we said before: any rerpetitive task will probably have a better, more DRY way, in this case we can use the `dropna` method: 

In [22]:
s.dropna()

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

---

### Dropping null values on DataFrames

You saw  how simple it is to drop `na`s with a Series. But with`DataFrame`s there will be a few more things to consider, because you can't drop single values. You can only drop entire columns or rows. Let's start with a sample `DataFrame`

In [25]:
df = pd.DataFrame({
    'Column A': [1, np.nan, 30, np.nan],
    'Column B': [2, 8, 31, np.nan],
    'Column C': [np.nan, 9, 32, 100],
    'Column D': [5, 8, 34, 110]
})

In [26]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [27]:
df.isnull()

Unnamed: 0,Column A,Column B,Column C,Column D
0,False,False,True,False
1,True,False,False,False
2,False,False,False,False
3,True,True,False,False


In [28]:
df.isnull().sum()

Column A    2
Column B    1
Column C    1
Column D    0
dtype: int64

The default `dropna` behavior will drop all the rows in which any null value is present:

In [29]:
df.dropna()

Unnamed: 0,Column A,Column B,Column C,Column D
2,30.0,31.0,32.0,34


In [38]:
df.dropna(axis=1)

Unnamed: 0,Column D
0,5
1,8
2,34
3,110


In this case, any row or column that contains **at least** one null value will be dropped. Which ca be, dependeing on the case, too extreme. You can control this behavior with the `how` parameter. Can be either `'any'` or `'all'`:

In [30]:
df2 = pd.DataFrame({
    'Column A': [1, np.nan, 30],
    'Column B': [2, np.nan, 31],
    'Column C': [np.nan, np.nan, 100]
})

In [31]:
df2

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,
1,,,
2,30.0,31.0,100.0


In [32]:
df2.dropna(how='all')

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,
2,30.0,31.0,100.0


In [34]:
df2.dropna(how='any') # default behaviour

Unnamed: 0,Column A,Column B,Column C
2,30.0,31.0,100.0


You can also use the `thresh` parameter to indicate a threshold (a minimum number) of non-null values for the row/column to be kept:

In [35]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [36]:
df.dropna(thresh=3)

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34


In [37]:
df.dropna(thresh=3, axis='columns')

Unnamed: 0,Column B,Column C,Column D
0,2.0,,5
1,8.0,9.0,8
2,31.0,32.0,34
3,,100.0,110


---

### Filling null values

Sometimes instead of dropping the null values, we might need to replace them with some other value. This highlt depends on yor context and the dataset you're currently working. Sometimes `nan` can be replace with a `0`, sometimes it can be replaced with the `mean` of the sample, and some other times you can take the closest value. Again it depedns on the context.

In [39]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

#### **Filling null with an arbitrary value**

In [40]:
s.fillna(0)

0    1.0
1    2.0
2    3.0
3    0.0
4    0.0
5    4.0
dtype: float64

In [42]:
s.fillna(s.mean())

0    1.0
1    2.0
2    3.0
3    2.5
4    2.5
5    4.0
dtype: float64

#### **Filling nulls with contiguous (close) values**

In [50]:
s.fillna(method='ffill') # forward fill

0    1.0
1    2.0
2    3.0
3    3.0
4    3.0
5    4.0
dtype: float64

In [51]:
s.fillna(method='bfill') # backward fill

0    1.0
1    2.0
2    3.0
3    4.0
4    4.0
5    4.0
dtype: float64

This can still leav null values at the extremes of the Series/DataFrame:

In [47]:
pd.Series([np.nan, 3, np.nan, 9]).fillna(method='ffill')

0    NaN
1    3.0
2    3.0
3    9.0
dtype: float64

In [49]:
pd.Series([1, np.nan, 3, np.nan, np.nan]).fillna(method='bfill')

0    1.0
1    3.0
2    3.0
3    NaN
4    NaN
dtype: float64

### Filling null values on DataFrames