<a href="https://colab.research.google.com/github/CodeVerse-team/python-for-data-analysis-learning-libraries/blob/main/handling_missing_data_with_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Handling Missing Data


---


## Pandas Utility Function
Similarly to `Numpy`, pandas also has utility functions to identify and detect null values:

In [3]:
import numpy as np
import pandas as pd

In [4]:
pd.isnull(np.nan)

True

In [5]:
pd.isnull(None)

True

In [8]:
pd.isna(np.nan)

True

In [9]:
pd.isna(None)

True

The Opposite ones also exist:

In [10]:
pd.notnull(np.nan)

False

In [12]:
pd.notnull(None)

False

In [13]:
pd.notna(np.nan)

False

In [14]:
pd.notnull(3)

True

These functions also work with `Series` and `DataFrame`s.

In [17]:
pd.isna(pd.Series([1, 2, np.nan, 5]))

Unnamed: 0,0
0,False
1,False
2,True
3,False


In [18]:
pd.notnull(pd.Series([1, np.nan, 5]))

Unnamed: 0,0
0,True
1,False
2,True


In [22]:
pd.isnull(pd.DataFrame({
    'Column A': [1, np.nan, 7],
    'Column B': [np.nan, 2, 3],
    'Column C': [np.nan, 2, np.nan]
}))

Unnamed: 0,Column A,Column B,Column C
0,False,True,True
1,True,False,False
2,False,False,True


## Pandas Operation with Missing Values
Pandas manages missing values more gracefully than numpy. `nan` will no longer behave as "viuses", and operations will just ignore them completely:

In [23]:
pd.Series([1, 2, np.nan]).count()

np.int64(2)

In [24]:
pd.Series([1, 2, np.nan]).sum()

np.float64(3.0)

In [25]:
pd.Series([1, 2, np.nan]).mean()

np.float64(1.5)

## Filtering missing Data
As we saw with numpy, we could combine boolean selection + `pd.isnull` to filter out those `nan` and null values:

In [26]:
s = pd.Series([1, 2, 3, np.nan, np.nan, 4])

In [27]:
pd.notnull(s)

Unnamed: 0,0
0,True
1,True
2,True
3,False
4,False
5,True


In [29]:
pd.notnull(s).count()

np.int64(6)

In [30]:
s[pd.notnull(s)]

Unnamed: 0,0
0,1.0
1,2.0
2,3.0
5,4.0


But both `notnull`, `isnull` and `isna` are also methods of `Series` and `DataFrames`, so we could use it that way:

In [31]:
s.isnull()

Unnamed: 0,0
0,False
1,False
2,False
3,True
4,True
5,False


In [32]:
s.notnull()

Unnamed: 0,0
0,True
1,True
2,True
3,False
4,False
5,True


In [33]:
s[s.notnull()]

Unnamed: 0,0
0,1.0
1,2.0
2,3.0
5,4.0


## Dropping the null values
Boolean selection + `notnull()` seems a little bit verbose and repetive. And as we said before: any repetitive task will probably have a better, more DRY way. In this case, we can the `dropna` method:

In [34]:
s.dropna()

Unnamed: 0,0
0,1.0
1,2.0
2,3.0
5,4.0
