# Handling Missing Data with Pandas

pandas borrows all the capabilities from numpy selection + adds a number of convenient methods to handle missing values. Let's see one at a time:

## Hands on!

In [5]:
import numpy as np 
import pandas as pd 

### Pandas utility functions

similarly to `numpy`, pandas also has a few utility functions to identify and detect null values:

In [6]:
pd.isnull(np.nan)

True

In [7]:
pd.isnull(None)

True

In [8]:
pd.isna(np.nan)

True

In [9]:
pd.isna(None)

True

In [10]:
pd.notnull(None)

False

In [11]:
pd.notnull(np.nan)

False

In [12]:
pd.notnull(3)

True

Also work with Series and Dataframes

In [13]:
pd.isnull(pd.Series([1, np.nan, 7]))

0    False
1     True
2    False
dtype: bool

In [14]:
pd.notnull(pd.Series([1, np.nan, 7]))

0     True
1    False
2     True
dtype: bool

In [15]:
pd.isnull(pd.DataFrame(
    data={
        'Column A': [1,np.nan,7],
        'Column B': [np.nan,2,3],
        'Column C': [np.nan,2,np.nan]
    }))
        

Unnamed: 0,Column A,Column B,Column C
0,False,True,True
1,True,False,False
2,False,False,True


### Pandas Operations with Missing Values

In [16]:
pd.Series([1,2, np.nan]).count()

2

In [17]:
pd.Series([1,2,np.nan]).sum()

3.0

In [18]:
pd.Series([2,2,np.nan]).mean()

2.0

### Filtering missing data

In [19]:
s = pd.Series([1,2,3,np.nan, np.nan, 4])

In [20]:
pd.notnull(s)

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [21]:
pd.notnull(s).sum()

4

In [22]:
s[pd.notnull(s)]

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

In [23]:
s.isnull()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [24]:
s.notnull()

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [25]:
s[s.notnull()]

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

### Dropping null values

In [26]:
s.dropna()

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

### Dropping null values on data frames

In [27]:
df = pd.DataFrame(data={
    'Column A': [1,np.nan,30,np.nan],
    'Column B': [2,8,31,np.nan],
    'Column C': [np.nan,9,32,100],
    'Column D': [5, 8, 34, 110],
})

In [28]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [29]:
df.shape

(4, 4)

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Column A  2 non-null      float64
 1   Column B  3 non-null      float64
 2   Column C  3 non-null      float64
 3   Column D  4 non-null      int64  
dtypes: float64(3), int64(1)
memory usage: 260.0 bytes


In [31]:
df.isnull()

Unnamed: 0,Column A,Column B,Column C,Column D
0,False,False,True,False
1,True,False,False,False
2,False,False,False,False
3,True,True,False,False


In [32]:
df.isnull().sum()

Column A    2
Column B    1
Column C    1
Column D    0
dtype: int64

The Default `dropna` behavior will drop all the rows in which any null value is present

In [33]:
df.dropna()

Unnamed: 0,Column A,Column B,Column C,Column D
2,30.0,31.0,32.0,34


In [39]:
df.dropna(axis=1)

Unnamed: 0,Column D
0,5
1,8
2,34
3,110


In [40]:
df2 = pd.DataFrame(data={
    'Column A': [1,np.nan,30],
    'Column B': [2,np.nan,31],
    'Column C': [np.nan,np.nan,100]
})

In [41]:
df2

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,
1,,,
2,30.0,31.0,100.0


In [42]:
df2.dropna(how='all')

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,
2,30.0,31.0,100.0


In [43]:
df.dropna(how='any') # default behavior

Unnamed: 0,Column A,Column B,Column C,Column D
2,30.0,31.0,32.0,34


In [44]:
df2

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,
1,,,
2,30.0,31.0,100.0


You can also use `thresh` parameter to indicate a threshold (a minimum number) of non null values for the column/row to be kept

In [39]:
df2

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,
1,,,
2,30.0,31.0,100.0


In [40]:
df2.dropna(thresh=1)

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,
2,30.0,31.0,100.0


In [41]:
df.dropna(thresh=3,axis=1)

Unnamed: 0,Column B,Column C,Column D
0,2.0,,5
1,8.0,9.0,8
2,31.0,32.0,34
3,,100.0,110


### Filling null values

In [42]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

__Filling nulls with a arbitrary value__

In [43]:
s.fillna(0)

0    1.0
1    2.0
2    3.0
3    0.0
4    0.0
5    4.0
dtype: float64

In [44]:
s.fillna(s.mean())

0    1.0
1    2.0
2    3.0
3    2.5
4    2.5
5    4.0
dtype: float64

In [45]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

__Filling nulls with contiguous (close) values__

The `method` argument is used to fill null values with other values close to that null one:

In [46]:
s.fillna(method='ffill')

0    1.0
1    2.0
2    3.0
3    3.0
4    3.0
5    4.0
dtype: float64

In [47]:
s.fillna(method='bfill')

0    1.0
1    2.0
2    3.0
3    4.0
4    4.0
5    4.0
dtype: float64

This can still leave null values at the extremes of the Series/DataFrame:

In [51]:
pd.Series([np.nan, 3, np.nan, 9]).fillna(method='ffill')

0    NaN
1    3.0
2    3.0
3    9.0
dtype: float64

In [52]:
pd.Series([1,np.nan, 3, np.nan, np.nan]).fillna(method='bfill')

0    1.0
1    3.0
2    3.0
3    NaN
4    NaN
dtype: float64

### Filling null values on data frames

The `fillna` method also works on `DataFrame`s, and it works similarly. The main differences are that you can specify the `axis` (as usual, rows or columns) to use to fill the values(specially for methods) and that you have more control on the values passed

In [53]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [56]:
df.fillna({'Column A': 0, 'Column B': 99, 'Column C': df['Column C'].mean()})

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,47.0,5
1,0.0,8.0,9.0,8
2,30.0,31.0,32.0,34
3,0.0,99.0,100.0,110


In [57]:
df.fillna(method='ffill', axis=0)

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,1.0,8.0,9.0,8
2,30.0,31.0,32.0,34
3,30.0,31.0,100.0,110


In [58]:
df.fillna(method='ffill', axis=1)

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,2.0,5.0
1,,8.0,9.0,8.0
2,30.0,31.0,32.0,34.0
3,,,100.0,110.0


### Checking if there are NAs

The question is does the Series or DataFrame contain any missing value? The answer should be yes or no and we can verify it in a few ways

In [50]:
s.dropna().count()

4

In [51]:
missing_values = len(s.dropna()) != len(s)
missing_values

True

You can also use the `count` Method which excludes `NaN` from its results

In [52]:
len(s)

6

In [53]:
df.count()

Column A    2
Column B    3
Column C    3
Column D    4
dtype: int64

In [54]:
s.count()

4

So we could just do:

In [55]:
missing_values = s.count() != len(s)
missing_values

True

In [56]:
for column in df.columns:
    missing_value = df[column].count() != len(df[column])
    print(missing_value)

True
True
True
False


__More Pythonic Solution__ `any`

In [57]:
pd.Series([True, False, False]).any()

True

In [58]:
pd.Series([True,False,False]).all()

False

In [59]:
pd.Series([True, True, True]).all()

True

The `isnull()` method returned a Boolean `Series` with `True` values whenever there was a `nan`:

In [60]:
s.isnull()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [61]:
df.isnull().any()

Column A     True
Column B     True
Column C     True
Column D    False
dtype: bool

So we can just use the any method with the boolean array returned

In [62]:
pd.Series([1, np.nan]).isnull().any()

True

In [63]:
pd.Series([1,2]).isnull().any()

False

In [64]:
s.isnull().any()

True

A more strict version would check only the `values` of the series:

In [65]:
s.isnull().values

array([False, False, False,  True,  True, False])

In [66]:
s.isnull().values.any()

True