In [1]:
import numpy as np
import pandas as pd

### **Pandas utility functions**

### Checking for null

In [2]:
pd.isnull(np.nan)

True

In [3]:
pd.isnull(None)

True

In [4]:
pd.isna(None)

True

In [5]:
pd.isna(np.nan)

True

### Checking for not null

In [6]:
pd.notnull(None)

False

In [7]:
pd.notnull(np.nan)

False

In [8]:
pd.notna(None)

False

In [9]:
pd.notna(np.nan)

False

In [10]:
pd.notnull(3)

True

In [11]:
pd.notnull(0)

True

### **Working with Series and DataFrames**

In [12]:
pd.isnull(pd.Series([1, np.nan, 7]))

0    False
1     True
2    False
dtype: bool

In [13]:
pd.isnull(pd.Series([1, np.nan, 7]))

0    False
1     True
2    False
dtype: bool

In [14]:
pd.isnull(pd.DataFrame({
    'Column A': [1, np.nan, 7],
    'Column B': [np.nan, 7, 9],
    'Column C': [np.nan, 7, np.nan],
}))

Unnamed: 0,Column A,Column B,Column C
0,False,True,True
1,True,False,False
2,False,False,True


### **Operating with Missing values**

In pandas `nan`s no longer behave as "viruses", and any operations will just ignore them completely.

In [15]:
pd.Series([1, 7, np.nan]).count()

2

In [16]:
pd.Series([1, 7, np.nan]).sum()

8.0

In [17]:
pd.Series([1, 7, np.nan]).mean()

4.0

### **Filtering Missing Data**

Combining boolean selection + `pd.isnull` to filter out those `nan`s and null values.

In [18]:
s= pd.Series([1, 7, np.nan, np.nan, 9])

In [19]:
pd.notnull(s)

0     True
1     True
2    False
3    False
4     True
dtype: bool

In [20]:
pd.isnull(s)

0    False
1    False
2     True
3     True
4    False
dtype: bool

In [21]:
# count of not null values
pd.notnull(s).sum()

3

In [22]:
# count of null values
pd.isnull(s).sum()

2

In [23]:
s[pd.notnull(s)]

0    1.0
1    7.0
4    9.0
dtype: float64

But both `notnull` and `isnull` are methods of `Series` and `DataFrame`s, so we could use it this way.

In [24]:
s.isnull()

0    False
1    False
2     True
3     True
4    False
dtype: bool

In [25]:
s.notnull()

0     True
1     True
2    False
3    False
4     True
dtype: bool

In [26]:
s[s.notnull()]

0    1.0
1    7.0
4    9.0
dtype: float64

### **Dropping null values**

Boolean selection + `notnull()` seems a little bit verbose and repetitive. In this case, we can use the `dropna` method.

In [27]:
s

0    1.0
1    7.0
2    NaN
3    NaN
4    9.0
dtype: float64

In [28]:
s.dropna()

0    1.0
1    7.0
4    9.0
dtype: float64

### **Dropping null values on DataFrames**

In `DataFrame`s, you can't drop single values. You can only drop entire columns or rows. 

In [29]:
df = pd.DataFrame({
    'Column A': [1, np.nan, 30, np.nan],
    'Column B': [2, 8, 31, np.nan],
    'Column C': [np.nan, 9, 32, 100],
    'Column D': [5, 8, 34, 110],
})

In [30]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [31]:
df.shape

(4, 4)

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Column A  2 non-null      float64
 1   Column B  3 non-null      float64
 2   Column C  3 non-null      float64
 3   Column D  4 non-null      int64  
dtypes: float64(3), int64(1)
memory usage: 256.0 bytes


In [33]:
df.isnull()

Unnamed: 0,Column A,Column B,Column C,Column D
0,False,False,True,False
1,True,False,False,False
2,False,False,False,False
3,True,True,False,False


In [34]:
df.isnull().sum()

Column A    2
Column B    1
Column C    1
Column D    0
dtype: int64

The default `dropna` behavior will drop all the rows in which any null value is present.

In [35]:
df.dropna()

Unnamed: 0,Column A,Column B,Column C,Column D
2,30.0,31.0,32.0,34


`axis` parameter can also be used to drop columns containing null values.

In [36]:
df.dropna(axis=1)   #axis='columns' also works

Unnamed: 0,Column D
0,5
1,8
2,34
3,110


In this case, any row or column that contains at least one null value will be dropped. Which can be, depending on the case, too extreme. You can control this behavior with the `how` parameter. Can be either `'any'` or `'all'`.

In [37]:
df2 = pd.DataFrame({
    'Column A': [1, np.nan, 30],
    'Column B': [2, np.nan, 31],
    'Column C': [np.nan, np.nan, 100]
})

In [38]:
df2

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,
1,,,
2,30.0,31.0,100.0


In [39]:
df2.dropna(how='all')

Unnamed: 0,Column A,Column B,Column C
0,1.0,2.0,
2,30.0,31.0,100.0


In [40]:
df2.dropna(how='any')

Unnamed: 0,Column A,Column B,Column C
2,30.0,31.0,100.0


`thresh` parameter is used to indicate a *threshold* (a minimum number) of non-null values for the row/column to be kept.

In [41]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [42]:
df.dropna(thresh=3)

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34


In [43]:
df.dropna(thresh=3, axis=1)

Unnamed: 0,Column B,Column C,Column D
0,2.0,,5
1,8.0,9.0,8
2,31.0,32.0,34
3,,100.0,110


### **Filling null values**

Sometimes instead of dropping the null values, replacing them with some other value is better idea. This highly depends on the context and the dataset you're currently working with. Sometimes a `nan` can be replaced with a `0`, sometimes it can be replaced with the `mean` of the sample, and some other times you can take the closest value. 

In [44]:
s

0    1.0
1    7.0
2    NaN
3    NaN
4    9.0
dtype: float64

### Filling nulls with an arbitary value

In [45]:
s.fillna(0)

0    1.0
1    7.0
2    0.0
3    0.0
4    9.0
dtype: float64

In [46]:
s.fillna(s.mean())

0    1.000000
1    7.000000
2    5.666667
3    5.666667
4    9.000000
dtype: float64

In [47]:
s

0    1.0
1    7.0
2    NaN
3    NaN
4    9.0
dtype: float64

### Filling nulls with contiguous (close) values

The `method` argument is used to fill null values with other values close to that null one.

In [48]:
# forward fill
s.fillna(method='ffill')

0    1.0
1    7.0
2    7.0
3    7.0
4    9.0
dtype: float64

In [49]:
# backward fill
s.fillna(method='bfill')

0    1.0
1    7.0
2    9.0
3    9.0
4    9.0
dtype: float64

This can still leave null values at the extremes of the Series/DataFrame.

In [50]:
pd.Series([np.nan, 3, np.nan, 7]).fillna(method='ffill')

0    NaN
1    3.0
2    3.0
3    7.0
dtype: float64

In [51]:
pd.Series([1, np.nan, 3, np.nan, np.nan]).fillna(method='bfill')

0    1.0
1    3.0
2    3.0
3    NaN
4    NaN
dtype: float64

### **Filling null values on DataFrames**

The `fillna` method also works on `DataFrame`s, and it works similarly. The main differences are that you can specify the `axis` (as usual, rows or columns) to use to fill the values (specially for methods) and that you have more control on the values passed.

In [52]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [53]:
df.fillna({'Column A':0, 'Column B':22, 'Column C': df['Column C'].mean()})

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,47.0,5
1,0.0,8.0,9.0,8
2,30.0,31.0,32.0,34
3,0.0,22.0,100.0,110


In [54]:
df.fillna(method='ffill', axis=0)

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,1.0,8.0,9.0,8
2,30.0,31.0,32.0,34
3,30.0,31.0,100.0,110


In [55]:
df.fillna(method='ffill', axis=1)

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,2.0,5.0
1,,8.0,9.0,8.0
2,30.0,31.0,32.0,34.0
3,,,100.0,110.0


### **Checking if there are Null values**

### Checking the length

In [56]:
s

0    1.0
1    7.0
2    NaN
3    NaN
4    9.0
dtype: float64

If there are missing values, `s.dropna()` will have less elements than `s`.

In [57]:
s.dropna().count()

3

In [58]:
missing_values= len(s.dropna()) != len(s)
missing_values

True

`count` method excludes `nan`s from its result

In [59]:
len(s)

5

In [60]:
s.count()

3

In [61]:
missing_values= s.count() != len(s)
missing_values

True

The methods `any` and `all` check if either there's `any` True value in a Series or `all` the values are `True`. They work in the same way as in Python:

In [62]:
pd.Series([True, False, False]).any()

True

In [63]:
pd.Series([True, False, False]).all()

False

In [64]:
pd.Series([True, True, True]).all()

True

The `isnull()` method returned a Boolean `Series` with `True` values wherever there was a `nan`.

In [65]:
s.isnull()

0    False
1    False
2     True
3     True
4    False
dtype: bool

In [66]:
pd.Series([1, np.nan]).isnull().any()

True

In [67]:
pd.Series([1, 2]).isnull().any()

False

In [68]:
s.isnull().any()

True

In [69]:
s.isnull().values

array([False, False,  True,  True, False])

In [70]:
s.isnull().values.any()

True