# Handling missing values

Handling missing values is an essential part of data cleaning and preparation process because almost all data in real life comes with some missing values.

In [23]:
# Let’s create a dataframe with missing values first.

import pandas as pd
import numpy as np

df = pd.DataFrame({'column_a':[1,2,4,4,np.nan,np.nan,6],
                  'column_b':[1.2,1.4,np.nan,6.2,None,1.1,4.3],
                  'column_c':['a','?','c','d','--',np.nan,'d'],
                  'column_d':[True,True,np.nan,None,False,True,False]})
df

Unnamed: 0,column_a,column_b,column_c,column_d
0,1.0,1.2,a,True
1,2.0,1.4,?,True
2,4.0,,c,
3,4.0,6.2,d,
4,,,--,False
5,,1.1,,True
6,6.0,4.3,d,False


np.nan, None and NaT (for datetime64[ns] types) are standard missing value for Pandas.

Finding Missing Values

Pandas provides isnull(), isna() functions to detect missing values. Both of them do the same thing.

df.isna() returns the dataframe with boolean values indicating missing values.

In [24]:
df.isna()

Unnamed: 0,column_a,column_b,column_c,column_d
0,False,False,False,False
1,False,False,False,False
2,False,True,False,True
3,False,False,False,True
4,True,True,False,False
5,True,False,True,False
6,False,False,False,False


You can also choose to use notna() which is just the opposite of isna().

df.isna().any() returns a boolean value for each column. If there is at least one missing value in that column, the result is True.

df.isna().sum() returns the number of missing values in each column.

In [25]:
df.isna().any()

column_a    True
column_b    True
column_c    True
column_d    True
dtype: bool

In [26]:
df.isna().sum()

column_a    2
column_b    2
column_c    1
column_d    2
dtype: int64

Handling Missing Values

Not all missing values come in nice and clean np.nan or None format. For example, “?” and “- -“ characters in column_c of our dataframe do not give us any valuable information or insight so essentially they are missing values. However, these characters cannot be detected as missing value by Pandas.

If we know what kind of characters used as missing values in the dataset, we can handle them while creating the dataframe using na_values parameter.

missing_values = ["?","--"]
df_test = pd.read_csv("dataset.csv",na_values=missing_values)

In [27]:
# We can use pandas replace() function to handle these values after a dataframe is created:

df.replace({"?":np.nan,"--":np.nan},inplace=True)
df

Unnamed: 0,column_a,column_b,column_c,column_d
0,1.0,1.2,a,True
1,2.0,1.4,,True
2,4.0,,c,
3,4.0,6.2,d,
4,,,,False
5,,1.1,,True
6,6.0,4.3,d,False


We have replaced non-informative cells with NaN values. inplace parameter saves the changes in the dataframe. Default value for inplace is False so if it is set it to True, changes will not be saved.

There is not an optimal way to handle missing values. Depending on the characteristics of the dataset and the task, we can choose to:

    1. Drop missing values
    2. Replace missing values

In [28]:
# Drop missing values

df

Unnamed: 0,column_a,column_b,column_c,column_d
0,1.0,1.2,a,True
1,2.0,1.4,,True
2,4.0,,c,
3,4.0,6.2,d,
4,,,,False
5,,1.1,,True
6,6.0,4.3,d,False


We can drop a row or column with missing values using dropna() function. how parameter is used to set condition to drop.

    how=’any’ : drop if there is any missing value
    how=’all’ : drop if all values are missing

Furthermore, using thresh parameter, we can set a threshold for missing values in order for a row/column to be dropped.

In [19]:
df.dropna(axis=0,how='all',inplace=True)
df

Unnamed: 0,column_a,column_b,column_c,column_d
0,1.0,1.2,a,True
1,2.0,1.4,,True
2,4.0,,c,
3,4.0,6.2,d,
4,,,,False
5,,1.1,,True
6,6.0,4.3,d,False


axis parameter is used to select row (0) or column (1).

Our dataframe do not have a row with full of missing values so setting how=’all’ did not drop any row. The default value is ‘any’ so we don’t need to specify it if we want to use how=’any’.

Replacing missing values

fillna() function of Pandas conveniently handles missing values. Using fillna(), missing values can be replaced by a special value or an aggreate value such as mean, median. Furthermore, missing values can be replaced with the value before or after it which is pretty useful for time-series datasets.    

In [29]:
# Replace missing values with a scalar:

df.fillna(25)

Unnamed: 0,column_a,column_b,column_c,column_d
0,1.0,1.2,a,True
1,2.0,1.4,25,True
2,4.0,25.0,c,25
3,4.0,6.2,d,25
4,25.0,25.0,25,False
5,25.0,1.1,25,True
6,6.0,4.3,d,False


    Using method parameter, missing values can be replaced with the values before or after them. ffill stands for “forward fill” replaces missing values with the values in the previous row. You can also choose bfill which stands for “backward fill”.

In [30]:
df.fillna(axis=0,method='ffill')

Unnamed: 0,column_a,column_b,column_c,column_d
0,1.0,1.2,a,True
1,2.0,1.4,a,True
2,4.0,1.4,c,True
3,4.0,6.2,d,True
4,4.0,6.2,d,False
5,4.0,1.1,d,True
6,6.0,4.3,d,False


Reference: https://towardsdatascience.com/handling-missing-values-with-pandas-b876bf6f008f