# HANDLING MISSING VALUES 

In [1]:
# Import relevant Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Importing Dataset

df = pd.read_csv("Churn_Modelling.csv")

In [3]:
# Inspect DataFrame 

df.shape
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           9944 non-null   object 
 6   Age              9914 non-null   float64
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42.0,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,,41.0,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42.0,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39.0,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43.0,2,125510.82,1,1,1,79084.1,0


In [4]:
# Counting Null Values in out DataFrame

print(df.isnull().sum())

RowNumber           0
CustomerId          0
Surname             0
CreditScore         0
Geography           0
Gender             56
Age                86
Tenure              0
Balance             0
NumOfProducts       0
HasCrCard           0
IsActiveMember      0
EstimatedSalary     0
Exited              0
dtype: int64


<b>TECHNIQUES FOR HANDLING MISSING VALUES

<b>1. Deleting Columns with Null Values

In [5]:
# Drop Column Null Values
updated_df1 = df.dropna(axis=1)

updated_df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Tenure           10000 non-null  int64  
 6   Balance          10000 non-null  float64
 7   NumOfProducts    10000 non-null  int64  
 8   HasCrCard        10000 non-null  int64  
 9   IsActiveMember   10000 non-null  int64  
 10  EstimatedSalary  10000 non-null  float64
 11  Exited           10000 non-null  int64  
dtypes: float64(2), int64(8), object(2)
memory usage: 937.6+ KB


We can see that we have dropped the null values but its not efficient to drop null values that are less than 75% in our dataset as we may loose valuable information. This method best works when 75% (and more) of our dataset has null values.

<b> 2. Deleting rows with missing values

In [6]:
# Dropping Rows with missing values
updated_df2 = df.dropna(axis=0)

updated_df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9861 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        9861 non-null   int64  
 1   CustomerId       9861 non-null   int64  
 2   Surname          9861 non-null   object 
 3   CreditScore      9861 non-null   int64  
 4   Geography        9861 non-null   object 
 5   Gender           9861 non-null   object 
 6   Age              9861 non-null   float64
 7   Tenure           9861 non-null   int64  
 8   Balance          9861 non-null   float64
 9   NumOfProducts    9861 non-null   int64  
 10  HasCrCard        9861 non-null   int64  
 11  IsActiveMember   9861 non-null   int64  
 12  EstimatedSalary  9861 non-null   float64
 13  Exited           9861 non-null   int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


<b>3. Filling the Missing Values (Imputation)

Here we fill the values with a certain value. The following are possible ways in which one can impute missing values;
    
    ~ Filling the missing data with either Mean or Median for numerical data
    ~ Filling missing data with mode if data is categorical
    ~ Filling the numerical value with either 0 or -999, or some other number that will not occur in the data . This is done so that the machine can recognize that the data is not real or different.
    ~Filling the categorical data with a new type for the missing values.

In [7]:
#Use 'fillna' to fill missing values with the mean of the column

updated_df3 = df
updated_df3['Age'] = updated_df3['Age'].fillna(df['Age'].mean())

updated_df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           9944 non-null   object 
 6   Age              10000 non-null  float64
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [8]:
#Use 'fillna' to fill missing values with the median of the column

updated_df3 = df
updated_df3['Age'] = updated_df3['Age'].fillna(df['Age'].median())

updated_df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           9944 non-null   object 
 6   Age              10000 non-null  float64
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


<b> NOTE: It's better to use mean to impute your missing values if you don't have any or many outliers, where the outliers are many we it's beter to use median to impute missing values.

<b> 4. Forward and Backward Filling to impute missing values

In [9]:
# Imputing using Backward fill 'bfill'

updated_df4 = df
updated_df4['Age'] = updated_df4['Age'].bfill()

updated_df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           9944 non-null   object 
 6   Age              10000 non-null  float64
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [10]:
# Imputing using Forward fill 'ffill'

updated_df4 = df
updated_df4['Age'] = updated_df4['Age'].ffill()

updated_df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           9944 non-null   object 
 6   Age              10000 non-null  float64
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB
