## Handling Missing Value

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv("./Churn_Modelling.csv")

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


## The second way of finding the null value is isnull() function.

In [4]:
data.isnull().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

## Handling Missing Values

### 1. Deleting the columns with missing data

In [5]:
updated_df = data.dropna(axis=1)

In [6]:
updated_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


#### The problem with this method is that we may lose valuable information on tha t feature, as we have deleted it completely due to some null value.

### 2. Deleting the rows with missing data.

In [7]:
updated_df = data.dropna(axis=0)

## 3. Filling the Missing Values - Imputation

In [8]:
data['Age'].mean()

38.9218

In [9]:
data['Age'].median()

37.0

In [10]:
# fillna : fills the null records
# dropna : drops the null records

In [11]:
data['Age'] = data['Age'].fillna(data['Age'].mean())

## 4. Forward & Backward Filling - Imputation

In [12]:
df = pd.read_csv("./Churn_Modelling.csv")

In [13]:
# backward fill
df['Age'] = df['Age'].bfill(axis=0)

In [14]:
# forward fill
df['Age'] = df['Age'].ffill(axis=0)

## 5. Finding the only Object type Data

In [16]:
df.select_dtypes(include=['object']).isnull().sum()

Surname      0
Geography    0
Gender       0
dtype: int64

In [18]:
for i in df.select_dtypes(include=['object']).columns:
    df[i] = df[i].fillna(df[i].mode()[0])

Unnamed: 0,Balance,EstimatedSalary
0,0.00,101348.88
1,83807.86,112542.58
2,159660.80,113931.57
3,0.00,93826.63
4,125510.82,79084.10
...,...,...
9995,0.00,96270.64
9996,57369.61,101699.77
9997,0.00,42085.58
9998,75075.31,92888.52
