1. Delete Rows/ Columns(Small Missing Value data)
2. Steps Checking the missing values in dataset by df.isnull().sum()
3. Decide whether to drop rows or columns:
Drop rows: if only a few rows have missing values (df_rows_dropped = df.dropna())
Drop columns: if the column has too many missing values(df_cols_dropped = df.dropna(axis=1)).
4. When to Use: Small proportion of missing data, irrelevant columns

2. Large Proportion of Missing Data

Definition: Many values missing in a column (e.g., >30–50%)

Recommended Approaches:

Drop the column → df.dropna(axis=1) if column is irrelevant

Imputation / Prediction → If column is important, fill missing values:

Mean / Median / Mode (for numeric or categorical)

Forward / Backward Fill (for time series)

Algorithm-based Imputation / ML prediction (for complex datasets)

Tip:

Always analyze the importance of the column before dropping it.

Use visualization libraries like missingno to see missing data patterns:

import missingno as msno
msno.matrix(df)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
df = pd.read_csv('Churn_Modelling.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [4]:
missing_values = df.isnull().sum()

In [6]:
missing_percent = (missing_values / len(df)) * 100

In [8]:
missing_percent

RowNumber          0.0
CustomerId         0.0
Surname            0.0
CreditScore        0.0
Geography          0.0
Gender             0.0
Age                0.0
Tenure             0.0
Balance            0.0
NumOfProducts      0.0
HasCrCard          0.0
IsActiveMember     0.0
EstimatedSalary    0.0
Exited             0.0
dtype: float64

In [7]:
missing_values

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [9]:
# Removing the columns with more than 50% missing values
thereshold = 50
cols_to_drop = missing_percent[missing_percent > thereshold].index
df_cleaned = df.drop(columns=cols_to_drop)
print(f"Dropped columns: {cols_to_drop.tolist()}")

Dropped columns: []


In [10]:
df_cleaned = df_cleaned.dropna(axis=0)

In [11]:
df_cleaned = df_cleaned.dropna(axis=1)

In [12]:
missing_values_after = df_cleaned.isnull().sum()
print("Missing Values After Cleaning:\n", missing_values_after)

Missing Values After Cleaning:
 RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64
