## MTN Nigeria Customer Churn Prediction – Exploratory Data Analysis (Customer-Level)

**Business Objective**  
Predict customers at high risk of churning so MTN Nigeria can proactively design retention interventions (better bundles, network quality fixes, loyalty rewards, targeted offers) → protect & grow revenue in a very competitive telecom market.

**Notebook Goal**  
- Understand customer behavior patterns  
- Identify strongest churn drivers  
- Surface actionable business insights  
- Prepare high-signal features for modeling

**Dataset Source**  
Kaggle: [MTN Nigeria Customer Churn](https://www.kaggle.com/datasets/oluwademiladeadeniyi/mtn-nigeria-customer-churn)  
Original data: transaction-level (multiple purchases per customer)

### Setup & Imports

In [16]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import os

### Load Raw Data

In [17]:
data = pd.read_csv(r'C:\Users\KOLADE\OneDrive\Documents\AkoladeDSJourney\MTN-Nigeria-Customer-Churn\data\raw\mtn_customer_churn.csv')
df = data.copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 974 entries, 0 to 973
Data columns (total 17 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Customer ID                974 non-null    object 
 1   Full Name                  974 non-null    object 
 2   Date of Purchase           974 non-null    object 
 3   Age                        974 non-null    int64  
 4   State                      974 non-null    object 
 5   MTN Device                 974 non-null    object 
 6   Gender                     974 non-null    object 
 7   Satisfaction Rate          974 non-null    int64  
 8   Customer Review            974 non-null    object 
 9   Customer Tenure in months  974 non-null    int64  
 10  Subscription Plan          974 non-null    object 
 11  Unit Price                 974 non-null    int64  
 12  Number of Times Purchased  974 non-null    int64  
 13  Total Revenue              974 non-null    int64  

**Quick Data Quality Check**

In [18]:
print(f"Dataframe shape: {df.shape}")
df.head()

Dataframe shape: (974, 17)


Unnamed: 0,Customer ID,Full Name,Date of Purchase,Age,State,MTN Device,Gender,Satisfaction Rate,Customer Review,Customer Tenure in months,Subscription Plan,Unit Price,Number of Times Purchased,Total Revenue,Data Usage,Customer Churn Status,Reasons for Churn
0,CUST0001,Ngozi Berry,Jan-25,27,Kwara,4G Router,Male,2,Fair,2,165GB Monthly Plan,35000,19,665000,44.48,Yes,Relocation
1,CUST0002,Zainab Baker,Mar-25,16,Abuja (FCT),Mobile SIM Card,Female,2,Fair,22,12.5GB Monthly Plan,5500,12,66000,19.79,Yes,Better Offers from Competitors
2,CUST0003,Saidu Evans,Mar-25,21,Sokoto,5G Broadband Router,Male,1,Poor,60,150GB FUP Monthly Unlimited,20000,8,160000,9.64,No,
3,CUST0003,Saidu Evans,Mar-25,21,Sokoto,Mobile SIM Card,Male,1,Poor,60,1GB+1.5mins Daily Plan,500,8,4000,197.05,No,
4,CUST0003,Saidu Evans,Mar-25,21,Sokoto,Broadband MiFi,Male,1,Poor,60,30GB Monthly Broadband Plan,9000,15,135000,76.34,No,


In [19]:
print(f"Duplicates data: {df.duplicated().sum()}\n")
print(f"Missing values:\n{df.isnull().sum()}")

Duplicates data: 0

Missing values:
Customer ID                    0
Full Name                      0
Date of Purchase               0
Age                            0
State                          0
MTN Device                     0
Gender                         0
Satisfaction Rate              0
Customer Review                0
Customer Tenure in months      0
Subscription Plan              0
Unit Price                     0
Number of Times Purchased      0
Total Revenue                  0
Data Usage                     0
Customer Churn Status          0
Reasons for Churn            690
dtype: int64


In [20]:
df[df["Reasons for Churn"].isnull()]['Customer Churn Status'].value_counts()

Customer Churn Status
No    690
Name: count, dtype: int64

- There is no duplicated data
- The missing value in `Reasons for Churn` is expected - Since customer didn't churn ther should be no reason for churn

### Key Descriptive Statistics

**Numerical Features**

In [21]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,974.0,48.043121,17.764307,16.0,32.0,49.0,63.75,80.0
Satisfaction Rate,974.0,2.947639,1.384219,1.0,2.0,3.0,4.0,5.0
Customer Tenure in months,974.0,31.422998,17.191256,1.0,17.0,31.0,47.0,60.0
Unit Price,974.0,19196.663244,25586.726985,350.0,5500.0,14500.0,24000.0,150000.0
Number of Times Purchased,974.0,10.564682,5.709427,1.0,5.0,11.0,15.0,20.0
Total Revenue,974.0,204669.609856,324785.499316,350.0,33000.0,108000.0,261000.0,3000000.0
Data Usage,974.0,99.304764,57.739511,0.82,47.6375,103.33,149.6975,200.0


**Categorical Features**

In [22]:
df.describe(include='O')

Unnamed: 0,Customer ID,Full Name,Date of Purchase,State,MTN Device,Gender,Customer Review,Subscription Plan,Customer Churn Status,Reasons for Churn
count,974,974,974,974,974,974,974,974,974,284
unique,496,484,3,35,4,2,5,21,2,7
top,CUST0003,Halima Walker,Feb-25,Osun,Mobile SIM Card,Female,Very Good,60GB Monthly Broadband Plan,No,High Call Tarriffs
freq,3,5,450,43,301,495,212,81,690,54


### Aggregating Transaction Data to Customer Level

**Why Aggregate?**  
The original dataset is **transaction-level** each row represents one purchase/device/subscription per customer (multiple rows per `Customer ID`).  

Churn, however, is a **customer-level outcome** (a customer either churns or doesn't).  

To build a meaningful predictive model and derive business insights:  
- We need **one row per unique customer**  
- Stable attributes (age, gender, state) → take once  
- Behavioral signals (revenue, data usage, purchases, devices) → aggregate meaningfully  
- Churn label & reasons → preserve at customer level for diagnosis  

This aggregation transforms the data from **transactional** → **customer-centric**, aligning perfectly with how MTN would actually intervene on churn (targeting individual customers, not individual purchases).

**Key Aggregation Decisions**

| Feature                     | Aggregation Method                  | Rationale / Business Meaning                                                                 |
|-----------------------------|-------------------------------------|-----------------------------------------------------------------------------------------------|
| `Age`, `State`, `Gender`    | `first`                             | Stable demographic attributes — same for all rows of a customer                              |
| `MTN Device`                | `nunique` + `mode`                  | `Device_Count`: how many different devices? `Primary_Device`: most common device type         |
| `Date of Purchase`          | `min` + `max` + `nunique`           | First/last purchase + number of active months → recency & engagement signals                 |
| `Satisfaction Rate`         | `mean`                              | Average satisfaction across all interactions                                                   |
| `Customer Review`           | `mode`                              | Most frequent review text (if multiple)                                                       |
| `Customer Tenure in months` | `max`                               | Longest recorded tenure → overall loyalty                                                    |
| `Total Revenue`             | `sum`                               | Lifetime value (most important business metric)                                               |
| `Data Usage`                | `mean`                              | Average monthly GB usage → engagement & potential overuse signal                              |
| `Number of Times Purchased` | `sum`                               | Total transactions → proxy for loyalty & activity                                             |
| `Unit Price`                | `mean`                              | Average spend per purchase                                                                    |
| `Customer Churn Status`     | `max`                               | If any row says "Yes" → customer churned (conservative assumption)                            |
| `Reasons for Churn`         | Custom join unique non-null reasons | Preserve all reported reasons for business diagnosis (not for modeling)                       |

In [23]:
customer_df = (
    df.groupby('Customer ID')
      .agg(
          # === Demographics (stable) ===
          Age=('Age', 'first'),
          State=('State', 'first'),
          Gender=('Gender', 'first'),
          
          # === Device & product behaviour ===
          Device_Count=('MTN Device', 'nunique'),
          Primary_Device=('MTN Device', lambda x: x.mode().iloc[0] if not x.mode().empty else None),
          
          # === Purchase behaviour (time & engagement) ===
          First_Purchase_Date=('Date of Purchase', 'min'),
          Last_Purchase_Date=('Date of Purchase', 'max'),
          Active_Months=('Date of Purchase', 'nunique'),
          
          # === Experience & satisfaction ===
          Avg_Satisfaction_Rate=('Satisfaction Rate', 'mean'),
          Primary_Review=('Customer Review', lambda x: x.mode().iloc[0] if not x.mode().empty else None),
          
          # === Tenure & value ===
          Customer_Tenure_Months=('Customer Tenure in months', 'max'),
          Total_Revenue=('Total Revenue', 'sum'),
          
          # === Usage & intensity ===
          Avg_Data_Usage_GB=('Data Usage', 'mean'),
          Total_Purchases=('Number of Times Purchased', 'sum'),
          Avg_Unit_Price=('Unit Price', 'mean'),
          
          # === Target & business explanation ===
          Customer_Churn_Status=('Customer Churn Status', 'max'),
          Reasons_for_Churn=(
              'Reasons for Churn',
              lambda x: (
                  ', '.join(pd.Series(x.dropna().str.strip()).unique())
                  if x.notna().any()
                  else None
              )
          )
      )
      .reset_index()
)


# Convert dates
date_cols = ['First_Purchase_Date', 'Last_Purchase_Date']
for col in date_cols:
    customer_df[col] = pd.to_datetime(customer_df[col], format='%b-%y', errors='coerce')

# Recency & span features
reference_date = customer_df['Last_Purchase_Date'].max() + pd.DateOffset(months=1)
customer_df['Months_Since_Last_Purchase'] = (
    (reference_date - customer_df['Last_Purchase_Date']).dt.days / 30.4375
).round(1).clip(lower=0)

customer_df['Purchase_Span_Months'] = (
    (customer_df['Last_Purchase_Date'] - customer_df['First_Purchase_Date']).dt.days / 30.4375
).round(1).fillna(0)

# Clean churn target
customer_df['Churn'] = (customer_df['Customer_Churn_Status'] == 'Yes').astype(int)

print("Customer-level dataset created!")
print("Shape:", customer_df.shape)

Customer-level dataset created!
Shape: (496, 21)


In [24]:
customer_df.head()

Unnamed: 0,Customer ID,Age,State,Gender,Device_Count,Primary_Device,First_Purchase_Date,Last_Purchase_Date,Active_Months,Avg_Satisfaction_Rate,...,Customer_Tenure_Months,Total_Revenue,Avg_Data_Usage_GB,Total_Purchases,Avg_Unit_Price,Customer_Churn_Status,Reasons_for_Churn,Months_Since_Last_Purchase,Purchase_Span_Months,Churn
0,CUST0001,27,Kwara,Male,1,4G Router,2025-01-01,2025-01-01,1,2.0,...,2,665000,44.48,19,35000.0,Yes,Relocation,3.0,0.0,1
1,CUST0002,16,Abuja (FCT),Female,1,Mobile SIM Card,2025-03-01,2025-03-01,1,2.0,...,22,66000,19.79,12,5500.0,Yes,Better Offers from Competitors,1.0,0.0,1
2,CUST0003,21,Sokoto,Male,3,5G Broadband Router,2025-03-01,2025-03-01,1,1.0,...,60,299000,94.343333,31,9833.333333,No,,1.0,0.0,0
3,CUST0004,36,Gombe,Female,1,4G Router,2025-03-01,2025-03-01,1,1.0,...,14,40500,92.72,9,4500.0,No,,1.0,0.0,0
4,CUST0005,57,Oyo,Male,1,4G Router,2025-01-01,2025-01-01,1,3.0,...,53,144000,42.92,16,9000.0,No,,3.0,0.0,0


**Quick Data Quality Checks After Aggregation**

In [25]:
print(f"Missing values:\n{customer_df.isna().sum()}")

Missing values:
Customer ID                     0
Age                             0
State                           0
Gender                          0
Device_Count                    0
Primary_Device                  0
First_Purchase_Date             0
Last_Purchase_Date              0
Active_Months                   0
Avg_Satisfaction_Rate           0
Primary_Review                  0
Customer_Tenure_Months          0
Total_Revenue                   0
Avg_Data_Usage_GB               0
Total_Purchases                 0
Avg_Unit_Price                  0
Customer_Churn_Status           0
Reasons_for_Churn             350
Months_Since_Last_Purchase      0
Purchase_Span_Months            0
Churn                           0
dtype: int64


In [26]:
# Any negative recency?
print(f"Min months since last purchase: {customer_df['Months_Since_Last_Purchase'].min()}")

Min months since last purchase: 1.0


In [27]:
# Sample of multi-device customers
print("\nSample of customers with >1 device:\n")
display(customer_df[customer_df['Device_Count'] > 1][['Customer ID', 'Device_Count', 'Primary_Device', 'Churn']].head(5))


Sample of customers with >1 device:



Unnamed: 0,Customer ID,Device_Count,Primary_Device,Churn
2,CUST0003,3,5G Broadband Router,0
5,CUST0006,3,4G Router,0
9,CUST0010,3,4G Router,0
10,CUST0011,3,4G Router,1
11,CUST0012,2,4G Router,0


In [28]:
# Churn rate
print(f"\nOverall churn rate: {customer_df['Churn'].mean().round(3) * 100}%")


Overall churn rate: 29.4%


In [29]:
# Sample of customers who churned with reasons
print("\nSample churned customers with reasons:\n")
display(customer_df[customer_df['Churn'] == 1][['Customer ID', 'Reasons_for_Churn', 'Total_Revenue', 'Avg_Satisfaction_Rate']].head(10))


Sample churned customers with reasons:



Unnamed: 0,Customer ID,Reasons_for_Churn,Total_Revenue,Avg_Satisfaction_Rate
0,CUST0001,Relocation,665000,2.0
1,CUST0002,Better Offers from Competitors,66000,2.0
6,CUST0007,Relocation,264000,5.0
10,CUST0011,Poor Network,301000,2.0
12,CUST0013,Relocation,54000,2.0
16,CUST0017,Costly Data Plans,135000,5.0
19,CUST0020,Poor Network,855000,5.0
29,CUST0030,Poor Network,767000,2.0
30,CUST0031,Better Offers from Competitors,917100,4.0
34,CUST0035,Poor Network,1345000,2.0


**Save Processed Customer-Level Dataset**

In [30]:
processed_folder = r"C:\Users\KOLADE\OneDrive\Documents\AkoladeDSJourney\MTN-Nigeria-Customer-Churn\data\processed"

os.makedirs(processed_folder, exist_ok=True)

save_path = os.path.join(processed_folder, "mtn_customer_level_churn.csv")

customer_df.to_csv(save_path, index=False)
print(f"Saved to: {save_path}")

Saved to: C:\Users\KOLADE\OneDrive\Documents\AkoladeDSJourney\MTN-Nigeria-Customer-Churn\data\processed\mtn_customer_level_churn.csv
