<a href="https://colab.research.google.com/github/NjorogeWinnie/Telecom-churn-analysis/blob/main/Telecom_Churn_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Business Understanding**

### Business Problem
SyriaTel loses revenue due to customer churn and this is a significant challenge as acquaring new customers is more expensive than retaining existing ones. A data driven approach is thus needed to identify customers at high risk of churning so that retention strategies are applied more effectively.

### Objectives
- Primary Objective.

Develop a classification model that maximizes recall to help identify the number of customers who leave without prior knowledge.
- Secondary Objective

1. Identify key features associated with customer churn to inform retention strategies.
2. Compare multiple classification models to evaluate perfomance tradeoffs.
3. Assess model performance using business-relevant evaluation metrics.


## **Data Understanding**
The dataset used in this analysis is the SyriaTel Customer Churn dataset sourced from [Kaggle](https://www.kaggle.com/datasets/becksddf/churn-in-telecoms-dataset?resource=download)

### Import the necessary libraries

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    roc_curve
)


### Checking the data

In [2]:
df = pd.read_csv("/content/SyriaTel Customer Churn.csv")
df.head()


Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [3]:
print('This dataset has ' + str(df.shape[0]) + ' rows, and ' + str(df.shape[1]) + ' columns')

This dataset has 3333 rows, and 21 columns


In [4]:
print("Data types for SyriaTel:")
print(df.dtypes)

Data types for SyriaTel:
state                      object
account length              int64
area code                   int64
phone number               object
international plan         object
voice mail plan            object
number vmail messages       int64
total day minutes         float64
total day calls             int64
total day charge          float64
total eve minutes         float64
total eve calls             int64
total eve charge          float64
total night minutes       float64
total night calls           int64
total night charge        float64
total intl minutes        float64
total intl calls            int64
total intl charge         float64
customer service calls      int64
churn                        bool
dtype: object


In [5]:
print("Numerical columns in SyriaTel")
num_features = df.select_dtypes(include='number').shape[1]
print(num_features)
print("Categorical columns in SyriaTel")
cat_features = df.select_dtypes(exclude='number').shape[1]
print(cat_features)

Numerical columns in SyriaTel
16
Categorical columns in SyriaTel
5


In [6]:
print("Data info for SyriaTel")
print(df.info)

Data info for SyriaTel
<bound method DataFrame.info of      state  account length  area code phone number international plan  \
0       KS             128        415     382-4657                 no   
1       OH             107        415     371-7191                 no   
2       NJ             137        415     358-1921                 no   
3       OH              84        408     375-9999                yes   
4       OK              75        415     330-6626                yes   
...    ...             ...        ...          ...                ...   
3328    AZ             192        415     414-4276                 no   
3329    WV              68        415     370-3271                 no   
3330    RI              28        510     328-8230                 no   
3331    CT             184        510     364-6381                yes   
3332    TN              74        415     400-4344                 no   

     voice mail plan  number vmail messages  total day minutes  \
0 

In [7]:
print(df['churn'].value_counts())
print(df['churn'].value_counts(normalize=True))

churn
False    2850
True      483
Name: count, dtype: int64
churn
False    0.855086
True     0.144914
Name: proportion, dtype: float64


In [8]:
categorical_cols = [
    'churn',
    'international plan',
    'voice mail plan'
]

for col in categorical_cols:
    print(f"{col}: {df[col].unique()}")


churn: [False  True]
international plan: ['no' 'yes']
voice mail plan: ['yes' 'no']


In [9]:
print(df.duplicated().sum())
print(df.isna().sum())

0
state                     0
account length            0
area code                 0
phone number              0
international plan        0
voice mail plan           0
number vmail messages     0
total day minutes         0
total day calls           0
total day charge          0
total eve minutes         0
total eve calls           0
total eve charge          0
total night minutes       0
total night calls         0
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
churn                     0
dtype: int64


- Dataset Overview

Number of observations: 3,333 customers

Number of features: 21 total

Target variable: churn

Target type: Binary categorical (Yes / No)

No missing values or duplicates

1. Data Structure and Type Identification: The dataset's shape (number of observations and features) and the types of variables (numerical vs. categorical) were inspected. This confirmed the data's composition and directly informed the need for scaling numerical features and encoding categorical features in the subsequent preprocessing phase.

2. Target Variable Analysis (Class Imbalance): The target variable (churn) was analyzed using value counts. This confirmed the crucial issue of class imbalance, which necessitates the use of robust techniques (like weighted models) and the selection of appropriate, non-accuracy-based evaluation metrics (like Recall and F1-Score) for model interpretation.

3. Feature Quality Check: Unique values within key categorical features were reviewed. This confirmed the consistency and cleanliness of the labels, ensuring that data quality issues will not negatively affect the final model's performance.

## **Data Preparation**

In [10]:
#Create a copy and keep the original dataset unaltered
data = df.copy(deep=True)

### Defining X and y

In [11]:
X = data.drop(columns='churn')
y = data['churn']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


### Dropping irrelevant columns

In [12]:
irrelevant_cols = [
    'account length',
    'phone number',
    'area code',
    'state',
    'total day calls',
    'total eve calls',
    'total night calls',
    'total intl calls'
]
X = X.drop(columns=irrelevant_cols)
print(X.columns)

Index(['international plan', 'voice mail plan', 'number vmail messages',
       'total day minutes', 'total day charge', 'total eve minutes',
       'total eve charge', 'total night minutes', 'total night charge',
       'total intl minutes', 'total intl charge', 'customer service calls'],
      dtype='object')


Identifier and routing columns were dropped as they provided no predictive value. The call counts were also dropped as they have weak signal as compared to minutes and charges with viable metrics for prediction

### Feature Engineering

In [13]:
# Captures Total usage into 2 cols
X['total_minutes'] = (
    X['total day minutes'] +
    X['total eve minutes'] +
    X['total night minutes'] +
    X['total intl minutes']
)

X['total_charges'] = (
    X['total day charge'] +
    X['total eve charge'] +
    X['total night charge'] +
    X['total intl charge']
)


In [14]:
# Capture time-of-day patterns
X['peak_period'] = X[['total day minutes','total eve minutes','total night minutes','total intl minutes']].idxmax(axis=1)


In [15]:
# Capture imbalance in usage patterns
X['day_night_ratio'] = (X['total day minutes']) / (X['total night minutes'] + 1)  # +1 avoids division by zero
# Captures the continuous variable as a binary
X['intl_usage_flag'] = (X['total intl minutes'] > 10).astype(int)


In [16]:
# Clean up the features used for the engineering
minute_charge_cols = [
    'total day minutes', 'total eve minutes', 'total night minutes', 'total intl minutes',
    'total day charge', 'total eve charge', 'total night charge', 'total intl charge'
]

X = X.drop(columns=minute_charge_cols)
