SYRIATEL CUSTOMER CHURN PREDICTION

AUTHOR : KELVIN MWAURA NJUNG'E

COHORT : DSFT08 (REMOTE)


BUSINESS UNDERSTANDING

Project Overview


Syriatel, a telecommunications company, is experiencing a significant rate of customer churn. Customer churn, or the rate at which customers stop doing business with a company, is a critical issue in the telecom industry. Retaining customers is not only more cost-effective than acquiring new ones but also crucial for sustaining profitability and market position. This project aims to build a predictive model to identify customers who are likely to churn, enabling Syriatel to implement proactive retention strategies to mitigate revenue loss.


STAKEHOLDERS

Primary Stakeholder:

Syriatel Mobile Telecom: The telecom company that will use the model to reduce customer churn and improve profitability.


Other Stakeholders:

Shareholders: Will benefit from increased profitability and market stability.

Employees: Will benefit from a more stable business environment and potentially better compensation.

Customers: Will benefit from improved services and customer support.

BUSINESS PROBLEM

The main business problem is the high rate of customer churn at Syriatel, leading to substantial revenue loss and increased costs associated with acquiring new customers. The goal is to develop a predictive model that can identify customers who are likely to churn, allowing Syriatel to take targeted actions to retain these customers and reduce overall churn rates.

Project Scope

In-Scope:


Identifying key features that predict customer churn.

Developing a robust classification model to predict churn.

Providing actionable insights and recommendations for retention of more customers


Out-of-Scope:

Directly implementing the retention strategies.

Long-term monitoring and adjustments of the model post-deployment.

RESEARCH OBJECTIVES

- Identify key features that determine customer churn. 

    We Analyze the relative importance of various features to uncover the key factors behind customer churn in the telecommunications sector. By pinpointing the most impactful contributors to churn, SyriaTel can gain valuable insights that will guide in them making strategic decisions, optimize resource allocation, and implement targeted initiatives to mitigate customer attrition effectively.


- Develop the most suitable model to predict customer churn.

 To compare the performance of different machine learning models to identify the most effective model for predicting churn.

 By comparing the performance of various models, SyriaTel can confidently select the most efficient one. This ensures that resources are optimally allocated for the implementation of churn-reduction strategies.

- Establish customer retention strategies to reduce churn.

 This endeavor is motivated by the notion that by accurately identifying customers at risk of churning, SyriaTel can proactively implement targeted retention strategies. These strategies may involve offering incentives, improving customer service experiences, and tailoring marketing approaches to align with the preferences of the at-risk customer segments. 



SUCCESS METRICS


- Achieve a high recall score  to ensure most at-risk customers are correctly identified.

- Identify key features that significantly contribute to customer churn.

- Provide clear, actionable recommendations for reducing churn and improving customer retention.

- Show the effectiveness of the churn prediction model in enabling proactive retention strategies and minimizing revenue losses.







DATA UNDERSTANDING


To better serve the identified consumers and clearly project the problem(s) stated in the background, I will use the Churn in Telecom's dataset from Kaggle (https://www.kaggle.com/datasets/becksddf/churn-in-telecoms-dataset).



The dataset contains 3333 ROWS and 21 columns, including information about the state, account length, area code, phone number, international plan, voice mail plan, number of voice mail messages, total day minutes, total day calls, total day charge, total evening minutes, total evening calls, total evening charge, total night minutes, total night calls, total night charge, total international minutes, total international calls, total international charge, customer service calls and churn.


DATASET FEATURES:


State: The state the customer lives in

Account Length: The number of days the customer has had an account.

Area Code: The area code of the customer

Phone Number: The phone number of the customer

International Plan: True if the customer has the international plan, otherwise false.

Voice Mail Plan: True if the customer has the voice mail plan, otherwise false.

Number Vmail Messages: the number of voicemails the customer has sent.

Total Day Minutes: total number of minutes the customer has been in calls during the day.

Total Day Calls: total number of calls the user has done during the day.

Total Day Charge: total amount of money the customer was charged by the Telecom company for calls during the day.

Total Eve Minutes: total number of minutes the customer has been in calls during the evening.

Total Eve Calls: total number of calls the customer has done during the evening.

Total Eve Charge: total amount of money the customer was charged by the Telecom company for calls during the evening.

Total Night Minutes: total number of minutes the customer has been in calls during the night.

Total Night Calls: total number of calls the customer has done during the night.

Total Night Charge: total amount of money the customer was charged by the Telecom company for calls during the night.

Total Intl Minutes: total number of minutes the user has been in international calls.

Total Intl Calls: total number of international calls the customer has done.

Total Intl Charge: total amount of money the customer was charged by the Telecom company for international calls.

Customer Service Calls: number of calls the customer has made to customer service.

Churn: true if the customer terminated their contract, otherwise false

IMPORTING THE NECESSARY LIBRARIES  TO USE DURING THE REST OF THE PROJECT FOR CLEANING, MODELLING AND OPTIMIZING.





In [14]:
# Data manipulation 
import pandas as pd 
import numpy as np 
import os 

# Data visualization
import seaborn as sns 
import matplotlib.pyplot as plt 
import plotly.graph_objs as go
import plotly.express as px

# Modeling
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV 
from imblearn.over_sampling import SMOTE,SMOTENC
from sklearn.metrics import f1_score,recall_score,precision_score,confusion_matrix,roc_curve,roc_auc_score,classification_report # performance metrics
from scipy import stats
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# Algorithms for supervised learning methods
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
from xgboost import XGBClassifier


# Filtering future warnings
import warnings
warnings.filterwarnings('ignore')

In [21]:

#I create a function to load the data and get the other attributes of the data i preparation for EDA 
# Create a class function, and in it  define instance objects and instance methods

class DataFrameLoader:
    def __init__(self, file_name):
        self.file_name = file_name
        self.data = None

    def load_data(self):
        try:
            current_dir = os.getcwd()
            file_path = os.path.join(current_dir, self.file_name)
            self.data = pd.read_csv(file_path)
            print(f"Data loaded successfully from {self.file_name}")
        except FileNotFoundError:
            print(f"Error: File {self.file_name} not found in the current directory.")
        except Exception as e:
            print(f"Error: {e}")

    def get_shape(self):
        if self.data is not None:
            rows, columns = self.data.shape
            print(f"The DataFrame has {rows} rows and {columns} columns.")
        else:
            print("Data not loaded yet. Please call the load_data() method first.")

    def get_info(self):
        if self.data is not None:
            print(self.data.info())
        else:
            print("Data not loaded yet. Please call the load_data() method first.")

    def describe_data(self):
        if self.data is not None:
            print(self.data.describe())
        else:
            print("Data not loaded yet. Please call the load_data() method first.")

# lets utilize  the function
file_name = 'bigml_59c28831336c6604c800002a.csv'
df_loader = DataFrameLoader(file_name)

# Load data
df_loader.load_data()







Data loaded successfully from bigml_59c28831336c6604c800002a.csv


In [20]:
# Get shape of the data frame
df_loader.get_shape()

The DataFrame has 3333 rows and 21 columns.


In [19]:
#knowing the data types of our data frame
df_loader.get_info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

we can see that there are no missing values in our data set

In [24]:

# get the inferential statistics of our dataset
df_loader.describe_data()

       account length    area code  number vmail messages  total day minutes  \
count     3333.000000  3333.000000            3333.000000        3333.000000   
mean       101.064806   437.182418               8.099010         179.775098   
std         39.822106    42.371290              13.688365          54.467389   
min          1.000000   408.000000               0.000000           0.000000   
25%         74.000000   408.000000               0.000000         143.700000   
50%        101.000000   415.000000               0.000000         179.400000   
75%        127.000000   510.000000              20.000000         216.400000   
max        243.000000   510.000000              51.000000         350.800000   

       total day calls  total day charge  total eve minutes  total eve calls  \
count      3333.000000       3333.000000        3333.000000      3333.000000   
mean        100.435644         30.562307         200.980348       100.114311   
std          20.069084          9.25943

DATA PREPARATION
