##  SyriaTel Customer Churn Prediction

### Authors: Musi Calori, Jessica Gichimu, Vicker Ivy, Bob Lewis


## 1. Business Understanding

### 1.1 Business Overview

The telecommunications industry has become very competitive over the years, with customer retention emerging as a critical challenge. One of the major issues facing telecom providers is customer churn, a scenario where users discontinue their service, either due to dissatisfaction from the provider or due to the availability of better alternatives. High churn rates can significantly impact a company's overall revenue, and scaling potential.



### 1.2 Problem Statement

SyriaTel, a leading telecom provider, is experiencing a significant loss of customers. To address this challenge, the company seeks to build a robust predictive model capable of identifying customers who are at risk of churning. By using data driven insights and predictive modeling, SyriaTel aims to understand the key drivers of customer churning, determing methods of improving long term retention of customers and enhance long term customer loyalty.

### 1.3 Business Objective

## 2. Data Understanding
In this step, we explore the dataset to understand what kind of information it contains. We look at the different features, their data types, and check for things like missing values or unusual patterns. This helps us get a clear picture of the data before moving on to cleaning and modeling.

### 2.1. Import Libraries
For this project, we will implement the following tools and libraries:

- Numpy: for numerical computations
- Pandas: for data loading, cleaning and manipulation
- Seaborn: for data visualization and EDA
- Matplotlib: for data visualization and EDA
- Scikit-learn: for data preprocessing, predictive modeling and model evaluation.
- Imblearn: for dealing with class imbalance.

In [42]:
# import required libraries

# data loading and manipulation
import pandas as pd
import numpy as np

# data visualization 
import seaborn as sns
import matplotlib.pyplot as plt

# data preprocessing and modeling
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, LabelEncoder, StandardScaler
from scipy import stats
import random
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, roc_curve, classification_report, confusion_matrix, ConfusionMatrixDisplay, auc, RocCurveDisplay

# algorithms for supervised learning
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# suppress warnings for better readability
import warnings
warnings.filterwarnings('ignore')

# Set the seaborn plot size
sns.set_theme(rc={'figure.figsize':(11.7,8.27)})

### 2.2 Load the Dataset

We will load the dataset, check the info and summary statistics of the dataset.

In [43]:
# load datasets
churn_df = pd.read_csv('churn.csv')
churn_df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [44]:
# check the info of the dataset
churn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   


From the info() function, we can see the following:

- The dataset contains a total of 3333 records, and 21 columns/features.
- The numerical features are about 16, while the categorical columns are about 4, excluding the target variable, which is churn.

Next, we want to check the descriptive statistics of the dataset. In this section, we will use the describe() function to check for:

- count: The total number of records in each numerical column
- mean: The average value in each numerical column
- std: The standard deviation
- min: The minimum value in each numerical column
- 25%: The 25th percentile value in each numerical column
- 50%: The 50th percentile value (median) in each numerical column
- 75%: The 75th percentile value in each column
max: The maximum value in each column

In [45]:
# check the summary statistics
churn_df.describe()

Unnamed: 0,account length,area code,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0



To get a better view of the shape of the dataset, as well as the numerical and categorical columns, we can do as follows:

In [46]:

# check the shape of the dataset
print(f"Number of rows: {churn_df.shape[0]}")
print(f"Numbber of columns: {churn_df.shape[1]}\n")

# display the numerical and categorical columns
print(f"Numerical columns: {churn_df.select_dtypes(include='number').columns}\n")
print(f"Categorical columns: {churn_df.select_dtypes(include='object').columns}\n")

Number of rows: 3333
Numbber of columns: 21

Numerical columns: Index(['account length', 'area code', 'number vmail messages',
       'total day minutes', 'total day calls', 'total day charge',
       'total eve minutes', 'total eve calls', 'total eve charge',
       'total night minutes', 'total night calls', 'total night charge',
       'total intl minutes', 'total intl calls', 'total intl charge',
       'customer service calls'],
      dtype='object')

Categorical columns: Index(['state', 'phone number', 'international plan', 'voice mail plan'], dtype='object')



### 2.3.  Feature Understanding

Below is a description of all the numerical and categorical features in the dataset: Numerical Features:

- account length: The number of days the customer has been an account holder.
area code: The area code associated with the customer's phone number.
number vmail messages: The number of voice messages received by the customer.
total day minutes: The total number of minutes used by the customer during the day.
total day calls: The total number of calls made by the customer during the day.
total day charge: The total charges incurred by the customer during the day.
total eve minutes: The total number of minutes used by the customer in the evening.
total eve calls: The total number of calls made by the customer in the evening.
total eve charge: The total charges incurred by the customer in the evening.
total night minutes: The total number of minutes spent by the customer at night.
total night calls: The total number of calls made by the customer at night.
total night charge: The total charged incurred by the customer at night.
total intl minutes: The total number of minutes spent by the customer on international calls
total intl calls: The total number of international calls made by the customer
total intl charge: The total charge incurred by the customer on international calls.
customer service calls: The number of calls made by customer service to customers.
Categorical Features:

state: The customer's state of residence.
phone number: The customer's mobile number.
international plan: Indicates if the customer has subscribed to an international plan (Yes/No)
voice mail plan: Indicates if the customer has a voice mail plan (Yes/No)
Now that we have a rudimentary understanding of the data, we can proceed to implementing some data preparation techniques.

### 3. Data Preparation
In this section, we will look into data cleaning techniques, Exploratory Data Analysis (EDA) and data preprocessing (data wrangling) for our dataset. This step is paramount to provide data that will contribute significantly to the performance of the prediction model

### Data Cleaning
In this section, we perform some data cleaning techniques on the dataset. These techniques include:

- Checking for null values and handling them.
- Checking for duplicate rows and dropping them.
- Standardizing the columns by adding an underscore between each word in a column, and capitalizing the 1st letter of each word in a column.

In [47]:
churn_df.isnull().sum()

state                     0
account length            0
area code                 0
phone number              0
international plan        0
voice mail plan           0
number vmail messages     0
total day minutes         0
total day calls           0
total day charge          0
total eve minutes         0
total eve calls           0
total eve charge          0
total night minutes       0
total night calls         0
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
churn                     0
dtype: int64

In [48]:
churn_df.duplicated().sum()

0

Next, looking closely into the features, we realized that we may not need the Phone_Number column in our analysis. Therefore, we will drop the column.

In [49]:

# drop the Phone_Number column
churn_df = churn_df.drop('phone number', axis=1)

# check the remaining columns
churn_df.columns

Index(['state', 'account length', 'area code', 'international plan',
       'voice mail plan', 'number vmail messages', 'total day minutes',
       'total day calls', 'total day charge', 'total eve minutes',
       'total eve calls', 'total eve charge', 'total night minutes',
       'total night calls', 'total night charge', 'total intl minutes',
       'total intl calls', 'total intl charge', 'customer service calls',
       'churn'],
      dtype='object')