## PREDICTING CUSTOMER CHURN AT SYRIATEL

### INTRODUCTION

Syriatel is a telecommunications company providing mobile network services in Syria.  

### OBJECTIVES

1. 

### DATA UNDERSTANDING

We will be using data from the file ```syriatel_customer_churn.csv``` in the ```data``` folder. The data contains various features that we shall analyze to predict whether a customer would continue using Syriatel services or would churn. These features are listed below and include number and length of calls made by a customer at various times and which plans they have subscribed to. The target feature is the ```churn``` column which tells us whether or not the customer churned.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("data/syriatel_customer_churn.csv")
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [2]:
df.shape

(3333, 21)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

We see that the data has 21 columns and 3333 rows, none of which are null.

### DATA PREPARATION

To avoid the risk of data leakage, we first split our data into train and test sets. We shall then prep our train set and use it to train our models. Later when we have our final model, we'll apply the same transformations to our test set.

In [4]:
from sklearn.model_selection import train_test_split

y = df["churn"]
X = df.drop(columns=["churn"], axis=1)

seed = 2024
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=seed)

Next we convert ```y_train``` to ```int``` values.

In [5]:
y_train.value_counts()

False    1986
True      347
Name: churn, dtype: int64

In [6]:
y_train_int = y_train.astype(int)

In [7]:
y_train_int.value_counts()

0    1986
1     347
Name: churn, dtype: int64

Separating continous and categorical variables in ```X_train```

In [8]:
X_train_cat = ["international plan", "voice mail plan"]
X_train_cont = df.columns[4:-1].difference(X_train_cat)

We have chosen to drop the columns ``` state```, ```account length```, ```area code```, and ```phone number``` as we presume they would not have a major impact in determining our target.

Transforming categorical variables.

In [9]:
X_train_cat_ohe = pd.get_dummies(X_train[X_train_cat], prefix=X_train_cat, drop_first=True)
X_train_cat_ohe.head()

Unnamed: 0,international plan_yes,voice mail plan_yes
2403,0,0
87,0,0
3180,1,0
3266,0,1
3055,0,0


Let's check for strong correlations in continuous variables.

In [10]:
X_train_cont_df = X_train.loc[:, X_train_cont]
X_train_cont_df.head()

Unnamed: 0,customer service calls,number vmail messages,total day calls,total day charge,total day minutes,total eve calls,total eve charge,total eve minutes,total intl calls,total intl charge,total intl minutes,total night calls,total night charge,total night minutes
2403,2,0,109,15.62,91.9,111,16.86,198.4,7,3.51,13.0,125,7.73,171.7
87,1,0,118,36.43,214.3,76,17.72,208.5,2,3.24,12.0,98,8.21,182.4
3180,1,0,115,25.81,151.8,116,8.81,103.6,4,3.29,12.2,86,7.03,156.3
3266,3,33,139,26.38,155.2,79,22.81,268.3,4,2.62,9.7,71,8.39,186.4
3055,2,0,80,29.82,175.4,127,16.78,197.4,2,2.62,9.7,102,8.47,188.2


In [11]:
abs(X_train_cont_df.corr()) > 0.75

Unnamed: 0,customer service calls,number vmail messages,total day calls,total day charge,total day minutes,total eve calls,total eve charge,total eve minutes,total intl calls,total intl charge,total intl minutes,total night calls,total night charge,total night minutes
customer service calls,True,False,False,False,False,False,False,False,False,False,False,False,False,False
number vmail messages,False,True,False,False,False,False,False,False,False,False,False,False,False,False
total day calls,False,False,True,False,False,False,False,False,False,False,False,False,False,False
total day charge,False,False,False,True,True,False,False,False,False,False,False,False,False,False
total day minutes,False,False,False,True,True,False,False,False,False,False,False,False,False,False
total eve calls,False,False,False,False,False,True,False,False,False,False,False,False,False,False
total eve charge,False,False,False,False,False,False,True,True,False,False,False,False,False,False
total eve minutes,False,False,False,False,False,False,True,True,False,False,False,False,False,False
total intl calls,False,False,False,False,False,False,False,False,True,False,False,False,False,False
total intl charge,False,False,False,False,False,False,False,False,False,True,True,False,False,False


In [12]:
df_corr = X_train_cont_df.corr().abs().stack().reset_index().sort_values(0, ascending=False)

df_corr['pairs'] = list(zip(df_corr.level_0, df_corr.level_1))

df_corr.set_index(['pairs'], inplace = True)

df_corr.drop(columns=['level_1', 'level_0'], inplace = True)

# cc for correlation coefficient
df_corr.columns = ['cc']

df_corr.drop_duplicates(inplace=True)

df_corr[(df_corr.cc>.75) & (df_corr.cc<1)]

Unnamed: 0_level_0,cc
pairs,Unnamed: 1_level_1
"(total day charge, total day minutes)",1.0
"(total eve minutes, total eve charge)",1.0
"(total night minutes, total night charge)",0.999999
"(total intl minutes, total intl charge)",0.999993


We choose to drop the ```charge``` features as they can be seen to be very strongly correlated with the ```minutes``` features.

In [13]:
to_drop = ['total day charge', 'total eve charge', 'total night charge', 'total intl charge']
X_train_cont = df.columns[4:-1].difference(X_train_cat + to_drop)
X_train_cont

Index(['customer service calls', 'number vmail messages', 'total day calls',
       'total day minutes', 'total eve calls', 'total eve minutes',
       'total intl calls', 'total intl minutes', 'total night calls',
       'total night minutes'],
      dtype='object')

We then scale continious variables.

In [14]:
X_train_cont_df = X_train.loc[:, X_train_cont]
X_train_cont_df.head()

Unnamed: 0,customer service calls,number vmail messages,total day calls,total day minutes,total eve calls,total eve minutes,total intl calls,total intl minutes,total night calls,total night minutes
2403,2,0,109,91.9,111,198.4,7,13.0,125,171.7
87,1,0,118,214.3,76,208.5,2,12.0,98,182.4
3180,1,0,115,151.8,116,103.6,4,12.2,86,156.3
3266,3,33,139,155.2,79,268.3,4,9.7,71,186.4
3055,2,0,80,175.4,127,197.4,2,9.7,102,188.2


In [15]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train_cont_df)
X_train_cont_scaled = scaler.transform(X_train_cont_df)
X_train_cont_scaled

array([[ 0.34443674, -0.59004295,  0.43843642, ...,  0.9741809 ,
         1.25617566, -0.57860761],
       [-0.42379549, -0.59004295,  0.88816037, ...,  0.62048184,
        -0.11402575, -0.36640043],
       [-0.42379549, -0.59004295,  0.73825239, ...,  0.69122165,
        -0.72300415, -0.8840273 ],
       ...,
       [ 1.11266898, -0.59004295,  0.63831373, ..., -0.47598524,
         0.69794546, -0.14427702],
       [-0.42379549, -0.59004295,  0.83819104, ..., -0.08691628,
         1.05318286, -0.80271427],
       [ 0.34443674, -0.59004295,  1.08803769, ..., -0.47598524,
        -0.67225595, -0.61827251]])

In [16]:
X_train_cont_scaled_df = pd.DataFrame(
    X_train_cont_scaled, index=X_train_cont_df.index, columns=X_train_cont_df.columns)

Combining the preprocessed continuous and categorical data into one data frame.

In [17]:
X_train_preprocessed = pd.concat([X_train_cat_ohe, X_train_cont_scaled_df], axis=1)
X_train_preprocessed.head()

Unnamed: 0,international plan_yes,voice mail plan_yes,customer service calls,number vmail messages,total day calls,total day minutes,total eve calls,total eve minutes,total intl calls,total intl minutes,total night calls,total night minutes
2403,0,0,0.344437,-0.590043,0.438436,-1.559802,0.539219,-0.064116,1.008057,0.974181,1.256176,-0.578608
87,0,0,-0.423795,-0.590043,0.88816,0.636415,-1.220078,0.134515,-1.014124,0.620482,-0.114026,-0.3664
3180,1,0,-0.423795,-0.590043,0.738252,-0.485019,0.790547,-1.928493,-0.205252,0.691222,-0.723004,-0.884027
3266,0,1,1.112669,1.823838,1.937516,-0.424013,-1.069281,1.310567,-0.205252,-0.193026,-1.484227,-0.287071
3055,0,0,0.344437,-0.590043,-1.010674,-0.061566,1.343469,-0.083783,-1.014124,-0.193026,0.088967,-0.251372


In [18]:
X_train_preprocessed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2333 entries, 2403 to 2656
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   international plan_yes  2333 non-null   uint8  
 1   voice mail plan_yes     2333 non-null   uint8  
 2   customer service calls  2333 non-null   float64
 3   number vmail messages   2333 non-null   float64
 4   total day calls         2333 non-null   float64
 5   total day minutes       2333 non-null   float64
 6   total eve calls         2333 non-null   float64
 7   total eve minutes       2333 non-null   float64
 8   total intl calls        2333 non-null   float64
 9   total intl minutes      2333 non-null   float64
 10  total night calls       2333 non-null   float64
 11  total night minutes     2333 non-null   float64
dtypes: float64(10), uint8(2)
memory usage: 205.0 KB
