# **SyriaTel Customer Churn Prediction**

## **Business Understanding**

### **Overview**

SyriaTel is a telecommunications company experiencing customer churn, where customers stop doing business with the company. Customer churn leads to significant revenue loss and reduced long-term profitability. Therefore, early identification of customer churns allows intervention with prevention of churn strategies.

This project aims to build a model that predicts churn and minimizes chances of missing a churn.

### **Objectives**

1. Develop a predictive model that classifies customers as likely to churn or not, allowing early churn prevention.
2. Identify factors(features) that highly influence a customer to churn.  
3. Minimize chance of missing a customer who will churn i.e False Negatives. Getting a Recall Score of 81%.  
4. Create a highly accurate model. Getting an Accuracy Score of 94%.  

## **Data Understanding**

The dataset used contains customer details, factors affecting churn and churn classification. The following are the columns contained in the dataset:
*Customer identification and Location* :   
* `state` -> state where customer lives.  
* `area code` -> telephone area code of customer's phone number.  
* `phone number` -> customer's phone number.  

`account length` -> number of days customr has used the telecommunication company.  
  
*Service Plans*   
* `international plan` -> Has customer subscribed to international calling plan?  
* `voice mail plan` -> Does customer have an active voicemail service?  
* ` Number vmail messages` -> Number of voicemail messages customer currently has.  

*Daytime Usage*, *Evening Usage*, *Night Usage*, *International Usage* each having:
* `total minutes` -> Total number of minutes used during a call at that specific day period.  
* `total calls` -> Total number of calls made at that specific day period.  
* `total charge` -> Total cost charged for the calls at that specific day period.  
`customer service calls` -> Number of times customer contacted customer service.  
  
`churn` -> Is cutomer expected to churn or not?  

In [398]:
# Necessary imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.metrics import log_loss, recall_score, accuracy_score, precision_score, f1_score
from imblearn.over_sampling import SMOTE

In [399]:
# Loading the dataset
df = pd.read_csv('data/bigml_59c28831336c6604c800002a.csv')
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [400]:
# Displaying the column names
df.columns

Index(['state', 'account length', 'area code', 'phone number',
       'international plan', 'voice mail plan', 'number vmail messages',
       'total day minutes', 'total day calls', 'total day charge',
       'total eve minutes', 'total eve calls', 'total eve charge',
       'total night minutes', 'total night calls', 'total night charge',
       'total intl minutes', 'total intl calls', 'total intl charge',
       'customer service calls', 'churn'],
      dtype='object')

In [401]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

The data has 3333 entries and 20 columns.

### **Selecting Target and Features**

* `churn` is the target since it's what I'm predicting.  
* `phone number` is a unique identifier and `area code` gives little or weak geographical information i.e small area. Therefore these two don't affect customer churn.  
* ***total charge*** is derived from ***total minutes*** hence they are correlated. Since minutes will never change but charge rates may change and are the same for everyone, I'll not use ***total charge***.  
* The rest of the columns will be the features I'll use.  

In [402]:
# Selecting target and features
y = df['churn']
X = df.drop(columns=['churn','area code','phone number', 'total day charge', 'total eve charge', 'total night charge', 'total intl charge'], axis=1)
X.columns

Index(['state', 'account length', 'international plan', 'voice mail plan',
       'number vmail messages', 'total day minutes', 'total day calls',
       'total eve minutes', 'total eve calls', 'total night minutes',
       'total night calls', 'total intl minutes', 'total intl calls',
       'customer service calls'],
      dtype='object')

## **Data Preprocessing**

#### Arranging the data(categorial, numeric, train, test)

Noting the categorical features and the numerical features.

In [403]:
categorical_features = X.select_dtypes(include=['object']).columns
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features, numerical_features

(Index(['state', 'international plan', 'voice mail plan'], dtype='object'),
 Index(['account length', 'number vmail messages', 'total day minutes',
        'total day calls', 'total eve minutes', 'total eve calls',
        'total night minutes', 'total night calls', 'total intl minutes',
        'total intl calls', 'customer service calls'],
       dtype='object'))

In [404]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#### Duplicated and Missing Values

Checking for duplicated values in both training and test sets

In [405]:
X_train.duplicated().sum(), X_test.duplicated().sum()

(0, 0)

Checking for missing values in both training and test sets

In [406]:
X_train.isnull().sum(), X_test.isnull().sum()

(state                     0
 account length            0
 international plan        0
 voice mail plan           0
 number vmail messages     0
 total day minutes         0
 total day calls           0
 total eve minutes         0
 total eve calls           0
 total night minutes       0
 total night calls         0
 total intl minutes        0
 total intl calls          0
 customer service calls    0
 dtype: int64,
 state                     0
 account length            0
 international plan        0
 voice mail plan           0
 number vmail messages     0
 total day minutes         0
 total day calls           0
 total eve minutes         0
 total eve calls           0
 total night minutes       0
 total night calls         0
 total intl minutes        0
 total intl calls          0
 customer service calls    0
 dtype: int64)

There are no duplicated values and missing values in both the training and test sets.

#### Preprocessing Data

In [407]:
# Handling numeric data i.e Normalization
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train[numerical_features]), columns=numerical_features, index=X_train.index)
X_test_scaled = pd.DataFrame(scaler.transform(X_test[numerical_features]), columns=numerical_features, index=X_test.index)

# Handling categoric data i.e OneHotEncoding
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
X_train_ohe = pd.DataFrame(ohe.fit_transform(X_train[categorical_features]), columns=ohe.get_feature_names(categorical_features), index=X_train.index)
X_test_ohe = pd.DataFrame(ohe.transform(X_test[categorical_features]), columns=ohe.get_feature_names(categorical_features), index=X_test.index)

# Combining handled categoric and numeric data
X_train_full = pd.concat([X_train_scaled, X_train_ohe], axis=1)
X_test_full = pd.concat([X_test_scaled, X_test_ohe], axis=1)

In [408]:
y_train.value_counts()

False    1993
True      340
Name: churn, dtype: int64

From the value_counts() output, I noticed the target(y) has class imblance with False(don't churn) being the majority class. I solve this using SMOTE since it's the better method for handling class imbalance.

In [409]:
#SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_full, y_train)
y_train_resampled.value_counts()

True     1993
False    1993
Name: churn, dtype: int64

Now the data is balanced in terms of class.

## **Modeling and Evaluation**

### **Objective 1 : Building the Classifier**

#### **Logistic Regression Model**

##### **Baseline Logistic Regression Model**

In [410]:
# Traing the model - Baseline Logistic Regression
baseline_logreg = LogisticRegression(random_state=42)
baseline_logreg.fit(X_train_resampled, y_train_resampled)

LogisticRegression(random_state=42)

Checking the logloss for the model between the training and testing sets. This also helps check for overfitting and underfitting.

In [411]:
# probabilities for positive class
y_train_proba = baseline_logreg.predict_proba(X_train_resampled)[:, 1]
y_test_proba = baseline_logreg.predict_proba(X_test_full)[:, 1]

train_logloss = log_loss(y_train_resampled, y_train_proba)
test_logloss = log_loss(y_test, y_test_proba)

print("Train Log Loss:", train_logloss)
print("Test Log Loss:", test_logloss)

Train Log Loss: 0.4809064990037247
Test Log Loss: 0.5192234427336707


The two have a `logloss < 0.5` hence the model has good predicting for both and no underfitting. Also, the difference in logloss between the training and test set is `0.0383` which is small hence there's no overfitting.  
Due to this I left the C(regularization strength) value as it is at default when tuning this model.

Predictng using the test set and evaluating using recall and accuracy scores.

In [412]:
y_logreg_test_pred = baseline_logreg.predict(X_test_full)
print("Recall Score on test set:", recall_score(y_test, y_logreg_test_pred))
print("Accuracy Score on test set:", accuracy_score(y_test, y_logreg_test_pred))

Recall Score on test set: 0.7622377622377622
Accuracy Score on test set: 0.765


##### **Tuned Logistic Regression Model**

Tuning the above logistic regression model to contol and balance L1 and L2 regularization using saga for solver, elasticnet for penalty and 0.5 l1-ratio.

In [413]:
# Training the model - Tuned Logistic Regression
tuned_logreg = LogisticRegression(random_state=42, solver='saga',penalty='elasticnet',l1_ratio=0.5)
tuned_logreg.fit(X_train_resampled, y_train_resampled)

LogisticRegression(l1_ratio=0.5, penalty='elasticnet', random_state=42,
                   solver='saga')

Checking the logloss for the model between the training and testing sets to ensure there's still no overfitting and underfitting after tuning.

In [414]:
# probabilities for positive class
y_train_proba2 = tuned_logreg.predict_proba(X_train_resampled)[:, 1]
y_test_proba2 = tuned_logreg.predict_proba(X_test_full)[:, 1]

train_logloss2 = log_loss(y_train_resampled, y_train_proba2)
test_logloss2 = log_loss(y_test, y_test_proba2)

print("Train Log Loss:", train_logloss2)
print("Test Log Loss:", test_logloss2)

Train Log Loss: 0.48097515204614544
Test Log Loss: 0.5186809141749676


In [415]:
y_logreg2_test_pred = tuned_logreg.predict(X_test_full)
print("Recall Score on test set:", recall_score(y_test, y_logreg2_test_pred))
print("Accuracy Score on test set:", accuracy_score(y_test, y_logreg2_test_pred))

Recall Score on test set: 0.7692307692307693
Accuracy Score on test set: 0.768


#### **Decision Tree Classifier**

##### **Baseline Decision Tree Classsifier**

In [416]:
# Training the model - Baseline Decision Tree Classifier
baseline_clf = DecisionTreeClassifier(random_state=42)
baseline_clf.fit(X_train_resampled, y_train_resampled)

DecisionTreeClassifier(random_state=42)

In [417]:
y_tree_test_pred = baseline_clf.predict(X_test_full)
print("Recall Score on test set:", recall_score(y_test, y_tree_test_pred))
print("Accuracy Score on test set:", accuracy_score(y_test, y_tree_test_pred))

Recall Score on test set: 0.7692307692307693
Accuracy Score on test set: 0.914


##### **Tuned Decision Tree Classifier**

Tuning the decision tree classifier using the best hyperparameters found after a grid search cross-validation on the training data. 

In [418]:
tuned_clf = DecisionTreeClassifier(random_state=42)
param_grid = {
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],}
grid_search = GridSearchCV(estimator=tuned_clf, param_grid=param_grid)
grid_search.fit(X_train_resampled, y_train_resampled)

GridSearchCV(estimator=DecisionTreeClassifier(random_state=42),
             param_grid={'max_depth': [3, 5, 7], 'min_samples_leaf': [1, 2, 4],
                         'min_samples_split': [2, 5, 10]})

In [419]:
grid_search.best_params_

{'max_depth': 7, 'min_samples_leaf': 4, 'min_samples_split': 2}

In [420]:
best_tree = grid_search.best_estimator_
y_tree2_test_pred = best_tree.predict(X_test_full)
print("Recall Score on test set:", recall_score(y_test, y_tree2_test_pred))
print("Accuracy Score on test set:", accuracy_score(y_test, y_tree2_test_pred))

Recall Score on test set: 0.8111888111888111
Accuracy Score on test set: 0.946


>From the above model training and testing, the best model so far according to both recall and accuracy is the tuned decision tree classifier.

### **Objective 2 : Features that Highly Influence Churn**

In [421]:
# The best model is the tuned decision tree classifier
best_model = best_tree

Checking on which features highly influence a customer's churn i.e feature importance.
> Higher feature importance = stonger influence on churn

In [422]:
feature_importance = pd.DataFrame({
    "Feature": X_train_full.columns,
    "Importance": best_model.feature_importances_
}).sort_values(by="Importance", ascending=False)

feature_importance.head(10)

Unnamed: 0,Feature,Importance
2,total day minutes,0.261601
10,customer service calls,0.24975
63,international plan_yes,0.191317
4,total eve minutes,0.097848
64,voice mail plan_no,0.048033
37,state_MT,0.031674
8,total intl minutes,0.031629
18,state_DC,0.021449
9,total intl calls,0.01886
6,total night minutes,0.013657


From the above results, a customer with high number of customer service calls, total anytime minutes and an international plan is more likely to churn than others.

### **Objective 3 : Best Recall Score i.e Minimize False Negatives(Missing a churning customer)**

Evaluating all the models created using accuracy, recall, precision scores and f1-score and creating a dataframe with these results.

In [423]:
def evaluate_model(model_name, y_true, y_pred):
    return {
        "Model": model_name,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Recall": recall_score(y_true, y_pred),
        "Precision": precision_score(y_true, y_pred),
        "F1 Score": f1_score(y_true, y_pred)
    }

In [424]:
results = []
results.append(evaluate_model("Baseline Logistic Regression", y_test, y_logreg_test_pred))
results.append(evaluate_model("Tuned Logistic Regression", y_test, y_logreg2_test_pred))
results.append(evaluate_model("Baseline Decision Tree", y_test, y_tree_test_pred))
results.append(evaluate_model("Tuned Decision Tree", y_test, y_tree2_test_pred))

In [425]:
results_df = pd.DataFrame(results)
results_df

Unnamed: 0,Model,Accuracy,Recall,Precision,F1 Score
0,Baseline Logistic Regression,0.765,0.762238,0.351613,0.481236
1,Tuned Logistic Regression,0.768,0.769231,0.355987,0.486726
2,Baseline Decision Tree,0.914,0.769231,0.674847,0.718954
3,Tuned Decision Tree,0.946,0.811189,0.811189,0.811189


From the above table(dataframe), the recall score has been increasing as I create more models and the model with less false negative classification is the tuned decision tree with recall score of 81%. 

### **Objective 4 : Best Accuracy Score i.e Highly Accurate Model**

In [426]:
results_df

Unnamed: 0,Model,Accuracy,Recall,Precision,F1 Score
0,Baseline Logistic Regression,0.765,0.762238,0.351613,0.481236
1,Tuned Logistic Regression,0.768,0.769231,0.355987,0.486726
2,Baseline Decision Tree,0.914,0.769231,0.674847,0.718954
3,Tuned Decision Tree,0.946,0.811189,0.811189,0.811189


From the above table(dataframe), the accuracy score has been increasing as I create more models and the highly accurate model is the tuned decision tree with an accuracy of 94%. 

## **Recommendations**

1. A classifier that predicts churn well was created and can hence be used for early identification of churn cases. I recommend deploying this model in real-time systems so that churns are flagged immediately.
2. Factors(features) that highly influence churn are such as customer service calls, total anytime minutes and international plan. I recommend focusing prevention strategies on these high impact factors. 
3. Since high recall reduces false negatives i.e chance of missijng a churn, I recommend prioritizing recall as a key performaance metric when tunig future models.

## **Conclusion**

This project successfuly addressed the business problem of customer churn at SyriaTel by developing a strong and highly accurate predictive model. Through model building and evaluation, the study identified key churn drivers and achieved strong performance particularly in recall and accuracy.  
The insights provided provide SyriaTel with the ability of early identification of customer churns and to implement prevention strategies preventing the churn. This can therefore reduce revenue loss and support the business's growth in the telecommunications industry.