  PREDICTION OF CUSTOMER CHURN

# Project Overview

 1.Business problem

Customer churn is one of the major problem facing many companies due to it's direct impact on revenues. Telecommuinications field is one of those fields where customer churn is very rampant due to ever increasing competion.SyriaTel is one of the companies facing this challenge and is seeking to come up with predictive patterns and develop a robust classifier to forecast whether a customer is likely to churn in the near future. Our main goal is to develop a prediction model that will help SyriaTel predict customers who are likely to churn inorder to take necessary measures to reduce the churn.

2. Data loading and Data Exploration
We shall utilise Syria Tel dataset in seeking solution to our stated business problem

# Data Understanding

In [2]:
#importing relevant libraries 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score,cross_validate
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
from sklearn.metrics import roc_auc_score, ConfusionMatrixDisplay, classification_report, make_scorer,accuracy_score, confusion_matrix, roc_curve
from sklearn.preprocessing import StandardScaler
import warnings
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV 
from sklearn.model_selection import cross_val_score
import multiprocessing # for reducing the runtime of gridsearch 
from sklearn.feature_selection import SelectFromModel 
from sklearn.linear_model import LogisticRegressionCV

# Ignore warnings
warnings.filterwarnings("ignore")


AttributeError: partially initialized module 'pandas' has no attribute 'core' (most likely due to a circular import)

In [None]:

#Reading from a CSV File
df=pd.read_csv("syriatel.csv")

In [None]:
# Dataset preview of the first 5 rows
df.head(5)

In [None]:
# The shape of the data
df.shape

In [None]:
#checking missing values
df.isnull().sum()

In [None]:
#checking datatypes
df.info()

# Data Preparation

Below are the areas explored to prepare data for modeling.
   1. Handling missing values
   2. Duplicated rows
   3. Identification and removal of Features that won't impact the analysis.
   

In [None]:
#checking for duplicates 
df.duplicated().sum()

In [None]:
# check missing values
df.isnull().sum()

Dropping irrelevant Features.
Columns dropped :
   1. State
   2. Account length
   3. Phone number


Reasons For Dropping
1. Account lenght- It's not applicable in this context since it doesn't give any information about customer's behaviour.In most cases these are random or sequential number assigned to new customers.
2. State -


In [None]:
#dropping irrevant features
irrelevant_features = ['account length','state','phone number']
df.drop(columns = irrelevant_features, inplace =  True )

In [None]:
#Separating categorical and continuous Features
# Categorical columns
categorical_columns = ['international plan', 'voice mail plan','churn']

# Continuous columns
continuous_columns = ['number vmail messages', 'total day minutes', 'total day calls',
                'total day charge', 'total eve minutes', 'total eve calls',
                'total eve charge', 'total night minutes', 'total night calls',
                'total night charge', 'total intl minutes', 'total intl calls',
                'total intl charge', 'customer service calls']

The analysis will have 'churn' as the target/dependant variable.
Churn has two classes :
1. False : Customer has not terminated association with the telco.
2. True  : Customer has  terminated.


# Distribution of Data

In [None]:
#Depandant variable
df.churn.value_counts(normalize = True)

In [None]:
# To get a pie chart to analyze 'Churn' 
df['churn'].value_counts().plot.pie(explode=[0.1,0.1], autopct='%1.1f%%', startangle=90, shadow=True, figsize=(8,8))
plt.title('Pie Chart for Churn')
plt.show()

The pie chart shows imbalanced data distribution.
14.5% of customers would actually churn,this can create a biased model.

Correlation Heatmap for Numeric Features

In [None]:

correlation_matrix = df.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()

Most of the characteristics exhibit no correlation, though a few demonstrate a complete positive correlation.
Features with postive Correlation.
total day charge 
total day minutes
total eve charge
total eve minutes
total night charge
total night minutes
total int charge
total int minutes

This perfect correlation is logical as the charge directly corresponds to the minutes used.
A correlation coefficientof 1 signifies perfect multicollinearity, which affects linear models 
differently from nonlinear models.

# Data Pre-processing

# Encoding Categorical Columns

In [None]:
#transformimg categorical data using OHE

# Select the categorical columns to be one-hot encoded
categorical_columns = ['international plan', 'voice mail plan']

# Create an instance of the OneHotEncoder
encoder = OneHotEncoder()

# Fit and transform the categorical columns
encoded_categorical = encoder.fit_transform(df[categorical_columns])

# Convert the encoded data to a DataFrame
encoded_df = pd.DataFrame(encoded_categorical.toarray(), columns=encoder.get_feature_names_out(categorical_columns))

# Concatenate the encoded DataFrame with the remaining columns from the original DataFrame
df = pd.concat([df.drop(categorical_columns, axis=1), encoded_df], axis=1)

df

In [None]:
# Tranfrom our dependant variable - 'churn' using LabelEncoder
# Define the function to encode the column
def encode(column):
    Lencod = LabelEncoder()
    df[column] = Lencod.fit_transform(df[column])

# Call the function to encode the 'churn' column
encode('churn')

In [None]:
# Check the value counts for the encoded churn column
print(df['churn'].value_counts())

# Data Split into Train and Test

In [None]:
# Split the data into (X) and (y)
y = df['churn']
X = df.drop('churn',axis = 1)

#Train-test Split
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Baseline Logistic Regression Mode Before Data Transformation

In [None]:
#Instantiating regression model
logRegBeforeTansf = LogisticRegression(solver='liblinear', random_state=42)
# Fit the logistic regression model to the training data
logRegBeforeTansf.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
y_predBeforeTansf = logRegBeforeTansf.predict(X_test)

# Scores before Data Standardization and Removal of Imbalance

# Function for Checking Metrics  

In [None]:
#creating a function for checking for metrics 
def get_model_metrics(model, X_train, y_train, X_test, y_test):
    # Train the model
    model.fit(X_train, y_train)

    # Predict on the training and testing data
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Calculate evaluation metrics
    roc_auc_train = roc_auc_score(y_train, y_train_pred)
    roc_auc_test = roc_auc_score(y_test, y_test_pred)
    cm_test = confusion_matrix(y_test, y_test_pred)
    cm_display_train = ConfusionMatrixDisplay(confusion_matrix=cm_test).plot()
    accuracy_train = accuracy_score(y_train, y_train_pred)
    accuracy_test = accuracy_score(y_test, y_test_pred)

    # Return results
    results = {
        'ROC-AUC For Train Data': roc_auc_train,
        'ROC-AUC For Test Data': roc_auc_test,
        'Accuracy Train': accuracy_train,
        'Accuracy Test': accuracy_test,
        'confusion_matrix_train': cm_display_train
    }
    return results

In [None]:
get_model_metrics(logRegBeforeTansf,X_train,y_train,X_test,y_test)


The baseline logistic regression exhibits a discrepancy in discrimination levels between the training and testing datasets. The ROC AUC value on the training data is 0.591577190746463, while on the testing data, it stands at 0.5537557289297834. This suggests that the model demonstrates a relatively higher level of discrimination between classes on the training data compared to the testing data.

Furthermore, a confusion matrix reveals the predicted and true labels of the logistic regression model. It shows 13 true positives, 88 false negatives, 554 true negatives, and 12 false positives.

In summary, the model attains a training accuracy of around 89.4% and a testing accuracy of approximately 85%, indicating competent performance in predicting class labels for both datasets. However, it's worth noting that the model's accuracy in prediction, as depicted by the confusion matrix, is not particularly high, suggesting some degree of overfitting.

# Classification Report

In [None]:
print(classification_report(y_test, y_predBeforeTansf))

# Data Standardization

StandardScaler() will be used to transform features to mean of 0 and standard deviation of 1.
This will help the selected features to equally contribute to the analysis.

In [None]:
#instantiating standard scaler
scaler = StandardScaler()
#Scaler fitting on Train data
scaler.fit (X_train)

#Transform Both Train and Test Data
X_train_Scaled = scaler.transform (X_train)
X_test_Scaled  = scaler.transform (X_test)

# Removal of Imbalance

SMOTE will be used to address underrepresentation of minority class.
It generates sythentic samples to represent minority class more accurately.

In [None]:
# Creating a instance of SMOTE
smote = SMOTE(random_state=42)

# Perform SMOTE on the training data
#X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

In [None]:
y_train_resampled.value_counts()

# Baseline Logistic Regression Mode After Data Transformation

In [None]:
#Instantiating regression model
logRegres = LogisticRegression(solver='liblinear', random_state=42)
# Fit the logistic regression model to the training data
logRegres.fit(X_train_resampled, y_train_resampled)

# Predicting with the new Data

In [None]:
#Train data prediction
y_pred_train = logRegres.predict (X_train_resampled)
#Test data predicition
y_pred_test  = logRegres.predict (X_test)

# Scores After  Data Standardization and Removal of Imbalance

# Metrics

In [None]:
get_model_metrics(logRegres,X_train_resampled,y_train_resampled,X_test,y_test)

Following transformation, the baseline logistic regression has mitigated the gap in discrimination levels between the training and testing datasets. The ROC AUC value is 0.7672942206654991 for the training data and 0.7442448308435083 for the testing data.

The confusion matrix displays the logistic regression model's predicted and true labels, indicating 72 true positives, 29 false negatives, 439 true negatives, and 127 false positives. This represents an enhancement over the previous model.

# Classification Report

In [None]:
#Train Data Accuracy
print(classification_report(y_train_resampled, y_pred_train))

In [None]:
#Test Data Accuracy
print(classification_report(y_test, y_pred_test))

Precision: It measures the proportion of correctly identified true positives out of all cases that are predicted as positive (true positives + false positives). In this case, for class 1, the precision is 0.36. This means that out of all the instances predicted as class 1, only 36% were actually correct, indicating a only 36% would actually churn.

Recall: It measures the proportion of correctly identified positive cases (true positives) out of all actual positive cases (true positives + false negatives). In this case, for class 1, the recall is 0.71. This means that out of all the actual instances of class 1, 71% were correctly identified by the model, indicating a relatively high recall rate.

F1-score:It provides a single score that balances both precision and recall.In this case, for class 1, the F1-score is 0.48, which indicates that the balance between precision and recall for class 1 is moderate.

Support: The number of actual occurrences of each class in the test dataset. In this case, there are 566 instances of class 0 and 101 instances of class 1.

Accuracy: Overall accuracy of the model, which measures the proportion of correctly classified instances out of the total instances. In this case, the accuracy is 0.77, indicating that the model correctly classified 77% of the instance

# Cross Validation ( Perfomance Improvement)

In [None]:
# Create an instance of Logistic Regression with cross-validation
logreg_cv = LogisticRegressionCV(Cs=10, cv=5, solver='liblinear')

# Fit the model on the resampled training data
logreg_cv.fit(X_train_resampled, y_train_resampled)

# Predict on the resampled training and testing data
y_train_pred_cv = logreg_cv.predict(X_train_resampled)
y_test_pred_cv = logreg_cv.predict(X_test)


# Scores After Cross Validation

 Metrics

In [None]:
get_model_metrics(logreg_cv,X_train_resampled,y_train_resampled,X_test,y_test)

In this case, the ROC AUC values are 0.766 for the training data and 0.745 for the testing data. Both values suggest that the model has improved reasonably well in distinguishing between the classes, with slightly higher discrimination observed in the training data compared to the testing data.

Confusion matrix shows 72 true positives, 29 false negatives, 440 true negatives, and 29 false positives.This is an improvement compared to the last model.

In [None]:
print(classification_report(y_test, y_test_pred_cv, target_names=['0', '1']))

# Comparison Between Cross Validation Model (logreg_cv) and Non-cross Validation Model (logRegres)

In [None]:
# Cross Validation Model (logreg_cv)     = Set1 Results
# Non-cross Validation Model (logRegres) = Set2 Results

Comparison:

Precision & Recall: The first set (logreg_cv) of results shows consistent precision and recall values for both classes (0.77 for both), while the second set shows a significant difference between precision and recall for class 1 (0.36 precision, 0.71 recall).

F1-score: The F1-score in the first set(logreg_cv) is consistent across both classes (0.77 for both), while in the second set, there's a notable difference between the F1-score for class 0 (0.85) and class 1 (0.48).

Conclusion
The first (logreg_cv) set of results shows more balanced performance across classes, with consistent precision, recall, and F1-score, while the second set indicates imbalanced performance, particularly for class 1, with much lower precision and F1-score compared to class 0.






In [None]:
#drawing ROC curve for the above three models 

# Compute ROC curves and AUC scores for each model
models = [logRegBeforeTansf, logRegres, logreg_cv]
labels = ['Model Before Data Transf', 'Model After Data Transf', 'CV Model']

plt.figure(figsize=(8, 6))

for model, label in zip(models, labels):
    if hasattr(model, "predict_proba"):
        y_probs = model.predict_proba(X_test)[:, 1]
    else:
        y_probs = model.predict(X_test)
    fpr, tpr, _ = roc_curve(y_test, y_probs)
    auc_score = roc_auc_score(y_test, y_probs)

    plt.plot(fpr, tpr, label='{} (AUC = {:.2f})'.format(label, auc_score))

plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curves')
plt.legend()
plt.show()

# Random Forest Classifier

### Model Creation, Training, and Prediction

In [None]:
# Object creation, fitting the data & getting predictions
from sklearn.ensemble import RandomForestClassifier
rf_model_final = RandomForestClassifier() 
rf_model_final.fit(X_train_resampled,y_train_resampled) 
y_pred_rf = rf_model_final.predict(X_test)

# Feature Importance Visualization

In [None]:
Importance =pd.DataFrame({"Importance": rf_model_final.feature_importances_*100},index = X_train_resampled.columns)
Importance.sort_values(by = "Importance", axis = 0, ascending = True).tail(15).plot(kind ="barh", color = "r",figsize=(9, 5))
plt.title("Feature Importance Levels");
plt.show()

Classification Report

In [None]:
print(classification_report(y_test, y_pred_rf, target_names=['0', '1']))

# Model Performance Summary Before Cross Validation

In [None]:
print("**************** RANDOM FOREST MODEL RESULTS **************** ")
print('Accuracy score for testing set: ',round(accuracy_score(y_test,y_pred_rf),5))
print('F1 score for testing set: ',round(f1_score(y_test,y_pred_rf),5))
print('Recall score for testing set: ',round(recall_score(y_test,y_pred_rf),5))
print('Precision score for testing set: ',round(precision_score(y_test,y_pred_rf),5))
cm_rf = confusion_matrix(y_test, y_pred_rf)
f, ax= plt.subplots(1,1,figsize=(5,3))
sns.heatmap(cm_rf, annot=True, cmap='Reds', fmt='g', ax=ax)
ax.set_xlabel('Predicted Labels'); ax.set_ylabel('True Labels') ; ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['0', '1']) ; ax.yaxis.set_ticklabels(['0', '1'])
plt.show();

# Cross Validation

In [None]:
# Create a random forest classifier object
rf_model = RandomForestClassifier()

# Define the scoring metrics for cross-validation
scoring_metrics = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score, zero_division=0),
    'recall': make_scorer(recall_score, zero_division=0),
    'f1_score': make_scorer(f1_score, zero_division=0)
}

# Perform 5-fold cross-validation
cv_results = cross_validate(rf_model, X_train_resampled, y_train_resampled, cv=5, scoring=scoring_metrics)

# Convert the cross-validation results to a readable format
cv_results_readable = {metric: np.mean(scores) for metric, scores in cv_results.items() if metric.startswith('test_')}

# Output the results
for metric, score in cv_results_readable.items():
    print(f"{metric}: {score}")

# Model Performance Summary After Cross Validation

In [None]:
get_model_metrics(rf_model,X_train_resampled,y_train_resampled,X_test,y_test)

# Decision Tree

# Model Creation, Training, and Prediction

In [None]:

# Object creation, fitting the data & getting predictions
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train_resampled,y_train_resampled)
y_pred_dt = decision_tree.predict(X_test)


### Feature Importance Visualization

In [None]:
feature_names = list(X_train_resampled.columns)
importances = decision_tree.feature_importances_[0:15]
indices = np.argsort(importances)

plt.figure(figsize=(8,6))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='green', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

### Classification Report


In [None]:

print(classification_report(y_test, y_pred_dt, target_names=['0', '1']))

### Model Performance Summary Before Cross Validation

In [None]:
print("**************** DECISION TREE CLASSIFIER MODEL RESULTS **************** ")
print('Accuracy score for testing set: ',round(accuracy_score(y_test,y_pred_dt),5))
print('F1 score for testing set: ',round(f1_score(y_test,y_pred_dt),5))
print('Recall score for testing set: ',round(recall_score(y_test,y_pred_dt),5))
print('Precision score for testing set: ',round(precision_score(y_test,y_pred_dt),5))
cm_dt = confusion_matrix(y_test, y_pred_dt)
f, ax= plt.subplots(1,1,figsize=(5,3))
sns.heatmap(cm_dt, annot=True, cmap='Greens', fmt='g', ax=ax)
ax.set_xlabel('Predicted Labels'); ax.set_ylabel('True Labels') ; ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['0', '1']) ; ax.yaxis.set_ticklabels(['0', '1'])
plt.show();

### Cross-Validation

In [None]:

from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeClassifier
import numpy as np

# Create a decision tree classifier object
decision_tree = DecisionTreeClassifier()
# Define the scoring metrics for cross-validation
scoring_metrics = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score, zero_division=0),
    'recall': make_scorer(recall_score, zero_division=0),
    'f1_score': make_scorer(f1_score, zero_division=0)
}

# Perform 5-fold cross-validation
cv_results = cross_validate(decision_tree, X_train_resampled, y_train_resampled, cv=5, scoring=scoring_metrics)

# Convert the cross-validation results to a readable format
cv_results_readable = {metric: np.mean(scores) for metric, scores in cv_results.items() if metric.startswith('test_')}

# Output the results
for metric, score in cv_results_readable.items():
    print(f"{metric}: {score}")

### Model Performance Summary After Cross Validation

In [None]:
get_model_metrics(decision_tree,X_train_resampled,y_train_resampled,X_test,y_test)

# Perfomance Summary of the Three Models

The Random Forest classifier demonstrates strong performance with an accuracy of approximately 95.95% on the testing data. It effectively distinguishes between positive and negative classes, boasting an area under the ROC curve (AUC) of 1.0 on the training data and 0.8576863870132596 on the testing data. Overall, the model exhibits robust predictive capabilities, yielding a high level of accuracy in predicting the target variable.

The confusion matrix reveals 76 true positives (TP), 545 true negatives (TN), 21 false positives (FP), and 25 false negatives (FN).

Comparing the three models, it's evident that logistic regression underperforms in predicting customer churn. In contrast, both the Random Forest classifier and Decision Trees exhibit strong performance, achieving accuracies of 95.95% and 88.0%, respectively.

Given the superior predictability of the Random Forest classifier and Decision Trees, it's pertinent to enhance these models further using hyperparameters to optimize accuracy. Hyperparameters serve as a valuable tool for improving efficiency and performance across models.

# Conclusion

Decision tree and Random forester performs relatively better than logistic regression therefore,we can drop the latter and hypertune the other two.

# Hyperparameter Tuning

This the process of selecting optimal hyperparameters fo our models.They are set before the model begins to learn. 

# Random Forest HyperTuning.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt'],
    'criterion': ['gini', 'entropy'],
    'bootstrap': [True, False]
}
# Create the Random Forest classifier
rf_clf = RandomForestClassifier()

# Instantiate the GridSearchCV object
grid_search = GridSearchCV(estimator=rf_clf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the GridSearchCV object to the data
grid_search.fit(X_train_resampled,y_train_resampled)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Create a new Random Forest classifier with the best hyperparameters
best_RForest_model = RandomForestClassifier(**best_params, random_state=42)

# Fit the best model to the resampled training data
best_RForest_model.fit(X_train_resampled, y_train_resampled)

# Predict on the training data
y_train_pred = best_model.predict(X_train_resampled)

# Predict on the test data
y_test_pred = best_model.predict(X_test)

# Perfomance Summary AfterAdding Hyperparameters

In [None]:
get_model_metrics(best_RForest_model,X_train_resampled,y_train_resampled,X_test,y_test)

# Decision HyperTuning.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

# Define the hyperparameters grid
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

# Create a Decision Tree classifier
dt_classifier = DecisionTreeClassifier()

# Instantiate GridSearchCV
grid_search = GridSearchCV(estimator=dt_classifier, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the GridSearchCV object to the data
grid_search.fit(X_train_resampled, y_train_resampled)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Instantiate the DecisionTreeClassifier with the best hyperparameters
best_decTree_classifier = DecisionTreeClassifier(**best_params)

# Fit the model with the best hyperparameters
best_decTree_classifier.fit(X_train, y_train)


# Perfomance Summary After Adding Hyperparameters


In [None]:

get_model_metrics(best_decTree_classifier,X_train_resampled,y_train_resampled,X_test,y_test)

# Evaluation

In [None]:
# Compute ROC curves and AUC scores for each model
models = [ logreg_cv,best_decTree_classifier,best_RForest_model]
labels = ['logistic Model','Decision Tree',' Random Fores']

plt.figure(figsize=(8, 6))

for model, label in zip(models, labels):
    if hasattr(model, "predict_proba"):
        y_probs = model.predict_proba(X_test)[:, 1]
    else:
        y_probs = model.predict(X_test)
    fpr, tpr, _ = roc_curve(y_test, y_probs)
    auc_score = roc_auc_score(y_test, y_probs)

    plt.plot(fpr, tpr, label='{} (AUC = {:.2f})'.format(label, auc_score))

plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curves')
plt.legend()
plt.show()