<a href="https://colab.research.google.com/github/SoremiKayode/COMP5000-2022-labs/blob/main/Customer_Churn_Complete.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Customer Churn Prediction For a Telecommunication Network**
>Telecommunication customer churn prediction plays a crucial role in reducing customer attrition rates and improving customer retention strategies. This study focuses on utilizing machine learning algorithms, namely Logistic Regression, K-Nearest Neighbours (KNN), Support Vector Machine (SVM), and Deep Learning, to predict customer churn in the telecommunication industry. The research incorporates feature engineering techniques, such as removing irrelevant data and up sampling the dataset to achieve class balance. The main objective is to evaluate the performance of these algorithms and conduct parameter tuning for enhanced results.

>The experimental results demonstrate that SVM achieves an accuracy of 89.9%, precision of 89.9%, and recall of 89.9% in predicting customer churn. The KNN model achieves an accuracy of 78%, precision of 80%, and recall of 80%. The optimized version of Logistic Regression yields an accuracy of 77.1%, precision of 77.2%, and recall of 77.1%. The Deep Learning model achieves an accuracy, precision, and recall of 85%.

>The parameter tuning process involves optimizing the hyperparameters of each algorithm. For SVM, parameters such as the kernel type, regularization parameter, and gamma value are fine-tuned. In KNN, the number of neighbours is optimized. Logistic Regression is optimized by tuning the regularization parameter and learning rate. Deep Learning involves tuning parameters such as the number of layers, activation functions, and learning rate.

>The findings of this research suggest that SVM outperforms other algorithms in terms of accuracy, precision, and recall. However, KNN, Logistic Regression, and Deep Learning also provide reasonably good results. The study highlights the significance of feature engineering and parameter tuning in improving the performance of machine learning algorithms for customer churn prediction in the telecommunication sector. These predictive models can assist telecommunication companies in identifying at-risk customers and implementing targeted retention strategies to reduce churn rates, enhance customer satisfaction, and ultimately improve business profitability.


##**Outline of building the machine learning prediction model**.

###**Data Analysis:**
* Checking data summary and info
* Removing unwanted column
* working on missing data
* Plotting the distribution of features with categorical data on an histogram
* Checking for correlation among features of the dataset.

###**Data Preprocessing**
* one hot encoding categorical data using pandas pd.features
* Scaling the input to zero and one using scikit learn MinMaxScaler
* splitting dataset into training and testing.
* searching for the best parameter to use.

##**Importing the needed library**

In [None]:
pip install scikeras[tensorflow]
pip install scikit-plot==0.3.7

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import importlib
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import seaborn as sns
import tensorflow as tf
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, , PrecisionRecallDisplay, classification_report, confusion_matrix
from scikeras.wrappers import KerasClassifier
import scikitplot as skplt
# from tensorflow.keras.utils.vis_utils import plot_model
from tensorflow.keras.utils import plot_model
%matplotlib inline



###**Loading the Dataset**

In [None]:
# loading the dataset
data = pd.read_excel("drive/MyDrive/Colab Notebooks/Telco_customer_churn.xlsx")
data

###**Getting the list of all the Features**

In [None]:
column = data.columns
print(len(column))
column
# The lenght of the columns is 33

###**Insight into the Dataset**

In [None]:
print("The Data Summary")
print(data.describe())

In [None]:
print("The data info is")
print(data.info())

####**Handling Missing Data and missing column**

It happens that there are no missing data in the dataset, except for the churn reason column, it will not be appropriate to include churn reason in the dataset, there is possibility for using the model to predict the churn possibility of cutomer who haven't spent much time with the company and definitely haven't churned, so we will drop the column alonside "CustomerID', 'Count', 'Country', 'State', 'City', 'Zip Code', 'Lat Long', 'Latitude', and 'Longitude Churn Score, CLTV, Churn Reason' "we want to focus on features that has high probability of affecting our predictions

In [None]:
wanted_columns = column[9:-3]
data = data[wanted_columns]

We will create a function to return the columns which are object and value count is not greater than 20

In [None]:
def get_categorical_columns(data):
  cat_columns = []
  for col in data.columns:
    if data[col].dtype == "O" or data[col].value_counts().shape[0] < 20 :
      cat_columns.append(col)
  return cat_columns

In [None]:
categorical_columns = get_categorical_columns(data)

writing a function to plot the histogram of each categorical columns

In [None]:
def plot_all_categorical_columns(data, columns, arrangement):
    len_columns = len(columns)
    fig, axes = plt.subplots(2, 3, figsize=(12, 7), sharey=True)

    for y in range(len_columns):
        sns.countplot(x=columns[y], data=data, ax=axes[arrangement[y][0], arrangement[y][1]])

    plt.tight_layout()
    plt.show()

# Example usage
first_columns = categorical_columns[:6]
second_columns = categorical_columns[6:12]
third_columns = categorical_columns[12:18]
arrangement = [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2)]

plot_all_categorical_columns(data, first_columns, arrangement)
plot_all_categorical_columns(data, second_columns, arrangement)
plot_all_categorical_columns(data, third_columns, arrangement)




###**Plotting Continuous Data**

In [None]:
fig, axes = plt.subplots(1,2, figsize=(12, 7))
sns.lineplot(y="Tenure Months", x="Churn Value", data=data)
sns.lineplot(y="Monthly Charges", x="Churn Value", data=data, ax=axes[0])

###**Data Processing**

####**Removing unwanted Column**

In [None]:
data = data[['Gender', 'Senior Citizen', 'Partner', 'Dependents', 'Tenure Months',
       'Phone Service', 'Multiple Lines', 'Internet Service',
       'Online Security', 'Online Backup', 'Device Protection', 'Tech Support',
       'Streaming TV', 'Streaming Movies', 'Contract', 'Paperless Billing',
       'Payment Method', 'Monthly Charges', 'Total Charges',
       'Churn Value']]

####**Seperating Categorical Column**

In [None]:
cat_column = [x for x in data.columns if data[x].dtype=="O"]

In [None]:
cat_column = ['Gender',
 'Senior Citizen',
 'Partner',
 'Dependents',
 'Phone Service',
 'Multiple Lines',
 'Internet Service',
 'Online Security',
 'Online Backup',
 'Device Protection',
 'Tech Support',
 'Streaming TV',
 'Streaming Movies',
 'Contract',
 'Paperless Billing',
 'Payment Method']

###**One Hot Encoding**

In [None]:
X = pd.get_dummies(data=data, columns=cat_column, drop_first=True)

###**Scaling Non Categorical Data**

In [None]:
# scaling categorical data
sc = MinMaxScaler()
a = sc.fit_transform(X[['Tenure Months']])
b = sc.fit_transform(X[['Monthly Charges']])

In [None]:
X['Tenure Months'] = a
X['Monthly Charges'] = b

In [None]:
X.drop("Total Charges", axis=1, inplace=True)

###**Upsampling the Data to Create a Balance between the two output Classs**

In [None]:
# setting variable y to the churn value
y = data["Churn Value"]
y

#including churn prediction back so we can upsample it to balance out
X["Churn Value"] = y

###**Splitting the Dataset into yes or no**

In [None]:
# splitting the dataset into yes or no
x_no = X[X["Churn Value"] == 0]
x_yes = X[X["Churn Value"] == 1]

In [None]:
print(len(x_no))
print(len(x_yes))

###**Upsampling the yes dataset to be the same size as the no dataset**

In [None]:
# upsampling the yes dataset to be the samw size as the no dataset
x_yes_upsampled = x_yes.sample(n=len(x_no), replace=True, random_state=42)

###**Joining the Dataset**

In [None]:
# Joining both dataset
x_upsampled = x_no.append(x_yes_upsampled).reset_index(drop=True)

In [None]:
y = x_upsampled["Churn Value"]
x_upsampled.drop("Churn Value", axis=1, inplace = True)

In [None]:
x_upsampled

In [None]:
y

###**Converting y label to Categorical**

In [None]:
# we need to convert he y label to categorical
y = tf.keras.utils.to_categorical(y)

In [None]:
y

###**Splitting the dataset into training and testing**

In [None]:
# splitting the dataset into training and testing
x_train, x_test, y_train, y_test = train_test_split(x_upsampled, y, test_size=0.3, random_state=42)

###**Highlight of what to do at the next stage**

* Use Logistic regression and plot the result
* knearest neighbour and plot the accuracy, precision and recall
* Use support vector machine (svm) and plot the result
* Use Deep learning and plot the result


##**Building The training Model**

###**We have put all the Data preprocessing in one file, let's import it**

In [None]:
# importing the python file that contain the data preprocessing function
from drive.MyDrive import customer_churnfuction

In [None]:
# loading the function
x_train, x_test, y_train, y_test = customer_churnfuction.perform_data_preprocessing("drive/MyDrive/Colab Notebooks/Telco_customer_churn.xlsx")

  x_upsampled = x_no.append(x_yes_upsampled).reset_index(drop=True)


##**Logistic Regression**

Logistic regression is a statistical algorithm used for binary classification tasks. It predicts the probability of an instance belonging to a certain class based on input features. Despite its name, it is primarily used for classification, not regression.

In logistic regression, we start with a linear model that combines the input features linearly. The linear model can be represented as:

z = b₀ + b₁x₁ + b₂x₂ + ... + bₚxₚ

Here, z is the linear combination of the input features (x₁, x₂, ..., xₚ), b₀ is the bias term, and b₁, b₂, ..., bₚ are the coefficients (weights) associated with each feature.

To map the linear output (z) to a probability value between 0 and 1, logistic regression uses the sigmoid function (also known as the logistic function). The sigmoid function is defined as:

σ(z) = 1 / (1 + e^(-z))

In the equation, e represents the base of the natural logarithm, and -z is the input to the exponential function. The sigmoid function takes the linear output (z) and squashes it into a value between 0 and 1, representing the estimated probability of the positive class.

During the training process, logistic regression aims to find the best set of coefficients (b₀, b₁, ..., bₚ) that maximizes the likelihood of the observed data. This is done by minimizing a cost function called log loss (also known as cross-entropy loss). The log loss penalizes large differences between the predicted probabilities and the true labels.

Optimization algorithms, such as gradient descent or its variants, are used to minimize the cost function and find the optimal coefficients. These algorithms iteratively update the coefficients based on the gradient (derivative) of the cost function with respect to the coefficients. The process continues until convergence is reached, indicating that the coefficients have reached their optimal values.


###**Function to Train, Evaluate and Plot Result of Logistic Regression**

In [None]:
def evaluate_logistic_regression(x_train, y_train, x_test, y_test):
    # Create an instance of the logistic regression model
    logistic_regression = LogisticRegression()

    # Fit the model
    logistic_regression.fit(x_train, y_train)

    # Make predictions on the test set
    y_pred = logistic_regression.predict(x_test)
    y = tf.keras.utils.to_categorical(y_pred)

    # Calculate accuracy, precision, and recall
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')

    # Print classification report
    print(f"The accuracy is {accuracy}")
    print(f"The precision is {precision}")
    print(f"The recall is {recall}")

    # Create a dictionary for the graph
    datapoint = {"accuracy" : accuracy, "precision" : precision, "recall" : recall}

    # Create a figure with three subplots
    fig, axs = plt.subplots(1, 3, figsize=(20, 5))

    # Plot the accuracy, precision, and recall
    axs[0].bar(list(datapoint.keys()), list(datapoint.values()))
    axs[0].set_title("Accuracy, Precision and Recall score for Logistic Regression")
    axs[0].bar[0].set_color('green')
    axs[0].bar[1].set_color('black')
    axs[0].bar[2].set_color('red')

    # Plot the confusion matrix
    skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True, title='Confusion Matrix for Customer Churn prediction Logistic Regression optimized', ax=axs[1])

    # Plot the precision-recall curve
    skplt.metrics.plot_precision_recall(y_test, y, title='PR Curve for customer churn prediction Logistic Regression optimized', ax=axs[2])

    plt.show()

# Call the function
evaluate_logistic_regression(x_train, y_train, x_test, y_test)

###**Optimizing Logistic Regression and Plotting the Result**

In [None]:
def tune_and_evaluate_logistic_regression(x_train, y_train, x_test, y_test):
    # Define the parameter grid for GridSearchCV
    param_grid = {'C': [0.1, 1, 10], 'penalty': ['l2', "none"]}

    # Create an instance of LogisticRegression
    logistic_regression = LogisticRegression()

    # Create an instance of GridSearchCV
    grid_search = GridSearchCV(logistic_regression, param_grid, cv=5)

    # Fit the model on the training data
    grid_search.fit(x_train, y_train)

    # Print the best parameters found by GridSearchCV
    print("Best Parameters: ", grid_search.best_params_)

    # Predict on the test set using the best model
    y_pred = grid_search.predict(x_test)

    # Print classification report
    print("Classification Report:\n", classification_report(y_test, y_pred))

    # Make predictions on the test set
    y = tf.keras.utils.to_categorical(y_pred)

    # Calculate accuracy, precision, and recall
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')

    print(f"The accuracy is {accuracy}")
    print(f"The precision is {precision}")
    print(f"The recall is {recall}")

    # Create a dictionary for the graph
    datapoint = {"accuracy" : accuracy, "precision" : precision, "recall" : recall}

    # Create a figure with three subplots
    fig, axs = plt.subplots(1, 3, figsize=(20, 5))

    # Plot the accuracy, precision, and recall
    axs[0].bar(list(datapoint.keys()), list(datapoint.values()))
    axs[0].set_title("Accuracy, Precision and Recall score for Logistic Regression")
    axs[0].bar[0].set_color('green')
    axs[0].bar[1].set_color('black')
    axs[0].bar[2].set_color('red')

    # Plot the confusion matrix
    skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True, title='Confusion Matrix for Customer Churn prediction Logistic Regression optimized', ax=axs[1])

    # Plot the precision-recall curve
    skplt.metrics.plot_precision_recall(y_test, y, title='PR Curve for customer churn prediction Logistic Regression optimized', ax=axs[2])

    plt.show()

# Call the function
tune_and_evaluate_logistic_regression(x_train, y_train, x_test, y_test)

##**Training Testing and Plotting Support Vector Machine**

In [None]:
def train_and_evaluate_svm(x_train, y_train, x_test, y_test):
    # Create an SVM classifier
    clf = svm.SVC(kernel='linear')

    # Train the classifier
    clf.fit(x_train, y_train)

    # Make predictions on the test set
    y_pred = clf.predict(x_test)
    y = tf.keras.utils.to_categorical(y_pred)

    # Calculate accuracy, precision, and recall
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')

    print(f"The accuracy is {accuracy}")
    print(f"The precision is {precision}")
    print(f"The recall is {recall}")

    # Create a dictionary for the graph
    datapoint = {"accuracy" : accuracy, "precision" : precision, "recall" : recall}

    # Create a figure with three subplots
    fig, axs = plt.subplots(1, 3, figsize=(20, 5))

    # Plot the accuracy, precision, and recall
    axs[0].bar(list(datapoint.keys()), list(datapoint.values()))
    axs[0].set_title("Accuracy, Precision and Recall score for SVM")
    axs[0].bar[0].set_color('green')
    axs[0].bar[1].set_color('black')
    axs[0].bar[2].set_color('red')

    # Plot the confusion matrix
    skplt.metrics.plot_confusion_matrix(y_test, y_pred, normalize=True, title='Confusion Matrix for Customer Churn prediction SVM', ax=axs[1])

    # Plot the precision-recall curve
    skplt.metrics.plot_precision_recall(y_test, y, title='PR Curve for customer churn prediction SVM', ax=axs[2])

    plt.show()

# Call the function
train_and_evaluate_svm(x_train, y_train, x_test, y_test)