### A CLASSIFICATION PROJECT - CUSTOMER CHURN ANALYSIS

#### PROJECT SCENARIO
As a data scientist at Vodafone Corporation, a large telecommunication company.
* Vodafone want to find the likelihood of a customer leaving the organization, the key indicators of churn as well as the retention strategies that can be implemented to avert this problem.
* To do this, the business development unit has provided you with data to build a series of machine learning models to predict customer churn.
* The marketing and sales team as well have provided you with some data to aid this endeavor.


#### PROJECT DESCRIPTION
 Telecommunication companies face the ongoing challenge of customer churn, where subscribers discontinue services and switch to competitors. 
 To address this issue and proactively retain customers, we are undertaking a customer churn analysis project utilizing machine learning techniques. 
 In this project, we explore how machine learning techniques can be leveraged for customer churn analysis in telecommunication networks, following the well-established CRISP-DM (Cross-Industry Standard Process for Data Mining) framework. 


#### BUSINESS UNDERSTANDING
In today's highly competitive telecommunication industry, customer churn, or the loss of customers to competitors, poses a significant challenge for companies striving to maintain market share and profitability. 
Identifying customers at risk of churn and implementing proactive retention strategies is crucial for sustaining business growth.

##### HYPOTHESIS
NULL HYPOTHESIS: There is no relationship between the tenure and the churn of customers.

ALTERNATE HYPOTHESIS: There is a relationship between the tenure and the churn of customers.

##### ANALYTICAL QUESTIONS
1. What is the overall churn rate of the telecommunication company?
2. Does churn rate differ based on the payment method?
3. What is the churn rate of customers based on their seniority?
4. What is the churn rate of customers based on their monthlycharges?
5. What is the churn rate of customers based on their contract type?
6. What is the churn rate of customers based on their gender?

#### DATA UNDERSTANDING

##### Loading the necessary libraries

In [None]:
import numpy as np
import pandas as pd
import os
import pyodbc
from dotenv import load_dotenv
from dotenv import dotenv_values
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
from scipy import stats
#Machine Learning Packages
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbPipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif
import warnings
warnings.filterwarnings('ignore')

: 

##### Load the datasets

In [None]:
#Loading first dataset from database
# Load environment variables from .env file
environment_variables = dotenv_values('.env')

# Access database credentials from environment variables dictionary
server = environment_variables.get("SERVER")
database = environment_variables.get("DATABASE")
password = environment_variables.get("PASSWORD")
username = environment_variables.get("USERNAME")

# Construct the connection string
connection_string = f"DRIVER=ODBC Driver 17 for SQL Server;SERVER={server};DATABASE={database};User Id={username};PASSWORD={password};"

print("USERNAME:", username)

# Construct the connection string
connection_string = f"DRIVER=ODBC Driver 17 for SQL Server;SERVER={server};DATABASE={database};UID={username};PWD={password};"
# Connect to the database
try:
    connection = pyodbc.connect(connection_string)
    print("Connection successful!")
except Exception as e:
    print("Error:", e)

# Specify the SQL queries to extract data from the tables
Dataset1 = "SELECT * FROM dbo.LP2_Telco_churn_first_3000"

# Suppress warnings
warnings.filterwarnings('ignore')

# Create a cursor from the connection
with connection.cursor() as cursor:
    # Execute the queries and fetch data into Pandas DataFrames
    Dataset1 = pd.read_sql_query(Dataset1, connection)

: 

In [None]:
#Preview the first dataset
Dataset1.head()

: 

In [None]:
#Load the second the dataset
Dataset2 = pd.read_csv("./Dataset/LP2_Telco-churn-second-2000.csv")
Dataset2.head()

: 

In [None]:
#Check the columns
column_names = Dataset1.columns
print(column_names)

: 

In [None]:
#Check the columns
column_names = Dataset2.columns
print(column_names)

: 

In [None]:
#Check the number of rows and columns
Dataset1.shape

: 

In [None]:
#Check the number of rows and columns
Dataset2.shape

: 

##### Observations
1. The outputs show both datasets have the same column names and number of columns so they can be merged easily.
2. However, some of the column names are in upper case so they will be converted to lower case.

In [None]:
#Convert column names to lower case
Dataset1.columns = Dataset1.columns.str.lower()

#Check the columns to confirm
column_names = Dataset1.columns
print(column_names)

: 

In [None]:
#Convert column names to lower case
Dataset2.columns = Dataset2.columns.str.lower()

#Check the columns to confirm
column_names = Dataset2.columns
print(column_names)

: 

In [None]:
#Check cell values
Dataset1.info()

: 

##### This shows there are empty cells in these columns; multiplelines, onlinesecurity, onlinebackup, deviceprotection, techsupport, streamingtv, streamingmovies, totalcharges and churn. They will be treated accordingly.

In [None]:
#Check cell values
Dataset2.info()

: 

##### This shows there are no empty cells in any of the columns but some of the columns have the wrong datatype. This will be taken care of accordingly.

In [None]:
#Check for the unique values of each column
def check_unique_values(df):
    for column in df.columns:
        unique_values = df[column].unique()
        print(f"Unique values in column '{column}': {unique_values}")

#Check for Dataset1
check_unique_values(Dataset1)

: 

In [None]:
def check_unique_values(df):
    for column in df.columns:
        unique_values = df[column].unique()
        print(f"Unique values in column '{column}': {unique_values}")

#Check for Dataset2
check_unique_values(Dataset2)

: 

##### The 'True' and 'False' values in the first dataset will be replaced with 'Yes' and 'No' to ensure both datasets have the same values before they are merged.

In [None]:
#Replace the "True" and "False" values in Dataset1
def replace_true_false(df):
    df.replace({True: 'Yes', False: 'No'}, inplace=True)

replace_true_false(Dataset1)

: 

In [None]:
#Check Dataset1 to confirm
Dataset1.head()

: 

In [None]:
#Replace the values in seniorcitizen column
def replace_yes_no_with_1_0(df):
    df['seniorcitizen'] = df['seniorcitizen'].replace({'Yes': 1, 'No': 0})

replace_yes_no_with_1_0(Dataset1)

: 

In [None]:
#Check Dataset1 to confirm
Dataset1.head()

: 

##### Since the columns and values for both datasets are similar now, we will merge both datasets.

In [None]:
#Merge both datasets
df = pd.concat([Dataset1, Dataset2], axis=0)

: 

In [None]:
#Check merged dataframe
df.head()

: 

In [None]:
#Check rows and columns
df.shape

: 

In [None]:
#Check cell values
df.info()

: 

##### The 'totalcharges' column has the wrong datatype. It will be converted into a float. 

In [None]:
#Convert 'totalcharges' column to numeric (float)
df['totalcharges'] = pd.to_numeric(df['totalcharges'], errors='coerce')

: 

In [None]:
#Check cell values
df.info()

: 

In [None]:
#Check for duplicates
df.duplicated().sum()

: 

In [None]:
#Check missing values
df.isna().sum()

: 

##### Check the columns

In [None]:
def check_unique_values(df):
    for column in df.columns:
        unique_values = df[column].unique()
        print(f"Unique values in column '{column}': {unique_values}")

#Check for Dataset2
check_unique_values(df)

: 

##### Univariate Analysis

##### Distribution of Categorical Variables

In [None]:

#List of categorical columns
categorical_columns = ['gender', 'partner', 'dependents', 'phoneservice', 'multiplelines', 
                       'internetservice', 'onlinesecurity', 'onlinebackup', 'deviceprotection', 
                       'techsupport', 'streamingtv', 'streamingmovies', 'contract', 
                       'paperlessbilling', 'paymentmethod', 'churn']

#Set up the figure and axes for plotting
fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(18, 18))
axes = axes.flatten()

#Loop through each categorical column and plot the frequency of each category
for i, column in enumerate(categorical_columns):
    #Check for duplicate labels and drop them
    unique_values = df[column].unique()
    if len(unique_values) != df[column].nunique():
        df_unique = df.drop_duplicates(subset=column)
        sns.countplot(x=column, data=df_unique, ax=axes[i])
    else:
        sns.countplot(x=column, data=df, ax=axes[i])
    axes[i].set_title(f'Distribution of {column}')
    axes[i].set_xlabel(column)
    axes[i].set_ylabel('Frequency')

#Adjust layout
plt.tight_layout()
plt.show()

: 

##### Distribution of Numerical Variables

In [None]:
#List of numerical columns
numerical_columns = ['tenure', 'monthlycharges', 'totalcharges']

#Set up the figure and axes for plotting
fig, axes = plt.subplots(nrows=len(numerical_columns), ncols=1, figsize=(8, 5*len(numerical_columns)))

#Loop through each numerical column and plot its histogram
for i, column in enumerate(numerical_columns):
    df[column].hist(ax=axes[i], bins=20, color='skyblue', edgecolor='black')
    axes[i].set_title(f'Distribution of {column}')
    axes[i].set_xlabel(column)
    axes[i].set_ylabel('Frequency')

#Adjust layout
plt.tight_layout()
plt.show()

: 

In [None]:
#Boolean variable to analyze
boolean_variable = 'seniorcitizen'

#Calculate the proportion of 'True' and 'False' values
proportion_true = df[boolean_variable].sum() / len(df)
proportion_false = 1 - proportion_true

#Plot the proportions
sns.barplot(x=['True', 'False'], y=[proportion_true, proportion_false])
plt.title(f'Proportion of True and False values in {boolean_variable}')
plt.xlabel('Value')
plt.ylabel('Proportion')
plt.show()

: 

##### Check for Outliers

In [None]:
#Check summary statistics
df.describe().T

: 

##### Observation
1. The 'seniorcitizen', 'tenure' and 'monthlycharges' columns do not have any missing values but the 'totalcharges' column has missing values.
2. The average monthlycharge is approximately 65.09, the minimum monthlycharge is approximately 18.4 and the maximum monthlycharge is approximately 118.65. 
3. In the 'tenure' column, the standard deviation is approximately 24.53, indicating that the values are spread out over a wide range. 

In [None]:
#Check 'tenure' 
numerical_column = ['tenure']
plt.figure(figsize=(10, 6))
sns.boxplot(data=df[numerical_column])
plt.title('Box Plot of Tenure')
plt.xlabel('Variable')
plt.ylabel('Value')
plt.xticks(rotation=45)
plt.show()

: 

In [None]:
#Check 'monthlycharges' 
numerical_column = ['monthlycharges']
plt.figure(figsize=(10, 6))
sns.boxplot(data=df[numerical_column])
plt.title('Box Plot of MonthlyCharges')
plt.xlabel('Variable')
plt.ylabel('Value')
plt.xticks(rotation=45)
plt.show()

: 

In [None]:
#Check 'totalcharges' 
numerical_column = ['totalcharges']
plt.figure(figsize=(10, 6))
sns.boxplot(data=df[numerical_column])
plt.title('Box Plot of TotalCharges')
plt.xlabel('Variable')
plt.ylabel('Value')
plt.xticks(rotation=45)
plt.show()

: 

##### Observation
1. This shows all the columns do not have any outliers.
2. The 'tenure' boxplot suggests a fairly even distribution of tenure values across the dataset.
3. The 'monthlycharges' boxplot suggests that most customers have monthly charges clustered around the median, with a fairly consistent spread across the quartiles.
4. The 'totalcharges' values are concentrated between approximately 2000 and 4000, with the median closer to Q3, suggesting a skew towards higher charges.

##### Bivariate Analysis

##### Gender Vrs Churn Rate

In [None]:
#Get the count of churn for each gender
gender_churn_counts = df.groupby(['gender', 'churn']).size().unstack(fill_value=0)

print(gender_churn_counts)

: 

In [None]:
#Plot Gender vs. Churn
plt.figure(figsize=(8, 6))
sns.countplot(x='gender', hue='churn', data=df)
plt.title('Gender vs. Churn')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.legend(title='Churn', labels=['No', 'Yes'])
plt.show()

: 

##### This implies that among females, 1823 customers did not churn, and 661 customers did churn. Similarly, among males, 1883 customers did not churn, and 675 customers did churn. 

##### Correlation of Numerical Variables

In [None]:
#Select the relevant columns
selected_columns = ['tenure', 'monthlycharges', 'totalcharges']
selected_corr_matrix = df[selected_columns].corr()

#Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(selected_corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Heatmap of Numerical Variables')
plt.show()

: 

##### Observations
1. Tenure vs. Monthly Charges: There’s a weak positive correlation of (0.24), suggesting that as tenure increases, monthly charges tend to increase slightly.
2. Tenure vs. Total Charges: A strong positive correlation of (0.83) is observed here, indicating that longer tenure is strongly associated with higher total charges.
3. Monthly Charges vs. Total Charges: This pair shows a moderate positive correlation of (0.65), meaning as monthly charges increase, total charges also tend to increase.

##### Contract Vrs Churn

In [None]:
contract_churn_counts = df.groupby(['contract', 'churn']).size()
print(contract_churn_counts)

: 

In [None]:
#Create a bar plot
plt.figure(figsize=(10, 6))
sns.countplot(x='contract', hue='churn', data=df)
plt.title('Contract vs. Churn')
plt.xlabel('Contract')
plt.ylabel('Count')
plt.legend(title='Churn', loc='upper right')
plt.show()


: 

##### Observations
1. For customers with a 'Month-to-month' contract, there are 1560 customers who did not churn (No) and 1184 customers who did churn (Yes).
2. For customers with a 'One year' contract, there are 933 customers who did not churn and 122 customers who did churn.
3. For customers with a 'Two year' contract, there are 1213 customers who did not churn and 30 customers who did churn.
4. It can be concluded that customers with shorter-term contracts (like 'Month-to-month') tend to churn more compared to those with longer-term contracts.

##### Multivariate Analysis

##### The 'churn' column contains string values ('Yes' and 'No'), which cannot be converted to float for correlation calculation. To perform correlation analysis, we need to encode these categorical values into numerical values first. One common approach is to use label encoding, where 'Yes' is replaced with 1 and 'No' is replaced with 0.

In [None]:
#Initialize LabelEncoder
label_encoder = LabelEncoder()

#Encode the 'churn' column
df['churn_encoded'] = label_encoder.fit_transform(df['churn'])

#Calculate correlation matrix
correlation_matrix = df[['tenure', 'monthlycharges', 'totalcharges', 'churn_encoded']].corr()

#Display correlation matrix
print(correlation_matrix)

: 

In [None]:
#Plotting the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

: 

##### Observations
1. tenure vs. monthlycharges: There is a positive correlation of approximately 0.24, indicating that as tenure increases, monthly charges also tend to increase, but the correlation is not very strong.
2. tenure vs. totalcharges: There is a strong positive correlation of approximately 0.83, suggesting that as tenure increases, total charges also increase.
3. tenure vs. churn_encoded: There is a negative correlation of approximately -0.35, indicating that as tenure increases, the likelihood of churn decreases.

#### HYPOTHESIS TESTING


##### Null Hypothesis: There is no relationship between the tenure and the churn of customers.
Alternate Hypothesis: There is a relationship between the tenure and the churn of customers.

In [None]:
#Create a contingency table
contingency_table = pd.crosstab(df['tenure'], df['churn'])

#Perform chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

#Set significance level
alpha = 0.05

print("Chi-square statistic:", chi2)
print("P-value:", p)
print("Degrees of freedom:", dof)

#Compare p-value with alpha to make a decision
if p < alpha:
    print("Reject the null hypothesis: There is a relationship between tenure and churn.")
else:
    print("Fail to reject the null hypothesis: There is no relationship between tenure and churn.")


: 

##### Based on the p-value (which is far below the significance level of 0.05), we reject the null hypothesis. This means that there is sufficient evidence to conclude that there is a statistically significant relationship between tenure and churn.

#### ANSWERING THE ANALYTICAL QUESTIONS

1. What is the overall churn rate against retained customers?
2. Does churn rate differ based on the payment method?
3. What is the churn rate of customers based on their seniority?
4. What is the churn rate of customers based on their monthlycharges?
5. What is the churn rate of customers based on their contract type?
6. What is the churn rate of customers based on their gender?

##### Question One: What is the overall churn rate against retained customers?

In [None]:
#Count the number of customers who churned
churned_count = df['churn'].value_counts()['Yes']

#Calculate the total number of customers
total_customers = len(df)

#Calculate the overall churn rate
overall_churn_rate = (churned_count / total_customers) * 100

print("Overall churn rate:", overall_churn_rate)

: 

In [None]:
#Calculate the churn rates as percentages
churn_rate_percentage = (churned_count / total_customers) * 100
retention_rate_percentage = 100 - churn_rate_percentage

#Create a bar plot
plt.figure(figsize=(8, 6))
bars = plt.bar(["Churned", "Retained"], [churn_rate_percentage, retention_rate_percentage], color=['orange', 'skyblue'])
plt.title("Overall Churn Rate")
plt.ylabel("Percentage")
plt.ylim(0, 100)

#Show the percentages on the bars
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width() / 2, yval, f"{yval:.2f}%", va='bottom')

plt.show()

: 

##### This shows a bar plot with two bars: one for the churned customers and one for the retained customers. It shows there are more retained customers than customers that churned. 

##### Question Two: Does churn rate differ based on the payment method?

In [None]:
#Calculate churn rate for each payment method
payment_churn_rates = df.groupby('paymentmethod')['churn'].value_counts(normalize=True).loc[:, 'Yes'] * 100

# Plot churn rate based on payment method
plt.figure(figsize=(10, 6))
ax = payment_churn_rates.plot(kind='bar', color='skyblue')
plt.title('Churn Rate by Payment Method')
plt.xlabel('Payment Method')
plt.ylabel('Churn Rate (%)')
plt.xticks(rotation=45, ha='right')

#Add labels on top of each bar
for i, rate in enumerate(payment_churn_rates):
    plt.text(i, rate, f'{rate:.2f}%', ha='center', va='bottom')

plt.tight_layout()
plt.show()

: 

##### This shows that customers using electronic check have the highest churn rate, which could suggest issues with this payment method. This might be as a result of user dissatisfaction hence requires further investigation and improvement of the payment process.

##### Question Three: What is the churn rate of customers based on their Seniority?

In [None]:
# Calculate churn rate based on seniority
seniority_churn_rate = df.groupby('seniorcitizen')['churn'].value_counts(normalize=True)[:, 'Yes'] * 100

# Plot churn rate based on seniority using a pie chart
plt.figure(figsize=(8, 8))
plt.pie(seniority_churn_rate, labels=['Senior Citizen' if seniority == 1 else 'Non-Senior Citizen' for seniority in seniority_churn_rate.index], autopct='%1.1f%%', startangle=140)
plt.title('Churn Rate by Seniority')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle
plt.show()


: 

##### Question Four: What is the churn rate of customers based on their monthlycharges?

In [None]:
#Check unique values in the monthlycharges column
unique_monthlycharges = df['monthlycharges'].unique()

#Print unique values
print(unique_monthlycharges)

: 

In [None]:
#Define bins for monthly charges
bins = [0, 30, 60, 90, 120]

#Create labels for the bins
labels = ['0-30', '30-60', '60-90', '90-120']

#Assign each monthly charge to a bin
df['monthly_charges_bin'] = pd.cut(df['monthlycharges'], bins=bins, labels=labels, right=False)

#Calculate churn rate based on monthly charges
monthly_charges_churn_rate = df.groupby('monthly_charges_bin')['churn'].value_counts(normalize=True)[:, 'Yes'] * 100

#Plot churn rate based on monthly charges using a bar plot
plt.figure(figsize=(10, 6))
bars = monthly_charges_churn_rate.plot(kind='bar', color='skyblue')
plt.title('Churn Rate by Monthly Charges')
plt.xlabel('Monthly Charges')
plt.ylabel('Churn Rate (%)')
plt.xticks(rotation=45)

# Add labels on the bars
for bar in bars.patches:
    plt.text(bar.get_x() + bar.get_width() / 2 - 0.1, bar.get_height() + 0.5, f'{bar.get_height():.2f}%', ha='center', color='black')

plt.show()

: 

##### This illustrates the relationship between the amount customers are charged monthly and the rate at which they stop using the service (churn rate). 
0-30 Range: Represents the lowest monthly charges and corresponds to the lowest churn rate, suggesting customers are satisfied with the service or find it affordable.

30-60 Range: Shows a significant increase in churn rate, indicating a threshold where customers may begin to consider the service too expensive or not worth the cost.

60-90 & 90-120 Ranges: Both have the highest and constant churn rates, suggesting that beyond a certain price point, the churn rate stabilizes, possibly due to a segment of customers who are less price-sensitive.

##### Question Five: What is the churn rate of customers based on their contract type?

In [None]:
#Calculate churn rate based on contract type
contract_churn_rate = df.groupby('contract')['churn'].value_counts(normalize=True)[:, 'Yes'] * 100

#Plot churn rate based on contract type using a bar plot
plt.figure(figsize=(10, 6))
bars = contract_churn_rate.plot(kind='bar', color='skyblue')
plt.title('Churn Rate by Contract Type')
plt.xlabel('Contract Type')
plt.ylabel('Churn Rate (%)')
plt.xticks(rotation=45)

#Add labels on the bars
for bar in bars.patches:
    plt.text(bar.get_x() + bar.get_width() / 2 - 0.1, bar.get_height() + 0.5, f'{bar.get_height():.2f}%', ha='center', color='black')

plt.show()


: 

##### This compares the churn rates across different contract durations. 
Month-to-Month: This category has the highest churn rate, around 40%, indicating that customers with no long-term commitments are more likely to discontinue the service.

One Year: Shows a significantly lower churn rate of about 12%, suggesting increased customer retention with longer contract terms.

Two Years: Has the lowest churn rate, which implies that the longest commitment contracts result in the best customer retention.

##### Question Six: What is the churn rate of customers based on their gender?

In [None]:
#Calculate churn rate based on gender
gender_churn_rate = df.groupby('gender')['churn'].value_counts(normalize=True)[:, 'Yes'] * 100

#Plot churn rate based on gender using a bar plot
plt.figure(figsize=(8, 6))
bars = gender_churn_rate.plot(kind='bar', color='skyblue')
plt.title('Churn Rate by Gender')
plt.xlabel('Gender')
plt.ylabel('Churn Rate (%)')
plt.xticks(rotation=0)

#Add labels on the bars
for bar in bars.patches:
    plt.text(bar.get_x() + bar.get_width() / 2 - 0.1, bar.get_height() + 0.5, f'{bar.get_height():.2f}%', ha='center', color='black')

plt.show()

: 

##### The churn rates are nearly identical for both genders, suggesting that gender does not play a significant role in the likelihood of customers discontinuing the service.

In [None]:
#Converting merged dataset to csv
df.to_csv('merged_dataset.csv', index=False)

: 

#### DATA PREPARATION

##### Check if dataset is balanced

In [None]:
#Check dataframe
df.head()

: 

In [None]:
#Drop the 'customerid', 'churn_encoded' and 'monthly_charges_bin' columns
df = df.drop(['customerid', 'churn_encoded', 'monthly_charges_bin'], axis=1)

: 

In [None]:
#Check dataframe
df.head(20)

: 

In [None]:
#Check for missing values in the churn column
missing_values = df['churn'].isnull().sum()

if missing_values == 0:
    print("There are no missing values in the churn column.")
else:
    print(f"There is {missing_values} missing value in the churn column.")

: 

In [None]:
#Drop rows with missing values in the churn column
df.dropna(subset=['churn'], inplace=True)

: 

In [None]:
#Check to confirm the missing values in the churn column
missing_values = df['churn'].isnull().sum()

if missing_values == 0:
    print("There are no missing values in the churn column.")
else:
    print(f"There are {missing_values} missing values in the churn column.")

: 

In [None]:
df.info()

: 

In [None]:
#Change the data type of the seniorcitizen column to object
df['seniorcitizen'] = df['seniorcitizen'].astype(str)

: 

##### To check if the dataset is balanced, we set a threshold of 5%. If the absolute difference between the counts of the two classes is less than the threshold, then the dataset is considered balanced; otherwise, it's considered imbalanced.

In [None]:
#The target variable is 'churn' and it binary are (Yes/No)
#Count the occurrences of each class
class_counts = df['churn'].value_counts()

#Set the threshold for imbalance(5% of the total number of rows)
threshold = len(df) * 0.05 

#Check if the dataset is balanced
is_balanced = abs(class_counts[0] - class_counts[1]) < threshold  

if is_balanced:
    print("The dataset is balanced.")
else:
    print("The dataset is imbalanced.")

: 

In [None]:
#Count the occurrences of each class
class_counts = df['churn'].value_counts()

#Plot the distribution of the target variable
plt.figure(figsize=(6, 4))
bars = class_counts.plot(kind='bar', color=['skyblue', 'orange'])
plt.title('Distribution of Churn')
plt.xlabel('Churn')
plt.ylabel('Count')
plt.xticks(rotation=0)

#Annotate the bars with churn counts
for i, count in enumerate(class_counts):
    plt.text(i, count + 10, str(count), ha='center', va='bottom')

plt.show()

: 

###### The visual also confirms the dataset is not balanced since it has more 'No' values than 'Yes' values.

##### TRAINING THE IMBALANCED DATASET

##### Split the Dataset into Training and Evaluation Set

##### In splitting the data, it is done such that;
X contains all the features except the target variable (churn).

y contains only the target variable (churn).

We use train_test_split to split the data into training and evaluation sets and set test_size to 0.3 which specifies that 30% of the data should be used for evaluation, while the rest is used for training. 

X_train and y_train contain the training features and target variable respectively.X_eval and y_eval contain the evaluation features and target variable respectively.


In [None]:
#Define features (X) and target variable (y)
X = df.drop('churn', axis=1) 
y = df['churn']  

#Split the dataset into training and evaluation sets
X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

: 

##### Encode the y train and evaluation labels

In [None]:
#Initialize LabelEncoder
label_encoder = LabelEncoder()

#Encode the target variable 'churn' for training set
y_train_encoded = label_encoder.fit_transform(y_train)

#Encode the target variable 'churn' for evaluation set
y_eval_encoded = label_encoder.transform(y_eval)

: 

##### Prepare Pipelines

In [None]:
#Identify the categorical columns
X.select_dtypes('object').columns

: 

In [None]:
#Identify the numerical columns
X.select_dtypes('number').columns

: 

In [None]:
#Define numerical and categorical features
numerical_features = ['tenure', 'monthlycharges', 'totalcharges']
categorical_features = ['gender', 'seniorcitizen', 'partner', 'dependents', 'phoneservice',
                        'multiplelines', 'internetservice', 'onlinesecurity', 'onlinebackup',
                        'deviceprotection', 'techsupport', 'streamingtv', 'streamingmovies',
                        'contract', 'paperlessbilling', 'paymentmethod']

#Create preprocessing steps for numerical and categorical features
numerical_transformer = Pipeline(steps=[
    ('num_imputer', SimpleImputer(strategy='median')),  #Fill missing values with the median
    ('scaler', StandardScaler())  #Scale the numerical features
])

categorical_transformer = Pipeline(steps=[
    ('cat_imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  #One-hot encode categorical features
])

#Combine preprocessing steps for numerical and categorical features
preprocessor = ColumnTransformer(
    transformers=[ 
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

: 

#### MODELLING AND EVALUATION

In [None]:
#Define the models
models = [
    ('K-Nearest_Neighbors', KNeighborsClassifier(n_neighbors=5)),  
    ('Logistic_Regression', LogisticRegression(random_state=42)),  
    ('Support_Vector_Machine', SVC(random_state=42)),  
    ('Decision_Tree', DecisionTreeClassifier(random_state=42)),  
    ('Random_Forest', RandomForestClassifier(random_state=42)),  
    ('Gradient_Boosting', GradientBoostingClassifier(random_state=42)),  
]

#Creating dictionary for the models
all_pipelines = {}

#Create a DataFrame for the metrics
metrics_output = pd.DataFrame(columns=['model_name', 'accuracy', 'precision', 'recall', 'f1_score'])

#Train and evaluate each model
for model_name, classifier in models:
    #Create a pipeline for the model
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', classifier)])  
    
    #Train the model
    pipeline.fit(X_train, y_train_encoded)

    #Add all pipeline to the all_pipeline dictionary
    all_pipelines[model_name] = pipeline
    
    #Make predictions on the test set
    y_pred = pipeline.predict(X_eval)

    #Generate classification report for each model
    metrics = classification_report(y_eval_encoded, y_pred, output_dict=True)
    
    #Evaluate the model
    accuracy = metrics['accuracy']
    precision = metrics['weighted avg']['precision']
    recall = metrics['weighted avg']['recall']
    f1_score= metrics['weighted avg']['f1-score']

    #Add metrics to metrics_output
    metrics_output.loc[len(metrics_output)] = [model_name, accuracy, precision, recall, f1_score]

: 

In [None]:
#Display the metrics_output
metrics_output.sort_values(ascending=False, by='f1_score')

: 

* Support Vector Machine (SVM): It has the highest F1 score among all models, indicating a good balance between precision and recall. This means it's effective at both correctly identifying positive cases (precision) and capturing most of the positive cases in the dataset (recall).

* Gradient Boosting: It also has a high F1 score, very close to SVM, indicating a similar balance between precision and recall. This suggests it's also effective at correctly classifying positive cases while capturing most of them.

* Logistic Regression and Random Forest: They have slightly lower F1 scores compared to SVM and Gradient Boosting, but they still maintain a reasonable balance between precision and recall. They might not be as good as SVM and Gradient Boosting in capturing all positive cases, but they provide decent overall performance.

* K-Nearest Neighbors (KNN): It has a lower F1 score compared to other models, indicating a weaker balance between precision and recall. This suggests it may struggle more with correctly identifying positive cases or capturing all of them.

* Decision Tree: It has the lowest F1 score among all models, indicating the weakest balance between precision and recall. This suggests it might have trouble both correctly classifying positive cases and capturing all of them.

In summary, based on F1 score, SVM and Gradient Boosting are the top-performing models, followed by Logistic Regression and Random Forest, while KNN and Decision Tree lag behind in performance.

##### TRAINING THE BALANCED DATASET 

In [None]:
# Creating dictionary for the models
all_balanced_pipelines = {}

# Create a DataFrame for the metrics
balanced_metrics_output = pd.DataFrame(columns=['model_name', 'accuracy', 'precision', 'recall', 'f1_score'])

# Train and evaluate each model
for model_name, classifier in models:
    # Create a pipeline for the model
    balanced_pipeline = imbPipeline(steps=[('preprocessor', preprocessor),
                                           ('smote-sampler', SMOTE(random_state=42)), 
                                           ('classifier', classifier)])  
    
    # Train the model
    balanced_pipeline.fit(X_train, y_train_encoded)
    
    # Add all pipeline to the all_pipeline dictionary
    all_balanced_pipelines[model_name] = balanced_pipeline
    
    # Make predictions on the test set
    y_pred = balanced_pipeline.predict(X_eval)

    # Generate classification report for each model
    balanced_metrics = classification_report(y_eval_encoded, y_pred, output_dict=True)
    
    # Evaluate the model
    accuracy = balanced_metrics['accuracy']
    precision = balanced_metrics['weighted avg']['precision']
    recall = balanced_metrics['weighted avg']['recall']
    f1_score = balanced_metrics['weighted avg']['f1-score']

    # Add metrics to metrics_output
    balanced_metrics_output.loc[len(balanced_metrics_output)] = [model_name, accuracy, precision, recall, f1_score]


: 

In [None]:
#Display the metrics_output
balanced_metrics_output.sort_values(ascending=False, by='f1_score')

: 

##### This is the reults of the performance of the models after balancing the dataset.
* Random Forest achieved the highest accuracy and F1 score among the models, indicating that it performed well overall in terms of correctly classifying instances and achieving a balance between precision and recall. It also had relatively high precision and recall.

* Gradient Boosting had slightly lower accuracy and F1 score compared to Random Forest but still performed well overall. It had similar precision and recall to Random Forest.

* Support Vector Machine (SVM) had the highest precision among the models, suggesting that it had the fewest false positive predictions. However, its accuracy and F1 score were slightly lower than those of Random Forest and Gradient Boosting.

* Logistic Regression had moderate performance, with accuracy, precision, recall, and F1 score falling in the mid-range among the models.

* Decision Tree showed lower performance compared to Random Forest and Gradient Boosting, with lower accuracy, precision, recall, and F1 score.

* K-Nearest Neighbors (KNN) had the lowest performance overall, with the lowest accuracy, precision, recall, and F1 score among the models.

##### Applying feature selection to improve performance of the models

In [None]:
# Creating dictionary for the models
all_bf_pipelines = {}

# Create a DataFrame for the metrics
bf_metrics_output = pd.DataFrame(columns=['model_name', 'accuracy', 'precision', 'recall', 'f1_score'])

# Train and evaluate each model
for model_name, classifier in models:
    # Create a pipeline for the model
    bf_pipeline = imbPipeline(steps=[('preprocessor', preprocessor),
                                           ('smote-sampler', SMOTE(random_state=42)), 
                                           ('feature_selection', SelectKBest(mutual_info_classif, k='all')),
                                           ('classifier', classifier)])
    
    # Train the model
    bf_pipeline.fit(X_train, y_train_encoded)
    
    # Add all pipeline to the all_pipeline dictionary
    all_bf_pipelines[model_name] = bf_pipeline
    
    # Make predictions on the test set
    y_pred = bf_pipeline.predict(X_eval)

    # Generate classification report for each model
    bf_metrics = classification_report(y_eval_encoded, y_pred, output_dict=True)
    
    # Evaluate the model
    accuracy = bf_metrics['accuracy']
    precision = bf_metrics['weighted avg']['precision']
    recall = bf_metrics['weighted avg']['recall']
    f1_score = bf_metrics['weighted avg']['f1-score']

    # Add metrics to metrics_output
    bf_metrics_output.loc[len(bf_metrics_output)] = [model_name, accuracy, precision, recall, f1_score]

: 

In [None]:
#Display the metrics_output
bf_metrics_output.sort_values(ascending=False, by='f1_score')

: 

##### The performance of the models are still not the best so we will consider more methods that can improve the performance.

##### Adding confusion matrix to the pipeline

In [None]:
from sklearn.metrics import confusion_matrix

# Creating dictionary for the models
all_bf_pipelines = {}

# Create confusion matrix dictionary
all_confusion_matrix = {}

# Create a DataFrame for the metrics
bf_metrics_output = pd.DataFrame(columns=['model_name', 'accuracy', 'precision', 'recall', 'f1_score'])

# Train and evaluate each model
for model_name, classifier in models:
    # Create a pipeline for the model
    bf_pipeline = imbPipeline(steps=[('preprocessor', preprocessor),
                                           ('smote-sampler', SMOTE(random_state=42)), 
                                           ('feature_selection', SelectKBest(mutual_info_classif, k='all')),
                                           ('classifier', classifier)])
    
    # Train the model
    bf_pipeline.fit(X_train, y_train_encoded)
    
    # Add all pipeline to the all_pipeline dictionary
    all_bf_pipelines[model_name] = bf_pipeline

    # Make predictions on the test set
    y_pred = bf_pipeline.predict(X_eval)

    # Create confusion matrix
    conf_matrix = confusion_matrix(y_eval_encoded, y_pred)

    # Add confusion matrix to the dictionary
    all_confusion_matrix[model_name] = conf_matrix

    # Generate classification report for each model
    bf_metrics = classification_report(y_eval_encoded, y_pred, output_dict=True)
    
    # Evaluate the model
    accuracy = bf_metrics['accuracy']
    precision = bf_metrics['weighted avg']['precision']
    recall = bf_metrics['weighted avg']['recall']
    f1_score = bf_metrics['weighted avg']['f1-score']

    # Add metrics to metrics_output
    bf_metrics_output.loc[len(bf_metrics_output)] = [model_name, accuracy, precision, recall, f1_score]


: 

In [None]:
#Iterate over the keys (model names) in the all_confusion_matrix dictionary
for model_name, confusion_matrix in all_confusion_matrix.items():
    print(f"Confusion Matrix for {model_name}:")
    print(confusion_matrix)

: 

##### These confusion matrices provide information about the performance of each model. They show how many instances were correctly or incorrectly classified by the model. For example, in the case of K-Nearest Neighbors, out of 1497 instances:
* 297 were correctly classified as positive (churn).

* 736 were correctly classified as negative (not churn).

* 376 were incorrectly classified as positive (predicted as churn but actually not churn).

* 104 were incorrectly classified as negative (predicted as not churn but actually churn).

In [None]:
# Define model names and confusion matrices
models = ['K-Nearest_Neighbors', 'Logistic_Regression', 'Support_Vector_Machine', 'Decision_Tree', 'Random_Forest', 'Gradient_Boosting']
confusion_matrices = [
    [[736, 376], [104, 297]],
    [[800, 312], [85, 316]],
    [[858, 254], [112, 289]],
    [[857, 255], [181, 220]],
    [[947, 165], [182, 219]],
    [[905, 207], [146, 255]]
]

# Plot confusion matrices using heatmaps
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

for i, (model, cm) in enumerate(zip(models, confusion_matrices)):
    ax = axes[i // 3, i % 3]
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=ax)
    ax.set_title(f"Confusion Matrix for {model}")
    ax.set_xlabel('Predicted label')
    ax.set_ylabel('True label')

plt.tight_layout()
plt.show()

: 

##### Considering both the false positive rate (FPR) and the false negative rate (FNR) for each model;
* Logistic Regression and Support Vector Machine (SVM) have relatively lower false positive rates and false negative rates compared to other models, indicating better overall performance in terms of minimizing classification errors.

* K-Nearest Neighbors (KNN) also has a relatively low false negative rate but a higher false positive rate compared to Logistic Regression and SVM.

* Decision Tree, Random Forest, and Gradient Boosting have higher false negative rates, indicating a higher tendency to miss positive cases.

* Therefore, Logistic Regression and Support Vector Machine (SVM) appear to be the best models with fewer false positives and false negatives in this case.

##### Check for the sensitivity and specificity Threshold of the models

* The goal here is to improve the False Positives and the the False Negatives.

In [None]:
# Define model names and confusion matrices
models = ['K-Nearest_Neighbors', 'Logistic_Regression', 'Support_Vector_Machine', 'Decision_Tree', 'Random_Forest', 'Gradient_Boosting']
confusion_matrices = [
    [[736, 376], [104, 297]],
    [[800, 312], [85, 316]],
    [[858, 254], [112, 289]],
    [[857, 255], [181, 220]],
    [[947, 165], [182, 219]],
    [[905, 207], [146, 255]]
]

# Iterate over models and confusion matrices
for model_name, conf_matrix in zip(models, confusion_matrices):
    # Correcting False Negatives
    true_positives = conf_matrix[1][1]
    false_negatives = conf_matrix[1][0]
    total_positives = true_positives + false_negatives

    # Adjusting sensitivity (recall) threshold to correct false negatives
    sensitivity_threshold = true_positives / total_positives
    print(f"{model_name}: Sensitivity Threshold = {sensitivity_threshold}")

    # Correcting False Positives
    true_negatives = conf_matrix[0][0]
    false_positives = conf_matrix[0][1]
    total_negatives = true_negatives + false_positives

    # Adjusting specificity (precision) threshold to correct false positives
    specificity_threshold = true_negatives / total_negatives
    print(f"{model_name}: Specificity Threshold = {specificity_threshold}")

    print()


: 

In [None]:
# Define sensitivity and specificity thresholds for each model
thresholds = {
    'K-Nearest_Neighbors': {'sensitivity': 0.7406483790523691, 'specificity': 0.6618705035971223},
    'Logistic_Regression': {'sensitivity': 0.7880299251870324, 'specificity': 0.7194244604316546},
    'Support_Vector_Machine': {'sensitivity': 0.7206982543640897, 'specificity': 0.7715827338129496},
    'Decision_Tree': {'sensitivity': 0.5486284289276808, 'specificity': 0.77068345323741},
    'Random_Forest': {'sensitivity': 0.5461346633416458, 'specificity': 0.8516187050359713},
    'Gradient_Boosting': {'sensitivity': 0.6359102244389028, 'specificity': 0.8138489208633094}
}

# Iterate over the models and adjust thresholds
adjusted_models = {}
for model_name, pipeline in all_bf_pipelines.items():
    sensitivity_threshold = thresholds[model_name]['sensitivity']
    specificity_threshold = thresholds[model_name]['specificity']

    # Create a pipeline for the Support Vector Machine model with probability=True
    if model_name == 'Support_Vector_Machine':
        pipeline.steps[-1] = ('classifier', SVC(probability=True))
        pipeline.fit(X_train, y_train_encoded)

    # Get predicted probabilities for positive class (churn)
    y_pred_prob = pipeline.predict_proba(X_eval)[:, 1]

    # Adjust decision threshold based on sensitivity and specificity thresholds
    adjusted_y_pred = (y_pred_prob >= sensitivity_threshold).astype(int)

    # Apply specificity threshold
    adjusted_y_pred[(y_pred_prob >= specificity_threshold) & (adjusted_y_pred == 0)] = 0

    # Store adjusted predictions in dictionary
    adjusted_models[model_name] = adjusted_y_pred

# Evaluate adjusted models
adjusted_metrics_output = pd.DataFrame(columns=['model_name', 'accuracy', 'precision', 'recall', 'f_score'])
for model_name, adjusted_y_pred in adjusted_models.items():
    # Calculate metrics using adjusted predictions
    accuracy = accuracy_score(y_eval_encoded, adjusted_y_pred)
    precision = precision_score(y_eval_encoded, adjusted_y_pred)
    recall = recall_score(y_eval_encoded, adjusted_y_pred)
    f_score = f1_score(y_eval_encoded, adjusted_y_pred)

    # Add metrics to adjusted_metrics_output
    adjusted_metrics_output.loc[len(adjusted_metrics_output)] = [model_name, accuracy, precision, recall, f_score]

# Display adjusted metrics
print(adjusted_metrics_output)

: 

In [None]:
# Define sensitivity and specificity thresholds for each model
thresholds = {
    'K-Nearest_Neighbors': {'sensitivity': 0.7406483790523691, 'specificity': 0.6618705035971223},
    'Logistic_Regression': {'sensitivity': 0.7880299251870324, 'specificity': 0.7194244604316546},
    'Support_Vector_Machine': {'sensitivity': 0.7206982543640897, 'specificity': 0.7715827338129496},
    'Decision_Tree': {'sensitivity': 0.5486284289276808, 'specificity': 0.77068345323741},
    'Random_Forest': {'sensitivity': 0.5461346633416458, 'specificity': 0.8516187050359713},
    'Gradient_Boosting': {'sensitivity': 0.6359102244389028, 'specificity': 0.8138489208633094}
}

# Iterate over the models and adjust thresholds
adjusted_models = {}
for model_name, pipeline in all_bf_pipelines.items():
    sensitivity_threshold = thresholds[model_name]['sensitivity']
    specificity_threshold = thresholds[model_name]['specificity']

    # Create a pipeline for the Support Vector Machine model with probability=True
    if model_name == 'Support_Vector_Machine':
        pipeline.steps[-1] = ('classifier', SVC(probability=True))
        pipeline.fit(X_train, y_train_encoded)

    # Get predicted probabilities for positive class (churn)
    y_pred_prob = pipeline.predict_proba(X_eval)[:, 1]

    # Adjust decision threshold based on sensitivity and specificity thresholds
    adjusted_y_pred = (y_pred_prob >= sensitivity_threshold).astype(int)

    # Apply specificity threshold
    adjusted_y_pred[(y_pred_prob >= specificity_threshold) & (adjusted_y_pred == 0)] = 0

    # Store adjusted predictions in dictionary
    adjusted_models[model_name] = adjusted_y_pred

# Evaluate adjusted models
adjusted_metrics_output = pd.DataFrame(columns=['model_name', 'accuracy', 'precision', 'recall', 'f1_score'])
for model_name, adjusted_y_pred in adjusted_models.items():
    # Calculate metrics using adjusted predictions
    accuracy = accuracy_score(y_eval_encoded, adjusted_y_pred)
    precision = precision_score(y_eval_encoded, adjusted_y_pred)
    recall = recall_score(y_eval_encoded, adjusted_y_pred)
    f1 = f1_score(y_eval_encoded, adjusted_y_pred)  # Corrected variable name

    # Add metrics to adjusted_metrics_output
    adjusted_metrics_output.loc[len(adjusted_metrics_output)] = [model_name, accuracy, precision, recall, f1]

# Display adjusted metrics
print(adjusted_metrics_output)


: 