<a href="https://colab.research.google.com/github/KwameSegbe/customer_churn/blob/master/Customer_Churn_Projects_Telecommunication_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NAME: FRANCIS KWAME SEGBE
# PROJECT 4
# CUSTOMER ANALYTICS


# Project Overview and Understanding of the Problem
This project aims to learn the patterns that exist amongs various customer groups and how these patterns affects the growths.

Customer churn analysis is key for boosting retention and revenue. By studying behaviors and attributes of struggling vs loyal customers, companies gain insight to improve product-market fit, manage competition, and nurture loyalty by addressing dissatisfaction before churn. This project harnesses churn analytics to uncover issues driving cancellations and build targeted solutions recapturing lost revenue through strategies like price evaluation, underperforming feature refinement, and realigning customer service. Our analysis will quantify satisfaction, lifetime value, retention drivers, and simulate retention improvements under proposed enhancements. Preemptive actions guided by analytics have achieved 30-50% customer retention boosts – that is the potential we can unlock with a data-driven, customer-centric approach to churn. By determining predictive indicators of abandonment and contrasting struggling and loyal user trends, we can construct personalized interventions to showcase the value of our service.

In [None]:
# !pip install pandas-profiling

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# from pandas_profiling import ProfileReport
import plotly.offline as py

In [None]:
# Loading the dataset.
data = pd.read_csv("Customer_Churn_Dataset.csv")
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


# We will start with the Exploratory Data Analyses and Data Cleaning

In [None]:
# #Pandas Profiling
# profile = ProfileReport(data)
# profile
# # profile.to_file(output_file='report.html')

We did a profiling report for quick access to our dataset for anyone who would like to know a prior about the dataset we are about working with. It gave us a quick glance at what we are to do and how we could about the whole process.

In [None]:
# Printing out the info of our columns section to understand a brief summary of dataset.
#This basically does the samething as the pandas_profiling
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [None]:
#Printing out the shape and size of our dataset.
#We are trying to unstand the structure and size of the dataset we are working with.
data.shape

(7043, 21)

In [None]:
# Convert Churn column. This is to help the machine learning model understand our data.
data['Churn'] = data['Churn'].replace({'Yes': 1, 'No': 0})

print(data['Churn'].head())

0    0
1    0
2    1
3    0
4    1
Name: Churn, dtype: int64


 From the above we coverted the churn columns binarized the churn columns to able to work with. Here all churns were coverted to 1 and all "No" were converted to 0. This was done to put our data in a form that our machine learning model will understand and be able to interpret.

In [None]:
#Converting "No internet Service" for the following to "No", (  OnlineSecurity,StreamingTV,
# DeviceProtection,TechSupport,OnlineSecurity,StreamingTV)
cols = ['OnlineSecurity','StreamingTV','DeviceProtection','TechSupport','OnlineBackup','StreamingMovies','MultipleLines']

for col in cols:
    data.loc[data[col] == 'No internet service', col] = 'No'

print(data[cols].head())

  OnlineSecurity StreamingTV DeviceProtection TechSupport OnlineBackup  \
0             No          No               No          No          Yes   
1            Yes          No              Yes          No           No   
2            Yes          No               No          No          Yes   
3            Yes          No              Yes         Yes           No   
4             No          No               No          No           No   

  StreamingMovies     MultipleLines  
0              No  No phone service  
1              No                No  
2              No                No  
3              No  No phone service  
4              No                No  


In [None]:
data.loc[data['MultipleLines'] == 'No phone service', 'MultipleLines'] = 'No'

print(data['MultipleLines'].unique())

['No' 'Yes']


We performed some data cleaning on the dataset by making sure that each from the select column was same. So we converted it "No" just to have everything beaing equal. We identified that there was data inconsistency in these columns so we had to make them consistent to be able to work with them. Subsquently we did same for the "MultipleLines" column.

In [None]:
# Replace spaces with NaN
data['TotalCharges'] = data['TotalCharges'].replace(' ', np.nan)

# Drop rows with NaN in TotalCharges
data.dropna(subset=['TotalCharges'], inplace=True)

# Converting Total charge to float
data['TotalCharges'] = data['TotalCharges'].astype('float32')

Further investigation of our dataset shows that there were spaces within the dataset in Column "TotalCharges" so in other to be able to work with column we replaced the empty rows with "NaN" and dropped the columns with "NaN" values.
Additionally, we added converted the data type of total charges to floating point numbers.

In [None]:
churn_counts = data['Churn'].value_counts()
churn_counts

0    5163
1    1869
Name: Churn, dtype: int64

# Data Visualization & Exploration
Over the next few section we are going to do various exploration of the data we have.
We explore Churn by the various categories to find out which category has the highest churn rates.

In [None]:
# visualizing our data
import plotly.express as px

churn_counts = data['Churn'].value_counts()

fig = px.pie(values=churn_counts.values,
             names=churn_counts.index,
             title='Percentage of Churn Customers',
             color_discrete_sequence=['lightgrey', 'red'])

fig.show()

From the above we can see that, only 26% of our entire dataset churned.
We get a speak peak of the percentage of our customers who churned. From these percentage, it is alarming to have 26% of our total customers churning.

In [None]:
import plotly.express as px

fig = px.histogram(data, x='gender', color='Churn',
                   labels={'gender': 'Gender'},
                   color_discrete_sequence=['seagreen', 'lightsalmon'],
                   title='Gender vs Churn')

fig.update_layout(xaxis_title='Gender',
                  yaxis_title='Count',
                  legend_title='Churn',
                  legend_orientation='h')

for i in range(len(fig.data)):
    fig.data[i].name = 'No' if i==0 else 'Yes'

fig.show()

Here we could see that men was slightly likely to churn than female.

In [None]:
import plotly.express as px

bright_colors = [[255, 0, 0], [0, 0, 255], [255, 255, 0]]

fig = px.bar(data, x='TechSupport', y='Churn',
             title='Churn Rate by Tech Support Service',
             color_discrete_sequence=bright_colors,
             text=data.groupby(['TechSupport'])['Churn'].transform('count') / data['Churn'].count())

fig.update_traces(texttemplate='%{text:.2%}', textposition='outside')

fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')

fig.update_layout(xaxis={'categoryorder':'total descending'})

fig.update_traces(marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)

fig.show()

We can see that most of our by TechSupport employees who said they had Techsupport were less likely to churn than those who reported that they had no Tech support

In [None]:
import plotly.express as px

bright_colors = [[255, 0, 0], [0, 0, 255], [255, 255, 0]]

categories = {'DSL': 'DSL', 'Fiber optic': 'Fiber', 'No': 'No Internet Service'}

fig = px.bar(data, x='InternetService', y='Churn',
             title='Churn Rate by Internet Service',
             color_discrete_sequence=bright_colors,
             text=data.groupby(['InternetService'])['Churn'].transform('count') / data['Churn'].count())

fig.update_xaxes(categoryorder='total descending',
                categoryarray=[v for k,v in categories.items()])

fig.update_traces(texttemplate='%{text:.2%}', textposition='outside')

fig.update_layout(
        xaxis_title=None,
        yaxis_title=None,
        uniformtext_minsize=8, uniformtext_mode='hide',
        xaxis={'categoryorder':'total descending'})

fig.update_traces(marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)

fig.show()

We further explored churn rate by the various services. From this we conclude that, people with no internet services were least to churn out three (DSL, Fiber Optic and No services). People with Fibre optics were more likely to churn.

In [None]:
import plotly.express as px

colors = ['lightblue', 'red', 'seagreen']

fig = px.bar(data, x='PaymentMethod', y='Churn',
             title='Churn Rate by Payment Method',
             color_discrete_sequence=colors,
             text=data.groupby(['PaymentMethod'])['Churn'].transform('count') / data['Churn'].count())

fig.update_traces(texttemplate='%{text:.2%}', textposition='outside')

fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide',
                 xaxis={'categoryorder':'total descending'})

fig.update_traces(marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)

fig.show()

We did a further study to find out the likely churn rate the various payment methods. From the above we can see that, the likely churn rate for each of the sections from here will show that people who paid through electronic means where more likely to churn and credit payment category had less chances of churning. The goal would be to get more people to move towards credit card payment.

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

cat_cols = ["gender","Partner","Dependents","PhoneService",
            "MultipleLines","InternetService","OnlineSecurity",
            "OnlineBackup","DeviceProtection","TechSupport",
            "StreamingTV","StreamingMovies","Contract",
            "PaperlessBilling","PaymentMethod","Churn"]

for col in cat_cols:

  le = LabelEncoder()

  data[col] = le.fit_transform(data[col])

print(data[cat_cols].head())

   gender  Partner  Dependents  PhoneService  MultipleLines  InternetService  \
0       0        1           0             0              0                0   
1       1        0           0             1              0                0   
2       1        0           0             1              0                0   
3       1        0           0             0              0                0   
4       0        0           0             1              0                1   

   OnlineSecurity  OnlineBackup  DeviceProtection  TechSupport  StreamingTV  \
0               0             1                 0            0            0   
1               1             0                 1            0            0   
2               1             1                 0            0            0   
3               1             0                 1            1            0   
4               0             0                 0            0            0   

   StreamingMovies  Contract  PaperlessBilli

We did a binarization to simplify the representation of categorical data, making it easier for machine learning algorithms to process the data. It's a common preprocessing step to ensure that the data is in a format suitable for predictive modeling. The numeric values assigned during binarization can be interpreted by the model to capture relationships and patterns in the data. In doing this we used label encoding which converted the various "yes" and "no" to "1" and "0". it is important to note that all these are crucial part of the machine learning pre-processes.

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load data
# df = pd.read_csv('telco_churn.csv')

# Define columns to scale
cols_to_scale = ['tenure', 'MonthlyCharges', 'TotalCharges']

# Create scaler
scaler = MinMaxScaler()

# Fit and transform selected columns
data[cols_to_scale] = scaler.fit_transform(data[cols_to_scale])

# Verify scaled values
print(data[cols_to_scale].head())

     tenure  MonthlyCharges  TotalCharges
0  0.000000        0.115423      0.001275
1  0.464789        0.385075      0.215867
2  0.014085        0.354229      0.010310
3  0.619718        0.239303      0.210241
4  0.014085        0.521891      0.015330


From the above we would like to say we are normalizing the features selected to be able to have improved model performance when working with it. The selected features are normalized between 0 and 1 or rescaled between zero and one.

In [None]:
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,0,0,1,0,0.0,0,0,0,0,...,0,0,0,0,0,1,2,0.115423,0.001275,0
1,5575-GNVDE,1,0,0,0,0.464789,1,0,0,1,...,1,0,0,0,1,0,3,0.385075,0.215867,0
2,3668-QPYBK,1,0,0,0,0.014085,1,0,0,1,...,0,0,0,0,0,1,3,0.354229,0.01031,1
3,7795-CFOCW,1,0,0,0,0.619718,0,0,0,1,...,1,1,0,0,1,0,0,0.239303,0.210241,0
4,9237-HQITU,0,0,0,0,0.014085,1,0,1,0,...,0,0,0,0,0,1,2,0.521891,0.01533,1


In [None]:
data.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

# Training our Model

In [None]:
#Features variables for x and y
y = data['Churn']
X = data.drop(columns=['Churn','customerID'])

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [None]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(4711, 19) (4711,)
(2321, 19) (2321,)


In [None]:
# Import metrics
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Decision Tree Model
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

y_dt_pred = dt_model.predict(X_test)

print("Decision Tree Model")
print(classification_report(y_test, y_dt_pred))
print(confusion_matrix(y_test, y_dt_pred))
print(accuracy_score(y_test, y_dt_pred))

# Logistic Regression
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

y_lr_pred = lr_model.predict(X_test)

print("\nLogistic Regression Model")
print(classification_report(y_test, y_lr_pred))
print(confusion_matrix(y_test, y_lr_pred))
print(accuracy_score(y_test, y_lr_pred))

# Random Forest
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

y_rf_pred = rf_model.predict(X_test)

print("\nRandom Forest Model")
print(classification_report(y_test, y_rf_pred))
print(confusion_matrix(y_test, y_rf_pred))
print(accuracy_score(y_test, y_rf_pred))

# Naive Bayes Model
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

y_nb_pred = nb_model.predict(X_test)

print("\nNaive Bayes Model")
print(classification_report(y_test, y_nb_pred))
print(confusion_matrix(y_test, y_nb_pred))
print(accuracy_score(y_test, y_nb_pred))

Decision Tree Model
              precision    recall  f1-score   support

           0       0.82      0.80      0.81      1549
           1       0.48      0.50      0.49       561

    accuracy                           0.72      2110
   macro avg       0.65      0.65      0.65      2110
weighted avg       0.73      0.72      0.73      2110

[[1244  305]
 [ 278  283]]
0.723696682464455

Logistic Regression Model
              precision    recall  f1-score   support

           0       0.84      0.90      0.87      1549
           1       0.64      0.52      0.57       561

    accuracy                           0.80      2110
   macro avg       0.74      0.71      0.72      2110
weighted avg       0.79      0.80      0.79      2110

[[1388  161]
 [ 270  291]]
0.795734597156398

Random Forest Model
              precision    recall  f1-score   support

           0       0.82      0.89      0.86      1549
           1       0.61      0.47      0.53       561

    accuracy            

# Model Performance Evaluation & Deployment Suggestions
From the evaluation metrics provided, the Logistic Regression model performs the best overall for this dataset and prediction task.

It has the highest overall accuracy at 80%, compared to 72-78% for other models. Higher accuracy means more correct predictions.
Precision and recall are more balanced than other models, indicating it has good ability to predict churn and non-churn correctly.
The macro F1 score, which is the hamonic mean between two precision and recall and combines precision and recall, is highest at 0.72 for the logistic model.
For the key minority positive class representing churn/class 1, logistic has superior recall (0.52) over DecisionTree (0.50) and Random Forest (0.48). Correctly identifying churn is critical.
While precision is slightly higher for Naive Bayes, its very low recall for class 1 shows it struggles to correctly detect churn. And DecisionTree's low precision indicates lots of false positives.

In conclusion, the logistic regression strikes the best balance of churn detection ability, precision, recall and F1 score. Making it the top performing model for the goal of identifying customers likely to churn.

We will leverage a robust data pipeline, to use in deploying top-performing logistic regression model. This will be done on scalable infrastructure; integrate churn likelihood forecasts into business systems enabling timely retention campaigns; continuous evaluation against latest customer data will ensure sustained model effectiveness. Our logistic regression model will then be our predictive indicators.

# Summary
From our analysis the dataset revealed that 26% of the dataset experienced churn, which means there is a critical challenge for customer retention. The men showed a slightly higher likelihood to churn than women as seen above, and customers with TechSupport were less likely to churn. As we delved deeper, further exploration highlighted that people with no internet services were least likely to churn, while those with Fiber Optic were more likely. We also examined payment methods which indicated that customers using electronic means were more likely to churn, This provide an opportunity to encourage credit card payments. Evaluating model performance, the Logistic Regression model stands out with the highest accuracy at 80%, well-balanced precision and recall, and the top macro F1 score of 0.72. In conclusion, Logistic Regression demonstrates the best balance for identifying customers likely to churn, making it the top-performing model for the project's goal.