# Customer Churn

# Table of Contents

- [Problem Statement](#Problem-Statement)
- [Imports](#Imports)
- [Data Loading](#Data-Loading)
- [Data Merging](#Data-Merging)
- [Data Cleaning](#Data-Cleaning)
- [Exploratory Data Analysis (EDA)](#Exploratory-Data-Analysis-\(EDA\))
  - [Data Overview](#Data-Overview)
  - [Null/NaN Values](#Null/NaN-Values)
  - [Data Visualization](#Data-Visualization)
- [Data Preprocessing](#Data-Preprocessing)
  - [Categorical to Numerical Conversion](#Categorical-to-Numerical-Conversion)
  - [Visualization of Converted Features](#Visualization-of-Converted-Features)
    - [Cluster Map](#Cluster-Map)
    - [t-SNE Visualization](#t-SNE-Visualization)
- [Preliminary Baseline Model](#Preliminary-Baseline-Model)
- [Advanced Model Development](#Advanced-Model-Development)
  - [Feature Engineering and Selection](#Feature-Engineering-and-Selection)
    - [Feature Importance Analysis](#Feature-Importance-Analysis)
    - [Feature Selection](#Feature-Selection)
    - [Dimensionality Reduction](#Dimensionality-Reduction)
  - [Handling Class Imbalance](#Handling-Class-Imbalance)
  - [Model Comparisons](#Model-Comparisons)
  - [Model Tuning](#Model-Tuning)
- [Model Evaluation](#Model-Evaluation)
- [Conclusion](#Conclusion)

# Problem Statement

Customer churn is a significant concern for any business, especially in the telecommunications industry. Churn rate can significantly impact a company's revenue and market share. In this project, we are working with a dataset from a fictional telecommunications company - Telco. The data represents the information about 7043 customers in California for Q3, including those who left, stayed, or signed up for the service.

The aim of this project is to understand and predict customer churn, i.e., identify the factors that lead to customer churn and use this understanding to predict whether a customer is likely to churn in the future. These insights will help the company in developing effective customer retention strategies and enhance customer satisfaction.

To conduct this analysis, we are provided with multiple data points for each customer, including but not limited to demographics, location, services used, customer satisfaction score, churn score, and Customer Lifetime Value (CLTV).

The Telco data module comprises five main categories:

1. Demographics - Customer's unique ID, gender, age, marital status, dependents, etc.
2. Location - Details about the customer's primary residence.
3. Population - Estimated population of the customer's zip code area.
4. Services - Details about the services a customer is using and their billing information.
5. Status - This includes customer satisfaction score, churn label, churn value, churn score, CLTV, and reason for churn if applicable.

# Imports

In [None]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.manifold import TSNE

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler

# Data Loading

In [None]:
os.chdir('..')

In [None]:
cust_churn = pd.read_excel('data/Telco_customer_churn.xlsx')
demo = pd.read_excel('data/Telco_customer_churn_demographics.xlsx')
location = pd.read_excel('data/Telco_customer_churn_location.xlsx')
population = pd.read_excel('data/Telco_customer_churn_population.xlsx')
services = pd.read_excel('data/Telco_customer_churn_services.xlsx')
status = pd.read_excel('data/Telco_customer_churn_status.xlsx')
churn = pd.read_excel('data/Telco_customer_churn.xlsx')

# Data Merging

In [None]:
# Combine on Customer ID
churn_all = demo.merge(location, on='Customer ID')
churn_all = churn_all.merge(services, on='Customer ID')

# Include the 'Churn' column and any other relevant columns from the 'status' dataframe.
churn_all = churn_all.merge(status, on='Customer ID')

# Data Cleaning

In [None]:
# Drop duplicated counts
churn_all.drop(['Count_x', 'Count_y'], axis=1, inplace=True)

# Drop all uniform columns
churn_all.drop(['Country', 'State', 'Quarter_x', 'Quarter_y'], axis=1, inplace=True)

# Drop columns that are redundant or unable to be converted to numerics
churn_all.drop(['Location ID', 'Service ID', 'Customer ID', 'Lat Long', 'City', 'Status ID'], axis=1, inplace=True)

# Exploratory Data Analysis (EDA)

## Data Overview

In [None]:
# Print the shape of the dataframe
print('Number of rows: ', churn_all.shape[0])
print('Number of columns: ', churn_all.shape[1])

In [None]:
# Look at first 5 rows
churn_all.head()

In [None]:
# Summary of the dataframe
churn_all.info()

In [None]:
# Summary statistics for numerical columns
churn_all.describe()

## Null/NaN Values

In [None]:
# Check for missing values
missing_values = churn_all.isnull().sum()
print('Missing values per column:\n', missing_values[missing_values > 0])

Looking into Churn Category & Churn Reason.

***Assumption***: 5174 cases of no churn (custom has not left), therefore it is NaN.

In [None]:
# Subset the data where 'Churn Category' and 'Churn Reason' are null
missing_churn_info = churn_all[churn_all['Churn Category'].isnull() & churn_all['Churn Reason'].isnull()]

# Check the 'Churn Label' column of this subset
churn_labels_in_missing = missing_churn_info['Churn Label'].value_counts()

print(churn_labels_in_missing)

***Assumption is confirmed.***

All instances of Null/NaN (being in Churn Category and Churn Reason columns) are where the Churn Label is 'No', therefore, the customer has not churned. 

To fix this, we will impute another category into Churn Category and Churn Reason, for instances of NaN's,

***Churn Category*** will be called ***'Retention'***

***Churn Reason*** will be called ***'Customer Retained'***

In [None]:
churn_all['Churn Category'].fillna('Retention', inplace=True)
churn_all['Churn Reason'].fillna('Customer Retained', inplace=True)

## Data Visualization

In [None]:
# Histogram of the target variable
sns.histplot(churn_all['Churn Label'])

In [None]:
# Bar plots for categorical variables
categorical_cols = churn_all.select_dtypes(include=['object']).columns
for col in categorical_cols:
    plt.figure(figsize=(10,4))
    sns.countplot(x=col, data=churn_all)
    plt.title(f'Distribution of {col}')
    plt.xticks(rotation=90)
    plt.show()

In [None]:
# Correlation heatmap for numerical variables
numerical_cols = churn_all.select_dtypes(include=['int64', 'float64']).columns
plt.figure(figsize=(12,10))
sns.heatmap(churn_all[numerical_cols].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation heatmap')
plt.show()

# Preliminary Baseline Model

## Selecting 'Satisfaction Score' as a Single Feature for Baseline Model

In this section, we will build a preliminary baseline model using the feature that's most highly correlated with our target variable, `Churn Value`. The feature we'll use is `Satisfaction Score`. 

By starting with a simple model, we can establish a baseline level of performance to compare with more complex models that we'll build later. 

In [None]:
# Modeling
predictive_features = ['Satisfaction Score']
X = churn_all[predictive_features]
y = churn_all['Churn Value']

# Divide data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# As the two features are numeric, you don't need to select only numeric columns
# But you can keep the scaler part as it is.

scaler = StandardScaler()

# Fit on training set only.
scaler.fit(X_train)

# Apply transform to both the training set and the test set.
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Fit the model 
logreg = LogisticRegression(max_iter=500)
logreg.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = logreg.predict(X_test)

After fitting the model, we'll look at the classification report, confusion matrix, and accuracy score to evaluate how well our model performed.

## Model Evaluation

In [None]:
# Print classification report
print(classification_report(y_test, y_pred))

In [None]:
# Print confusion matrix
print(confusion_matrix(y_test, y_pred))

In [None]:
# Print accuracy score
print('Accuracy score: ', accuracy_score(y_test, y_pred))

### Classification Report

- Precision: When our model predicts a customer will churn (1) or not churn (0), it's correct 100% and 92% of the time, respectively. This indicates our model is quite reliable when it predicts churn.
  
- Recall: For the actual churn cases, our model is able to correctly identify 77% of them. However, for the non-churn cases, our model identifies them perfectly (100%). This suggests that our model might be better at identifying non-churn cases than churn cases.

- F1-score: The F1-score is the harmonic mean of precision and recall. The closer to 1, the better. Our model has an F1-score of 0.87 for churn cases (1) and 0.96 for non-churn cases (0). This reinforces the idea that our model is slightly better at predicting non-churn cases.

- Accuracy: The accuracy of our model is 94%, which suggests that it correctly predicted the churn status of customers 94% of the time. This is a good sign of our model's overall effectiveness.

### Confusion Matrix

The confusion matrix gives us a more granular view of the model's performance:

- True negatives (top-left square): The model correctly predicted that 1036 customers would not churn.
- False negatives (bottom-left square): The model incorrectly predicted that 87 customers would not churn, but they actually did.
- True positives (bottom-right square): The model correctly predicted that 286 customers would churn.
- False positives (top-right square): The model predicted that no customers would churn when they actually did not, hence no false positives.

Our model seems to have some difficulty in detecting customers who are going to churn, as indicated by the number of false negatives.

### Accuracy Score

Finally, the overall accuracy of our model is about 93.8%. This score is consistent with the accuracy metric from the classification report and confirms that our model performs well.

However, despite its good overall accuracy, we might want to try to improve its performance on predicting churn customers, which currently stands at 77% (recall for churn). This could involve including more features in our model, tweaking the model's parameters, or trying different algorithms. 

# Data Preprocessing

## Categorical to Numerical Conversion

In [None]:
# Convert No/Yes to 0/1
churn_all['Under 30'] = churn_all['Under 30'].map(dict(Yes=1, No=0))
churn_all['Senior Citizen'] = churn_all['Senior Citizen'].map(dict(Yes=1, No=0))
churn_all['Married'] = churn_all['Married'].map(dict(Yes=1, No=0))
churn_all['Dependents'] = churn_all['Dependents'].map(dict(Yes=1, No=0))
churn_all['Referred a Friend'] = churn_all['Referred a Friend'].map(dict(Yes=1, No=0))
churn_all['Multiple Lines'] = churn_all['Multiple Lines'].map(dict(Yes=1, No=0))
churn_all['Internet Service'] = churn_all['Internet Service'].map(dict(Yes=1, No=0))
churn_all['Online Security'] = churn_all['Online Security'].map(dict(Yes=1, No=0))
churn_all['Online Backup'] = churn_all['Online Backup'].map(dict(Yes=1, No=0))
churn_all['Device Protection Plan'] = churn_all['Device Protection Plan'].map(dict(Yes=1, No=0))
churn_all['Premium Tech Support'] = churn_all['Premium Tech Support'].map(dict(Yes=1, No=0))
churn_all['Streaming TV'] = churn_all['Streaming TV'].map(dict(Yes=1, No=0))
churn_all['Streaming Movies'] = churn_all['Streaming Movies'].map(dict(Yes=1, No=0))
churn_all['Streaming Music'] = churn_all['Streaming Music'].map(dict(Yes=1, No=0))
churn_all['Unlimited Data'] = churn_all['Unlimited Data'].map(dict(Yes=1, No=0))
churn_all['Paperless Billing'] = churn_all['Paperless Billing'].map(dict(Yes=1, No=0))
churn_all['Churn Label'] = churn_all['Churn Label'].map(dict(Yes=1, No=0))

# Convert Gender to a Binary
churn_all['Gender'] = churn_all['Gender'].map(dict(Female=1, Male=0))

In [None]:
# One Hot Encode categoricals (They are not numerically sound)
churn_all = pd.get_dummies(churn_all, columns=['Offer','Phone Service', 'Internet Type', 'Contract', 'Payment Method', 'Customer Status', 'Churn Category', 'Churn Reason'])

Checking if all objects are converted to numericals

In [None]:
if churn_all.select_dtypes(include=['object']).shape[1] == 0:
    print("All object columns successfully converted to numerics!")
else:
    print("Some object columns are still not converted.")

In [None]:
# Drop columns that are derivatives or identical to target variable (churn value)
churn_all = churn_all.drop(['Churn Label', 'Customer Status_Churned', 'Churn Reason_Customer Retained', 'Churn Category_Retention'], axis=1)

In [None]:
churn_all.info()

## Visualization of Converted Features

### Cluster Map

Cluster Map to visualize the objects now converted to numerics and their correlation with other columns

In [None]:
sns.clustermap(churn_all.corr(), annot=False, cmap='coolwarm', figsize=(15,15))

The cluster map visually represents correlation values amongst the various columns in the dataset. On observing this representation, we notice a few interesting aspects:

1. **Lack of Individual Strong Indicators:** It's noteworthy that none of the other features form a close cluster with the churn-related features. This observation suggests that no single feature is a strong determinant of churn. It likely reflects the multifaceted nature of customer behavior, where a combination of various features might influence a customer's decision to churn or not.

2. **Hierarchical Clustering (Dendrogram):** The tree-like structure seen at the top and left of the cluster map is known as a dendrogram. This structure shows a hierarchical clustering of features based on their correlation. The features with higher correlation are grouped together.

3. **Churn Related Features:** The features 'Churn Reason_Competitor made better offer' and 'Churn Reason_Competitor had better devices' are much closer related to churn value than the other churn reasons.

This exploration provides valuable insights about the correlation and potential relationships amongst different features. However, the predictive importance of these features will be better understood when we fit and evaluate a machine learning model. This process will provide a more quantitative measure of feature importance in the context of predicting customer churn.


### t-SNE Visualization

In [None]:
X = churn_all.drop(['Churn Value'], axis=1) 
y = churn_all['Churn Value']

In [None]:
# Fit and transform X to visualizable lower dimensions
transformed = TSNE(n_components=2, random_state=0, perplexity=200, learning_rate=700).fit_transform(X)

In [None]:
# Plot
plt.figure(figsize=(12,10))

scatter = plt.scatter(transformed[:,0], transformed[:,1], c=y, alpha=0.2, cmap='bwr') # 'bwr' is blue-white-red palette
plt.title('t-SNE visualization of churn data')

# Create a colorbar
cbar = plt.colorbar(scatter, ticks=[0,1])
cbar.set_label('Churn Value')
cbar.set_ticklabels(['Not Churn', 'Churn']) # 0 is "Not Churn", 1 is "Churn"

plt.show()

The t-SNE visualization above illustrates a high-dimensional churn dataset projected into a two-dimensional space. This graphical representation allows us to examine the relationship and potential clusters among customers based on their churn status. Our dataset contains 7043 records, out of which 5174 customers have not churned ('Not Churn').

While there are discernible patterns in the data, it is important to note that the visualization does not exhibit a clear and distinct separation between the 'Churn' and 'Not Churn' classes. 'Not Churn' instances are widely scattered throughout the plot, signifying a diverse set of attributes among customers who did not churn.

In comparison, the 'Churn' points seem to display a slightly more coherent structure, as most of these points tend to be concentrated below the 0 point on the x-axis. This could potentially suggest that the attributes of churned customers share some similarities, thus exhibiting a denser clustering in that specific region.

Despite this, it is evident that the 'Churn' instances are not exclusively confined to that area and are still dispersed across the plot. This highlights the complex and multi-faceted nature of customer churn, hinting at the presence of multiple factors that could lead to a customer's decision to leave the service.

This data visualization underscores the potential feasibility of predicting customer churn using machine learning algorithms. Nonetheless, the overlap between 'Churn' and 'Not Churn' instances suggests that achieving high predictive accuracy may be challenging due to the inherent intricacies in customer behavior. Hence, it is essential to further explore the data and refine our prediction model for improved outcomes.

# Refined Model Development

## Feature Engineering and Selection

### Feature Importance Analysis

### Feature Selection

### Dimensionality Reduction

## Handling Class Imbalance

## Model Comparisons

## Model Tuning

# Model Evaluation

# Conclusion