<a href="https://colab.research.google.com/github/RoshiniAish1999/customer-churn-analysis-project/blob/main/customer_churn_analysis_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Problem Definition**

To predict customer churn for a subscription service
Churn refers to customers leaving the service.

Why is this important? For example:
- Churn prediction helps businesses retain customers.
- Reducing churn saves costs on acquiring new customers.

##DATA COLLECTION AND EXPLORATION

In [None]:
import numpy as np #
import pandas as pd #

import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
df=pd.read_csv("/content/Churn_Modelling.csv")

In [None]:
df.head()

In [None]:
df.describe().T.round(2)

In [None]:
df.shape

In [None]:
df.info()

In [None]:
missing_df =  df.isnull().sum().to_frame().rename(columns={0:"Total No. of Missing Values"})
missing_df["% of Missing Values"] = round((missing_df["Total No. of Missing Values"]/len( df))*100,2)
missing_df

In [None]:
df.drop(["RowNumber", "CustomerId", "Surname"], axis=1, inplace=True)

In [None]:
df.describe()

In [None]:
df.head()

## Exploratory Data Analysis (EDA)

In [None]:
df["CreditCard"] = df["HasCrCard"].apply(lambda x: "credit card present" if x == 1 else "no credit card")
df["IsActive"] = df["IsActiveMember"].apply(lambda x: "active" if x == 1 else "not active")
df["outcome"] = df["Exited"].apply(lambda x: "quit" if x == 1 else "did not quit")

##code is transforming numerical columns into more readable categorical labels using the .apply(lambda x: ...) function

In [None]:
df["gender_quit"] = df["Gender"] + '-' + df["outcome"]
df["geography_quit"] = df["Geography"] + '-' + df["outcome"]
df["card_quit"] = df["CreditCard"] + '-' + df["outcome"]
df["active_quit"] = df["IsActive"] + '-' + df["outcome"]

##code is creating new categorical features by combining different columns. This helps in feature engineering, allowing models to detect relationships between variables more effectively.

In [None]:
pies = ["Geography", "Gender", "CreditCard", "IsActive", "outcome"]
fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(16, 8))
for i in range(len(pies)):
    counts = df[pies[i]].value_counts()
    axes[i].pie(counts, autopct="%0.2f%%", labels=counts)
    axes[i].legend(counts.index)

plt.tight_layout()
plt.show()

In [None]:
churn_dist = df['Exited'].value_counts(normalize=True) * 100
churn_dist.plot(kind='bar', color=['green', 'red'])
plt.title("Churn Distribution")
plt.ylabel("Percentage")
plt.show()


In [None]:
# Select only numerical features for correlation calculation
numerical_df = df.select_dtypes(include=np.number)

plt.figure(figsize=(10, 8))
sns.heatmap(numerical_df.corr(), annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()

In [None]:
sns.countplot(data=df, x='Gender', hue='Exited')
plt.title('Churn by Gender')
plt.show()


## MODEL BUILDING


In [None]:
X = df[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']] #select relevant features for training, it can also be all numerical features like done in numerical_df
y = df['Exited']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # Adjust test_size and random_state as needed


The dataset was split into features (X) and target (y) variables.
Features: All columns except for identifiers and the target variable (Exited).
Target: The Exited column, which indicates churn (1 = Quit, 0 = Did Not Quit).
The dataset was split into:
Training set (70%): Used to train the machine learning models.
Testing set (30%): Used to evaluate model performance.
A random_state ensures reproducibility.

## LOGISTIC REGRESSION

In [None]:
# prompt: generate logistic regression

from sklearn.linear_model import LogisticRegression

# Initialize and train the Logistic Regression model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred_log_reg = log_reg.predict(X_test)

# Evaluate the model
accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg)
print(f"Logistic Regression Accuracy: {accuracy_log_reg}")
print(classification_report(y_test, y_pred_log_reg))
print(confusion_matrix(y_test, y_pred_log_reg))


Overview: Logistic Regression is a linear model that predicts the probability of an event occurring (in this case, customer churn). It assumes a linear relationship between the independent variables (features) and the log-odds of the dependent variable (target).

Why Use Logistic Regression?

Simplicity: Itâ€™s easy to implement and interpret, making it suitable for smaller datasets or when the relationships between variables are straightforward.
Linear Separability: Works well if the dataset has a clear boundary between churned and non-churned customers in terms of feature values.
Probability Outputs: Logistic Regression outputs a probability (0 to 1), which is useful for ranking and thresholding decisions.

Performance:
Accuracy: 80%
Logistic Regression provides a good baseline model but may not perform well if the dataset contains non-linear relationships or complex feature interactions.



## RANDOM FOREST

In [None]:
# prompt: generate random forest

# Initialize and train the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42) #n_estimators is the number of trees in the forest
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = rf_classifier.predict(X_test)

# Evaluate the model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf}")
print(classification_report(y_test, y_pred_rf))
print(confusion_matrix(y_test, y_pred_rf))


Overview: Random Forest is an ensemble learning method that combines multiple decision trees to make predictions. It aggregates the results of individual trees (via majority voting for classification or averaging for regression) to improve accuracy and reduce overfitting.

Why Use Random Forest?

Handles Non-Linearity: Captures complex and non-linear relationships in the data that simpler models like Logistic Regression cannot handle.
Reduces Overfitting: By averaging multiple trees, Random Forest minimizes the risk of overfitting, which is common with single decision trees.
Feature Importance: Random Forest provides insights into the relative importance of features, helping identify key drivers of churn.
Robustness: Handles missing data and noisy datasets effectively.

Performance:
Accuracy: 84%
The higher accuracy compared to Logistic Regression suggests that Random Forest is better suited for this dataset due to its ability to model complex feature interactions.
Feature Insights: Feature importance can be extracted using rf_model.feature_importances_ to analyze which variables contribute most to churn predictions.

## Performance Metrics

In [None]:
models = ["Logistic Regression", "Random Forest"]
accuracies = [accuracy_score(y_test, y_pred_lr), accuracy_score(y_test, y_pred_rf)]
plt.bar(models, accuracies, color=['blue', 'orange'])
plt.title("Model Accuracy Comparison")
plt.show()


Accuracy tells us the overall performance of the model in predicting customer churn (whether a customer quits or not).
For example, if the Random Forest model achieves 84% accuracy, this means 84 out of 100 customers are correctly classified as churners or non-churners.
Business Impact: While accuracy is important, it doesn't capture the cost of misclassifying churners (False Negatives) or overestimating churn (False Positives), which can lead to missed revenue opportunities or unnecessary retention efforts.


In [None]:
importances = rf_model.feature_importances_
features = X_train.columns
sns.barplot(x=importances, y=features)
plt.title("Feature Importance")
plt.show()


In [None]:
from sklearn.metrics import roc_curve, auc
y_probs = rf_model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_probs)
plt.plot(fpr, tpr, label=f"AUC = {auc(fpr, tpr):.2f}")
plt.plot([0, 1], [0, 1], 'r--')
plt.title("ROC Curve")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.show()


The ROC curve visualizes the trade-off between True Positive Rate (Recall) and False Positive Rate.
The AUC score quantifies this trade-off: an AUC of 0.85 for Random Forest indicates that the model is highly effective at distinguishing between churners and non-churners.
Business Impact:
A high AUC score ensures confidence in the model's ability to rank churners higher than non-churners, improving the prioritization of customers for retention campaigns.


In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

y_pred = rf_classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap="Blues")
plt.title("Confusion Matrix")
plt.show()

## INSIGHT USING METRICS

Logistic Regression:
Accuracy: 80%
Precision: High for non-churners but moderate for churners.
Recall: Moderate; some churners are missed.
Implication: Logistic Regression is better for baseline understanding but might miss complex patterns in data.




Random Forest:
Accuracy: 84%
Precision: More balanced for churners and non-churners.
Recall: Higher, capturing more churners.
AUC: 0.85, suggesting stronger differentiation capability.
Implication: Random Forest is better suited for this project due to its ability to model non-linear relationships and interactions, improving recall and precision.


## RECOMMENDATIONS


1.Targeted Promotions & Loyalty Programs
Why? Customers with a low balance or inactive accounts have a higher likelihood of churning.
Action Plan:
Offer personalized discounts or cashback offers to encourage engagement.
Implement tiered loyalty programs rewarding long-term customers.
Introduce time-sensitive promotional offers to reactivate dormant users.


2. Segment-Specific Engagement Strategies
Why? Certain customer demographics (e.g., specific gender, age group, or geographic region) may have a higher churn rate.
Action Plan:
Conduct customer surveys to understand pain points in at-risk regions.
Develop region-specific customer support initiatives.
Offer incentives for referrals and word-of-mouth marketing to retain customers within high-churn segments.

3. Personalized Customer Experience & Support
Why? Poor service, unresolved complaints, or lack of engagement contribute to customer churn.
Action Plan:
Implement AI-driven chatbots and proactive customer support to address issues before they escalate.
Provide dedicated account managers for high-value customers to ensure personalized attention.
Improve onboarding processes to make new customers feel valued and comfortable with the services.


4. Proactive Retention Strategies for High-Risk Customers
Why? Customers predicted to churn should be engaged before they decide to leave.
Action Plan:
Create an early warning system using predictive analytics to flag at-risk customers.
Implement retention call campaigns where customer service representatives reach out to high-risk customers.
Offer flexible payment options or account management features to customers showing signs of disengagement.


5. Data-Driven Decision Making
Why? The model provides key insights into the factors that influence churn, which should guide business strategies.
Action Plan:
Continuously monitor customer data and update models to improve prediction accuracy.
A/B test different retention strategies and measure effectiveness using KPIs.
Integrate predictive analytics into the CRM system to automate customer segmentation and outreach.


6. Product & Service Enhancements
Why? Product dissatisfaction is a major reason for customer churn.
Action Plan:
Analyze feedback from churned customers to improve products and services.
Introduce innovative features that align with customer needs.
Ensure a seamless user experience through mobile-friendly services and intuitive interfaces.