# Enhanced Customer Churn Prediction and CLV Analysis

## Business Scenario: Online Retailer

An online retailer specialising in consumer electronics is facing increasing competition. The marketing team wants to improve customer retention and maximise customer lifetime value. They've tasked us with analysing their customer data to provide actionable insights.

Our goals are to:
1. Predict which customers are likely to churn
2. Estimate customer lifetime value (CLV)
3. Develop targeted retention strategies

Let's dive in!

## Step 1: Data Loading and Preprocessing

We'll start by importing the necessary libraries and loading our dataset. Then, we'll preprocess the data by calculating RFM (Recency, Frequency, Monetary) metrics, which are crucial for our analysis.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Load and preprocess the data
df = pd.read_csv('ecommerce_data_week2.csv')
df['order_date'] = pd.to_datetime(df['order_date'])

# Calculate RFM metrics
latest_date = df['order_date'].max()
rfm = df.groupby('customer_id').agg({
    'order_date': lambda x: (latest_date - x.max()).days,
    'order_id': 'count',
    'revenue': 'sum',
    'customer_age_group': 'first',
    'customer_region': 'first',
    'web_traffic_source': 'first'
})
rfm.columns = ['recency', 'frequency', 'monetary', 'age_group', 'region', 'traffic_source']
rfm['churned'] = rfm['recency'] > 90

print(rfm.head())
print(f"Churn rate: {rfm['churned'].mean():.2%}")

             recency  frequency     monetary age_group      region  \
customer_id                                                          
1                 46         16  5713.503346     36-45      London   
2                 32         13  5070.118301     26-35      London   
3                 26         12  4401.409671     36-45      London   
4                 91         11  3258.138934     46-55      London   
5                 68         11  4366.886948       56+  South East   

            traffic_source  churned  
customer_id                          
1                    Email    False  
2                Instagram    False  
3                    Email    False  
4                 Facebook     True  
5                   Direct    False  
Churn rate: 37.00%


We've successfully loaded our dataset and calculated the RFM metrics for each customer. The 'rfm' DataFrame now contains:
- Recency: number of days since the customer's last purchase
- Frequency: total number of purchases
- Monetary: total amount spent
- Other customer attributes like age group, region, and traffic source

We've also defined churn as customers who haven't made a purchase in the last 90 days. The output shows the first few rows of our RFM dataframe and the overall churn rate. This gives us a initial understanding of our customer base and the extent of our churn problem.

## Step 2: Customer Segmentation Visualisation

Now that we have our RFM metrics, let's create an interactive 2D scatter plot to visualise our customer segments. This will help us understand the relationship between recency, frequency, and monetary value of our customers.

In [2]:
fig = px.scatter(rfm, x='recency', y='frequency',
                 size='monetary', color='churned',
                 hover_name=rfm.index,
                 labels={'churned': 'Churned',
                         'recency': 'Recency (days)',
                         'frequency': 'Frequency',
                         'monetary': 'Monetary Value'},
                 title='Customer Segmentation: Recency vs Frequency')

fig.update_layout(
    xaxis_title='Recency (days)',
    yaxis_title='Frequency',
    legend_title='Churned',
    coloraxis_colorbar=dict(title='Churned')
)

fig.update_traces(
    hovertemplate="<br>".join([
        "Customer ID: %{hovertext}",
        "Recency: %{x} days",
        "Frequency: %{y}",
        "Monetary Value: £%{marker.size:.2f}",
        "Churned: %{marker.color}"
    ])
)

fig.show()

This scatter plot provides several insights:
1. The x-axis shows recency (days since last purchase), and the y-axis shows frequency (number of purchases).
2. The size of each point represents the monetary value of the customer.
3. The colour indicates whether a customer has churned (as per our definition).

Key observations:
- Churned customers (likely in red) tend to cluster towards the right side of the plot, indicating higher recency values.
- Larger bubbles (high-value customers) seem to be more concentrated in the lower-right quadrant, suggesting that some of our highest-value customers may be at risk of churning.
- There's a visible relationship between frequency and churn - customers who purchase more frequently are less likely to churn.

This visualisation helps us identify different customer segments and understand their behaviour, which will be crucial for developing targeted retention strategies.

## Step 3: Building a Churn Prediction Model

Now that we have visualised our customer segments, let's build a more sophisticated churn prediction model using Random Forest. We'll include additional features like age group, region, and traffic source to improve our predictions.

In [3]:
# Prepare features
X = pd.get_dummies(rfm.drop(['churned'], axis=1), columns=['age_group', 'region', 'traffic_source'])
y = rfm['churned']

# Scale numerical features
scaler = StandardScaler()
X[['recency', 'frequency', 'monetary']] = scaler.fit_transform(X[['recency', 'frequency', 'monetary']])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions
y_pred = rf_model.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

# Plot feature importances
importances = pd.DataFrame({'feature': X.columns, 'importance': rf_model.feature_importances_})
importances = importances.sort_values('importance', ascending=False).head(10)

fig = px.bar(importances, x='importance', y='feature', orientation='h',
             title='Top 10 Feature Importances for Churn Prediction')
fig.show()

              precision    recall  f1-score   support

       False       1.00      1.00      1.00        13
        True       1.00      1.00      1.00         7

    accuracy                           1.00        20
   macro avg       1.00      1.00      1.00        20
weighted avg       1.00      1.00      1.00        20



1. Model Performance:
   - The classification report shows metrics like precision, recall, and F1-score for both churned and non-churned customers.
   - Pay attention to the 'weighted avg' row for an overall view of the model's performance.
   - A high F1-score indicates a good balance between precision and recall.

2. Feature Importance:
   - The bar chart shows the top 10 most important features for predicting churn.
   - Recency is likely the most important feature, followed by frequency and monetary value.
   - Other features like age group, region, or traffic source may also play significant roles.

This model provides us with a tool to predict which customers are likely to churn, allowing for proactive retention efforts. The feature importances give us insight into what factors most strongly influence churn, which can guide our retention strategies.

## Step 4: Customer Lifetime Value (CLV) Prediction

Now that we can predict churn, let's estimate the Customer Lifetime Value (CLV) for each customer. This will help us identify high-value customers and prioritise our retention efforts.

In [4]:
# Calculate components for CLV
rfm['avg_order_value'] = rfm['monetary'] / rfm['frequency']
rfm['purchase_frequency'] = rfm['frequency'] / ((latest_date - df['order_date'].min()).days / 365)
rfm['customer_lifespan'] = 1 / (1 + np.exp(-(-3 + 10 * rf_model.predict_proba(X)[:, 0])))  # Sigmoid transformation of churn probability

# Calculate CLV
rfm['clv'] = rfm['avg_order_value'] * rfm['purchase_frequency'] * rfm['customer_lifespan'] * 3  # Assuming 3 years

# Visualise CLV distribution
fig = px.histogram(rfm, x='clv', color='churned', marginal='box', 
                   title='Distribution of Customer Lifetime Value')
fig.show()

# Print average CLV for churned and non-churned customers
print(rfm.groupby('churned')['clv'].mean())

churned
False    4570.359258
True      487.503819
Name: clv, dtype: float64


#### Interpretation of CLV Analysis:

1. CLV Distribution:
   - The histogram shows the distribution of Customer Lifetime Values, separated by churned and non-churned customers.
   - The box plot on the side provides a summary of the CLV distribution for each group.

2. Average CLV Comparison:
   - The printed values show the average CLV for churned and non-churned customers.
   - Non-churned customers likely have a significantly higher average CLV.

Key insights:
- There's likely a wide range of CLV among customers, indicating varying levels of customer value.
- Churned customers generally have lower CLV, which quantifies the impact of churn on the business.
- This analysis helps identify high-value customers who should be prioritised in retention efforts.

Understanding CLV allows us to make more informed decisions about customer acquisition costs, retention strategies, and overall customer relationship management.

## Step 5: Customer Insights Dashboard

Finally, let's create a comprehensive dashboard that summarises our key findings. This will provide a quick overview of important metrics and insights for the marketing team.

In [5]:
# Churn Rate by Age Group
churn_by_age = rfm.groupby('age_group')['churned'].mean().sort_values(ascending=False)
fig1 = go.Figure(go.Bar(x=churn_by_age.index, y=churn_by_age.values))
fig1.update_layout(title='Churn Rate by Age Group', xaxis_title='Age Group', yaxis_title='Churn Rate')

# Average CLV by Region
clv_by_region = rfm.groupby('region')['clv'].mean().sort_values(ascending=False)
fig2 = go.Figure(go.Bar(x=clv_by_region.index, y=clv_by_region.values))
fig2.update_layout(title='Average CLV by Region', xaxis_title='Region', yaxis_title='Average CLV')

# Top Traffic Sources
top_sources = rfm['traffic_source'].value_counts().head(5)
fig3 = go.Figure(go.Pie(labels=top_sources.index, values=top_sources.values))
fig3.update_layout(title='Top Traffic Sources')

# Churn Probability vs CLV
fig4 = go.Figure(go.Scatter(x=rf_model.predict_proba(X)[:, 1], y=rfm['clv'], mode='markers'))
fig4.update_layout(title='Churn Probability vs CLV', xaxis_title='Churn Probability', yaxis_title='CLV')

# Combine plots into a single figure
fig = make_subplots(rows=2, cols=2, specs=[[{'type': 'xy'}, {'type': 'xy'}],
                                           [{'type': 'domain'}, {'type': 'xy'}]],
                    subplot_titles=('Churn Rate by Age Group', 'Average CLV by Region', 
                                    'Top Traffic Sources', 'Churn Probability vs CLV'))

# Add traces to the subplots
fig.add_trace(fig1.data[0], row=1, col=1)
fig.add_trace(fig2.data[0], row=1, col=2)
fig.add_trace(fig3.data[0], row=2, col=1)
fig.add_trace(fig4.data[0], row=2, col=2)

# Update layout
fig.update_layout(height=800, title_text="Customer Insights Dashboard")

# Show the figure
fig.show()

#### Interpretation of Customer Insights Dashboard:

This dashboard provides a comprehensive overview of our customer base and churn analysis. Let's break down each component:

1. Churn Rate by Age Group:
   - This bar chart shows how churn rate varies across different age groups.
   - Look for age groups with notably higher churn rates. These may require targeted retention strategies.
   - Consider whether the churn rate aligns with the business's target demographics.

2. Average CLV by Region:
   - This bar chart displays the average Customer Lifetime Value across different regions.
   - Regions with higher CLV are particularly valuable and may warrant increased marketing or retention efforts.
   - Lower CLV regions might need strategies to increase customer engagement and spending.

3. Top Traffic Sources:
   - The pie chart shows the distribution of the top traffic sources for customers.
   - Identify which sources are bringing in the most customers.
   - Consider allocating more resources to high-performing sources and investigating ways to improve lower-performing ones.

4. Churn Probability vs CLV:
   - This scatter plot shows the relationship between a customer's probability of churning and their CLV.
   - Points in the upper-right quadrant represent high-value customers at high risk of churning. These should be a priority for retention efforts.
   - A negative correlation (if present) would indicate that higher-value customers are generally less likely to churn.

Key Takeaways:
1. Identify the most vulnerable age groups and regions for targeted retention campaigns.
2. Focus on the most effective traffic sources for customer acquisition.
3. Prioritise retention efforts on high-CLV customers with high churn probability.
4. Use these insights to develop data-driven, targeted marketing and retention strategies.

This dashboard enables the marketing team to quickly grasp key customer metrics and make informed decisions about where to focus their efforts for maximum impact on customer retention and lifetime value.

## Actionable Insights and Recommendations

Based on our analysis, here are some actionable insights and recommendations for the online retailer:

1. **Target High-Risk, High-Value Customers**: Focus retention efforts on customers with high churn probability but high CLV.
   - Action: Develop a VIP programme with exclusive benefits for these customers.

2. **Age-Specific Strategies**: Tailor marketing approaches to different age groups, focusing on those with higher churn rates.
   - Action: Create age-specific product bundles and marketing campaigns.

3. **Regional Focus**: Invest more in regions with higher average CLV.
   - Action: Increase marketing budget and customise offerings for high-CLV regions.

4. **Traffic Source Optimisation**: Allocate more resources to the top-performing traffic sources.
   - Action: Increase ad spend on the most effective channels.

5. **Proactive Churn Prevention**: Use the churn prediction model to identify at-risk customers early.
   - Action: Implement an early warning system and trigger personalised retention campaigns.

6. **CLV-Based Segmentation**: Create customer segments based on predicted CLV for more targeted marketing.
   - Action: Develop tiered loyalty programmes based on CLV segments.

By implementing these strategies, the online retailer can work towards reducing churn, increasing customer lifetime value, and ultimately driving sustainable growth in their e-commerce business.