<a href="https://www.kaggle.com/code/dennismathewjose/customer-segmentation-using-rfm-analysis?scriptVersionId=208697387" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Data Preprocessing

In [None]:
filepath = '/kaggle/input/ecommerce-data/data.csv'
data = pd.read_csv(filepath, encoding = 'ISO-8859-1')
copy_data = data.copy()

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.info()

In [None]:
data.describe().T

In [None]:
data.isnull().sum()

In [None]:
### Converting the Date Column to Date Type
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])

In [None]:
data.info()

In [None]:
#Handling the missing values and negative values appropriately
data.loc[data['UnitPrice'] < 0]

In [None]:
data.loc[data['Quantity']  < 0]

In [None]:
cleaned_data = data.dropna(subset = ['CustomerID'])
cleaned_data.loc[data['UnitPrice'] < 0]
cleaned_data.loc[cleaned_data['Quantity']  < 0]

In [None]:
cleaned_data.isnull().sum()

In [None]:
cleaned_data.info()

In [None]:
cleaned_data.loc[cleaned_data['Quantity']  < 0]

In [None]:
cleaned_data.describe().T

## Exploratory Data Visualization

#### Customer Analysis

In [None]:
unique_customers = cleaned_data['CustomerID'].nunique()
print(f"There are {unique_customers} unique customers.")

In [None]:
min_date, max_date = cleaned_data['InvoiceDate'].min(), cleaned_data['InvoiceDate'].max()
print(f"The dataset covers the period from {min_date} to {max_date}.")

In [None]:
# Assuming the dataset is in a DataFrame called df
print(f"The dataset contains {cleaned_data.shape[0]} rows and {cleaned_data.shape[1]} columns.")

#### Product Analysis

In [None]:
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'iframe'

customer_order_counts = cleaned_data.groupby('CustomerID')['InvoiceNo'].nunique()
fig = px.histogram(customer_order_counts, nbins=30, title="Distribution of Orders per Customer")
fig.show()

In [None]:
top_customers = customer_order_counts.nlargest(5).reset_index()
print(top_customers)

fig = px.bar(top_customers, x='CustomerID', y='InvoiceNo', title="Top 5 Customers by Order Count")
fig.show()


In [None]:
top_products = cleaned_data['Description'].value_counts().head(10).reset_index()
top_products.columns = ['Description', 'Count']

fig = px.bar(top_products, x='Description', y='Count', title="Top 10 Most Frequently Purchased Products")
fig.show()

In [None]:
average_price = cleaned_data['UnitPrice'].mean()
print(f"The average product price is {average_price:.2f}.")


In [None]:
cleaned_data.loc[:,'Revenue'] = cleaned_data.loc[:,'Quantity'] * cleaned_data.loc[:,'UnitPrice']
revenue_per_product = cleaned_data.groupby('Description')['Revenue'].sum().reset_index()
top_revenue_product = revenue_per_product.nlargest(1, 'Revenue')
print(top_revenue_product)

fig = px.bar(revenue_per_product.nlargest(10, 'Revenue'), x='Description', y='Revenue', title="Top Revenue-Generating Products")
fig.show()


#### Time Analysis

In [None]:
cleaned_data.loc[:,'InvoiceDate'] = pd.to_datetime(cleaned_data['InvoiceDate'])
cleaned_data.loc[:,'DayOfWeek'] = cleaned_data['InvoiceDate'].dt.day_name()
day_order_counts = cleaned_data['DayOfWeek'].value_counts()

fig = px.bar(day_order_counts, x=day_order_counts.index, y=day_order_counts.values, title="Orders by Day of the Week")
fig.show()


In [None]:
cleaned_data.loc[:,'Month'] = cleaned_data['InvoiceDate'].dt.month
month_order_counts = cleaned_data['Month'].value_counts().sort_index()

fig = px.line(x=month_order_counts.index, y=month_order_counts.values, title="Orders by Month")
fig.update_xaxes(title="Month")
fig.update_yaxes(title="Order Count")
fig.show()


#### Geographical Analysis

In [None]:
top_countries = cleaned_data['Country'].value_counts().head(5).reset_index()
top_countries.columns = ['Country', 'Count']

fig = px.bar(top_countries, x='Country', y='Count', title="Top 5 Countries by Number of Orders")
fig.show()


In [None]:
average_order_value = cleaned_data.groupby('Country')['Revenue'].mean().reset_index()

fig = px.bar(average_order_value, x='Country', y='Revenue', title="Average Order Value by Country")
fig.show()


#### Payment Analysis

The dataset provided does not include any information about payment methods. To analyze the most common payment methods used by customers, you would need a dataset that includes a column like PaymentMethod, TransactionType, or something similar.

#### Customer Behavior

In [None]:
customer_activity = cleaned_data.groupby('CustomerID').agg(FirstPurchase=('InvoiceDate', 'min'), LastPurchase=('InvoiceDate', 'max')).reset_index()

customer_activity['ActiveDuration'] = (customer_activity['LastPurchase'] - customer_activity['FirstPurchase']).dt.days

average_active_duration = customer_activity['ActiveDuration'].mean()

print(f"The average active duration for customers is {average_active_duration:.2f} days.")

fig = px.histogram(customer_activity, x='ActiveDuration', nbins=30, title='Distribution of Customer Active Durations')
fig.update_layout(xaxis_title='Active Duration (days)', yaxis_title='Number of Customers')
fig.show()

#### Are there any customer segments based on their purchase behavior?

Yes there are Customer segments based on their purchase behavior and are generated in the RFM section below.

#### Returns and Refunds 

In [None]:
# Identify orders with returns or refunds

# Check for negative quantities
returns_or_refunds = cleaned_data[(cleaned_data['Quantity'] < 0) | (cleaned_data['InvoiceNo'].str.startswith('C'))]

total_orders = cleaned_data['InvoiceNo'].nunique()

refund_orders = returns_or_refunds['InvoiceNo'].nunique()

refund_percentage = (refund_orders / total_orders) * 100

print(f"Percentage of orders with returns or refunds: {refund_percentage:.2f}%")

In [None]:
import plotly.express as px
import pandas as pd

# Create a true copy of the filtered DataFrame
returns_or_refunds = cleaned_data[cleaned_data['Quantity'] < 0].copy()

# Ensure that the InvoiceDate column is of datetime type
returns_or_refunds.loc[:,'InvoiceDate'] = pd.to_datetime(returns_or_refunds['InvoiceDate'])

# Extract month and year for grouping using a string method to avoid serialization issues
returns_or_refunds.loc[:,'Month_Year'] = returns_or_refunds['InvoiceDate'].dt.strftime('%Y-%m')

# Count the number of returns/refunds for each month
monthly_returns = returns_or_refunds.groupby('Month_Year').size().reset_index(name='Refund_Count')

# Sort the monthly returns chronologically
monthly_returns = monthly_returns.sort_values('Month_Year')

# Create a bar plot for returns/refunds
fig = px.bar(monthly_returns, x='Month_Year', y='Refund_Count', 
             title='Returns/Refunds Count Over Time', 
             labels={'Month_Year': 'Month/Year', 'Refund_Count': 'Number of Returns/Refunds'})

# Customize the plot
fig.update_xaxes(
    tickangle=45,  # Rotate x-axis labels for better readability
    title_text='Month/Year'
)
fig.update_yaxes(
    title_text='Number of Returns/Refunds'
)
fig.update_layout(
    height=600,  # Adjust height if needed
    width=800,   # Adjust width if needed
    title_x=0.5  # Center the title
)

# Show the plot
fig.show()


In [None]:
# Filter out categories with zero total orders to avoid division by zero
category_orders = cleaned_data.groupby('Description').size().reset_index(name='Total_Orders')
category_orders = category_orders[category_orders['Total_Orders'] > 0]

# Count returns for each category
returns_or_refunds = cleaned_data[cleaned_data['Quantity'] < 0]
category_returns = returns_or_refunds.groupby('Description').size().reset_index(name='Return_Count')

# Merge with a minimum threshold of total orders
category_data = pd.merge(category_orders, category_returns, on='Description', how='left').fillna(0)

# Calculate return rate
category_data['Return_Rate'] = category_data['Return_Count'] / category_data['Total_Orders']

# Filter for categories with a significant number of total orders and varied return rates
significant_categories = category_data[
    (category_data['Total_Orders'] > 10) &  # Ensure enough total orders
    (category_data['Return_Rate'] > 0) &    # Ensure some returns
    (category_data['Return_Rate'] < 1)      # Exclude 100% return rate
]

# Sort and select top 5 with diverse return rates
category_data_sorted = significant_categories.sort_values(by='Return_Rate', ascending=False)
top_5_categories = category_data_sorted.head()

# Print detailed information
print("Top 5 Product Categories by Return Rate:")
print(top_5_categories.to_string(index=False))

# Create visualization
import plotly.express as px

fig = px.bar(top_5_categories, x='Description', y='Return_Rate', 
             title='Return Rate for Top 5 Product Categories', 
             labels={'Description':'Product Category','Return_Rate' : 'Return Rate'},
             color='Return_Rate', 
             color_continuous_scale='Viridis')
fig.update_xaxes(tickangle=90)
fig.update_layout(height=600, width=800)
fig.show()

# Additional insights
print("\nAdditional Insights:")
print(f"Total Unique Product Categories: {len(category_data)}")
print(f"Categories with Significant Orders and Returns: {len(significant_categories)}")

#### Profitability Analysis

In [None]:
cleaned_data.loc[:,'Sales'] = cleaned_data['Quantity'] * cleaned_data['UnitPrice']

total_sales = cleaned_data['Sales'].sum()

#Estimate cost assuming a 30% profit margin
profit_margin = 0.30

cleaned_data.loc[:,'Cost'] = cleaned_data['Sales'] * (1 - profit_margin)

total_cost = cleaned_data['Cost'].sum()
total_profit = total_sales - total_cost

print(f"Total Cost: ${total_cost:,.2f}")
print(f"Total Profit: ${total_profit:,.2f}")

print()

In [None]:
# Calculate sales and profit
cleaned_data.loc[:,'Sales'] = cleaned_data['Quantity'] * cleaned_data['UnitPrice']
profit_margin = 0.30

# Group by product and calculate metrics
product_profit = cleaned_data.groupby('Description').agg(
    Total_Sales=('Sales', 'sum'), 
    Total_Quantity=('Quantity', 'sum'),
    Total_Profit=('Sales', lambda x: x.sum() * profit_margin)
).reset_index()

# Calculate profit margin
product_profit.loc[:,'Profit_Margin'] = (product_profit['Total_Profit'] / product_profit['Total_Sales'])

# Filter for meaningful products
significant_products = product_profit[
    (product_profit['Total_Sales'] > 0) &  # Exclude zero sales
    (product_profit['Total_Quantity'] > 10) &  # Minimum quantity sold
    (product_profit['Profit_Margin'] > 0) &  # Positive profit margin
    (product_profit['Profit_Margin'] < 1)  # Exclude 100% margin
]

# Sort and select top 5 diverse products
top_products = significant_products.sort_values(by='Profit_Margin', ascending=False).head(5)

# Print detailed information
print("Top 5 Products by Profit Margin:")
print(top_products.to_string(index=False))

# Visualization
import plotly.express as px

fig = px.bar(
    top_products, 
    x='Description', 
    y='Profit_Margin', 
    title='Top 5 Products by Profit Margin', 
    labels={'Description': 'Product', 'Profit_Margin': 'Profit Margin'},
    text=[f'{x:.2%}' for x in top_products['Profit_Margin']],
    color='Total_Sales',  # Color by total sales for additional insight
    color_continuous_scale='Viridis'
)

# Customize visualization
fig.update_traces(
    texttemplate='%{text}',  # Display percentage
    textposition='outside'
)
fig.update_layout(
    height=600,
    width=800,
    title_x=0.5
)
fig.show()

# Additional insights
print("\nAdditional Insights:")
print(f"Total Unique Products: {len(product_profit)}")
print(f"Products with Significant Sales and Margin: {len(significant_products)}")

#### Customer Satisfaction

#### Customer Feedback Proxy Metrics Dissatisfaction (Proxy):
* Use return/refund rates as a proxy for dissatisfaction. Products with higher return rates or more frequent refunds may indicate issues, such as poor quality or misrepresentation.

#### Customer Satisfaction (Proxy):
* High purchase quantities can indicate customer satisfaction. Customers who buy products in large quantities might be more satisfied or have strong preferences for those products.




In [None]:
import pandas as pd
import numpy as np

# Step 1: Calculate satisfaction and dissatisfaction scores

# Create Sales column
cleaned_data.loc[:,'Sales'] = cleaned_data['Quantity'] * cleaned_data['UnitPrice']

# Satisfaction Rate (proxy: high purchase quantities, threshold = 50)
satisfaction_threshold = 50  # Quantity threshold for high satisfaction
satisfaction_data = cleaned_data[cleaned_data['Quantity'] >= satisfaction_threshold]
product_satisfaction = satisfaction_data.groupby('Description').agg(
    Total_Quantity=('Quantity', 'sum'),
    Total_Sales=('Sales', 'sum')
).reset_index()
product_satisfaction['Avg_Satisfaction_Score'] = np.clip(
    product_satisfaction['Total_Quantity'] / product_satisfaction['Total_Quantity'].max() * 5, 0, 5)

# Dissatisfaction Rate (proxy: refund/return rates, negative quantities)
dissatisfaction_data = cleaned_data[cleaned_data['Quantity'] < 0]
product_dissatisfaction = dissatisfaction_data.groupby('Description').agg(
    Total_Returns=('Quantity', 'sum'),
    Total_Refunds=('Sales', 'sum')
).reset_index()
product_dissatisfaction['Avg_Dissatisfaction_Score'] = np.clip(
    abs(product_dissatisfaction['Total_Returns']) / abs(product_dissatisfaction['Total_Returns']).max() * 5, 0, 5)

# Merge satisfaction and dissatisfaction data
product_feedback = pd.merge(
    product_satisfaction[['Description', 'Avg_Satisfaction_Score']],
    product_dissatisfaction[['Description', 'Avg_Dissatisfaction_Score']],
    on='Description', how='outer'
).fillna(0)  # Fill missing scores with 0


In [None]:
import plotly.express as px

# Sort by highest satisfaction or dissatisfaction scores and filter top 20
top_feedback_products = product_feedback.nlargest(20, 'Avg_Dissatisfaction_Score')

# Create stacked bar chart for satisfaction and dissatisfaction
fig = px.bar(top_feedback_products, 
             x='Description', 
             y=['Avg_Satisfaction_Score', 'Avg_Dissatisfaction_Score'], 
             title='Satisfaction vs Dissatisfaction by Product (Top 20)',
             labels={'value': 'Score (Out of 5)', 'variable': 'Metric', 'Description': 'Product Name'},
             barmode='stack',  # Stacked bar chart
             height=600,
             color_discrete_map={'Avg_Satisfaction_Score': 'green', 'Avg_Dissatisfaction_Score': 'red'})

# Update x-axis for readability
fig.update_xaxes(tickangle=45)
fig.show()


## RFM Calculation

#### To calculate RFM (Recency, Frequency, and Monetary) metrics, follow these steps:

#### Steps to Calculate RFM:
#### 1. Prepare the Dataset
Ensure your dataset has the following columns:

##### CustomerID: Unique identifier for each customer.
##### InvoiceDate: Date of each transaction.
##### InvoiceNo: Unique identifier for each order.
##### Quantity: Number of items purchased.
##### UnitPrice: Price per unit of the product.
#### 2. Calculate RFM Metrics
##### Recency (R): Days since the last purchase.
##### Frequency (F): Total number of orders per customer.
##### Monetary (M): Total monetary value of purchases per customer.

In [None]:
import pandas as pd

# Add a total price column
cleaned_data.loc[:,'TotalPrice'] = cleaned_data['Quantity'] * cleaned_data['UnitPrice']

# Define a reference date for recency calculation (e.g., last transaction date in the dataset)
reference_date = cleaned_data['InvoiceDate'].max()

# Group by CustomerID to calculate RFM
rfm = cleaned_data.groupby('CustomerID').agg(
    Recency=('InvoiceDate', lambda x: (reference_date - x.max()).days),
    Frequency=('InvoiceNo', 'nunique'),
    Monetary=('TotalPrice', 'sum')
).reset_index()

# Display the RFM metrics
print(rfm.head())

### Explanation of the Code
#### Recency:
(reference_date - x.max()).days: Calculates the number of days since the last purchase for each customer.
#### Frequency:
'nunique' on InvoiceNo counts the number of unique orders per customer.
#### Monetary:
The sum of the TotalPrice column calculates the total revenue generated by each customer.

In [None]:
import plotly.express as px

# Recency Distribution
fig_recency = px.histogram(rfm, x='Recency', nbins=20, title='Distribution of Recency (R)')
fig_recency.update_xaxes(title='Days Since Last Purchase')
fig_recency.update_yaxes(title='Count of Customers')
fig_recency.show()


In [None]:

# Frequency Distribution
fig_frequency = px.histogram(rfm, x='Frequency', nbins=20, title='Distribution of Frequency (F)')
fig_frequency.update_xaxes(title='Number of Orders')
fig_frequency.update_yaxes(title='Count of Customers')
fig_frequency.show()


In [None]:
# Monetary Distribution
fig_monetary = px.histogram(rfm, x='Monetary', nbins=20, title='Distribution of Monetary Value (M)')
fig_monetary.update_xaxes(title='Total Monetary Value ($)')
fig_monetary.update_yaxes(title='Count of Customers')
fig_monetary.show()

## RFM Segmentation

#### Step 1: Assign RFM Scores
##### We will divide the Recency, Frequency, and Monetary metrics into quartiles (1–4):

- Recency: Lower scores (closer to 1) indicate recent purchases.
- Frequency: Higher scores (closer to 4) indicate frequent purchases.
- Monetary: Higher scores (closer to 4) indicate higher spending.

In [None]:
def assign_rfm_scores_with_cut(data, column, ascending=True):
    bins = pd.cut(data[column], bins=4, labels=[4, 3, 2, 1] if ascending else [1, 2, 3, 4])
    return bins

# Assign scores
rfm['R_Score'] = assign_rfm_scores_with_cut(rfm, 'Recency', ascending=False)
rfm['F_Score'] = assign_rfm_scores_with_cut(rfm, 'Frequency', ascending=True)
rfm['M_Score'] = assign_rfm_scores_with_cut(rfm, 'Monetary', ascending=True)


In [None]:
# Combine RFM scores into a single score
rfm['RFM_Score'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)

# Display the first few rows
print(rfm[['CustomerID', 'R_Score', 'F_Score', 'M_Score', 'RFM_Score']].head())

#### Step 2: Analyze RFM Segments
##### After calculating the scores, you can group customers into meaningful segments:

- Example Segments
- Champions (RFM Score: 444): Recent, frequent, and high spenders.
- Loyal Customers: High frequency and monetary, but not necessarily recent.
- At Risk: Used to spend often but haven’t made recent purchases.
- Lost Customers (RFM Score: 111): Old purchases, infrequent, and low spenders.

In [None]:
# Define RFM segments based on RFM scores
def segment_rfm(score):
    if score == '444':
        return 'Champion'
    elif score.startswith('44'):
        return 'Loyal'
    elif score.endswith('44'):
        return 'Big Spender'
    elif score.startswith('11'):
        return 'Lost'
    else:
        return 'At Risk'

rfm['Segment'] = rfm['RFM_Score'].apply(segment_rfm)

# Display segment counts
print(rfm['Segment'].value_counts())

### Visualizations

In [None]:
import plotly.express as px

fig = px.pie(rfm, names='Segment', title='RFM Segment Distribution', hole=0.4)
fig.show()

In [None]:
import plotly.express as px

# Count customers in each segment
segment_count = rfm['Segment'].value_counts().reset_index()
segment_count.columns = ['Segment', 'Customer Count']

# Plot the segments
fig = px.bar(segment_count, x='Segment', y='Customer Count', color='Segment', 
             title='Customer Segmentation Based on RFM Scores')
fig.show()

In [None]:
# Calculate mean R, F, M scores by segment
segment_metrics = rfm.groupby('Segment')[['Recency', 'Frequency', 'Monetary']].mean().reset_index()

# Plot the metrics
fig = px.bar(segment_metrics, x='Segment', y=['Recency', 'Frequency', 'Monetary'], 
             title='Average RFM Metrics by Segment', barmode='group')
fig.show()

In [None]:
rfm_scores = rfm.groupby(['R_Score', 'F_Score'], observed=False).size().reset_index(name='Count')

fig = px.density_heatmap(rfm_scores, x='R_Score', y='F_Score', z='Count', 
                         title='Heatmap of RFM Scores', color_continuous_scale='Viridis')
fig.show()

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np

# Select RFM metrics for clustering
rfm_clustering = rfm[['Recency', 'Frequency', 'Monetary']]

# Standardize the data
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm_clustering)

# Check the scaled data
print(f"Scaled RFM Data: \n{rfm_scaled[:5]}")


In [None]:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Elbow method to find the optimal number of clusters
inertia = []
K = range(1, 11)  # Testing 1 to 10 clusters
for k in K:
    kmeans = KMeans(n_clusters=k, n_init=10, random_state=42)  # Explicitly set n_init
    kmeans.fit(rfm_scaled)
    inertia.append(kmeans.inertia_)

# Plot the elbow curve
plt.figure(figsize=(8, 5))
plt.plot(K, inertia, marker='o')
plt.title('Elbow Method for Optimal Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.show()

In [None]:
# Apply K-Means with the chosen number of clusters
optimal_k = 4  # Replace with the number from the elbow method
kmeans = KMeans(n_clusters=optimal_k,n_init=10, random_state=42)
rfm['Cluster'] = kmeans.fit_predict(rfm_scaled)

# Map clusters to meaningful names (optional)
rfm['Cluster'] = rfm['Cluster'].map({
    0: 'Low Value', 
    1: 'High Value', 
    2: 'Medium Value', 
    3: 'New Customers'
})


In [None]:
import plotly.express as px

# Create a scatter plot for visualizing clusters
fig = px.scatter(rfm, x='Recency', y='Monetary', color='Cluster', size='Frequency',
                 title='Customer Segmentation Using K-Means Clustering',
                 labels={'Cluster': 'Customer Cluster'})
fig.show()


In [None]:
# Count customers in each cluster
cluster_count = rfm['Cluster'].value_counts().reset_index()
cluster_count.columns = ['Cluster', 'Customer Count']

# Bar plot of cluster distribution
fig = px.bar(cluster_count, x='Cluster', y='Customer Count', color='Cluster',
             title='Distribution of Customers Across Clusters')
fig.show()


#### Analyze and Interpret Results
##### Each cluster can be interpreted as follows:

- High Value: Frequent purchasers with high monetary value.
- Medium Value: Moderate frequency and spending.
- Low Value: Rare purchasers with low spending.
- New Customers: Recently joined but may show potential.

### Segment Profiling

In [None]:
# Grouping customers by segments and calculating descriptive statistics
segment_profile = rfm.groupby('Cluster').agg({
    'Recency': ['mean', 'median', 'min', 'max'],
    'Frequency': ['mean', 'median', 'min', 'max'],
    'Monetary': ['mean', 'median', 'min', 'max'],
    'CustomerID': 'count'  # Number of customers in each segment
}).reset_index()

# Rename columns for better readability
segment_profile.columns = ['Cluster', 'R_mean', 'R_median', 'R_min', 'R_max', 
                           'F_mean', 'F_median', 'F_min', 'F_max', 
                           'M_mean', 'M_median', 'M_min', 'M_max', 
                           'Customer_Count']

print(segment_profile)


#### Based on the descriptive statistics, you can describe each segment's characteristics. For example:

##### High Value Segment:
- Low Recency (recent purchases), High Frequency, High Monetary values.
Loyal customers who spend frequently and recently.
##### Low Value Segment:
- High Recency (last purchase was long ago), Low Frequency, Low Monetary values.
Customers with minimal or no engagement.
##### Medium Value Segment:

- Mid-range values for Recency, Frequency, and Monetary metrics.
Casual customers who shop occasionally.
##### New Customers Segment:

- Low Frequency, Low Monetary but Low Recency.
Recent acquisitions who have yet to demonstrate loyalty.

In [None]:
# Separate box plots for Recency, Frequency, and Monetary metrics

# Recency
fig_recency = px.box(
    rfm,
    x='Cluster',
    y='Recency',
    title='Recency Distribution by Segment',
    labels={'Recency': 'Recency (Days)', 'Cluster': 'Segment'}
)
fig_recency.update_layout(
    yaxis=dict(title='Recency (Days)', range=[0, rfm['Recency'].max() * 1.1])
)
fig_recency.show()

In [None]:

# Frequency
fig_frequency = px.box(
    rfm,
    x='Cluster',
    y='Frequency',
    title='Frequency Distribution by Segment',
    labels={'Frequency': 'Frequency (Orders)', 'Cluster': 'Segment'}
)
fig_frequency.update_layout(
    yaxis=dict(title='Frequency (Orders)', range=[0, rfm['Frequency'].max() * 1.1])
)
fig_frequency.show()

In [None]:
# Monetary
fig_monetary = px.box(
    rfm,
    x='Cluster',
    y='Monetary',
    title='Monetary Distribution by Segment',
    labels={'Monetary': 'Monetary Value ($)', 'Cluster': 'Segment'}
)
fig_monetary.update_layout(
    yaxis=dict(title='Monetary Value ($)', range=[0, rfm['Monetary'].max() * 1.1])
)
fig_monetary.show()


#### Recency Box Plot:

Focuses on the distribution of Recency values by cluster.
Adjusts the y-axis to fit the maximum Recency values.
#### Frequency Box Plot:

Highlights the distribution of Frequency (order count) per cluster.
Adjusts the y-axis range specifically for Frequency.
#### Monetary Box Plot:

Displays the Monetary Value (spending) distribution across segments.
Sets the y-axis to match the scale of the Monetary values.

### Marketing Recommendation

#### 1. Champions (High R, High F, High M)
 Characteristics:

- These are the most valuable customers who purchase frequently, spend a lot, and have made recent purchases.
Strategies:

##### Loyalty Rewards: Offer exclusive discounts, loyalty points, or early access to new products to retain these customers.
- Upselling and Cross-Selling: Suggest premium or complementary products.
- Personalized Communication: Send thank-you emails or personalized recommendations based on their purchase history.
- Exclusive Access: Provide invitations to VIP events or access to beta testing of new products.

#### 2. Loyal Customers (Moderate R, High F, High M)
Characteristics:

- These customers are loyal and make frequent purchases, but their last purchase might not be recent.
##### Strategies:

- Win-Back Campaigns: Use email or SMS reminders with discounts to re-engage them.
- Subscription Programs: Introduce memberships or subscription services to keep them engaged.
- Feedback Collection: Ask for feedback on their experience to show their opinion matters.
- Bundled Offers: Encourage repeat purchases through bundled deals.

#### 3. Potential Loyalists (Low R, Moderate F, Moderate M)
Characteristics:

- Recently acquired customers who show potential to become loyal.
##### Strategies:

- Welcome Offers: Send a “thank you for joining” discount or special offer.
- Onboarding Campaigns: Provide educational content about your products/services to build trust.
- Engagement Campaigns: Use email drip campaigns to keep them interested and engaged.
- Incentives for Repeat Purchases: Offer discounts on their next purchase to establish a habit.
#### 4. At-Risk Customers (High R, Low F, Low M)
Characteristics:

- Previously active customers who haven’t purchased recently.
##### Strategies:

- Reactivation Campaigns: Send personalized win-back offers, such as “We Miss You!” discounts.
- Product Recommendations: Use past purchase history to recommend new or related items.
- Limited-Time Offers: Create urgency with flash sales or time-bound discounts.
- Survey Campaigns: Understand why they stopped buying and address those issues.
#### 5. Hibernating Customers (High R, Low F, Low M)
Characteristics:

- Customers who purchased a long time ago, infrequently, and with low spending.
##### Strategies:

- Low-Cost Engagement: Use cost-effective channels like social media ads to re-engage.
- Occasional Deals: Send seasonal promotions or holiday-specific offers.
- Highlight New Offerings: Show what’s new since they last purchased.
- Focus on Budget-Friendly Options: Market products at lower price points to encourage purchases.
#### 6. New Customers (Low R, Low F, Moderate M)
Characteristics:

- Recently acquired customers with low purchase frequency but moderate spending.
##### Strategies:

- Onboarding Program: Send a welcome email series introducing your brand and offerings.
- First Purchase Incentives: Encourage another purchase with “first repeat discount” offers.
- Nurturing Campaigns: Use storytelling in emails to build an emotional connection with the brand.
- Value Proposition: Highlight quality, service, or benefits to encourage trust.
#### 7. Low-Value Customers (Moderate R, Low F, Low M)
Characteristics:

- Customers with minimal engagement and spending.
##### Strategies:

- Cost-Effective Retention: Focus marketing efforts on automated email campaigns or SMS offers.
- Educational Content: Share content that shows how your products solve problems or improve their lives.
- Focus on Volume: Offer bulk discounts or promotions for buying multiple items.
- Gradual Upselling: Introduce moderately priced products to increase spending.
##### General Recommendations
- Leverage Data-Driven Personalization: Use RFM scores to send personalized offers and recommendations.
- Test Campaigns: Use A/B testing to determine which offers and communication styles resonate with each segment.
- Monitor Engagement: Continuously analyze customer behavior to adjust segmentation and strategies.
- Reward Loyalty: Implement a tier-based loyalty program rewarding consistent engagement.