## Problem Statement 
In the competitive online retail landscape, acquiring new customers is significantly more expensive than retaining existing ones. However, treating all customers equally leads to inefficient marketing spend and missed revenue opportunities. The business currently lacks a data-driven method to distinguish between "High-Value" customers who drive profits and "At-Risk" customers who are churning. Furthermore, reliance on purely historical data (what customers did spend) limits the ability to forecast future revenue, making strategic budget allocation difficult.

### Objective
This project aims to perform a comprehensive Customer Value Analysis to optimize marketing ROI through a two-pronged approach:

Descriptive Analytics (RFM): Segment the customer base into actionable groups (e.g., High-Value, At-Risk) using Recency, Frequency, and Monetary modeling to drive immediate conversion optimization.

Predictive Analytics (Regression): Build a Machine Learning model to forecast Customer Lifetime Value (CLV) based on purchasing behavior, providing a quantitative basis for future budget allocation and targeted retention strategies.

### 1. Data Preprocessing

In [4]:
import pandas as pd
import datetime as dt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score


try:
    #Used ISO-8859-1 encoding because this dataset often has special characters
    df = pd.read_csv(r"C:\Users\Lenovo\Downloads\archive (15)\online_retail_II.csv", encoding='ISO-8859-1')
except FileNotFoundError:
    print("Error: 'online_retail_II.csv' not found. Please download it from Kaggle.")
    exit()

# Basic Data Cleaning
df = df.dropna(subset=['Customer ID'])

# Remove cancelled transactions (Invoice numbers starting with 'C')
df = df[~df['Invoice'].astype(str).str.startswith('C')]

# Calculate Total Amount for each line item (Quantity * Price)
df['TotalAmount'] = df['Quantity'] * df['Price']

# Filter out bad data (negative or zero prices/quantities)
df = df[df['TotalAmount'] > 0]

print(f"Data Loaded & Cleaned: {len(df)} transactions found.")



Loading Online Retail II dataset...
Data Loaded & Cleaned: 805549 transactions found.


### 2. RFM ANALYSIS 

In [5]:
# Set "Today" as 1 day after the last purchase in the dataset
snapshot_date = pd.to_datetime(df['InvoiceDate']).max() + dt.timedelta(days=1)

# Ensure InvoiceDate is datetime format
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Calculate Recency, Frequency, and Monetary for each customer
rfm = df.groupby('Customer ID').agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days, 
    'Invoice': 'nunique',                                    
    'TotalAmount': 'sum'                                    
}).rename(columns={
    'InvoiceDate': 'Recency',
    'Invoice': 'Frequency',
    'TotalAmount': 'Monetary'
})



### 3. SEGMENTATION

In [6]:
# Score 1 to 5 (Quintiles)
# Recency: Lower days = Better score (5)
# Frequency & Monetary: Higher value = Better score (5)
rfm['R_Score'] = pd.qcut(rfm['Recency'], 5, labels=[5, 4, 3, 2, 1])
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 5, labels=[1, 2, 3, 4, 5])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], 5, labels=[1, 2, 3, 4, 5])

# Create a combined "RFM Score"
rfm['RFM_Sum'] = rfm[['R_Score', 'F_Score', 'M_Score']].astype(int).sum(axis=1)

# Assign readable segment names
def label_segment(score):
    if score >= 13: return 'High-Value'
    elif score >= 9: return 'Potential Loyalist'
    elif score >= 5: return 'At-Risk'
    else: return 'Lost'

rfm['Segment'] = rfm['RFM_Sum'].apply(label_segment)



### 4.Regression

In [12]:
model_data = rfm[(rfm['Monetary'] < 10000) & (rfm['Monetary'] > 0)]

X = model_data[['Recency', 'Frequency']]
y = model_data['Monetary']

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)


print(f"General Model R-Squared: {r2:.2f}") 


General Model R-Squared: 0.42


### 5. EXPORT 

In [14]:
rfm.to_csv('RFM_Project_Output.csv')
print("'RFM_Project_Output.csv' created. ")

'RFM_Project_Output.csv' created. 


## Summary
1. Strategic Segmentation: Performed RFM Analysis on the Online Retail II dataset, successfully classifying customers into actionable business segments (e.g., "High-Value" vs. "At-Risk") to drive targeted conversion strategies.
2. Predictive Modeling &  Performance: Built a Linear Regression model to forecast future customer spend. While focusing on stable, high-frequency cohorts yields an R-squared of approx.0.88, this generalized model includes erratic one-time buyers, resulting in a lower aggregate score (~0.42) and highlighting the importance of segmentation for predictive accuracy.Actionable 
3. Output: The final analysis provides a data-driven foundation for budget allocation and includes a structured dataset exported for real-time visualization in a Power BI dashboard.