# Customer Segmentation for Online Retail

## 1. Introduction
This notebook performs customer segmentation on the Online Retail dataset. We will use RFM (Recency, Frequency, Monetary) analysis to create features for each customer and then apply K-Means clustering to group customers into distinct segments.

## 2. Data Loading and Initial Exploration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

# Load the dataset
df = pd.read_csv('online_retail.csv', encoding='ISO-8859-1')

df.head()

## 3. Data Cleaning and Preprocessing

In [None]:
# Drop rows with missing CustomerID
df.dropna(subset=['CustomerID'], inplace=True)

# Remove returns (negative quantity)
df = df[df['Quantity'] > 0]

# Convert InvoiceDate to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# Calculate total price
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']

## 4. RFM Analysis

In [None]:
# Calculate Recency, Frequency, and Monetary values
snapshot_date = df['InvoiceDate'].max() + dt.timedelta(days=1)
rfm = df.groupby('CustomerID').agg({
    'InvoiceDate': lambda date: (snapshot_date - date.max()).days,
    'InvoiceNo': 'nunique',
    'TotalPrice': 'sum'
})

# Rename columns
rfm.rename(columns={'InvoiceDate': 'Recency', 'InvoiceNo': 'Frequency', 'TotalPrice': 'MonetaryValue'}, inplace=True)

rfm.head()

## 5. K-Means Clustering

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Scale the RFM data
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm)

# Use the Elbow method to find the optimal number of clusters
wcss = {}
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
    kmeans.fit(rfm_scaled)
    wcss[k] = kmeans.inertia_

plt.figure(figsize=(10,6))
plt.plot(wcss.keys(), wcss.values(), 'gs-')
plt.xlabel('Values of K')
plt.ylabel('WCSS')
plt.title('The Elbow Method')
plt.show()

In [None]:
# Apply K-Means with the optimal number of clusters (e.g., 4)
kmeans = KMeans(n_clusters=4, init='k-means++', random_state=42)
rfm['Cluster'] = kmeans.fit_predict(rfm_scaled)

rfm.head()

## 6. Segment Analysis and Visualization

In [None]:
# Analyze the segments
rfm.groupby('Cluster').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'MonetaryValue': ['mean', 'count']
}).round(1)

In [None]:
# Visualize the segments
sns.pairplot(rfm, hue='Cluster', vars=['Recency', 'Frequency', 'MonetaryValue'])
plt.show()

## 7. Conclusion
Using RFM analysis and K-Means clustering, we have successfully segmented the customers into distinct groups. Each cluster represents a different type of customer (e.g., loyal high-spenders, new customers, at-risk customers). These segments can be used to develop targeted marketing strategies to improve customer retention and increase sales.