# 🎯 Segmentación de Clientes: Online Retail Dataset
## Proyecto de Clustering RFM con CRISP-DM

**Objetivo Empresarial:** Identificar 3 segmentos de clientes para optimizar estrategias de marketing diferenciadas.

---

## Fase 1: Business & Data Understanding 📊

### 1.1 Contexto del Negocio

Una tienda online necesita personalizar sus campañas de marketing según el comportamiento de compra. El objetivo es:

- **PREMIUM (Whales):** Clientes de alto valor monetario → Campañas VIP, ofertas exclusivas
- **RETENTION (Core):** Clientes regulares y consistentes → Programas de fidelización  
- **REACTIVATION (Swing):** Clientes esporádicos → Incentivos para reenganche

### 1.2 Preguntas de Investigación

**¿Cuáles son las características de nuestros clientes?**
- ¿Cuántos clientes únicos tenemos?
- ¿Cuál es la distribución de compras (recency, frecuencia, gasto)?
- ¿Existen patrones anormales en los datos?


### 1.3 Dataset: Online Retail II (2010-2011)

**Fuente:** UCI Machine Learning Repository  
**Formato:** Transacciones de e-commerce  
**Período:** Diciembre 2010 - Diciembre 2011  

**Características del dataset:**
- `invoice`: ID único de transacción (6 dígitos, cancelaciones con 'C', ajustes con 'A')
- `stockcode`: Código de producto (5 dígitos, algunos con letras o anormales)
- `description`: Nombre del producto
- `quantity`: Cantidad vendida (algunos valores negativos = devoluciones)
- `invoicedate`: Fecha y hora de la transacción
- `price`: Precio unitario (algunos negativos = ajustes)
- `customer_id`: Identificador del cliente (algunos nulos)
- `country`: País de origen del cliente

**Volumen inicial:** 541,910 transacciones

### 1.4 Carga y Exploración Inicial de Datos

In [1]:
# Import Required Libraries
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Load dataset
df = pd.read_pickle('df_2010-2011.pkl')

# Normalize column names
df.columns = [col.lower().replace(' ', '_') for col in df.columns]

print("=" * 80)
print("DATASET OVERVIEW")
print("=" * 80)
print(f"Shape: {df.shape}")
print(f"\nColumn Data Types:\n{df.dtypes}")
print(f"\nFirst rows:")
display(df.head())
print(f"\nBasic Statistics:")
display(df.describe())

DATASET OVERVIEW
Shape: (541910, 8)

Column Data Types:
invoice                object
stockcode              object
description            object
quantity                int64
invoicedate    datetime64[ns]
price                 float64
customer_id           float64
country                object
dtype: object

First rows:


Unnamed: 0,invoice,stockcode,description,quantity,invoicedate,price,customer_id,country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom



Basic Statistics:


Unnamed: 0,quantity,invoicedate,price,customer_id
count,541910.0,541910,541910.0,406830.0
mean,9.552234,2011-07-04 13:35:22.342307584,4.611138,15287.68416
min,-80995.0,2010-12-01 08:26:00,-11062.06,12346.0
25%,1.0,2011-03-28 11:34:00,1.25,13953.0
50%,3.0,2011-07-19 17:17:00,2.08,15152.0
75%,10.0,2011-10-19 11:27:00,4.13,16791.0
max,80995.0,2011-12-09 12:50:00,38970.0,18287.0
std,218.080957,,96.759765,1713.603074


### 1.5 Evaluación de Calidad de Datos

**¿Qué anomalías hay en los datos?**

Vamos a identificar:
- Valores faltantes (missing values)
- Valores negativos en cantidad y precio
- Códigos anormales en invoice y stockcode


In [3]:
# Analyze missing values
missing_data = {
    'Customer ID': df['customer_id'].isna().sum(),
    'Description': df['description'].isna().sum(),
    'Quantity': df['quantity'].isna().sum(),
    'Price': df['price'].isna().sum()
}

# Analyze negative values
negative_data = {
    'Negative Quantity': (df['quantity'] < 0).sum(),
    'Negative Price': (df['price'] < 0).sum()
}

# Analyze unique values
unique_data = {
    'Unique Invoices': df['invoice'].nunique(),
    'Unique StockCodes': df['stockcode'].nunique(),
    'Unique Descriptions': df['description'].nunique(),
    'Unique Customers': df['customer_id'].nunique(),
    'Unique Countries': df['country'].nunique()
}

total_entries = len(df)

print("\n" + "=" * 80)
print("DATA QUALITY ISSUES")
print("=" * 80)
print("\n📌 MISSING VALUES:")
for col, count in missing_data.items():
    pct = (count / total_entries) * 100
    print(f"  {col:20} : {count:8,} ({pct:6.2f}%)")

print("\n📌 NEGATIVE VALUES (Devoluciones/Ajustes):")
for issue, count in negative_data.items():
    pct = (count / total_entries) * 100
    print(f"  {issue:20} : {count:8,} ({pct:6.2f}%)")

print("\n📌 UNIQUE VALUES:")
for col, count in unique_data.items():
    print(f"  {col:20} : {count:8,}")


DATA QUALITY ISSUES

📌 MISSING VALUES:
  Customer ID          :  135,080 ( 24.93%)
  Description          :    1,454 (  0.27%)
  Quantity             :        0 (  0.00%)
  Price                :        0 (  0.00%)

📌 NEGATIVE VALUES (Devoluciones/Ajustes):
  Negative Quantity    :   10,624 (  1.96%)
  Negative Price       :        2 (  0.00%)

📌 UNIQUE VALUES:
  Unique Invoices      :   25,900
  Unique StockCodes    :    4,070
  Unique Descriptions  :    4,223
  Unique Customers     :    4,372
  Unique Countries     :       38


In [4]:
# Create comprehensive data quality dashboard
fig = make_subplots(
    rows=2, cols=3,
    subplot_titles=("Missing Values", "Negative Values", "Unique Values",
                    "Invoice Types Distribution", "Quantity & Price Ranges", "Temporal Coverage"),
    specs=[[{"type": "bar"}, {"type": "bar"}, {"type": "bar"}],
           [{"type": "pie"}, {"type": "box"}, {"type": "scatter"}]]
)

# 1. Missing values
missing_cols = list(missing_data.keys())
missing_vals = list(missing_data.values())
fig.add_trace(
    go.Bar(x=missing_cols, y=missing_vals, name='Missing', marker_color='#EF553B', text=missing_vals, textposition='outside'),
    row=1, col=1
)

# 2. Negative values
neg_issues = list(negative_data.keys())
neg_vals = list(negative_data.values())
fig.add_trace(
    go.Bar(x=neg_issues, y=neg_vals, name='Negative', marker_color='#AB63FA', text=neg_vals, textposition='outside'),
    row=1, col=2
)

# 3. Unique values
unique_cols = list(unique_data.keys())
unique_vals = list(unique_data.values())
fig.add_trace(
    go.Bar(x=unique_cols, y=unique_vals, name='Unique', marker_color='#00CC96', text=unique_vals, textposition='outside'),
    row=1, col=3
)

# 4. Invoice types
df_invoice_str = df['invoice'].astype('str')
invoice_normal = df_invoice_str.str.match("^\d{6}$").sum()
invoice_cancellation = df_invoice_str.str.startswith('C').sum()
invoice_adjustment = df_invoice_str.str.startswith('A').sum()
invoice_other = len(df) - invoice_normal - invoice_cancellation - invoice_adjustment

invoice_types = ['Normal', 'Cancellation', 'Adjustment', 'Other']
invoice_counts = [invoice_normal, invoice_cancellation, invoice_adjustment, invoice_other]

fig.add_trace(
    go.Pie(labels=invoice_types, values=invoice_counts, 
            marker=dict(colors=['#636EFA', '#EF553B', '#00CC96', '#AB63FA']),
            textinfo='percent'),
    row=2, col=1
)

# 5. Box plots for quantity and price
fig.add_trace(go.Box(y=df['quantity'], name='Quantity', marker_color='#636EFA'), row=2, col=2)
fig.add_trace(go.Box(y=df['price'], name='Price', marker_color='#EF553B'), row=2, col=2)

# 6. Temporal coverage (records per month)
df['month'] = pd.to_datetime(df['invoicedate']).dt.to_period('M')
monthly_counts = df.groupby('month').size()
months_str = [str(m) for m in monthly_counts.index]
fig.add_trace(
    go.Scatter(x=months_str, y=monthly_counts.values, mode='lines+markers', name='Records/Month',
               line=dict(color='#636EFA'), marker=dict(size=8)),
    row=2, col=3
)

fig.update_xaxes(title_text="Columns", row=1, col=1)
fig.update_xaxes(title_text="Issues", row=1, col=2)
fig.update_xaxes(title_text="Entity", row=1, col=3)
fig.update_xaxes(title_text="Month", row=2, col=3)

fig.update_layout(height=800, width=1400, title_text="📊 Data Quality Dashboard", showlegend=False)
fig.show()

## Fase 1.6: Análisis Detallado de Columnas Clave

### Invoice Column - Tipos de Transacciones

¿Qué tipos de facturas tenemos?
- **Normal (6 dígitos):** Compras regulares
- **Cancellation (Comienza con 'C'):** Pedidos cancelados/devueltos
- **Adjustment (Comienza con 'A'):** Ajustes de precio o inventario
- **Other:** Otros tipos de transacciones


In [5]:
# Invoice Column Analysis with Interactive Visualization
df_invoice_str = df['invoice'].astype('str')

# Count invoice types
invoice_normal = df_invoice_str.str.match("^\d{6}$").sum()
invoice_cancellation = df_invoice_str.str.startswith('C').sum()
invoice_adjustment = df_invoice_str.str.startswith('A').sum()
invoice_other = len(df) - invoice_normal - invoice_cancellation - invoice_adjustment

# Create labels and values
invoice_types = ['Normal', 'Cancellation', 'Adjustment', 'Other']
invoice_counts = [invoice_normal, invoice_cancellation, invoice_adjustment, invoice_other]
invoice_percentages = [count / len(df) * 100 for count in invoice_counts]

# Create 2x2 dashboard with different chart types
fig = make_subplots(
    rows=2, cols=2,
    specs=[[{"type": "pie"}, {"type": "bar"}], [{"type": "bar"}, {"type": "pie"}]],
    subplot_titles=("Distribution", "Count by Type", "Percentage (%)", "Proportion")
)

# 1. Pie chart - Distribution
fig.add_trace(
    go.Pie(labels=invoice_types, values=invoice_counts, name='Distribution',
            marker=dict(colors=['#636EFA', '#EF553B', '#00CC96', '#AB63FA'])),
    row=1, col=1
)

# 2. Bar chart - Counts
fig.add_trace(
    go.Bar(x=invoice_types, y=invoice_counts, name='Count',
           marker=dict(color=['#636EFA', '#EF553B', '#00CC96', '#AB63FA']),
           text=invoice_counts, textposition='outside'),
    row=1, col=2
)

# 3. Horizontal bar chart - Percentages
fig.add_trace(
    go.Bar(y=invoice_types, x=invoice_percentages, orientation='h', name='Percentage',
           marker=dict(color=['#636EFA', '#EF553B', '#00CC96', '#AB63FA']),
           text=[f'{p:.2f}%' for p in invoice_percentages], textposition='outside'),
    row=2, col=1
)

# 4. Donut chart - Proportion
fig.add_trace(
    go.Pie(labels=invoice_types, values=invoice_counts, hole=0.3, name='Proportion',
           marker=dict(colors=['#636EFA', '#EF553B', '#00CC96', '#AB63FA'])),
    row=2, col=2
)

fig.update_layout(height=800, width=1200, title_text="Invoice Column Analysis", showlegend=False)
fig.show()

# Print summary
print("\n=== INVOICE COLUMN SUMMARY ===")
print(f"Total Records: {len(df):,}")
print(f"\nInvoice Type Breakdown:")
for invoice_type, count, percentage in zip(invoice_types, invoice_counts, invoice_percentages):
    print(f"  {invoice_type:20} : {count:8,} ({percentage:6.2f}%)")


=== INVOICE COLUMN SUMMARY ===
Total Records: 541,910

Invoice Type Breakdown:
  Normal               :  532,619 ( 98.29%)
  Cancellation         :    9,288 (  1.71%)
  Adjustment           :        3 (  0.00%)
  Other                :        0 (  0.00%)


### StockCode Column - Patrones de Códigos de Producto

¿Qué patrones tienen los códigos de stock?
- **Normal (5 dígitos):** Códigos de producto estándar (e.g., '23166')
- **Con Letras:** Códigos con dígitos + letras (e.g., '23203E')
- **Anormales:** Códigos especiales como 'POST', 'DOT', 'M' (ajustes, envíos, cargos)


In [6]:
# StockCode Column Analysis with Interactive Visualization
df_stockcode_str = df['stockcode'].astype('str')

# Classify stockcode patterns
stockcode_normal = df_stockcode_str.str.match("^\d{5}$").sum()
stockcode_with_letters = df_stockcode_str.str.match("^\d{5}[a-zA-Z]+$").sum()
stockcode_abnormal = len(df) - stockcode_normal - stockcode_with_letters

# Get details of abnormal stockcodes
abnormal_mask = ~(df_stockcode_str.str.match("^\d{5}$") | df_stockcode_str.str.match("^\d{5}[a-zA-Z]+$"))
abnormal_details = df[abnormal_mask]['stockcode'].value_counts().head(20)

# Create labels and values for patterns
pattern_types = ['Normal (5 digits)', 'With Letters', 'Abnormal']
pattern_counts = [stockcode_normal, stockcode_with_letters, stockcode_abnormal]
pattern_percentages = [count / len(df) * 100 for count in pattern_counts]

# Create 2x2 dashboard
fig = make_subplots(
    rows=2, cols=2,
    specs=[[{"type": "pie"}, {"type": "bar"}], [{"type": "bar"}, {"type": "pie"}]],
    subplot_titles=("Pattern Distribution", "Count by Pattern", "Top 20 Abnormal Codes", "Proportion")
)

# 1. Pie chart - Pattern Distribution
fig.add_trace(
    go.Pie(labels=pattern_types, values=pattern_counts, name='Distribution',
            marker=dict(colors=['#636EFA', '#EF553B', '#00CC96'])),
    row=1, col=1
)

# 2. Bar chart - Pattern Counts
fig.add_trace(
    go.Bar(x=pattern_types, y=pattern_counts, name='Count',
           marker=dict(color=['#636EFA', '#EF553B', '#00CC96']),
           text=pattern_counts, textposition='outside'),
    row=1, col=2
)

# 3. Horizontal bar chart - Top 20 Abnormal
abnormal_codes = list(abnormal_details.index[::-1])
abnormal_counts = list(abnormal_details.values[::-1])

fig.add_trace(
    go.Bar(y=abnormal_codes, x=abnormal_counts, orientation='h', name='Abnormal',
           marker=dict(color='#AB63FA'),
           text=abnormal_counts, textposition='outside'),
    row=2, col=1
)

# 4. Donut chart - Proportion
fig.add_trace(
    go.Pie(labels=pattern_types, values=pattern_counts, hole=0.3, name='Proportion',
           marker=dict(colors=['#636EFA', '#EF553B', '#00CC96'])),
    row=2, col=2
)

fig.update_xaxes(title_text="", row=2, col=1)
fig.update_yaxes(title_text="StockCode", row=2, col=1)

fig.update_layout(height=900, width=1200, title_text="StockCode Column Analysis", showlegend=False)
fig.show()

# Print summary
print("\n=== STOCKCODE COLUMN SUMMARY ===")
print(f"Total Records: {len(df):,}")
print(f"\nPattern Breakdown:")
for pattern_type, count, percentage in zip(pattern_types, pattern_counts, pattern_percentages):
    print(f"  {pattern_type:25} : {count:8,} ({percentage:6.2f}%)")

print(f"\nTop 20 Abnormal StockCodes:")
for code, count in abnormal_details.items():
    print(f"  {str(code):20} : {count:6,} records")


=== STOCKCODE COLUMN SUMMARY ===
Total Records: 541,910

Pattern Breakdown:
  Normal (5 digits)         :  487,036 ( 89.87%)
  With Letters              :   51,878 (  9.57%)
  Abnormal                  :    2,996 (  0.55%)

Top 20 Abnormal StockCodes:
  POST                 :  1,257 records
  DOT                  :    710 records
  M                    :    571 records
  C2                   :    144 records
  D                    :     77 records
  S                    :     63 records
  BANK CHARGES         :     37 records
  AMAZONFEE            :     34 records
  CRUK                 :     16 records
  DCGSSGIRL            :     13 records
  DCGSSBOY             :     11 records
  gift_0001_20         :     10 records
  gift_0001_10         :      9 records
  gift_0001_30         :      8 records
  DCGS0003             :      5 records
  gift_0001_50         :      4 records
  PADS                 :      4 records
  gift_0001_40         :      3 records
  B                    :   

---

## Fase 2: Data Preparation 🧹

### 2.1 Decisiones de Limpieza

**¿Qué hacemos con los datos problemáticos?**

1. **Clientes sin ID:** 135,080 registros sin customer_id → ELIMINAR (necesario para clustering)
2. **Cantidades negativas:** 13,451 registros → MANTENER (son devoluciones válidas)
3. **Precios negativos:** 1,336 registros → MANTENER (son ajustes válidos)
4. **Códigos anormales:** POST, DOT, M, etc. → REVISAR caso a caso

**Estrategia:** 
- Remover registros sin customer_id
- Mantener todos los demás (el modelo capturará estos patrones)


In [7]:
# Data Cleaning Strategy
print("=" * 80)
print("DATA CLEANING")
print("=" * 80)

print(f"\nOriginal dataset: {len(df):,} rows")

# Remove records without customer_id (required for clustering)
df_clean = df[df['customer_id'].notna()].copy()
removed_null_customers = len(df) - len(df_clean)

print(f"Removed NULL customer_id: {removed_null_customers:,} rows")
print(f"Clean dataset: {len(df_clean):,} rows")
print(f"Retention rate: {(len(df_clean)/len(df))*100:.2f}%")

# Convert invoicedate to datetime
df_clean['invoicedate'] = pd.to_datetime(df_clean['invoicedate'])

# Calculate basic RFM metrics
max_date = df_clean['invoicedate'].max()
print(f"\nData period: {df_clean['invoicedate'].min().date()} to {max_date.date()}")
print(f"Analysis base date: {max_date.date()}")

DATA CLEANING

Original dataset: 541,910 rows
Removed NULL customer_id: 135,080 rows
Clean dataset: 406,830 rows
Retention rate: 75.07%

Data period: 2010-12-01 to 2011-12-09
Analysis base date: 2011-12-09


### 2.2 Feature Engineering: RFM Metrics

¿Por qué RFM?
- **Recency (R):** ¿Cuándo fue la última compra? (clientes activos recientemente)
- **Frequency (F):** ¿Cuántas veces compró? (clientes leales)
- **Monetary (M):** ¿Cuánto gastó? (clientes valiosos)

Estos 3 indicadores son **predictivos del valor del cliente** y permitirán segmentación efectiva.


In [8]:
# Calculate RFM features
print("\n" + "=" * 80)
print("RFM FEATURE ENGINEERING")
print("=" * 80)

# Calculate sale total per transaction
df_clean['sale_total'] = df_clean['quantity'] * df_clean['price']

# Aggregate by customer_id
df_rfm = df_clean.groupby('customer_id').agg({
    'sale_total': 'sum',        # Monetary (M)
    'invoice': 'nunique',       # Frequency (F)
    'invoicedate': 'max'        # Last purchase date
}).rename(columns={'sale_total': 'monetary', 'invoice': 'frequency', 'invoicedate': 'last_purchase'})

# Calculate Recency in days
df_rfm['recency'] = (max_date - df_rfm['last_purchase']).dt.days

# Remove negative monetary values (refunds only)
df_rfm = df_rfm[df_rfm['monetary'] > 0]

print(f"\nUnique customers after RFM calculation: {len(df_rfm):,}")
print(f"\nRFM STATISTICS:")
print(df_rfm[['recency', 'frequency', 'monetary']].describe())

# Show sample
print(f"\nSample RFM data:")
display(df_rfm.head(10))


RFM FEATURE ENGINEERING

Unique customers after RFM calculation: 4,322

RFM STATISTICS:
           recency    frequency      monetary
count  4322.000000  4322.000000  4.322000e+03
mean     89.343591     5.115687  1.923487e+03
std      99.133565     9.384459  8.263127e+03
min       0.000000     1.000000  1.776357e-15
25%      16.000000     1.000000  3.022925e+02
50%      48.500000     3.000000  6.575500e+02
75%     137.000000     6.000000  1.625740e+03
max     373.000000   248.000000  2.794890e+05

Sample RFM data:


Unnamed: 0_level_0,monetary,frequency,last_purchase,recency
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
12347.0,4310.0,7,2011-12-07 15:52:00,1
12348.0,1797.24,4,2011-09-25 13:13:00,74
12349.0,1757.55,1,2011-11-21 09:51:00,18
12350.0,334.4,1,2011-02-02 16:01:00,309
12352.0,1545.41,11,2011-11-03 14:37:00,35
12353.0,89.0,1,2011-05-19 17:47:00,203
12354.0,1079.4,1,2011-04-21 13:11:00,231
12355.0,459.4,1,2011-05-09 13:49:00,213
12356.0,2811.43,3,2011-11-17 08:40:00,22
12357.0,6207.67,1,2011-11-06 16:07:00,32


### 2.3 Distribuciones de Características RFM

¿Cómo se distribuyen nuestras métricas?
- ¿Hay clientes muy recientes o muy antiguos?
- ¿La frecuencia varía mucho entre clientes?
- ¿Hay grandes diferencias en gasto monetario (outliers)?


In [9]:
# Create RFM distribution visualizations
fig = make_subplots(
    rows=2, cols=3,
    subplot_titles=("Recency Distribution", "Frequency Distribution", "Monetary Distribution",
                    "Recency (Box)", "Frequency (Box)", "Monetary (Box)"),
    specs=[[{"type": "histogram"}, {"type": "histogram"}, {"type": "histogram"}],
           [{"type": "box"}, {"type": "box"}, {"type": "box"}]]
)

# Histograms
fig.add_trace(
    go.Histogram(x=df_rfm['recency'], nbinsx=50, name='Recency', marker_color='#636EFA'),
    row=1, col=1
)

fig.add_trace(
    go.Histogram(x=df_rfm['frequency'], nbinsx=50, name='Frequency', marker_color='#EF553B'),
    row=1, col=2
)

fig.add_trace(
    go.Histogram(x=df_rfm['monetary'], nbinsx=50, name='Monetary', marker_color='#00CC96'),
    row=1, col=3
)

# Box plots
fig.add_trace(go.Box(y=df_rfm['recency'], name='Recency', marker_color='#636EFA'), row=2, col=1)
fig.add_trace(go.Box(y=df_rfm['frequency'], name='Frequency', marker_color='#EF553B'), row=2, col=2)
fig.add_trace(go.Box(y=df_rfm['monetary'], name='Monetary', marker_color='#00CC96'), row=2, col=3)

fig.update_xaxes(title_text="Days", row=1, col=1)
fig.update_xaxes(title_text="Count", row=1, col=2)
fig.update_xaxes(title_text="£", row=1, col=3)

fig.update_layout(height=700, width=1400, title_text="📈 RFM Features Distribution", showlegend=False)
fig.show()

In [10]:
# Normalize RFM features (0-1 scale) for clustering
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_rfm_scaled = scaler.fit_transform(df_rfm[['recency', 'frequency', 'monetary']])
df_rfm_scaled_df = pd.DataFrame(
    df_rfm_scaled,
    columns=['recency_scaled', 'frequency_scaled', 'monetary_scaled'],
    index=df_rfm.index
)

print("\nSCALED RFM STATISTICS (0-1 range):")
print(df_rfm_scaled_df.describe())

# Calculate correlation matrix
correlation_matrix = df_rfm[['recency', 'frequency', 'monetary']].corr()
print("\nCORRELATION MATRIX:")
print(correlation_matrix)

# Visualize correlation heatmap
fig = go.Figure(data=go.Heatmap(
    z=correlation_matrix.values,
    x=['Recency', 'Frequency', 'Monetary'],
    y=['Recency', 'Frequency', 'Monetary'],
    colorscale='RdBu',
    zmid=0,
    text=correlation_matrix.values,
    texttemplate='%{text:.2f}',
    textfont={"size": 14}
))

fig.update_layout(title='RFM Correlation Heatmap', width=600, height=600)
fig.show()


SCALED RFM STATISTICS (0-1 range):
       recency_scaled  frequency_scaled  monetary_scaled
count     4322.000000       4322.000000      4322.000000
mean         0.239527          0.016663         0.006882
std          0.265774          0.037994         0.029565
min          0.000000          0.000000         0.000000
25%          0.042895          0.000000         0.001082
50%          0.130027          0.008097         0.002353
75%          0.367292          0.020243         0.005817
max          1.000000          1.000000         1.000000

CORRELATION MATRIX:
            recency  frequency  monetary
recency    1.000000  -0.258376 -0.130575
frequency -0.258376   1.000000  0.565728
monetary  -0.130575   0.565728  1.000000
