# Insights de Negocio y Recomendaciones

Este notebook aplica técnicas avanzadas de análisis para extraer insights accionables. Incluye segmentación de clientes mediante RFM y clustering, análisis de desempeño de productos, identificación de patrones estacionales y análisis geográfico para generar recomendaciones estratégicas basadas en datos.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from pathlib import Path

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Carga y Preparación de Datos

Cargamos el dataset limpio y preparamos las estructuras de datos necesarias para análisis avanzados. La conversión de ORDERDATE a datetime es esencial para calcular métricas temporales como recencia de compra en la segmentación RFM.

In [None]:
df = pd.read_csv('../data/processed/sales_data_clean.csv')
df['ORDERDATE'] = pd.to_datetime(df['ORDERDATE'])
print(f"Data loaded: {df.shape}")

## 2. Segmentación de Clientes

La segmentación permite identificar grupos de clientes con comportamientos similares para aplicar estrategias diferenciadas.

### 2.1 Análisis RFM (Recency, Frequency, Monetary)

RFM es una técnica de marketing que segmenta clientes en tres dimensiones: Recency (qué tan recientemente compraron), Frequency (qué tan seguido compran), y Monetary (cuánto gastan). Calculamos cada métrica por cliente y asignamos scores del 1-4 usando cuartiles. Clientes con scores altos en las tres dimensiones son los más valiosos.

In [None]:
snapshot_date = df['ORDERDATE'].max() + pd.Timedelta(days=1)

rfm = df.groupby('CUSTOMERNAME').agg({
    'ORDERDATE': lambda x: (snapshot_date - x.max()).days,
    'ORDERNUMBER': 'nunique',
    'SALES': 'sum'
})

rfm.columns = ['Recency', 'Frequency', 'Monetary']
rfm['Recency_Score'] = pd.qcut(rfm['Recency'], 4, labels=[4, 3, 2, 1])
rfm['Frequency_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), 4, labels=[1, 2, 3, 4])
rfm['Monetary_Score'] = pd.qcut(rfm['Monetary'], 4, labels=[1, 2, 3, 4])
rfm['RFM_Score'] = rfm['Recency_Score'].astype(str) + rfm['Frequency_Score'].astype(str) + rfm['Monetary_Score'].astype(str)

print("RFM Analysis Sample:")
print(rfm.head(10))

In [None]:
def segment_customer(row):
    r, f, m = int(row['Recency_Score']), int(row['Frequency_Score']), int(row['Monetary_Score'])
    
    if r >= 3 and f >= 3 and m >= 3:
        return 'Champions'
    elif r >= 3 and f >= 2:
        return 'Loyal Customers'
    elif r >= 3:
        return 'Potential Loyalists'
    elif f >= 3 and m >= 3:
        return 'At Risk'
    elif r <= 2 and f <= 2:
        return 'Lost'
    else:
        return 'Others'

rfm['Segment'] = rfm.apply(segment_customer, axis=1)
segment_counts = rfm['Segment'].value_counts()
print("\nCustomer Segments:")
print(segment_counts)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

axes[0].pie(segment_counts.values, labels=segment_counts.index, autopct='%1.1f%%', startangle=90)
axes[0].set_title('Customer Segmentation Distribution')

segment_value = rfm.groupby('Segment')['Monetary'].sum().sort_values(ascending=False)
axes[1].barh(segment_value.index, segment_value.values)
axes[1].set_xlabel('Total Monetary Value ($)')
axes[1].set_title('Monetary Value by Customer Segment')
axes[1].invert_yaxis()

plt.tight_layout()
plt.savefig('../results/customer_segmentation.png', dpi=300, bbox_inches='tight')
plt.show()

### 2.2 Clustering con K-Means

Aplicamos el algoritmo K-Means de machine learning para segmentación no supervisada. Primero normalizamos las variables RFM usando StandardScaler para que tengan la misma escala. Luego agrupamos clientes en 4 clusters basados en similitud multidimensional. Este método complementa RFM descubriendo segmentos naturales en los datos.

In [None]:
rfm_normalized = rfm[['Recency', 'Frequency', 'Monetary']].copy()
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm_normalized)

kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
rfm['Cluster'] = kmeans.fit_predict(rfm_scaled)

cluster_summary = rfm.groupby('Cluster')[['Recency', 'Frequency', 'Monetary']].mean().round(2)
print("\nCluster Characteristics:")
print(cluster_summary)

In [None]:
fig = plt.figure(figsize=(14, 6))
ax = fig.add_subplot(121, projection='3d')

scatter = ax.scatter(rfm['Recency'], rfm['Frequency'], rfm['Monetary'], 
                    c=rfm['Cluster'], cmap='viridis', alpha=0.6, s=50)
ax.set_xlabel('Recency')
ax.set_ylabel('Frequency')
ax.set_zlabel('Monetary')
ax.set_title('Customer Clusters (3D)')
plt.colorbar(scatter, ax=ax, label='Cluster')

ax2 = fig.add_subplot(122)
cluster_counts = rfm['Cluster'].value_counts().sort_index()
ax2.bar(cluster_counts.index, cluster_counts.values)
ax2.set_xlabel('Cluster')
ax2.set_ylabel('Number of Customers')
ax2.set_title('Customers per Cluster')

plt.tight_layout()
plt.savefig('../results/kmeans_clustering.png', dpi=300, bbox_inches='tight')
plt.show()

## 3. Análisis de Desempeño de Productos

Evaluamos cada producto individualmente para identificar top performers y underperformers. Agregamos por PRODUCTCODE calculando ventas totales, conteo de órdenes, cantidad vendida y precio promedio. La métrica Revenue_per_Order (ingreso por orden) indica la eficiencia de cada producto en generar valor por transacción.

In [None]:
product_performance = df.groupby('PRODUCTCODE').agg({
    'SALES': ['sum', 'count'],
    'QUANTITYORDERED': 'sum',
    'PRICEEACH': 'mean'
})
product_performance.columns = ['Total_Sales', 'Order_Count', 'Total_Quantity', 'Avg_Price']
product_performance['Revenue_per_Order'] = product_performance['Total_Sales'] / product_performance['Order_Count']
product_performance = product_performance.sort_values('Total_Sales', ascending=False)

print("Top 10 Products by Sales:")
print(product_performance.head(10))

In [None]:
top_products = product_performance.head(20)

fig, axes = plt.subplots(2, 1, figsize=(14, 10))

axes[0].barh(range(len(top_products)), top_products['Total_Sales'])
axes[0].set_yticks(range(len(top_products)))
axes[0].set_yticklabels(top_products.index)
axes[0].set_xlabel('Total Sales ($)')
axes[0].set_title('Top 20 Products by Revenue')
axes[0].invert_yaxis()

axes[1].scatter(top_products['Order_Count'], top_products['Total_Sales'], 
               s=top_products['Total_Quantity'], alpha=0.6)
axes[1].set_xlabel('Number of Orders')
axes[1].set_ylabel('Total Sales ($)')
axes[1].set_title('Product Performance: Orders vs Sales (bubble size = quantity)')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('../results/product_performance.png', dpi=300, bbox_inches='tight')
plt.show()

## 4. Identificación de Patrones Estacionales

Analizamos variaciones temporales en las ventas para detectar estacionalidad. Agregamos por mes para identificar meses pico y valles, información crítica para gestión de inventario y planificación de campañas. También analizamos por día de la semana para optimizar operaciones diarias y promociones.

In [None]:
df['Month'] = df['ORDERDATE'].dt.month
df['DayOfWeek'] = df['ORDERDATE'].dt.dayofweek

monthly_pattern = df.groupby('Month')['SALES'].agg(['sum', 'count', 'mean'])
monthly_pattern.columns = ['Total_Sales', 'Order_Count', 'Avg_Sales']

print("Monthly Sales Pattern:")
print(monthly_pattern)

In [None]:
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
axes[0].plot(monthly_pattern.index, monthly_pattern['Total_Sales'], marker='o', linewidth=2)
axes[0].set_xticks(range(1, 13))
axes[0].set_xticklabels(month_names)
axes[0].set_ylabel('Total Sales ($)')
axes[0].set_title('Monthly Sales Pattern')
axes[0].grid(True, alpha=0.3)

dayofweek_pattern = df.groupby('DayOfWeek')['SALES'].sum()
day_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
axes[1].bar(dayofweek_pattern.index, dayofweek_pattern.values)
axes[1].set_xticks(range(7))
axes[1].set_xticklabels(day_names)
axes[1].set_ylabel('Total Sales ($)')
axes[1].set_title('Sales by Day of Week')

plt.tight_layout()
plt.savefig('../results/seasonal_patterns.png', dpi=300, bbox_inches='tight')
plt.show()

## 5. Análisis Geográfico Detallado

Profundizamos en el desempeño por ubicación geográfica. Cruzamos COUNTRY con TERRITORY para entender dinámicas regionales. Calculamos Sales_per_Customer (ventas por cliente) para medir la eficiencia de penetración en cada mercado. Mercados con alto sales per customer pero pocos clientes representan oportunidades de expansión.

In [None]:
geo_analysis = df.groupby(['COUNTRY', 'TERRITORY']).agg({
    'SALES': 'sum',
    'ORDERNUMBER': 'nunique',
    'CUSTOMERNAME': 'nunique'
}).reset_index()
geo_analysis.columns = ['Country', 'Territory', 'Total_Sales', 'Orders', 'Customers']
geo_analysis['Sales_per_Customer'] = geo_analysis['Total_Sales'] / geo_analysis['Customers']
geo_analysis = geo_analysis.sort_values('Total_Sales', ascending=False)

print("Top 15 Countries by Sales:")
print(geo_analysis.head(15))

In [None]:
territory_summary = df.groupby('TERRITORY').agg({
    'SALES': 'sum',
    'CUSTOMERNAME': 'nunique',
    'PRODUCTLINE': 'nunique'
}).reset_index()
territory_summary.columns = ['Territory', 'Total_Sales', 'Customers', 'Product_Lines']
territory_summary['Sales_per_Customer'] = territory_summary['Total_Sales'] / territory_summary['Customers']

print("\nTerritory Performance:")
print(territory_summary)

## 6. Compilación de Insights Clave

Sintetizamos los hallazgos más importantes del análisis en insights accionables. Identificamos la línea de producto top, mejor mes de ventas, territorio líder, valor promedio de orden, concentración de clientes y tamaño de deal más común. Estos insights forman la base para decisiones estratégicas de negocio.

In [None]:
insights = []

# Top performing product line
top_productline = df.groupby('PRODUCTLINE')['SALES'].sum().idxmax()
insights.append(f"Top Product Line: {top_productline}")

# Best month
best_month = monthly_pattern['Total_Sales'].idxmax()
insights.append(f"Best Sales Month: {month_names[best_month-1]}")

# Top territory
top_territory = df.groupby('TERRITORY')['SALES'].sum().idxmax()
insights.append(f"Top Territory: {top_territory}")

# Average order value
avg_order_value = df.groupby('ORDERNUMBER')['SALES'].sum().mean()
insights.append(f"Average Order Value: ${avg_order_value:.2f}")

# Customer concentration
top_20_customers_pct = (rfm.nlargest(20, 'Monetary')['Monetary'].sum() / rfm['Monetary'].sum()) * 100
insights.append(f"Top 20 Customers Contribution: {top_20_customers_pct:.1f}%")

# Most common deal size
common_dealsize = df['DEALSIZE'].value_counts().idxmax()
insights.append(f"Most Common Deal Size: {common_dealsize}")

print("\n" + "="*60)
print("KEY BUSINESS INSIGHTS")
print("="*60)
for i, insight in enumerate(insights, 1):
    print(f"{i}. {insight}")

## 7. Generación de Recomendaciones Estratégicas

Basándonos en los insights descubiertos, formulamos recomendaciones accionables. Cada recomendación está vinculada a un hallazgo específico del análisis y propone acciones concretas: desde enfocar esfuerzos de marketing en segmentos específicos hasta optimizar inventario estacional y replicar estrategias exitosas entre territorios.

In [None]:
recommendations = [
    "Focus marketing efforts on Champions and Loyal Customers for upselling opportunities",
    "Develop re-engagement campaigns for 'At Risk' and 'Lost' customer segments",
    f"Increase inventory and promotions for {top_productline} during peak seasons",
    f"Replicate successful strategies from {top_territory} to other territories",
    "Implement targeted promotions during low-sales months to smooth seasonal variations",
    "Analyze and address why certain product lines underperform in specific regions",
    "Consider loyalty programs for high-frequency, high-monetary value customers",
    "Investigate factors contributing to 'Disputed' and 'Cancelled' orders to reduce losses"
]

print("\n" + "="*60)
print("STRATEGIC RECOMMENDATIONS")
print("="*60)
for i, rec in enumerate(recommendations, 1):
    print(f"{i}. {rec}")

## 8. Exportación de Resultados

Guardamos todos los análisis en archivos CSV para consumo externo. El análisis RFM, desempeño de productos y análisis geográfico se exportan como tablas. Los insights y recomendaciones se guardan en formato texto para fácil lectura por stakeholders. Estos outputs alimentan dashboards, reportes ejecutivos y sistemas de BI.

In [None]:
rfm.to_csv('../results/customer_rfm_analysis.csv')
product_performance.to_csv('../results/product_performance.csv')
geo_analysis.to_csv('../results/geographic_analysis.csv', index=False)

with open('../results/business_insights.txt', 'w') as f:
    f.write("KEY BUSINESS INSIGHTS\n")
    f.write("="*60 + "\n")
    for i, insight in enumerate(insights, 1):
        f.write(f"{i}. {insight}\n")
    
    f.write("\n" + "="*60 + "\n")
    f.write("STRATEGIC RECOMMENDATIONS\n")
    f.write("="*60 + "\n")
    for i, rec in enumerate(recommendations, 1):
        f.write(f"{i}. {rec}\n")

print("\nAll results exported to results/ directory")