# üìå Grocery Customer Churn (Abandono de Clientes de Supermercado)

## Acerca del conjunto de datos

Este conjunto de datos contiene informaci√≥n detallada sobre las transacciones, el comportamiento y los datos demogr√°ficos de los clientes de una tienda de comestibles. Incluye informaci√≥n de los clientes, datos transaccionales y m√©tricas de comportamiento, lo que lo hace ideal para crear modelos predictivos para la p√©rdida de clientes y el an√°lisis del valor de vida del cliente (CLV).

La columna objetivo para la predicci√≥n de abandono es la tasa de abandono, que indica si un cliente se ha dado de baja (1) o ha permanecido activo (0). El conjunto de datos est√° dise√±ado para crear modelos y predecir la retenci√≥n de clientes, analizar su comportamiento y pronosticar su valor de vida.

Caracter√≠sticas principales:

* Informaci√≥n del cliente: incluye detalles demogr√°ficos como edad, sexo, nivel de ingresos, estado civil, nivel educativo y ocupaci√≥n.

* Datos transaccionales: contiene informaci√≥n sobre cada transacci√≥n, como la fecha de la transacci√≥n, la cantidad, el precio, la categor√≠a del producto y el m√©todo de pago.

* M√©tricas de comportamiento del cliente: incluye caracter√≠sticas como el valor de compra promedio, la frecuencia de compra y la fecha de la √∫ltima compra.

* Datos promocionales: incluye detalles sobre las promociones, incluido el tipo, la eficacia y el p√∫blico objetivo.

* Churn: La columna de destino indica si el cliente se ha dado de baja (1) o ha permanecido activo (0).

* Tama√±o del conjunto de datos: este conjunto de datos contiene 35.843 filas y 26 atributos (incluida la columna de abandono).

Este estudio incluye las tareas del Data Engineer:

* Obtenci√≥n y tratamiento de datos a trav√©s de una API externa.

* Implementaci√≥n del proceso ETL (Extracci√≥n, Transformaci√≥n y Carga) para la depuraci√≥n y organizaci√≥n de la informaci√≥n.

* Desarrollo de visualizaciones clave para detectar patrones y comportamientos relevantes.

* Exploraci√≥n de los datos (EDA t√©cnico) y elaboraci√≥n de un reporte con hallazgos significativos, para posterior EDA profundo.

##  Extracci√≥n

In [138]:
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')
from google.colab import files

# Abrir un cuadro para subir archivos
uploaded = files.upload()

for nombre_archivo in uploaded.keys():
    df = pd.read_csv(nombre_archivo)


Saving Grocery_Customer_Churn_Data.csv to Grocery_Customer_Churn_Data.csv


## Inspecci√≥n general del dataset

In [139]:
# Configuraciones de pandas

pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", "{:.2f}".format)


In [140]:
# Mostrar las filas del DataFrame

df.head()
df.tail()



Unnamed: 0,customer_id,age,gender,income_bracket,loyalty_program,membership_years,marital_status,number_of_children,education_level,occupation,transaction_id,transaction_date,product_category,quantity,unit_price,avg_purchase_value,purchase_frequency,last_purchase_date,avg_discount_used,online_purchases,in_store_purchases,total_sales,total_transactions,total_items_purchased,promotion_type,promotion_effectiveness,days_since_last_purchase,churn
35838,C1338,48,Female,Medium,Yes,1,Married,0,PhD,Employed,T1338,2023-12-05,Clothing,4,51.07,100.74,Daily,2023-12-05,5.68,10,5,247.51,8,63,Seasonal Discount,Low,-157,0
35839,C1339,26,Female,Low,Yes,5,Divorced,1,Bachelor's,Student,T1339,2023-12-06,Groceries,4,148.23,46.29,Yearly,2023-12-06,8.76,5,0,1119.07,16,32,Seasonal Discount,Low,-158,0
35840,C1340,57,Male,High,No,2,Married,1,Master's,Unemployed,T1340,2023-12-07,Home Goods,3,107.4,79.63,,2023-12-07,6.11,6,9,724.01,21,28,20% Off,Medium,-159,1
35841,C1341,43,Other,Low,No,12,Divorced,4,Master's,Student,T1341,2023-12-08,Clothing,3,61.33,6.36,Weekly,2023-12-08,7.41,0,4,5141.65,17,15,20% Off,High,-160,0
35842,C1342,53,Other,Low,No,8,Divorced,3,Master's,Employed,T1342,2023-12-09,Toys,3,17.98,134.11,Yearly,2023-12-09,13.52,10,3,47.96,48,56,Buy One Get One Free,Low,-161,1


In [141]:
# Atributo del DataFrame
df.shape

(35843, 28)

In [142]:
# Obtener informaci√≥n general del DataFrame

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35843 entries, 0 to 35842
Data columns (total 28 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               35843 non-null  object 
 1   age                       35843 non-null  int64  
 2   gender                    35843 non-null  object 
 3   income_bracket            35843 non-null  object 
 4   loyalty_program           35843 non-null  object 
 5   membership_years          35843 non-null  int64  
 6   marital_status            35843 non-null  object 
 7   number_of_children        35843 non-null  int64  
 8   education_level           35843 non-null  object 
 9   occupation                35843 non-null  object 
 10  transaction_id            35843 non-null  object 
 11  transaction_date          35843 non-null  object 
 12  product_category          35843 non-null  object 
 13  quantity                  35843 non-null  int64  
 14  unit_p

**Nota:**

El dataset contiene 35.843 registros y 28 columnas, combinando informaci√≥n de clientes y transacciones.

Se observa una mezcla de variables:

* Identificadores (customer_id, transaction_id)

* Variables demogr√°ficas del cliente

* Variables transaccionales

* Variables agregadas hist√≥ricas

Esto indica que el dataset no est√° normalizado y presenta m√∫ltiples niveles de granularidad.

* transaction_date, last_purchase_date ‚Üí object ‚ùå

* purchase_frequency ‚Üí deber√≠a ser categ√≥rica (object est√° bien, pero validar valores)

* promotion_effectiveness ‚Üí categ√≥rica, no num√©rica

## Clasificar variables

In [143]:
num_cols = [
    "age", "membership_years", "number_of_children",
    "quantity", "unit_price", "avg_purchase_value",
    "purchase_frequency", "avg_discount_used",
    "online_purchases", "in_store_purchases",
    "total_sales", "total_transactions",
    "total_items_purchased", "promotion_effectiveness",
    "days_since_last_purchase"
]

categorical_cols = [
    "gender", "income_bracket", "loyalty_program",
    "marital_status", "education_level", "occupation",
    "product_category", "promotion_type", "churn"
]

date_cols = [
    "transaction_date", "last_purchase_date"
]

id_cols = [
    "customer_id", "transaction_id"
]


Se realiz√≥ una clasificaci√≥n expl√≠cita de columnas por tipo l√≥gico (num√©ricas, categ√≥ricas, fechas e identificadores) con el objetivo de:

* Facilitar validaciones de calidad de datos

* Aplicar transformaciones espec√≠ficas por tipo

* Mejorar la mantenibilidad del pipeline ETL

## Detecci√≥n de valores nulos

In [144]:
nulls = df.isna().sum().sort_values(ascending=False)
nulls[nulls > 0]


Unnamed: 0,0
promotion_type,5213
avg_purchase_value,5206
purchase_frequency,5179
total_sales,3584


**Nota:**

Se identifican valores faltantes principalmente en variables relacionadas con promociones, frecuencia de compra y m√©tricas agregadas.

Columnas con mayor cantidad de nulos:

* promotion_type: 5213
* avg_purchase_value: 5206
* purchase_frequency: 5179
* total_sales: 3584

Esto sugiere que:

* No todas las transacciones estuvieron asociadas a promociones

* Algunas m√©tricas agregadas pueden no haber sido calculadas para todos los registros

Estos nulos no se consideran errores autom√°ticamente, sino que requieren una decisi√≥n de negocio (imputaci√≥n, categorizaci√≥n expl√≠cita o exclusi√≥n seg√∫n el caso).

* promotion_type nulo ‚Üí no todas las transacciones tuvieron promoci√≥n

* purchase_frequency y avg_purchase_value ‚Üí no est√©n calculadas para todos los clientes

* total_sales nulo ‚Üí inconsistencia, posiblemente falte para clientes sin historial consolidado


## Duplicados y unicidad

In [145]:
df["customer_id"].duplicated().sum()


np.int64(35343)

In [146]:
df["transaction_id"].duplicated().sum()


np.int64(35343)

In [147]:
df.groupby("customer_id")["transaction_id"].nunique().describe()


Unnamed: 0,transaction_id
count,500.0
mean,1.0
std,0.0
min,1.0
25%,1.0
50%,1.0
75%,1.0
max,1.0


In [148]:
df["transaction_id"].nunique()

500

In [149]:
df["customer_id"].value_counts().head()

Unnamed: 0_level_0,count
customer_id,Unnamed: 1_level_1
C1342,72
C1341,72
C1340,72
C1339,72
C1338,72


**Nota:**

customer_id duplicados: 35.343

transaction_id:

* count = 500

* mean = 1

* std = 0

Conclusi√≥n estructural:

* Cada transaction_id aparece ~72 veces

* Pero cada customer_id aparece muchas veces

* transaction_id NO es una transacci√≥n. Adem√°s  transaction_id identifica un CUSTOMER, no un evento




## EDA t√©cnico

### Variables N√∫mericas y Fechas

In [150]:
df[num_cols ].describe()

Unnamed: 0,age,membership_years,number_of_children,quantity,unit_price,avg_purchase_value,avg_discount_used,online_purchases,in_store_purchases,total_sales,total_transactions,total_items_purchased,days_since_last_purchase
count,35843.0,35843.0,35843.0,35843.0,35843.0,30637.0,35843.0,35843.0,35843.0,32259.0,35843.0,35843.0,35843.0
mean,44.22,7.68,2.07,2.99,103.78,81.95,10.03,5.03,4.74,2634.98,28.62,55.26,-67.75
std,14.94,4.4,1.42,1.4,56.24,42.54,2.87,3.16,3.15,1507.68,13.42,26.36,144.18
min,18.0,1.0,0.0,1.0,5.16,-7.21,5.01,0.0,0.0,-1514.41,5.0,10.0,-318.0
25%,32.0,4.0,1.0,2.0,54.21,43.36,7.52,2.0,2.0,1401.95,18.0,31.0,-192.0
50%,45.0,8.0,2.0,3.0,107.23,84.71,10.35,5.0,4.0,2700.31,29.0,57.0,-67.0
75%,57.0,11.0,3.0,4.0,151.9,118.64,12.47,8.0,8.0,3869.3,41.0,77.0,57.0
max,70.0,15.0,4.0,5.0,199.75,164.19,14.96,10.0,10.0,6600.61,50.0,100.0,181.0


In [151]:
# Lista de columnas num√©ricas a evaluar

# Funci√≥n para detectar outliers por regla de negocio
def detect_anomalies(df, cols):
    anomalies = {}

    for col in cols:
        if col not in df.columns:
            continue

        # Forzar conversi√≥n num√©rica
        series_raw = df[col]
        series = pd.to_numeric(series_raw, errors="coerce")

        n_missing = series.isnull().sum()
        n_negative = (series < 0).sum()
        n_zero = (series == 0).sum()

        min_val = series.min()
        max_val = series.max()
        mean_val = series.mean()
        std_val = series.std()

        if pd.notna(std_val) and std_val > 0:
            lower_limit = mean_val - 3 * std_val
            upper_limit = mean_val + 3 * std_val
            n_outliers = ((series < lower_limit) | (series > upper_limit)).sum()
        else:
            n_outliers = 0

        anomalies[col] = {
            "missing": int(n_missing),
            "negative": int(n_negative),
            "zero": int(n_zero),
            "min": min_val,
            "max": max_val,
            "mean": mean_val,
            "std": std_val,
            "outliers_3sigma": int(n_outliers)
        }

    return pd.DataFrame(anomalies).T


In [152]:
# Evaluaci√≥n num√©ricas
num_anomalies = detect_anomalies(df, num_cols)
display(num_anomalies.sort_values(by="outliers_3sigma", ascending=False))

# Fechas: detectar fechas futuras o inconsistentes
today = pd.Timestamp.today()

date_cols = ["transaction_date", "last_purchase_date"]

for col in date_cols:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors="coerce")


for col in date_cols:
    if col in df.columns:
        n_missing = df[col].isnull().sum()
        n_future = (df[col] > today).sum()
        min_date = df[col].min()
        max_date = df[col].max()

        print(
            f"{col}: "
            f"missing={n_missing}, "
            f"future_dates={n_future}, "
            f"min={min_date}, "
            f"max={max_date}"
        )


Unnamed: 0,missing,negative,zero,min,max,mean,std,outliers_3sigma
age,0.0,0.0,0.0,18.0,70.0,44.22,14.94,0.0
membership_years,0.0,0.0,0.0,1.0,15.0,7.68,4.4,0.0
number_of_children,0.0,0.0,6387.0,0.0,4.0,2.07,1.42,0.0
quantity,0.0,0.0,0.0,1.0,5.0,2.99,1.4,0.0
unit_price,0.0,0.0,0.0,5.16,199.75,103.78,56.24,0.0
avg_purchase_value,5206.0,11.0,0.0,-7.21,164.19,81.95,42.54,0.0
purchase_frequency,35843.0,0.0,0.0,,,,,0.0
avg_discount_used,0.0,0.0,0.0,5.01,14.96,10.03,2.87,0.0
online_purchases,0.0,0.0,3295.0,0.0,10.0,5.03,3.16,0.0
in_store_purchases,0.0,0.0,3440.0,0.0,10.0,4.74,3.15,0.0


transaction_date: missing=0, future_dates=0, min=2023-01-01 00:00:00, max=2024-05-14 00:00:00
last_purchase_date: missing=0, future_dates=0, min=2023-01-01 00:00:00, max=2024-05-14 00:00:00


**Nota:**

1Ô∏è‚É£ Columnas num√©ricas

* age, membership_years, number_of_children, quantity, unit_price

Sin negativos, sin outliers seg√∫n 3œÉ ‚Üí ‚úÖ saneadas

* avg_purchase_value

5.206 nulos ‚Üí ya identificado

Min = -7,21 ‚Üí valores negativos peque√±os, probablemente errores de transacci√≥n

Outliers 3œÉ = 0 ‚Üí estad√≠sticamente dentro de rango

* purchase_frequency

Todos nulos ‚Üí ‚ùå requiere recalcular

* total_sales

3.584 nulos ‚Üí ya identificado

973 negativos ‚Üí ‚ùå valores problem√°ticos

Min = -1.514 ‚Üí claramente errores o c√°lculos incorrectos

* online_purchases, in_store_purchases

Muchos ceros (‚âà 3.300-3.400) ‚Üí puede ser normal seg√∫n reglas de negocio

* promotion_effectiveness

Todos nulos ‚Üí ‚ùå descartar o recalcular

* days_since_last_purchase

22.739 negativos ‚Üí ‚ùå problema sist√©mico confirmado

2Ô∏è‚É£ Columnas de fecha

* transaction_date y last_purchase_date

No hay nulos, no hay fechas futuras ‚Üí ‚úÖ consistentes

Observaci√≥n: el gran problema de fechas negativas ya no es por fechas inv√°lidas, sino por c√°lculo incorrecto de days_since_last_purchase ‚Üí debe recalcularse como (hoy - last_purchase_date)

3Ô∏è‚É£ Conclusiones parciales

Las fechas son confiables, no hay que imputar ni corregir transaction_date ni last_purchase_date.

days_since_last_purchase debe recalcularse para eliminar los negativos.

purchase_frequency y promotion_effectiveness no se pueden usar tal cual, requieren recalculo o imputaci√≥n.

avg_purchase_value y total_sales tienen algunos valores negativos. Revisar c√≥mo se calcularon (probablemente quantity * unit_price para total_sales).

El resto de variables num√©ricas est√°n limpias y listas para an√°lisis o derivaci√≥n de features.

In [153]:
(df["age"] < 0).sum(), (df["age"] > 100).sum()


(np.int64(0), np.int64(0))

In [154]:
(df["unit_price"] < 0).sum()


np.int64(0)

In [155]:
(df["quantity"] <= 0).sum()


np.int64(0)

In [156]:
(df["transaction_date"] > pd.Timestamp.today()).sum()


np.int64(0)

In [157]:
# Detecci√≥n de anomal√≠as l√≥gicas

anomalies = {}

# Ventas negativas
anomalies["negative_total_sales"] = df[df["total_sales"] < 0].shape[0]

# Cantidades inv√°lidas
anomalies["invalid_quantity"] = df[df["quantity"] <= 0].shape[0]

# Precios inv√°lidos
anomalies["invalid_unit_price"] = df[df["unit_price"] <= 0].shape[0]

# Edades fuera de rango l√≥gico
anomalies["invalid_age"] = df[(df["age"] < 18) | (df["age"] > 100)].shape[0]

# M√©tricas temporales inconsistentes
if "days_since_last_purchase" in df.columns:
    anomalies["negative_days_since_last_purchase"] = (
        df[df["days_since_last_purchase"] < 0].shape[0]
    )

# Resumen de anomal√≠as
anomaly_df = pd.DataFrame.from_dict(
    anomalies, orient="index", columns=["count"]
)

print("\nResumen de anomal√≠as detectadas:")
display(anomaly_df)


Resumen de anomal√≠as detectadas:


Unnamed: 0,count
negative_total_sales,973
invalid_quantity,0
invalid_unit_price,0
invalid_age,0
negative_days_since_last_purchase,22739


**Notas:**

El an√°lisis de calidad de datos identific√≥ dos tipos principales de anomal√≠as:

* Ventas totales negativas (negative_total_sales):
Se detectaron 973 registros con valores negativos, lo que puede estar asociado a devoluciones, cancelaciones o ajustes contables. Estos casos requieren validaci√≥n de negocio para decidir si deben mantenerse, corregirse o excluirse.

* D√≠as negativos desde la √∫ltima compra (negative_days_since_last_purchase):
Se identificaron 22.739 registros con valores negativos, lo que indica inconsistencias temporales (por ejemplo, fechas de referencia mal definidas o eventos posteriores al ‚Äúsnapshot‚Äù de an√°lisis). Esta anomal√≠a sugiere un problema de alineaci√≥n temporal m√°s que de valores num√©ricos.

* No se detectaron anomal√≠as en:

1.Cantidades (invalid_quantity)

2.Precios unitarios (invalid_unit_price)

3.Edad de clientes (invalid_age)

### Variables Cat√©goricas

In [158]:
for col in categorical_cols:
    print("\n", col)
    display(df[col].value_counts(dropna=False).head(10))



 gender


Unnamed: 0_level_0,count
gender,Unnamed: 1_level_1
Male,12116
Other,12112
Female,11615



 income_bracket


Unnamed: 0_level_0,count
income_bracket,Unnamed: 1_level_1
High,13769
Low,11746
Medium,10328



 loyalty_program


Unnamed: 0_level_0,count
loyalty_program,Unnamed: 1_level_1
Yes,18844
No,16999



 marital_status


Unnamed: 0_level_0,count
marital_status,Unnamed: 1_level_1
Single,13182
Divorced,11333
Married,11328



 education_level


Unnamed: 0_level_0,count
education_level,Unnamed: 1_level_1
Master's,9825
PhD,9108
Bachelor's,8810
High School,8100



 occupation


Unnamed: 0_level_0,count
occupation,Unnamed: 1_level_1
Employed,7812
Self-Employed,7750
Unemployed,7239
Retired,6879
Student,6163



 product_category


Unnamed: 0_level_0,count
product_category,Unnamed: 1_level_1
Home Goods,8022
Clothing,7608
Toys,7308
Electronics,6885
Groceries,6020



 promotion_type


Unnamed: 0_level_0,count
promotion_type,Unnamed: 1_level_1
Seasonal Discount,11036
20% Off,9961
Buy One Get One Free,9633
,5213



 churn


Unnamed: 0_level_0,count
churn,Unnamed: 1_level_1
0,25187
1,10656


**Notas:**

1Ô∏è‚É£ G√©nero

Male: 12.116

Other: 12.112

Female: 11.615

‚úÖ Hay una distribuci√≥n bastante equilibrada entre g√©neros, con una ligera ventaja de ‚ÄúMale‚Äù y ‚ÄúOther‚Äù sobre ‚ÄúFemale‚Äù. Esto es √∫til, porque no hay un sesgo fuerte hacia un g√©nero en los an√°lisis posteriores.

2Ô∏è‚É£ Income Bracket (Nivel de ingreso)

High: 13.769

Low: 11.746

Medium: 10.328

‚úÖ Predomina el nivel High, seguido de Low y Medium. Esto indica que la base de clientes tiene una proporci√≥n relativamente alta de ingresos altos, lo que puede influir en patrones de consumo y promociones.

3Ô∏è‚É£ Loyalty Program

Yes: 18.844

No: 16.999

‚úÖ La mayor√≠a de los clientes participa en un programa de fidelidad, pero hay una proporci√≥n significativa que no lo hace. Esto es clave para an√°lisis de retenci√≥n o churn.

4Ô∏è‚É£ Marital Status

Single: 13.182

Divorced: 11.333

Married: 11.328

‚úÖ Hay m√°s clientes solteros, aunque los divorciados y casados est√°n bastante equilibrados. Esto puede impactar en patrones de compra seg√∫n la etapa de vida.

5Ô∏è‚É£ Education Level

Master‚Äôs: 9.825

PhD: 9.108

Bachelor‚Äôs: 8.810

High School: 8.100

‚úÖ La mayor√≠a tiene educaci√≥n avanzada, con predominio de Master‚Äôs y PhD. Esto podr√≠a correlacionarse con ingresos altos y ciertos h√°bitos de consumo.

6Ô∏è‚É£ Occupation

Employed: 7.812

Self-Employed: 7.750

Unemployed: 7.239

Retired: 6.879

Student: 6.163

‚úÖ Hay una diversidad de ocupaciones, siendo la mayor√≠a empleados o aut√≥nomos. Esto puede afectar frecuencia de compra y preferencias de productos.

7Ô∏è‚É£ Product Category

Home Goods: 8.022

Clothing: 7.608

Toys: 7.308

Electronics: 6.885

Groceries: 6.020

‚úÖ Los productos m√°s comprados son Home Goods y Clothing, mientras que Groceries es la categor√≠a menos frecuente en este dataset, lo que puede reflejar un enfoque en bienes duraderos o de consumo discrecional.

8Ô∏è‚É£ Promotion Type

Seasonal Discount: 11.036

20% Off: 9.961

Buy One Get One Free: 9.633

NaN: 5.213

‚úÖ La mayor√≠a de las transacciones tienen alg√∫n tipo de promoci√≥n aplicada, aunque hay un 14‚Äì15% de valores faltantes. Esto indica que las promociones son un factor importante para analizar su efecto en ventas y churn.



Conclusi√≥n general

* La base de clientes es bastante equilibrada en g√©nero y estado civil, aunque ligeramente inclinada hacia solteros y hombres/other.

* Hay predominio de ingresos altos y educaci√≥n avanzada, lo que podr√≠a afectar patrones de gasto.

* La participaci√≥n en programas de fidelidad es alta, lo que permite analizar retenci√≥n y efectividad de promociones.

* Los productos m√°s populares son Home Goods y Clothing, y la mayor√≠a de las compras incluyen alguna promoci√≥n.

* Los datos son en general balanceados, pero hay algunos valores faltantes en promotion_type, avg_purchase_value y otras variables que deber√≠amos considerar antes de modelar.

## Limpieza num√©rica y nulos

In [159]:
df_clean = df.copy()

In [160]:
df_clean = df.copy()

df_clean["gender"] = df_clean["gender"].fillna("unknown")

df_clean["age"] = df_clean["age"].clip(0, 100)
df_clean["age"] = df_clean["age"].fillna(df_clean["age"].median())

df_clean["unit_price"] = df_clean["unit_price"].clip(lower=0)


In [161]:
# Normalizar nombres de columnas
def clean_column_names(df_clean):
    df_clean = df.copy()
    df_clean.columns = (
        df.columns
        .str.strip()                # Quitar espacios al inicio/final
        .str.lower()                # Pasar todo a min√∫sculas
        .str.replace(" ", "_")      # Reemplazar espacios por _
        .str.replace("-", "_")      # Reemplazar guiones por _
        .str.replace(r"[^\w_]", "", regex=True)  # Quitar caracteres especiales
    )
    return df_clean

## Convertir fechas

In [162]:
# Convertir fechas
date_cols = ["transaction_date", "last_purchase_date"]
for col in date_cols:
    df_clean[col] = pd.to_datetime(df_clean[col], errors="coerce")

today = pd.Timestamp.today().normalize()

# Detectar y reportar fechas inconsistentes
for col in date_cols:
    n_missing = df[col].isnull().sum()
    n_future = (df[col] > today).sum()
    print(f"{col}: missing={n_missing}, future_dates={n_future}")

# Recalcular m√©trica correctamente
df_clean["days_since_last_purchase"] = (
    today - df_clean["last_purchase_date"]
).dt.days

# Seguridad final
df_clean.loc[
    df_clean["days_since_last_purchase"] < 0,
    "days_since_last_purchase"
] = 0

transaction_date: missing=0, future_dates=0
last_purchase_date: missing=0, future_dates=0


## Normalizaci√≥n de categor√≠as y reducci√≥n de cardinalidad

In [163]:
for col in categorical_cols:

    # Reducir categor√≠as raras (<1% frecuencia)
    freq = df_clean[col].value_counts(normalize=True)
    rare_labels = freq[freq < 0.01].index
    df_clean[col] = df_clean[col].replace(rare_labels, "other")

# Promotion type: rellenar NaN
df_clean["promotion_type"] = df_clean["promotion_type"].fillna("no_promotion")


In [164]:
df_clean["loyalty_program"] = df_clean["loyalty_program"].astype(str).str.strip().str.lower()

# Mapear Yes/No a 1/0
df_clean["loyalty_program"] = df_clean["loyalty_program"].map({"yes": 1, "no": 0})


## M√©tricas derivadas

In [165]:
df_clean["total_sales"] = df_clean["quantity"] * df_clean["unit_price"]
df_clean["avg_purchase_value"] = df_clean.groupby("customer_id")["total_sales"].transform("mean")
df_clean["purchase_frequency"] = df_clean.groupby("customer_id")["transaction_id"].transform("count") / df_clean["membership_years"].replace(0,1)

# Promotion effectiveness
df_clean["promo_flag"] = df_clean["promotion_type"].apply(lambda x: 0 if x=="no_promotion" else 1)
df_clean["promotion_effectiveness"] = df_clean.groupby("customer_id")["promo_flag"].transform("sum") / df_clean.groupby("customer_id")["transaction_id"].transform("count")

# Online vs in-store ratio
df_clean["total_purchases"] = df_clean["online_purchases"] + df_clean["in_store_purchases"]
df_clean["online_ratio"] = df_clean["online_purchases"] / df_clean["total_purchases"].replace(0,1)


## Validaci√≥n post-transformaci√≥n

In [166]:
for col in num_cols:
    if col in df_clean.columns:
        print(f"{col}: min={df_clean[col].min()}, max={df_clean[col].max()}, mean={df_clean[col].mean():.2f}, nulos={df_clean[col].isnull().sum()}")

age: min=18, max=70, mean=44.22, nulos=0
membership_years: min=1, max=15, mean=7.68, nulos=0
number_of_children: min=0, max=4, mean=2.07, nulos=0
quantity: min=1, max=5, mean=2.99, nulos=0
unit_price: min=5.159564146735754, max=199.7514692043265, mean=103.78, nulos=0
avg_purchase_value: min=6.0056146087811895, max=990.1493097576406, mean=310.11, nulos=0
purchase_frequency: min=4.733333333333333, max=72.0, mean=18.12, nulos=0
avg_discount_used: min=5.012745373771642, max=14.96257824333469, mean=10.03, nulos=0
online_purchases: min=0, max=10, mean=5.03, nulos=0
in_store_purchases: min=0, max=10, mean=4.74, nulos=0
total_sales: min=6.0056146087811895, max=990.1493097576405, mean=310.11, nulos=0
total_transactions: min=5, max=50, mean=28.62, nulos=0
total_items_purchased: min=10, max=100, mean=55.26, nulos=0
promotion_effectiveness: min=0.7183098591549296, max=0.9859154929577465, mean=0.85, nulos=0
days_since_last_purchase: min=594, max=1093, mean=844.25, nulos=0


In [167]:
# Dataset final con todas las columnas procesadas
df_final = df_clean
df_final = df_final.reset_index(drop=True)
df_final.to_csv("dataset_final_all_columns.csv", index=False)

## Dataset Data Analyst

In [168]:
# Dataset resumido por cliente

df_agg = df_final.groupby("customer_id").agg({
    # Num√©ricas: agregaciones t√≠picas
    "age": "first",
    "membership_years": "first",
    "number_of_children": "first",
    "quantity": "mean",
    "unit_price": "mean",
    "avg_purchase_value": "mean",
    "purchase_frequency": "mean",
    "avg_discount_used": "mean",
    "online_purchases": "sum",
    "in_store_purchases": "sum",
    "total_sales": "sum",
    "total_transactions": "sum",
    "total_items_purchased": "sum",
    "promotion_effectiveness": "mean",
    "days_since_last_purchase": "min",

    # Categ√≥ricas: valor m√°s frecuente
    "gender": lambda x: x.mode()[0],
    "income_bracket": lambda x: x.mode()[0],
    "marital_status": lambda x: x.mode()[0],
    "education_level": lambda x: x.mode()[0],
    "occupation": lambda x: x.mode()[0],
    "product_category": lambda x: x.mode()[0],
    "promotion_type": lambda x: x.mode()[0],


    # Fechas
    "transaction_date": "max",
    "last_purchase_date": "max",

    # Flags binarios
    "loyalty_program": "max",
    "churn": "max",
    "promo_flag": "max",

}).reset_index()


# Exportar CSV para Data Analyst

df_agg.to_csv("dataset_analyst_by_customer.csv", index=False)


**Documentaci√≥n para Data analyst:**


* Diccionario de columnas con descripci√≥n y tipo.

* Rango y estad√≠stica clave de cada variable.

* Notas sobre imputaciones, clipping y derivadas.

## Dataset Data Scientist

### Codificaci√≥n de variables categ√≥ricas: One-hot encoding

In [187]:
categorical_cols = [
    "gender", "income_bracket", "marital_status",
    "education_level", "occupation",
    "product_category", "promotion_type"
]

df_final_encoded = pd.get_dummies(df_final, columns=categorical_cols, drop_first=False)

In [188]:
# Dataset final para Data Scientist
df_final_encoded.to_csv("customer_dataset_for_ml.csv", index=False)

**Documentaci√≥n para Data Scientist:**


* Diccionario de columnas codificadas (qu√© representa cada columna 0/1).

* Notas sobre limpieza y derivadas.

* Estad√≠sticas resumidas de features num√©ricas.

## Validaci√≥n final

### 1Ô∏è‚É£ Chequeo r√°pido de nulos / NaN

In [171]:
def check_nulls(df, name="dataset"):
    print(f"\nüìå Null check ‚Üí {name}")
    nulls = df.isna().sum()
    nulls = nulls[nulls > 0]

    if nulls.empty:
        print("‚úÖ No hay valores nulos ni NaN")
    else:
        print("‚ùå Columnas con nulos:")
        print(nulls)


In [172]:
check_nulls(df_final, "df_final")
check_nulls(df_agg, "df_agg")



üìå Null check ‚Üí df_final
‚úÖ No hay valores nulos ni NaN

üìå Null check ‚Üí df_agg
‚úÖ No hay valores nulos ni NaN


### 2Ô∏è‚É£Verificar columnas binarias (0/1)

In [173]:
def check_binary_cols(df, binary_cols):
    print("\nüìå Binary columns check")
    for col in binary_cols:
        if col in df.columns:
            invalid = df[~df[col].isin([0,1]) & df[col].notna()]
            if len(invalid) == 0:
                print(f"‚úÖ {col}: OK (solo 0/1)")
            else:
                print(f"‚ùå {col}: valores inv√°lidos ‚Üí")
                print(df[col].value_counts())


In [174]:
binary_cols = ["loyalty_program", "churn", "promo_flag" ]
check_binary_cols(df_final, binary_cols)



üìå Binary columns check
‚úÖ loyalty_program: OK (solo 0/1)
‚úÖ churn: OK (solo 0/1)
‚úÖ promo_flag: OK (solo 0/1)


### 3Ô∏è‚É£ Verificaci√≥n de rangos num√©ricos

In [175]:
def check_numeric_ranges(df, num_cols):
    print("\nüìå Numeric ranges check")
    for col in num_cols:
        if col in df.columns:
            min_v = df[col].min()
            max_v = df[col].max()
            if min_v < 0:
                print(f"‚ö†Ô∏è {col}: valor negativo detectado (min={min_v})")
            else:
                print(f"‚úÖ {col}: min={min_v}, max={max_v}")


In [176]:
check_numeric_ranges(df_final, num_cols)



üìå Numeric ranges check
‚úÖ age: min=18, max=70
‚úÖ membership_years: min=1, max=15
‚úÖ number_of_children: min=0, max=4
‚úÖ quantity: min=1, max=5
‚úÖ unit_price: min=5.159564146735754, max=199.7514692043265
‚úÖ avg_purchase_value: min=6.0056146087811895, max=990.1493097576406
‚úÖ purchase_frequency: min=4.733333333333333, max=72.0
‚úÖ avg_discount_used: min=5.012745373771642, max=14.96257824333469
‚úÖ online_purchases: min=0, max=10
‚úÖ in_store_purchases: min=0, max=10
‚úÖ total_sales: min=6.0056146087811895, max=990.1493097576405
‚úÖ total_transactions: min=5, max=50
‚úÖ total_items_purchased: min=10, max=100
‚úÖ promotion_effectiveness: min=0.7183098591549296, max=0.9859154929577465
‚úÖ days_since_last_purchase: min=594, max=1093


### 4Ô∏è‚É£ Chequeo de fechas (consistencia temporal)

In [177]:
def check_dates(df, date_cols):
    print("\nüìå Date columns check")
    for col in date_cols:
        if col in df.columns:
            if not pd.api.types.is_datetime64_any_dtype(df[col]):
                print(f"‚ùå {col}: no es datetime")
            else:
                print(f"‚úÖ {col}: OK ({df[col].min()} ‚Üí {df[col].max()})")


In [178]:
check_dates(df_final, date_cols)



üìå Date columns check
‚úÖ transaction_date: OK (2023-01-01 00:00:00 ‚Üí 2024-05-14 00:00:00)
‚úÖ last_purchase_date: OK (2023-01-01 00:00:00 ‚Üí 2024-05-14 00:00:00)


### 5Ô∏è‚É£ Valores categ√≥ricos raros (cardinalidad)

In [179]:
def check_categorical_cols(df, cat_cols, max_unique=20):
    print("\nüìå Categorical consistency check")
    for col in cat_cols:
        if col in df.columns:
            uniq = df[col].nunique(dropna=False)
            print(f"üîπ {col}: {uniq} valores √∫nicos")
            if uniq <= max_unique:
                print(df[col].value_counts())


In [180]:
check_categorical_cols(df_agg, categorical_cols)



üìå Categorical consistency check
üîπ gender: 3 valores √∫nicos
gender
Other     169
Male      169
Female    162
Name: count, dtype: int64
üîπ income_bracket: 3 valores √∫nicos
income_bracket
High      192
Low       164
Medium    144
Name: count, dtype: int64
üîπ marital_status: 3 valores √∫nicos
marital_status
Single      184
Divorced    158
Married     158
Name: count, dtype: int64
üîπ education_level: 4 valores √∫nicos
education_level
Master's       137
PhD            127
Bachelor's     123
High School    113
Name: count, dtype: int64
üîπ occupation: 5 valores √∫nicos
occupation
Employed         109
Self-Employed    108
Unemployed       101
Retired           96
Student           86
Name: count, dtype: int64
üîπ product_category: 5 valores √∫nicos
product_category
Home Goods     112
Clothing       106
Toys           102
Electronics     96
Groceries       84
Name: count, dtype: int64
üîπ promotion_type: 3 valores √∫nicos
promotion_type
Seasonal Discount       180
20% Off     

### 6Ô∏è‚É£ Chequeo de duplicados

In [181]:
def check_duplicates(df, subset=None, name="dataset"):
    print(f"\nüìå Duplicate check ‚Üí {name}")
    dups = df.duplicated(subset=subset).sum()
    if dups == 0:
        print("‚úÖ No hay duplicados")
    else:
        print(f"‚ùå {dups} duplicados encontrados")


In [182]:
check_duplicates(df_agg, subset=["customer_id"], name="df_agg")



üìå Duplicate check ‚Üí df_agg
‚úÖ No hay duplicados


### 7Ô∏è‚É£ Check FINAL tipo ‚Äúsem√°foro‚Äù

In [183]:
def final_dataset_check(df, name):
    print(f"\nüö¶ FINAL CHECK ‚Üí {name}")
    print("Filas:", df.shape[0])
    print("Columnas:", df.shape[1])
    print("NaN totales:", df.isna().sum().sum())


In [184]:
final_dataset_check(df_final, "Dataset Final (All Columns)")
final_dataset_check(df_agg, "Dataset Analyst by Customer")



üö¶ FINAL CHECK ‚Üí Dataset Final (All Columns)
Filas: 35843
Columnas: 31
NaN totales: 0

üö¶ FINAL CHECK ‚Üí Dataset Analyst by Customer
Filas: 500
Columnas: 28
NaN totales: 0


##