# **COFFEE SURVEY DATA ANALYSIS**

#### **CONFIGURACIÓN INICIAL Y IMPORTS**

In [84]:
# Autoreload: recarga automáticamente módulos modificados
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


#### **LIBRERÍAS ESTÁNDAR**

In [85]:
import pandas as pd
import numpy as np
import plotly.express as px
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

#### **IMPORTS DEL MÓDULO coffee_data_utils**

In [86]:
from coffee_data_utils import *

#### **CONFIGURACIÓN DE PANDAS**

In [87]:
# Mostrar más filas y columnas
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

# Formato de números
pd.options.display.float_format = '{:.2f}'.format

#### **RUTAS DE ARCHIVOS**

In [88]:
# Ruta de entrada y de salida para el set de datos
DATA_PATH = 'coffee-survey-full-dataset.csv'
OUTPUT_PATH = 'coffee_data_cleaned.csv'

In [89]:
# Ejecutar limpieza completa
df_clean = full_cleaning_pipeline(DATA_PATH, OUTPUT_PATH)

INICIANDO PIPELINE DE LIMPIEZA

[1/7] Cargando datos...
Dataset cargado: 4042 filas | 113 columnas

[2/7] Eliminando columnas con >95% missing...
Eliminando 14 columnas con >95.0% missing
  Columnas eliminadas: ['Where else do you purchase coffee?', 'Please specify what your favorite coffee drink is', 'What else do you add to your coffee?', 'What kind of flavorings do you add?', 'What kind of flavorings do you add? (Vanilla Syrup)']...

[3/7] Estandarizando nombres de columnas...
Columnas demográficas estandarizadas

[4/7] Codificando variables ordinales...
Codificando variables ordinales
 age -> age_encoded
 cups_per_day -> cups_per_day_encoded
 education -> education_encoded
Valores no mapeados en 'employment': ['Homemaker']
 employment -> employment_encoded
 children -> children_encoded

[5/7] Creando variables derivadas...
Segmentos de consumo creados: Light, Moderate, Heavy
Grupos de edad creados: Gen Z, Millenials, Gen z, Boomers+

[6/7] Imputando valores faltantes...
  gender: 5

In [90]:
 # Ver resumen
quick_summary(df_clean)

RESUMEN RÁPIDO

Dimensiones: 4,042 filas × 106 columnas
Memoria: 9.81 MB

Valores faltantes: 40,003 (9.34%)

Tipos de datos:
bool       47
object     41
float64    18
Name: count, dtype: int64

Segmentos de consumo:
consumption_segment
Moderate (1-2 cups)    2940
Heavy (3+ cups)         661
Light (<1 cup)          348
Unkown                   93
Name: count, dtype: int64

Grupos de edad:
age_group
Young Millennials (25-34)    1986
Older Millennials (35-44)     960
Gen X (45-64)                 489
Gen Z (<25)                   481
Boomers+ (65+)                 95
Unknown                        31
Name: count, dtype: int64


In [91]:
 # Crear subsets
print("\n" + "="*80)
print("CREANDO SUBSETS TEMÁTICOS")
print("="*80 + "\n")
    
subset_1_consumption = create_consumption_subset(df_clean)
subset_2_places = create_place_subset(df_clean)
subset_3_brewing = create_home_brewing_subset(df_clean)
subset_4_onthego = create_onthego_subset(df_clean)
subset_5_dairy = create_dairy_subset(df_clean)
subset_6_sweetener = create_sweetener_subset(df_clean)
    
print("\n Todos los subsets creados exitosamente")


CREANDO SUBSETS TEMÁTICOS

 Subset de consumo creado: (4042, 15)
 Subset de lugares creado: (4042, 20)
 Subset de métodos en casa creado: (4042, 25)
 Subset de compras on-the-go creado: (4042, 21)
 Subset de lácteos creado: (4042, 24)
 Subset de azucarantes creado: (4042, 23)

 Todos los subsets creados exitosamente


### **ANÁLISIS DEMOGRÁFICO (Subset 1: Consumo)**

In [92]:
subset_1_consumption.head(10)

Unnamed: 0,submission_id,age,age_encoded,age_group,gender,education,education_encoded,employment,employment_encoded,children,children_encoded,political_affiliation,cups_per_day,cups_per_day_encoded,consumption_segment
0,gMR29l,18-24 years old,1.0,Gen Z (<25),Unknown,Unknown,,Unknown,,0,0.0,Unknown,,,Unkown
1,BkPN0e,25-34 years old,2.0,Young Millennials (25-34),Unknown,Unknown,,Unknown,,0,0.0,Unknown,,,Unkown
2,W5G8jj,25-34 years old,2.0,Young Millennials (25-34),Unknown,Unknown,,Unknown,,0,0.0,Unknown,,,Unkown
3,4xWgGr,35-44 years old,3.0,Older Millennials (35-44),Unknown,Unknown,,Unknown,,0,0.0,Unknown,,,Unkown
4,QD27Q8,25-34 years old,2.0,Young Millennials (25-34),Unknown,Unknown,,Unknown,,0,0.0,Unknown,,,Unkown
5,V0LPeM,55-64 years old,5.0,Gen X (45-64),Unknown,Unknown,,Unknown,,0,0.0,Unknown,,,Unkown
6,V0Gaxg,18-24 years old,1.0,Gen Z (<25),Unknown,Unknown,,Unknown,,0,0.0,Unknown,,,Unkown
7,AdzRL0,,,Unknown,Unknown,Unknown,,Unknown,,0,0.0,Unknown,,,Unkown
8,LbWda2,25-34 years old,2.0,Young Millennials (25-34),Unknown,Unknown,,Unknown,,0,0.0,Unknown,Less than 1,0.0,Light (<1 cup)
9,EXQLWN,,,Unknown,Unknown,Unknown,,Unknown,,0,0.0,Unknown,,,Unkown


In [112]:

avg_consumption = subset_1_consumption.groupby('age_group')['cups_per_day_encoded'].agg(['mean', 'count']).reset_index()
avg_consumption.columns = ['Generation', 'Avg_Cups', 'Sample_Size']
# Ordenar
age_order = ['Gen Z (<25)', 'Young Millennials (25-34)', 
             'Older Millennials (35-44)', 'Gen X (45-64)', 'Boomers+ (65+)', 'Unknown']
avg_consumption['Generation'] = pd.Categorical(
    avg_consumption['Generation'], 
    categories=age_order, 
    ordered=True
)
avg_consumption = avg_consumption.sort_values('Generation').reset_index().drop('index', axis=1)
avg_consumption = avg_consumption.drop(avg_consumption.index[-1])
avg_consumption

Unnamed: 0,Generation,Avg_Cups,Sample_Size
0,Gen Z (<25),1.48,466
1,Young Millennials (25-34),1.64,1956
2,Older Millennials (35-44),1.83,948
3,Gen X (45-64),2.06,478
4,Boomers+ (65+),2.22,94


In [117]:
# Visualizar 

fig = px.bar(
  avg_consumption,
  x='Generation',
  y='Avg_Cups',
  title='Consumo Promedio de Café por Generación',
  text='Avg_Cups',
  color='Avg_Cups',
  color_continuous_scale='YlOrBr',
  hover_data=['Sample_Size']
)

fig.update_traces(texttemplate='%{text:.2f} cups', textposition='outside')
fig.update_layout(
  xaxis_title = "",
  yaxis_title = "Tazas por Día (promedio)",
  showlegend=False,
  height=600
)

fig.show()

In [121]:

fig = px.violin(
    subset_1_consumption.dropna(subset=['age_group', 'cups_per_day_encoded']),
    x='age_group',
    y='cups_per_day_encoded',
    box=True,
    title='Distribución de Consumo por Generación',
    color='age_group'
)
fig.show()

In [127]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from scipy import stats

# Filtrar datos relevantes
df_analysis = subset_1_consumption[
    (subset_1_consumption['gender'].isin(['Male', 'Female'])) &
    (subset_1_consumption['age_group'] != 'Unknown')
].copy()

# ============================================================================
# ANÁLISIS 1: Tabla Cruzada Completa
# ============================================================================

# Consumo promedio por género y edad
consumption_by_gender_age = df_analysis.groupby(['age_group', 'gender']).agg({
    'cups_per_day_encoded': ['mean', 'std', 'count']
}).reset_index()

# Aplanar columnas
consumption_by_gender_age.columns = ['Age_Group', 'Gender', 'Mean', 'Std', 'N']
consumption_by_gender_age['Mean'] = consumption_by_gender_age['Mean'].round(3)
consumption_by_gender_age['Std'] = consumption_by_gender_age['Std'].round(3)

print("="*70)
print("CONSUMO PROMEDIO POR GÉNERO Y EDAD")
print("="*70)
print(consumption_by_gender_age)

# Pivot para mejor visualización
pivot_table = consumption_by_gender_age.pivot(
    index='Age_Group',
    columns='Gender',
    values='Mean'
)
print("\n" + "="*70)
print("TABLA PIVOT: Consumo Promedio")
print("="*70)
print(pivot_table)

# Diferencia Male - Female por edad
pivot_table['Difference (M-F)'] = pivot_table['Male'] - pivot_table['Female']
print("\n" + "="*70)
print("DIFERENCIA (Male - Female) POR EDAD")
print("="*70)
print(pivot_table['Difference (M-F)'])

CONSUMO PROMEDIO POR GÉNERO Y EDAD
                   Age_Group  Gender  Mean  Std     N
0             Boomers+ (65+)  Female  2.00 1.26    21
1             Boomers+ (65+)    Male  2.23 1.23    52
2              Gen X (45-64)  Female  1.65 0.99   138
3              Gen X (45-64)    Male  2.25 1.13   270
4                Gen Z (<25)  Female  1.09 0.76    78
5                Gen Z (<25)    Male  1.56 1.01   297
6  Older Millennials (35-44)  Female  1.40 0.94   197
7  Older Millennials (35-44)    Male  1.97 0.95   621
8  Young Millennials (25-34)  Female  1.30 0.86   419
9  Young Millennials (25-34)    Male  1.76 0.91  1284

TABLA PIVOT: Consumo Promedio
Gender                     Female  Male
Age_Group                              
Boomers+ (65+)               2.00  2.23
Gen X (45-64)                1.65  2.25
Gen Z (<25)                  1.09  1.56
Older Millennials (35-44)    1.40  1.97
Young Millennials (25-34)    1.30  1.76

DIFERENCIA (Male - Female) POR EDAD
Age_Group
Boomers+ (65+

In [129]:
fig = px.bar(
    consumption_by_gender_age,
    x='Age_Group',
    y='Mean',
    color='Gender',
    barmode='group',
    title='Consumo de Café por Género y Edad (Estratificado)',
    text='Mean',
    color_discrete_map={'Male': '#8B4513', 'Female': '#DEB887'},
    error_y='Std',  # Barras de error
    hover_data=['N']
)

fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.update_layout(
    xaxis_title="Grupo de Edad",
    yaxis_title="Tazas por Día (promedio)",
    legend_title="Género",
    height=600,
    hovermode='x unified'
)

fig.show()

In [134]:
# Preparar matriz para heatmap
heatmap_data = [[2.00, 2.23],
                [1.65, 2.25],
                [1.09, 1.56],
                [1.40, 1.97],
                [1.30, 1.76]]

age_labels = ['Boomers+ (65+)', 'Gen X (45-64)', 'Gen Z (<25)',
              'Older Millennials (35-44)', 'Young Millennials (25-34)']

fig = go.Figure(data=go.Heatmap(
    z=heatmap_data,
    x=['Female', 'Male'],
    y=age_labels,
    colorscale='YlOrBr',
    text=heatmap_data,
    texttemplate='%{text:.2f}',
    textfont={"size": 14},
    colorbar=dict(title="Tazas/día")
))

fig.update_layout(
    title='Heatmap: Consumo de Café por Género y Edad',
    xaxis_title="Género",
    yaxis_title="Grupo de Edad",
    height=600
)

fig.show()

## Análisis: Diferencia de Consumo por Género (Estratificado por Edad)
¿Existe una diferencia significativa en el consumo de café entre hombres y mujeres, 
y es esta diferencia consistente entre grupos de edad?

### Hallazgo Principal
**Los hombres consumen más café que las mujeres en todos los grupos de edad, pero 
la magnitud de esta diferencia varía significativamente: desde +0.23 tazas en 
Boomers+ hasta +0.60 tazas en Gen X (2.6x de variación).**

### Resultados Detallados
| Age Group                 | Female | Male |
|----------------------------|:-------:|:----:|
| Boomers+ (65+)             | 2.00    | 2.23 |
| Gen X (45-64)              | 1.65    | 2.25 |
| Gen Z (<25)                | 1.09    | 1.56 |
| Older Millennials (35-44)  | 1.40    | 1.97 |
| Young Millennials (25-34)  | 1.30    | 1.76 |


**Gen X (45-64 años) muestra la mayor brecha:**
- Hombres: 2.25 tazas/día
- Mujeres: 1.65 tazas/día
- Diferencia: **+0.60 tazas (36% más)**

**Boomers+ (65+) muestra la menor brecha:**
- Hombres: 2.23 tazas/día
- Mujeres: 2.00 tazas/día
- Diferencia: **+0.23 tazas (11% más)**

### Implicaciones
1. **Para Marketing**: Segmentar campañas por edad×género, no solo demografía individual
2. **Para Producto**: Hombres Gen X son el target de máximo consumo
3. **Cultural**: La convergencia en Boomers+ sugiere que la brecha es un fenómeno generacional