## Hipótesis:

- H0: No existen diferencias significativas entre los grupos de nivel educativo
- H1: Existen ciertas diferencias significativas entre los grupos de nivel educativo

# RESPUESTA
He hecho varios tipos de test: Mann Whitney y krustal-Willis. También he utilizado dos DF diferentes: en uno quedan registrados todos los clientes afiliados a la aerolínea, en otro sólo aquellos que han realizado al menos algún vuelo.

# PROCESO

In [1]:
from src import analisis_soporte as ans
import pandas as pd
import numpy as np

from scipy import stats
from scipy.stats import ttest_ind, norm, chi2_contingency, f_oneway
from sklearn.linear_model import LinearRegression



import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv('fligh-data-clean.csv', index_col=0)

In [3]:
df_booked = df[df['flights booked'] > 0]

## AB testing

In [4]:
df_testA = df[['loyalty number', 'flights booked', 'education']]
df_testA

Unnamed: 0,loyalty number,flights booked,education
0,100018,3,bachelor
1,100102,10,college
2,100140,6,college
3,100214,0,bachelor
4,100272,0,bachelor
...,...,...,...
405619,999902,0,college
405620,999911,0,doctor
405621,999940,3,bachelor
405622,999982,0,college


In [5]:
df_testB = df_booked[['loyalty number', 'flights booked', 'education']]
df_testB

Unnamed: 0,loyalty number,flights booked,education
0,100018,3,bachelor
1,100102,10,college
2,100140,6,college
8,100428,6,bachelor
10,100550,3,bachelor
...,...,...,...
405612,999550,15,doctor
405613,999589,14,college
405614,999631,11,bachelor
405616,999758,1,college


In [6]:
tabla_statsA = df_testA.groupby('education')['flights booked'].agg(['mean','mean','std','var'])
tabla_statsB = df_testB.groupby('education')['flights booked'].agg(['mean','mean','std','var'])

display(tabla_statsA)
display(tabla_statsB)

Unnamed: 0_level_0,mean,mean,std,var
education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bachelor,4.110288,4.110288,5.221671,27.265843
college,4.169744,4.169744,5.24604,27.520938
doctor,4.175512,4.175512,5.256971,27.635748
high school or below,4.178445,4.178445,5.240179,27.459474
master,4.2007,4.2007,5.213956,27.185339


Unnamed: 0_level_0,mean,mean,std,var
education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bachelor,8.028607,8.028607,4.669,21.799564
college,8.070523,8.070523,4.667465,21.785233
doctor,8.053519,8.053519,4.697898,22.070242
high school or below,8.016147,8.016147,4.681404,21.915543
master,8.005894,8.005894,4.620198,21.346227


### pruebas: normalidad y homogeneidad

In [7]:
ans.normalidad(df_testA, 'flights booked')

Para la columna flights booked los datos no siguen una distribución normal.


In [8]:
ans.normalidad(df_testB, 'flights booked')

Para la columna flights booked los datos no siguen una distribución normal.


In [9]:
ans.homogeneidad(df_testA, 'education', 'flights booked')

Para la métrica flights booked, las varianzas no son homogéneas entre grupos.


In [10]:
ans.homogeneidad(df_testB, 'education', 'flights booked')

Para la métrica flights booked las varianzas son homogéneas entre grupos.


### testing - 1


In [11]:
df_testA['grupo'] = df_testA['education'].apply(lambda x: 'A' if x in ['bachelor', 'college', 'master', 'doctor'] else 'B')
df_testA

Unnamed: 0,loyalty number,flights booked,education,grupo
0,100018,3,bachelor,A
1,100102,10,college,A
2,100140,6,college,A
3,100214,0,bachelor,A
4,100272,0,bachelor,A
...,...,...,...,...
405619,999902,0,college,A
405620,999911,0,doctor,A
405621,999940,3,bachelor,A
405622,999982,0,college,A


In [12]:
ans.test_man_whitney(df_testA, ['flights booked'], 'A', 'B', 'grupo')

Para la métrica flights booked, las medianas son iguales.


In [13]:
df_testA['grupo2'] = df['education'].apply(lambda x: 'C' if x in ['master', 'doctor'] else 'D')

In [14]:
ans.test_man_whitney(df_testA, ['flights booked'], 'C', 'D', 'grupo2')

Para la métrica flights booked, las medianas son diferentes.


### testing - 2

In [15]:
df_testB['grupo'] = df_testB['education'].apply(lambda x: 'A' if x in ['bachelor', 'college', 'master', 'doctor'] else 'B')
df_testB

Unnamed: 0,loyalty number,flights booked,education,grupo
0,100018,3,bachelor,A
1,100102,10,college,A
2,100140,6,college,A
8,100428,6,bachelor,A
10,100550,3,bachelor,A
...,...,...,...,...
405612,999550,15,doctor,A
405613,999589,14,college,A
405614,999631,11,bachelor,A
405616,999758,1,college,A


In [16]:
ans.test_man_whitney(df_testB, ['flights booked'], 'A', 'B', 'grupo')

Para la métrica flights booked, las medianas son iguales.


In [17]:
df_testB['grupo2'] = df_testB['education'].apply(lambda x: 'C' if x in ['master', 'doctor'] else 'D')
df_testB

Unnamed: 0,loyalty number,flights booked,education,grupo,grupo2
0,100018,3,bachelor,A,D
1,100102,10,college,A,D
2,100140,6,college,A,D
8,100428,6,bachelor,A,D
10,100550,3,bachelor,A,D
...,...,...,...,...,...
405612,999550,15,doctor,A,C
405613,999589,14,college,A,D
405614,999631,11,bachelor,A,D
405616,999758,1,college,A,D


In [18]:
ans.test_man_whitney(df_testA, ['flights booked'], 'C', 'D', 'grupo2')

Para la métrica flights booked, las medianas son diferentes.


## kruskal - wallis

### testing - 1

In [19]:
df['education'].unique()


array(['bachelor', 'college', 'master', 'high school or below', 'doctor'],
      dtype=object)

In [20]:
from scipy.stats import kruskal

statistic, p_value = kruskal(df_testA[df_testA['education']=='bachelor']['flights booked'], df_testA[df_testA['education']=='college']['flights booked'], df_testA[df_testA['education']=='master']['flights booked'], df_testA[df_testA['education']=='doctor']['flights booked'],df_testA[df_testA['education']=='high school or below']['flights booked'])

# Imprimir los resultados
print("Estadístico de Kruskal-Wallis:", statistic)
print("Valor p:", p_value)

# Interpretar los resultados
alpha = 0.05
if p_value < alpha:
    print("Hay diferencias estadísticamente significativas entre los grupos.")
else:
    print("No hay diferencias estadísticamente significativas entre los grupos.")

Estadístico de Kruskal-Wallis: 17.726425291000844
Valor p: 0.001395642070733225
Hay diferencias estadísticamente significativas entre los grupos.


### testing - 2

In [None]:
statistic, p_value = kruskal(df_testB[df_testB['education']=='bachelor']['flights booked'], df_testB[df_testB['education']=='college']['flights booked'], df_testB[df_testB['education']=='master']['flights booked'], df_testB[df_testB['education']=='doctor']['flights booked'],df_testB[df_testB['education']=='high school or below']['flights booked'])

# Imprimir los resultados
print("Estadístico de Kruskal-Wallis:", statistic)
print("Valor p:", p_value)

# Interpretar los resultados
alpha = 0.05
if p_value < alpha:
    print("Hay diferencias estadísticamente significativas entre los grupos.")
else:
    print("No hay diferencias estadísticamente significativas entre los grupos.")

Estadístico de Kruskal-Wallis: 3.8067713809098156
Valor p: 0.4327876205737683
No hay diferencias estadísticamente significativas entre los grupos.
