# Análisis Exploratorio de ENIGH

La priemra parte de nuestra exploración para estas bases de datos está en el conocimiento de las dimensiones y un <i>sanity check</i>. Ya que estas bases tienen presentaciones ejecutivas, podemos hacer una comparación en algunos resultados sólo para validación.

In [17]:

import numpy as np
import pandas as pd
import pingouin as pg
import scipy.stats as stats
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from funcs import *


## Concentrado del Hogar

In [18]:
df2020 = pd.read_csv(
    "D:/Dropbox/Dropbox/Dany/Desafio/conjunto_de_datos_enigh_ns_2020_csv/conjunto_de_datos_concentradohogar_enigh_2020_ns/conjunto_de_datos/conjunto_de_datos_concentradohogar_enigh_2020_ns.csv"
)
df2018 = pd.read_csv(
    "D:/Dropbox/Dropbox/Dany/Desafio/conjunto_de_datos_enigh_2018_ns_csv/conjunto_de_datos_concentradohogar_enigh_2018_ns/conjunto_de_datos/conjunto_de_datos_concentradohogar_enigh_2018_ns.csv"
)
df2016 = pd.read_csv(
    "D:/Dropbox/Dropbox/Dany/Desafio/conjunto_de_datos_enigh2016_nueva_serie_csv/conjunto_de_datos_concentradohogar_enigh_2016_ns/conjunto_de_datos/conjunto_de_datos_concentradohogar_enigh_2016_ns.csv"
)

### Sanity check

### 2020

In [19]:
np.sum(df2020['salud']*df2020['factor'])/df2020['factor'].sum()

1265.6349782541984

Coincide con el promedio reportado por la INEGI

### 2018

In [20]:
np.sum(df2018['salud']*df2018['factor'])/df2018['factor'].sum()


840.8746496324256

Para calcular el cambio del peso en 2018, con 2020 como referencia, usamos la siguiente fórmula:
$$
Valor_f = Valor_i * \frac{IPC_f}{IPC_i}
$$
Donde $Valor_i$ es el valor inicial y $Valor_f$ es el valor final. $IPC_f = 114.38$ e $IPC_i = 122.56$.

Hay una diferencia con la presentación de 2018 por 3 pesos

In [21]:
val = np.sum(df2018['salud']*df2018['factor'])/df2018['factor'].sum()
val * (122.56/114.38)

901.0106404874113


$901 Sería el precio comparándolo con 2020 y es igual al precio reportado en el reporte de 2020

### 2016

In [22]:
np.sum(df2016['salud']*df2016['factor'])/df2016['factor'].sum()

763.0811599115455

Hay una diferencia de 3 pesos con los resultados reportados en la presentación de 2018.

In [23]:
val = np.sum(df2016['salud']*df2016['factor'])/df2016['factor'].sum()
val * (122.56/102.82)

909.5820556191308

$909.58 es el precio comparándolo con 2020 y es igual al precio reportado en el reporte de 2020.

# Pruebas de Hipótesis

## Comparación de Medias de Gasto de Bolsillo Trimestral entre 2018 y 2020

$H_0$: La media de gasto de bolsillo en 2020 es igual a 2018

$H_1$: La media de gasto de bolsillo en 2020 es <b>diferente</b> a 2018

$\alpha$: 0.05

In [24]:
df2020['saludf'] = df2020['salud'] * df2020['factor']
df2018['saludf'] = df2018['salud'] * df2018['factor']
pg.ttest(df2020['saludf'], df2018['saludf'], alternative='two-sided', paired=False)

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power
T-test,8.457898,159862.618596,two-sided,2.74488e-17,"[92835.11, 148839.15]",0.041895,18940000000000.0,1.0


In [25]:
1265.63/901.01-1

0.4046791933496854

La prueba de hipótesis muestra que la diferencia de 2020 es estadísticamente significativa. El promedio de gasto de bolsillo trimestal creció en un 40.46%

## Comparativa de medias entre otros rubros (2018 vs 2020)

In [26]:
def aggregate(df, year):
    rubros = [
        "alimentos",
        "bebidas",
        "tabaco",
        "adqui_vehi",
        "ali_dentro",
        "ali_fuera",
        "transporte",
        "vivienda",
        "educacion",
        "educa_espa",
        "esparci",
        "vesti_calz",
        "pago_tarje",
        "salud",
        "hospital",
        "atenc_ambu",
        "medicinas",
        "mantenim",
        "comunica",
        "refaccion",
        "servicio",
        "prestamos",
        "limpieza",
        "energia",
    ]
    labels = [
        "Alimentos",
        "Bebidas",
        "Tabaco",
        "Adiquisición de vehículos",
        "Alimentos dentro del hogar",
        "Alimentos fuera del hogar",
        "Transporte",
        "Vivienda",
        "Educación",
        "Educación y Esparcimiento",
        "Esparcimiento",
        "Vestido y calzado",
        "Pagos con tarjeta",
        "Salud (Medicinas, Atenc. Amb. y Hospitalizaciones)",
        "Hospitalizaciones",
        "Atención ambulatoria",
        "Medicinas",
        "Mantenimiento de vehículos",
        "Comunicaciones",
        "Refacciones de vehículos",
        "Servicios de reparación",
        "Préstamos",
        "Limpieza",
        "Electricidad y Combustibles",
    ]

    results = []
    for rubro in rubros:
        results.append(
            np.round(np.sum(df[rubro] * df["factor"]) / df["factor"].sum(), 1)
        )
    df = pd.DataFrame({"Item": rubros, "Mean": results, "year": year, "labels": labels})
    return df


In [27]:
r2020 = aggregate(df2020, 2020)
r2018 = aggregate(df2018, 2018)
r2018['Mean'] = r2018['Mean'] * (122.56/114.38)
results = pd.concat([r2020, r2018]).pivot(index=['Item', 'labels'], columns='year', )
results.columns = ['2018', '2020']
results = results.reset_index()
results['PCT_DIFF'] = np.round(((results['2020']/results['2018']) - 1)*100, 1)
results['2018_PCT'] = np.round((results['2018']/np.sum(results['2018']))*100, 1)
results['2020_PCT'] = np.round((results['2020']/np.sum(results['2020']))*100, 1)

def color(x):
    if x >= 0:
        return GRADIENT_PALETTE[-1]
    else:
        return GRADIENT_PALETTE[0]
    
labels = lambda x: f'{x}%' 

adjustx = lambda x: x + 8.5 if x >= 0 else x - 8.8

def adjustxline(x):
    if x >= 0:
        return x - 2.9 if x - 2.9 > 0 else 0
    else:
        return x + 2.9 if x + 2.9 < 0 else 0
    
results.sort_values(by='PCT_DIFF', ascending=True, inplace=True)

In [28]:
rubros = [
        "alimentos","bebidas","tabaco","adqui_vehi","ali_dentro","ali_fuera","transporte","vivienda","educacion",
        'educa_espa','esparci',"vesti_calz","pago_tarje","salud",'hospital','atenc_ambu','medicinas',"mantenim",
        "comunica","refaccion","servicio",'prestamos','limpieza','energia'
    ]
tresults = []
for rubro in rubros:
    df2020[f'{rubro}f'] = df2020[rubro] * df2020['factor']
    df2018[f'{rubro}f'] = df2018[rubro] * df2018['factor']
    tresults.append(pg.ttest(df2020[f'{rubro}f'], df2018[f'{rubro}f'], alternative='two-sided', paired=False))
tresults = pd.concat(tresults, axis=0)
tresults['Rubro'] = rubros
tresults['Relevant'] = tresults['p-val'] < 0.05
tresults

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power,Rubro,Relevant
T-test,-14.822862,140374.09525,two-sided,1.136835e-49,"[-711898.02, -545620.29]",0.075111,2.648e+45,1.0,alimentos,True
T-test,-1.86379,145017.649484,two-sided,0.06235322,"[-15579.19, 391.91]",0.009402,0.032,0.473939,bebidas,False
T-test,-2.633952,133177.328851,two-sided,0.008440711,"[-6111.05, -896.55]",0.013435,0.18,0.772473,tabaco,True
T-test,-2.773235,148219.59514,two-sided,0.005550895,"[-137298.67, -23590.49]",0.013945,0.262,0.802268,adqui_vehi,True
T-test,-1.607106,155215.071563,two-sided,0.1080332,"[-109118.48, 10794.53]",0.008016,0.02,0.365319,ali_dentro,False
T-test,-28.738426,117570.776704,two-sided,5.372385e-181,"[-615383.42, -536803.34]",0.148641,4.191e+176,1.0,ali_fuera,True
T-test,-17.693043,137941.809713,two-sided,5.6711909999999995e-70,"[-795066.2, -636483.51]",0.089857,4.505e+65,1.0,transporte,True
T-test,-3.972147,145040.641636,two-sided,7.126211e-05,"[-132190.94, -44838.92]",0.020038,14.925,0.981119,vivienda,True
T-test,-16.261939,133284.303789,two-sided,2.097637e-59,"[-614835.14, -482569.65]",0.082938,1.322e+55,1.0,educacion,True
T-test,-22.565256,124174.262634,two-sided,1.604166e-112,"[-940874.89, -790491.12]",0.116024,1.367e+108,1.0,educa_espa,True


In [29]:
results = results.join(tresults[['Relevant', 'Rubro']].set_index('Rubro'), on='Item')

In [30]:
lines = results.Relevant.apply(lambda x: 2 if x else 0)

fig = go.Figure(
    go.Scatter(
        x=results["PCT_DIFF"],
        y=results['labels'],
        mode="markers",
        name="",
        marker=dict(color=results.PCT_DIFF.apply(color), size=25),
    )
)
fig.update_traces(marker_line_width=lines)

fig.add_vline(x=0, name="", fillcolor="#e0e0e0", line=dict(width=5))
annotations = []
for index, row in results.iterrows():
    annotations.append(
        dict(
            x=adjustx(row['PCT_DIFF']),
            y=row.labels,
            text=labels(row['PCT_DIFF']),
            showarrow=False,
            font=dict(family="Arial", size=14),
        )
    )
    fig.add_shape(
        type="line",
        x0=0,
        y0=row.labels,
        x1=adjustxline(row['PCT_DIFF']),
        y1=row.labels,
        line=dict(color="black", width=2),
    )

fig.add_annotation(xref='paper', yref='paper', x=-0.4, y=-0.06, showarrow=False, 
                   text='Fuente: Elaboración propia con datos del INEGI, ENIGH')

fig.update_layout(
    title="Figura 2. Porcentaje de variación en el gasto de rubros de 2018 vs 2020",
    annotations=annotations, template="plotly_white",
)
fig.update_xaxes(showgrid=True)
fig.update_yaxes(showgrid=False)
fig.write_image("images/variacion_rubros.png", width=1000, height=1000)
fig.show()

In [31]:
results.Relevant.value_counts(normalize=True)

True     0.791667
False    0.208333
Name: Relevant, dtype: float64

El cambio observado en el 80% de los rubros es significativo.

## ¿Qué parte de la población, en grupos de edad, se vio más impactada en el gasto de bolsillo?

In [32]:
hogar2020 = pd.read_csv(
    "D:/Dropbox/Dropbox/Dany/Desafio/conjunto_de_datos_enigh_ns_2020_csv/conjunto_de_datos_poblacion_enigh_2020_ns/conjunto_de_datos/conjunto_de_datos_poblacion_enigh_2020_ns.csv"
)
hogar2018 = pd.read_csv(
    "D:/Dropbox/Dropbox/Dany/Desafio/conjunto_de_datos_enigh_2018_ns_csv/conjunto_de_datos_poblacion_enigh_2018_ns/conjunto_de_datos/conjunto_de_datos_poblacion_enigh_2018_ns.csv"
)


Columns (10,11,12,13,14,15,16,17,82,83,166) have mixed types. Specify dtype option on import or set low_memory=False.


Columns (10,80,81,162) have mixed types. Specify dtype option on import or set low_memory=False.



Nuestro enfoque para responder a esta pregunta es definir el grupo de edad de un hogar por la mayoría de sus habitantes. Es decir, si en un hogar hay 3 adultos mayores de 50 y 1 joven de 20 años se categorizará el hogar como de 50 años.

In [33]:
hogar2018['edad_cat'] = hogar2018['edad'].apply(defineEdadCat)
hogar2020['edad_cat'] = hogar2020['edad'].apply(defineEdadCat)
hg2018 = hogar2018.groupby(['folioviv', 'foliohog', 'edad_cat'])['numren'].agg(['sum', 'count']).reset_index()
hg2020 = hogar2020.groupby(['folioviv', 'foliohog', 'edad_cat'])['numren'].agg(['sum', 'count']).reset_index()
hg2018['pct'] = np.round(hg2018['count'] / hg2018['sum'], 1)
hg2020['pct'] = np.round(hg2020['count'] / hg2020['sum'], 1)
hg2018 = hg2018.sort_values(by=['folioviv', 'foliohog', 'pct'], ascending=[True,True,False]).groupby(['folioviv', 'foliohog']).first().reset_index()
hg2020 = hg2020.sort_values(by=['folioviv', 'foliohog', 'pct'], ascending=[True,True,False]).groupby(['folioviv', 'foliohog']).first().reset_index()
hg2018 = hg2018[['folioviv', 'foliohog', 'edad_cat']]
hg2020 = hg2020[['folioviv', 'foliohog', 'edad_cat']]
df2018 = pd.merge(df2018, hg2018, on=['folioviv', 'foliohog'], how='left')
df2020 = pd.merge(df2020, hg2020, on=['folioviv', 'foliohog'], how='left')

In [34]:
fig = px.histogram(hg2018, x='edad_cat')
fig.update_traces(marker_color=CATEGORY_PALETTE[0])
fig.show()

In [35]:
hg2018['edad_cat'].value_counts(normalize=True)

36-50    0.341300
51-65    0.269455
19-35    0.206197
65+      0.178078
0-18     0.004970
Name: edad_cat, dtype: float64

In [36]:
def weightMean(x, old=False):
    r = np.round(np.sum(x.saludf) / np.sum(x.factor), 2)
    return np.round(r*((122.56/114.38)), 2) if old else r

r18 = df2018.groupby('edad_cat').apply(lambda x: weightMean(x, True)).reset_index()
r18

Unnamed: 0,edad_cat,0
0,0-18,919.71
1,19-35,663.61
2,36-50,784.61
3,51-65,998.6
4,65+,1249.89


In [37]:
r20 = df2020.groupby('edad_cat').apply(weightMean).reset_index()
r20

Unnamed: 0,edad_cat,0
0,0-18,741.23
1,19-35,921.01
2,36-50,1100.54
3,51-65,1499.22
4,65+,1521.88


In [38]:
r1820 = pd.merge(r18, r20, on='edad_cat', how='left')
r1820.columns = ['edad_cat', '2018', '2020']
r1820['PCT_DIFF'] = np.round(r1820['2020']/r1820['2018']-1, 2)

In [39]:
r1820

Unnamed: 0,edad_cat,2018,2020,PCT_DIFF
0,0-18,919.71,741.23,-0.19
1,19-35,663.61,921.01,0.39
2,36-50,784.61,1100.54,0.4
3,51-65,998.6,1499.22,0.5
4,65+,1249.89,1521.88,0.22


In [40]:
def compareMeansEdad():
  grupos = ['0-18',
  '19-35',
  '36-50',
  '51-65',
  '65+']
  results = []
  for grupo in grupos:
    results.append(pg.ttest(df2020.loc[df2020.edad_cat == grupo,'saludf'], df2018.loc[df2018.edad_cat == grupo, 'saludf'], alternative='two-sided', paired=False))
  result = pd.concat(results, axis=0)
  result['grupo'] = grupos
  result['alpha'] = 0.05
  result['relevant'] = result['p-val'] < result['alpha']
  return result

In [41]:
resultDF = compareMeansEdad()
resultDF

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power,grupo,alpha,relevant
T-test,-0.485582,444.28665,two-sided,0.627503,"[-418505.78, 252673.75]",0.035387,0.093,0.076297,0-18,0.05,False
T-test,3.411633,30907.64992,two-sided,0.0006465763,"[33752.78, 124903.49]",0.038037,4.251,0.923332,19-35,0.05,True
T-test,5.206652,54800.660135,two-sided,1.929707e-07,"[64500.13, 142377.89]",0.044235,7386.399,0.999334,36-50,0.05,True
T-test,6.455143,44468.818254,two-sided,1.092274e-10,"[123080.09, 230413.75]",0.058401,11710000.0,1.0,51-65,0.05,True
T-test,1.63536,22000.239843,two-sided,0.1019879,"[-15874.27, 175770.96]",0.019815,0.049,0.406046,65+,0.05,False


In [42]:
r1820 = pd.merge(r1820, resultDF[['grupo', 'relevant']], left_on='edad_cat', right_on='grupo', how='left')

In [16]:
fig = go.Figure()
def color(x, rel):
    opacity = '1' if rel else '0.3'
    if x == '0-18':
        return CATEGORY_PALETTE_OPACITY[0].replace('x', opacity)
    elif x == '19-35':
        return CATEGORY_PALETTE_OPACITY[1].replace('x', opacity)
    elif x == '36-50':
        return CATEGORY_PALETTE_OPACITY[2].replace('x', opacity)
    elif x == '51-65':
        return CATEGORY_PALETTE_OPACITY[3].replace('x', opacity)
    else:
        return CATEGORY_PALETTE_OPACITY[4].replace('x', opacity)
    
for i in r18.edad_cat:
    dats = r1820[r1820.edad_cat == i][['2018', '2020', 'relevant']]
    colors = color(i, dats.relevant.values[0])
    fig.add_trace(
        go.Scatter(
            x=[2018, 2020],
            y=[dats['2018'].values[0], dats['2020'].values[0]],
            mode="lines+markers+text",
            text=[dats['2018'].values[0], dats['2020'].values[0]],
            textposition=["middle left", "middle right"],
            marker=dict(size=12, color=colors),
            name=i,
        )
    )

fig.add_annotation(yref='paper', x=2018, y=-0.05, text='2018', showarrow=False, font=dict(size=16))
fig.add_annotation(yref='paper', x=2020, y=-0.05, text='2020', showarrow=False, font=dict(size=16))
fig.add_shape(
    type="line",
    x0=2018,
    x1=2018,
    y0=0,
    y1=1,
    xref="x",
    yref="paper",
    layer="below",
)
fig.add_shape(
    type="line",
    x0=2020,
    x1=2020,
    y0=0,
    y1=1,
    xref="x",
    yref="paper",
    layer="below",
)
fig.update_xaxes(range=[2017.5, 2020.5], visible=False)
fig.update_layout(title_text="Cambio en el gasto de bolsillo por grupos de edad (2018 vs 2020)", 
                  yaxis_title="Media de Gasto de bolsillo", xaxis_title="",
                  legend=dict(orientation="h", yanchor="bottom", y=1.002, xanchor="right", x=1),
                  template="plotly_white",uniformtext_minsize=12, uniformtext_mode='hide')
fig.add_annotation(xref='paper', yref='paper', x=-0.084, y=-0.1, 
                   text='Fuente: Elaboración propia con datos del INEGI, ENIGH', 
                   showarrow=False)
fig.write_image("images/fig2.png", width=1000, height=600)
fig.show()


NameError: name 'r18' is not defined

Los resultados muestran que los grupos de edad cuya diferencia es estadísticamente significativa son:
- 19-35
- 36-50
- 51-65

## ¿Qué parte de la población, en cuanto a cantidad de miembros de hogar, se vio más afectado en el gasto de bolsillo?

In [62]:
cnt2018 = hogar2018.groupby(['folioviv', 'foliohog']).agg({'numren':'size'}).reset_index()
cnt2020 = hogar2020.groupby(['folioviv', 'foliohog']).agg({'numren':'size'}).reset_index()

In [63]:
fig = make_subplots(rows=2, cols=1)
fig.add_trace(go.Histogram(x=cnt2018.numren, name='2018', 
                           marker=dict(color=CATEGORY_PALETTE[0])), row=1, col=1)
fig.add_trace(go.Histogram(x=cnt2020.numren, name='2020',
                           marker=dict(color=CATEGORY_PALETTE[0])), row=2, col=1)
fig.update_layout(title_text="Figura 4. Distribución número de habitantes por hogar (2018 vs 2020)",
                  xaxis_title="", yaxis_title="Frecuencia",
                  showlegend=False,)
fig.update_xaxes(row=2, col=1, title_text="Número de habitantes")
fig.update_yaxes(row=2, col=1, title_text="Frecuencia")
fig.add_annotation(xref='paper', yref='paper', x=-0.084, y=-0.14, text='Fuente: Elaboración propia con datos del INEGI, ENIGH', showarrow=False)
fig.write_image("images/histogramas.png", width=1000, height=600)
fig.show()

In [64]:
def categorizeMembers(x):
  if x > 0 and x <= 2:
    return '0-2'
  elif x > 2 and x <= 4:
    return '3-4'
  elif x > 4 and x <= 6:
    return '5-6'
  else:
    return '6+'

cnt2018['member_group'] = cnt2018['numren'].apply(categorizeMembers)
cnt2020['member_group'] = cnt2020['numren'].apply(categorizeMembers)

In [65]:
cnt2018 = cnt2018[['folioviv', 'foliohog', 'member_group']]
cnt2020 = cnt2020[['folioviv', 'foliohog', 'member_group']]

In [66]:
cnt2020['member_group'].value_counts(normalize=True)

3-4    0.415039
0-2    0.317597
5-6    0.210177
6+     0.057187
Name: member_group, dtype: float64

In [67]:
df2018 = pd.merge(df2018, cnt2018, on=['folioviv', 'foliohog'], how='left')
df2020 = pd.merge(df2020, cnt2020, on=['folioviv', 'foliohog'], how='left')

In [68]:
r18 = df2018.groupby('member_group').apply(lambda x: weightMean(x, True)).reset_index()
r18

Unnamed: 0,member_group,0
0,0-2,836.04
1,3-4,864.8
2,5-6,911.07
3,6+,1445.89


In [69]:
r20 = df2020.groupby('member_group').apply(weightMean).reset_index()
r20

Unnamed: 0,member_group,0
0,0-2,1112.67
1,3-4,1313.78
2,5-6,1322.4
3,6+,1545.46


In [70]:
r1820 = pd.merge(r18, r20, on='member_group', how='left')
r1820.columns = ['member_group', '2018', '2020']
r1820['PCT_DIFF'] = np.round(r1820['2020']/r1820['2018']-1, 2)
r1820

Unnamed: 0,member_group,2018,2020,PCT_DIFF
0,0-2,836.04,1112.67,0.33
1,3-4,864.8,1313.78,0.52
2,5-6,911.07,1322.4,0.45
3,6+,1445.89,1545.46,0.07


In [71]:
def compareMeans(vars, column):
  results = []
  for grupo in vars:
    results.append(pg.ttest(df2020.loc[df2020[column] == grupo, 'saludf'], df2018.loc[df2018[column] == grupo, 'saludf'], alternative='two-sided', paired=False))
  result = pd.concat(results, axis=0)
  result['grupo'] = vars
  result['alpha'] = 0.05
  result['relevant'] = result['p-val'] < result['alpha']
  return result

resultDF = compareMeans(['0-2', '3-4', '5-6', '6+'], 'member_group')
resultDF

Unnamed: 0,T,dof,alternative,p-val,CI95%,cohen-d,BF10,power,grupo,alpha,relevant
T-test,3.303081,50097.410112,two-sided,0.0009569563,"[34363.0, 134658.38]",0.029164,2.344,0.906089,0-2,0.05,True
T-test,8.040092,65770.017809,two-sided,9.125176e-16,"[121988.88, 200638.41]",0.06004,924400000000.0,1.0,3-4,0.05,True
T-test,5.262168,34599.027397,two-sided,1.432137e-07,"[83206.5, 181983.46]",0.055651,12360.0,0.999347,5-6,0.05,True
T-test,0.08702,5490.515942,two-sided,0.9306592,"[-198427.97, 216862.17]",0.001841,0.023,0.05094,6+,0.05,False


In [72]:
r1820 = pd.merge(r1820, resultDF[['grupo', 'relevant']], left_on='member_group', right_on='grupo', how='left')
r1820

Unnamed: 0,member_group,2018,2020,PCT_DIFF,grupo,relevant
0,0-2,836.04,1112.67,0.33,0-2,True
1,3-4,864.8,1313.78,0.52,3-4,True
2,5-6,911.07,1322.4,0.45,5-6,True
3,6+,1445.89,1545.46,0.07,6+,False


In [73]:
r1820.to_clipboard(index=False, sep=',')

In [74]:
def color(x, rel):
    opacity = '1' if rel else '0.3'
    if x == '0-2':
        return CATEGORY_PALETTE_OPACITY[0].replace('x', opacity)
    elif x == '3-4':
        return CATEGORY_PALETTE_OPACITY[1].replace('x', opacity)
    elif x == '5-6':
        return CATEGORY_PALETTE_OPACITY[2].replace('x', opacity)
    elif x == '6+':
        return CATEGORY_PALETTE_OPACITY[3].replace('x', opacity)
    else:
        return CATEGORY_PALETTE_OPACITY[4].replace('x', opacity)

fig = go.Figure()
for i in r18.member_group:
    dats = r1820[r1820.member_group == i][['2018', '2020', 'relevant']]
    if i == '5-6':
        text = [dats['2018'].values[0], '']
    else:
        text = [dats['2018'].values[0], dats['2020'].values[0]]
    colors = color(i, dats.relevant.values[0])
    fig.add_trace(
        go.Scatter(
            x=[2018, 2020],
            y=[dats['2018'].values[0], dats['2020'].values[0]],
            mode="lines+markers+text",
            text=text,
            textposition=["middle left", "middle right"],
            marker=dict(size=12, color=colors),
            name=i,
        )
    )

fig.add_annotation(yref='paper', x=2018, y=-0.05, text='2018', showarrow=False, font=dict(size=16))
fig.add_annotation(yref='paper', x=2020, y=-0.05, text='2020', showarrow=False, font=dict(size=16))
fig.add_shape(
    type="line",
    x0=2018,
    x1=2018,
    y0=0,
    y1=1,
    xref="x",
    yref="paper",
    layer="below",
)
fig.add_shape(
    type="line",
    x0=2020,
    x1=2020,
    y0=0,
    y1=1,
    xref="x",
    yref="paper",
    layer="below",
)
fig.update_xaxes(range=[2017.5, 2020.5], visible=False)
fig.update_layout(title_text="Figura 4. Cambio en el gasto de bolsillo por cantidad de miembros del hogar (2018 vs 2020)", 
                  yaxis_title="Media de Gasto de bolsillo", xaxis_title="",
                  legend=dict(orientation="h", yanchor="bottom", y=1.002, xanchor="right", x=1),
                  template="plotly_white",)
fig.add_annotation(xref='paper', yref='paper', x=-0.04, y=-0.09, text='Fuente: Elaboración propia con datos del INEGI, ENIGH', showarrow=False)
fig.write_image("images/fig3.png", width=1000, height=600)
fig.show()


Los resultados muestran que las diferencias observadas en los grupos 0-2, 3-4 y 5-6 son significativas. Ordenados por mayor diferencia porcentual:
1. 3-4
2. 5-6
3. 0-2

## ¿el gasto de bolsillo tiene un mayor impacto al dividirlo por género?

In [45]:
gp2018 = pd.read_csv("D:/Dropbox/Dropbox/Dany/Desafio/conjunto_de_datos_enigh_2018_ns_csv/conjunto_de_datos_gastospersona_enigh_2018_ns/conjunto_de_datos/conjunto_de_datos_gastospersona_enigh_2018_ns.csv")
pob2018 = pd.read_csv("D:/Dropbox/Dropbox/Dany/Desafio/conjunto_de_datos_enigh_2018_ns_csv/conjunto_de_datos_poblacion_enigh_2018_ns/conjunto_de_datos/conjunto_de_datos_poblacion_enigh_2018_ns.csv", usecols=['folioviv' , 'foliohog', 'numren', 'sexo'])
gp2020 = pd.read_csv("D:/Dropbox/Dropbox/Dany/Desafio/conjunto_de_datos_enigh_ns_2020_csv/conjunto_de_datos_gastospersona_enigh_2020_ns/conjunto_de_datos/conjunto_de_datos_gastospersona_enigh_2020_ns.csv")
pob2020 = pd.read_csv("D:/Dropbox/Dropbox/Dany/Desafio/conjunto_de_datos_enigh_ns_2020_csv/conjunto_de_datos_poblacion_enigh_2020_ns/conjunto_de_datos/conjunto_de_datos_poblacion_enigh_2020_ns.csv", usecols=['folioviv', 'foliohog', 'numren', 'sexo'])


Columns (6,15,16,17,18,19) have mixed types. Specify dtype option on import or set low_memory=False.


Columns (6,15,16,17,18,19) have mixed types. Specify dtype option on import or set low_memory=False.



In [46]:
gp2018 = pd.merge(gp2018, pob2018, on=['folioviv', 'foliohog', 'numren'], how='left')
gp2020 = pd.merge(gp2020, pob2020, on=['folioviv', 'foliohog', 'numren'], how='left')


In [47]:
claves = gp2018.clave[gp2018.clave.str.contains(r'^J')].unique()

In [48]:
gp2018 = gp2018[gp2018.clave.isin(claves)]
gp2020 = gp2020[gp2020.clave.isin(claves)]

In [49]:
gp2018.groupby(['folioviv', 'sexo']).agg({'gasto_tri': 'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,gasto_tri
folioviv,sexo,Unnamed: 2_level_1
100027202,2,
100067406,1,
100068902,1,
100068902,2,
100074002,1,
...,...,...
3260321621,2,
3260481204,1,
3260550019,1,
3260550412,2,


Después de checar las tablas, no parece posible hacer este análisis. Muy pocos registros de la tabla GastosPersona tienen datos en la columna gasto_tri.