# Variables of households and population of Mexican States in 2020

This Notebook uses the households and population dataframe of Mexican States (admin1) derived from the 2020 Mexican Census: [INEGI](https://inegi.org.mx/programas/ccpv/2020/#Datos_abiertos).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from string import ascii_letters
import numpy as np

%matplotlib inline
%reload_ext autoreload
%autoreload 2

Read Variables of households and population of Mexico in 2020

In [2]:
df = pd.read_parquet('../data/conjunto_de_datos_iter_00CSV20.parquet')

By using this query only the totals of each variable for each municipality is used well the rest of the dataframe is ignored.

In [3]:
df.query("NOM_LOC == 'Total de la Entidad'", inplace = True)
df

Unnamed: 0,ENTIDAD,NOM_ENT,MUN,NOM_MUN,LOC,NOM_LOC,LONGITUD,LATITUD,ALTITUD,POBTOT,...,VPH_CEL,VPH_INTER,VPH_STVP,VPH_SPMVPI,VPH_CVJ,VPH_SINRTV,VPH_SINLTC,VPH_SINCINT,VPH_SINTIC,TAMLOC
3,1,Aguascalientes,0,Total de la entidad Aguascalientes,0,Total de la Entidad,,,,1425607,...,359895,236003,174089,98724,70126,6021,15323,128996,1711,*
2061,2,Baja California,0,Total de la entidad Baja California,0,Total de la Entidad,,,,3769020,...,1080169,800189,618175,384011,216865,41223,38772,293529,9582,*
7627,3,Baja California Sur,0,Total de la entidad Baja California Sur,0,Total de la Entidad,,,,798447,...,226517,148723,136538,67961,36197,14508,8675,77223,2608,*
10188,4,Campeche,0,Total de la entidad Campeche,0,Total de la Entidad,,,,928363,...,218322,114020,151613,38508,17976,23627,36397,130361,12028,*
12988,5,Coahuila de Zaragoza,0,Total de la entidad Coahuila de Zaragoza,0,Total de la Entidad,,,,3146771,...,824291,519599,443659,195883,124077,17020,46420,332298,5754,*
17137,6,Colima,0,Total de la entidad Colima,0,Total de la Entidad,,,,731391,...,206736,132395,114164,43881,22695,9173,12085,82366,2698,*
18396,7,Chiapas,0,Total de la entidad Chiapas,0,Total de la Entidad,,,,5543828,...,944695,292189,433400,61298,32460,214333,379915,993929,151655,*
39883,8,Chihuahua,0,Total de la entidad Chihuahua,0,Total de la Entidad,,,,3741869,...,1051045,650546,481871,282049,179964,42113,64419,432032,20158,*
52272,9,Ciudad de México,0,Total de la entidad Ciudad de México,0,Total de la Entidad,,,,9209944,...,2536523,2084156,1290811,957162,568827,46172,77272,561128,10528,*
52938,10,Durango,0,Total de la entidad Durango,0,Total de la Entidad,,,,1832650,...,434450,215108,222785,73099,54664,22071,41801,243750,12446,*


By using the dictonary that the dataset offers, the selection of columns of interest is done.

In [4]:
df=df[['ENTIDAD','NOM_ENT','PCON_DISC','PCON_LIMI','PCLIM_PMEN','PSIND_LIM','GRAPROES','PSINDER','PDER_SS','PROM_OCUP',
       'TVIVPARHAB','VPH_SINTIC','POB0_14','POB15_64','POB65_MAS']].copy()

Based on the dictonary the columns are renamed in a clearer way.

In [5]:
df.rename(columns = {'ENTIDAD':'cve_ent','NOM_ENT':'states','PCON_DISC': 'population_disability','PCON_LIMI': 'population_limitation',
                     'PCLIM_PMEN': 'population_mental_problem','PSIND_LIM':'population_no_problems','GRAPROES': 'average_years_finish', 'PSINDER': 'no_med_insurance', 
                     'PDER_SS': 'med_insurance', 'PROM_OCUP': 'average_household_size','TVIVPARHAB': 'total_households','VPH_SINTIC': 'household_no_tics',
                    'POB0_14':'population_0_14_years_old','POB15_64':'population_15_64_years_old','POB65_MAS':'population_65_more_years_old'}, inplace=True)

It is also necessary to change the data types of the columns of interest to int and float data types, since this values will be normalized for further study.

In [6]:
df['population_disability'] = df['population_disability'].astype(int)
df['population_limitation'] = df['population_limitation'].astype(int)
df['population_mental_problem'] = df['population_mental_problem'].astype(int)
df['population_no_problems'] = df['population_no_problems'].astype(int)
df['average_years_finish'] = df['average_years_finish'].astype(float)
df['no_med_insurance'] = df['no_med_insurance'].astype(int)
df['med_insurance'] = df['med_insurance'].astype(int)
df['average_household_size'] = df['average_household_size'].astype(float)
df['total_households'] = df['total_households'].astype(int)
df['household_no_tics'] = df['household_no_tics'].astype(int)
df['population_0_14_years_old'] = df['population_0_14_years_old'].astype(int)
df['population_15_64_years_old'] = df['population_15_64_years_old'].astype(int)
df['population_65_more_years_old'] = df['population_65_more_years_old'].astype(int)

To obtain the total household which have TIC's it is necessary to substract from the total household the households that do not have TIC's

In [7]:
df['household_tics'] = df['total_households']-df['household_no_tics']
df.dtypes

cve_ent                           int64
states                           object
population_disability             int32
population_limitation             int32
population_mental_problem         int32
population_no_problems            int32
average_years_finish            float64
no_med_insurance                  int32
med_insurance                     int32
average_household_size          float64
total_households                  int32
household_no_tics                 int32
population_0_14_years_old         int32
population_15_64_years_old        int32
population_65_more_years_old      int32
household_tics                    int32
dtype: object

The week 1 analyzes it is read

In [8]:
dfWeek1 = pd.read_csv('../data/week1analyzesStates.csv')

The week 1 analyzes cve_ent is converted to a string value for a good compatibility for future merging

In [9]:
dfWeek1['cve_ent'] = dfWeek1['cve_ent'].astype('int64')
dfWeek1.head()

Unnamed: 0,cve_ent,state,population,total_cases,case_rate,total_cases_last_60_days,case_rate_last_60_days,total_deaths,death_rate,total_deaths_last_60_days,death_rate_last_60_days
0,1,AGUASCALIENTES,1434635,28746,2003.715231,2046,142.614672,2516,175.375618,43,2.997278
1,2,BAJA CALIFORNIA,3634868,53646,1475.872026,3460,95.189151,8950,246.226273,195,5.364707
2,3,BAJA CALIFORNIA SUR,804708,51019,6340.063725,16314,2027.319226,2069,257.111896,580,72.075834
3,4,CAMPECHE,1000617,15865,1585.521733,5095,509.185832,1516,151.50652,247,24.684769
4,7,CHIAPAS,5730367,16679,291.063382,4806,83.868974,1839,32.092185,178,3.106258


The week 1 analyzes and the lastest dataframe is merged using the code of the state of origin of the municipality

In [10]:
dfAll = pd.merge(df,dfWeek1,on=['cve_ent'])
dfAll.head()

Unnamed: 0,cve_ent,states,population_disability,population_limitation,population_mental_problem,population_no_problems,average_years_finish,no_med_insurance,med_insurance,average_household_size,...,state,population,total_cases,case_rate,total_cases_last_60_days,case_rate_last_60_days,total_deaths,death_rate,total_deaths_last_60_days,death_rate_last_60_days
0,1,Aguascalientes,71294,165482,20169,1177938,10.35,262088,1161139,3.68,...,AGUASCALIENTES,1434635,28746,2003.715231,2046,142.614672,2516,175.375618,43,2.997278
1,2,Baja California,151945,361269,52519,3213665,10.2,836317,2905265,3.26,...,BAJA CALIFORNIA,3634868,53646,1475.872026,3460,95.189151,8950,246.226273,195,5.364707
2,3,Baja California Sur,35383,90233,10423,663217,10.34,129270,664122,3.3,...,BAJA CALIFORNIA SUR,804708,51019,6340.063725,16314,2027.319226,2069,257.111896,580,72.075834
3,4,Campeche,52259,112956,12314,753271,9.63,203304,719677,3.55,...,CAMPECHE,1000617,15865,1585.521733,5095,509.185832,1516,151.50652,247,24.684769
4,5,Coahuila de Zaragoza,134816,302543,35073,2683785,10.43,597373,2540708,3.48,...,COAHUILA,3218720,75379,2341.893672,6121,190.168763,6592,204.801909,117,3.634985


Once merged the dataframes only the data that is possible to normalized is selected. After selecting the data the normalization of it is implemented based on the total population or total households of each municipality by obtain the percentage of people or households with the certain variable of interest.

In [11]:
dfAll = dfAll[['cve_ent','state','population','population_disability', 'population_limitation',
       'population_mental_problem','population_no_problems', 'average_years_finish', 'no_med_insurance',
       'med_insurance', 'average_household_size', 'case_rate', 
       'case_rate_last_60_days', 'death_rate',
       'death_rate_last_60_days','total_households','household_tics','household_no_tics',
        'population_0_14_years_old','population_15_64_years_old','population_65_more_years_old']].copy()
dfAll['pct_disability']=dfAll['population_disability']/dfAll['population']*100
dfAll['pct_limitation']=dfAll['population_limitation']/dfAll['population']*100
dfAll['pct_mental_problem']=dfAll['population_mental_problem']/dfAll['population']*100
dfAll['pct_no_problems']=dfAll['population_no_problems']/dfAll['population']*100
dfAll['pct_no_med_insurance']=dfAll['no_med_insurance']/dfAll['population']*100
dfAll['pct_med_insurance']=dfAll['med_insurance']/dfAll['population']*100
dfAll['pct_household_tics']=dfAll['household_tics']/dfAll['total_households']*100
dfAll['pct_household_no_tics']=dfAll['household_no_tics']/dfAll['total_households']*100
dfAll['pct_pop_0_14_years_old']=dfAll['population_0_14_years_old']/dfAll['population']*100
dfAll['pct_pop_15_64_years_old']=dfAll['population_15_64_years_old']/dfAll['population']*100
dfAll['pct_pop_65_more_years_old']=dfAll['population_65_more_years_old']/dfAll['population']*100

Finally the variables and the region codes are selected of the dataframe for future storage

In [12]:
dfFinal = dfAll[['cve_ent','state','case_rate','case_rate_last_60_days', 'death_rate',
        'death_rate_last_60_days','population','pct_disability',
        'pct_limitation','pct_mental_problem', 'pct_no_problems' ,'average_years_finish',
        'pct_no_med_insurance','pct_med_insurance', 'average_household_size',
        'pct_household_tics','pct_household_no_tics','pct_pop_0_14_years_old',
        'pct_pop_15_64_years_old','pct_pop_65_more_years_old']].copy()
dfFinal

Unnamed: 0,cve_ent,state,case_rate,case_rate_last_60_days,death_rate,death_rate_last_60_days,population,pct_disability,pct_limitation,pct_mental_problem,pct_no_problems,average_years_finish,pct_no_med_insurance,pct_med_insurance,average_household_size,pct_household_tics,pct_household_no_tics,pct_pop_0_14_years_old,pct_pop_15_64_years_old,pct_pop_65_more_years_old
0,1,AGUASCALIENTES,2003.715231,142.614672,175.375618,2.997278,1434635,4.969487,11.534781,1.405863,82.107156,10.35,18.268619,80.936196,3.68,99.557246,0.442754,26.849687,65.64973,6.766181
1,2,BAJA CALIFORNIA,1475.872026,95.189151,246.226273,5.364707,3634868,4.180207,9.938985,1.444867,88.412151,10.2,23.008181,79.927662,3.26,99.165994,0.834006,24.163271,72.553336,6.747975
2,3,BAJA CALIFORNIA SUR,6340.063725,2027.319226,257.111896,72.075834,804708,4.396999,11.213136,1.295252,82.4171,10.34,16.064212,82.529564,3.3,98.915448,1.084552,24.487516,68.228351,6.028646
3,4,CAMPECHE,1585.521733,509.185832,151.50652,24.684769,1000617,5.222678,11.288635,1.230641,75.280652,9.63,20.317864,71.923323,3.55,95.38671,4.61329,24.096932,61.273494,6.979494
4,5,COAHUILA,2341.893672,190.168763,204.801909,3.634985,3218720,4.188497,9.399482,1.089657,83.380505,10.43,18.559334,78.935353,3.48,99.361293,0.638707,25.608938,64.831517,7.132866
5,6,COLIMA,2408.320417,860.214506,168.247463,14.137372,785153,5.079265,11.122036,1.277713,76.114337,10.05,15.675161,77.175659,3.21,98.810684,1.189316,22.054682,62.982629,7.866238
6,7,CHIAPAS,291.063382,83.868974,32.092185,3.106258,5730367,3.976674,7.333928,0.929417,84.528565,7.78,31.66956,64.544958,4.09,88.774803,11.225197,30.8894,59.278524,6.090901
7,8,CHIHUAHUA,1608.791507,93.121455,202.262957,3.20927,3801487,4.413747,10.381148,1.296835,82.710476,10.0,15.102196,83.027878,3.25,98.241618,1.758382,24.788274,66.104317,7.338181
8,9,DISTRITO FEDERAL,9118.176844,1678.711159,402.388607,17.563614,9018645,5.472984,12.653985,1.612526,83.044837,11.48,27.751275,74.168703,3.32,99.618041,0.381959,18.326179,72.354993,11.333244
9,10,DURANGO,2113.969211,266.827751,137.988524,3.638317,1868996,5.454961,11.655991,1.202624,80.213173,9.75,24.68673,73.122949,3.69,97.479026,2.520974,27.49626,62.741119,7.648063


The dataframe is stored

In [13]:
dfFinal.to_csv('../data/week3_variables_states.csv',index=False)