# MEX INEGI 2020 Population and Housing Census
This dataset contains the data from the population and housing census performed in Mexico on the year 2020. 

### Goal:
* Fetch selected columns from the csv file, preprocess the data and export as a new csv file to be inserted into a SQL Server table (population).

In [1]:
# Import the relevant libraries
import pandas as pd
import numpy as np
import unicodedata
import re

In [2]:
"""
Read csv file from INEGI website and display its first rows
* URL: https://www.inegi.org.mx/datosabiertos/

The route to acces the data is the following:

Informacion Demografica y Social > Censos y Conteos > Censos y Conteos de Poblacion y vivienda >
2020 > Principales resultados por localidad (ITER) > Estados Unidos Mexicanos
"""
raw_data = pd.read_csv('conjunto_de_datos_iter_00CSV20.csv', low_memory=False)
df = raw_data.copy()
df.head()

Unnamed: 0,ENTIDAD,NOM_ENT,MUN,NOM_MUN,LOC,NOM_LOC,LONGITUD,LATITUD,ALTITUD,POBTOT,...,VPH_CEL,VPH_INTER,VPH_STVP,VPH_SPMVPI,VPH_CVJ,VPH_SINRTV,VPH_SINLTC,VPH_SINCINT,VPH_SINTIC,TAMLOC
0,0,Total nacional,0,Total nacional,0,Total nacional,,,,126014024,...,30775898,18307193,15211306,6616141,4047100,1788552,3170894,15108204,852871,*
1,0,Total nacional,0,Total nacional,9998,Localidades de una vivienda,,,,250354,...,47005,8385,18981,1732,1113,12775,14143,51293,7154,*
2,0,Total nacional,0,Total nacional,9999,Localidades de dos viviendas,,,,147125,...,25581,5027,11306,971,708,8247,10065,29741,5283,*
3,1,Aguascalientes,0,Total de la entidad Aguascalientes,0,Total de la Entidad,,,,1425607,...,359895,236003,174089,98724,70126,6021,15323,128996,1711,*
4,1,Aguascalientes,0,Total de la entidad Aguascalientes,9998,Localidades de una vivienda,,,,3697,...,732,205,212,48,41,39,62,530,20,*


In [3]:
# Display the amount of rows and columns of the dataframe
nr_rows = df.shape[0]
nr_col = df.shape[1]
print(f'There are {nr_rows} rows and {nr_col} columns in the dataframe.')

There are 195662 rows and 286 columns in the dataframe.


### Create Fourth DataFrame ("education" table)

In this section we select the columns that will be part of the "education" dataframe. This dataframe will be exported as a separate csv file and further turned into a SQL Server table.
This dataframe will contain data about the education level accross the mexican population.

In [4]:
# Select the relevant columns
edu_columns = (
    'ENTIDAD', 'MUN', 'LOC',
    'P15YM_AN', 'P15YM_SE', 'P15PRI_IN', 'P15PRI_CO',
    'P15SEC_IN', 'P15SEC_CO', 'P18YM_PB', 'GRAPROES',
    'GRAPROES_F', 'GRAPROES_M'
)
    

df_edu = df[list(edu_columns)]
df_edu.head()

Unnamed: 0,ENTIDAD,MUN,LOC,P15YM_AN,P15YM_SE,P15PRI_IN,P15PRI_CO,P15SEC_IN,P15SEC_CO,P18YM_PB,GRAPROES,GRAPROES_F,GRAPROES_M
0,0,0,0,4456431,4841952,7731820,12325433,2913915,22833912,39977750,9.74,9.64,9.84
1,0,0,9998,24331,28014,38545,38809,14833,40180,33907,6.5,6.51,6.5
2,0,0,9999,15092,17660,21620,22033,5040,24056,19102,6.45,6.48,6.43
3,1,0,0,21908,25567,65609,122405,30347,288036,467249,10.35,10.32,10.38
4,1,0,9998,180,209,378,470,157,832,722,8.14,8.2,8.11


In [5]:
# Replace placeholders with NaN (interpreted as NULL later)
df_edu = df_edu.replace(['*', 'N/D'], np.nan)

int_columns = ['ENTIDAD', 'MUN', 'LOC', 'P15YM_AN', 
               'P15YM_SE', 'P15PRI_IN', 'P15PRI_CO',
               'P15SEC_IN', 'P15SEC_CO', 'P18YM_PB']

for col in int_columns:
    df_edu[col] = df_edu[col].astype(pd.Int64Dtype())

float_columns = ['GRAPROES', 'GRAPROES_F', 'GRAPROES_M']

for col in float_columns:
    df_edu[col] = df_edu[col].astype(float)

In [6]:
# Double check the datatypes
df_edu.dtypes

ENTIDAD         Int64
MUN             Int64
LOC             Int64
P15YM_AN        Int64
P15YM_SE        Int64
P15PRI_IN       Int64
P15PRI_CO       Int64
P15SEC_IN       Int64
P15SEC_CO       Int64
P18YM_PB        Int64
GRAPROES      float64
GRAPROES_F    float64
GRAPROES_M    float64
dtype: object

In [7]:
# Export the dataframe as a new .csv file
df_edu.to_csv('education.csv', index=False, na_rep='', encoding='utf-8')
print("Successfully exported the dataframe as 'education.csv'")

Successfully exported the dataframe as 'education.csv'
