# MEX INEGI 2020 Population and Housing Census
This dataset contains the data from the population and housing census performed in Mexico on the year 2020. 

### Goal:
* Fetch selected columns from the csv file, preprocess the data and export as a new csv file to be inserted into a SQL Server table (population).

In [1]:
# Import the relevant libraries
import pandas as pd
import numpy as np
import unicodedata
import re

In [2]:
"""
Read csv file from INEGI website and display its first rows
* URL: https://www.inegi.org.mx/datosabiertos/

The route to acces the data is the following:

Informacion Demografica y Social > Censos y Conteos > Censos y Conteos de Poblacion y vivienda >
2020 > Principales resultados por localidad (ITER) > Estados Unidos Mexicanos
"""

raw_data = pd.read_csv('conjunto_de_datos_iter_00CSV20.csv', low_memory=False)
df = raw_data.copy()
df.head()

Unnamed: 0,ENTIDAD,NOM_ENT,MUN,NOM_MUN,LOC,NOM_LOC,LONGITUD,LATITUD,ALTITUD,POBTOT,...,VPH_CEL,VPH_INTER,VPH_STVP,VPH_SPMVPI,VPH_CVJ,VPH_SINRTV,VPH_SINLTC,VPH_SINCINT,VPH_SINTIC,TAMLOC
0,0,Total nacional,0,Total nacional,0,Total nacional,,,,126014024,...,30775898,18307193,15211306,6616141,4047100,1788552,3170894,15108204,852871,*
1,0,Total nacional,0,Total nacional,9998,Localidades de una vivienda,,,,250354,...,47005,8385,18981,1732,1113,12775,14143,51293,7154,*
2,0,Total nacional,0,Total nacional,9999,Localidades de dos viviendas,,,,147125,...,25581,5027,11306,971,708,8247,10065,29741,5283,*
3,1,Aguascalientes,0,Total de la entidad Aguascalientes,0,Total de la Entidad,,,,1425607,...,359895,236003,174089,98724,70126,6021,15323,128996,1711,*
4,1,Aguascalientes,0,Total de la entidad Aguascalientes,9998,Localidades de una vivienda,,,,3697,...,732,205,212,48,41,39,62,530,20,*


In [3]:
# Display the amount of rows and columns of the dataframe
nr_rows = df.shape[0]
nr_col = df.shape[1]
print(f'There are {nr_rows} rows and {nr_col} columns in the dataframe.')

There are 195662 rows and 286 columns in the dataframe.


In [4]:
# Get a list of the column names to select the ones we need. 
# list(df.columns)

### Create Third DataFrame ("ind_population" table)

In this section we select the columns that will be part of the "ind_population" dataframe. This dataframe will be exported as a separate csv file and further turned into a SQL Server table.
This dataframe will contain data about the distribution of indigenous and afromexican population in Mexico.

In [5]:
# Select the relevant columns
df_ind_population = df[['ENTIDAD','MUN','LOC', 'PHOG_IND',
                        'P3YM_HLI', 'P3HLINHE', 'P3HLI_HE',
                        'P5_HLI','P5_HLI_NHE','P5_HLI_HE']]

df_ind_population.head()

Unnamed: 0,ENTIDAD,MUN,LOC,PHOG_IND,P3YM_HLI,P3HLINHE,P3HLI_HE,P5_HLI,P5_HLI_NHE,P5_HLI_HE
0,0,0,0,11800247,7364645,865972,6423548,7177185,785361,6317027
1,0,0,9998,27252,26486,2712,22906,25743,2412,22464
2,0,0,9999,21531,18640,3098,14758,17992,2742,14467
3,1,0,0,5552,2539,25,2461,2508,22,2437
4,1,0,9998,5,20,0,15,20,0,15


In [6]:
# Replace placeholders with NaN (interpreted as NULL later)
df_ind_population = df_ind_population.replace(['*', 'N/D'], np.nan)

# Convert relevant population columns to nullable Int64
int_columns = ['ENTIDAD','MUN','LOC', 'PHOG_IND',
               'P3YM_HLI', 'P3HLINHE', 'P3HLI_HE',
               'P5_HLI','P5_HLI_NHE','P5_HLI_HE']

for col in int_columns:
    df_ind_population[col] = df_ind_population[col].astype(pd.Int64Dtype())

In [7]:
# View column data types
df_ind_population.dtypes

ENTIDAD       Int64
MUN           Int64
LOC           Int64
PHOG_IND      Int64
P3YM_HLI      Int64
P3HLINHE      Int64
P3HLI_HE      Int64
P5_HLI        Int64
P5_HLI_NHE    Int64
P5_HLI_HE     Int64
dtype: object

In [8]:
# Export the dataframe as a new .csv file
df_ind_population.to_csv('ind_population.csv', index=False, na_rep='', encoding='utf-8')
print("Successfully exported the dataframe as 'ind_population.csv'")

Successfully exported the dataframe as 'ind_population.csv'
