# MEX INEGI 2020 Population and Housing Census
This dataset contains the data from the population and housing census performed in Mexico on the year 2020. 

### Goal:
* Fetch selected columns from the csv file, preprocess the data and export as a new csv file to be inserted into a SQL Server table (population).

In [1]:
# Import the relevant libraries
import pandas as pd
import numpy as np
import unicodedata
import re

In [2]:
"""
Read csv file from INEGI website and display its first rows
* URL: https://www.inegi.org.mx/datosabiertos/

The route to acces the data is the following:

Informacion Demografica y Social > Censos y Conteos > Censos y Conteos de Poblacion y vivienda >
2020 > Principales resultados por localidad (ITER) > Estados Unidos Mexicanos
"""

raw_data = pd.read_csv('conjunto_de_datos_iter_00CSV20.csv', low_memory=False)
df = raw_data.copy()
df.head()

Unnamed: 0,ENTIDAD,NOM_ENT,MUN,NOM_MUN,LOC,NOM_LOC,LONGITUD,LATITUD,ALTITUD,POBTOT,...,VPH_CEL,VPH_INTER,VPH_STVP,VPH_SPMVPI,VPH_CVJ,VPH_SINRTV,VPH_SINLTC,VPH_SINCINT,VPH_SINTIC,TAMLOC
0,0,Total nacional,0,Total nacional,0,Total nacional,,,,126014024,...,30775898,18307193,15211306,6616141,4047100,1788552,3170894,15108204,852871,*
1,0,Total nacional,0,Total nacional,9998,Localidades de una vivienda,,,,250354,...,47005,8385,18981,1732,1113,12775,14143,51293,7154,*
2,0,Total nacional,0,Total nacional,9999,Localidades de dos viviendas,,,,147125,...,25581,5027,11306,971,708,8247,10065,29741,5283,*
3,1,Aguascalientes,0,Total de la entidad Aguascalientes,0,Total de la Entidad,,,,1425607,...,359895,236003,174089,98724,70126,6021,15323,128996,1711,*
4,1,Aguascalientes,0,Total de la entidad Aguascalientes,9998,Localidades de una vivienda,,,,3697,...,732,205,212,48,41,39,62,530,20,*


In [3]:
# Display the amount of rows and columns of the dataframe
nr_rows = df.shape[0]
nr_col = df.shape[1]
print(f'There are {nr_rows} rows and {nr_col} columns in the dataframe.')

There are 195662 rows and 286 columns in the dataframe.


In [4]:
# Get a list of the column names to select the ones we need. 
# list(df.columns)

### Create Second DataFrame ("population" table)

In this section we select the columns that will be part of the "population" dataframe. This dataframe will be exported as a separate csv file and further turned into a SQL Server table.
This dataframe will contain data about the general population distribution in Mexico.

In [5]:
# Select the relevant columns
df_population = df[['ENTIDAD','MUN','LOC','POBTOT','POBFEM','POBMAS', 
                    'P_0A2', 'P_0A4', 'P_3YMAS', 'P_5YMAS', 'P_15YMAS', 'P_18YMAS']]

df_population.head()

Unnamed: 0,ENTIDAD,MUN,LOC,POBTOT,POBFEM,POBMAS,P_0A2,P_0A4,P_3YMAS,P_5YMAS,P_15YMAS,P_18YMAS
0,0,0,0,126014024,64540634,61473390,5764054,10047365,119976584,115693273,93985354,87492680
1,0,0,9998,250354,96869,153485,10493,17848,239441,232086,197411,186968
2,0,0,9999,147125,61324,85801,6798,11527,139757,135028,111530,104612
3,1,0,0,1425607,728924,696683,71864,124430,1352235,1299669,1038904,960764
4,1,0,9998,3697,1510,2187,165,277,3532,3420,2836,2609


In [6]:
# Replace placeholders with NaN (interpreted as NULL later)
df_population = df_population.replace(['*', 'N/D'], np.nan)

# Convert relevant population columns to nullable Int64
int_columns = ['ENTIDAD','MUN','LOC','POBTOT','POBFEM','POBMAS', 
                    'P_0A2', 'P_0A4', 'P_3YMAS', 'P_5YMAS', 'P_15YMAS', 'P_18YMAS']

for col in int_columns:
    df_population[col] = df_population[col].astype(pd.Int64Dtype())

In [7]:
# View column data types
df_population.dtypes

ENTIDAD     Int64
MUN         Int64
LOC         Int64
POBTOT      Int64
POBFEM      Int64
POBMAS      Int64
P_0A2       Int64
P_0A4       Int64
P_3YMAS     Int64
P_5YMAS     Int64
P_15YMAS    Int64
P_18YMAS    Int64
dtype: object

In [8]:
# Export the dataframe as a new .csv file
df_population.to_csv('population.csv', index=False, na_rep='', encoding='utf-8')
print("Successfully exported the dataframe as 'population.csv'")

Successfully exported the dataframe as 'population.csv'
