# Data Types

In this notebook we are going to correct the data types, to obtain a new .csv file in data\processed, then a Ydata analysis is done to see missing data and anomalies.

We import the libraries that help us in this phase.

In [1]:
import pandas as pd
import os
import datetime
import numpy as np
from ydata_profiling import ProfileReport

We start by loading the dataset that has the concatenated data and a column with the state name.

In [26]:
path = os.path.join('..', 'data', 'interim', 'Ensanut-data-p.csv') # Direccion del archivo

df = pd.read_csv(path, low_memory =False) # Cargamos dataframe

Since our file is large, it may not have been loaded with the correct data type, so let's check the data type of the columns:

In [27]:
print(df.dtypes)

Folio                           object
Edad                             int64
Sexo                             int64
C_Entidad                        int64
Entidad                         object
Fecha                           object
Atentar_contras_si              object
Depresion                       object
Tristeza                        object
Cuantos cigarrillos (numero)    object
Frecuencia emborrachar          object
dtype: object


We proceed to set the correct data type:

In [28]:
df.replace(["", " "], np.nan, inplace=True)
# Cambiar los tipos de datos
df['Edad'] = df['Edad'].astype('int64')    # Asegurar que es entero
df['Sexo'] = df['Sexo'].astype('category')  # Convertir a categoría
df['C_Entidad'] = df['C_Entidad'].astype('category')  # Convertir a categoría
df['Fecha'] = pd.to_datetime(df['Fecha'], errors='coerce')  # Convertir a fecha
df['Atentar_contras_si'] = df['Atentar_contras_si'].astype('category')  # Convertir a categoría
df['Depresion'] = df['Depresion'].astype('category')  # Convertir a categoría
df['Tristeza'] = df['Tristeza'].astype('category')  # Convertir a categoría
df['Cuantos cigarrillos (numero)'] = pd.to_numeric(df['Cuantos cigarrillos (numero)'], errors='coerce').astype('Int64')  # Convertir a entero con NaN permitido
df['Frecuencia emborrachar'] = df['Frecuencia emborrachar'].astype('category')  # Convertir a categoría

# Verificar los tipos de datos después de la conversión
print(df.dtypes)

Folio                                   object
Edad                                     int64
Sexo                                  category
C_Entidad                             category
Entidad                                 object
Fecha                           datetime64[ns]
Atentar_contras_si                    category
Depresion                             category
Tristeza                              category
Cuantos cigarrillos (numero)             Int64
Frecuencia emborrachar                category
dtype: object


We save the file (csv and parquet to lose less information):

In [29]:
path_save = os.path.join('..','data', 'interim', 'Ensanut-data.csv')
path_save_parquet = os.path.join('..','data', 'interim', 'Ensanut-data.parquet')
df.to_csv(path_save, index=False)
df.to_parquet(path_save_parquet)

We create a Ydata report for our file, the new report is saved in docs\docs.

In [30]:
ruta_output_y = os.path.join('..','docs', 'docs', 'interim-MH-data.html')
title = "ENSANUT YData Profiling Report"
#Ydata
df = pd.read_parquet(path_save_parquet)
profile_ensa_ydata = ProfileReport(df, title=title, explorative=True, minimal = True)
profile_ensa_ydata.to_file(ruta_output_y)
print(f"YData report save: {ruta_output_y}")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

YData report save: ..\docs\docs\interim-MH-data.html
