# GENERAL DATA CLEANING
In this notebook the dataset is cleaned for the first time, deleting data that will not be useful for any of our data science tasks. Operations performed here include deletion of useless fields and rows, and changes in how the data is represented in some fields.

In [12]:
# Libraries
import numpy as np
import pandas as pd
import os

# 1 - Loading Data

In [9]:
COVID_URL = os.path.join('..', 'datasets', '211123COVID19MEXICO.csv')

..\datasets\211123COVID19MEXICO.csv


In [32]:
# Defining the type of each columns reduces the memory space used by the DataFrame
types = {
    'FECHA_ACTUALIZACION': 'object',
    'ID_REGISTRO': 'object',
    'ORIGEN': np.int8,
    'SECTOR': np.int8,
    'ENTIDAD_UM': np.int8,
    'SEXO': np.int8,
    'ENTIDAD_NAC': np.int8,
    'ENTIDAD_RES': np.int8,
    'MUNICIPIO_RES': np.int8,
    'TIPO_PACIENTE': np.int8,
    'FECHA_INGRESO': 'object',
    'FECHA_SINTOMAS': 'object',
    'FECHA_DEF': 'object',
    'INTUBADO': np.int8,
    'NEUMONIA': np.int8,
    'EDAD': np.int8,
    'NACIONALIDAD': np.int8,
    'EMBARAZO': np.int8,
    'HABLA_LENGUA_INDIG': np.int8,
    'INDIGENA': np.int8,
    'DIABETES': np.int8,
    'EPOC': np.int8,
    'ASMA': np.int8,
    'INMUSUPR': np.int8,
    'HIPERTENSION':np.int8,
    'OTRA_COM': np.int8,
    'CARDIOVASCULAR': np.int8,
    'OBESIDAD': np.int8,
    'RENAL_CRONICA': np.int8,
    'TABAQUISMO': np.int8,
    'OTRO_CASO': np.int8,
    'TOMA_MUESTRA_LAB': np.int8,
    'RESULTADO_LAB': np.int8,
    'TOMA_MUESTRA_ANTIGENO': np.int8,
    'RESULTADO_ANTIGENO': np.int8,
    'CLASIFICACION_FINAL': np.int8,
    'MIGRANTE': np.int8,
    'PAIS_NACIONALIDAD': 'object',
    'PAIS_ORIGEN': 'object',
    'UCI': np.int8
}

In [33]:
# encoding = 'latin' because data has accents and other special characters
df = pd.read_csv(COVID_URL, encoding='latin', dtype=types)

### General Information

#### Dataset Size

In [34]:
print(f'No. of rows: {df.shape[0]} --- No. of columns: {df.shape[1]}')

No. of rows: 11734369 --- No. of columns: 40


#### How does the DataFrame looks?

In [35]:
df.head(3)

Unnamed: 0,FECHA_ACTUALIZACION,ID_REGISTRO,ORIGEN,SECTOR,ENTIDAD_UM,SEXO,ENTIDAD_NAC,ENTIDAD_RES,MUNICIPIO_RES,TIPO_PACIENTE,...,OTRO_CASO,TOMA_MUESTRA_LAB,RESULTADO_LAB,TOMA_MUESTRA_ANTIGENO,RESULTADO_ANTIGENO,CLASIFICACION_FINAL,MIGRANTE,PAIS_NACIONALIDAD,PAIS_ORIGEN,UCI
0,2021-11-23,z54912,1,12,31,1,31,31,79,1,...,2,1,1,2,97,3,99,MÃ©xico,97,97
1,2021-11-23,z1e370,1,12,14,1,14,14,85,1,...,2,1,2,2,97,7,99,MÃ©xico,97,97
2,2021-11-23,z35a05,1,12,31,1,31,31,102,1,...,2,1,2,2,97,7,99,MÃ©xico,97,97


In [36]:
df.tail(3)

Unnamed: 0,FECHA_ACTUALIZACION,ID_REGISTRO,ORIGEN,SECTOR,ENTIDAD_UM,SEXO,ENTIDAD_NAC,ENTIDAD_RES,MUNICIPIO_RES,TIPO_PACIENTE,...,OTRO_CASO,TOMA_MUESTRA_LAB,RESULTADO_LAB,TOMA_MUESTRA_ANTIGENO,RESULTADO_ANTIGENO,CLASIFICACION_FINAL,MIGRANTE,PAIS_NACIONALIDAD,PAIS_ORIGEN,UCI
11734366,2021-11-23,m148c71,2,12,15,1,15,15,-25,1,...,99,2,97,1,2,7,99,MÃ©xico,97,97
11734367,2021-11-23,m0944b9,2,12,15,1,15,15,-25,1,...,99,2,97,1,2,7,99,MÃ©xico,97,97
11734368,2021-11-23,m1bb0d5,2,12,15,2,15,15,-25,1,...,99,2,97,1,1,3,99,MÃ©xico,97,97


#### Datatypes for fields and memory usage

In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11734369 entries, 0 to 11734368
Data columns (total 40 columns):
 #   Column                 Dtype 
---  ------                 ----- 
 0   FECHA_ACTUALIZACION    object
 1   ID_REGISTRO            object
 2   ORIGEN                 int8  
 3   SECTOR                 int8  
 4   ENTIDAD_UM             int8  
 5   SEXO                   int8  
 6   ENTIDAD_NAC            int8  
 7   ENTIDAD_RES            int8  
 8   MUNICIPIO_RES          int8  
 9   TIPO_PACIENTE          int8  
 10  FECHA_INGRESO          object
 11  FECHA_SINTOMAS         object
 12  FECHA_DEF              object
 13  INTUBADO               int8  
 14  NEUMONIA               int8  
 15  EDAD                   int8  
 16  NACIONALIDAD           int8  
 17  EMBARAZO               int8  
 18  HABLA_LENGUA_INDIG     int8  
 19  INDIGENA               int8  
 20  DIABETES               int8  
 21  EPOC                   int8  
 22  ASMA                   int8  
 23  INMUS

# 2 - Data Cleaning

### Goals for the project

* Identify key aspects of the COVID-19's impact in Mexico:
    * ¿How many men and women got infected or died of COVID-19?
    * ¿Which states have suffered more because of COVID-19?
    * ¿Cómo ha ido avanzando la pandemia en México?
    * What's the general situation of infected/deceases cases of the COVID-19 in Mexico since early 2020?
    * ¿What's the performance of the different health systems in controlling the pandemic?
    * What kind of influence does previous affections have in the outcome of COVID-19 infected people?
    * What kind of influence does health complications have?
    * How infected/deceased cases of COVID-19 distribute acroos age ranges?
    * What kind of relationship exists between life expectancy and age range?

### Strategy for data cleaning

In order to acomplish the previously mentioned objectives, several actions must be made:
* Some fields can be safely deleted because they don't contain useful information for our future data science tasks:
    * ID Registro, Municipio de procedencia, Origen, Entidad_UM, Fecha de Síntomas, País de Origen, País de Nacionalidad, Migrante, Otro Caso, Tipo de Paciente, Habla lengua indígena, Fecha de actualización.
* Datetime fields hold useful nfromation but some data transformation si needed to facilitate the process of certain data sicnece tasks:
    * Fecha de actualización, Fecha de defunción, Fecha de ingreso.

## 2.1 - Field/Row Selection


In [43]:
df['CLASIFICACION_FINAL'].value_counts()

7    7310676
3    3610640
6     462464
1     242652
5      81812
2      14684
4      11441
Name: CLASIFICACION_FINAL, dtype: int64

### Deleting unnecesary rows
These rows do not have conclusive test results for COVID-19 detection. So, they cannnot be used to properly detect COVID cases.

In [45]:
# The CLASIFICACION_FINAL field represents cases with numbers.
# Only the number "3" represents COVID-19 cases with positive antigen test results.
# All the others are cases with unconclusive/non-reliable results
rows_to_delete = df[df.CLASIFICACION_FINAL != 3].index
rows_to_delete

Int64Index([       1,        2,        3,        5,        7,        8,
                  11,       12,       13,       14,
            ...
            11734358, 11734359, 11734360, 11734361, 11734362, 11734363,
            11734364, 11734365, 11734366, 11734367],
           dtype='int64', length=8123729)

In [46]:
df.drop(rows_to_delete, inplace=True)

In [50]:
print(f'No. of rows: {df.shape[0]} --- No. of columns: {df.shape[1]}')

No. of rows: 3610640 --- No. of columns: 21


### Deleting unnecesary fields

As stated before, some fields do not have useful information for our data science tasks. They can be safely deleted.

In [49]:
cols = ['FECHA_ACTUALIZACION', 'ID_REGISTRO', 'ORIGEN', 'ENTIDAD_NAC', 'ENTIDAD_RES',
        'MUNICIPIO_RES', 'NACIONALIDAD', 'HABLA_LENGUA_INDIG', 'INDIGENA', 'OTRA_COM',
        'OTRO_CASO', 'TOMA_MUESTRA_LAB', 'RESULTADO_LAB', 'TOMA_MUESTRA_ANTIGENO', 'RESULTADO_ANTIGENO',
        'CLASIFICACION_FINAL', 'MIGRANTE', 'PAIS_NACIONALIDAD', 'PAIS_ORIGEN']
df.drop(cols, axis='columns', inplace=True)

In [51]:
print(f'No. of rows: {df.shape[0]} --- No. of columns: {df.shape[1]}')

No. of rows: 3610640 --- No. of columns: 21


In [11]:
# La memoria se ha reducido drásticamente
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3867976 entries, 0 to 11734368
Data columns (total 21 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   SECTOR          int8  
 1   ENTIDAD_UM      int8  
 2   SEXO            int8  
 3   TIPO_PACIENTE   int8  
 4   FECHA_INGRESO   object
 5   FECHA_SINTOMAS  object
 6   FECHA_DEF       object
 7   INTUBADO        int8  
 8   NEUMONIA        int8  
 9   EDAD            int8  
 10  EMBARAZO        int8  
 11  DIABETES        int8  
 12  EPOC            int8  
 13  ASMA            int8  
 14  INMUSUPR        int8  
 15  HIPERTENSION    int8  
 16  CARDIOVASCULAR  int8  
 17  OBESIDAD        int8  
 18  RENAL_CRONICA   int8  
 19  TABAQUISMO      int8  
 20  UCI             int8  
dtypes: int8(18), object(3)
memory usage: 184.4+ MB


## 2.2 - Reformatting field values

Many peculiarities exist whe representing information in our dataset. Specifically, there are three aspects of the dataset regarding how information is presented thet we need to address:
- The field "SEXO" is composed of 3 values: 1 represents women, 2 represents men and 99 represents unspecified cases. Cases with the number 99 will be deleted and the field will be converted to a binary variable (where 1 represents women and 0 erepresents men).
- The same representation issue appears in those fields regarding previous affection and health complications (e.g. EMBARAZO, INTUBADO, etc.). These fields present a data catalog of 5 values: 1 represents YES, 2 NO, 97 is DOES NOT APPLY, 98 is UNKNOWN and 99 is UNSPECIFIED. Only cases with numbers 1 and 2 will be kept, and these fields will be transformed to binary variables (where 1 indicates YES and 0 NO).
- The field "TIPO_PACIENTE" presents 3 values. The value 1 represents ambulatory cases of Covid-19, the value 2 represents hospitalized cases, and the value 99 is for unspecified cases. Those last cases will de deleted and the "TIPO_PACIENTE" field will be renamed to "HOSPITALIZADO", where a value of 1 represents hospitalized pacients.

### Field "SEXO"

In [53]:
# Deleting entries with values 99
rows_to_delete = df[df.SEXO == 99].index
df.drop(rows_to_delete, inplace=True)

In [54]:
# Conversion function
clean_sex = lambda x : 1 if x == 1 else 0

In [55]:
# Conversion function is applied. Results are stores in the same column
df['SEXO'] = df['SEXO'].apply(clean_sex).astype('int8') # the field is defined as int8 to preserve memory

### Fields representing Previous Affections (Comorbilities)

#### Counting cases for each case type

In [156]:
cols = ['NEUMONIA', 'EMBARAZO', 'DIABETES', 'EPOC', 'ASMA',
        'INMUSUPR', 'HIPERTENSION', 'CARDIOVASCULAR', 'OBESIDAD', 'RENAL_CRONICA',
        'TABAQUISMO', 'INTUBADO', 'UCI']

Before modyfing these fields, it's useful to look at the proportion of values that each fields contains

In [164]:
def calc_dropped_props(row):
    foo = row[97] + row[98] + row[99]
    bar = row.sum()
    return foo / bar

In [179]:
def count_case_types(df, cols):
    aux_df = pd.DataFrame(columns = [1, 2, 97, 98, 99])
    for col in cols: aux_df.loc[col] = df[col].value_counts()
    aux_df.fillna(0, inplace=True)
    aux_df = aux_df.astype('int64')
    # aux_df['Proportion'] = aux_df.apply(calc_dropped_props, axis='columns')
    aux_df.columns = ['YES (1)', 'NO (2)', 'DOES NOT APPLY (97)', 'UNKNOWN (98)', 'UNSPECIFIED (99)']
    return aux_df

In [180]:
case_types = count_case_types(df, cols)
case_types

Unnamed: 0,YES (1),NO (2),DOES NOT APPLY (97),UNKNOWN (98),UNSPECIFIED (99)
NEUMONIA,422949,0,0,0,4260
EMBARAZO,27092,0,0,10873,3
DIABETES,406991,0,0,7714,0
EPOC,32714,0,0,7021,0
ASMA,71752,0,0,6881,0
INMUSUPR,24304,0,0,7083,0
HIPERTENSION,527975,0,0,7190,0
CARDIOVASCULAR,45151,0,0,7037,0
OBESIDAD,451169,0,0,6792,0
RENAL_CRONICA,44701,0,0,6975,0


#### Observations
* The DOES NOT APPLY values in te "EMBARAZO" field are clearly referring to male pacients, so is safe to set those values to NO (represented with the number 2). 
*  For the fields representing complications (i.e "INTUBADO" and "UCI"), the DOES NOT APPLY values refer to individuals that did not die due to COVID-19. These values can also be set to NO. el que tengan los valores tan exactos significa que aplican a un segmento específico de la población, que se presume son los pacientes no fallecidos. También puede ajustarse el valor a NO.
* All other values will either be deleted (number 97 and 98) or rewriting ("NO" values will be set to 0).

#### Converting case types

In [169]:
# Converts DOES NOT APPLY cases
def convert_values_for_cases(x):
    if x==1: return 1 # YES cases will stay as they are (represented with the number 1)
    if x==2: return 0 # NO cases will be represented with the number 2
    if x==97: return 0 # DOES NOT APPLY cases will be set to number 0
    else: return x # Other cases (98, 99)

In [170]:
for col in cols:
    df[col] = df[col].apply(convert_values_for_cases).astype('int8')

#### Deleting cases

In [24]:
values_to_delete = [98, 99]
for col in cols:
    for value in values_to_delete:
        rows_to_delete = (df[df[col]==value]).index
        df.drop(rows_to_delete, inplace=True)

In [183]:
for col in cols:
    rows_to_delete = (df[df[col] >= 98]).index
    df.drop(rows_to_delete, inplace=True)

In [184]:
df.shape

(3579784, 21)

### Field "TIPO\_PACIENTE"

Here we need to delete the UNSPECIFIED cases, represented with the number 99.

In [185]:
rows_to_delete = df[df['TIPO_PACIENTE'] == 99].index
df.drop(rows_to_delete, axis='rows', inplace=True)

After that, the field will be converted to a binary variable, converting the current values (1 for ambulatory pacients and 0 for hsopitalized pacients) to new values (1 for hospitalized pacientes and 0 for ambulatory ones). Also, the columns must be renamed to reflect its changed nature.

In [187]:
# conversion function
def convert_pacient_to_hospitalized(x):
    if x == 1: return 0 # Ambulatory cases (1) will be set to 0
    else: return 1 #  Hopitalized (0) cases will be set to 1

In [188]:
# Apply conversion function to "TIPO_PACIENTE" field
df['TIPO_PACIENTE'] = df['TIPO_PACIENTE'].apply(convert_pacient_to_hospitalized).astype('int8')

In [189]:
# Rename column
df.rename(columns={'TIPO_PACIENTE': 'PAC_HOSPITALIZADO'}, inplace=True)

## 3 - Save Data

Once the general data cleaning has finished, data can be saved for future data science tasks

#### Dimensionality of the cleaned dataset

In [196]:
# Dimensionality of the cleaned dataset
print(f'No. of rows: {df.shape[0]} --- No. of columns: {df.shape[1]}')

No. of rows: 3579784 --- No. of columns: 21


In [193]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3579784 entries, 0 to 11703510
Data columns (total 21 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   SECTOR             int8  
 1   ENTIDAD_UM         int8  
 2   SEXO               int8  
 3   PAC_HOSPITALIZADO  int8  
 4   FECHA_INGRESO      object
 5   FECHA_SINTOMAS     object
 6   FECHA_DEF          object
 7   INTUBADO           int8  
 8   NEUMONIA           int8  
 9   EDAD               int8  
 10  EMBARAZO           int8  
 11  DIABETES           int8  
 12  EPOC               int8  
 13  ASMA               int8  
 14  INMUSUPR           int8  
 15  HIPERTENSION       int8  
 16  CARDIOVASCULAR     int8  
 17  OBESIDAD           int8  
 18  RENAL_CRONICA      int8  
 19  TABAQUISMO         int8  
 20  UCI                int8  
dtypes: int8(18), object(3)
memory usage: 170.7+ MB


### 3.1 - Saving Dataset

In [192]:
clean_dataset_path = os.path.join('..', 'datasets', 'clean_covid_dataset.csv')
df.to_csv(clean_dataset_path, index=False)