# Analyzing COVID-19 data in Mexico

In this notebook I will analyze a data set with COVID-19 data in Mexico using the CRISP-DM process. This data set is updated daily and can be found in the Mexico City government [webpage](https://datos.cdmx.gob.mx/explore/dataset/casos-asociados-a-covid-19/). I downloaded the data from this page on May 1st 2020, and I compressed it using 'gzip' method from pandas.

## Business and Data Understanding

Let's begin by taking a look a the data.

In [1]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [23]:
english_cols = ['date_update','origin','sector','medical_unit_entity','sex','birth_entity',
'residency_entity','municipality_of_residence','patient_type','admission_date',
'date_symptoms','death_date','intubed','pneumonia','age','nationality',
'pregnancy','speaks_indigenous_language','diabetes','copd','asthma',
'immunosuppression','hypertension','other_complication','cardiovascular',
'obesity','chronic_kidney','tobacco','another_case','outcome','migrant',
'country_nationality','country_of_origin','intensive_care_unit','age_range',
'register_id','death','hospitalized']
covid_data = pd.read_csv('covid-19-mex.csv', skiprows=1, names = english_cols, compression='gzip')

In [24]:
covid_data.head()

Unnamed: 0,date_update,origin,sector,medical_unit_entity,sex,birth_entity,residency_entity,municipality_of_residence,patient_type,admission_date,...,another_case,outcome,migrant,country_nationality,country_of_origin,intensive_care_unit,age_range,register_id,death,hospitalized
0,2020-04-30,FUERA DE USMER,SSA,CIUDAD DE MÉXICO,MUJER,CIUDAD DE MÉXICO,CIUDAD DE MÉXICO,Iztapalapa,AMBULATORIO,2020-04-07,...,SI,No positivo SARS-CoV-2,NO ESPECIFICADO,México,99,NO APLICA,41-50,091c2e,,
1,2020-04-30,FUERA DE USMER,IMSS,CIUDAD DE MÉXICO,MUJER,CIUDAD DE MÉXICO,CIUDAD DE MÉXICO,Iztapalapa,AMBULATORIO,2020-04-12,...,NO ESPECIFICADO,No positivo SARS-CoV-2,NO ESPECIFICADO,México,99,NO APLICA,06-15,14bbc1,,
2,2020-04-30,USMER,SSA,MORELOS,MUJER,MORELOS,MORELOS,Tláhuac,AMBULATORIO,2020-04-02,...,NO,No positivo SARS-CoV-2,NO ESPECIFICADO,México,99,NO APLICA,21-30,1840b3,,
3,2020-04-30,USMER,IMSS,CAMPECHE,MUJER,TABASCO,TABASCO,Gustavo A. Madero,HOSPITALIZADO,2020-04-15,...,NO ESPECIFICADO,No positivo SARS-CoV-2,NO ESPECIFICADO,México,99,NO,21-30,118206,,1.0
4,2020-04-30,FUERA DE USMER,SSA,CIUDAD DE MÉXICO,MUJER,CIUDAD DE MÉXICO,CIUDAD DE MÉXICO,Venustiano Carranza,AMBULATORIO,2020-04-28,...,NO,No positivo SARS-CoV-2,NO ESPECIFICADO,México,99,NO APLICA,41-50,1221b8,,


In [25]:
covid_data.shape

(87372, 38)

In [28]:
covid_data.describe(include = 'all')

Unnamed: 0,date_update,origin,sector,medical_unit_entity,sex,birth_entity,residency_entity,municipality_of_residence,patient_type,admission_date,...,another_case,outcome,migrant,country_nationality,country_of_origin,intensive_care_unit,age_range,register_id,death,hospitalized
count,87372,87372,87372,87372,87372,87372,87372,38402,87372,87372,...,87372,87372,87372,87372,87372.0,87372,87363,87372,3077.0,23165.0
unique,1,2,13,32,2,33,32,16,2,121,...,3,3,3,71,29.0,4,11,87372,,
top,2020-04-30,FUERA DE USMER,SSA,CIUDAD DE MÉXICO,HOMBRE,CIUDAD DE MÉXICO,CIUDAD DE MÉXICO,Gustavo A. Madero,AMBULATORIO,2020-04-28,...,NO ESPECIFICADO,No positivo SARS-CoV-2,NO ESPECIFICADO,México,99.0,NO APLICA,31-40,06a3e9,,
freq,87372,54009,47450,21795,44124,19600,18703,4332,64207,4120,...,32806,52628,86920,86193,87187.0,64207,21072,1,,
mean,,,,,,,,,,,...,,,,,,,,,1.0,1.0
std,,,,,,,,,,,...,,,,,,,,,0.0,0.0
min,,,,,,,,,,,...,,,,,,,,,1.0,1.0
25%,,,,,,,,,,,...,,,,,,,,,1.0,1.0
50%,,,,,,,,,,,...,,,,,,,,,1.0,1.0
75%,,,,,,,,,,,...,,,,,,,,,1.0,1.0


In [29]:
#explore values of data
print('columns unique values')
for col in covid_data.columns:
    print(col, covid_data[col].unique())

columns unique values
date_update ['2020-04-30']
origin ['FUERA DE USMER' 'USMER']
sector ['SSA' 'IMSS' 'ISSSTE' 'ESTATAL' 'PRIVADA' 'NO ESPECIFICADO' 'PEMEX'
 'SEDEMA' 'UNIVERSITARIO' 'SEMAR' 'MUNICIPAL' 'DIF' 'CRUZ ROJA']
medical_unit_entity ['CIUDAD DE MÉXICO' 'MORELOS' 'CAMPECHE' 'MÉXICO' 'GUERRERO' 'CHIHUAHUA'
 'JALISCO' 'VERACRUZ DE IGNACIO DE LA LLAVE' 'SONORA'
 'MICHOACÁN DE OCAMPO' 'BAJA CALIFORNIA' 'COAHUILA DE ZARAGOZA' 'YUCATÁN'
 'SAN LUIS POTOSÍ' 'TAMAULIPAS' 'BAJA CALIFORNIA SUR' 'OAXACA' 'TABASCO'
 'SINALOA' 'TLAXCALA' 'HIDALGO' 'GUANAJUATO' 'QUINTANA ROO' 'PUEBLA'
 'NAYARIT' 'AGUASCALIENTES' 'NUEVO LEÓN' 'QUERÉTARO' 'CHIAPAS' 'ZACATECAS'
 'COLIMA' 'DURANGO']
sex ['MUJER' 'HOMBRE']
birth_entity ['CIUDAD DE MÉXICO' 'MORELOS' 'TABASCO' 'MÉXICO' 'GUERRERO' 'CHIHUAHUA'
 'JALISCO' 'VERACRUZ DE IGNACIO DE LA LLAVE' 'NAYARIT' 'HIDALGO'
 'NO ESPECIFICADO' 'COAHUILA DE ZARAGOZA' 'YUCATÁN' 'SAN LUIS POTOSÍ'
 'TAMAULIPAS' 'BAJA CALIFORNIA SUR' 'BAJA CALIFORNIA'
 'MICHOACÁN DE OCAMP

register_id ['091c2e' '14bbc1' '1840b3' ... '026b61' '1a58a7' '1b1bfb']
death [nan  1.]
hospitalized [nan  1.]


From the above, we can see that this data contains medical information from all the people that have been tested for COVID-19 in Mexico. The majority of the columns in this data set are categorical. There is only three numerical columns: age, death and hospitalized. Some columns and rows contain NAN values. However, other columns contain NAN values expressed in strings like 'NO APLICA', 'NO ESPECIFICADO', and 'SE IGNORA', that in english  mean 'NOT APPLY', 'NOT SPECIFIED', and 'IT IS IGNORED'.

It is clear that we will have to do some cleaning, but first let's formulate some interesting questions that could be answered with the data.

### Business Questions

I would like to understand the following questions with this data:

* 1. __What kind of people are infected with COVID-19 in Mexico?__ To answer this question we will have to look at columns such as sex, age or municipality_residence.
* 2. __What is related to the severity of COVID 19? Specifically, can we predict whether a person is likely to be intubated, or whether a person will survive the disease?__  To answer this question we will have to look at columns such as intubated, diabetes, asthma etc.
* 3. __Can we make a prediction for the end of the pandemic in Mexico City from this data?__ To answer this question we will use the data from daily cumulative cases in Mexico City and an epidemic model.

### Data Preparation

Let's do some cleaning of the data