# Casos de covid en los estados unidos por condado

Se nos presentan datos de los muertos por covid en los Estados Unidos por Condado, debemos realizar un proceso de ETL en 3 archivos csv distintos.

## Importamos las librerías correspondientes

In [2]:
import pandas as pd
from numpy import nan
import sqlalchemy

## 1. Extract

Realizamos la extracción de datos desde los archivos csv que se nos entregaron, los cuales por conveniencia estarán en el mismo directorio.
También deberemos analizar la composición de los datos obtenidos.

Notamos que los datos están separados por "/" en el caso de los condados, por "&" en el caso de los casos diarios y por "," en los estados. Son agregados a sus respectivos data frames

In [3]:
df_counties = pd.read_csv(filepath_or_buffer="counties_us.csv", sep="/", encoding='utf-8')
df_daily = pd.read_csv(filepath_or_buffer="daily_cases_us.csv", sep="&", encoding='utf-8')
df_states = pd.read_csv(filepath_or_buffer="states_us.csv", sep=",", encoding='utf-8')

Imprimimos cada uno de los data frames, para ver que columnas y tipos de datos los componen.

In [4]:
df_counties

Unnamed: 0,fIps,cOUnty,statE,statE_cOdE,malE,fEmalE,mEdIan_agE,lat,lOng
0,1001,AUtaUga CoUnty,Alabama,AL,26874,28326,37.8,32.534923,-86.642730
1,1003,BaldwIn CoUnty,Alabama,AL,101188,106919,42.8,30.727479,-87.722564
2,1005,BarboUr CoUnty,Alabama,AL,13697,12085,39.9,31.869581,-85.393210
3,1007,BIbb CoUnty,Alabama,AL,12152,10375,39.9,32.998628,-87.126475
4,1009,BloUnt CoUnty,Alabama,AL,28434,29211,40.8,33.980869,-86.567380
...,...,...,...,...,...,...,...,...,...
3215,72145,Vega Baja MUnIcIpIo,PuertO RicO,,25580,27791,40.7,18.428461,-66.397926
3216,72147,VIeqUes MUnIcIpIo,PuertO RicO,,4332,4439,43.6,18.122662,-65.439095
3217,72149,VIllalba MUnIcIpIo,PuertO RicO,,11169,11824,38.8,18.128155,-66.472816
3218,72151,YabUcoa MUnIcIpIo,PuertO RicO,,16541,17608,42.5,18.070468,-65.896311


In [5]:
df_daily

Unnamed: 0,datE,cOUnty,statE,fIps,casEs,dEaths
0,2020!%01!%21,SnohomIsh,WashingtOn,53061.0,1,0.0
1,2020!%01!%22,SnohomIsh,WashingtOn,53061.0,1,0.0
2,2020!%01!%23,SnohomIsh,WashingtOn,53061.0,1,0.0
3,2020!%01!%24,Cook,IllinOis,17031.0,1,0.0
4,2020!%01!%24,SnohomIsh,WashingtOn,53061.0,1,0.0
...,...,...,...,...,...,...
2502827,2022!%05!%13,Sweetwater,WyOming,56037.0,11088,126.0
2502828,2022!%05!%13,Teton,WyOming,56039.0,10074,16.0
2502829,2022!%05!%13,UInta,WyOming,56041.0,5643,39.0
2502830,2022!%05!%13,WashakIe,WyOming,56043.0,2358,44.0


In [6]:
df_states

Unnamed: 0,cOUnty,statE,statE_cOdE,pOpUlatIOn
0,Autauga County,Alabama,AL,55200
1,Baldwin County,Alabama,AL,208107
2,Barbour County,Alabama,AL,25782
3,Bibb County,Alabama,AL,22527
4,Blount County,Alabama,AL,57645
...,...,...,...,...
3215,Vega Baja Municipio,Puerto Rico,,53371
3216,Vieques Municipio,Puerto Rico,,8771
3217,Villalba Municipio,Puerto Rico,,22993
3218,Yabucoa Municipio,Puerto Rico,,34149


Notamos que las columnas están sucias y además hay datos faltantes o mal escritos, por lo que debemos pasar a un proceso de transformación.

## 2. Transform

### Separación de Datos

Cambiamos los nombres de las columnas, para poder indexar los datos correctamente.

In [7]:
colsCounty = ['Fips', 'County', 'State', 'State_code', 'Male', 'Female', 'Median_age', 'Lat', 'Long']
colsDays = ['Date', 'County', 'State', 'Fips', 'Cases', 'Deaths']
colsStates = ['County', 'State', 'State_code', 'Population']
df_counties.columns = colsCounty
df_daily.columns = colsDays
df_states.columns = colsStates

Obtenemos los tipos de datos para cada data frame.

In [8]:
df_counties.dtypes #tipos de datos

Fips            int64
County         object
State          object
State_code     object
Male            int64
Female          int64
Median_age    float64
Lat           float64
Long          float64
dtype: object

In [9]:
df_daily.dtypes

Date       object
County     object
State      object
Fips      float64
Cases       int64
Deaths    float64
dtype: object

In [10]:
df_states.dtypes

County        object
State         object
State_code    object
Population     int64
dtype: object

Obtenemos información de cada data frame, para ver cuantos datos nulos tenemos.

In [11]:
df_counties.info() #nos da información más detallada

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3220 entries, 0 to 3219
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Fips        3220 non-null   int64  
 1   County      3220 non-null   object 
 2   State       3220 non-null   object 
 3   State_code  3141 non-null   object 
 4   Male        3220 non-null   int64  
 5   Female      3220 non-null   int64  
 6   Median_age  3220 non-null   float64
 7   Lat         3220 non-null   float64
 8   Long        3220 non-null   float64
dtypes: float64(3), int64(3), object(3)
memory usage: 226.5+ KB


Notamos que para el caso de los condados, la columna State_code solo tiene 3141 datos no nulos.

In [12]:
df_counties["State_code"].unique() #vemos cuantos datos distintos tiene

array(['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', nan, 'FL', 'GA',
       'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA',
       'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY',
       'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
       'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY'], dtype=object)

In [13]:
df_counties.isnull().sum() #con este método podemos sumar los datos nulos por columna

Fips           0
County         0
State          0
State_code    79
Male           0
Female         0
Median_age     0
Lat            0
Long           0
dtype: int64

Volvemos a hacer el mismo procedimientos para los data frames de los casos diarios y los estados.

In [14]:
df_daily.info() #vemos información más detallada

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2502832 entries, 0 to 2502831
Data columns (total 6 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Date    object 
 1   County  object 
 2   State   object 
 3   Fips    float64
 4   Cases   int64  
 5   Deaths  float64
dtypes: float64(2), int64(1), object(3)
memory usage: 114.6+ MB


In [15]:
df_daily.isnull().sum() #sumamos los nulos por columna

Date          0
County        0
State         0
Fips      23678
Cases         0
Deaths    57605
dtype: int64

In [16]:
df_states.info() #información del data frame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3220 entries, 0 to 3219
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   County      3220 non-null   object
 1   State       3220 non-null   object
 2   State_code  3141 non-null   object
 3   Population  3220 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 100.8+ KB


In [17]:
df_states['State_code'].unique()

array(['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', nan, 'FL', 'GA',
       'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA',
       'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY',
       'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
       'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY'], dtype=object)

In [18]:
df_states.isnull().sum() #obtenemos la suma de nulos por columna

County         0
State          0
State_code    79
Population     0
dtype: int64

### Datos Nulos

Para el caso de los datos nulos tomamos la desición de eliminar los menos posibles, ya que mientras más datos se eliminan, más información perdemos.

Notamos que para el caso del data frame de los condados, la columna State_code retorna nulo en caso que el estado sea Puerto Rico o sea District of Columbia, son los 2 únicos casos, en base a eso podemos reemplazar dichos datos nulos por su correspondiente codigo estatal.

In [19]:
df_counties.loc[(df_counties['State_code'].isnull()) & (df_counties['State'] == 'PuertO RicO'), 'State_code'] = 'PR'            #la función loc nos permite agregar varias sentencias condicionales y reemplazar un valor por otro
df_counties.loc[(df_counties['State_code'].isnull()) & (df_counties['State'] == 'District Of COlumbia'), 'State_code'] = 'DC'
df_counties

Unnamed: 0,Fips,County,State,State_code,Male,Female,Median_age,Lat,Long
0,1001,AUtaUga CoUnty,Alabama,AL,26874,28326,37.8,32.534923,-86.642730
1,1003,BaldwIn CoUnty,Alabama,AL,101188,106919,42.8,30.727479,-87.722564
2,1005,BarboUr CoUnty,Alabama,AL,13697,12085,39.9,31.869581,-85.393210
3,1007,BIbb CoUnty,Alabama,AL,12152,10375,39.9,32.998628,-87.126475
4,1009,BloUnt CoUnty,Alabama,AL,28434,29211,40.8,33.980869,-86.567380
...,...,...,...,...,...,...,...,...,...
3215,72145,Vega Baja MUnIcIpIo,PuertO RicO,PR,25580,27791,40.7,18.428461,-66.397926
3216,72147,VIeqUes MUnIcIpIo,PuertO RicO,PR,4332,4439,43.6,18.122662,-65.439095
3217,72149,VIllalba MUnIcIpIo,PuertO RicO,PR,11169,11824,38.8,18.128155,-66.472816
3218,72151,YabUcoa MUnIcIpIo,PuertO RicO,PR,16541,17608,42.5,18.070468,-65.896311


In [20]:
df_counties.isnull().sum() #verificamos que ya no queden datos nulos

Fips          0
County        0
State         0
State_code    0
Male          0
Female        0
Median_age    0
Lat           0
Long          0
dtype: int64

Podemos hacer algo similar para el caso del data frame de los estados.

In [21]:
df_states.loc[(df_states['State_code'].isnull()) & (df_states['State'] == 'Puerto Rico'), 'State_code'] = 'PR'          #Cambiamos los valores de los nan por el respectivo codigo estatal
df_states.loc[(df_states['State_code'].isnull()) & (df_states['State'] == 'District of Columbia'), 'State_code'] = 'DC'
df_states

Unnamed: 0,County,State,State_code,Population
0,Autauga County,Alabama,AL,55200
1,Baldwin County,Alabama,AL,208107
2,Barbour County,Alabama,AL,25782
3,Bibb County,Alabama,AL,22527
4,Blount County,Alabama,AL,57645
...,...,...,...,...
3215,Vega Baja Municipio,Puerto Rico,PR,53371
3216,Vieques Municipio,Puerto Rico,PR,8771
3217,Villalba Municipio,Puerto Rico,PR,22993
3218,Yabucoa Municipio,Puerto Rico,PR,34149


In [22]:
df_states.isnull().sum() #verificamos que ya no queden nulos

County        0
State         0
State_code    0
Population    0
dtype: int64

Para el data frame de los casos diarios es un tema distinto.

Primero sabemos que hay 57605 valores nulos en las muertes, desde el punto de vista estadístico no sirve de mucho saber cuantos contagiados tuviste si no puedes saber cuantos de ellos murieron.
Es por ello que consideramos que es mejor eliminar dichas filas, además al ser más de 2 millones 500 mil datos en total, eliminar esos 57605 datos no afecta tanto al resultado final.

In [23]:
df_daily = df_daily.dropna(subset=['Deaths']) #eliminamos la tupla cada vez que sea nulo en las muertes
df_daily

Unnamed: 0,Date,County,State,Fips,Cases,Deaths
0,2020!%01!%21,SnohomIsh,WashingtOn,53061.0,1,0.0
1,2020!%01!%22,SnohomIsh,WashingtOn,53061.0,1,0.0
2,2020!%01!%23,SnohomIsh,WashingtOn,53061.0,1,0.0
3,2020!%01!%24,Cook,IllinOis,17031.0,1,0.0
4,2020!%01!%24,SnohomIsh,WashingtOn,53061.0,1,0.0
...,...,...,...,...,...,...
2502827,2022!%05!%13,Sweetwater,WyOming,56037.0,11088,126.0
2502828,2022!%05!%13,Teton,WyOming,56039.0,10074,16.0
2502829,2022!%05!%13,UInta,WyOming,56041.0,5643,39.0
2502830,2022!%05!%13,WashakIe,WyOming,56043.0,2358,44.0


Luego si la columna Fips del data frame de los casos diarios es nula, entonces se pueden dar 2 casos:

El primero corresponde a que en la fila del condado diga desconocido (County = 'unknown'), para dicho caso no podemos recuperar el valor del Fips, por lo tanto la tupla también debe ser eliminada.

### Formateo de Datos

In [24]:
df_counties["County"] = df_counties["County"].str.title()
df_counties["State"] = df_counties["State"].str.title()

In [25]:
df_counties

Unnamed: 0,Fips,County,State,State_code,Male,Female,Median_age,Lat,Long
0,1001,Autauga County,Alabama,AL,26874,28326,37.8,32.534923,-86.642730
1,1003,Baldwin County,Alabama,AL,101188,106919,42.8,30.727479,-87.722564
2,1005,Barbour County,Alabama,AL,13697,12085,39.9,31.869581,-85.393210
3,1007,Bibb County,Alabama,AL,12152,10375,39.9,32.998628,-87.126475
4,1009,Blount County,Alabama,AL,28434,29211,40.8,33.980869,-86.567380
...,...,...,...,...,...,...,...,...,...
3215,72145,Vega Baja Municipio,Puerto Rico,PR,25580,27791,40.7,18.428461,-66.397926
3216,72147,Vieques Municipio,Puerto Rico,PR,4332,4439,43.6,18.122662,-65.439095
3217,72149,Villalba Municipio,Puerto Rico,PR,11169,11824,38.8,18.128155,-66.472816
3218,72151,Yabucoa Municipio,Puerto Rico,PR,16541,17608,42.5,18.070468,-65.896311


## 3. Load

In [26]:
df_counties.set_index('Fips', inplace = True)
df_counties

Unnamed: 0_level_0,County,State,State_code,Male,Female,Median_age,Lat,Long
Fips,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1001,Autauga County,Alabama,AL,26874,28326,37.8,32.534923,-86.642730
1003,Baldwin County,Alabama,AL,101188,106919,42.8,30.727479,-87.722564
1005,Barbour County,Alabama,AL,13697,12085,39.9,31.869581,-85.393210
1007,Bibb County,Alabama,AL,12152,10375,39.9,32.998628,-87.126475
1009,Blount County,Alabama,AL,28434,29211,40.8,33.980869,-86.567380
...,...,...,...,...,...,...,...,...
72145,Vega Baja Municipio,Puerto Rico,PR,25580,27791,40.7,18.428461,-66.397926
72147,Vieques Municipio,Puerto Rico,PR,4332,4439,43.6,18.122662,-65.439095
72149,Villalba Municipio,Puerto Rico,PR,11169,11824,38.8,18.128155,-66.472816
72151,Yabucoa Municipio,Puerto Rico,PR,16541,17608,42.5,18.070468,-65.896311


In [27]:
engine = sqlalchemy.create_engine('postgresql://postgres:admin@localhost:5432/casosCovid')
conn = engine.connect()

In [28]:
df_counties.to_sql('condados', engine, if_exists='append')

ProgrammingError: (psycopg2.errors.UndefinedColumn) no existe la columna «Fips» en la relación «condados»
LINE 1: INSERT INTO condados ("Fips", "County", "State", "State_code...
                              ^

[SQL: INSERT INTO condados ("Fips", "County", "State", "State_code", "Male", "Female", "Median_age", "Lat", "Long") VALUES (%(Fips)s, %(County)s, %(State)s, %(State_code)s, %(Male)s, %(Female)s, %(Median_age)s, %(Lat)s, %(Long)s)]
[parameters: ({'Fips': 1001, 'County': 'Autauga County', 'State': 'Alabama', 'State_code': 'AL', 'Male': 26874, 'Female': 28326, 'Median_age': 37.8, 'Lat': 32.53492293292812, 'Long': -86.64273013739468}, {'Fips': 1003, 'County': 'Baldwin County', 'State': 'Alabama', 'State_code': 'AL', 'Male': 101188, 'Female': 106919, 'Median_age': 42.8, 'Lat': 30.72747876693927, 'Long': -87.72256353282204}, {'Fips': 1005, 'County': 'Barbour County', 'State': 'Alabama', 'State_code': 'AL', 'Male': 13697, 'Female': 12085, 'Median_age': 39.9, 'Lat': 31.86958143836232, 'Long': -85.39320978902569}, {'Fips': 1007, 'County': 'Bibb County', 'State': 'Alabama', 'State_code': 'AL', 'Male': 12152, 'Female': 10375, 'Median_age': 39.9, 'Lat': 32.998628228925845, 'Long': -87.12647507182314}, {'Fips': 1009, 'County': 'Blount County', 'State': 'Alabama', 'State_code': 'AL', 'Male': 28434, 'Female': 29211, 'Median_age': 40.8, 'Lat': 33.98086883066284, 'Long': -86.56738039931867}, {'Fips': 1011, 'County': 'Bullock County', 'State': 'Alabama', 'State_code': 'AL', 'Male': 5663, 'Female': 4689, 'Median_age': 39.6, 'Lat': 32.10052546439408, 'Long': -85.71567933028645}, {'Fips': 1013, 'County': 'Butler County', 'State': 'Alabama', 'State_code': 'AL', 'Male': 9368, 'Female': 10657, 'Median_age': 40.7, 'Lat': 31.752429155671948, 'Long': -86.68027562242744}, {'Fips': 1015, 'County': 'Calhoun County', 'State': 'Alabama', 'State_code': 'AL', 'Male': 55315, 'Female': 59783, 'Median_age': 39.7, 'Lat': 33.771415664670144, 'Long': -85.82603921906548}  ... displaying 10 of 3220 total bound parameter sets ...  {'Fips': 72151, 'County': 'Yabucoa Municipio', 'State': 'Puerto Rico', 'State_code': 'PR', 'Male': 16541, 'Female': 17608, 'Median_age': 42.5, 'Lat': 18.07046811622328, 'Long': -65.89631139269119}, {'Fips': 72153, 'County': 'Yauco Municipio', 'State': 'Puerto Rico', 'State_code': 'PR', 'Male': 17475, 'Female': 18964, 'Median_age': 43.0, 'Lat': 18.07972783688101, 'Long': -66.85827635834862})]
(Background on this error at: https://sqlalche.me/e/14/f405)