# Proyecto Semestral 2025-2 - Gestión de Datos (IN1232C)

Estudiantes: Álvaro Molina Jara, Maicol Ramírez Mariño, Vicente Lillo Gallegos

## ETAPA 1: Limpieza y preparación de datos
Rango temporal: 1 mes de datos (Julio 2020)


### 1. Cargar y visualizar los primeros 5 registros del archivo 07-01-2021.csv y hacer el merge para los 31 días de archivos

In [1]:
import os
import requests
import pandas as pd
from io import StringIO

base_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/"
dfs = []
failed_files = []

for day in range(1, 32):
    day_str = str(day).zfill(2)
    file_name = f"07-{day_str}-2020.csv"
    file_url = f"{base_url}{file_name}"

    try:
        response = requests.get(file_url)
        response.raise_for_status()
        df = pd.read_csv(StringIO(response.text))
        dfs.append(df)
        print(f"Se cargó exitosamente {file_name}")
    except requests.exceptions.RequestException as e:
        print(f"No se pudo cargar {file_name}: {e}")
        failed_files.append(file_name)
    except Exception as e:
        print(f"Ocurrió un error inesperado al intentar procesar {file_name}: {e}")
        failed_files.append(file_name)

if dfs:
    combined_df = pd.concat(dfs, ignore_index = True)
else:
    print("No se cargaron DataFrames.")

if failed_files:
    print("\nArchivos que no se pudieron cargar:")
    for file in failed_files:
        print(file)

if 'combined_df' in locals() and not combined_df.empty:
    output_filename = 'covid_julio_2020.csv'
    combined_df.to_csv(output_filename, index = False)
    print(f"\nDataFrame guardado como '{output_filename}'")
    file_size_mb = os.path.getsize(output_filename) / (1024 * 1024)
    print(f"Tamaño del archivo: {file_size_mb:.2f} MB")
else:
    print("\nNo hay un DataFrame combinado para guardar.")

Se cargó exitosamente 07-01-2020.csv
Se cargó exitosamente 07-02-2020.csv
Se cargó exitosamente 07-03-2020.csv
Se cargó exitosamente 07-04-2020.csv
Se cargó exitosamente 07-05-2020.csv
Se cargó exitosamente 07-06-2020.csv
Se cargó exitosamente 07-07-2020.csv
Se cargó exitosamente 07-08-2020.csv
Se cargó exitosamente 07-09-2020.csv
Se cargó exitosamente 07-10-2020.csv
Se cargó exitosamente 07-11-2020.csv
Se cargó exitosamente 07-12-2020.csv
Se cargó exitosamente 07-13-2020.csv
Se cargó exitosamente 07-14-2020.csv
Se cargó exitosamente 07-15-2020.csv
Se cargó exitosamente 07-16-2020.csv
Se cargó exitosamente 07-17-2020.csv
Se cargó exitosamente 07-18-2020.csv
Se cargó exitosamente 07-19-2020.csv
Se cargó exitosamente 07-20-2020.csv
Se cargó exitosamente 07-21-2020.csv
Se cargó exitosamente 07-22-2020.csv
Se cargó exitosamente 07-23-2020.csv
Se cargó exitosamente 07-24-2020.csv
Se cargó exitosamente 07-25-2020.csv
Se cargó exitosamente 07-26-2020.csv
Se cargó exitosamente 07-27-2020.csv
S

In [2]:
combined_df.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-Fatality_Ratio
0,45001.0,Abbeville,South Carolina,US,2020-07-02 04:33:46,34.223334,-82.461707,113,0,0,113,"Abbeville, South Carolina, US",460.716761,0.0
1,22001.0,Acadia,Louisiana,US,2020-07-02 04:33:46,30.295065,-92.414197,919,37,0,882,"Acadia, Louisiana, US",1481.183012,4.026115
2,51001.0,Accomack,Virginia,US,2020-07-02 04:33:46,37.767072,-75.632346,1043,14,0,1029,"Accomack, Virginia, US",3227.503404,1.342282
3,16001.0,Ada,Idaho,US,2020-07-02 04:33:46,43.452658,-116.241552,2288,23,0,2265,"Ada, Idaho, US",475.095881,1.005245
4,19001.0,Adair,Iowa,US,2020-07-02 04:33:46,41.330756,-94.471059,15,0,0,15,"Adair, Iowa, US",209.731544,0.0


Se renombra columna de `Incidence_Rate` a `Incident_Rate` para estandarización:

In [3]:
combined_df.rename(columns = {'Incidence_Rate': 'Incident_Rate'}, inplace = True)

### 2. Mostrar el número total de filas y columnas del DataFrame


#### Número de filas

In [4]:
combined_df.shape[0]

120885

#### Número de columnas

In [5]:
combined_df.shape[1]

14

### 3. Describir los tipos de datos (dtypes) y convertir las columnas necesarias (por ejemplo, fechas)


In [6]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120885 entries, 0 to 120884
Data columns (total 14 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   FIPS                 98491 non-null   float64
 1   Admin2               98629 non-null   object 
 2   Province_State       115413 non-null  object 
 3   Country_Region       120885 non-null  object 
 4   Last_Update          120885 non-null  object 
 5   Lat                  118412 non-null  float64
 6   Long_                118412 non-null  float64
 7   Confirmed            120885 non-null  int64  
 8   Deaths               120885 non-null  int64  
 9   Recovered            120885 non-null  int64  
 10  Active               120885 non-null  int64  
 11  Combined_Key         120885 non-null  object 
 12  Incident_Rate        118412 non-null  float64
 13  Case-Fatality_Ratio  118907 non-null  float64
dtypes: float64(5), int64(4), object(5)
memory usage: 12.9+ MB


In [7]:
combined_df['Last_Update'] = pd.to_datetime(combined_df['Last_Update'])

In [8]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120885 entries, 0 to 120884
Data columns (total 14 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   FIPS                 98491 non-null   float64       
 1   Admin2               98629 non-null   object        
 2   Province_State       115413 non-null  object        
 3   Country_Region       120885 non-null  object        
 4   Last_Update          120885 non-null  datetime64[ns]
 5   Lat                  118412 non-null  float64       
 6   Long_                118412 non-null  float64       
 7   Confirmed            120885 non-null  int64         
 8   Deaths               120885 non-null  int64         
 9   Recovered            120885 non-null  int64         
 10  Active               120885 non-null  int64         
 11  Combined_Key         120885 non-null  object        
 12  Incident_Rate        118412 non-null  float64       
 13  Case-Fatality_

### 4. Detectar y mostrar valores nulos o faltantes por columna


In [9]:
combined_df_nan = combined_df[combined_df.isnull().any(axis = 1)]
combined_df_nan

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case-Fatality_Ratio
84,31005.0,Arthur,Nebraska,US,2020-07-02 04:33:46,41.568961,-101.695956,0,0,0,0,"Arthur, Nebraska, US",0.000000,
145,51017.0,Bath,Virginia,US,2020-07-02 04:33:46,38.058526,-79.739121,0,0,0,0,"Bath, Virginia, US",0.000000,
153,,Bear River,Utah,US,2020-07-02 04:33:46,41.521068,-113.083282,1538,2,0,1536,"Bear River, Utah, US",823.261142,0.130039
210,31009.0,Blaine,Nebraska,US,2020-07-02 04:33:46,41.913117,-99.976778,0,0,0,0,"Blaine, Nebraska, US",0.000000,
280,31017.0,Brown,Nebraska,US,2020-07-02 04:33:46,42.430189,-99.929041,0,0,0,0,"Brown, Nebraska, US",0.000000,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
120880,,,Unknown,Ukraine,2020-08-01 04:36:27,,,0,0,0,0,"Unknown, Ukraine",0.000000,0.000000
120881,,,,Nauru,2020-08-01 04:36:27,-0.522800,166.931500,0,0,0,0,Nauru,0.000000,0.000000
120882,,,Niue,New Zealand,2020-08-01 04:36:27,-19.054400,-169.867200,0,0,0,0,"Niue, New Zealand",0.000000,0.000000
120883,,,,Tuvalu,2020-08-01 04:36:27,-7.109500,177.649300,0,0,0,0,Tuvalu,0.000000,0.000000


### 5. Eliminar columnas irrelevantes (por ejemplo, códigos FIPS o coordenadas si no se usarán)


In [10]:
combined_df.drop(['Admin2', 'Combined_Key', 'FIPS', 'Lat', 'Long_'], axis = 1, inplace = True)
combined_df.head()

Unnamed: 0,Province_State,Country_Region,Last_Update,Confirmed,Deaths,Recovered,Active,Incident_Rate,Case-Fatality_Ratio
0,South Carolina,US,2020-07-02 04:33:46,113,0,0,113,460.716761,0.0
1,Louisiana,US,2020-07-02 04:33:46,919,37,0,882,1481.183012,4.026115
2,Virginia,US,2020-07-02 04:33:46,1043,14,0,1029,3227.503404,1.342282
3,Idaho,US,2020-07-02 04:33:46,2288,23,0,2265,475.095881,1.005245
4,Iowa,US,2020-07-02 04:33:46,15,0,0,15,209.731544,0.0


### 6. Estandarizar nombres de columnas (formato snake_case)


In [11]:
combined_df.columns = combined_df.columns.str.lower().str.replace(' ', '_')
combined_df.columns = combined_df.columns.str.lower().str.replace('-', '_')
combined_df

Unnamed: 0,province_state,country_region,last_update,confirmed,deaths,recovered,active,incident_rate,case_fatality_ratio
0,South Carolina,US,2020-07-02 04:33:46,113,0,0,113,460.716761,0.000000
1,Louisiana,US,2020-07-02 04:33:46,919,37,0,882,1481.183012,4.026115
2,Virginia,US,2020-07-02 04:33:46,1043,14,0,1029,3227.503404,1.342282
3,Idaho,US,2020-07-02 04:33:46,2288,23,0,2265,475.095881,1.005245
4,Iowa,US,2020-07-02 04:33:46,15,0,0,15,209.731544,0.000000
...,...,...,...,...,...,...,...,...,...
120880,Unknown,Ukraine,2020-08-01 04:36:27,0,0,0,0,0.000000,0.000000
120881,,Nauru,2020-08-01 04:36:27,0,0,0,0,0.000000,0.000000
120882,Niue,New Zealand,2020-08-01 04:36:27,0,0,0,0,0.000000,0.000000
120883,,Tuvalu,2020-08-01 04:36:27,0,0,0,0,0.000000,0.000000


### 7. Homogeneizar nombres de países (ej. “US” → “United States”)


In [12]:
combined_df['country_region'] = combined_df['country_region'].replace('US', 'United States')
combined_df

Unnamed: 0,province_state,country_region,last_update,confirmed,deaths,recovered,active,incident_rate,case_fatality_ratio
0,South Carolina,United States,2020-07-02 04:33:46,113,0,0,113,460.716761,0.000000
1,Louisiana,United States,2020-07-02 04:33:46,919,37,0,882,1481.183012,4.026115
2,Virginia,United States,2020-07-02 04:33:46,1043,14,0,1029,3227.503404,1.342282
3,Idaho,United States,2020-07-02 04:33:46,2288,23,0,2265,475.095881,1.005245
4,Iowa,United States,2020-07-02 04:33:46,15,0,0,15,209.731544,0.000000
...,...,...,...,...,...,...,...,...,...
120880,Unknown,Ukraine,2020-08-01 04:36:27,0,0,0,0,0.000000,0.000000
120881,,Nauru,2020-08-01 04:36:27,0,0,0,0,0.000000,0.000000
120882,Niue,New Zealand,2020-08-01 04:36:27,0,0,0,0,0.000000,0.000000
120883,,Tuvalu,2020-08-01 04:36:27,0,0,0,0,0.000000,0.000000


### 8. Convertir la columna Last_Update al formato YYYY-MM-DD

In [13]:
combined_df['last_update'] = pd.to_datetime(combined_df['last_update'], errors = 'coerce')
combined_df['last_update'] = combined_df['last_update'].dt.strftime('%Y-%m-%d')
combined_df.head()

Unnamed: 0,province_state,country_region,last_update,confirmed,deaths,recovered,active,incident_rate,case_fatality_ratio
0,South Carolina,United States,2020-07-02,113,0,0,113,460.716761,0.0
1,Louisiana,United States,2020-07-02,919,37,0,882,1481.183012,4.026115
2,Virginia,United States,2020-07-02,1043,14,0,1029,3227.503404,1.342282
3,Idaho,United States,2020-07-02,2288,23,0,2265,475.095881,1.005245
4,Iowa,United States,2020-07-02,15,0,0,15,209.731544,0.0


### 9. Crear una columna active_cases = Confirmed - Deaths - Recovered

Primero, se garantiza el manejo de las columnas con valores nulos (NaN):

In [14]:
print("Primeras 5 filas antes de limpiar los valores NaN:")
display(combined_df.head())

print("\nRecuento de valores NaN por columna antes de la limpieza:")
print(combined_df[['country_region', 'province_state', 'confirmed', 'deaths', 'recovered', 'active', 'incident_rate', 'case_fatality_ratio']].isnull().sum())

combined_df['country_region'] = combined_df['country_region'].fillna('Unknown')
combined_df['province_state'] = combined_df['province_state'].fillna('Unknown')
combined_df['confirmed'] = combined_df['confirmed'].fillna(0)
combined_df['deaths'] = combined_df['deaths'].fillna(0)
combined_df['recovered'] = combined_df['recovered'].fillna(0)
combined_df['active'] = combined_df['active'].fillna(0)
combined_df['incident_rate'] = combined_df['incident_rate'].fillna(0)
combined_df['case_fatality_ratio'] = combined_df['case_fatality_ratio'].fillna(0)

print("\nRecuento de valores NaN por columna después de la limpieza:")
print(combined_df[['country_region', 'province_state', 'confirmed', 'deaths', 'recovered', 'active', 'incident_rate', 'case_fatality_ratio']].isnull().sum())

print("\nPrimeras 5 filas después de limpiar los valores NaN:")
display(combined_df.head())

Primeras 5 filas antes de limpiar los valores NaN:


Unnamed: 0,province_state,country_region,last_update,confirmed,deaths,recovered,active,incident_rate,case_fatality_ratio
0,South Carolina,United States,2020-07-02,113,0,0,113,460.716761,0.0
1,Louisiana,United States,2020-07-02,919,37,0,882,1481.183012,4.026115
2,Virginia,United States,2020-07-02,1043,14,0,1029,3227.503404,1.342282
3,Idaho,United States,2020-07-02,2288,23,0,2265,475.095881,1.005245
4,Iowa,United States,2020-07-02,15,0,0,15,209.731544,0.0



Recuento de valores NaN por columna antes de la limpieza:
country_region            0
province_state         5472
confirmed                 0
deaths                    0
recovered                 0
active                    0
incident_rate          2473
case_fatality_ratio    1978
dtype: int64

Recuento de valores NaN por columna después de la limpieza:
country_region         0
province_state         0
confirmed              0
deaths                 0
recovered              0
active                 0
incident_rate          0
case_fatality_ratio    0
dtype: int64

Primeras 5 filas después de limpiar los valores NaN:


Unnamed: 0,province_state,country_region,last_update,confirmed,deaths,recovered,active,incident_rate,case_fatality_ratio
0,South Carolina,United States,2020-07-02,113,0,0,113,460.716761,0.0
1,Louisiana,United States,2020-07-02,919,37,0,882,1481.183012,4.026115
2,Virginia,United States,2020-07-02,1043,14,0,1029,3227.503404,1.342282
3,Idaho,United States,2020-07-02,2288,23,0,2265,475.095881,1.005245
4,Iowa,United States,2020-07-02,15,0,0,15,209.731544,0.0


Ahora, se procede a crear la columna `active_cases`:

In [15]:
combined_df['active_cases'] = combined_df['confirmed'] - combined_df['deaths'] - combined_df['recovered']
combined_df

Unnamed: 0,province_state,country_region,last_update,confirmed,deaths,recovered,active,incident_rate,case_fatality_ratio,active_cases
0,South Carolina,United States,2020-07-02,113,0,0,113,460.716761,0.000000,113
1,Louisiana,United States,2020-07-02,919,37,0,882,1481.183012,4.026115,882
2,Virginia,United States,2020-07-02,1043,14,0,1029,3227.503404,1.342282,1029
3,Idaho,United States,2020-07-02,2288,23,0,2265,475.095881,1.005245,2265
4,Iowa,United States,2020-07-02,15,0,0,15,209.731544,0.000000,15
...,...,...,...,...,...,...,...,...,...,...
120880,Unknown,Ukraine,2020-08-01,0,0,0,0,0.000000,0.000000,0
120881,Unknown,Nauru,2020-08-01,0,0,0,0,0.000000,0.000000,0
120882,Niue,New Zealand,2020-08-01,0,0,0,0,0.000000,0.000000,0
120883,Unknown,Tuvalu,2020-08-01,0,0,0,0,0.000000,0.000000,0


Se elimina la columna `active`, que ahora es redundante con `active_cases`, y que además, contenía valores 0 en filas que si tenían datos de confirmados, fallecidos y recuperados:

In [16]:
combined_df.drop(['active'], axis = 1, inplace = True)

In [17]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120885 entries, 0 to 120884
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   province_state       120885 non-null  object 
 1   country_region       120885 non-null  object 
 2   last_update          120885 non-null  object 
 3   confirmed            120885 non-null  int64  
 4   deaths               120885 non-null  int64  
 5   recovered            120885 non-null  int64  
 6   incident_rate        120885 non-null  float64
 7   case_fatality_ratio  120885 non-null  float64
 8   active_cases         120885 non-null  int64  
dtypes: float64(2), int64(4), object(3)
memory usage: 8.3+ MB


Adicionalmente, se eliminan las filas duplicadas:

In [18]:
combined_df[combined_df.duplicated(keep = False)].head(30)

Unnamed: 0,province_state,country_region,last_update,confirmed,deaths,recovered,incident_rate,case_fatality_ratio,active_cases
84,Nebraska,United States,2020-07-02,0,0,0,0.0,0.0,0
145,Virginia,United States,2020-07-02,0,0,0,0.0,0.0,0
210,Nebraska,United States,2020-07-02,0,0,0,0.0,0.0,0
280,Nebraska,United States,2020-07-02,0,0,0,0.0,0.0,0
323,South Dakota,United States,2020-07-02,0,0,0,0.0,0.0,0
754,Nebraska,United States,2020-07-02,0,0,0,0.0,0.0,0
801,Nebraska,United States,2020-07-02,0,0,0,0.0,0.0,0
860,Nevada,United States,2020-07-02,0,0,0,0.0,0.0,0
1013,Oregon,United States,2020-07-02,0,0,0,0.0,0.0,0
1054,Nebraska,United States,2020-07-02,0,0,0,0.0,0.0,0


In [20]:
combined_df.drop_duplicates(inplace = True)

In [24]:
combined_df[combined_df.duplicated(keep = False)].head(30)

Unnamed: 0,province_state,country_region,last_update,confirmed,deaths,recovered,incident_rate,case_fatality_ratio,active_cases


### 10. Guardar el DataFrame limpio como covid_clean_julio2020.csv e indicar su tamaño en MB

In [23]:
clean_output_filename = 'covid_clean_julio2020.csv'
df.to_csv(clean_output_filename, index = False)
print(f"DataFrame guardado como '{clean_output_filename}'")

file_size_mb = os.path.getsize(clean_output_filename) / (1024 * 1024)
print(f"Tamaño del archivo: {file_size_mb:.2f} MB")

DataFrame guardado como 'covid_clean_julio2020.csv'
Tamaño del archivo: 0.52 MB


## Optimización del DF

### Tiempo de carga y uso de memoria del DF antes de las optimizaciones

Se miden los tiempos de carga y uso de memoria del DataFrame antes de aplicar cualquier optimización:

In [25]:
import time

start_time = time.time()
combined_df = pd.read_csv('covid_julio_2020.csv')
end_time = time.time()
loading_time = end_time - start_time
memory_usage_mb = combined_df.memory_usage(deep = True).sum() / (1024 * 1024)
print(f"Tiempo de carga del archivo: {loading_time:.2f} segundos")
print(f"Uso de memoria del DataFrame 'combined_df': {memory_usage_mb:.2f} MB")

Tiempo de carga del archivo: 0.45 segundos
Uso de memoria del DataFrame 'combined_df': 42.58 MB


In [26]:
print("\nUso de memoria por columna (antes de la optimización):")
print(combined_df.memory_usage(deep = True) / (1024 * 1024))


Uso de memoria por columna (antes de la optimización):
Index                  0.000126
FIPS                   0.922279
Admin2                 5.958440
Province_State         6.496505
Country_Region         5.983093
Last_Update            7.839375
Lat                    0.922279
Long_                  0.922279
Confirmed              0.922279
Deaths                 0.922279
Recovered              0.922279
Active                 0.922279
Combined_Key           7.997908
Incidence_Rate         0.922279
Case-Fatality_Ratio    0.922279
dtype: float64


### Optimización de carga del DF usand Dask en vez de Pandas


**Justificación**: Dask está especializado para manejar grandes conjuntos de datos (que incluso superan el tamaño de la memoria RAM), a diferencia de Pandas que solo es capaz de manejar conjuntos de datos pequeños y medianos menores al tamaño de la memoria RAM disponible.



In [30]:
import dask.dataframe as dd

file_path = 'covid_julio_2020.csv'

dtype_dict = {
    'Province_State': 'object',
    'Country_Region': 'object',
    'Last_Update': 'object',
    'Confirmed': 'float64',
    'Deaths': 'float64',
    'Recovered': 'float64',
    'Lat': 'float64',
    'Long_': 'float64',
    'FIPS': 'float64',
    'Admin2': 'object',
    'Active': 'float64',
    'Combined_key': 'object',
    'Incidence_rate': 'float64',
    'Case-Fatality_Ratio': 'float64'
}

print(f"Cargando '{file_path}' con Dask...")
start_time = time.time()
dask_df = dd.read_csv(file_path, dtype = dtype_dict, assume_missing = True)
end_time = time.time()
loading_time_dask = end_time - start_time
print(f"Tiempo de carga con Dask (solo metadatos): {loading_time_dask:.2f} segundos")

print("Persistiendo el DataFrame de Dask en memoria...")
start_time_persist = time.time()
dask_df_persisted = dask_df.persist()
_ = dask_df_persisted.head()
end_time_persist = time.time()
persist_time = end_time_persist - start_time_persist
print(f"Tiempo de persistencia de Dask: {persist_time:.2f} segundos")

print("\nEstructura del DataFrame de Dask:")
print(dask_df_persisted)
print(f"Número de particiones: {dask_df_persisted.npartitions}")

Cargando 'covid_julio_2020.csv' con Dask...
Tiempo de carga con Dask (solo metadatos): 0.03 segundos
Persistiendo el DataFrame de Dask en memoria...
Tiempo de persistencia de Dask: 0.51 segundos

Estructura del DataFrame de Dask:
Dask DataFrame Structure:
                  FIPS  Admin2 Province_State Country_Region Last_Update      Lat    Long_ Confirmed   Deaths Recovered   Active Combined_Key Incidence_Rate Case-Fatality_Ratio
npartitions=1                                                                                                                                                                   
               float64  string         string         string      string  float64  float64   float64  float64   float64  float64       string        float64             float64
                   ...     ...            ...            ...         ...      ...      ...       ...      ...       ...      ...          ...            ...                 ...
Dask Name: read, 1 expression
Expr=F

In [31]:
memory_usage_dask_mb = dask_df_persisted.memory_usage(deep = True).sum().compute() / (1024 * 1024)
print(f"Uso de memoria del DataFrame de Dask (persistente): {memory_usage_dask_mb:.2f} MB")

Uso de memoria del DataFrame de Dask (persisted): 19.42 MB


### Resumen de Optimización con Dask

#### Comparación de Tiempos de Carga:
*   **Pandas (carga completa):** `0.45 segundos`
*   **Dask (solo metadatos):** `0.03 segundos`
*   **Dask (persistencia y computación):** `0.51 segundos`

#### Comparación de Uso de Memoria:
*   **Pandas DataFrame:** `42.58 MB`
*   **Dask DataFrame (persistido):** `19.42 MB`

Se observa una reducción significativa en el uso de memoria del DataFrame de Dask persistido en comparación con el DataFrame de Pandas. Aunque el tiempo total para cargar y persistir el DataFrame de Dask es ligeramente mayor que el de Pandas para esta carga específica, Dask ofrece ventajas sustanciales para el procesamiento de datos que exceden la memoria disponible, gracias a su capacidad de operar de forma diferida y en paralelo. La carga inicial de metadatos con Dask es casi instantánea, lo que permite una manipulación rápida de la estructura del DataFrame.

### Optimización de tipos de datos

Se procede a convertir las columnas numéricas (`confirmed`, `deaths`, `recovered`, `active_cases`, `incident_rate`, `case_fatality_ratio`) a tipos de datos más eficientes en memoria (por ejemplo, `float32`, `int32`) y la columna `last_update` a un formato datetime optimizado:

In [32]:
optimized_dtypes = {
    'Confirmed': 'float32',
    'Deaths': 'float32',
    'Recovered': 'float32',
    'Lat': 'float32',
    'Long_': 'float32',
    'FIPS': 'float32',
    'Active': 'float32',
    'Incidence_Rate': 'float32',
    'Case-Fatality_Ratio': 'float32',
    'Province_State': 'category',
    'Country_Region': 'category',
    'Admin2': 'category',
    'Combined_Key': 'category'
}

dask_df_optimized = dask_df_persisted.astype({k: v for k, v in optimized_dtypes.items() if k in dask_df_persisted.columns})
dask_df_optimized.columns = dask_df_optimized.columns.str.lower().str.replace(' ', '_')
dask_df_optimized.columns = dask_df_optimized.columns.str.lower().str.replace('-', '_')
dask_df_optimized['last_update'] = dd.to_datetime(dask_df_optimized['last_update'], errors = 'coerce')
print("Persistiendo el DataFrame de Dask optimizado en memoria...")
dask_df_optimized = dask_df_optimized.persist()
print("DataFrame de Dask optimizado y persistido.")
print("\nTipos de datos del DataFrame de Dask optimizado:")
print(dask_df_optimized.dtypes)
memory_usage_dask_optimized_mb = dask_df_optimized.memory_usage(deep=True).sum().compute() / (1024 * 1024)
print(f"\nUso de memoria del DataFrame de Dask optimizado: {memory_usage_dask_optimized_mb:.2f} MB")

Persistiendo el DataFrame de Dask optimizado en memoria...
DataFrame de Dask optimizado y persistido.

Tipos de datos del DataFrame de Dask optimizado:
fips                          float32
admin2                       category
province_state               category
country_region               category
last_update            datetime64[ns]
lat                           float32
long_                         float32
confirmed                     float32
deaths                        float32
recovered                     float32
active                        float32
combined_key                 category
incidence_rate                float32
case_fatality_ratio           float32
dtype: object

Uso de memoria del DataFrame de Dask optimizado: 6.35 MB


### Uso de índices para operaciones temporales


Se establece la columna 'last_update' como el índice del DataFrame (después de la conversión a datetime) para acelerar las operaciones basadas en tiempo, como filtrados y agrupaciones por rango de fechas:

In [33]:
print("Estableciendo 'last_update' como índice y persistiendo...")
dask_df_optimized = dask_df_optimized.set_index('last_update')
dask_df_optimized = dask_df_optimized.persist()

print("\nEstructura del DataFrame de Dask optimizado con índice de fecha:")
print(dask_df_optimized)

Estableciendo 'last_update' como índice y persistiendo...

Estructura del DataFrame de Dask optimizado con índice de fecha:
Dask DataFrame Structure:
                  fips             admin2     province_state     country_region      lat    long_ confirmed   deaths recovered   active       combined_key incidence_rate case_fatality_ratio
npartitions=1                                                                                                                                                                                
               float32  category[unknown]  category[unknown]  category[unknown]  float32  float32   float32  float32   float32  float32  category[unknown]        float32             float32
                   ...                ...                ...                ...      ...      ...       ...      ...       ...      ...                ...            ...                 ...
Dask Name: operation, 1 expression
Expr=FromGraph(abf3def)


### Conclusiones sobre mejoras de rendimiento después de las optimizaciones

**1. ¿Cuál era el rendimiento actual de la carga de "covid_julio_2020.csv" y el uso de memoria del DataFrame `combined_df` antes de las optimizaciones?**

Inicialmente, cargar `covid_julio_2020.csv` en un DataFrame de Pandas tardaba 0,45 segundos, y el `combined_df` consumía 42,58 MB de memoria. Las columnas `combined_key`, `last_update`, `province_state`, `country_region` y `admin2` eran las que más memoria consumían.

**2. ¿Cómo las optimizaciones de Dask y el tipo de datos mejoraron el rendimiento y el uso de la memoria?**

Con Dask, la carga inicial de metadatos fue muy rápida (0,03 segundos). La persistencia del DataFrame de Dask, que activa el cálculo, tardó 0,51 segundos, un poco más que la carga inicial de Pandas. Sin embargo, el DataFrame de Dask, antes de la optimización de tipos de datos, redujo el uso de memoria a 19,42 MB. Tras optimizar los tipos de datos (p. ej., `float32`, `category` y `datetime`), el consumo de memoria se redujo aún más, a 6,35 MB. Considerando las limitaciones de Google Colab en cuanto a uso de memoria con grandes conjuntos de datos, esta disminución del consumo de memoria es muy relevante.
