## CLEANING COVID WEEKLY CASES DATA
This notebook contains the cleaning process of the covid data. Slightly less cleaning was necessary for this dataset. Column descriptions are documented in the data_insights document.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("../data/raw_covid.csv")
df.head()

Unnamed: 0,Week Start Date,Week End Date,Confirmed deaths,Probable deaths,Confirmed and probable deaths,Confirmed cases,Probable cases,Confirmed and probable cases,Last updated
0,1/8/2023,1/14/2023,132,47,179,7075,1645,8720,11/16/2023
1,10/17/2021,10/23/2021,79,4,83,8163,699,8862,11/16/2023
2,11/22/2020,11/28/2020,153,15,168,18177,1003,19180,11/16/2023
3,10/3/2021,10/9/2021,85,2,87,9316,748,10064,11/16/2023
4,4/5/2020,4/11/2020,654,23,677,12145,30,12175,11/16/2023


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 9 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   Week Start Date                195 non-null    object
 1   Week End Date                  195 non-null    object
 2   Confirmed deaths               195 non-null    int64 
 3   Probable deaths                195 non-null    int64 
 4   Confirmed and probable deaths  195 non-null    int64 
 5   Confirmed cases                195 non-null    int64 
 6   Probable cases                 195 non-null    int64 
 7   Confirmed and probable cases   195 non-null    int64 
 8   Last updated                   195 non-null    object
dtypes: int64(6), object(3)
memory usage: 13.8+ KB


Since there were not too many null rows we drop them

In [4]:
df = df.dropna()

Now we can only extract the columns we want. To simplify our work, we removed all but the confirmed deaths and cases, and kept a single time series, which we turned into a time series. We also sorted by Week End Date to make the data chronological.

In [5]:
df = df[["Week End Date", "Confirmed deaths", "Confirmed cases"]]
df['Week End Date'] = pd.to_datetime(df['Week End Date'])
df = df.sort_values(by='Week End Date')

Now we will sum values over months to get our final dictionary

In [6]:
# Transform 'date' column into 'year_month' format for the first DataFrame
df['date'] = df['Week End Date'].dt.to_period('M').drop(columns=['Weed End Date'])

# Aggregate confirmed cases and confirmed deaths by year and month
df = df.groupby('date').agg({'Confirmed cases': 'sum', 'Confirmed deaths': 'sum'}).reset_index()

df.head(10)

# Note: Originally we also calculated by year, but we did not end up using that dictionary

Unnamed: 0,date,Confirmed cases,Confirmed deaths
0,2020-02,2,0
1,2020-03,6902,72
2,2020-04,50483,3189
3,2020-05,39135,3696
4,2020-06,6592,688
5,2020-07,5478,208
6,2020-08,9791,148
7,2020-09,10263,98
8,2020-10,28504,229
9,2020-11,65502,457


In [7]:
df.to_csv('../data/cleaned_covid.csv', index = False)