# Data Sourcing and Cleaning

## Data Sources:

1. Daily state-wise COVID-19 cases for India: [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19/tree/ef15d99458d44aa9bc03c0726c609643e6f90d3b)

2. Daily state-wise COVID-19 vaccinations for India: 

3. Daily state-wise COVID-19 cases for USA: todo

4. Daily state-wise COVID-19 vaccination for USA: [Data on COVID-19 (coronavirus) vaccinations by Our World in Data](https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations)

In [12]:
import os
from datetime import datetime
import pandas as pd

## 1. Getting daily state-wise cases for India

- We have linked the repository [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19/tree/ef15d99458d44aa9bc03c0726c609643e6f90d3b) as a submodule to our repository
- The data for all countries (aside from USA) are stored under `COVID-19/csse_covid_19_data/csse_covid_19_daily_reports` as `.csv` files, with one for every date
- The files are named in the `mm-dd-yyyy.csv` format
- We have filtered out rows for India from `01-22-2020.csv` to `10-12-2021.csv`
- Some files use the column name `Country/Region` while others use the column name `Country_Region` for the country in which the cases were recorded
- A `Date` column has been added
- The aggregated time-series data from all the files are stored in a single file `india_cases_<last_case_date_in_dd-mm-yyyy>.csv`

In [13]:
# Store the names of all the daily case files and sort by date
daily_reports = filter(lambda x: x.endswith('csv'), os.listdir("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports"))
sorted_files = sorted(daily_reports, key=lambda file: datetime.strptime(file, '%m-%d-%Y.csv'))

india_cases = pd.DataFrame()


# Aggregate only for India and store in india_cases_<last_case_date>.csv
for file in sorted_files:
    filename = os.path.join("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports", file)

    temp_df = pd.read_csv(filename)

    # Add a column for the date
    temp_df['Date'] = datetime.strptime(file, '%m-%d-%Y.csv')
    
    # Make the column name Country_Region for consistency
    if 'Country/Region' in temp_df.columns:
        temp_df.rename(columns = {'Country/Region':'Country_Region'}, inplace = True)
    
    # pd.concat is faster than DataFrame.append when dealing with empty DataFrames
    india_cases = pd.concat([india_cases, temp_df[temp_df['Country_Region'] == 'India']])

india_cases.reset_index(drop=True, inplace=True)

# Find the last date for which cases were recorded from the sorted list of files
last_case_date = datetime.strptime(sorted_files[-1], '%m-%d-%Y.csv').strftime('%d-%m-%Y')

# Store the aggregated time-series data from all the daily case files into a single file
if not os.path.exists('./raw_data_%s' % last_case_date):
    os.makedirs('./raw_data_%s' % last_case_date)
india_cases.to_csv('./raw_data_%s/india_cases_%s.csv' % (last_case_date, last_case_date))

## 2. Getting daily state-wise vaccinations for India (old)

- We have obtained the data from [COVID-19 India APIs - cowin_vaccine_data_statewise](https://data.covid19india.org) (raw data source [here](http://data.covid19india.org/csv/latest/cowin_vaccine_data_statewise.csv))
- The csv file contains multiple `NaN` entries for `Total Doses Administered` on future dates, as all dates of a month are stored as empty rows at the beginning of the month
- The data are stored in a single file `india_vaccines_<last_vaccine_date_in_dd-mm-yyyy>_old.csv`

In [14]:
# Load the csv into memory

# csv_url = 'http://data.covid19india.org/csv/latest/cowin_vaccine_data_statewise.csv'
# india_vaccines = pd.read_csv(csv_url)

# # Drop the columns where no doses have been administered (includes future dates)
# india_vaccines = india_vaccines[india_vaccines['Total Doses Administered'].notna()]

# # Reformat the date and sort
# india_vaccines['Updated On'] = pd.to_datetime(india_vaccines['Updated On'], format='%d/%m/%Y')
# india_vaccines.sort_values(by=['Updated On', 'State'], inplace=True, ignore_index=True)

# # Store into a csv file
# last_date_india_vaccines = india_vaccines['Updated On'].max().strftime('%d-%m-%Y')

# if not os.path.exists('./raw_data_%s' % last_date_india_vaccines):
#     os.makedirs('./raw_data_%s' % last_date_india_vaccines)

# india_vaccines.to_csv('./raw_data_%s/india_vaccines_%s_old.csv' % (last_date_india_vaccines, last_date_india_vaccines))

## 2. Getting daily daily vaccinations for India

- We have obtained the data from [Data on COVID-19 (coronavirus) vaccinations by Our World in Data](https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations/country_data/India.csv) (raw data source [here](https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/country_data/India.csv))
- There is no statewise data in this source, but we do not require it
- The data are sorted by date and stored in a single file `india_vaccines_<last_vaccine_date_in_dd-mm-yyyy>.csv`

In [19]:
# Fetch the daily India vaccination data
india_vaccines = pd.read_csv('https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/country_data/India.csv')

# reformat the date and sort
india_vaccines['date'] = pd.to_datetime(india_vaccines['date'], format='%Y-%m-%d')
india_vaccines.sort_values(by=['date'], inplace=True, ignore_index=True)

# Store into a csv file
last_date_india_vaccines = india_vaccines['date'].max().strftime('%d-%m-%Y')

if not os.path.exists('./raw_data_%s' % last_date_india_vaccines):
    os.makedirs('./raw_data_%s' % last_date_india_vaccines)

india_vaccines.to_csv('./raw_data_%s/india_vaccines_%s.csv' % (last_date_india_vaccines, last_date_india_vaccines))

## 3. Getting daily state-wise cases for USA

- We have linked the repository [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19/tree/ef15d99458d44aa9bc03c0726c609643e6f90d3b) as a submodule to our repository
- The data for USA are stored under `COVID-19/csse_covid_19_data/csse_covid_19_daily_reports_us` as `.csv` files, with one for every date
- The files are named in the `mm-dd-yyyy.csv` format
- We have aggregated and sorted data from all dates
- A `Date` column has been added
- The aggregated time-series data from all the files are stored in a single file `india_cases_<last_case_date_in_dd-mm-yyyy>.csv`

In [16]:
# Store the names of all the daily case files and sort by date
daily_reports_usa = filter(lambda x: x.endswith('csv'), os.listdir("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports_us"))
sorted_files_usa = sorted(daily_reports_usa, key=lambda file: datetime.strptime(file, '%m-%d-%Y.csv'))

usa_cases = pd.DataFrame()


# Aggregate only for India and store in india_cases_<last_case_date>.csv
for file in sorted_files_usa:
    filename = os.path.join("./COVID-19/csse_covid_19_data/csse_covid_19_daily_reports_us", file)

    temp_df = pd.read_csv(filename)

    # Add a column for the date
    temp_df['Date'] = datetime.strptime(file, '%m-%d-%Y.csv')
    
    # pd.concat is faster than DataFrame.append when dealing with empty DataFrames
    usa_cases = pd.concat([usa_cases, temp_df])

usa_cases.reset_index(drop=True, inplace=True)

# Find the last date for which cases were recorded from the sorted list of files
last_case_date_usa = datetime.strptime(sorted_files[-1], '%m-%d-%Y.csv').strftime('%d-%m-%Y')

# Store the aggregated time-series data from all the daily case files into a single file
if not os.path.exists('./raw_data_%s' % last_case_date_usa):
    os.makedirs('./raw_data_%s' % last_case_date_usa)

usa_cases.to_csv('./raw_data_%s/usa_cases_%s.csv' % (last_case_date_usa, last_case_date_usa))

## 4. Getting daily state-wise vaccinations for USA

- We have used data from [Data on COVID-19 (coronavirus) vaccinations by Our World in Data](https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations) (raw data source [here](https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/us_state_vaccinations.csv))
- The rows are sorted by date and location and then stored in a single file `usa_vaccines_<last_vaccine_date_in_dd-mm-yyyy>.csv`

In [17]:
# Fetch the daily USA vaccination data
usa_vaccines = pd.read_csv('https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/us_state_vaccinations.csv')

# reformat the date and sort
usa_vaccines['date'] = pd.to_datetime(usa_vaccines['date'], format='%Y-%m-%d')
usa_vaccines.sort_values(by=['date', 'location'], inplace=True, ignore_index=True)

# Store into a csv file
last_date_usa_vaccines = usa_vaccines['date'].max().strftime('%d-%m-%Y')

if not os.path.exists('./raw_data_%s' % last_date_usa_vaccines):
    os.makedirs('./raw_data_%s' % last_date_usa_vaccines)

usa_vaccines.to_csv('./raw_data_%s/usa_vaccines_%s.csv' % (last_date_usa_vaccines, last_date_usa_vaccines))