# Covid-19 (Coronavirus) Analytics and Forecasting

***
*WORK IN PROGRESS*
***

## Data Sources

- **Primary Data Source:** Johns Hopkins CSSE Data Repository - https://github.com/CSSEGISandData/COVID-19  
    - Live data:
        - Countries
        - US States
        - US County
    - Historic data:
        - Countries (cases, deaths, recoveries)
        - US States (cases, deaths) **PENDING**
        - US Counties (cases, deaths) **PENDING**
- **US State Testing and Hospitalizations:** Covid Tracking Project - https://covidtracking.com/data/  
    - Live data:
        - US State testing and hopspitalization, ICU stats
    - Historic data:
        - US State testing and hopspitalization, ICU stats
- **US County - Alternative:** NY Times - https://github.com/nytimes/covid-19-data/
    - Historic data:
        - US States (cases, deaths)
        - US Counties (cases, deaths)

*Note that since 3/23 John Hopkins no longer tracks historic regional data including US States.  A separate dataset is required for US State data (Covid Tracking), and as such, there may be minor differences when comparing the combined individual state data (Covid Tracking Project dataset) against the Global US stats (John Hopkins dataset)*

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from datetime import datetime
import os
import seaborn as sns
sns.set()

## Load the Data
See the Covid19_Data_ETL notebook for info on the data gathering and wrangling process.  
All original data left unchanged, only the format was changed to make it preferable to work with for EDA and Data Viz

### Dataset 1: John Hopkins

In [2]:
jh_live_global = pd.read_csv('Datasets/JH/jh_live_global_orig.csv')

In [3]:
jh_live_countries = pd.read_csv('Datasets/JH/jh_live_countries.csv', index_col=0)

In [4]:
jh_live_usstates = pd.read_csv('Datasets/JH/jh_live_usstates.csv', index_col=0)

In [5]:
jh_live_uscounties = pd.read_csv('Datasets/JH/jh_live_uscounties.csv', index_col=[0,1])

In [6]:
jh_hist_countries_cases = pd.read_csv('Datasets/JH/jh_hist_countries_cases.csv', index_col=0)

In [7]:
jh_hist_countries_deaths = pd.read_csv('Datasets/JH/jh_hist_countries_deaths.csv', index_col=0)

In [8]:
jh_hist_countries_recovered = pd.read_csv('Datasets/JH/jh_hist_countries_recovered.csv', index_col=0)

In [9]:
jh_hist_usstates_cases = pd.read_csv('Datasets/JH/jh_hist_usstates_cases.csv', index_col=0)

In [10]:
jh_hist_usstates_deaths = pd.read_csv('Datasets/JH/jh_hist_usstates_deaths.csv', index_col=0)

In [11]:
jh_hist_uscounties_cases = pd.read_csv('Datasets/JH/jh_hist_uscounties_cases.csv', index_col=0, header=[0,1])

In [12]:
jh_hist_uscounties_deaths = pd.read_csv('Datasets/JH/jh_hist_uscounties_deaths.csv', index_col=0, header=[0,1])

## Start examining the data - Exploratory Data Analysis (EDA)

In [13]:
# Total Global confirmed cases (live)
jh_live_countries['Confirmed'].sum()

1511104

In [14]:
# Total Global deaths (live)
jh_live_countries['Deaths'].sum()

88338

In [15]:
# Total US confirmed cases (live)
jh_live_usstates['Confirmed'].sum()

429052

In [16]:
# Total US Deaths (live)
jh_live_usstates['Deaths'].sum()

14695

# TODO - CONTINUE REVISING FROM HERE

In [17]:
# Top countries by confirmed count as of latest date
top_countries = confirmed_country.loc[jh_date].sort_values(ascending=False).nlargest(20)
top_countries

NameError: name 'confirmed_country' is not defined

In [None]:
# Plot the top 20 countries confirmed infections over time
confirmed_country[top_countries.index].plot(figsize=(15,8), title="Top 20 Countries - Confirmed Cases Over Time")

In [None]:
# Same chart but excluding China and starting on Mar 15 to zoom in
confirmed_country[top_countries.index].loc['03/15/2020':, top_countries.index != 'China'].plot(figsize=(15,8), title="Top 20 Countries ex China - Confirmed Cases Over Time")

In [None]:
# Same chart but US only and starting on Mar 15 to zoom in
confirmed_country[top_countries.index].loc['3/15/2020':, 'US'].plot(figsize=(15,8), title="United States - Confirmed Cases Over Time")

In [None]:
# Top 5 New York counties with the most confirmed cases (JH Dataset)
global_curr[global_curr["Province_State"]=='New York'][['Admin2', 'Confirmed', 'Deaths']].sort_values(by='Confirmed', ascending=False).head(5)

In [None]:
# Top 5 California counties with the most confirmed cases (JH Dataset)
global_curr[global_curr["Province_State"]=='California'][['Admin2', 'Confirmed', 'Deaths']].sort_values(by='Confirmed', ascending=False).head(5)

In [None]:
# Top 10 Massachusetts counties with the most confirmed cases (JH Dataset)
global_curr[global_curr["Province_State"]=='Massachusetts'][['Admin2', 'Confirmed', 'Deaths']].sort_values(by='Confirmed', ascending=False).head(10)

In [None]:
# Chart the top states with confirmed positive
states_positive[top_states].loc['3/15/2020':, :].plot(title='Confirmed Cases - Top US States', figsize=(15,8))

In [None]:
# The same chart but this time without NY to zoom in on the others
states_positive[top_states].loc['3/15/2020':, top_states != 'NY'].plot(title='Confirmed Cases - Top US States (ex NY)', figsize=(15,8))

In [None]:
# Chart the top states of by # of deaths
states_deaths[top_states].loc['3/15/2020':, :].plot(title='Deaths - Top US States', figsize=(15,8))

In [None]:
counties_CA_filter = counties['state']=='California'
counties_CA_cases = pd.pivot_table(counties[counties_CA_filter], index='date', columns='county', values='cases', aggfunc=np.sum)

In [None]:
top_CA_counties = counties_CA_cases.loc[nytdate, :].nlargest(10).index
top_CA_counties

In [None]:
counties_CA_cases.loc['3/15/2020':, top_CA_counties].plot(title='Cases - California Counties', figsize=(15,8))

In [None]:
OC_filter = (counties['state']=='California') & (counties['county']=='Orange')
OC = counties[OC_filter]
OC.loc['03/15/2020':,'cases':].plot(title='Cases - Orange County, CA', figsize=(15,8))

In [None]:
counties_MA_filter = counties['state']=='Massachusetts'
counties_MA_cases = pd.pivot_table(counties[counties_MA_filter], index='date', columns='county', values='cases', aggfunc=np.sum)

In [None]:
top_MA_counties = counties_MA_cases.loc[nytdate, :].nlargest(10).index
top_MA_counties

In [None]:
counties_MA_cases.loc['3/15/2020':, top_MA_counties].plot(title='Cases - Massachusetts Counties', figsize=(15,8))

**TODO NEXT:**  
- Move data wrangling to a separate data prep file
- Export data prep files to csv to be pulled into analysis notebook and for future reference
- More EDA and Data Viz
- Perform forecasting using the historic time series data
- Get population data for each country / state and add to this report (may be easiest to just put in a csv file)
  - Also population density if possible
- Add metrics based on poulation data (% of population infected, etc)
- Model out different scenarios: 
  - No changes
  - Lockdown
  - Extensive testing
  - Mandatory quarantines
  - Containment effectiveness score for each country (estimated)
  - Cure discovered
- Look into using Unity to model out scenarios, using simulated humans and the Global / US map tool I have