# <img style="float: left; padding-right: 10px; width: 45px" src="https://github.com/Harvard-IACS/2018-CS109A/blob/master/content/styles/iacs.png?raw=true"> CS-S109A Introduction to Data Science 

## Lecture 12: A Case Study: COVID-19

**Harvard University**<br>
**Summer 2020**<br>
**Instructors:** Kevin Rader<br>
**Authors:** Rahul Dave, David Sondak, Pavlos Protopapas, Chris Tanner, Eleni Kaxiras, Kevin Rader

---

In [None]:
## RUN THIS CELL TO GET THE RIGHT FORMATTING 
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

# Table of Contents 
<ol start="0">
<li> Learning Goals </li> 
<li> COVID-19 Data </li>
<li> Prediction Modeling </li>
<li> Interpretive Modeling </li>
  

In [None]:
import pandas as pd
import sys
import numpy as np
import scipy as sp
import sklearn as sk
import matplotlib.pyplot as plt
import tensorflow as tf



## Learning Goals

This Jupyter notebook accompanies Lecture 12. By the end of this lecture, you should be more comfortable with:

- wrangling, processing, merging, and exploring data
- following the data science process
- issues with prediction modeling
- COVID-19 Data

## Part 0: Data 

For this notebook we will be using 3 sources of data:

1. `covid_cases_by_county.csv`: confirmed COVID-19 case counts across US counties measured daily.  [Source]()
2. `state_policy_data.csv`: current COVID-related policy data across US states: [Source](https://www.kff.org/coronavirus-covid-19/issue-brief/state-data-and-policy-actions-to-address-coronavirus/#socialdistancing)
3. `election2016_by_county.csv`: we've used these data before (pre-split into train and test in earlier uses).

Let's take a peak at all 3 of these datasets:

In [None]:
covid = pd.read_csv('../data/covid_cases_by_county.csv')
print(covid.shape)
covid.head()

In [None]:
elect = pd.read_csv('../data/election2016_by_county.csv')
print(elect.shape)
elect.head()

In [None]:
policy = pd.read_csv('../data/state_policy_data.csv')
print(policy.shape)
policy.head()

**Q0.1** What is being measured in the 2 new datasets (`covid` and `policy`)?  What interesting questions can be answered with these 3 datasets?  What data pitfalls could we possibly get tripped up on?  What other issues may arise?

*your answer here*

### Merging Data and Train-Test Split

Let's first perform some merging of the data so we don't have to deal with 3 different data sources.  The code below gives you the steps, and the splits into `train` and `test` dataframes:

In [None]:

covid['fipscode'] = covid['FIPS']

merged = covid.merge(elect,on="fipscode")
merged2 = merged.merge(policy,on="state",how="left")

print(merged2.shape)

merged2.head()

In [None]:
# split into train and test

from sklearn.model_selection import train_test_split
itrain, itest = train_test_split(range(merged2.shape[0]), train_size=0.80)

train = merged.iloc[itrain, :]
test = merged.iloc[itest, :]


## Part 1: Data Exploration

There are [at least] 2 perspective we can take on using this data set:
1. Build a prediction model to predict how many cases there will be tomorrow within each county
2. Look at what factors are influencing the number of cases

No matter the perspective, let's look at the cases for the most recent date:

*Note: always use train ONLY when doing any analysis, including explorations:

In [None]:
y_train = train['7/28/20']
y_test = test['7/28/20']

plt.hist(y_train);

**Q1.1** Describe this distribution.  What issues may occur if using this version of the variable in modeling?  How can this be corrected (there are many possibilities)?

*your answer here*

In [None]:
######
# your code here
######



## Part 2: Prediction modeling


**Q2.1** If we were to build a model to predict the number of new cases on 7/28/20 (to be used as a prediction model to predict going forward), what factors should we include?  What would be the most obvious predictor(s)?  

*your answer here*


**Q2.2** Build a model to predict the **new cases** on 7/28/20 based on the number of **new cases** on 7/27/20.  Evaluate the model's accuracy on the test set, provide a visual to help interpret the model, and interpret what the model says about the relationship.  Which counties are the outliers?

In [None]:
######
# your code here
######

*your answer here*

**Q2.3** Build a model to predict the **new cases** on 7/28/20 based on any variables available that day (be selective of what variables you would like to include as predictors).

In [None]:
######
# your code here
######

*your answer here*

## Part 3: Interpretation modeling

**Q3.1** What form(s) of the response variable should we use to answer the broad questions: 
- What factors are related to the rate of spread of the disease across counties?
- What policies have affected the rate of spread of the disease across counties?
- Is mask-wearing effective?

Should new cases the previous day be used as a predictor in anby models to answer this question?  How would this affect the interpretation?

*your answer here*


**Q3.2** Build a model (or multiple models) and use it to answer the question "What demographic factors are associated with differences in the rate of spread of the disease? across counties"

In [None]:
######
# your code here
######

*your answer here*

**Q3.3** Build a model (or multiple models) and use it to answer the question "How have state re-openings affected the rate of spread of the disease across counties?".

In [None]:
######
# your code here
######

*your answer here*

**Q3.4** Ask your own interesting applied question and use a model(s) to address it.

In [None]:
######
# your code here
######


*your answer here*