# <img style="float: left; padding-right: 10px; width: 45px" src="https://github.com/Harvard-IACS/2021-s109a/blob/master/lectures/crest.png?raw=true"> CS-S109A Introduction to Data Science 

## Final Exam: COVID-19 Modeling

**Harvard University**<br/>
**Summer 2021**<br/>
**Instructors**: Kevin Rader


<hr style='height:2px'>

---



### INSTRUCTIONS

- This final exam is to be completed indivudally.  Do not consult with your peers when working on it (you can aks the teaching staff for clarification questions, including private messages on Ed).
- To submit your assignment follow the instructions given in Canvas.
- Restart the kernel and run the whole notebook again before you submit. 
- As much as possible, try and stick to the hints and functions we import at the top of the homework, as those are the ideas and tools the class supports and is aiming to teach. And if a problem specifies a particular library you're required to use that library, and possibly others from the import list.
- Please use .head() when viewing data. Do not submit a notebook that is excessively long because output was not suppressed or otherwise limited. 

**Note: for all problems, it is up to you to decide how to transform the data (standardization, log transformations, etc.).  Be sure you use and interpret theses transformations approporiately.**

In [None]:
import numpy as np
import pandas as pd
import scipy as sp
import sklearn as sk
import statsmodels as sm
import matplotlib.pyplot as plt
import seaborn as sns

# You are free to use any functions/methods within these packages (BS4, ELI5, and LIME are fine too)
# if you would like to use any other, please contact hte teaching staff 

<hr style="height:2pt">

# Analyzing the recent spread of COVID-19 

![](fig/vaccine.jpeg)

You are tasked with using the COVID case and vaccination data across counties presented by the CDC to analyze the recent surge in COVID infections and the association with (amonth other predictors).  You are also tasked with building prediction models to forecast how the disease spread will change based on data from the previous week (and  demographic and other measures.

The exam broken into 4 problems:
- Problem 1: Data Wrangling and Explorations
- Problem 2: Interpretive Linear Regression Modeling
- Problem 3: Prediction Modeling
- Problem 4: Further Analysis

You are provided with four raw data files, and a 5th cleaned file is provided to be used for all EDA and modeling tasks.

The variables included in each of the four raw data sets are:

For 'covid_cases_county.csv' (note: counties show up many times in this dataset: once for each data they report the number of cases):
- `date`: the date of the measurement, taken weekly
- `county`: county name
- `state`: the state in which the county lies
- `fips`: the unique Federal Information Processing System (FIPS) codes for the county
- `cases`: the cumulative number of confirmed positive cases up to and including that date
- `deaths`: the cumulative number of confirmed COVID-related deaths up to and including that date


For 'vaccines_county.csv' (note: counties show up many times in this dataset: once for each data they report the number of cases):
- `date`: the date of the measurement, taken weekly
- `fips`: the unique FIPS code for the county
- `fully`: the percent of residents that are fully vaccinated in the county on that date
- `dose1`: the percent of residents that have received at least one vaccine dose in the county on that date.

For 'masks_county.csv' (note: this is based on a survey conducted by the New York Times in summer of 2020):
- `fips`: the unique FIPS code for the county
- `never`: the percent of respondents that report they never wore masks in public
- `rarely`: the percent of respondents that report they rarely wore masks in public
- `sometimes`: the percent of respondents that report they sometimes wore masks in public	
- `frequently`: the percent of respondents that report they frequently wore masks in public	
- `always`: the percent of respondents that report they always wore masks in public

For 'demographics_county.csv' (note: these are various measures taken from 2010 to 2020):
- `fips`: the unique FIPS code for the county
- `population`: total number of residents in the country	
- `hispanic`: the percentage of residents that self-identify as hispanic
- `minority`: the percentage of residents that self-identify as a minority group (non-white)
- `female`: the percentage of residents that self-identify as female
- `unemployed`: the percentage of residents that are unemployed
- `income`: the median household income, in thousnads of dollards
- `nodegree`: the percentage of residents that report not having graduated high school
- `bachelor`: the percentage of residents that report having a college degree
- `inactivity`: the percentage of residents that get less than 1 hour of vigorous exercise a week
- `obesity`: the percentage of residents that are considered obese based on BMI
- `density`: the population density (residents per square mile)
- `votergap20`: Biden voting percentage minus Trump voting percentage in the 2020 election
- `votergap16`: Clinton voting percentage minus Trump voting percentage in the 2016 election


### Data Sources
- Vaccinations [here](https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-County/8xkx-amqh).
- Cases [here](https://github.com/nytimes/covid-19-data).
- Mask Usage [here](https://github.com/nytimes/covid-19-data/tree/master/mask-use).
- Demographics [here](https://www.ers.usda.gov/data-products/county-level-data-sets/) 
- 2020 Election [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ)



## Question 1 [25pts]: Data Wrangling and Explorations </b></div>

**1.1** Load the data sets as follows:
- 'covid_cases_county.csv' as `covid_raw` 
- 'vaccines_county.csv' as `vaccines_raw`
- 'masks_county.csv' as `masks`
- 'demographics_county.csv' as `demo` 

**1.2** Create a subset of the `covid_raw` data frame that only contains the measures for 5 dates: June 27 and July 4, 11, 18 and 25.  Do the same for the `vaccines_raw`.  Call these subsets `covid` and `vaccines`, respectively, and print out their dimensions (aka, shapes).

**1.3** Determine and print the number of counties that are measured for each time period in `covid` and `vaccines` (do not print out the list of counties, just the number/count).  Comment on what this implies for presence of missing data.

**1.4** Process both `covid` and `vaccines` so that each county is represented by a single row in each data frame (rather than having 5 separate rows for each county: 1 for each time period in part 1.2).  Call these new generate Pandas data frames `covid_by_county` and `vaccines_by_county` separately.  Print out the dimensions of each resulting data frame, and view the header of `covid_by_county`.  Note: you should use informative names for the columns in the resulting data frames: for example, `cases_w30` for the cumulative number of cases on July 25 (it's the 30th week of the calendar year).

**Hint**: Splitting based on dates and then using `pd.DataFrame.merge` (source)[https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html] could be helpful for this task using the `fips` code as the keys to join on (you should drop any counties that are not measured in all time periods...the default argument for `how` in `pd.DataFrame.merge` will behave this way).

**1.5** Merge the 4 data fames (`covid_by_county`, `vaccines_by_county`, `masks`, and `demo`) based on `fips` and save the result as `covid_merged` (you should drop any counties that are not measured in all 4 data frames).  Determine and report how many counties were dropped from `demo` in this process, and view the header of `covid_merged`.

**1.6** Use `covid_merged` to calculate the novel case rate (per 1000 residents) for each of the weeks for all of the counties, and save these as 4 new well-named variables in `covid_merged`.  For example, `rate_w30` can mathematically be represented as `1000*(cases_30-cases_29)/population`.  Plot the histogram of the novel case rate in week 29, `rate_w29`, and comment on what you notice.

**1.7** We did the steps above (and some other minimal processing) and saved the results in `covid_clean.csv` for you.  Use this data file to answer some exploratory questions and all future analyses: 

1. Has the overall average case rate increased from week 28 (July 5-11) to week 29 (July 12-18)?  
2. Treating the counties as separate and equal observations: in what states did the case rate increase the most?  In what states did the case rate decrease the most (or increse the least)?  List the top 5 for each.  Do you notice any patterns in these states?
3. Create and interpret separate visuals to display how the country case rate in week 29 relates to each of the following variables. Interpret what you see (be specific to this domain).

    a. The political views in the county (as measured by the votergap in the 2020 election).
    
    b. The vaccination rate in the county (for week 28) (be sure to throw away the zeros as these represent unreported values).
    
    c. The population density of the county.
    
    d. Whether 50% or more of the surveyed residents in the county report that they always wore a mask in public at the time of the survey.

## Answers

**1.1** Load the data sets as follows:
- 'covid_cases_county.csv' as `covid_raw` 
- 'vaccines_county.csv' as `vaccines_raw`
- 'masks_county.csv' as `masks`
- 'demographics_county.csv' as `demo` 

Print out each of their dimensions (aka, shapes).

In [None]:
# Load data
covid_raw = pd.read_csv('data/covid_cases_county.csv')
vaccines_raw = pd.read_csv('data/vaccines_county.csv')
masks = pd.read_csv('data/masks_county.csv')
demo = pd.read_csv('data/demographics_county.csv')

# print shapes of the datasets
print(covid_raw.shape[1],"total columns in covid_raw, and ",covid_raw.shape[0],"rows")
print(vaccines_raw.shape[1],"total columns in vaccines_raw",vaccines_raw.shape[0],"rows")
print(masks.shape[1],"total columns in vaccines_raw",masks.shape[0],"rows")
print(demo.shape[1],"total columns in demo",demo.shape[0],"rows")

**1.2** Create a subset of the `covid_raw` data frame that only contains the measures for 5 dates: June 27 and July 4, 11, 18 and 25.  Do the same for the `vaccines_raw`.  Call these subsets `covid` and `vaccines`, respectively, and print out their dimensions (aka, shapes).


In [None]:
# take a look at the covid dataset
covid_raw.head()

In [None]:
# subset the covid dataset to selected dates
covid = covid_raw.loc[
    (covid_raw['date']=='2021-06-27') |
    (covid_raw['date']=='2021-07-04') |
    (covid_raw['date']=='2021-07-11') |
    (covid_raw['date']=='2021-07-18') |
    (covid_raw['date']=='2021-07-25')
    , :]
print(covid.shape[1],"total columns in covid, and ",covid.shape[0],"rows")

In [None]:
# take a look at the vaccine dataset
vaccines_raw.head()

In [None]:
# subset the vaccinces dataset to selected dates
vaccines = vaccines_raw.loc[
    (vaccines_raw['date']=='2021-06-27') |
    (vaccines_raw['date']=='2021-07-04') |
    (vaccines_raw['date']=='2021-07-11') |
    (vaccines_raw['date']=='2021-07-18') |
    (vaccines_raw['date']=='2021-07-25')
    , :]
print(vaccines.shape[1],"total columns in vaccinces, and ",vaccines.shape[0],"rows")

**1.3** Determine and print the number of counties that are measured for each time period in `covid` and `vaccines` (do not print out the list of counties, just the number/count).  Comment on what this implies for presence of missing data.


In [None]:
# count the number of counties per day in the covid dataset
counties_by_day = covid.groupby('date').agg({'county': 'count',})
print("Number of counties per day in covid dataset:")
counties_by_day.head()

**Comment:** Not all days have the same number of counties. It seems some days have no data for some counties (i.e. missing data).

In [None]:
# assess if there's a one-to-one relationships between counties and FIPS
fips_by_day_by_county = covid.groupby(['date','county']).agg({'fips': 'count',})
print("Number of FIPS per day by county:")
print(fips_by_day_by_county[fips_by_day_by_county['fips'] == 0].head())
print(fips_by_day_by_county[fips_by_day_by_county['fips'] > 1].head())

**Comment:** Some counties have no FIPS, and some counties have multiple FIPS.

In [None]:
# count the number of counties per day in the covid dataset
counties_by_day = covid.groupby('date').agg({'fips': 'count',})
print("Number of the unique Federal Information Processing System (FIPS) codes for the county per day in covid dataset:")
counties_by_day.head()

In [None]:
# count the number of FIPS per day in the vaccines dataset
covid_fips_count_by_day = vaccines.groupby('date').agg({'fips': 'count',})
print("Number of FIPS per day in vaccines dataset:")
covid_fips_count_by_day.head()

**Comment:** We can observe missing data for counties and fips. Not all dates have the same number of counties in the covid dataset. The number of FIPS in the covid and vaccine datasets is also not identical. We'll need dilligence when merging the two datasets. Since some counties have no FIPS in the covid dataset, and since the vaccines dataset doesn't have county data, we will likely not be able to include the counties with missing FIPS in a vaccine analysis.

**1.4** Process both `covid` and `vaccines` so that each county is represented by a single row in each data frame (rather than having 5 separate rows for each county: 1 for each time period in part 1.2).  Call these new generate Pandas data frames `covid_by_county` and `vaccines_by_county` separately.  Print out the dimensions of each resulting data frame, and view the header of `covid_by_county`.  Note: you should use informative names for the columns in the resulting data frames: for example, `cases_w30` for the cumulative number of cases on July 25 (it's the 30th week of the calendar year).

**Hint**: Splitting based on dates and then using `pd.DataFrame.merge` (source)[https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html] could be helpful for this task using the `fips` code as the keys to join on (you should drop any counties that are not measured in all time periods...the default argument for `how` in `pd.DataFrame.merge` will behave this way).




In [None]:
# turn the long format into a wide format
# reset the index to repivot
covid_by_county = covid.reset_index(level=0)
# 
covid_by_county = covid_by_county.pivot_table(index=['fips'], columns='date',
                    values=['cases', 'deaths'], aggfunc='sum', margins=False)

In [None]:
# reformat the column names to one level
# create a list of the new column names in the right order
new_cols=[('{1} {0}'.format(*tup)) for tup in covid_by_county.columns]

# assign it to the dataframe (assuming you named it pivoted
covid_by_county.columns= new_cols

# resort the index, so you get the columns in the order you specified
covid_by_county = covid_by_county.sort_index(axis='columns')

In [None]:
# rename the columns
covid_by_county = covid_by_county.rename(columns = {'2021-06-27 cases': 'cases_w26', 
                          '2021-07-04 cases': 'cases_w27',
                          '2021-07-11 cases': 'cases_w28',
                          '2021-07-18 cases': 'cases_w29',
                          '2021-07-25 cases': 'cases_w30',
                          '2021-06-27 deaths': 'deaths_w26', 
                          '2021-07-04 deaths': 'deaths_w27',
                          '2021-07-11 deaths': 'deaths_w28',
                          '2021-07-18 deaths': 'deaths_w29',
                          '2021-07-25 deaths': 'deaths_w30',
                          }, inplace = False)

In [None]:
# check out the covid dataset
covid_by_county.head()

In [None]:
# retain the number of columns for future use
nr_columns_covid_by_county = covid_by_county.shape[1]

In [None]:
# turn the long format into a wide format
# reset the index to repivot
vaccines_by_county = vaccines.reset_index(level=0)
# 
vaccines_by_county = vaccines_by_county.pivot_table(index=['fips'], columns='date',
                    values=['fully', 'dose1'], aggfunc='sum', margins=False)

In [None]:
# reformat the column names to one level
# create a list of the new column names in the right order
new_cols=[('{1} {0}'.format(*tup)) for tup in vaccines_by_county.columns]

# assign it to the dataframe (assuming you named it pivoted
vaccines_by_county.columns= new_cols

# resort the index, so you get the columns in the order you specified
vaccines_by_county = vaccines_by_county.sort_index(axis='columns')

In [None]:
# rename the columns
vaccines_by_county = vaccines_by_county.rename(columns = {
                          '2021-06-27 fully': 'fully_w26', 
                          '2021-07-04 fully': 'fully_w27',
                          '2021-07-11 fully': 'fully_w28',
                          '2021-07-18 fully': 'fully_w29',
                          '2021-07-25 fully': 'fully_w30',
                          '2021-06-27 dose1': 'dose1_w26', 
                          '2021-07-04 dose1': 'dose1_w27',
                          '2021-07-11 dose1': 'dose1_w28',
                          '2021-07-18 dose1': 'dose1_w29',
                          '2021-07-25 dose1': 'dose1_w30',
                          }, inplace = False) 

In [None]:
# merge the data with an inner join to remove counties with missing data
covid_and_vaccines_by_fips = covid_by_county.merge(vaccines_by_county, on=['fips'])

# post merge, split the data back into separate datasets
# this will allow both datasets to have an identical set of fips
covid_by_county = covid_and_vaccines_by_fips.iloc[:,:nr_columns_covid_by_county]
vaccines_by_county = covid_and_vaccines_by_fips.iloc[:,nr_columns_covid_by_county:]

In [None]:
# check out the shape of the wide datasets
print(covid_by_county.shape[1],"total columns in covid_by_county, and ",covid_by_county.shape[0],"rows")
print(vaccines_by_county.shape[1],"total columns in vaccines_by_county",vaccines_by_county.shape[0],"rows")

In [None]:
# inspect the covid data
covid_by_county.head()

**Comment:** As expected, some records were dropped as part of the merge. There are now 3214 rows, instead of 3218. The rows are identical as an inner join merge dropped records where no matching fips were found.

**1.5** Merge the 4 data fames (`covid_by_county`, `vaccines_by_county`, `masks`, and `demo`) based on `fips` and save the result as `covid_merged` (you should drop any counties that are not measured in all 4 data frames).  Determine and report how many counties were dropped from `demo` in this process, and view the header of `covid_merged`.



In [None]:
# retain the original nr. of rows in demo
demo_fips_cnt = demo.shape[0]

# Merge the datasets
covid_merged = covid_by_county.merge(vaccines_by_county, on=['fips'])
covid_merged = covid_merged.merge(masks, on=['fips'])
covid_merged = covid_merged.merge(demo, on=['fips'])
print("The number of records dropped from demo is:", demo_fips_cnt - covid_merged.shape[0])

In [None]:
covid_merged.head()

**1.6** Use `covid_merged` to calculate the novel case rate (per 1000 residents) for each of the weeks for all of the counties, and save these as 4 new well-named variables in `covid_merged`.  For example, `rate_w30` can mathematically be represented as `1000*(cases_30-cases_29)/population`.  Plot the histogram of the novel case rate in week 29, July 12-18, `rate_w29`, and comment on what you notice.



In [None]:
# calculate novel case rates
covid_merged['novel_case_rate_w27'] = 1000*(covid_merged['cases_w27'] - 
                                           covid_merged['cases_w26']) / covid_merged['population']
covid_merged['novel_case_rate_w28'] = 1000*(covid_merged['cases_w28'] - 
                                           covid_merged['cases_w27']) / covid_merged['population']
covid_merged['novel_case_rate_w29'] = 1000*(covid_merged['cases_w29'] - 
                                           covid_merged['cases_w28']) / covid_merged['population']
covid_merged['novel_case_rate_w30'] = 1000*(covid_merged['cases_w30'] - 
                                           covid_merged['cases_w29']) / covid_merged['population']

In [None]:
# show histograme for week 29
novel_case_rate_w29_hist = plt.hist(covid_merged['novel_case_rate_w29'])
plt.xlabel("Novel case rate per 1000 residents")
plt.ylabel("Number of counties")
plt.suptitle("Distribution of new cases per 1000 in week 29, by county")
plt.show()

**Comment:** Most counties showed an increase of 1 or 2 cases per thousand in week 29. A few counties had more than 2, going up to an increase of 7 cases per 1000. Close to 500 counties saw a decrease by 1 or w cases, relative to the previous week.

**1.7** We did the steps above (and some other minimal processing) and saved the results in `covid_clean.csv` for you.  Use this data file to answer some exploratory questions and all future analyses: 

1. Has the overall average case rate increased from week 28 (July 5-11) to week 29 (July 12-18)?  
2. Treating the counties as separate and equal observations: in what states did the case rate increase the most?  In what states did the case rate decrease the most (or increse the least)?  List the top 5 for each.  Do you notice any patterns in these states?
3. Create and interpret separate visuals to display how the country case rate in week 29 relates to each of the following variables. Interpret what you see (be specific to this domain).

    a. The political views in the county (as measured by the votergap in the 2020 election).
    
    b. The vaccination rate in the county (for week 28) (be sure to throw away the zeros as these represent unreported values).
    
    c. The population density of the county.
    
    d. Whether 50% or more of the surveyed residents in the county report that they always wore a mask in public at the time of the survey.

In [None]:
# read the cleaned dataset
covid_clean = pd.read_csv('data/covid_clean.csv')

**1. Has the overall average case rate increased from week 28 (July 5-11) to week 29 (July 12-18)?**

In [None]:
# plot novel cases per 1000 for w29
novel_case_rate_w29_hist = plt.hist(covid_clean['rate_w29']*1000)
plt.xlabel("Novel case rate per 1000 residents")
plt.ylabel("Number of counties")
plt.suptitle("Distribution by county of new cases per 1000 in week 29")
plt.show()

In [None]:
average_covid_increase = covid_clean['rate_w29'].mean()
print("Average covid increase per 1000 in week 29:", average_covid_increase *1000)

**Comment:**: The histogram shows that overall rate of covid cases has increased for all counties, on average 0.66 per thousand

**2. Treating the counties as separate and equal observations: in what states did the case rate increase the most? In what states did the case rate decrease the most (or increse the least)? List the top 5 for each. Do you notice any patterns in these states?**

In [None]:
# show the counties with the highest rate increase
covid_clean.sort_values('rate_w29', ascending=False).head()

In [None]:
# show the counties with the highest rate increase
covid_clean.sort_values('rate_w29', ascending=True).head(5)

**Comment:** At first glance it looks like the northern states had the lowest increase in cases, and the mid-country southern states had the highest increase in cases in week 29.

**Create and interpret separate visuals to display how the country case rate in week 29 relates to each of the following variables. Interpret what you see (be specific to this domain).**

**a. The political views in the county (as measured by the votergap in the 2020 election).**

In [None]:
# plot novel cases per 1000 for w29 as a function of votergap
# show the logaritmic view to distribute the scatter more evenly
plt.scatter(covid_clean['votergap20'],np.log(covid_clean['rate_w29']*1000))
plt.xlabel("Votergap in the 2020 election")
plt.ylabel("Log of novel case rate per 1000 residents")
plt.suptitle("Distribution of novel cases in week 29 by votergap")
plt.show()

**Comment:** Without controlling for confounding factors, there doesn't seem to be a pattern between votergap in the 2020 elections and the covid novel case rate in week 29.

**b. The vaccination rate in the county (for week 28) (be sure to throw away the zeros as these represent unreported values).**

In [None]:
covid_clean_nonzero_vac = covid_clean[covid_clean['fully_w28']!=0]
print("Nr of counties with unreported vaccination numbers in week 28:",covid_clean.shape[0]-
                                                                      covid_clean_nonzero_vac.shape[0])

In [None]:
# plot novel cases per 1000 for w29 as a function of vaccination in the previous week
# show the logaritmic view to distribute the scatter more evenly
plt.scatter(covid_clean_nonzero_vac['fully_w28'],np.log(covid_clean_nonzero_vac['rate_w29']*1000))
plt.xlabel("Vaccination rates")
plt.ylabel("Log of novel case rate per 1000 residents")
plt.suptitle("Distribution of novel cases in week 29 by vaccination rates in week 28")
plt.show()

**Comment:** Without controlling for confounding factors, there doesn't seem to be a pattern between vaccination rates in week 28 and the covid novel case rate in week 29.

**c. The population density of the county.**

In [None]:
# plot novel cases per 1000 for w29 as a function of votergap
# show the logaritmic view to distribute the scatter more evenly
plt.scatter(np.log(covid_clean['density']),np.log(covid_clean['rate_w29']*1000))
plt.xlabel("Population")
plt.ylabel("Log of novel case rate per 1000 residents")
plt.suptitle("Distribution of novel cases in week 29 by population density")
plt.show()

**Comment:** Without controlling for confounding factors, there doesn't seem to be a pattern between population density and the covid novel case rate in week 29.

**d. Whether 50% or more of the surveyed residents in the county report that they always wore a mask in public at the time of the survey.**

In [None]:
# subset to mask wearing higher than 50%
covid_clean_mask_adoption_above_50pc = covid_clean[covid_clean['always']>=50]
covid_clean_mask_adoption_below_50pc = covid_clean[covid_clean['always']<50]

In [None]:
# plot novel cases per 1000 for w29 as a function of votergap
# show the logaritmic view to distribute the scatter more evenly
novel_case_rate_w29_hist_above_50_masked = plt.hist(covid_clean_mask_adoption_above_50pc['rate_w29']*1000, alpha=0.5, label='Above 50% always wears mask')
novel_case_rate_w29_hist_below_50_masked = plt.hist(covid_clean_mask_adoption_below_50pc['rate_w29']*1000, alpha=0.5, label='Below 50% always wears mask')
plt.xlabel("Novel case rate per 1000 residents")
plt.ylabel("Number of counties")
plt.legend()
plt.suptitle("Distribution by county of new cases per 1000 in week 29")
plt.show()

**Comment:** Without controlling for confounding factors, there seems to be rather unpronounced pattern between aobve 50 percent mask adoption and the increase in covid cases per thousand. For 100 counties, the number of cases seems 1 or 2 cases higher per thousand.

---

## Question 2 [35pts]: Regression modeling 

**2.1** Fit a linear regression model to predict `rate_w29` (which represent the rate of new cases in the week of July 12-18) from `rate_w28` (July 5-11). Report the 95% confidence intervals for the coefficients, and carefully interpret the coefficients (including their statistical significances).  What does this model suggest about whether the rate of COVID infection increased from week 28 to week 29?


**2.2** Fit a linear regression model to predict `rate_w29` from `rate_w28` and `votergap20` along with the interaction between the two.  Interpret the coefficient estimates carefully (no need to mention significances).


**2.3** Create a scatterplot of `rate_w29` vs. `rate_w28`.  Add 3 separate predicted lines from your model in 2.2 to this scatterplot: the predicted line from the model in 2.2 for counties...
    1. where Biden was favored by 50 percentage points.
    2. where Biden and Trump were equal
    3. where Trump was favored by 50 percentage points.
Interpret what you see.


**2.4** Fit a linear regression model to assess the overall association of vaccination rate (`fully_w28`) on `rate_w29`.  Carefully interpret the results (including the statistical significance).  


**2.5** Many counties have the value zero for `fully_w28` which really represents a missing/unreported value for vaccinationr rate.  Comment on the effect of ignoring this issue can have on the intepretations and inferences in the model in 2.4.  What would be a better way of handling this issue?


**2.6** What factors could be confounded (whether mesured here or not) with the result seen in the model from 2.3 (list up to 3)?  Fit an appropriate linear model that controls for as many of these factors as possible (for those that are measured in this data set). Interpret the coefficient estimates from this model and compare to the results from 2.4.

**2.7** What major issue could arise if you fit a model to predict `rate_w29` from `rate_w28` and `rate_w27` (or from `fully_w28` and `fully_w27`) in a linear regression model?  Suggest and explain the use of two different approaches to account for this: one approach should be based on modeling and one approach should be based on feature engineering/variable transformations (not PCA). 

**2.8** The test set has a response variable that is `rate_w30`.  How would you use your models to predict `rate_w29` in this section in order to predict `rate_w30` instead?  Explain.  What could go wrong in this modification?

**Hint**: what should be the predictors to predict `rate_w30` instead of `rate_w29`? 


## Answers

**2.1** Fit a linear regression model to predict `rate_w29` (which represent the rate of new cases in the week of July 12-18) from `rate_w28` (July 5-11). Report the 95% confidence intervals for the coefficients, and carefully interpret the coefficients (including their statistical significances).  What does this model suggest about whether the rate of COVID infection increased from week 28 to week 29?


In [None]:
from statsmodels.api import OLS

In [None]:
# plot novel cases per 1000 for w29 as a function of w29
plt.scatter(covid_clean['rate_w28'],covid_clean['rate_w29'])
plt.xlabel("Increase in week 28")
plt.ylabel("Increasein week 29")
plt.suptitle("Graph 2.1.1 Correlation between week 28 and week 29 covid case growth rate")
plt.show()

**Comment:** Graph 2.1.1 would indicate there is a positive linear relationship between the rate of increase in week 28 and the rate of increase in week 29. This is quite intuive as one would assume a certain level of momentum.

In [None]:
# shape the data for regression with one series
rate_w29 = covid_clean['rate_w29'].to_numpy().reshape(-1,1)
rate_w28 = covid_clean['rate_w28'].to_numpy().reshape(-1,1)

# add intercept
OLS_X = sm.tools.add_constant(rate_w28)

# fit the model on the training data
OLSModel = OLS(rate_w29,OLS_X).fit()
# print("Statmodels results: \n",OLSModel.params,sep="")
OLSModel.summary()

In [None]:
# grab the coefficients confidence intervals from the model
confidence_intervals = OLSModel.conf_int(alpha=0.05, cols=None)
pvalues = OLSModel.pvalues

In [None]:
# printing values from the summary table
print("The 95% confidence interval for the intercept is:", confidence_intervals[0][0], "-", confidence_intervals[0][1])
print("The statistical significance (p-value) for the intercept is:", pvalues[0])
print("The 95% confidence interval for the slope is:", confidence_intervals[1][0], "-", confidence_intervals[1][1])
print("The statistical significances (p-value) for the slope is:", pvalues[1])

In [None]:
# plot novel cases per 1000 for w29 as a function of w29
plt.scatter(covid_clean['rate_w28'],covid_clean['rate_w29'])
plt.xlabel("Increase in week 28")
plt.ylabel("Increase in week 29")
i=0.0004       # intercept
s=0.6603        # slope
x=np.linspace(-0.005,0.03,20)      # from 1 to 10, by 50
plt.plot(x, s*x + i, c = 'red')    
plt.suptitle("Graph 2.1.2 Correlation between week 28 and week 29 covid case growth rate")
plt.show()

**Comment:** The statistical significance for both coefficients is 0. This means the probability that we would find these coefficients if the null hypothesis were true, is zero. As such, the coefficients are considered statistically significant.
The intercept coefficient is very close to 0. Most of the effect is therefore between rate_28 and rate_29.
The slope coefficient is estimated at 0.6603: for every increase in rate_28 for a county, we - on average - expect rate_29 rate to increase by 0.6603. As such - as the coefficient is less than one - this suggests the rate of COVID infection DEcreased from week 28 to week 29.
When graphing the relationship however, it seems reasonable to conclude that the slope is underestimated, probably due to outliers on the bottom right.

**2.2** Fit a linear regression model to predict `rate_w29` from `rate_w28` and `votergap20` along with the interaction between the two.  Interpret the coefficient estimates carefully (no need to mention significances).


In [None]:
# add intercept, taking into account both regressors
covid_clean['rate_w28*votergap20'] = covid_clean['rate_w28']*covid_clean['votergap20']
OLS_x_train = sm.tools.add_constant(covid_clean[['rate_w28','votergap20', 'rate_w28*votergap20']])

# limit the target variable to pickup count for the dataset
y_train = covid_clean['rate_w29']

In [None]:
# fit the model on the training data
OLSModel = OLS(y_train,OLS_x_train).fit()
print("Statmodels results: \n",OLSModel.params,sep="")

OLSModel.summary()

In [None]:
# plot novel cases per 1000 for w29 as a function of w29
plt.scatter(covid_clean['rate_w28'],covid_clean['rate_w29'])
plt.xlabel("Increase in week 28")
plt.ylabel("Increase in week 29")
i=0.0001       # intercept
s=1.1899       # slope
x=np.linspace(-0.005,0.03,20)      # from 1 to 10, by 50
plt.plot(x, s*x + i, c = 'red')    
plt.suptitle("Graph 2.1.3 Correlation between week 28 and week 29 covid case growth rate")
plt.show()

In [None]:
# plot the interaction
votergap20_median = covid_clean.votergap20.median()
covid_clean['gap20_med'] = covid_clean.votergap20 > votergap20_median
plt.scatter(covid_clean['rate_w28'],covid_clean['rate_w29'], c=covid_clean['gap20_med'], alpha = 0.5)
# plot below median votergap slope
x=np.linspace(-0.005,0.03,20)      # from 1 to 10, by 50
i=0.0001       # intercept
s=1.1899       # slope below median
plt.plot(x, s*x + i, c = 'yellow', label='Below votergap median')   
# plot above median votergap slope
s= 1.1899 + 0.0102 * votergap20_median
plt.plot(x, s*x + i, c = 'purple', label='Above votergap median')
plt.xlabel("Increase in week 28")
plt.ylabel("Increasein week 29")
plt.suptitle("Graph 2.1.4 Correlation between w28 and w29 covid growth rate by votergap")
plt.legend()

In [None]:
# showing all rows and columns when displaying pandas info 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
# show the outliers
print("Votergap outliers:")
covid_clean.loc[:,['county','votergap20']][covid_clean['rate_w28'] > 0.01]

**Comment:** 
Coefficient interpretation:
- At 0 rate_w28 and 0 votergap20, we expect mpg to be 0.0001
- For every 1 unit increase in rate_w28 (the increase in covid cases per 1000, relative to the previous week), rate_w29 increases by 1.1899 (holding votergap20 at 0)
- For every 1 unit increase in rate_w28, rate_w29 changes by 1.1899 + votergap20 * 0.0102 (where votergap20 is not 0)
- For every 1 unit increase in votergap20, rate_w29 changes by -4.804e-06 (holding rate_28 at 0)
- For every 1 unit increase in votergap20, rate_w29 changes by -4.804e-06 + rate_28 * 0.0102 (where rate_28 is not 0)

**Comment:** While the coefficient for votergap is relatively small, there is a material effect when controlling for votergap and the interaction with rate_w28. After controlling for votergap20 (and the interaction), the coefficient for rate_w28 is almost 1.19. This indicates that for every increase in rate_w28, rate_29 increased by 20% more.
After controlling for the confounding factors, the regression graph 2.1.3 also looks more reasonable.
The interaction variable indicates how different the slope for rate_w28 is as voter gap changes. As median votergap20 is negative, for each one change in voter gap, the slope for rate_w28 decreases by 0.01. This is  illustrated in graph 2.1.4 in purple. The graph also demonstrates how the purple line may be pushed down by outliers for counties Franklin, Powder River, and Loving, counties where Trump significantly outperformed Biden.

**2.3** Create a scatterplot of `rate_w29` vs. `rate_w28`.  Add 3 separate predicted lines from your model in 2.2 to this scatterplot: the predicted line from the model in 2.2 for counties...
    1. where Biden was favored by 50 percentage points.
    2. where Biden and Trump were equal
    3. where Trump was favored by 50 percentage points.
Interpret what you see.




In [None]:
# plot various predicted lines for three votergap scenarios


# We can see the interaction by cutting one of the terms in the interaction along it’s median,
# and then plotting the response variable against the other variable in the interacting pair
votergap20 = covid_clean.votergap20.median()
covid_clean['gap20_med'] = covid_clean.votergap20 > votergap20_median
plt.scatter(covid_clean['rate_w28'],covid_clean['rate_w29'], c=covid_clean['gap20_med'], alpha = 0.5)

# line specs
x=np.linspace(-0.005,0.03,20) 
i=0.0001       # intercept

# plot where Biden was favored by 50 percentage points
votergap20 = 50
s = 1.1899 + 0.0102 * votergap20       # slope as a function of votergap20
plt.plot(x, s*x + i, c = 'blue', label='Biden was favored by 50 percentage points')   

# plot where Biden and Trump were equal
votergap20 = 0
s = 1.1899 + 0.0102 * votergap20       # slope as a function of votergap20
plt.plot(x, s*x + i, c = 'purple', label='Biden and Trump were equal')

# plot where Trump was favored by 50 percentage points.
votergap20 = -50
s = 1.1899 + 0.0102 * votergap20       # slope as a function of votergap20
plt.plot(x, s*x + i, c = 'red', label='Trump was favored by 50 percentage points')   

# Clean up plot
plt.xlabel("Increase in week 28")
plt.ylabel("Increasein week 29")
plt.suptitle("Comparison of predicted slopes by votergap")
plt.legend()

**Comment:** We can observe a lower predicted increase in counties where Trump was favored by more than 50 percentage points, and a higher predicted increase in counties where Biden was favored by more than 50 percentage points. Where Biden and Trump were equal, the week 29 increase lies in between.

**2.4** Fit a linear regression model to assess the overall association of vaccination rate (`fully_w28`) on `rate_w29`.  Carefully interpret the results (including the statistical significance).  




In [None]:
# shape the data for regression with one series
rate_w29 = covid_clean['rate_w29'].to_numpy().reshape(-1,1)
fully_w28 = covid_clean['fully_w28'].to_numpy().reshape(-1,1)

# add intercept
OLS_X = sm.tools.add_constant(fully_w28)

# fit the model on the training data
OLSModel = OLS(rate_w29,OLS_X).fit()
# print("Statmodels results: \n",OLSModel.params,sep="")
OLSModel.summary()

In [None]:
# grab the coefficients confidence intervals from the model
confidence_intervals = OLSModel.conf_int(alpha=0.05, cols=None)
pvalues = OLSModel.pvalues

# printing values from the summary table
print("The 95% confidence interval for the intercept is:", confidence_intervals[0][0], "-", confidence_intervals[0][1])
print("The statistical significance (p-value) for the intercept is:", pvalues[0])
print("The 95% confidence interval for the slope is:", confidence_intervals[1][0], "-", confidence_intervals[1][1])
print("The statistical significances (p-value) for the slope is:", pvalues[1])

**Comment:** The statistical significance for both coefficients is near 0. This means the probability that we would find these coefficients if the null hypothesis were true, is zero. As such, the coefficients are considered statistically significant.
The intercept coefficient is 0.0009. When fully vaccination rate in week 28 is zero, we therefore expect the increse in covid cases in week 29 to be 0.0009.
The slope coefficient is estimated at roughly -7e-06. As such, for every increase in vaccination rate for a county, we - on average - expect rate_29 to decease by 7 in a million.

**2.5** Many counties have the value zero for `fully_w28` which really represents a missing/unreported value for vaccination rate.  Comment on the effect of ignoring this issue can have on the intepretations and inferences in the model in 2.4.  What would be a better way of handling this issue?




**Comment:** Since fully_w28 at zero really represents a missing / unreported value for vaccination rate, zero does not seem to be a reasonable value. First, the absence of data can reduce statistical power (the probability that the test will reject the null hypothesis when it is false). Second, the lost data can cause bias in the estimation of parameters. 
A better way of handling this, would be to impute the missing / unreported values with a reasonable value. One way to impute a reasonable value, could be to use the average vaccination rate of neighbouring counties. Alternatively, we could use a modeling technique such as regression to impute the missing value.

**2.6** What factors could be confounded (whether measured here or not) with the result seen in the model from 2.3 (list up to 3)?  Fit an appropriate linear model that controls for as many of these factors as possible (for those that are measured in this data set). Interpret the coefficient estimates from this model and compare to the results from 2.4.



**Comment:** Amongst many possible confounding factors, it seems reasonable that the results in 2.3 could be confounded by mask wearing habits, vaccination rates, and population density.

In [None]:
# removing all the non-numeric columns
covid_clean_num_only = covid_clean.drop(['date','county', 'state', 'gap20_med'],axis=1) 

# removing the data for previous weeks as they are likely to be highly collinear with the values of week 28
covid_clean_num_only = covid_clean_num_only.drop(['cases_w26','deaths_w26', 'fully_w26', 'dose1_w26'],axis=1)
covid_clean_num_only = covid_clean_num_only.drop(['cases_w27','deaths_w27', 'fully_w27', 'dose1_w27', 'rate_w27'],axis=1)
covid_clean_num_only = covid_clean_num_only.drop(['cases_w28'],axis=1)
# removing the data from week 29 and week 30 as this wouldn't be known at prediction time
covid_clean_num_only = covid_clean_num_only.drop(['cases_w29','deaths_w29', 'fully_w29', 'dose1_w29'],axis=1)
covid_clean_num_only = covid_clean_num_only.drop(['cases_w30','deaths_w30', 'fully_w30', 'dose1_w30', 'rate_w30'],axis=1)

In [None]:
# cleaning out a couple of records with non-numeric values in votergap16
orig_nr_rows = covid_clean_num_only.shape[0]
covid_clean_num_only = covid_clean_num_only.loc[:,:][covid_clean_num_only['votergap16']!='#VALUE!']
print("Number of rows removed by cleaning up non-numeric values:", orig_nr_rows - covid_clean_num_only.shape[0])

#convert votergap16 to dtype float
covid_clean_num_only['votergap16'] = covid_clean_num_only['votergap16'].astype(float)

In [None]:
# add interaction variables between rate_28 and the mask variables
# drop the previous interaction variables

covid_clean_num_only = covid_clean_num_only.drop(['rate_w28*votergap20'],axis=1) 
# for column in ['never','rarely','sometimes','frequently', 'always', 'density', 'votergap20']:
for column in covid_clean_num_only.columns:
    if column != 'rate_w28' and column != 'rate_w29':
        covid_clean_num_only[str(column) + '*' + 'rate_w28'] = covid_clean_num_only[column] * covid_clean_num_only['rate_w28']



In [None]:
# #standardize the features
# from sklearn.preprocessing import MinMaxScaler
# column_names = covid_clean_num_only.columns
# scale_transformer = MinMaxScaler(copy=True).fit(covid_clean_num_only)
# covid_clean_num_only = pd.DataFrame(scale_transformer.transform(covid_clean_num_only))
# covid_clean_num_only.columns = column_names

In [None]:
# manage the target variable
X_train = covid_clean_num_only.loc[:, covid_clean_num_only.columns != 'rate_w29']
y_train = covid_clean_num_only.rate_w29

In [None]:
# use backward selection to prune the predictors
X = pd.DataFrame(X_train)
# set p-value minimum to retain predictor
cutoff = 0.05

for i in np.arange(X.shape[1]):
    # add the constant as statsmodel doesn't do that for us
    OLS_x_train = sm.tools.add_constant(X)
    # fit the model with the remaining predictors
    OLSModel = OLS(y_train,OLS_x_train).fit()
    # remove the predictor with the highest p-value
    highest_non_const_p_value = np.max(OLSModel.pvalues[1:])
    if highest_non_const_p_value > cutoff:
        highest_non_const_p_value_name = np.argmax(OLSModel.pvalues[1:])
        print("Predictor#:", highest_non_const_p_value_name, 
              "with associated p-value of" ,
              highest_non_const_p_value, 
              "is being removed")
        X = X.drop(highest_non_const_p_value_name,axis=1)
        X.reset_index

OLSModel.summary()

**Comments:** 
- We added a significant set of possibly confounding factors. The effect of full vaccination in week28, however, remains quite similar: in 2.4, the coeffient for fully_w28 is -6.699e-06. After adding additional predictors, the coefficient is -6.058e-06. Full vaccination therefore continues to predict a slightly lower rate for week 29. 
- We can observe that the coefficient of rate_w28 has increased significantly, from 1.1899 in 2.3 to 2.9137. 
- Most predictors in the dataset have been retained after backward selection, including confounders such as population, minority, unemployment rate, education factors, health factors and voter gaps. Notably, the density and "always" mask wearing predictors were eliminated due to high p-value. This is likely the case because of multicollinearity with other predictors.


**2.7** What major issue could arise if you fit a model to predict `rate_w29` from `rate_w28` and `rate_w27` (or from `fully_w28` and `fully_w27`) in a linear regression model?  Suggest and explain the use of two different approaches to account for this: one approach should be based on modeling and one approach should be based on feature engineering/variable transformations (not PCA). 



**Comment:** Rate_w28 and Rate_w27 will likely be highly correlated: where there's been material change in cases in one direction, we should expect a similar change the week after. Adding both feauters, therefore will increase a multicollinearity problem, making coefficients, confidence intervals and p-values unreliable. One solution (as illustrated above) is to use a predictor selection method (e.g., forward or backward selection). 
To reduce the risk of multicollinearity, it is recommended to standardize features. (@Devisch. Is this true? Should we have standardized the features for all solutions above?)

**2.8** The test set has a response variable that is `rate_w30`.  How would you use your models to predict `rate_w29` in this section in order to predict `rate_w30` instead?  Explain.  What could go wrong in this modification?

**Hint**: what should be the predictors to predict `rate_w30` instead of `rate_w29`? 


**Comment:** Our model is hard coded to predict rate_w29 from previous weeks' data. As such, if we wanted to predict rate_w30, we'd want to use e.g., rate_29 as a predictor (not rate_w28).
One solution is to keep the columns names as-is, but to shift the data by one week. We'd, for example, replace rate_w28 with rate_w29 data, rate_w27 data with rate_w28 data, rate_w26 data with rate_w27 data. We'd perform a similar operation for the other weekly metrics ('cases_w26','deaths_w26', 'fully_w26', 'dose1_w26')

Covid patterns, however, change continuously. @Devisch. It's be great to show this. As such, this approach would not hold over time. A better approach, would be to do refit the model with the more recent data.

---

## Question 3 [30pts]: Prediction modeling 

**3.1** Fit a well-tuned lasso model to predict `rate_w29` from the following set of predictors (along with all 2-way interactions among the main effects and all 2nd and 3rd order polynomial terms):

`['rate_w28','rate_w27','dose1_w28','hispanic','minority','female','unemployed', 'income','nodegree','bachelor','inactivity','obesity','density','cancer','votergap20']`

Report and explain the best choice of $\lambda$ (a visual can help with this), your estimate of out-of-sample $R^2$, along with the number of coefficients that shrunk exactly to zero (or numerically zero) and the number that are non-zero.

**3.2** Plot the trajectory curves of the main effects `['rate_w28','rate_w27','fully_w28','votergap20']` from this model: the estimates of the $\beta$ coefficients as a function of $\lambda$.  Interpret what you notice.

**3.3** Fit a well-tuned random forest model to predict `rate_w29` from the predictors listed in 3.1.  Report your choice of the tuning parameters and briefly justify your choices (a visual or table may be helpful for this).  Provide an estimate of out-of-sample $R^2$.  Note: do not go to crazy with the number of options for the parameters you are tuning...choose a set of values that are reasonable.

**3.4** Interpret the relationship between `rate_w29` and `dose1_w28` from the random forest model in 3.3.  Is there any evidence of interactive effects in this model involving `dose1_w28`?  How do you know?  Provide a reasonable visual (or a few visuals) to help you with these tasks and interpret what you see. 

**3.5** Fit a well-tuned boosting model to predict `rate_w29` from the predictors listed in 3.1.  Report your best choice of the tuning parameters and briefly justify your choice (a visual or table may be helpful for this).  Provide an estimate of out-of-sample $R^2$.  Note: again, do not go to crazy with the number of options for the parameters you are tuning...choose a set of values that are reasonable.

**3.6** Improve upon your favorite/best predictive model from 3.1, 3.3, or 3.5, by including other provided feature, by doing feature engineering, or by doing variable removal/selection.  Explain your choices.  Provide an estimate of out-of-sample $R^2$. 

**3.7** Evaluate your models from 3.1, 3.3, 3.5, and 3.6 on the test set (this will take some work...refer back to 2.8) using $R^2$.  How do these model's $R^2$ in test compare to the out-of-sample $R^2$ when tuning?  Explain whether this is surprising or not?



## Answers

**3.1** Fit a well-tuned lasso model to predict `rate_w29` from the following set of predictors (along with all 2-way interactions among the main effects and all 2nd and 3rd order polynomial terms):

`['rate_w28','rate_w27','dose1_w28','hispanic','minority','female','unemployed', 'income','nodegree','bachelor','inactivity','obesity','density','cancer','votergap20']`

Report and explain the best choice of $\lambda$ (a visual can help with this), your estimate of out-of-sample $R^2$, along with the number of coefficients that shrunk exactly to zero (or numerically zero) and the number that are non-zero.


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
model_descriptions = ['Poly, interact and lasso', 'Random forest']
idx = pd.Index(model_descriptions, name='Regression method')

# prepare a dataframe to represent rates for each model
model_comparison_df = pd.DataFrame(
    index=idx,
    columns=['training accuracy', 'test accuracy']
)


In [None]:
# restart from clean data
covid_clean = pd.read_csv('data/covid_clean.csv')

In [None]:
# downselect columns
# not including 'cancer'
columns = ['rate_w28','rate_w27','dose1_w28','hispanic','minority','female','unemployed', 
            'income','nodegree','bachelor','inactivity','obesity','density','votergap20']
X = covid_clean.loc[:,columns]
y = pd.DataFrame(covid_clean.loc[:,['rate_w29']])
X = X.reindex()
y = y.reindex()

# add a week 30 version with an identical split for easy model testing on never seen target data
y_w30_test_lasso, y_w30_test_lasso = train_test_split(pd.DataFrame(covid_clean.loc[:,['rate_w30']]), test_size=0.2, random_state = 109)


In [None]:
# prepare out of time data
covid_clean_w30 = covid_clean.copy()
latest_week = 29
features = []
weeks = [26,27,28]
week_dependent_features = ['cases','deaths', 'fully', 'dose1', 'rate']

for week in weeks:
    for column in week_dependent_features:
        curr_feature = column + '_w' + str(week)
        next_feature = column + '_w' + str(week + 1)
        features.append(curr_feature)
        covid_clean_w30[curr_feature] = covid_clean[next_feature]

X_w30 = covid_clean.loc[:,columns]
y_w30 = pd.DataFrame(covid_clean.loc[:,['rate_w30']])
X_w30 = X_w30.reindex()
y_w30 = y_w30.reindex()

In [None]:
# split dataset in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 109)

In [None]:
# Note how the number of rows for X_train and X_test is different, but number of columns is identical
print("X_train shape", X_train.shape)
print("X_train shape", X_test.shape)

In [None]:
def add_poly_features(dataset, degree, columns):
    """
    :param dataset: Your data
    :param degree: Max degree
    :return: Augmented DataFrame
    """
    # walk through the columns for which to add polynomials
    for column in columns:
        # create 2+ degree polynomials
        for polynomial in range(degree):
            # ignore polynomials with exponent 0 and 1
            polynomial = polynomial + 2
            if polynomial <= degree:
                # create the new columns
                dataset[str(column) + "_" + str(polynomial)] = dataset[column] ** polynomial
    poly_dataset = dataset
    return poly_dataset

In [None]:
# add second and third polynomials
X_train = add_poly_features(X_train, 3, columns)
X_test = add_poly_features(X_test, 3, columns)
# take a quick look at the dataset
print(X_train.shape)
print(X_test.shape)
X_train.describe()

In [None]:
def build_interaction(df, columns):
    # create a copy of the columns and dataframes to avoid unintentionally changing the original set
    interact_left = columns.copy()
    interact_right = columns.copy()
    result_df = df.copy()

    # create interaction features for all the requested columns
    for left in interact_left:
        # avoid multiplying by oneself, or producing the same column twice
        interact_right.remove(left)
        for right in interact_right:
            # create an interaction column by multiplying the numbers
            if left != right:
                result_df[str(left) + '_*_' + str(right)] = df[left] * df[right]
    return result_df

In [None]:
X_train = build_interaction(X_train, columns)
X_test = build_interaction(X_test, columns)
print(X_train.shape)
print(X_test.shape)
# X_train.describe()

In [None]:
#standardize the features
from sklearn.preprocessing import MinMaxScaler
column_names = X_train.columns
scale_transformer = MinMaxScaler(copy=True).fit(X_train)
X_train = pd.DataFrame(scale_transformer.transform(X_train))
X_test = pd.DataFrame(scale_transformer.transform(X_test))
X_train.columns = column_names
X_test.columns = column_names


scale_transformer = MinMaxScaler(copy=True).fit(y_train)
y_train = pd.DataFrame(scale_transformer.transform(y_train))
y_test = pd.DataFrame(scale_transformer.transform(y_test))

In [None]:
# # prepare w_30
# # add a week 30 version with an identical split for easy model testing on never seen target data
# y_w30_test_lasso, y_w30_test_lasso = train_test_split(pd.DataFrame(covid_clean.loc[:,['rate_w30']]), test_size=0.2, random_state = 109)
# # train accuracy
#         y_train_pred = covid_lasso.predict(X_train) 
#         best_train_score = r2_score(y_train, y_train_pred)

In [None]:
# take a quick look at the standardized dataset
# Note that, as expected, all features are scaled between 0 and 1
print(X_train.shape)
print(X_test.shape)
X_train.describe()

In [None]:
# put y_train in the expected format
y_train = y_train.values.ravel()

In [None]:
# get the locations for columns of interest (for later use)
rate_w28_loc = X_train.columns.get_loc('rate_w28')
rate_w27_loc = X_train.columns.get_loc('rate_w27')
dose1_w28_loc = X_train.columns.get_loc('dose1_w28')
votergap20_loc = X_train.columns.get_loc('votergap20')

rate_w28_coefs = []
rate_w27_coefs = []
dose1_w28_coefs = []
votergap20_coefs = []

In [None]:
# import functions for ease of use
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score

# choose from a range of lambdas (lasso penalties)
lambdas = [0.0001, 0.001, 0.01, 0.1]

# initialize variables
best_accuracy = -1
best_model = None
accuracies = []
models = []

# experiment with different lambdas
for c in lambdas:
    #@Devisch should we use cross validation instead?
    covid_lasso = Lasso(alpha=c, max_iter=100000, fit_intercept=True)
    covid_lasso.fit(X_train, y_train)
    y_hat_test = covid_lasso.predict(X_test)
    cur_accuracy = r2_score(y_test.to_numpy(), y_hat_test)

    # adding accuracy to a list in case we want to show how accuracy changes with lambda
    accuracies.append(cur_accuracy)
    models.append(covid_lasso)

    # track how specific coefficients change as a function of lambda
    rate_w28_coefs.append(covid_lasso.coef_[rate_w28_loc])
    rate_w27_coefs.append(covid_lasso.coef_[rate_w27_loc])
    dose1_w28_coefs.append(covid_lasso.coef_[dose1_w28_loc])
    votergap20_coefs.append(covid_lasso.coef_[votergap20_loc])
    
    # retain the best model
    if cur_accuracy > best_accuracy:
        best_accuracy = cur_accuracy
        best_lasso_model = covid_lasso
        best_lambda = c
            
        # train accuracy
        y_train_pred = covid_lasso.predict(X_train) 
        best_train_score = r2_score(y_train, y_train_pred)
        
        

print("Best lambda is:",best_lambda )
print("Best test accuracy is:",best_accuracy )

In [None]:
plt.plot(lambdas,accuracies)
plt.xlabel("Lambda")
plt.ylabel("Test R squared")
plt.title("Test R squared")

**Comment:** Note how only small values of lambda produce a reasonable R squared score. Once the score reaches 1%, test score plummet. The penality is too high, and all coefficients become zero.

In [None]:
# prepare a dataframe with the coefficients
coef_pd = pd.DataFrame(np.transpose([best_lasso_model.coef_]),
            columns = ["best_lasso_model_coeff"], index=X_train.columns)

In [None]:
print("The number of NON-zero coefficients:", coef_pd[:][coef_pd['best_lasso_model_coeff']!=0].shape[0])
print("The number of zero coefficients:", coef_pd[:][coef_pd['best_lasso_model_coeff']==0].shape[0])

**Comment:** Note how only five coefficients are non-zero.

In [None]:
# print all the non-zero coefficients
coef_pd[:][coef_pd['best_lasso_model_coeff']!=0]

In [None]:
# Add best tree to comparison table
model_comparison_df.loc['Poly, interact and lasso','test accuracy'] = best_accuracy
model_comparison_df.loc['Poly, interact and lasso','training accuracy'] = best_train_score
# model_comparison_df.loc['Poly, interact and lasso','w30 accuracy'] = w30_accuracy

# display the rates by model in percentage format
model_comparison_df.style.format({
    'training accuracy': '{:,.1%}'.format,
    'test accuracy': '{:,.1%}'.format,
})

In [None]:
# from sklearn.linear_model import Lasso
# lasso_alpha0001 = Lasso(alpha=0.0001,fit_intercept=True,max_iter=100000).fit(X_train , y_train)
# lasso_alpha001 = Lasso(alpha=0.001,fit_intercept=True,max_iter=100000).fit(X_train , y_train)
# lasso_alpha01 = Lasso(alpha=0.01,fit_intercept=True,max_iter=100000).fit(X_train , y_train)
# lasso_alpha1 = Lasso(alpha=1,fit_intercept=True`a,max_iter=1000).fit(X_train , y_train)
# lasso_alpha10 = Lasso(alpha=10,fit_intercept=True,max_iter=1000).fit(X_train , y_train)
# lasso_alpha100 = Lasso(alpha=100,fit_intercept=True,max_iter=1000).fit(X_train , y_train)

# # Add everything to a table
# coef_pd = pd.DataFrame(np.transpose([lasso_alpha0001.coef_,lasso_alpha001.coef_,lasso_alpha01.coef_,
#                           lasso_alpha1.coef_,lasso_alpha10.coef_,lasso_alpha100.coef_]),
#             columns = ["lasso_alpha0001","lasso_alpha001","lasso_alpha01","lasso_alpha1",
#                        "lasso_alpha10","lasso_alpha100",], index=X_train.columns)




**3.2** Plot the trajectory curves of the main effects `['rate_w28','rate_w27','fully_w28','votergap20']` from this model: the estimates of the $\beta$ coefficients as a function of $\lambda$.  Interpret what you notice.


In [None]:
plt.plot(lambdas, rate_w28_coefs, label="rate_w28")
plt.plot(lambdas, rate_w27_coefs, label="rate_w27")
plt.plot(lambdas, dose1_w28_coefs, label="dose1_w28")
plt.plot(lambdas, votergap20_coefs, label="votergap20")
plt.legend()

**Comment:** The coefficients for all predictors except votergap have been shrunk to zero fo all values of lambda. Only the smallest lambda retains a predictor: votergap20.

**3.3** Fit a well-tuned random forest model to predict `rate_w29` from the predictors listed in 3.1.  Report your choice of the best tuning parameters and briefly justify your choice (a visual or table may be helpful for this).  Provide an estimate of out-of-sample $R^2$.  Note: do not go to crazy with the number of options for the parameters you are tuning...choose a set of values that are reasonable.


In [None]:
# restart from clean data
covid_clean = pd.read_csv('data/covid_clean.csv')
# downselecting columns
# not including 'cancer'
columns = ['rate_w28','rate_w27','dose1_w28','hispanic','minority','female','unemployed', 
            'income','nodegree','bachelor','inactivity','obesity','density','votergap20']
X = covid_clean.loc[:,columns]
y = pd.DataFrame(covid_clean.loc[:,['rate_w29']])
# X_train = X_train.reindex()
# y_train = y_train.reindex()

# split dataset in train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 109)

y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

In [None]:
# prepare a dataframe to represent rates for each model
random_forest_comparison_df = pd.DataFrame(
    columns=['training accuracy', 'test accuracy']
)

In [None]:
import warnings
warnings.filterwarnings("ignore")

from sklearn.ensemble import RandomForestRegressor
from sklearn.utils import resample

from sklearn.metrics import r2_score

random_forest_train_score = -1
random_forest_test_score = -1
depth_option = 7
tree_nr_options = [10,20,50,100]
depth_options = [5, 10, 15]
max_features_list = [2,5,10,len(X_train.columns)]
row_number = 0

# go through all the depth options we want to explore
for depth_option in depth_options:
    # go through all the options for nr of trees we want to explore
    for ntrees in tree_nr_options:
        # buld ntrees trees
        for max_features in max_features_list:
            estimators = []
            R2s_train = []
            R2s_test = []
            y_hats_test = np.zeros((X_test.shape[0], ntrees))
            randomtree = RandomForestRegressor(max_depth=depth_option, max_features = max_features)
            # bootstap the training set
            boot_x, boot_y = resample(X_train, y_train)
            
            # fit and test the model
            estimators = np.append(estimators,randomtree.fit(boot_x, boot_y))
            R2s_train = np.append(R2s_train,randomtree.score(X_train, y_train))
            R2s_test = np.append(R2s_test,randomtree.score(X_test, y_test))
            
            # Add rates to dataframe for clear comparison 
            curr_tree_descr = str(ntrees) + ' bagged trees w/ depth ' + str(depth_option) + " and max_features " + str(max_features)
            random_forest_comparison_df.loc[curr_tree_descr,'training accuracy'] = np.mean(R2s_train)
            
            # accuracy scores on test set
            random_forest_comparison_df.loc[curr_tree_descr,'test accuracy'] = np.mean(R2s_test)
            
            row_number = row_number + 1
            # retain the best scores
            if np.mean(R2s_test) > random_forest_test_score:
                random_forest_test_score = np.mean(R2s_test)
                random_forest_train_score = np.mean(R2s_train)
                best_tree_nr = ntrees
                best_depth = depth_option
                best_max_features = max_features
                best_RF_model = randomtree
                

In [None]:
# Print out the best test scores
random_forest_comparison_df.sort_values('test accuracy', ascending=False).head()

**Comment:** We're choosing the model with the best test accuracy:

In [None]:
print("The best tree number is:", best_tree_nr)
print("The best tree depth is:", best_depth)
print("The best max_features is:", best_max_features)

In [None]:
# Add best tree to comparison table
model_comparison_df.loc['Random forest','training accuracy'] = random_forest_train_score
model_comparison_df.loc['Random forest','test accuracy'] = random_forest_test_score

# display the rates by model in percentage format
model_comparison_df.style.format({
    'training accuracy': '{:,.1%}'.format,
    'test accuracy': '{:,.1%}'.format,
})

**3.4** Interpret the relationship between `rate_w29` and `dose1_w28` from the random forest model in 3.3.  Is there any evidence of interactive effects in this model involving `dose1_w28`?  How do you know?  Provide a reasonable visual (or a few visuals) to help you with these tasks and interpret what you see. 



In [None]:
# Create the data frame of means to do the prediction
means1 = X_train.mean(axis = 0)
means_df = (means1.to_frame()).transpose()

# Do the prediction at all observed dose1_w28
doses = np.arange(np.min(X_train['dose1_w28']),np.max(X_train['dose1_w28']))
means_df  = pd.concat([means_df]*doses.size,ignore_index=True)
means_df['dose1_w28'] = doses


In [None]:
means1.to_frame().transpose()

In [None]:
#plots at means
yhat_rf = best_RF_model.predict(means_df)
plt.scatter(X_train['dose1_w28'],y_train)
plt.plot(means_df['dose1_w28'],yhat_rf,color="red")
plt.title("Predicted rate_w29 vs. dose1_w28 from RF in train")
plt.xlabel("Dose1_w28")
plt.ylabel("Rate_w29")

In [None]:
#Plots for all observations.  And then averaged
yhat_rfs = []
for i in range(0,X_train.shape[0]):
    obs = X_train.iloc[i,:].to_frame().transpose()
    obs_df  = pd.concat([obs]*doses.size,ignore_index=True)
    obs_df['dose1_w28'] = doses
    yhat_rf = best_RF_model.predict(obs_df)
    yhat_rfs.append(yhat_rf)
    plt.plot(obs_df['dose1_w28'],yhat_rf,color='blue',alpha=0.05)

plt.plot(obs_df['dose1_w28'],np.mean(yhat_rfs, axis=0),color='red',linewidth=2);
    
# plt.ylim(0,1)
plt.xlabel("One vaccination received rate in week 28")
plt.ylabel("Case rate change week 29")
plt.title("Predicted rate_w29 vs. dose1_28 from RF in train for all observations");


**Comment:** We can observe that in the randomforest model dose1_w28 does not have much of an effect on predicted values: the red (average) line is flat. We also plotted the full range of dose1_w28 for each observation in the training set. We can observe that the graph shape for dose1_w28 varies only marginally between the observations. We can therefore conclude that limited interaction effects exist between dose1_w28 and other predictors.

**3.5** Fit a well-tuned boosting model to predict `rate_w29` from the predictors listed in 3.1.  Report your best choice of the tuning parameters and briefly justify your choice (a visual or table may be helpful for this).  Provide an estimate of out-of-sample $R^2$.  Note: again, do not go to crazy with the number of options for the parameters you are tuning...choose a set of values that are reasonable.



In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
boosts = []
boostfts = []
depths = [1, 2, 3, 4,5,6,7]
# build boost models with base estimators of different depths
for base_depth in depths:
    boost = AdaBoostRegressor( base_estimator = DecisionTreeRegressor(max_depth = base_depth),
                              n_estimators=100)
    boosts.append(boost)


    # Fit on the entire data
    boostfit = boost.fit(X_train,y_train)
    boostfts.append(boostfit)

In [None]:
# plot accuracy by estimator for different base depths
plt.rcParams["figure.figsize"] = (20,10)
for base_depth in depths:
    plt.plot(list(boosts[base_depth -1].staged_score(X_test,y_test)),
             label="Test accuracy, depth " + str(base_depth), color = "green", alpha = base_depth/8)
    plt.plot(list(boosts[base_depth -1].staged_score(X_train,y_train)),
             label="Train accuracy, depth " + str(base_depth), color = "red", alpha = base_depth/8)
    plt.xlabel("Iteration")
plt.ylabel("Accuracy")
plt.title("Accuracy as a function of iterations")
plt.legend()

In [None]:
# list(boosts[3].staged_score(X_train,y_train))

In [None]:
# we've visually assessed initial depth equal to three to be the best option
best_depth = 3
best_boost = boosts[best_depth -1]
train_list = list(best_boost.staged_score(X_train,y_train))
test_list = list(best_boost.staged_score(X_test,y_test))

# assess which iteration is best
index_best_accuracy = test_list.index(max(test_list))
print("The iteration with the best accuracy is:", index_best_accuracy)

In [None]:
# Add best tree to comparison table
model_comparison_df.loc['Adaboost','training accuracy'] = train_list[index_best_accuracy]
model_comparison_df.loc['Adaboost','test accuracy'] = test_list[index_best_accuracy]

# display the rates by model in percentage format
model_comparison_df.style.format({
    'training accuracy': '{:,.1%}'.format,
    'test accuracy': '{:,.1%}'.format,
})

**Comment:** The best adaboost model seems to be a smiple one with initial depth equal to three. As we increase the initial depth more, the model becomes overfit: test scores do not improve, but train scores do. We also note that the best model has very few iterations. This indicates we lack important variables: with the data availalble, the model cannot improve.

In [None]:
# Add best tree to comparison table
model_comparison_df.loc['Adaboost','training accuracy'] = train_list[index_best_accuracy]
model_comparison_df.loc['Adaboost','test accuracy'] = test_list[index_best_accuracy]

# display the rates by model in percentage format
model_comparison_df.style.format({
    'training accuracy': '{:,.1%}'.format,
    'test accuracy': '{:,.1%}'.format,
})

**3.6** Improve upon your favorite/best predictive model from 3.1, 3.3, or 3.5, by including other provided feature, by doing feature engineering, or by doing variable removal/selection.  Explain your choices.  Provide an estimate of out-of-sample $R^2$. 



In [None]:
# restart from clean data
covid_clean = pd.read_csv('data/covid_clean.csv')
# downselecting columns
# not including 'cancer'
# columns = ['rate_w28','rate_w27','dose1_w28','hispanic','minority','female','unemployed', 
#             'income','nodegree','bachelor','inactivity','obesity','density','votergap20']

# drop non numeric columns
covid_clean = covid_clean.drop(['date','county', 'fips','state'],axis=1) 


# removing the data from week 29 and week 30 as this wouldn't be known at prediction time
covid_clean = covid_clean.drop(['cases_w29','deaths_w29', 'fully_w29', 'dose1_w29'],axis=1)
covid_clean = covid_clean.drop(['cases_w30','deaths_w30', 'fully_w30', 'dose1_w30', 'rate_w30'],axis=1)


# cleaning out a couple of records with non-numeric values in votergap16
orig_nr_rows = covid_clean.shape[0]
covid_clean = covid_clean.loc[:,:][covid_clean['votergap16']!='#VALUE!']
print("Number of rows removed by cleaning up non-numeric values:", orig_nr_rows - covid_clean.shape[0])

#convert votergap16 to dtype float
covid_clean['votergap16'] = covid_clean['votergap16'].astype(float)

In [None]:
covid_clean.columns

In [None]:
y = pd.DataFrame(covid_clean.loc[:,['rate_w29']])
X = covid_clean.drop(['rate_w29'], axis=1)

In [None]:

# columns = ['rate_w28','rate_w27', 'minority','female'
#            ,'dose1_w28','hispanic','unemployed', 
#             'income','nodegree','bachelor','inactivity','obesity','density','votergap20'
#             ,'sometimes', 'frequently'
#           ]
# X = X.loc[:,columns]

In [None]:
# columns = pd.Index.tolist(X.columns)
# # add second and third polynomials

# X = add_poly_features(X, 5, columns)

# # take a quick look at the dataset
# print(X.shape)
# X.describe()

In [None]:
# X = build_interaction(X, columns)
# print(X.shape)
# # X_train.describe()

In [None]:
# split dataset in train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 109)

y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

In [None]:
# prepare a dataframe to represent rates for each model
random_forest_comparison_df = pd.DataFrame(
    columns=['training accuracy', 'test accuracy']
)

In [None]:
import warnings
warnings.filterwarnings("ignore")

from sklearn.ensemble import RandomForestRegressor
from sklearn.utils import resample

from sklearn.metrics import r2_score

random_forest_train_score = -1
random_forest_test_score = -1
depth_option = 7
tree_nr_options = [10,20,50,100]
depth_options = [5, 10, 15]
# max_features_list = [2,5,10,len(X_train.columns)]
max_features_list = [2,len(X_train.columns)]

row_number = 0

# go through all the depth options we want to explore
for depth_option in depth_options:
    # go through all the options for nr of trees we want to explore
    for ntrees in tree_nr_options:
        # buld ntrees trees
        for max_features in max_features_list:
            estimators = []
            R2s_train = []
            R2s_test = []
            y_hats_test = np.zeros((X_test.shape[0], ntrees))
            randomtree = RandomForestRegressor(max_depth=depth_option, max_features = max_features)
            # bootstap the training set
            boot_x, boot_y = resample(X_train, y_train)
            
            # fit and test the model
            estimators = np.append(estimators,randomtree.fit(boot_x, boot_y))
            R2s_train = np.append(R2s_train,randomtree.score(X_train, y_train))
            R2s_test = np.append(R2s_test,randomtree.score(X_test, y_test))
            
            # Add rates to dataframe for clear comparison 
            curr_tree_descr = str(ntrees) + ' bagged trees w/ depth ' + str(depth_option) + " and max_features " + str(max_features)
            random_forest_comparison_df.loc[curr_tree_descr,'training accuracy'] = np.mean(R2s_train)
            
            # accuracy scores on test set
            random_forest_comparison_df.loc[curr_tree_descr,'test accuracy'] = np.mean(R2s_test)
            
            row_number = row_number + 1
            # retain the best scores
            if np.mean(R2s_test) > random_forest_test_score:
                random_forest_test_score = np.mean(R2s_test)
                random_forest_train_score = np.mean(R2s_train)
                best_tree_nr = ntrees
                best_depth = depth_option
                best_max_features = max_features
                best_RF_all_Feature_model = randomtree
                
                
# Print out the best test scores
random_forest_comparison_df.sort_values('test accuracy', ascending=False).head()

In [None]:
# Add best tree to comparison table
model_comparison_df.loc['RF all features','training accuracy'] = random_forest_train_score
model_comparison_df.loc['RF all features','test accuracy'] = random_forest_test_score

# display the rates by model in percentage format
model_comparison_df.style.format({
    'training accuracy': '{:,.1%}'.format,
    'test accuracy': '{:,.1%}'.format,
})

In [None]:
# install eli5
# !pip install eli5

In [None]:
import eli5
#permutation importance for the random forest
from eli5.sklearn import PermutationImportance

seed = 42

perm = PermutationImportance(best_RF_all_Feature_model,random_state=seed,n_iter=10).fit(X_test, y_test)
eli5.show_weights(perm,feature_names=X.columns.tolist())
#eli5.explain_weights(perm, feature_names = X_train.columns.tolist())


**Comment:** We did not improve upon the random forest model by adding the remaining features.

**3.7** Evaluate your models from 3.1, 3.3, 3.5, and 3.6 on the test set (this will take some work...refer back to 2.8) using $R^2$.  How do these model's $R^2$ in test compare to the out-of-sample $R^2$ when tuning?  Explain whether this is surprising or not?

In [None]:
######
# your code here
######

*your answer here*

---

## Question 4 [10pts]: Going further

**4.1** Use all of the useable variables in `demo` and `masks` to create clusters of observations based on the $K$-means clustering approach.  Be sure to carefully select a reasonable choice for $K$.  Explain your choice (a visual may help with this).

**4.2** Use your created clusters and incorporate them as predictor(s) into a linear regression model to assess whether the relationships you measured in the model from 2.6 depend on cluster type.  Comment on what you notice.  Determine whether out-of-sample $R^2$ has improved using this model (in comparison to the model from 2.6) based on 5-fold CV.

**4.3: BONUS** Find data online to improve the prediction accuracy of your best model. Be sure to cite your source of your data and the approach you took into incorporating these new data.  Note: this is only worth up to 3 bonus points, so do not spend too much effor on this part over improving ealrier parts of the exam.

## Answers

**4.1** Use all of the useable variables in `demo` and `masks` to create clusters of observations based on the $K$-means clustering approach.  Be sure to carefully select a reasonable choice for $K$.  Explain your choice (a visual may help with this).

In [None]:
######
# your code here
######

*your answer here*

**4.2** Use your created clusters and incorporate them as predictor(s) into a linear regression model to assess whether the relationships you measured in the model from 2.6 depend on cluster type.  Comment on what you notice.  Determine whether out-of-sample $R^2$ has improved using this model (in comparison to the model from 2.6) based on 5-fold CV.


In [None]:
######
# your code here
######

*your answer here*

**4.3: BONUS** Find data online to improve the prediction accuracy of your best model. Be sure to cite your source of your data and the approach you took into incorporating these new data.  Note: this is only worth up to 3 bonus points, so do not spend too much effor on this part over improving ealrier parts of the exam.

In [None]:
######
# your code here
######



*your answer here*