<a href="https://colab.research.google.com/github/nhwang1325/storytelling-with-data/blob/master/data-stories/COVID-19%3FStory4_Amelia_Hannah_Nathan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

This demonstration notebook provides a suggested set of libraries that you might find useful in crafting your data stories.  You should comment out or delete libraries that you don't use in your analysis.

In [None]:
!ls

sample_data


In [None]:
#number crunching
import numpy as np
import pandas as pd


#data visualization
import plotly
import plotly.express as px
from matplotlib import pyplot as plt



# Google authentication

Run the next cell to enable use of your Google credentials in uploading and downloading data via Google Drive.  See tutorial [here](https://colab.research.google.com/notebooks/io.ipynb#scrollTo=P3KX0Sm0E2sF) for interacting with data via Google services.

In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

KeyboardInterrupt: ignored

# Project team

List your team members and (as appropriate) each team member's role on this project.

# Background and overview

Introduce your question and motivation here.  Link to other resources or related work as appropriate.

# Approach

Briefly describe (at a high level) the approach you'll be taking to answer or explore your question in this notebook.

# Quick summary

Briefly describe your key findings at a high level.



In [None]:
nd 

**Project Team:**
Amelia Ockert, Hannah Utter, and Nathan Hwang

Nathan and Hannah developed the initial story idea, but this idea evolved as we brainstormed together (after I [Amelia] joined) and as a class, and as we found our data sources.

Amelia performed the data extraction, tidying, manipulation, and analysis, and made the plots. She wrote the script for part 2 of the video and completed all text entries for this notebook.

We worked as a team to discuss our story format, and Hannah put this to action by bringing out ideas to life through videography. She wrote the script for parts 1 and 3, and did the video compilation and editing for the story.

Nathan put together and released a mental health survey to assess students' reactions to community accusations. He also helped Hannah with special effects in the video.


**Background and overview:**

Did the spike in Covid cases at Dartmouth this past February lead to a spike in covid cases in surrounding counties?

It's no secret that Dartmouth has taken the heat from the surrounding community this past year. Prior to students returning this past fall, several articles were released by townspeople saying that Dartmouth students should not be allowed back in Hanover. Furthermore, hundreds of Dartmouth faculty signed a petition to prevent students from returning to campus. Needless to say, Dartmouth students were generally not welcomed back into Hanover, and have instead been a source of blame for Covid cases in the surrounding community.
We aim to investigate the merits of these accusations--we want to dive into the data and see how Covid cases changed in the week(s) following Dartmouth's outbreak. We are curious to see whether there were in fact spikes in cases in the surrounding communities.

**Approach:**

After much trial-and-error, we learned that it is difficult to isolate Dartmouth-specific data. Thus, we pivoted our approach to instead look at total new cases, per 100K residents, per week in Grafton and the surrounding counties. We figured that since Dartmouth is a part of Grafton county, and we know when Dartmouth cases spike, we will still be able to see from the Grafton + surrounding county data whether or not their cases spike as well around the time of Dartmouth's outbreak.

Covid-case data aside, we will also pull in reports of community accusations to add to our intro section to help the viewer understand the extent of the backlash Dartmouth students have received.

**Quick summary:**

We were pretty surprised with our findings. We of course dove into our project thinking Dartmouth did not cause huge community spikes, so we were expecting to see cases stay pretty consistent, or mildely elevated, in the weeks following Dartmouth's outbreak. We were *not* expecting to see that weekly cases in the surrounding communities actually *decreased* in the weeks following Dartmouth's outbreak.

It seems Dartmouth's outbreak served as a reminder to the community that Covid is still here. People likely took extra precautions that resulted in fewer community transmissions. It seems like Dartmouth students are not to blame after all!


# Data

Briefly describe your dataset(s), including links to original sources.  Provide any relevant background information specific to your data sources.

In [None]:
# Provide code for downloading or importing your data here
# We used county-level covid data from the NY Times' github. The data has been updated by the NY Times since the beginning of Covid, and tracks total cases and deaths across all 
#counties in the United States.

#Link to the NY Times github: https://github.com/nytimes/covid-19-data/blob/master/us-counties.csv
#Link to the raw data source: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv


!wget https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
covid_data = pd.read_csv('us-counties.csv')



--2021-05-10 01:03:27--  https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 53424352 (51M) [text/plain]
Saving to: ‘us-counties.csv’


2021-05-10 01:03:27 (139 MB/s) - ‘us-counties.csv’ saved [53424352/53424352]



# Analysis

Briefly describe each step of your analysis, followed by the code implementing that part of the analysis and/or producing the relevant figures.  (Copy this text block and the following code block as many times as are needed.)

In [None]:
#we're interested in the covid cases in Grafton and the immediately surrounding counties,
#which includes Sullivan NH, and Orange and Windsor VT.

covid_states = ["New Hampshire", "Vermont"]
covid_counties = ["Grafton","Sullivan","Orange", "Windsor"]

#Create new dataframe that is just the counties of interest
county_data = covid_data[covid_data["state"].isin(covid_states) & covid_data["county"].isin(covid_counties)]


In [None]:
#We're interested in cases per 100K. Thus, we need to add the populations for each county
#These data were found on the US Census Bureau website. The Grafton URL is: https://www.census.gov/quickfacts/fact/table/graftoncountynewhampshire/PST045219


#Define a function to add the population of these counties (updated in 2019 census)
def define_pop(row):
  if row['county'] == "Grafton" :
    return 89886
  if row['county'] == "Rockingham" :
    return 309769
  if row['county'] == "Sullivan" :
    return 43146
  if row['county'] == "Merrimack" :
    return 151391
  if row['county'] == "Orange" :
    return 28892
  if row['county'] == "Windsor" :
    return 55062
  return "other"

In [None]:
#Check to make sure it worked
county_data.apply(lambda row: define_pop(row), axis = 1)

447        89886
481        89886
516        89886
562        89886
623        89886
           ...  
1296697    55062
1298806    89886
1298811    43146
1299938    28892
1299944    55062
Length: 1686, dtype: int64

In [None]:
#Create a fresh copy of the dataframe to avoiding indexing error
df = county_data.copy()

#Subset to look at just data from 2021
df = df[df["date"] > "2020-12-31"]

#Now, add the population as a new column called "pop". 
df['pop'] = df.apply(lambda row: define_pop(row), axis = 1)

df = df[df["date"] > "2020-12-31"]

In [None]:
#Find cases per 100000 by multiplying cases by 100000 and dividing by the county population
df['per100K'] = 100000 * df['cases'] / df['pop']

In [None]:
df.head() #check

NameError: ignored

In [None]:
#creating a new dataframe that will ultimately have a new column that records the increase in cases per day
#this uses the "shift" command, followed by a simple subtraction to find the number of new cases per day

df["lagCol"] = df.groupby("county")["cases"].shift(1)
df["newcases"] = df.cases - df.lagCol

In [None]:
#Create lag column for cases per 100K
df["per100KlagCol"] = df.groupby("county")["per100K"].shift(1)

#Find the new cases per 100K
df["newcasesper100K"] = df.per100K - df.per100KlagCol




In [None]:
#Plot daily new cases per county
fig5 = px.line(df, x = "date", y = "newcases", color = "county")
fig5.show()

NameError: ignored

In [None]:
#There are clearly a lot of daily fluctuations, so we instead want to look at new cases per week. This will give us
#a better picture of what is happening on a weekly basis

df['date'] = pd.to_datetime(df['date'], errors = 'coerce')
df['weekNumb'] = df['date'].dt.week
df.head()


Series.dt.weekofyear and Series.dt.week have been deprecated.  Please use Series.dt.isocalendar().week instead.



Unnamed: 0,date,county,state,fips,cases,deaths,pop,per100K,lagCol,newcases,per100KlagCol,newcasesper100K,weekNumb
886520,2021-01-01,Grafton,New Hampshire,33009.0,1127,9.0,89886,1253.810382,,,,,53
886525,2021-01-01,Sullivan,New Hampshire,33019.0,430,6.0,43146,996.616141,,,,,53
887650,2021-01-01,Orange,Vermont,50017.0,312,1.0,28892,1079.883705,,,,,53
887656,2021-01-01,Windsor,Vermont,50027.0,404,3.0,55062,733.718354,,,,,53
889766,2021-01-02,Grafton,New Hampshire,33009.0,1177,9.0,89886,1309.436397,1127.0,50.0,1253.810382,55.626015,53


In [None]:
#For simplicity, we're going to eliminate "week 53", which was basically just the first 2 days of January
df = df[df["weekNumb"] < 20]

In [None]:
df2 = df.copy() #creating new copy due to paranoia of messing something from DF up

#Now, we're grouping by county and week to find the sum total of new cases per 100K per week
df2 = df2.groupby(['county','weekNumb'], as_index = False)['newcasesper100K'].sum()
df2.head()


Unnamed: 0,county,weekNumb,newcasesper100K
0,Grafton,1,176.890728
1,Grafton,2,199.141134
2,Grafton,3,277.017556
3,Grafton,4,171.328127
4,Grafton,5,189.128452


In [None]:
df2.dtypes

county              object
weekNumb             int64
newcasesper100K    float64
dtype: object

In [None]:
#Stacked Bar chart of Weekly Covid Cases per 100K

bar_fig = px.bar(df2, x='weekNumb', y='newcasesper100K', color = "county",
              title = "2021 Weekly Covid Cases per 100K in Four Upper Valley Counties",
              labels = {
                  "weekNumb": "Week",
                  "newcasesper100K": "New Cases per 100K",
                  "county": "county"
              }) #color = 'county') #barmode = 'group')


bar_fig.update_layout(
    title = {
        "text": "2021 Weekly Covid Cases (per 100K) in Four Upper Valley Counties",
        "y": 0.9,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top"
    }
)

bar_fig.update_xaxes(
    ticktext = ["Jan 10-16", "Jan 24-30", "Feb 7-13", "Feb 21-27", "Mar 7-13", "Mar 21-27", "Apr 4-10", "Apr 18-24"],
    tickvals = [2,4,6,8,10,12,14,16]
)

bar_fig.show()

In [None]:
# Line graph showing trend in weekly covid cases per 100K

line_fig = px.line(df2, x='weekNumb', y='newcasesper100K', color = "county",
              title = "2021 Weekly Covid Cases per 100K in Four Upper Valley Counties",
              labels = {
                  "weekNumb": "Week",
                  "newcasesper100K": "New Cases per 100K",
                  "county": "county"
              }) #color = 'county') #barmode = 'group')



line_fig.update_layout(
    title = {
        "text": "2021 Weekly Covid Cases (per 100K) in Four Upper Valley Counties",
        "y": 0.9,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top"
    }
)

line_fig.update_xaxes(
    ticktext = ["Jan 10-16", "Jan 24-30", "Feb 7-13", "Feb 21-27", "Mar 7-13", "Mar 21-27", "Apr 4-10", "Apr 18-24"],
    tickvals = [2,4,6,8,10,12,14,16]
)

line_fig.show()

In [None]:
## Adding an annotation to the above line graph to show the Dartmouth outbreak onset

figg2 = px.line(df2, x='weekNumb', y='newcasesper100K', color = "county",
              title = "2021 Weekly Covid Cases per 100K in Four Upper Valley Counties",
              labels = {
                  "weekNumb": "Week",
                  "newcasesper100K": "New Cases per 100K",
                  "county": "county"
              }) #color = 'county') #barmode = 'group')

figg2.add_annotation(x = 8, y = 220,
                    text = "Dartmouth Outbreak, February 21-27",
                    yshift = 25,
                    showarrow=True,
                    arrowhead =2,
                    bordercolor = "#c7c7c7",
                    borderwidth = 2,
                    bgcolor = "#ff7f0e")


figg2.update_layout(
    title = {
        "text": "2021 Weekly Covid Cases (per 100K) in Four Upper Valley Counties",
        "y": 0.9,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top"
    }
)

figg2.update_xaxes(
    ticktext = ["Jan 10-16", "Jan 24-30", "Feb 7-13", "Feb 21-27", "Mar 7-13", "Mar 21-27", "Apr 4-10", "Apr 18-24"],
    tickvals = [2,4,6,8,10,12,14,16]
)

figg2.show()

In [None]:
#Facet the line graph to show each county individually
figg3 = px.line(df2, x='weekNumb', y='newcasesper100K', 
                color = "county",
                facet_col = "county", facet_col_wrap = 2,
              title = "2021 Weekly Covid Cases per 100K in Four Upper Valley Counties",
              labels = {
                  "weekNumb": "Week",
                  "newcasesper100K": "New Cases per 100K",
                  "county": "county"
              }) #color = 'county') #barmode = 'group')

#figg3.add_annotation(x = 8, y = 220,
 #                   text = "Dartmouth Outbreak, February 21-27",
  #                  yshift = 25,
   #                 showarrow=True,
    #                arrowhead =2,
     #               bordercolor = "#c7c7c7",
      #              borderwidth = 2,
       #             bgcolor = "#ff7f0e")


figg3.update_layout(
    title = {
        "text": "2021 Weekly Covid Cases (per 100K) in Four Upper Valley Counties",
        "y": 0.9,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top"
    }
)

figg3.update_xaxes(
    ticktext = ["Jan 10-16", "Jan 24-30", "Feb 7-13", "Feb 21-27", "Mar 7-13", "Mar 21-27", "Apr 4-10", "Apr 18-24"],
    tickvals = [2,4,6,8,10,12,14,16]
)
figg3.show()

In [None]:
# Provide code for carrying out the part of your analysis described
# in the previous text block.  Any statistics or figures should be displayed
# in the notebook.

# Interpretations and conclusions

Describe and discuss your findings and say how they answer your question (or how they failed to answer your question).  Also describe the current state of your project-- e.g., is this a "complete" story, or is further exploration needed?

# Future directions

Describe some open questions or tasks that another interested individual or group might be able to pick up from, using your work as a starting point.

**Interpretations and conclusions:**

The blame game gets us no where, and so it's silly to keep pointing fingers. No data will tell us who brought the virus to the area, or exactly how it spread. It could be that some Dartmouth students infected some community members (and it is also likely that the reverse is true!). But this correlation and causation stuff is hard to weed out of the data.

What we do see is that covid cases did NOT spike in Grafton and surrounding communitites following the Covid outbreak at Dartmouth College. In fact, there were the lowest new weekly covid cases/ 100K in the weeks following Dartmouth's outbreak. The next peak in cases wasn't until April, over 6 weeks after Dartmouth's peak.

So, our analysis succeeds in answering our research question of whether the outbreak at Dartmouth caused spikes in the surrounding community.  

**Conclusions and future directions: **

Of course, it is hard to tell when a project is truly "complete". We could extend this analysis to further into the spring, or similarly backtrack the analysis to the fall when accusations began. We could also look at Dartmouth student vaccination rates versus community vaccination rates, and explore any changes in covid cases or covid-related deaths. To explore the blame game from a different lens, we could further investigate Dartmouth students' mental health regarding covid and community accusations. We started this in our current project, but this aspect lends itself to further study. There are many avenues of exploration that will serve as interesting research questions for future work.

Possible future research questions include:
- How have community accusations impacted Dartmouth students' mental health?
- Have covid outbreaks in Grafton and surrounding counties lead to covid outbreaks at Dartmouth? (essentially the reverse of our research question)
- How do Dartmouth's covid cases/deaths compare to other rural colleges? Other Ivy League schools?

