# COVID-19 Data Analysis Tool

Brett Deaton - Fall 2020

This notebook gathers up-to-date data on COVID-19 infections from the Centers For Disease Control and Prevention, for further analysis.

Data from https://github.com/datasets/covid-19 dataset, a sanitized version of the the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) https://github.com/CSSEGISandData/COVID-19 dataset. The data is compiled from state health departments and used widely, notably for the [JHU COVID-19 Dashboard](https://coronavirus.jhu.edu/map.html).

## Setup

We want to read in and organize the data from its source. We use the pandas
package to fetch and interpret a comma-separated-values file (csv) at the
listed url. Think of pandas as Excel for python.

The pandas function `read_csv()` returns a `DataFrame` object, with many
useful methods you can read about in the official
[documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

In [None]:
import pandas
url = "https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-aggregated.csv"
df = pandas.read_csv(url, parse_dates=['Date']) # convert date string to a timestamp object

We also want to plot the data. We use the matplotlib.pyplot package with loads
of plotting functionality you can read about in the official
[documentation](https://matplotlib.org/api/pyplot_api.html) or explore in a
[tutorial](https://matplotlib.org/tutorials/introductory/pyplot.html).

Below we will use pyplot indirectly through the `plot()` method of the
pandas DataFrame object.

In [None]:
import matplotlib.pyplot as plt

## Inspect Data

This is a really large dataset!

In [None]:
# output the first few rows
df.head(3)

In [None]:
# output the last few rows
df.tail(3)

Let's look at specific ranges of cells, and a description of the size of the DataFrame.

In [None]:
# interact with specific table elements in the DataFrame
print("## First Row", df.loc[0], sep="\n")
print()

print("## First Cell", df.at[0, "Date"], sep="\n")
print()

print("## Shape of the Table", df.shape, sep="\n")

Let's make some lists out of the DataFrame.

In [None]:
# create a list of the headers
headers = list( df.columns.values )
print(headers)

In [None]:
# create a list of the countries
countries = list( df["Country"].drop_duplicates() )
print(len(countries), "countries:")
print(countries)

In [None]:
# create a list of the dates
days = list( df["Date"].drop_duplicates() )
print("## Last Three of", len(days), "Days")
for day in days[-3:]:
    print (day.date(), sep="\n")

## Analyze Data

We want to examine specific subsets of the data to answer various questions:

* What are the cumulative effects?
* How are new case counts changing?
* Are there any underlying patterns in new case counts?

Let's make some visualizations to answer these questions! We'll use the `plot()` method of the DataFrame
object. Fore more information on this method see the
[documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html).

In [None]:
# select a country to analyze
country = "US"

#### What are the cumulative effects?

In [None]:
# select subset of data from one country
stencil = ( df["Country"] == country )
df_country = df[stencil].copy()

In [None]:
# make a time series plot of the case counts
df_country.plot(title="Cumulative Cases ("+country+")", x="Date")

#### How are new case counts changing?

In [None]:
# create new column for new cases, by taking daily difference of confirmed cases
daily_diff = df_country["Confirmed"].diff()
df_country["Confirmed New"] = daily_diff
df_country.head()

In [None]:
# make a time series plot of the new case counts
df_country.plot(title="New Cases ("+country+")", x="Date", y="Confirmed New", legend=None)

#### Are there any underlying patterns in new case counts?

From the previous plot, there appears to be a regular wiggle in the data.
New case counts rise and then fall roughly once weekly. Let's combine the
data by day of the week, to search for weekly patterns.

In [None]:
# first create new column for day of the week
day_of_wk = df_country["Date"].dt.dayofweek # order is Mon, Tue, ...
df_country["Day of Week"] = day_of_wk

# then sum up new cases for each day
sum_by_day = df_country.groupby(["Day of Week"]).sum()
sum_by_day

In [None]:
# make a bar chart plot of the new cases grouped by day of the week

# fist generate the plot object
plot_day_sum = sum_by_day.plot.bar(title="New Cases Summed by Day ("+country+")",
                                   y="Confirmed New",
                                   legend=None)

# reset the bar labels
daynames = ["Mon", "Tue", "Wed", "Thr", "Fri", "Sat", "Sun"]
plot_day_sum.set_xticklabels(daynames)

# then display the plot
plt.show()

### Todo

Modifications or repairs to make:

* explore differences in weekly patterns between different countries
* make a pie-chart of cases from largest countries
* explore the shift between time series of new cases and recovered