# COVID-19 Data Science Tutorial

COVID-19 Data Tutorial from Code Curious Youtube https://youtu.be/48kdz3VDjoE using John Hopkins University Center for Systems Science and Engineering (CSSE) GitHub data https://github.com/CSSEGISandData/COVID-19.

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px

In [None]:
base_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/"
confirmed_df = pd.read_csv(base_url + "time_series_covid19_confirmed_global.csv" )

Looking at the head of the data we can see that for confirmed cases there are over one thousand columns.

In [None]:
confirmed_df.head()

Viewing the shape of the dataframe shows that there are 289 rows to 1016 columns in the dataframe. This is roughly the number of countries recongized globaly and so makes sense.

In [None]:
confirmed_df.shape

This step will clean the datafram to remove unnecessarily precise geographic information such at the latitude and longitude and the provincial/state columns.

In [None]:
confirmed_df = confirmed_df.drop(columns=["Lat", "Long", "Province/State"])
confirmed_df.head()

Next we will reduce the number of columns by merging them into rows in the same country and then relabeling the column appropriately.

In [None]:
confirmed_df = confirmed_df.groupby(by="Country/Region").aggregate(np.sum).T
confirmed_df.head()

In [None]:
confirmed_df.index.name= "Date"
confirmed_df = confirmed_df.reset_index()
confirmed_df.tail()

Having the number of columns match the number of countries is much more managable than having a thousand plus columns in the dates, but this can be cleaned up further to give a readable narrow dataframe.

In [None]:
melt_confirmed_df = confirmed_df.melt(id_vars="Date").copy()
melt_confirmed_df.rename(columns={"value":"Confirmed"}, inplace=True)
melt_confirmed_df.head()

The date is showing as a "object", which is representing a string. This is not easy to manipulate so we should change it to a date to make it sortable.

In [None]:
melt_confirmed_df.dtypes

In [None]:
melt_confirmed_df["Date"] = pd.to_datetime(melt_confirmed_df["Date"])#.dt.strftime("%m/%d/%Y")
melt_confirmed_df.tail()

Finding today's date will allow us to see the latest totals for COVID-19 confirmed cases.

In [None]:
max_date = melt_confirmed_df["Date"].max()
max_date

After finding today's date we can create another copy containing all of the countries and regions that have confirmed totals for the most recent date and format those into a month/day/year format originally used by John Hopkins University.

In [None]:
total_confirmed_df = melt_confirmed_df[melt_confirmed_df["Date"]==max_date].copy()
total_confirmed_df["Date"] = pd.to_datetime(total_confirmed_df["Date"]).dt.strftime("%m/%d/%Y")
total_confirmed_df.head()

Checking the sum for today (10/30/2022) shows a close match for today's total confirmed cases and John Hopkin University Dashboard, a close match is the best that we can get since the dashboard is updated in real-time rather than the .csv files that we're working with.

In [None]:
sum_confirmed = total_confirmed_df["Confirmed"].sum()
sum_confirmed

# Visualizing the Data
First we can start with visualizing all the data that we have. Naturally we can start with countries and the number of confirmed cases.

In [None]:
fig1 = px.bar(total_confirmed_df, x="Country/Region", y= "Confirmed")
fig1.show()

Looking at the top 30 countries can show the highest number of confirmed cases. Although these will likely lean toward more developed countries with better infrastructure to measure the cases present in the country.

In [None]:
fig2 = px.bar(total_confirmed_df.sort_values("Confirmed", ascending=False).head(30), x="Country/Region", y= "Confirmed", text="Confirmed")
fig2.show()

Here we can see the top countries are similar to those present on the dashboard and generally match the expecation that they would have the healthcare infrastructure to measure high numbers of cases. 

In [None]:
fig3=px.scatter(melt_confirmed_df, x="Date", y="Confirmed", color="Country/Region")
fig3.show()

In [None]:
fig4 = px.line(melt_confirmed_df[melt_confirmed_df["Country/Region"]=="Germany"], x="Date", y="Confirmed")
fig4.show()