
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# COVID Demo
## Data Analysis with `pandas`

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Import the COVID-19 dataset
   * `pd.read_csv()`
 - Summarize the data
   * `head`, `tail`, `shape`
   * `sum`, `min`, `count`, `mean`, `std`
   * `describe`
 - Slice and munge data
   * Slicing, `loc`, `iloc`
   * `value_counts`
   * `drop`
   * `sort_values`
   * Filtering
 - Group data and perform aggregate functions
   * `groupby`
 - Work with missing data and duplicates
   * `isnull`
   * `unique`, `drop_duplicates`
   * `fillna`
 - Visualization
   * Histograms
   * Scatterplots
   * Lineplots
 
 Check out [this cheetsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) for help.  Also see [the `pandas` docs.](https://pandas.pydata.org/docs/)

### Import the COVID-19 dataset

Use `%sh ls` to search the folder structure

In [0]:
%sh ls /dbfs/databricks-datasets/COVID/

In [0]:
%sh ls /dbfs/databricks-datasets/COVID/CSSEGISandData/csse_covid_19_data/csse_covid_19_daily_reports

Use `%sh head` to see the first few lines of CSV file

In [0]:
%sh head /dbfs/databricks-datasets/COVID/CSSEGISandData/csse_covid_19_data/csse_covid_19_daily_reports/04-11-2020.csv

Import `pandas`.  Alias it as `pd`

In [0]:
import pandas as pd

Read the csv file.  This creates a `DataFrame`

In [0]:
pd.read_csv("/dbfs/databricks-datasets/COVID/CSSEGISandData/csse_covid_19_data/csse_covid_19_daily_reports/04-11-2020.csv")

Now let's combine the lines of code and save the `DataFrame` to a variable so we can reuse it

In [0]:
import pandas as pd

df = pd.read_csv("/dbfs/databricks-datasets/COVID/CSSEGISandData/csse_covid_19_data/csse_covid_19_daily_reports/04-11-2020.csv")

df

### Summarize the data

First, let's talk tab completion

In [0]:
# df. # Uncomment this and press 'tab' with your cursor after the "."

Need help?

In [0]:
help(df.head())

Take a peak at the first and last few rows of the data

In [0]:
df.head()

In [0]:
df.tail(2)

How many records are in the dataset?

In [0]:
df.shape

Summarize the data

In [0]:
# df.sum()
# df.min()
# df.max()
# df.count()
df.mean()
# df.std()

These summary stats are aggregated for you...

In [0]:
df.describe()

### Slice and munge data

Grab just the confirmed cases

In [0]:
df["Confirmed"]

Grab the country and confirmed cases.

In [0]:
df.columns

In [0]:
df[["Country_Region", "Confirmed"]]

Create a new column `Date`

In [0]:
import datetime

df["Date"] = datetime.date(2020, 4, 11)

In [0]:
df["Date"].head()

Slice the DataFrame to get the first 10 rows

In [0]:
df.loc[:10, ["Country_Region", "Confirmed"]]
# df.loc[0:10, ["Country_Region", "Confirmed"]] # Same thing

Return just the first column from the first row

In [0]:
df.iloc[0, 0]

How many regions to we have per country?

In [0]:
df["Country_Region"].value_counts()

What's FIPS?

In [0]:
df = df.drop("FIPS", axis=1)

Sort by confirmed cases

In [0]:
df.sort_values("Confirmed", ascending=False)

Let's just look at what's going on in the US

In [0]:
df[df["Country_Region"] == "US"]

Now let's look at what's going on in my county

In [0]:
df[(df["Country_Region"] == "US") & (df["Province_State"] == "California") & (df["Admin2"] == "San Francisco")]

### Group data and perform aggregate functions

What country has the greatest number of confirmed cases?

In [0]:
df.groupby("Country_Region")

Group and sum the data. **Note that an aggregate function return a scalar (single) value.**

In [0]:
df.groupby("Country_Region")["Confirmed"].sum().sort_values(ascending=False)

Which US states have the most cases?

In [0]:
df[df["Country_Region"] == "US"].groupby("Province_State")["Confirmed"].sum().sort_values(ascending=False)

### Work with missing data and duplicates

Do we have null values?

In [0]:
df.isnull().tail()

In [0]:
df.isnull().sum()

How many unique countries?

In [0]:
df["Country_Region"].unique().shape

Another way to do the same thing.

In [0]:
df["Country_Region"].drop_duplicates()

In [0]:
df.fillna("NO DATA AVAILABLE").tail(3)

### Visualization
   * Histograms
   * Scatterplots
   * Lineplots

In [0]:
us_subset_df = df[df["Country_Region"] == "US"]

What is the _distribution_ of deaths by US states and territories?

In [0]:
us_subset_df.groupby("Province_State")["Deaths"].sum().hist()

In [0]:
us_subset_df.groupby("Province_State")["Deaths"].sum().hist(bins=30)

How do confirmed cases compare to deaths?

In [0]:
us_subset_df.plot.scatter(x="Confirmed", y="Deaths")

In [0]:
us_subset_df[us_subset_df["Deaths"] < 1000].plot.scatter(x="Confirmed", y="Deaths")

Import the data for all available days.

In [0]:
import glob

path = "/dbfs/databricks-datasets/COVID/CSSEGISandData/csse_covid_19_data/csse_covid_19_daily_reports"
all_files = glob.glob(path + "/*.csv")

dfs = []

for filename in all_files:
  temp_df = pd.read_csv(filename)
  temp_df.columns = [c.replace("/", "_") for c in temp_df.columns]
  temp_df.columns = [c.replace(" ", "_") for c in temp_df.columns]
  
  month, day, year = filename.split("/")[-1].replace(".csv", "").split("-")
  d = datetime.date(int(year), int(month), int(day))
  temp_df["Date"] = d

  dfs.append(temp_df)
  
all_days_df = pd.concat(dfs, axis=0, ignore_index=True, sort=False)
all_days_df = all_days_df.drop(["Latitude", "Longitude", "Lat", "Long_", "FIPS", "Combined_Key", "Last_Update"], axis=1)

In [0]:
all_days_df.head()

How has the disease spread over time?

In [0]:
all_days_df.groupby("Date")["Confirmed"].sum().plot(title="Confirmed Cases over Time", rot=45)

Break this down by types of cases.

In [0]:
all_days_df.groupby("Date")["Confirmed", "Deaths", "Recovered"].sum().plot(title="Confirmed, Deaths, Recovered over Time", rot=45)

What is the growth in my county?

In [0]:
(all_days_df[(all_days_df["Country_Region"] == "US") & (all_days_df["Province_State"] == "California") & (all_days_df["Admin2"] == "San Francisco")]
  .groupby("Date")["Confirmed", "Deaths", "Recovered"]
  .sum()
  .plot(title="Confirmed, Deaths, Recovered over Time", rot=45))

Wrap this up in a function and run it yourself!

In [0]:
def plotMyCountry(Country_Region, Province_State, Admin2):
  (all_days_df[(all_days_df["Country_Region"] == Country_Region) & (all_days_df["Province_State"] == Province_State) & (all_days_df["Admin2"] == Admin2)]
    .groupby("Date")["Confirmed", "Deaths", "Recovered"]
    .sum()
    .plot(title="Confirmed, Deaths, Recovered over Time", rot=45))
  
plotMyCountry("US", "New York", "New York City")

&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>