# Pandas Worked Example 09 / 17 / 2020

We start by importing the packages we will be using:
- Matplotlib - Creating simple line plots
- Seaborn - Generating statistical plots
- Numpy in case we need to set any dtypes
- Pandas for working with the data

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


Now we read in our dataset (JHU COVID Timeseries data for Confirmed cases in the US). The url will always load will always be the most uptodate version. 

We now should do any data cleaning we want to do in the same cell.
- This will ensure that we don't run into any errors by reruning cells that set indexes etc.
- Drop any columns that have no data: `df.isnull().sum().sort_values()`
  - to drop if there are columns: `df = df.drop(columns=['col1', 'col2'])`
- Create a long version of the data : `df.melt(id_vars=index_cols, var_name='Date', value_name='Count')`
  - Set the dtype for Date: `df_l['Date'] = pd.to_datetime(df_l['Date'])`
- Set the index: `.set_index()`

 

In [None]:
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv'

index_cols = ["UID", "iso2", "iso3", "code3", "FIPS", "Admin2", "Province_State", "Country_Region", "Lat", "Long_", "Combined_Key"]

df = pd.read_csv(url)
df.head()

# Working with the wide format data

Next we should look at what our DataFrame is telling us.

What columns do we have available?
What does each row represent?

Steps:
- Plot all the data
  - As a line plot
  - As a heatmap
    - Use these plots to find our next question

- Considerations:
  - Should we start at a fine granular level (individual cities) or at a higher level (States)
  - Should we transform the data at all?
    - Normalize, Standardise, Log Transform 

# Working with the long format data

One of the limiting factors when using the wide format data is that we cannot look at date ranges

- Filtering using the Date:
  - `date_filter = ('2020-03-01' <= df_l['Date']) & (df_l['Date'] <= '2020-03-31')`
