# CME538 - Introduction to Data Science
## Tutorial 5 - Exploratory data anlaysis with COVID cases data

### Learning Objectives
After completing this tutorial, you should be comfortable:

- Using basic data exploration and filtering methods
- Using basic dataframe manipulaiton techniques in Pandas
- Doing basic Time-series analysis such as data-time filtering, resampling and change over time

### Lecture Structure
1. [Data Exploration](#section1)
2. [Database Manipulation](#section2)
3. [Time-series Analysis](#section3)

## Setup Notebook

In [3]:
# 3rd party imports
import pandas as pd
import seaborn as sns
from datetime import datetime
import matplotlib.pylab as plt
import numpy as np
import matplotlib as plt

# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")
import warnings
warnings.filterwarnings('ignore')

# Overview

In this tutorial we will use Covid datasets to conduct time-seriese analysis of covid growth in some countires. We will use basic techniques for exploring filtering datasets, and use Pandas features to manipulate the dataset into the form we want. Next, we will do basic time-series analysis such as resampling and percentage change over time on the data, and we learn how to plot the results.

<a id='section1'></a>
# 1.  Data exploration and filtering

First, we need to read the database

In [5]:
# import dataframe of global covid cases
df = pd.read_csv('time_series_covid19_confirmed_global.csv')

Let's have a look at the first and last 5 rows of the dataset, as well as its shape, columns, and indices.

In [None]:
#Head


In [None]:
#Tail


In [None]:
#Columns


In [None]:
#Indices


In [None]:
#Shape


We can use the .describe function to analyze the columns

In [None]:
# Descirbe 'Province/State'


In [None]:
# Descirbe 'Country/Region'


We can use .unique for analyzing the uniqe values in each column, and .value_counts for the count of each unique value.

In [None]:
# Unique


In [None]:
# Unique


In [None]:
# Value_counts


In [None]:
# Value_counts


Now that we have some idea about the database, we can filter the values we are interested in. Let's have a look at Canada's data.

we can add more conditions to our filter

Now let's go back to see why the province/state column has so many null values. We can use the .null function

<a id='section2'></a>
# 2. Database manipulation

So basically, we want to reshape it into a database where the columns are countries and the rows are the dates. This way for a given data. We are looking at country-level data so we can sum the numbers for the states to get a total number for the country.

Since we do not need the Lat/Long and province data, we can simply remove them from our database

keep in mind that you need to redefine the variable, or the database will remain unchanged

Now let's use the groupby function to sum up the province data

You can see that now our database has 192 rows

Also note that there are different ways to group by, for instance if you were looking for the count of instnces rather that sum of the numbers, you couls use the code below

The next step would be to change columns with rows! This is called transposing the dataframe:

Notice the change in the 'Country/Regoin' index

<a id='section3'></a>
# 3. Time-series analysis

### DateTime features

Although, the indices are dates, but they are still string values. If we change that to datetime values, then we can use a range of great features 

We can use pd.DatetimeIndex to change the index into datetime

This provides a range of features we can use to filter the dataframe

The code ebove is simialr to a two condition filter

We also filter out a range of dates

Now that we know the datetime features, we can start analyzing our database

first, let's sort the data from the highest covid cases to lowest. We can do this by the last row (the last date we have)

If we wanted to sort the dataframe based on a column value, we should change the axis number to 0

Pandas provides plotting features as well

For instance we can plot the columns we want with a simple method: .plot()

Remeber you can always use the help function to see what methods and features are avialable to you

The discription above tells us we can change the type of the chart by adding a "kind" parameter. Let's try that


or we can add a marker and change the figure size

As you can see, the marker can not be seen easily. Maybe it's better to reduce our datapoints from daily to weekly. But for that we need to take the average of every week into one row. We can use reampling to do that

### Resampling Time-Series Dataframe
We can easily resample the time-series to any time-frequency with resample method. You may resample the time-series dataframes or series to any time-frequency you like with:

df.resample()

Now we can plot our resampled plot. As you can see this one looks better

Now that we have weekly average data, we can caculate weekly changes

Pandas provides a method for calculating percentage change between rows and columns. .pct_change

The function calculates the difference between one value to the next, and divides that by the first value. 

If both values are 0, then the output will be Nan. If the first is 0 and then the next is not, the value will be infinity (inf)

Let's see how many null values there is for each column

we can use to_frame function to convert this into a dataframe

we can also use the dropna funciton to drop all the rows with nan values