# Workbook 12 - Time Series
by [Dr. David Elliott](https://eldave93.netlify.app/)

1. [Workspace Setup](#setup)

The aim of this workbook is to...

# 1. Workspace Setup
Before downloading any data we should think about our workspace. It is assumed if you have made it this far you have already got your workspace setup. There are two ways of using these notebooks. The first is to use Google Colab, which is a website that allows you to write and execute python code through the browser. The second is a local workspace (e.g. Anaconda).

## 1.1. Google Colab
If you are using google colab then you can follow the below instructions to get setup.

First lets check if you are actually using google colab.

In [1]:
try:
    import google.colab
    COLAB=True
    
    # set the week code
    WORKSHOP_NAME = "12-time-series"
except:
    COLAB=False

If using colab you will need to install the dependencies and upload the files associated with this workshop to the temporary file store.

To do this:
1. Download the workbook repository as a .zip file from GitHub (Green "Code" button, "Download ZIP"),
2. On Google Colab click the folder icon on the left panel
3. Click the page icon with the upwards arrow on it
4. From your local computers file store, upload the .zip file (e.g. `machine-learning-workbooks-main.zip`)

__Note__ 

- Make sure to restart the runtime after installing to ensure everything works correctly.

In [2]:
if COLAB:
    import os

    # check if the environment is already setup to avoid repeating this after 
    # restarting the runtime
    if not os.path.exists("machine-learning-workbooks-main") and os.path.exists("machine-learning-workbooks-main.zip"):
          !unzip machine-learning-workbooks-main.zip
          
    print("Setting working directory to:")
    %cd ./machine-learning-workbooks-main/{WORKSHOP_NAME}

else:
    print("Colab is not being used")

Colab is not being used


As seen above, I automatically set the working directory to be a local version of the workshop repository. This is so all the data, images, and scripts for displaying the solutions works. This is located on the temporary file store associated with this colabs runtime. 

## 1.2. Local Workspace

If your using a local workspace you will need all the packages to run this notebook.

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sys
import os
from datetime import date

sys.path.append('../scripts') # add scripts to the path for use later

If you do not already have them the script below, provided `AUTO_INSTALL = True`, will install them for you.

In [4]:
AUTO_INSTALL = False

if AUTO_INSTALL:
    !{sys.executable} -m pip install -r ../scripts/requirements.txt

## 1.3. Displaying solutions

The solutions are activated using a new .txt file which can be put in the workshop folder (e.g. `01-end_to_end`). Please put in a request for access.

If you have access to the solutions, the following cell will create clickable buttons under each exercise, which will allow you to reveal the solutions.

__Notes__

- This method was created by [Charlotte Desvages](https://charlottedesvages.com/).

In [5]:
%run ../scripts/create_widgets.py 12

<IPython.core.display.Javascript object>

Buttons created!


In [6]:
# colours for print()
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

---

# 2. Problem Understanding <a id='problem'></a>

---

### 🚩 Exercise 1

List some examples of real-world time series data.

In [7]:
%run ../scripts/show_solutions.py 12_ex1

Button(description='Reveal solution', style=ButtonStyle())

Output(layout=Layout(border='1px solid green'))

---

We'll be using Covid data later, so first lets look at how we can prepare a dataset for use in python

In [None]:
UPDATE = True

if UPDATE == True:
    # get the latest covid-19 UK data 
    covid_eng = pd.read_csv("https://coronavirus.data.gov.uk/api/v1/data?filters=areaType=nation;areaName=England&structure=%7B%22areaType%22:%22areaType%22,%22areaName%22:%22areaName%22,%22areaCode%22:%22areaCode%22,%22date%22:%22date%22,%22newCasesBySpecimenDate%22:%22newCasesBySpecimenDate%22,%22cumCasesBySpecimenDate%22:%22cumCasesBySpecimenDate%22,%22newFirstEpisodesBySpecimenDate%22:%22newFirstEpisodesBySpecimenDate%22,%22cumFirstEpisodesBySpecimenDate%22:%22cumFirstEpisodesBySpecimenDate%22,%22newReinfectionsBySpecimenDate%22:%22newReinfectionsBySpecimenDate%22,%22cumReinfectionsBySpecimenDate%22:%22cumReinfectionsBySpecimenDate%22%7D&format=csv")
    # save to csv
    covid_eng.to_csv("./Data/covid_rates_"+str(date.today())+".csv", index=False)
    
# get a list of the data files
onlyfiles = [f for f in os.listdir("Data") if os.path.isfile(os.path.join("Data", f))]
# load the latest data
covid_eng = pd.read_csv("./Data/"+onlyfiles[-1])
    
covid_eng.head()

This is just the covid infection data for England, and therfore only has one `areaCode`. For the sake of this workshop we are only going to look at `newCasesBySpecimenDate` which we will rename to `new_cases`.

In [None]:
covid = covid_eng[["date", "newCasesBySpecimenDate"]].copy()
covid.rename(mapper={"newCasesBySpecimenDate": "new_cases"},axis='columns', inplace=True)
covid.head()

Notice that at the moment `date` is curently stored as an "object" - meaning it is a string of text.

In [None]:
covid.info()

This makes it difficult to work with - how do we extract the year, for instance? How does Python know what order the times go in to plot them? To let Python  know this is a date so need to turn this into a `date` object. To do this we are going to need to use Python's `DateTime` library, so lets take a quick detour away from our data and into looking into this package.

## `DateTime`

Python's `DateTime` library is great for dealing with time-related data, and Pandas has incorporated this library into its own `datetime` series and objects. Below, we'll load in the `DateTime` library, which we can use to create a `datetime` object by entering in the different components of the date as arguments.

In [None]:
from datetime import datetime

Let's just start with an example date.

In [None]:
example_date = datetime(2022, 6, 7) # 07-06-2022
example_date

The components of the date are accessible via the object's attributes.

In [None]:
print("Year", example_date.year)
print("Month",example_date.month)
print("Day", example_date.day)
print("Hour", example_date.hour)
print("Minute", example_date.minute)
print("Second", example_date.second)
print("Micro-Second", example_date.microsecond)

Notice that from the hour is set to 0, this is because we didn't specify an hour, but if we needed this information we could.

In [None]:
example_date = datetime(2022, 6, 7, 8, 30) # 07-06-2022 08:30

print("Year", example_date.year)
print("Month",example_date.month)
print("Day", example_date.day)
print("Hour", example_date.hour)
print("Minute", example_date.minute)

---

### 🚩 Exercise 2

Convert the following strings to datetime format:

- 19 October 2021, 10:13 AM

- 31 May 2022, 09:30 AM

- 25 November 2021, 15:30 PM

In [8]:
%run ../scripts/show_solutions.py 12_ex2

Button(description='Reveal solution', style=ButtonStyle())

Output(layout=Layout(border='1px solid green'))

---

### 🚩 Exercise 3

Using the datetime objects above, print out their elements (e.g. month, hour)

In [9]:
%run ../scripts/show_solutions.py 12_ex3

Button(description='Reveal solution', style=ButtonStyle())

Output(layout=Layout(border='1px solid green'))

---

### `timedelta()`

If we wanted to add or subtract time from a date we could use a `timedelta` object to shift a `datetime` object. An example of when you may want to do this is when time is your index and you want to get everything that happened a week before a specific observation.

In [None]:
# Import timedelta() from the DateTime library.
from datetime import timedelta

# Timedeltas represent time as an amount rather than as a fixed position.
offset = timedelta(days=1)

# The timedelta() has attributes that allow us to extract values from it.
print('offset days', offset.days)

`datetime`'s `.now()` function will give you the `datetime` object of right now.

In [None]:
now = datetime.now()
print("right now:", now)

We can use `timedelta()` on the time now.

In [None]:
print("Past: ", now - offset)
print("Future: ", now + offset)

Note: The largest value a `timedelta()` can hold is days so if you want to shift by years you will have to convert it to days (see [here](https://docs.python.org/2/library/datetime.html)).

---

### 🚩 Exercise 4

Using the datetime objects created in exercise 3, find out the date 30 days before them.

In [10]:
%run ../scripts/show_solutions.py 12_ex4

Button(description='Reveal solution', style=ButtonStyle())

Output(layout=Layout(border='1px solid green'))

---

Right lets get back to our COVID data! The first step when working with time information is to convert it to a datetime format. The problem is that dates and times can be written in lots of different ways. For example, the following are all valid formats for the same datetime - and there are loads more! 

- 6/1/1930 22:00

- 06/01/1930 22:00

- 01/06/1930 22:00

- 1/6/1930 22:00

- 1930-06-01 22:00:00

- 1st June 1930 22:00

- June 1st 1930 10pm

Pandas ``to_datetime`` function will try to figure out what format it's reading, but it takes a while and might get it wrong.

In [None]:
%%time
# letting pandas figure out the format for itself - slow and may be wrong
pd.to_datetime(covid.date)

You should figure out the format yourself and tell the function using a format code.

The codes used below are:

code | represents
--|--
%Y| year
%m | month
%d | day

In [None]:
%%time
# telling pandas the format.
# We use the - between %Y and %m as the format of our dates have
# -'s. 
covid.date = pd.to_datetime(covid.date, format='%Y-%m-%d')

covid.date

---

### 🚩 Exercise 5

Convert the following strings to datetime format using `pd.to_datetime`:

- `"19 October 2021, 10:13 AM"`

- `"Oct 19 2021 - 10h13"`

- `"2021-10-19 10:13:29"`

__Hint__

To open the docs to grab the website for looking at datetime format codes (under the format info)
```
?pd.to_datetime
```

...or google "pandas strftime format codes"

In [11]:
%run ../scripts/show_solutions.py 12_ex5

Button(description='Reveal solution', style=ButtonStyle())

Output(layout=Layout(border='1px solid green'))

---

You can see that each of these dates are now `datetime64` so have the properties of the datetime objects we created before.

In [None]:
print("Year", covid.date[0].year)
print("Month",covid.date[0].month)
print("Day", covid.date[0].day)

If we want a particular part it can be useful to define a function to do this...

In [None]:
def extract_day(datetime):
    return datetime.day

extract_day(covid.date[0])

...we can then apply this to the Time column

In [None]:
covid.date.apply(extract_day)

---

### 🚩 Exercise 6

Extract the year of each of the datetimes in the covid dataset

In [12]:
%run ../scripts/show_solutions.py 12_ex6

Button(description='Reveal solution', style=ButtonStyle())

Output(layout=Layout(border='1px solid green'))

---
We need to use `.apply` if doing something more complicated e.g., classify dates into "this year", "last year", or "other".

In [None]:
def year_classifier(datetime):
    year = datetime.year
    if year < 2021:
        return 'this year'
    elif year < 2020:
        return 'last year'
    else:
        return 'other'
covid.date.apply(year_classifier)

## Filtering on datetimes