<div>
<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

# Time series analysis

<br/> 
<br/> 
<br/> 
<br/> 
<br/> 
<br/> 
<br/> 
<br/> 
<br/> 
<br/> 
<br/> 
<br/> 

## Review of date and time data types in Python
Full documentation here: https://docs.python.org/3/library/datetime.html

In [None]:
from datetime import datetime

timestamp = datetime(2030, 12, 31, 23, 59, 59)
timestamp

### We can look at individual components of the datetime object, such as the `year`, `month`, `day` etc.

In [None]:
timestamp.year

In [None]:
timestamp.month

In [None]:
timestamp.day

In [None]:
timestamp.hour

<br/>

### We can manually construct datetime objects as well

In [None]:
beginning_of_year = datetime(2030, 1, 1, 0, 0, 0)
end_of_year = datetime(2030, 12, 31, 23, 59, 59)

<br/>

### We can subtract datetime objects and get timedelta objects

In [None]:
delta = end_of_year - beginning_of_year
delta

The difference between two dates is a timedelta object

In [None]:
type(delta)

In [None]:
dir(delta)

In [None]:
delta.seconds

<br/>

### Convert between `str` and `datetime` (and vice-versa)

https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

#### `datetime` -> `str`

In [None]:
now = datetime.now()
now.strftime('%B %d, %Y %H:%M')

#### `str` -> `datetime`

In [None]:
valentines_day_string = '02/14/2030 06:45pm'

valentines_day = datetime.strptime(
    valentines_day_string, 
    '%m/%d/%Y %I:%M%p'
)

valentines_day

In many cases, you can auto-parse the date from a string:

In [None]:
from dateutil.parser import parse
parse(valentines_day_string)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## The pandas `DatetimeIndex`

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime

#### We can use datetime to index data by timestamps

In [None]:
# List of Tuesdays in January and February of 2030 (9 of them total)
dates = [
    datetime(2030, 1, 1), 
    datetime(2030, 1, 8), 
    datetime(2030, 1, 15), 
    datetime(2030, 1, 22), 
    datetime(2030, 1, 29), 
    datetime(2030, 2, 5), 
    datetime(2030, 2, 12), 
    datetime(2030, 2, 19), 
    datetime(2030, 2, 26)
]

# Example max daily temperatures (measured in Boston)
data = [34, 42, 28, 41, 29, 45, 19, 26, 33]

ts = pd.Series(data, index=dates)
ts

#### The index generated in this case is of type `DatetimeIndex`

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html

`DatetimeIndex` is a specialized index class that simplifies working with dates.

In [None]:
ts.index

#### CAUTION 

Generally, you should make sure your index is sorted, otherwise some methods might have unexpected behavior.

In [None]:
ts.sort_index(inplace=True)
ts

<br><br><Br>

## How to generate a `DatetimeIndex`

Above we created a `DatetimeIndex` by manually using a list of datetime objects. 

The pandas library has a very useful method for creating a DatetimeIndex easier. 

You can read more about `date_range` here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html

In [None]:
dates = pd.date_range(
    start=datetime(2030, 1, 1), 
    periods=9, 
    freq='7D', 
    inclusive='left'
)
dates

In [None]:
ts = pd.Series(data, index=dates)
ts

<br/>
<br/>

## Frequencies

The `freq` parameter of `date_range` is very powerful:
    
Some examples:    
* use `M` for month, `H` for hour, `T` or `min` for minute etc.
* use `5M` for 5 months, `2H` for 2 hours, `2H30min` for 2 hours and 30 mins, etc.

http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

In [None]:
more_dates = pd.date_range(
    start='01/01/2030', 
    periods=100, 
    freq='2h30min'
)

more_dates[:7]

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## The `DatetimeIndex` and filtering

`DatetimeIndex` allows us to filter data by datetime object components

In [None]:
import random

index = pd.date_range(
    start='12/28/2030',
    periods=100,
    freq='5h39min40s',
    inclusive='left'
)

data = [random.randint(10, 50) for x in index]

ts = pd.Series(data, index=index)
ts

In [None]:
ts.index

<br>

### Get all the data for a given year

In [None]:
ts

In [None]:
ts['2030']

In [None]:
ts['2031']

<br>

### Get all the data for a given month of a given year

In [None]:
ts['2031-01']

In [None]:
ts['2031-02']

<br>

### Get data using date ranges

In [None]:
ts['2030-12-28':'2030-12-29']

In [None]:
ts['2030-12':'2031-01']

#### Can also use datetime objects

In [None]:
ts[datetime(2030, 12, 28, 17, 0, 5):datetime(2030, 12, 29, 0, 0, 0)]

<br>

### Get all the data with a timestamp in a specified time range

In [None]:
ts.between_time('00:00', '00:30')

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## `DatetimeIndex` attributes and methods

In [None]:
ts.head()

In [None]:
ts.index

In [None]:
dir(ts.index)

### Get the week number
* E.g. week 1, week 2, week 50, etc.

In [None]:
dir(ts.index)

In [None]:
ts.index.week

In [None]:
ts.index.isocalendar().week

### Get the month names

In [None]:
ts.index.month_name()

### Get the day count

In [None]:
ts.index.dayofyear

### Get the number of days in the month

In [None]:
ts.index.days_in_month

### Get the day name

In [None]:
ts.index.day_name()

### Get the hour, minute, second part of the timestamp

In [None]:
# Hour
ts.index.hour

In [None]:
# Minute
ts.index.minute

In [None]:
# Second
ts.index.second

### Get the quarter

In [None]:
ts.index

In [None]:
ts.index.quarter

**There are many, many more methods and attributes.** You can explore them in the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html

### You can use these methods to create new DataFrame columns

In [None]:
data = pd.DataFrame(ts, columns=['Values'])
data.head()

In [None]:
data['Year'] = data.index.year
data['Day Name'] = data.index.day_name()
data.head(7)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Set a `DatetimeIndex` in a DataFrame

### Option 1: parse the timestamp column as a date and designate it as the index column when reading the file

In [None]:
data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/random_timeseries.csv', 
    parse_dates=['Timestamp'],
    index_col=['Timestamp']
)

data.head()

In [None]:
data.index

<br><br>

### Option 2: Use `set_index()` to designate a column of type `datetime` as the index

In [None]:
data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/random_timeseries.csv', 
    parse_dates=['Timestamp']
)

data.head()

In [None]:
data.info()

In [None]:
data.set_index(['Timestamp'], inplace=True)

In [None]:
data.head()

In [None]:
data.index

<br><br>

### Option 3: convert a column of type `object` to `datetime`, and then set it as the index using the `to_datetime()` function

In [None]:
data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/random_timeseries.csv' 
)

data.head()

In [None]:
data.info()

In [None]:
pd.to_datetime(data['Timestamp'])

In [None]:
data['Timestamp'] = pd.to_datetime(data['Timestamp'])

In [None]:
data.info()

In [None]:
data.set_index('Timestamp', inplace=True)

In [None]:
data.head()

In [None]:
data.index

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Timezones

Working with local timezones can be a real pain, especially when dealilng with daylight savings time transitions. Which is why it is generally easier to convert your timestamps to UTC.

All timezones are basically offsets from UTC.

In [None]:
import pytz

# Some example of timezones
pytz.common_timezones[:10]

### Localize your timezone

Let's assume we know that this data was recorded in UTC. We can localize it (as in, add the timezone, which in this case is UTC).

In [None]:
data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/random_timeseries.csv', 
    parse_dates=['Timestamp'],
    index_col=['Timestamp']
)

data.head()

In [None]:
data_utc = data.tz_localize('UTC')
data_utc.head()

### Convert to another timezone

In [None]:
data_utc.tz_convert('US/Pacific').head()

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Filter data in a DataFrame using the `DatetimeIndex`

In [None]:
data.head()

### Remember that you can always filter using a date, or parts of a date

In [None]:
data.loc['2030-01-10 15:45':'2030-01-10 15:49']

<br>

### You can also filter data by hour / minute / second only

* this is useful if you have events that happen across days, but in a fixed time interval

In [None]:
data.between_time('15:48', '16:00')

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Resampling

What do we do when data is not spaced at regular time intervals and we need to work with fixed frequencies?

We can resample data!

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html

<br/>
<br/>

### downsampling = going from higher frequency to lower frequency

<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/downsampling.png" width="250"/>

<br><br>

### upsampling = going from lower frequency to higher frequency

<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/upsampling.png?" width="250"/>

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Downsampling

**going from higher frequency to lower frequency**

<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/downsampling.png" width="250"/>

<br>

In [None]:
data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/random_timeseries.csv', 
    parse_dates=['Timestamp'],
    index_col=['Timestamp']
)

Our data is collected at 1 minute interval:

In [None]:
data.head()

<br>

Below, we're resampling to 10 min intervals, taking the mean across each period.

In [None]:
data.resample('10min')

In [None]:
data.resample('10min').mean()

In [None]:
# For example, the values for the first period are
# 5, 6, 0, 7, 0, with a mean of 3.6

data['2030-01-10 15:40:00':'2030-01-10 15:49:59']

<br>

### We can specify multiple aggregation functions

In [None]:
data.resample('10min')

In [None]:
dir(data.resample('10min'))

In [None]:
data.resample('10min').agg([np.sum, np.mean])

<br><br>

### We can specify where we want the label (left or right)

In [None]:
# Label on the left. 
# The aggregated value has the timestamp
# of the START of the interval.

data.resample('10min', label='left').sum().head()

In [None]:
# Label on the right. 
# The aggregated value has the timestamp
# of the END of the interval.

data.resample('10min', label='right').sum().head()

<br><br>

### We can use any aggregation function from numpy. Or our own function.

In [None]:
def positive_product(values):
    result = 1
    
    for v in values:
        if v > 0:
            result *= v
            
    return result

In [None]:
data.head(15)

In [None]:
data.resample('5min').agg(positive_product).head()

<br><br>

### Frequency strings

* remember, we have lots of options for specifying the frequency
<br>
* https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

In [None]:
# Resample to quarter end

data.resample('QE').sum()

<br>

In [None]:
# Resample to 1 hour

data.resample('1h').sum()

<br>

In [None]:
# Resample to calendar month end

data.resample('ME').sum()

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Upsampling

**upsampling = going from lower frequency to higher frequency.**

<img src="https://edlitera-images.s3.us-east-1.amazonaws.com/upsampling.png?" width="250"/>

In [None]:
data.head()

In [None]:
data.resample('s')

In [None]:
# asfreq converts the time series to the specified frequency
# (which is 1 second in this case)

data.resample('s').asfreq()

<br><br>

### Fill in missing data while upsampling

You can:
    
* forward fill (propagate the last value forward) - `ffill`   
* backward fill - `bfill`
* interpolate
etc.

In [None]:
data.resample('s').asfreq().head()

In [None]:
data.resample('s').ffill().head()

In [None]:
data.resample('s').ffill(limit=2).head()

In [None]:
data.resample('s').bfill().head()

In [None]:
data.head()

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html

In [None]:
data.resample('s').interpolate(method='linear').head(65)

<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>
<br/>

## Moving window functions

* when doing timeseries analysis, we often have to perform statistics over a sliding (or rolling, or moving) window of data

* useful for 'smoothing out' graphs, for example

In [None]:
data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/random_timeseries.csv', 
    parse_dates=['Timestamp'],
    index_col=['Timestamp']
)

data.head(5)

<br>

**We can achive that using the `pandas` rolling objects**

In [None]:
data.rolling(3, center=True)

In [None]:
dir(data.rolling(3, center=True))

In [None]:
data.rolling(3, center=True).mean()

<br>

### What's going on here?

* we take all the rows and group them in buckets of 3, as follows:
  * rows 1-3 are in bucket 1 (first window)
  * rows 2-4 are in bucket 2 (second window)
  * rows 3-5 are in bucket 3 (third window)
  * ... 
  
**This is a 'rolling window' (or a 'moving window', or a 'sliding window')**

* for each bucket, we compute an aggregation (e.g. **mean**)

* we specify that this **mean** should have the timestamp of the **center** of the window interval (`center=True`)

#### Documentation

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html

<br><br><br>

<br><br>

<br><br><br><br><br><br>

## Another moving window example

* we often want to decompose time series into components that can identify trends
<br><br>
* the goal is to identify patterns, if any exist
<br><br>
* to that end, we want to ignore occassional fluctuations
<br><br>
* this is again a situation where moving window functions (e.g. moving averages) can be helpful

In [None]:
data = pd.read_csv(
    'https://edlitera-datasets.s3.amazonaws.com/s&p500.csv', 
    index_col='Date', 
    parse_dates=['Date']
)

data.head()

In [None]:
data_recent = data['1990-01-01':]
data_recent.head()

<br><br>

Let's plot this data!

In [None]:
import matplotlib

%matplotlib inline
import seaborn; seaborn.set()

In [None]:
data_recent.plot(y='SP500', rot=45)

In [None]:
(
    data_recent
    .rolling(14, center=True)
    .mean()
    .plot(y='SP500', rot=45)
)

### You can also use the time as boundary for the rolling window

* this requires a `DatetimeIndex` to work

In [None]:
(
    data_recent
    .rolling('420D')
    .mean()
    .plot(y='SP500', rot=45)
)

<br/>
<br/>

### Now let's do some comparisons

Let's say we want to visualize and compare the S&P 500 growth between 1890-1910 with the growth between 1990 and 2010.

In [None]:
data.head()

In [None]:
# S&P 500 between 1890 and 1910

d1 = data.loc['1890-01-01':'1910-01-01', ['SP500']]
d1['Year / Month'] = d1.index.astype('str').str[2:10]
d1.head()

In [None]:
# S&P 500 between 1990 and 2010

d2 = data.loc['1990-01-01':'2010-01-01', ['SP500']]
d2['Year / Month'] = d2.index.astype('str').str[2:10]
d2.head()

In [None]:
d = pd.merge(
    d1, d2, 
    left_on='Year / Month', right_on='Year / Month'
)

d.head()

In [None]:
d.rename(columns={
    'SP500_x': 'SP500 19th century',
    'SP500_y': 'SP500 20th century'
}, inplace=True)
d.head()

In [None]:
d.plot(
    x='Year / Month', 
    y=['SP500 19th century', 'SP500 20th century'], 
    rot=45
)

Not much of a comparison...