# <font color='#eb3483'> Data Wrangling - Time Series </font>
In this module we explore a special type of data - time series (a.k.a. data related to times or dates). Time series data is pretty ubiquitous. Think of the stock market, weather data, or evne your own bank statements - it's all data tied to specific dates and times. Dates and times deserve some extra special attention during your data wrangling process because they have some unique properties. Like numeric data they have a natural ordering (i.e. 3pm is after 2pm), but they also have additional structure (i.e. for a given time we have an hour, a day of the week, a year, a zodiac sign...etc.). 

This notebook first looks at how python stores date/time data, and then dives deep into some cool functionality pandas has to play around with this data. 



## <font color='#eb3483'> Datetime</font>
We'll start by looking at python's basic way to deal with dates - the datetime object. We can load datetime functionality using the `datetime` package, and create a new datetime variable.

In [2]:
from datetime import datetime, date #importing the datetime type from the datetime package (I know ... it's confusing!)

#Let's make a date for July 1st 2020 (Canada day!)
canada_day = datetime(year=2020, month =7, day=1)
canada_day

datetime.datetime(2020, 7, 1, 0, 0)

Notice that our datetime object has stored our year month and day. We can index to retrieve those variables using the same dot notation we've used in pandas.

In [None]:
canada_day.year

We can also specify a time for our date.

In [None]:
canada_day = datetime(year=2020, month =7, day=1, hour =13, minute = 30)
canada_day.hour

We can use the strftime (STRing Format TIME) method included with our datetime objects to print out our date as a nice string. We specify what we want our output string to look like by using special 'directives' (`%` followed by a letter). Think of directives as instructions (i.e. `%d` says put the day here), you can check out the plethora of formatting options [here](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior).

In [None]:
#This is saying I want the string version of our date with
#%B - the month (full name)
#%d - the day (number)
#%Y - the year
#I've also thrown in a comma to make it look snazzy
canada_day.strftime('%B %d, %Y')

In [None]:
#We can also go classic and do the American slash M/D/Y format
canada_day.strftime('%m/%d/%Y')

Datetime also has some handy functions that conveniently grab the current date/time.

In [None]:
print('Right now:', datetime.now())
print("Today's date:", date.today())

Note that we're using the date object from datetime in the last line. Same idea as datetime but no time (can be handy when you only care about the day and not specific time). One of the cool things about working with dates in datetime objects is we can add and subtract dates - let's see how far Canada day is from fourth of july.

In [None]:
#Let's create a new datetime - the fourth of july for the yankees out there
fourth_of_july = datetime(year=2020, month =7, day=4)

#How far is fourth of july from canada day?
holiday_difference = fourth_of_july - canada_day
print('How far after canada day is fourth of july?', holiday_difference)
print('Type of subtracted dates:', type(holiday_difference))

You'll notice that when we add/subtract dates we get a new type of object: timedelta. It's what you would expect - a variable that stores the length of time between datetimes. We can even index into it and get the number of days in our timedelta.

In [None]:
holiday_difference.days

### <font color='#eb3483'> Quick Knowledge Check</font>
1. Create a new datetime object for your next birthday. Print it out as a string with your choice of formatting (play around with different options!)

In [3]:
#Here's my birthday (no pressure to send gifts or a card)
connor_bday = datetime(year=2020, month =10, day=6)
connor_bday

datetime.datetime(2020, 10, 6, 0, 0)

2. Calculate how many days away your birthday is from the current date.

In [5]:
#Ugh so far away!
connor_bday - datetime.now()

datetime.timedelta(days=91, seconds=14234, microseconds=989977)

## <font color='#eb3483'> Datetimes in Pandas </font>
Datetime objects are great for individual dates (and provide a lot of flexibility/ease of use), but don't scale well to vectors of dates (i.e. columns in a dataframe). For that it's time to turn to our favorite coding bears - pandas! 

In [7]:
import pandas as pd

### <font color='#eb3483'>Timestamps  </font>

The timestamp is the most basic form of time series data that Pandas has. It does exactly what the name describes: marks the exact moment in which the data was collected. 

While kaggle datasets and other online challenges are normally clean "hourly" or "daily" dataset, TimeStamps are how most data is normally collected in the wild! 

An event happens, and the time of the event is dumped into a database. 

One example of this would be... bitcoin! Now, whatever you may think about bitcoin, it is an excellent source of high-granularity data. Let's dive in! 

In [8]:
data = pd.read_csv('./data/bitcoin.csv')

In [18]:
data.head()

Unnamed: 0,Timestamp,Open,High,Low,Close,Volume_(BTC),Volume_(Currency),Weighted_Price
0,2017-01-01 00:00:00,973.37,973.37,973.35,973.35,2.122048,2065.524303,973.363509
1,2017-01-01 00:01:00,973.37,973.37,973.35,973.35,2.122048,2065.524303,973.363509
2,2017-01-01 00:02:00,973.37,973.37,973.35,973.35,2.122048,2065.524303,973.363509
3,2017-01-01 00:03:00,973.36,973.36,973.36,973.36,0.04,38.9344,973.36
4,2017-01-01 00:04:00,973.36,973.4,973.36,973.39,5.4588,5313.529708,973.387871


In [None]:
data.tail()

Interesting. We have this `Timestamp` column, that we can kind of parse by looking at it. 

In [None]:
data.Timestamp.head()

We can kind of understand this. Looks like Year, month, and day, then hours, minutes, then seconds ...  

Let's inspect a random row: 

In [None]:
print('One of the times in our dataset: %s' % data.Timestamp.iloc[3])
print('Type of the Series (data.Time):  %s' % data.Timestamp.dtype)
print('Type of a particular time:       %s' % type(data.Timestamp.iloc[3]))

We can use `pd.to_datetime` to parse the timestamps

In [None]:
time_as_a_timestamp = pd.to_datetime(data.Timestamp, infer_datetime_format=True)

What is it now? 

In [None]:
time_as_a_timestamp.head(2)

Now the column is in `datetime[ns]` format! That means the column is a timestamp (with precission in nanoseconds)

Now we can compute statistics with it!

In [None]:
time_as_a_timestamp.min()

In [None]:
time_as_a_timestamp.max()

Now we can extract days, months etcetera:

In [None]:
time_as_a_timestamp.dt.day.head(5)

because the column is a timestamp dtype, it has the `.dt` accessor with all of the timestamp related functions. Since pandas was created for stock trading data (which are timeseries), [there are lot of timestamp specific properties!](https://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties)

Let's make a toy dataset so that we can see some of the results side by side

In [None]:
new = pd.DataFrame()
new['date'] = time_as_a_timestamp
new['day'] = new['date'].dt.day
new['month'] = new['date'].dt.month
new['year'] = new['date'].dt.year
new['hour'] = new['date'].dt.hour
new['minute'] = new['date'].dt.minute
new['second'] = new['date'].dt.second
new['day of the week'] = new['date'].dt.weekday
new['quarter'] = new['date'].dt.quarter
new['is it a leap year?'] = new['date'].dt.is_leap_year

new.head(2)

Pandas... is amazing. 

### <font color='#eb3483'> Different date formats  </font>

Now you may be thinking _"hang on, was that just because the strings were exactly in the way Pandas likes them?"_

It's a fair question, and the answer is No. Pandas' [`to_datetime`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) has an `infer_datetime_format` argument which is amazingly good, and can for the most part figure out what you need from it. 

Let's put it to the test: 

In [None]:
# little function to sanity check our dates
def sanity_check(dates):
    # go ahead Pandas, guess my date format! 
    inferred_dates = pd.to_datetime(dates, infer_datetime_format=True)
    
    # Print out the results 
    print('Our first day is   5,    and was infered as %0.0f' % inferred_dates.iloc[0].day)
    print('Our first month is 4,    and was infered as %0.0f' % inferred_dates.iloc[0].month)
    print('Our first year is  2007, and was infered as %0.0f' % inferred_dates.iloc[0].year)

Let's start with an easy one 

In [None]:
american_dates = pd.Series(['04/05/2007',  # <-- April 5th, 2007
                            '04/13/2006', 
                            '12/27/2014'])

sanity_check(american_dates)

Can we separate them with hyphens? 

In [None]:
hyphen_separated_dates = pd.Series(['04-05-2007',  # <-- April 5th, 2007
                            '04-13-2006', 
                            '12-27-2014'])

sanity_check(hyphen_separated_dates)

Let's write the year in a weird way

In [None]:
short_year = pd.Series(['04-05-07',  # <-- April 5th, 2007
                        '04-13-06', 
                        '12-27-14'])

sanity_check(short_year)

Eh... english? 

In [None]:
dates_in_english = pd.Series(['April 5th, 2007',  # <-- April 5th, 2007
                            'April 13th, 2006', 
                            'December 27th, 2014'])

sanity_check(dates_in_english)

Wow! So, european dates should be easy... right? 

In [None]:
european_dates = pd.Series(['05/04/2007',   # <-- April 5th, 2007
                            '13/04/2006', 
                            '27/12/2014'])

sanity_check(european_dates)

Wait... what? It got the day and month mixed up! 

It turns out Pandas can infer lots of things, but Europe isn't it's strenght. Even though the second and third line clearly indicate that the month is in the middle (the 13'th can't be a month), it still gets confused. 

And here is where line 2 of [The Zen of Python](https://www.python.org/dev/peps/pep-0020/#id3) comes in:
> Explicit is better than implicit 

In [None]:
inferred_dates = pd.to_datetime(european_dates, 
                                dayfirst=True)  # <--- explicit! 

In [None]:
print('Our first day is   5,    and was infered as %0.0f' % inferred_dates.iloc[0].day)
print('Our first month is 4,    and was infered as %0.0f' % inferred_dates.iloc[0].month)
print('Our first year is  2007, and was infered as %0.0f' % inferred_dates.iloc[0].year)

By being explicit, we can parse arbitrarily crazy dates, following python [date string formatting syntax](http://strftime.org/):

In [None]:
dates_in_quackland = pd.Series(['05_quack_2007$04',   # <-- April 5th, 2007, in quack_timesystem
                                '13_quack_2006$04',    
                                '27_quack_2014$12'])

inferred_dates = pd.to_datetime(dates_in_quackland, 
                                format='%d_quack_%Y$%m')  # <--- %d is day, %m is month, %Y is 4 digit year

print('Our first day is   5,    and was infered as %0.0f' % inferred_dates.iloc[0].day)
print('Our first month is 4,    and was infered as %0.0f' % inferred_dates.iloc[0].month)
print('Our first year is  2007, and was infered as %0.0f' % inferred_dates.iloc[0].year)

### <font color='#eb3483'> Quick Knowledge Check</font>
1. Time to practice converting some dates! Write code that converts the following series into datetimes columns.

In [11]:
canadian_dates = pd.Series(['July 24, 2020 eh!',
                           'October 6, 2020 eh!',
                           'January 3, 2019 eh!',])

#Convert the canadian dates series here 
pd.to_datetime(canadian_dates, format='%B %d, %Y eh!')

0   2020-07-24
1   2020-10-06
2   2019-01-03
dtype: datetime64[ns]

In [16]:
#Note that the first two digit number is month (you can tell because the second one >12)
farm_dates = pd.Series(['Oink 2020 Moo 12 Baa 18 Cluck 14:00',
                        'Oink 2020 Moo 2 Baa 3 Cluck 1:00',
                        'Oink 2004 Moo 7 Baa 9 Cluck 21:00'])
#Convert the farm dates series here 
pd.to_datetime(farm_dates, format='Oink %Y Moo %m Baa %d Cluck %H:%M')

0   2020-12-18 14:00:00
1   2020-02-03 01:00:00
2   2004-07-09 21:00:00
dtype: datetime64[ns]

### <font color='#eb3483'>Datetime Indices </font>
Where pandas really shines is when we set a datetime data as our index (it's generally good practice to do this when you have time seriees data for reasons that will become apparent soon). So let's start by setting our timestampl column as our index.

In [19]:
data.Timestamp = pd.to_datetime(data.Timestamp, infer_datetime_format=True)

data = data.set_index('Timestamp',    # <---- Set the index to be our timestamp data  
                      drop=True)      # <---- drop the original column

In [None]:
#Let's take a peak to make sure we did this right
data.head()

In [None]:
#We can also sort our dataframe by the time index (good practice for time series data!)
data = data.sort_index()

Now that we have our data with the timeseries index we can do some really cool indexing (pandas is ... amazing!)

In [None]:
#Let's get all the data for Jan 17th
data.loc['Jan 17th 2018'].head()   # <--- wait, you can do that???

In [None]:
#Or how about all the January data?
data.loc['Jan 2018'].head()

In [None]:
#We can even look at data between dates
data.loc['01/15/2018':'01/22/2018']  # <--- remember, American dates are less error prone in Pandas 

Essentially we can slice our data by using dates, and pandas even let's us use date different formats. The beauty of this is that it seems perfectly natural (of course we should be able to just pull all of january's data without fancy index conditions), but for anyone coming from a different coding language you'll realize this is bonkers crazy!

### <font color='#eb3483'>Resampling Data </font>
Sometimes we might get data at a really granular level (i.e. microsecond) and want to take a step back and look at a larger time frequency (i.e. days). Let's think about some of our bitcoin data fields. The price on Jan 17th, at 3h00m00s makes sense (since its an event, something that happened). But the volume "in that moment"? It's a bit non-sensical (you dont have a number of transactions in a snap second, *you have them over a period*). Some datasets (this one probably included) will treat data as being "since the last timestamp", but real world data may not be so forgiving. 

Counting using timestamps is like asking _how many people went into McDonnals at an exact moment_. Probably none. It does't tell us much. We' think in people "per minute", or "per hour". To _resample_ our data at a different time frequency, we can use the [resample](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwi3jfnKgNnaAhUGvBQKHRCwBd4QFggpMAA&url=https%3A%2F%2Fpandas.pydata.org%2Fpandas-docs%2Fstable%2Fgenerated%2Fpandas.DataFrame.resample.html&usg=AOvVaw1le9agxvLanaQp9zlNYG9Y) function.

Let's start by looking at our bitcoin data in 5 minute intervals. All we have to do is call the resample method on our series and specify the interval (5 min).

In [20]:
data['Volume_(Currency)'].resample('5 min')

<pandas.core.resample.DatetimeIndexResampler object at 0x7f84795e0940>

Hmm what are we getting back - that doesn't look like numbers! It's actually a new "resampler" object (very similar to 'groupby' objects in pandas), which is just a series with some extra information about how to apply functions to it (i.e. when we apply sum it'll apply it to 5 minute time intervals of our dataset). Which means, to actually get some numbers we need to specify how we're going to map our 5 minute interval data to one number. Let's use sum!

In [None]:
data['Volume_(Currency)'].resample('5 min').sum()

Boo ya - now we have the total volume (currency) traded in 5 minute time buckets. We could have also chosen other aggregation functions (like max, mean, min...etc.) - try it out yourself!

We can specify our resampling windows using special characters just like our string formatting (check-out the full list of frequency code names [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases)). For example, let's look at the max volume in every 2 week interval.

In [None]:
data['Volume_(Currency)'].resample('2W').max().head()

### <font color='#eb3483'> Quick Knowledge Check</font>
1. What's the higheset bitcoin open price every day in january? (Hint first get all the january data, and then apply our resampling function for days)

In [28]:
data['Jan 2018']['Open'].resample('1D').max()

Timestamp
2018-01-01    13888.77
2018-01-02    15275.00
2018-01-03    15400.00
2018-01-04    15400.00
2018-01-05    17178.00
2018-01-06    17174.00
2018-01-07    17115.01
2018-01-08    16275.00
2018-01-09    15384.00
2018-01-10    14848.00
2018-01-11    14970.00
2018-01-12    14080.83
2018-01-13    14499.99
2018-01-14    14332.85
2018-01-15    14253.00
2018-01-16    13642.29
2018-01-17    12358.89
2018-01-18    12122.50
2018-01-19    11973.98
2018-01-20    12984.06
2018-01-21    12762.80
2018-01-22    11839.99
2018-01-23    11345.00
2018-01-24    11409.33
2018-01-25    11690.00
2018-01-26    11569.98
2018-01-27    11488.10
2018-01-28    11694.98
2018-01-29    11570.00
2018-01-30    11150.00
2018-01-31    10299.00
Freq: D, Name: Open, dtype: float64

### <font color='#eb3483'> Timeshifts </font>
Sometimes we might want to shift our dates by a fixed amount. For example, what if our "timestamp" column was actually when the bitcoin data was reported, not when it happened (i.e. all of our dates are off by 2 weeks)? For that we can use the `tshift` function. 

In [None]:
#let's remember what our dates were originally (starts at jan 1!)
data.head()

Let's shift our data by 2 weeks. We can do this by specifying the frequency (i.e. the time unit we're using for our shift) as weeks, and our periods as 2 (i.e. how many time units we want to move it). 

In [None]:
data.tshift(periods=2, freq = 'W').head() #<--- 2 W = 2 weeks

In [None]:
#We could also shift it by 14 days and get the same results
data.tshift(periods=14, freq = 'D').head() #<--- 14 D = 14 days

Sweet - our data is shifted! To make these changes stick we'd have to assign the shifted data to our data object - but we'll leave it as is.

### <font color='#eb3483'> Quick Knowledge Check</font>
1. Can you shift our data by three quarter-years?

In [30]:
#The frequency special character for 'quarter year' is Q, and we want to shift 3 of them
data.tshift(periods=3, freq = 'Q')

Unnamed: 0_level_0,Open,High,Low,Close,Volume_(BTC),Volume_(Currency),Weighted_Price
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2017-09-30 00:00:00,973.37,973.37,973.35,973.35,2.122048,2065.524303,973.363509
2017-09-30 00:01:00,973.37,973.37,973.35,973.35,2.122048,2065.524303,973.363509
2017-09-30 00:02:00,973.37,973.37,973.35,973.35,2.122048,2065.524303,973.363509
2017-09-30 00:03:00,973.36,973.36,973.36,973.36,0.040000,38.934400,973.360000
2017-09-30 00:04:00,973.36,973.40,973.36,973.39,5.458800,5313.529708,973.387871
...,...,...,...,...,...,...,...
2018-09-30 23:56:00,8155.00,8155.00,8154.99,8154.99,0.617945,5039.342643,8154.997667
2018-09-30 23:57:00,8154.99,8154.99,8154.00,8154.01,40.655410,331543.193980,8154.958865
2018-09-30 23:58:00,8154.00,8154.01,8150.00,8150.00,9.856911,80340.432933,8150.670628
2018-09-30 23:59:00,8150.01,8150.01,8122.82,8145.00,68.274269,555026.852280,8129.370847


### <font color='#eb3483'> Rolling Windows </font>

Rolling windows do what their name suggest: aggregate of the previous X periods (and, for instance, take the mean). They are very useful to smooth choppy timeseries and be less reactive to noise. 

We can choose to center the window (look back and forward), but in general we only want to take into account information from the past, so we should use `center=False` (which is the default)

Let's say it's December 18th 2017, in the early morning, and we are at our terminal. 

##### Midnight and a bit... 

In [None]:
data.loc['Dec 18th 2017 00:08:00':'Dec 18th 2017 00:12:00', 'Weighted_Price'].plot(figsize=(16, 4));

![](https://i.imgflip.com/29iucd.jpg)

##### A few minutes pass... 

In [None]:
data.loc['Dec 18th 2017 00:12:00':'Dec 18th 2017 00:15:00', 'Weighted_Price'].plot(figsize=(16, 4));

![](https://i.redditmedia.com/VE5dgdjQ8FKZ47gdxJdQ07q36bsZVyhvAmllvLdtTnI.jpg?w=534&s=ce869cd0d8630cd420af7fa72b3c296d)

##### A few more minutes... 

In [None]:
data.loc['Dec 18th 2017 00:15:00':'Dec 18th 2017 00:18:00', 'Weighted_Price'].plot(figsize=(16, 4));

![](https://i.imgflip.com/29iucd.jpg)

I think you get the picture. What's going on is that we're being extremely reactive to noise, and missing the underlying process. What is in fact going on is that we are in a free-fall, but it might not be obvious unless we look at the slightly broader picture. 

In other words, assuming there is an underlying process, we can assume the recent past should carry some weight. How much weight? A rolling [window](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html) of weight! 

#### The first hour of Dec 18th 2017, as seen by traders

In [None]:
data.loc['Dec 18th 2017 00:00:00':'Dec 18th 2017 01:00:00', 'Weighted_Price'].plot(figsize=(16, 4));

#### The first hour of Dec 18th 2017, as seen by a rolling window of 10 minutes

In [None]:
# this is just the raw data, so we can apply a rolling window on it  
first_hour = data.loc['Dec 18th 2017 00:00:00':'Dec 18th 2017 01:00:00', 'Weighted_Price']

# notice the window size as a parameter of rolling, feel free to mess around with that parameter 
# and the center set to False. That's because we don't want to use data from the future! 
# Also notice how we use the mean. We can use many others. Try changing it! 
window_size = 10
first_hour_rolling_window = first_hour.rolling(window=window_size, center=False).mean()

What do these look like? A rolling window of 10 basically calculates the average bitcoin price in 10 minutes interval (so the average price between 00:00 and 00:10, the average price between 00:01 and 00:11, the avg price between 00:02 and 00:12, etc)

In [None]:
# Let's plot these together 
first_hour_rolling_window.plot(figsize=(16, 8), 
                               color='b',
                               label=f'rolling_window = {window_size}');
first_hour.plot(figsize=(16, 8), label='raw data', alpha=.7, ls='-', color='orange');

### <font color='#eb3483'> Quick Knowledge Check</font>
1. Can you get the maximum close price in a rolling window of 2 weeks?

In [35]:
#The 'easy' answer is that our data is every minute, so our window size is 14 days x 24 hours x 60 hours
data['Close'].rolling(window=14*24*60).max()

#Can you think of a different way to do it?

Timestamp
2017-01-01 00:00:00        NaN
2017-01-01 00:01:00        NaN
2017-01-01 00:02:00        NaN
2017-01-01 00:03:00        NaN
2017-01-01 00:04:00        NaN
                        ...   
2018-03-26 23:56:00    9472.88
2018-03-26 23:57:00    9472.88
2018-03-26 23:58:00    9472.88
2018-03-26 23:59:00    9472.88
2018-03-27 00:00:00    9472.88
Name: Close, Length: 648001, dtype: float64