We're going to be looking at time series and date functionality in pandas. Manipulating dates and times is quite flexible in pandas and thus allows us to conduct more analysis such as time series analysis. Actually, pandas was originally created by Wes McKinney to handle date and time data when he worked as a consultant for hedge funds. So it's quite robust in this matter.

In [1]:
import pandas as pd
import numpy as np

# Timestamp

Pandas has four main time related classes. Timestamp, DatetimeIndex, Period, and PeriodIndex.

In [2]:
# timestamp represents a single timestamp and associates values with points in time. 
# let's create a timestamp with a string 9/1/2019 at 10:05 AM. And here we have our timestamp. 
# Timestamp is interchangeable with Python's datetime in most cases.

pd.Timestamp('9/1/2019 10:05AM')

Timestamp('2019-09-01 10:05:00')

In [3]:
# we also create a timestamp by passing multiple parameters such as year, month, data, hour, minute separately
pd.Timestamp(2019, 12, 20, 0, 0)

Timestamp('2019-12-20 00:00:00')

In [4]:
# timestamp also has some useful attributes such as isoweekday(), which shows the weekday of the timestamp
# note that 1 represents Monday and 7 represents Sunday
pd.Timestamp(2019, 12, 20, 0, 0).isoweekday()

5

In [5]:
# we can find and extract the specific year, month, day, hour, minute, second from a timestamp
pd.Timestamp(2019, 12, 20, 5, 2, 23).second

23

# Period

Suppose we weren't interested in a specific point in time and instead we wanted a span of time. This is where the Period class comes into play. Period represents a single time span, such as a specific day or month. 

In [6]:
# let's create a period that's just for the month January 2016.
pd.Period('1/2016')

Period('2016-01', 'M')

In [7]:
# when we print it out, the granularity of the period is M for month, since that was the finest grained piece we provide
# here's an example of a period that is March 5th, 2016
pd.Period('3/5/2016')

Period('2016-03-05', 'D')

In [8]:
# period object represents the full timespan that we specify. Arithmetic on period is very easy and intuitive
# for instance, if we want to find out 5 months after January 2016, we simply plus 5
pd.Period('1/2016') + 5

Period('2016-06', 'M')

In [3]:
# if we want to find out two days before March 5th 2016, we simply substract 2
pd.Period('3/5/2016') - 2

Period('2016-03-03', 'D')

The key here is that the period object encapsulates(概括) the granularity(顆粒狀) for arithmetic

# DatetimeIndex and PeriodIndex

The index of a timestamp is DatetimeIndex

In [4]:
# let's create our example series t1, we'll use the Timestamp of September 1st, 2nd and 3rd of 2016. 
# When we look at the series, each Timestamp is the index and has a value associated with it, in this case, a, b, and c

t1 = pd.Series(list('abc'), [pd.Timestamp('2016-09-01'), pd.Timestamp('2016-09-02'), pd.Timestamp('2016-09-03')])
t1

2016-09-01    a
2016-09-02    b
2016-09-03    c
dtype: object

In [5]:
# looking at the type of our series index, we see that it's DatetimeIndex
type(t1.index)

pandas.core.indexes.datetimes.DatetimeIndex

Remember the d-type above is referring to the data values in the series, not the index.

In [7]:
# similarly, we can create a period-based index as well
t2 = pd.Series(list('def'), [pd.Period('2016-09'), pd.Period('2016-10'), pd.Period('2016-11')])
t2

2016-09    d
2016-10    e
2016-11    f
Freq: M, dtype: object

Looking at the type of t2.index, we'll see that it's a periodIndex.

# Converting to Datetime

In [2]:
# Let's look into how to convert to Datetime.
# soppose we hve a list of dates as strings ans we want to create a new dataframe

# we're going to try a bunch of different formats
d1 = ['2 June 2013', 'Aug 29, 2014', '2015-06-26', '7/12/16']

# and just some random data
ts3 = pd.DataFrame(np.random.randint(10, 100, (4, 2)), index=d1, columns=list('ab'))
ts3

Unnamed: 0,a,b
2 June 2013,78,33
"Aug 29, 2014",36,42
2015-06-26,35,47
7/12/16,93,47


In [3]:
# using pandas to_datatime, pandas will try to convert these to Datetime and put them in a standard format

ts3.index = pd.to_datetime(ts3.index)
ts3

Unnamed: 0,a,b
2013-06-02,78,33
2014-08-29,36,42
2015-06-26,35,47
2016-07-12,93,47


In [4]:
# to_datetime also has options to change the data parse order
# for instance, we can pass in the argument dayfirst = True to parse the date in European date

pd.to_datetime('4.7.12', dayfirst=True)

Timestamp('2012-07-04 00:00:00')

# Timedelta

Timedeltas are difference in times. This is not the same as a period, but conceptually similar.

In [5]:
# for instance, if we want to take the difference between September 3rd and September 1st, we get Timedelta of two days

pd.Timestamp('9/3/2016')-pd.Timestamp('9/1/2016')

Timedelta('2 days 00:00:00')

In [6]:
# we can also find what the date and time it is for 12 days and three hours past September 2nd, at 8:10 AM

pd.Timestamp('9/2/2016 8:10AM') + pd.Timedelta('12D 3H')

Timestamp('2016-09-14 11:10:00')

# Offset

Offset is similar to timedelta, but it follows specific calendar duration rules. Offset allows flexibility in terms of types of time intervals. Besides hour, day, week, month, etc., it also has things like business day, and end of month, semi month begin, etc. So very non-traditional time series, but things that we would use in business all the time. 

In [7]:
# So let's create a timestamp, and see what day it is.
pd.Timestamp('9/4/2016').weekday()

6

In [11]:
# now we can add the timestamp with a week ahead
# 注意大小寫!!!!!

pd.Timestamp('9/4/2016') + pd.offsets.Week()

Timestamp('2016-09-11 00:00:00')

In [12]:
# let's try to do the month end, then we would have the last day of September
pd.Timestamp('9/4/2016') + pd.offsets.MonthEnd()

Timestamp('2016-09-30 00:00:00')

# Working with Dates in a Dataframe

Let's look at a few tricks for working with dates in a DataFrame. Suppose we want to look at nine measurements, taken bi-weekly, every Sunday, starting in October 2016. Using date_range, we can create a DatetimeIndex. 

In date_range, we have to either specify the start or the end date. If it's not explicitly specified, by default, the data is considered the start date. So then we have to take the specified number of periods, and a frequency. 

In [14]:
# We're going to set it to '2W- SUN' which means bi-weekly on Sunday. Like regex, 
# there's sort of a mini language to describe these periods.

dates = pd.date_range('10-01-2016', periods=9, freq='2W-SUN')
dates

DatetimeIndex(['2016-10-02', '2016-10-16', '2016-10-30', '2016-11-13',
               '2016-11-27', '2016-12-11', '2016-12-25', '2017-01-08',
               '2017-01-22'],
              dtype='datetime64[ns]', freq='2W-SUN')

In [15]:
# there are many other frequencies that you can specify.
# for example, we can do business day (Mon.-Fri.)
pd.date_range('10-01-2016', periods=9, freq='B')

DatetimeIndex(['2016-10-03', '2016-10-04', '2016-10-05', '2016-10-06',
               '2016-10-07', '2016-10-10', '2016-10-11', '2016-10-12',
               '2016-10-13'],
              dtype='datetime64[ns]', freq='B')

In [17]:
# or we can do quarterly, with the quarter start in June
# quarter = 1/4 year = 3 months

pd.date_range('04-01-2016', periods=12, freq='QS-JUN')

DatetimeIndex(['2016-06-01', '2016-09-01', '2016-12-01', '2017-03-01',
               '2017-06-01', '2017-09-01', '2017-12-01', '2018-03-01',
               '2018-06-01', '2018-09-01', '2018-12-01', '2019-03-01'],
              dtype='datetime64[ns]', freq='QS-JUN')

In [20]:
# Now let's go back to our weekly on Sunday example and create a dataframe using these dates and some random data
dates = pd.date_range('10-01-2016', periods=9, freq='2W-SUN')
df = pd.DataFrame({'Count 1': 100 + np.random.randint(-5, 10, 9).cumsum(),
                   'Count 2': 120 + np.random.randint(-5, 10, 9)}, index=dates)
df

Unnamed: 0,Count 1,Count 2
2016-10-02,109,127
2016-10-16,115,119
2016-10-30,112,124
2016-11-13,110,124
2016-11-27,112,126
2016-12-11,118,118
2016-12-25,123,119
2017-01-08,125,122
2017-01-22,126,121


In [21]:
# first, we can check what day of the week a specific date is. 
# For example, here we can see that all dates in pur index are on Sunday. Which matches the frequency that we set.

df.index.weekday_name

Index(['Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday', 'Sunday',
       'Sunday', 'Sunday'],
      dtype='object')

In [22]:
# we also use diff() to find the difference between each date's value
df.diff()

# 以第一個 row 為基準，算出其他 row 跟他相差多少

Unnamed: 0,Count 1,Count 2
2016-10-02,,
2016-10-16,6.0,-8.0
2016-10-30,-3.0,5.0
2016-11-13,-2.0,0.0
2016-11-27,2.0,2.0
2016-12-11,6.0,-8.0
2016-12-25,5.0,1.0
2017-01-08,2.0,3.0
2017-01-22,1.0,-1.0


Suppose we want to know what the mean count is for each month in our DataFrame. We can do this using resample. Converting from a higher frequency from a lower frequency is called down sampling. And we'll talk about this in a moment in a little bit more detail in another lecture. 

In [23]:
df.resample('M').mean()
# 以 month 為單位，算 count1, count2 的平均值

Unnamed: 0,Count 1,Count 2
2016-10-31,112.0,123.333333
2016-11-30,111.0,125.0
2016-12-31,120.5,118.5
2017-01-31,125.5,121.5


Let's talk about datetime indexing and slicing.

In [24]:
# For instance, we can use partial string indexing 
# and the key here is that it's actually based on string indexing to find values from a particular year.
df['2017']

Unnamed: 0,Count 1,Count 2
2017-01-08,125,122
2017-01-22,126,121


In [25]:
# or we can do it from a particular month
df['2016-12']

Unnamed: 0,Count 1,Count 2
2016-12-11,118,118
2016-12-25,123,119


In [26]:
# we can even slice on a range of dates
# for example, here we only want the values from December 2016 onwards (2016-last)
df['2016-12':]

Unnamed: 0,Count 1,Count 2
2016-12-11,118,118
2016-12-25,123,119
2017-01-08,125,122
2017-01-22,126,121
