# Time series data in python

In this workbook you will be shown how time can be treated as a data type within python.

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

sales=pd.read_csv('store_sales2.csv')
sales.head()

Unnamed: 0,date,family,sales,onpromotion
0,2013-01-01,AUTOMOTIVE,0.0,0
1,2013-01-01,BABY CARE,0.0,0
2,2013-01-01,BEAUTY,0.0,0
3,2013-01-01,BEVERAGES,0.0,0
4,2013-01-01,BOOKS,0.0,0


## Time for pandas

The first step to understanding time series in python is to look at the different functions that come with pandas. The first we are going to look at is how to create time as a datatype. Depending on the data source, data that is time may not be recognised as such. For example, in our stores data check what data type the Data column is:

In [2]:
sales.dtypes #delete

date            object
family          object
sales          float64
onpromotion      int64
dtype: object

In this instance, the Date column has been listed an an object- or string. At this point it makes it difficult for us to perform any time series analysis. Our first step is to convert the column into a datetime object, which will be recognised as time series data.

The function we will be using is `pd.to_datetime()`.

There are a few arguments you should be aware of when using this function:

<ul>
    <li> <b>dayfirst</b>- Default is <code>None</code>, set to <code>True</code> if the date puts the day first</li>
    <li> <b>yearfirst</b>- Default is <code>None</code>, set to <code>True</code> if the date puts the year first</li>
    <li> <b>format</b>- Provide the format the data is currently written in</li>
</ul>

Click <a href='https://dataindependent.com/pandas/pandas-to-datetime-string-to-date-pd-to_datetime/'>here</a> for a resource on datetime formats.

In [4]:
pd.to_datetime(sales.date, dayfirst=True, format="%Y/%m/%d")

0       2013-01-01
1       2013-01-01
2       2013-01-01
3       2013-01-01
4       2013-01-01
           ...    
55567   2017-08-15
55568   2017-08-15
55569   2017-08-15
55570   2017-08-15
55571   2017-08-15
Name: date, Length: 55572, dtype: datetime64[ns]

Notice the dtype now says `datetime`, it is now recognised specifically as time series data.

We can then overwrite the original Date column with this transformed data.

In [5]:
sales['date']=pd.to_datetime(sales.date, dayfirst=True, format="%Y/%m/%d")
sales.head()

Unnamed: 0,date,family,sales,onpromotion
0,2013-01-01,AUTOMOTIVE,0.0,0
1,2013-01-01,BABY CARE,0.0,0
2,2013-01-01,BEAUTY,0.0,0
3,2013-01-01,BEVERAGES,0.0,0
4,2013-01-01,BOOKS,0.0,0


In [6]:
sales.dtypes

date           datetime64[ns]
family                 object
sales                 float64
onpromotion             int64
dtype: object

We can use this function to create timestamps (date and time information), for example:

In [7]:
pd.to_datetime('03/04/1990') # timestamp for 3rd April 1990

Timestamp('1990-03-04 00:00:00')

In [8]:
pd.to_datetime('03/04/1990 05:15:17') #timestamp for quarter past 5 on the 3rd of April 1990

Timestamp('1990-03-04 05:15:17')

With our dates now in a datetime format we can start to perform operations with them. Time data, although mathematical, does not follow the same rules as we are used to. By converting to datetime, python will now account for the special rules that come with time.

For example, let's say we wanted to find the difference in days between two dates. We can simply subtract the two from each other:

In [9]:
d1=pd.to_datetime('03/04/1990')
d2=pd.to_datetime('03/04/2022')

d2-d1

Timedelta('11688 days 00:00:00')

What about if you wanted to compare the date to today? You could type in today's date (but that will be different every day), or we can use a function:

In [10]:
from datetime import datetime

datetime.now()

datetime.datetime(2023, 7, 27, 15, 35, 59, 325947)

With this function we can now find out how long ago something was in days:

In [11]:
datetime.now()-d1

Timedelta('12198 days 15:36:00.741664')

What happens if we want to look at it in terms of weeks or months or years? 

We can use a Timedelta, a function that represents a duration of time. We can use it to define how long a time period is (maximum unit is days):

In [12]:
pd.Timedelta(days=30)

Timedelta('30 days 00:00:00')

In [13]:
pd.Timedelta(weeks=1) # there is an argument for each time period

Timedelta('7 days 00:00:00')

Combining this with the difference in times calculation, we can find out how far apart two days are in a unit different to days (but not years or months, days is as large as it goes):

In [14]:
(datetime.now()-d1)/pd.Timedelta(weeks=1) # time difference in weeks 

1742.6642895718485

(Of course, if you wanted to find out the difference in years, you could divide the output for weeks by 52).

In [15]:
(datetime.now()-d1)/pd.Timedelta(minutes=1) # time difference in minutes

17566056.055018518

### Practice

Find out how old you are in days, hours, minutes and seconds:

In [16]:
my_bday='03/04/1990'

print(datetime.now()-pd.to_datetime(my_bday)) #days
print((datetime.now()-d1)/pd.Timedelta(hours=1)) #hours
print((datetime.now()-d1)/pd.Timedelta(minutes=1)) #minutes
print((datetime.now()-d1)/pd.Timedelta(seconds=1)) #seconds

12198 days 15:36:04.174816
292767.601160115
17566056.069613084
1053963364.176785


Let's say you want to add or subtract a specific time period to a date. You can do this using `pd.offsets.DateOffset()` to create a time series object that can add or subtract to a date. For example, what will the date be in 22 days?

In [17]:
pd.offsets.DateOffset(days=22) + datetime.now()

Timestamp('2023-08-18 15:36:05.655430')

DateOffset allows you to check quickly what the date or time is or was. This can be combined with a data validation function- e.g. you want to automate the order to send a product out for delivery, but want to ensure the target date isn't on the weekend.

### Checking the time

Speaking of, python makes it possible for us to extract specific information about dates:

In [18]:
date=sales.iloc[0,0]
print(date)

2013-01-01 00:00:00


In [19]:
print('Day: '+ str(date.day)+', '+'Hour: '+str(date.hour))

Day: 1, Hour: 0


In [20]:
print('Weekday: '+ str(date.weekday())+', '+'Month: '+str(date.month))

Weekday: 1, Month: 1


With datetime, the week starts on Monday (0) and ends on Sunday (6).

We cna obtain the same information by adding `.dt` to the end of a pandas series which contains datetime:

In [21]:
sales.date.dt.day_name()

0        Tuesday
1        Tuesday
2        Tuesday
3        Tuesday
4        Tuesday
          ...   
55567    Tuesday
55568    Tuesday
55569    Tuesday
55570    Tuesday
55571    Tuesday
Name: date, Length: 55572, dtype: object

### Practice

Try finding other parts of datetime using `.dt` (year, month, month_name, minute, etc).

In [22]:
print('Year: '+ str(sales.date.dt.year[0]))
print('Month: '+ str(sales.date.dt.month[0]))
print('Month Name: '+ str(sales.date.dt.month_name()[0]))
print('Hour: '+ str(sales.date.dt.hour[0]))
print('Minute: '+ str(sales.date.dt.minute[0]))

Year: 2013
Month: 1
Month Name: January
Hour: 0
Minute: 0


### Filtering

Converting our dates into datetime also allows us to create filters:

In [23]:
sales[sales.date<pd.to_datetime('01/01/2015')]

Unnamed: 0,date,family,sales,onpromotion
0,2013-01-01,AUTOMOTIVE,0.000,0
1,2013-01-01,BABY CARE,0.000,0
2,2013-01-01,BEAUTY,0.000,0
3,2013-01-01,BEVERAGES,0.000,0
4,2013-01-01,BOOKS,0.000,0
...,...,...,...,...
24019,2014-12-31,POULTRY,585.882,0
24020,2014-12-31,PREPARED FOODS,132.331,1
24021,2014-12-31,PRODUCE,3837.229,187
24022,2014-12-31,SCHOOL AND OFFICE SUPPLIES,0.000,0


We get even more filtering power if we set the date to be the dataframe's index.

In [25]:
sales_index=sales.copy()

sales_index.set_index('date',inplace=True)
sales_index.head()

Unnamed: 0_level_0,family,sales,onpromotion
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-01-01,AUTOMOTIVE,0.0,0
2013-01-01,BABY CARE,0.0,0
2013-01-01,BEAUTY,0.0,0
2013-01-01,BEVERAGES,0.0,0
2013-01-01,BOOKS,0.0,0


In [27]:
sales_index.loc['2015'] # view data only in 2015

Unnamed: 0_level_0,family,sales,onpromotion
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-01,AUTOMOTIVE,0.0000,0
2015-01-01,BABY CARE,0.0000,0
2015-01-01,BEAUTY,0.0000,0
2015-01-01,BEVERAGES,0.0000,0
2015-01-01,BOOKS,0.0000,0
...,...,...,...
2015-12-31,POULTRY,585.0700,0
2015-12-31,PREPARED FOODS,183.7980,0
2015-12-31,PRODUCE,4109.4062,0
2015-12-31,SCHOOL AND OFFICE SUPPLIES,1.0000,0


## Practice

Create a new dataframe called sales_bev which shows all the sales in the time period for BEVERAGES. In this new dataframe, create the following columns:

<ol>
    <li> The day of week as a number </li>
    <li> The day of week as its name </li>
    <li> The month as its name </li>
    <li> Filter the data to only show dates after the 16th of June 2015</li>
    <li> [Stretch] The difference in sales from the day before </li>
</ol>

In [28]:
sales_bev=sales[sales.family=='BEVERAGES'].reset_index(drop=True)

In [29]:
sales_bev['day_of_week']=sales_bev.date.dt.weekday
sales_bev['weekday']=sales_bev.date.dt.day_name()
sales_bev['month']=sales_bev.date.dt.month_name()

In [30]:
sales_bev.set_index('date').loc['16/6/2016':] # using date as an index

Unnamed: 0_level_0,family,sales,onpromotion,day_of_week,weekday,month
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-06-16,BEVERAGES,2566.0,35,3,Thursday,June
2016-06-17,BEVERAGES,2808.0,34,4,Friday,June
2016-06-18,BEVERAGES,3714.0,37,5,Saturday,June
2016-06-19,BEVERAGES,3755.0,35,6,Sunday,June
2016-06-20,BEVERAGES,3124.0,37,0,Monday,June
...,...,...,...,...,...,...
2017-08-11,BEVERAGES,2390.0,8,4,Friday,August
2017-08-12,BEVERAGES,2649.0,7,5,Saturday,August
2017-08-13,BEVERAGES,2947.0,10,6,Sunday,August
2017-08-14,BEVERAGES,2559.0,11,0,Monday,August


In [31]:
sales_bev[sales_bev.date>pd.to_datetime('16/6/2016')] # setting a filter on date

Unnamed: 0,date,family,sales,onpromotion,day_of_week,weekday,month
1260,2016-06-17,BEVERAGES,2808.0,34,4,Friday,June
1261,2016-06-18,BEVERAGES,3714.0,37,5,Saturday,June
1262,2016-06-19,BEVERAGES,3755.0,35,6,Sunday,June
1263,2016-06-20,BEVERAGES,3124.0,37,0,Monday,June
1264,2016-06-21,BEVERAGES,2876.0,35,1,Tuesday,June
...,...,...,...,...,...,...,...
1679,2017-08-11,BEVERAGES,2390.0,8,4,Friday,August
1680,2017-08-12,BEVERAGES,2649.0,7,5,Saturday,August
1681,2017-08-13,BEVERAGES,2947.0,10,6,Sunday,August
1682,2017-08-14,BEVERAGES,2559.0,11,0,Monday,August


In [32]:
import numpy as np

sales_diff=[]
for i in range(len(sales_bev)): 
    # This if check skips the first row as it will break the loop (no prior entry)
    if i>0:
        sales_today=sales_bev.iloc[i,2]
        sales_yday=sales_bev.iloc[i-1,2]
        sales_diff.append(sales_today-sales_yday)
    else:
  
        sales_diff.append(np.nan)
        
    
sales_bev['change']=sales_diff

In [33]:
sales_bev.head()

Unnamed: 0,date,family,sales,onpromotion,day_of_week,weekday,month,change
0,2013-01-01,BEVERAGES,0.0,0,1,Tuesday,January,
1,2013-01-02,BEVERAGES,1481.0,0,2,Wednesday,January,1481.0
2,2013-01-03,BEVERAGES,1016.0,0,3,Thursday,January,-465.0
3,2013-01-04,BEVERAGES,1146.0,0,4,Friday,January,130.0
4,2013-01-05,BEVERAGES,1581.0,0,5,Saturday,January,435.0


## Grouping dates

A useful features for inspecting time series data is to group it by some aggregate (month, year, etc). The normal `groupby()` function does not have the flexibility to group by anything other than what is written (i.e. group by date). `.resample()` however, is able to adjust to different datetime aggregates.

This function will only work if the dataframe's index is datetime:

In [34]:
sales_index.resample('a').mean() # a for annual

Unnamed: 0_level_0,sales,onpromotion
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-12-31,238.544231,0.0
2014-12-31,368.474537,1.07043
2015-12-31,400.852651,2.025891
2016-12-31,488.668592,5.779328
2017-12-31,475.880919,6.854092


Grouping data in this way is a great method for viewing trends within the data. We can view the change in a variable over time (e.g. average sales). It is likely stakeholders will want to know how the data is trending, which you can report simply with this technique.

### Practice

Create dataframes which summarise the data as follows:

<ol>
    <li> Total annual sales </li>
    <li> Median sales per month </li>
    <li> Weekly mean sales </li>
</ol>

In [35]:
sales_index.resample('a').sum()

Unnamed: 0_level_0,sales,onpromotion
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-12-31,2865393.0,0
2014-12-31,4426116.0,12858
2015-12-31,4815042.0,24335
2016-12-31,5886013.0,69612
2017-12-31,3564824.0,51344


In [36]:
sales_index.resample('M').median()

Unnamed: 0_level_0,sales,onpromotion
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-01-31,4.0,0.0
2013-02-28,6.8,0.0
2013-03-31,7.0,0.0
2013-04-30,7.4825,0.0
2013-05-31,9.0,0.0
2013-06-30,9.0,0.0
2013-07-31,9.0,0.0
2013-08-31,9.0,0.0
2013-09-30,7.0,0.0
2013-10-31,7.0,0.0
