# Working With Dates in Pandas

In [1]:
import pandas as pd

## Find current time
Finding the current time can be very useful when we want to calculate the age of something (how long has passed since then untill current time) or create dynamic, time based filters (show all sales that occured in the last 30 days)

In [6]:
print(pd.datetime.today())

2021-05-29 15:00:32.828219


We can make it more presentable by converting the datetime object into a string

In [5]:
print(pd.datetime.today())

2021-05-29 15:00:28.421210


### Pandas Timestamp
**Timestamp** is the Pandas standart datetime object.<br>Pandas is extremely flexible and can convert plenty of formats from a string to a Timestamp using the `to_datetime` method

In [9]:
pd.to_datetime('2019 - 05 - 07')

Timestamp('2019-05-07 00:00:00')

In [10]:
pd.to_datetime('2018 / 10 / 19')

Timestamp('2018-10-19 00:00:00')

In [11]:
pd.to_datetime('2019, 4, 3')

Timestamp('2019-04-03 00:00:00')

In [12]:
pd.to_datetime('09-14-2015')

Timestamp('2015-09-14 00:00:00')

We can even flip the date. As long as the 'day' value is bigger then 12 Pandas will understand what we mean

In [13]:
pd.to_datetime('14-09-2015')

Timestamp('2015-09-14 00:00:00')

If the day value is smaller then 12 and we use this format, Pandas will default to the 'American' format (MM-DD-YYYY)

In [14]:
pd.to_datetime('08-09-2015')

Timestamp('2015-08-09 00:00:00')

We can even use month names

In [15]:
pd.to_datetime('23 March 2017')

Timestamp('2017-03-23 00:00:00')

And month abbriviations as well...

In [16]:
pd.to_datetime('jun 15, 2010')

Timestamp('2010-06-15 00:00:00')

In [17]:
pd.to_datetime('apr 23rd 2017')

Timestamp('2017-04-23 00:00:00')

The same goes for time parsing

In [18]:
pd.to_datetime('2000-12-07, 10:5:30')

Timestamp('2000-12-07 10:05:30')

In [19]:
pd.to_datetime('2000-12-07, 10:5:30 PM')

Timestamp('2000-12-07 22:05:30')

### Timestamp Attributes

In [20]:
Independence = pd.to_datetime('4th of July, 2000')
Independence

Timestamp('2000-07-04 00:00:00')

We can call various methods and attributes to extract different elemnts from our date object

In [22]:
print(Independence.day)
print(Independence.week)
print(Independence.weekday())
print(Independence.day_name())
print(Independence.month)
print(Independence.month_name())
print(Independence.days_in_month)
print(Independence.quarter)
print(Independence.year)
print(Independence.date())

4
27
1
Tuesday
7
July
31
3
2000
2000-07-04


We can also perform some boolean checks

In [25]:
print(Independence.is_leap_year)
print(Independence.is_month_start)
print(Independence.is_month_end)
print(Independence.is_quarter_start)
print(Independence.is_quarter_end)
print(Independence.is_year_start)
print(Independence.is_year_end)

True
False
False
False
False
False
False


### strftime()
We can use the ***strftime*** method to convert the dates back to strings to display them in any format we want.<br>More information at: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

In [26]:
Independence.strftime('%A, %d/%B/%Y')

'Tuesday, 04/July/2000'

## DatetimeIndex
By running the `to_datetime` method on a collection of dates we can generate a DateTimeIndex object

In [21]:
dates = ['01/01/2000', '2000-06-15', '2000-DEC-24']

Instead of doing this

In [22]:
pandas_dates = []
for date in dates:
    pandas_dates.append(pd.to_datetime(date))

pd.Series(pandas_dates)

0   2000-01-01
1   2000-06-15
2   2000-12-24
dtype: datetime64[ns]

We can use the `to_datetime` method 

In [27]:
pd.to_datetime(dates)

DatetimeIndex(['2000-01-01', '2000-06-15', '2000-12-24'], dtype='datetime64[ns]', freq=None)

### format
If we are converting a large collection of dates (for example a file with millions of records) we can specify a `format` to speed up the conversion proccess

In [4]:
dates = ['20/12/15', '15/06/18', '22/09/20']
pd.to_datetime(dates, format = '%d/%m/%y')

DatetimeIndex(['2015-12-20', '2018-06-15', '2020-09-22'], dtype='datetime64[ns]', freq=None)

We will usually use it as an index of a Series or Dataframe

In [5]:
values = [100, 200, 300]
date_index = pd.to_datetime(dates)

In [6]:
pd.Series(data = values, index = date_index)

2015-12-20    100
2018-06-15    200
2020-09-22    300
dtype: int64

The `to_datetime` method can be used to convert entire columns inside a datafarame to datetime objects. <br>
Ideal for converting columns inside a Dataframe to datetime type

In [9]:
dates = ['01/01/2000', '2000-06-15', '2000-DEC-24']
pd.to_datetime(dates)

DatetimeIndex(['2000-01-01', '2000-06-15', '2000-12-24'], dtype='datetime64[ns]', freq=None)

In [8]:
values = ['01/01/2000', '2000', 'abc', True]
ser = pd.Series(values)
ser

0    01/01/2000
1          2000
2           abc
3          True
dtype: object

### coerce
Trying to use ***to_datetime*** on a sequence that contains invalid values usually returns an error.<br>We can overcome this by setting the ***errors*** parameter to 'coerce'.<br>
The result will be Nat values which stands for "Not a Time". That is the equivelent to the NaN values we saw in numeric and text columns

In [87]:
pd.to_datetime(ser, errors = 'coerce')

0   2000-01-01
1   2000-01-01
2          NaT
3          NaT
dtype: datetime64[ns]

Converting numbers to dates in the "Unix" format (number of seconds since January 1st, 1970)

In [74]:
pd.to_datetime([1500000000,1510000000,1520000000,1530000000,1540000000, 1550000000], unit='s')

DatetimeIndex(['2017-07-14 02:40:00', '2017-11-06 20:26:40',
               '2018-03-02 14:13:20', '2018-06-26 08:00:00',
               '2018-10-20 01:46:40', '2019-02-12 19:33:20'],
              dtype='datetime64[ns]', freq=None)

## DateOffset
The ***DateOffset*** allows us to define time periods. Let's look at our currently defined Timestamp

In [27]:
Independence

Timestamp('2000-07-04 00:00:00')

Trying to add or substract numbers from it will result in an error.<br>Instead, we can use the ***DateOffset*** to define a certain period of time and use it in our calculation.<br>
The default is 1 day

In [31]:
# Independence + 1 Error
Independence + pd.DateOffset()

Timestamp('2000-07-05 00:00:00')

In [32]:
Independence + pd.DateOffset(days = 10)

Timestamp('2000-07-14 00:00:00')

In [33]:
Independence + pd.DateOffset(weeks = 2)

Timestamp('2000-07-18 00:00:00')

Months, quarters and even years can have different amount of days so defining them in this way can be very helpful

In [34]:
Independence + pd.DateOffset(months = 3)

Timestamp('2000-10-04 00:00:00')

In [40]:
Independence + pd.DateOffset(years = 2)

Timestamp('2002-07-04 00:00:00')

We can give a negative argument to substract the period

In [41]:
Independence + pd.DateOffset(years = -2)

Timestamp('1998-07-04 00:00:00')

We don't have to settle for a single element, we can pass as many as we want amd in any order we want.<br>The following example will add 2 years, 4 months, 1 week, 5 days, 8 hours and 15 minutes.

In [42]:
Independence + pd.DateOffset(minutes = 15, months = 4, years = 2, hours = 8, weeks = 1, days = 5)

Timestamp('2002-11-16 08:15:00')

## Generating Datetime Ranges
We can use the ***date_range*** method to generate custom sequences of timestamps.<br>The simplest way is to define a start date and an end date

In [88]:
pd.date_range('Jan 1st, 2000', 'Jan 20th, 2000')

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08',
               '2000-01-09', '2000-01-10', '2000-01-11', '2000-01-12',
               '2000-01-13', '2000-01-14', '2000-01-15', '2000-01-16',
               '2000-01-17', '2000-01-18', '2000-01-19', '2000-01-20'],
              dtype='datetime64[ns]', freq='D')

The default gap between the values is 1 day but we can change that using the ***freq*** (frequency) parameter.<br>For example, 3 days

In [91]:
pd.date_range('Jan 1st, 2000', 'Jan 20th, 2000', freq = '3D')

DatetimeIndex(['2000-01-01', '2000-01-04', '2000-01-07', '2000-01-10',
               '2000-01-13', '2000-01-16', '2000-01-19'],
              dtype='datetime64[ns]', freq='3D')

We can ask to get only business days

In [92]:
pd.date_range('Jan 1st, 2000', 'Jan 20th, 2000', freq = 'B')

DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06',
               '2000-01-07', '2000-01-10', '2000-01-11', '2000-01-12',
               '2000-01-13', '2000-01-14', '2000-01-17', '2000-01-18',
               '2000-01-19', '2000-01-20'],
              dtype='datetime64[ns]', freq='B')

Using "M" will give us the end of month

In [93]:
pd.date_range('Jan 1st, 2000', 'Jan 20th, 2001', freq = 'M')

DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-30',
               '2000-05-31', '2000-06-30', '2000-07-31', '2000-08-31',
               '2000-09-30', '2000-10-31', '2000-11-30', '2000-12-31'],
              dtype='datetime64[ns]', freq='M')

If we want the first day of the month we'll use "MS" (Month Start)<br>Notice the extra value we get in this example compared to the previous one

In [94]:
pd.date_range('Jan 1st, 2000', 'Jan 20th, 2001', freq = 'MS')

DatetimeIndex(['2000-01-01', '2000-02-01', '2000-03-01', '2000-04-01',
               '2000-05-01', '2000-06-01', '2000-07-01', '2000-08-01',
               '2000-09-01', '2000-10-01', '2000-11-01', '2000-12-01',
               '2001-01-01'],
              dtype='datetime64[ns]', freq='MS')

We can use "W" to specify week frequency

In [96]:
pd.date_range('Jan 1st, 2000', 'March 21th, 2000', freq = 'W')

DatetimeIndex(['2000-01-02', '2000-01-09', '2000-01-16', '2000-01-23',
               '2000-01-30', '2000-02-06', '2000-02-13', '2000-02-20',
               '2000-02-27', '2000-03-05', '2000-03-12', '2000-03-19'],
              dtype='datetime64[ns]', freq='W-SUN')

By default we get the beginning of each week (the date of Sunday), but we can change that as well.<br>In this example we specify we want to see the dates of all Fridays between the selected ranges

In [97]:
pd.date_range('Jan 1st, 2000', 'March 21th, 2000', freq = 'W-FRI')

DatetimeIndex(['2000-01-07', '2000-01-14', '2000-01-21', '2000-01-28',
               '2000-02-04', '2000-02-11', '2000-02-18', '2000-02-25',
               '2000-03-03', '2000-03-10', '2000-03-17'],
              dtype='datetime64[ns]', freq='W-FRI')

We can use Y to get the start of each year

In [104]:
pd.date_range('Jan 1st, 2000', 'March 21th, 2012', freq = 'Y')

DatetimeIndex(['2000-01-01', '2001-01-01', '2002-01-01', '2003-01-01',
               '2004-01-01', '2005-01-01', '2006-01-01', '2007-01-01',
               '2008-01-01', '2009-01-01', '2010-01-01', '2011-01-01',
               '2012-01-01'],
              dtype='datetime64[ns]', freq='AS-JAN')

And "YS" for the Year Start

In [105]:
pd.date_range('Jan 1st, 2000', 'March 21th, 2012', freq = 'YS')

DatetimeIndex(['2000-01-01', '2001-01-01', '2002-01-01', '2003-01-01',
               '2004-01-01', '2005-01-01', '2006-01-01', '2007-01-01',
               '2008-01-01', '2009-01-01', '2010-01-01', '2011-01-01',
               '2012-01-01'],
              dtype='datetime64[ns]', freq='AS-JAN')

If we are interested in generating a certain number of dates we can define a number of "periods" instead of calculating the desired end date

In [4]:
pd.date_range('Jan 1st, 2000', periods=24, freq = '3 D')

DatetimeIndex(['2000-01-01', '2000-01-04', '2000-01-07', '2000-01-10',
               '2000-01-13', '2000-01-16', '2000-01-19', '2000-01-22',
               '2000-01-25', '2000-01-28', '2000-01-31', '2000-02-03',
               '2000-02-06', '2000-02-09', '2000-02-12', '2000-02-15',
               '2000-02-18', '2000-02-21', '2000-02-24', '2000-02-27',
               '2000-03-01', '2000-03-04', '2000-03-07', '2000-03-10'],
              dtype='datetime64[ns]', freq='3D')

Get a list of 20 business days

In [5]:
pd.date_range('Jan 1st, 2000', periods=20, freq = 'B')

DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05', '2000-01-06',
               '2000-01-07', '2000-01-10', '2000-01-11', '2000-01-12',
               '2000-01-13', '2000-01-14', '2000-01-17', '2000-01-18',
               '2000-01-19', '2000-01-20', '2000-01-21', '2000-01-24',
               '2000-01-25', '2000-01-26', '2000-01-27', '2000-01-28'],
              dtype='datetime64[ns]', freq='B')

We can state the end date and the number of periods we want to get until that date

In [6]:
pd.date_range(end = 'Jan 1st, 2000', periods=24, freq = '3 D')

DatetimeIndex(['1999-10-24', '1999-10-27', '1999-10-30', '1999-11-02',
               '1999-11-05', '1999-11-08', '1999-11-11', '1999-11-14',
               '1999-11-17', '1999-11-20', '1999-11-23', '1999-11-26',
               '1999-11-29', '1999-12-02', '1999-12-05', '1999-12-08',
               '1999-12-11', '1999-12-14', '1999-12-17', '1999-12-20',
               '1999-12-23', '1999-12-26', '1999-12-29', '2000-01-01'],
              dtype='datetime64[ns]', freq='3D')

If we need to generate a datetime sequence that will show a certain minute in every hour or a specific day in a month or a year, we won't be able to use the given freq arguments and we'll have to generate our own custom period using the ***DateOffset*** method.
In the following example I'm trying to create a ***DatetimeIndex*** of my birthdays, starting from the year 2000. Using the 'Y' value will make the dates stick to the last day of every year instead the one I've specified

In [44]:
birthdays = pd.date_range(start = '2000-03-21', periods = 20, freq = 'Y')
birthdays

DatetimeIndex(['2000-12-31', '2001-12-31', '2002-12-31', '2003-12-31',
               '2004-12-31', '2005-12-31', '2006-12-31', '2007-12-31',
               '2008-12-31', '2009-12-31', '2010-12-31', '2011-12-31',
               '2012-12-31', '2013-12-31', '2014-12-31', '2015-12-31',
               '2016-12-31', '2017-12-31', '2018-12-31', '2019-12-31'],
              dtype='datetime64[ns]', freq='A-DEC')

And this is how we solve it using ***DateOffset***

In [45]:
birthdays = pd.date_range(start = '2000-03-21', periods = 20, freq = pd.DateOffset(years = 1))
birthdays

DatetimeIndex(['2000-03-21', '2001-03-21', '2002-03-21', '2003-03-21',
               '2004-03-21', '2005-03-21', '2006-03-21', '2007-03-21',
               '2008-03-21', '2009-03-21', '2010-03-21', '2011-03-21',
               '2012-03-21', '2013-03-21', '2014-03-21', '2015-03-21',
               '2016-03-21', '2017-03-21', '2018-03-21', '2019-03-21'],
              dtype='datetime64[ns]', freq='<DateOffset: years=1>')

### dt Sub Library
The dt Sub Library will allow us to call all the Timestamp methods and attributes on an entire datetime Series.<br>First let's import a file with some sales data

In [58]:
sales = pd.read_csv('JanuarySales2014.csv', index_col='SalesOrderID',usecols = ['SalesOrderID', 'OrderDate', 'DueDate', 'ShipDate', 'Status', 'Freight'])
sales.head(3)

Unnamed: 0_level_0,OrderDate,DueDate,ShipDate,Status,Freight
SalesOrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
63363,01/01/2014 00:00,13/01/2014 00:00,08/01/2014 00:00,5,0.682
63364,01/01/2014 00:00,13/01/2014 00:00,08/01/2014 00:00,5,0.7483
63365,01/01/2014 00:00,13/01/2014 00:00,08/01/2014 00:00,5,0.1248


The 3 date columns are currently reffered to as strings (type object).<br>We can change that by converting all of them to datetime using the ***astype*** method

In [61]:
sales[['OrderDate','DueDate','ShipDate']] = sales[['OrderDate','DueDate','ShipDate']].astype('datetime64')
sales.head(3)

Unnamed: 0_level_0,OrderDate,DueDate,ShipDate,Status,Freight
SalesOrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
63363,2014-01-01,2014-01-13,2014-08-01,5,0.682
63364,2014-01-01,2014-01-13,2014-08-01,5,0.7483
63365,2014-01-01,2014-01-13,2014-08-01,5,0.1248


We can also do that in one go by using the ***parse_dates*** parameter when uploading the file

In [74]:
sales = pd.read_csv('JanuarySales2014.csv', index_col='SalesOrderID',\
                    usecols = ['SalesOrderID', 'OrderDate', 'DueDate', 'ShipDate', 'Status', 'Freight'],\
                    parse_dates = ['OrderDate', 'DueDate', 'ShipDate'])

sales.head(3)

Unnamed: 0_level_0,OrderDate,DueDate,ShipDate,Status,Freight
SalesOrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
63363,2014-01-01,2014-01-13,2014-08-01,5,0.682
63364,2014-01-01,2014-01-13,2014-08-01,5,0.7483
63365,2014-01-01,2014-01-13,2014-08-01,5,0.1248


Now that Pandas understands that we are dealing with dates, we can use the ***dt*** sub library to extract any elements we need from them

In [28]:
sales['OrderDate'].sample(3).to_frame().assign(day = sales['OrderDate'].dt.day,
                                              day_name = sales['OrderDate'].dt.weekday_name,
                                              week = sales['OrderDate'].dt.weekofyear,
                                              month_name = sales['OrderDate'].dt.month_name(),
                                               quarter = sales['OrderDate'].dt.quarter,
                                              year = sales['OrderDate'].dt.year)

Unnamed: 0_level_0,OrderDate,day,day_name,week,month_name,quarter,year
SalesOrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
63973,2014-01-10,10,Friday,2,January,1,2014
64878,2014-01-24,24,Friday,4,January,1,2014
64262,2014-01-15,15,Wednesday,3,January,1,2014


Group by day of the week

In [33]:
sales['weekday'] = sales['OrderDate'].dt.weekday_name
sales.groupby('weekday').size().sort_values(ascending = False)

weekday
Wednesday    477
Thursday     339
Sunday       293
Friday       274
Monday       263
Saturday     248
Tuesday      247
dtype: int64

### is_...
We have several methods with the ***is_*** prefix to check various facts about our dates

In [41]:
some_dates = sales['OrderDate'].sample(5)
some_dates

SalesOrderID
64856   2014-01-24
65471   2014-01-31
65261   2014-01-29
63938   2014-01-09
64265   2014-01-15
Name: OrderDate, dtype: datetime64[ns]

In [44]:
some_dates.dt.is_month_end

SalesOrderID
64856    False
65471     True
65261    False
63938    False
64265    False
Name: OrderDate, dtype: bool

We can use these methods to help filter our Dataframe based on those criteria

In [48]:
sales[sales['OrderDate'].dt.is_month_end].head(3)

Unnamed: 0_level_0,OrderDate,DueDate,ShipDate,Status,Freight,weekday
SalesOrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
65452,2014-01-31,2014-02-12,2014-02-07,5,44.4488,Friday
65453,2014-01-31,2014-02-12,2014-02-07,5,43.8745,Friday
65454,2014-01-31,2014-02-12,2014-02-07,5,44.4868,Friday


In [50]:
sales[sales['DueDate'].dt.is_quarter_end].head(3)

Unnamed: 0_level_0,OrderDate,DueDate,ShipDate,Status,Freight,weekday
SalesOrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1


### strftime()
And again, just like single Timestamps, we can format entire datetime Series with the ***strftiem*** method

In [40]:
sales['OrderDate'].dt.strftime('%d//%m--%Y A.D').sample(3)

SalesOrderID
63597    04//01--2014 A.D
65027    27//01--2014 A.D
64159    13//01--2014 A.D
Name: OrderDate, dtype: object

### Difference Between Dates
We can substract dates from eachother to get the distance (delta) between them.<br>In the following example we want to see if our shipments got to their destination in time.

In [75]:
sales.head(3)

Unnamed: 0_level_0,OrderDate,DueDate,ShipDate,Status,Freight
SalesOrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
63363,2014-01-01,2014-01-13,2014-08-01,5,0.682
63364,2014-01-01,2014-01-13,2014-08-01,5,0.7483
63365,2014-01-01,2014-01-13,2014-08-01,5,0.1248


In [81]:
sales['Days Ahead'] = sales['DueDate'] - sales['ShipDate']
sales.sample(3)

Unnamed: 0_level_0,OrderDate,DueDate,ShipDate,Status,Freight,Days Ahead
SalesOrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
64961,2014-01-26,2014-07-02,2014-02-02,5,1.2498,150 days
64980,2014-01-26,2014-07-02,2014-02-02,5,2.2493,150 days
64874,2014-01-24,2014-05-02,2014-01-31,5,0.1248,91 days


find all shipments that were late

In [86]:
late_shipments = sales['Days Ahead'] < '0 days'
sales[late_shipments].sort_values('Days Ahead', ascending=False).head(3)

Unnamed: 0_level_0,OrderDate,DueDate,ShipDate,Status,Freight,Days Ahead
SalesOrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
64676,2014-01-20,2014-01-02,2014-01-27,5,15.224,-25 days
64643,2014-01-20,2014-01-02,2014-01-27,5,2.999,-25 days
64641,2014-01-20,2014-01-02,2014-01-27,5,0.6248,-25 days
