***

# Dates and times

Date time data go together with text data as causing particular difficulties for data analysts, but hopefully you will soon agree that `pandas` offers excellent tools for handling dates and times! Base `Python` offers the `datetime` module with three date and time related classes:

* `date`: consists of year, month and day
* `time`: hours, minutes, seconds and microseconds
* `datetime`: combination of `date` and `time`

By way of contrast, `pandas` simply offers the `Timestamp` class (built upon `numpy`'s `datetime64` class)


# Setup

In [1]:
import pandas as pd
import numpy as np

feedback = pd.DataFrame({
    'item_no': pd.Series([2, 2, 3, 4, 5, 1, 9, 5, 7, 10, 8], dtype='Int64'),
    'date': pd.Series(['2020-04-11', '2020-04-12', '2020-05-13', np.nan, '2020-05-28', '2020-05-29',
                       '2020-06-01', '2020-06-07', '2020-06-300', '2020-06-30', '2020-08-01']),
    'rating': pd.Series([5, 1, 3, 5, 4, 3, 2, 5, 1, 4, 5], dtype='Int64'),
    'message': pd.Series(["Ideal for my lunchbox - Dave Smith", "Broke first time I used it, I want a refund! Get back to me at lenore29@gmail.com or 07700 900796",
                        "My name is Tony 07700900829", "Bought another one for my sister", "Works pretty well, but can't handle carrots", 
                        "The concept is great, the execution- not so great, thin handles - Eleanor & dave", "Bit of a cheap version of the real thing",
                        "Arrived on time, as expected", "Customer service terrible - hello anyone there?! DaveAllsop@yahoo.co.uk, 07700 900572 or 0131 9496 0886", 
                        "Workks well, seems solid, good value", "Great finish on it, really decent build quality"], dtype='string')
})


## Converting values to `Timestamp`s

In [2]:
feedback.date

0      2020-04-11
1      2020-04-12
2      2020-05-13
3             NaN
4      2020-05-28
5      2020-05-29
6      2020-06-01
7      2020-06-07
8     2020-06-300
9      2020-06-30
10     2020-08-01
Name: date, dtype: object

This `Series` is currently `object` type (it holds strings), but we can try to convert this `Series` to `Timestamp` type using `pandas` `to_datetime()` method. We set argument `yearfirst=True` to say that in the strings given, it looks like year comes first, i.e. dates are written '2021-01-01', and not '01-01-2021'

In [3]:
feedback.loc[:, 'date'] = pd.to_datetime(feedback.date, yearfirst=True)

ParserError: Unknown string format: 2020-06-300

We catch an exception: `pandas` can't deal with one of the dates: presumably due to the fact that June doesn't have 300 days. We can either correct this manually, or go ahead with conversion passing the argument `errors='coerce'`. This has the effect that any string that can't be converted to a `Timestamp` will be replaced with `pd.NaT`: 'not a Time'

In [4]:
feedback.loc[:, 'date'] = pd.to_datetime(feedback.date, errors='coerce', yearfirst=True)
feedback.date

0    2020-04-11
1    2020-04-12
2    2020-05-13
3           NaT
4    2020-05-28
5    2020-05-29
6    2020-06-01
7    2020-06-07
8           NaT
9    2020-06-30
10   2020-08-01
Name: date, dtype: datetime64[ns]

## `Timestamp` and `Timedelta`

What `type` is a single element in this column after conversion with `pd.to_datetime()`?

In [5]:
feedback.loc[0, 'date']

Timestamp('2020-04-11 00:00:00')

It is a `Timestamp` object. As mentioned above, this is a `pandas` specific class for holding information about an instant in time. OK, what time range does the data span?

In [6]:
date_range = feedback.date.max() - feedback.date.min()
date_range

Timedelta('112 days 00:00:00')

We get a `Timedelta` object returned. 'Delta' is a maths terms meaning 'difference' (often written as $\delta$ for a 'small' change or $\Delta$ otherwise), so this is a 'time difference' of 112 days. 

What happens if we add a `Timedelta` object to a `Timestamp`? Well, this creates another `Timestamp` object. The logic of this is as follows

**point in time(`Timestamp`)** + **time range(`Timedelta`)** = **point in time(`Timestamp`)**

because, as we saw above

**time range(`Timedelta`)** = **point in time(`Timestamp`)** - **point in time(`Timestamp`)**

In [7]:
pd.Timestamp("2020-01-01 12:00") + pd.Timedelta("1 day 1 hour")

Timestamp('2020-01-02 13:00:00')

So the following should be `True`: 

In [8]:
feedback.date.min() + date_range == feedback.date.max()

True

## Using the `.dt` accessor

Date time `Series` in `pandas` have a wide variety of `DatetimeProperties` available through the **`.dt` accessor**. This is similar to the `.str` accessor for `StringMethods` we saw above. Let's see an example

***Get the day of the week for each date***

In [9]:
feedback.date.dt.day_name()

0      Saturday
1        Sunday
2     Wednesday
3           NaN
4      Thursday
5        Friday
6        Monday
7        Sunday
8           NaN
9       Tuesday
10     Saturday
Name: date, dtype: object

***Get the week of the year for each date***

You can access `year`, `week` and `day` components of the `Timestamp`s through the `.isocalendar()` method. This method provides components corresponding to the `ISO 8601` standard: an unambiguous calendar that is understood internationally. 

In [10]:
feedback.date.dt.isocalendar()

Unnamed: 0,year,week,day
0,2020.0,15.0,6.0
1,2020.0,15.0,7.0
2,2020.0,20.0,3.0
3,,,
4,2020.0,22.0,4.0
5,2020.0,22.0,5.0
6,2020.0,23.0,1.0
7,2020.0,23.0,7.0
8,,,
9,2020.0,27.0,2.0


In [11]:
feedback.date.dt.isocalendar().week

0       15
1       15
2       20
3     <NA>
4       22
5       22
6       23
7       23
8     <NA>
9       27
10      31
Name: week, dtype: UInt32

***Get the month for each date***

The `month` is available directly from the `.dt` accessor

In [12]:
feedback.date.dt.month_name()

0      April
1      April
2        May
3        NaN
4        May
5        May
6       June
7       June
8        NaN
9       June
10    August
Name: date, dtype: object

<hr style="border:8px solid black"> </hr>

***

**<u>Task - 2 mins</u>**

Obtain the quarter of the year each `date` falls within. Add it as an extra column `quarter` to the `feedback` `DataFrame`.

**Solution**

In [13]:
feedback.loc[:, 'quarter'] = feedback.date.dt.quarter
feedback

Unnamed: 0,item_no,date,rating,message,quarter
0,2,2020-04-11,5,Ideal for my lunchbox - Dave Smith,2.0
1,2,2020-04-12,1,"Broke first time I used it, I want a refund! G...",2.0
2,3,2020-05-13,3,My name is Tony 07700900829,2.0
3,4,NaT,5,Bought another one for my sister,
4,5,2020-05-28,4,"Works pretty well, but can't handle carrots",2.0
5,1,2020-05-29,3,"The concept is great, the execution- not so gr...",2.0
6,9,2020-06-01,2,Bit of a cheap version of the real thing,2.0
7,5,2020-06-07,5,"Arrived on time, as expected",2.0
8,7,NaT,1,Customer service terrible - hello anyone there...,
9,10,2020-06-30,4,"Workks well, seems solid, good value",2.0


***

<hr style="border:8px solid black"> </hr>

## Optional: Methods available with a `DatetimeIndex`

Some date-time manipulations in `pandas` are easier with (or, indeed, are only possible if) the `DataFrame` has a `DatetimeIndex`. This is easy to arrange if we already have a column containing `Timestamp`s

In [14]:
feedback.set_index('date', inplace=True, drop=True)
feedback.index

DatetimeIndex(['2020-04-11', '2020-04-12', '2020-05-13',        'NaT',
               '2020-05-28', '2020-05-29', '2020-06-01', '2020-06-07',
                      'NaT', '2020-06-30', '2020-08-01'],
              dtype='datetime64[ns]', name='date', freq=None)

Now that we have a `DatetimeIndex`, a number of methods become available to us. One of the most useful is `.resample()`. Let's see an example of it in use:

***Get a count of the number of ratings supplied by users each month.***

In [15]:
# M for monthly
feedback.rating.resample(rule='M').count()

date
2020-04-30    2
2020-05-31    3
2020-06-30    3
2020-07-31    0
2020-08-31    1
Name: rating, dtype: int64

So, `pandas` has aggregated all ratings provided each **month** (`rule='M'`) together and we then apply the `.count()` aggregator. Note however that this isn't a conventional aggregation as we have seen them before using `.groupby()`: even though there were no ratings in July 2020, `.resample()` has provided this month anyway, as it understands the concept of providing date-time data at a **fixed frequency** (monthly in this case).

If we would prefer to see months labelled by their start dates, we could do:

In [16]:
# MS for month start
feedback.rating.resample(rule='MS').count()

date
2020-04-01    2
2020-05-01    3
2020-06-01    3
2020-07-01    0
2020-08-01    1
Name: rating, dtype: int64

Here are the available frequencies that can be provided for the `rule=` argument

| Alias | Description |
| :-: | :-: |
| B | business day frequency |
| C | custom business day frequency |
| D | calendar day frequency |
| W | weekly frequency |
| M | month end frequency |
| SM | semi-month end frequency (15th and end of month) |
| BM | business month end frequency |
| CBM | custom business month end frequency |
| MS | month start frequency |
| SMS | semi-month start frequency (1st and 15th) |
| BMS | business month start frequency |
| CBMS | custom business month start frequency |
| Q | quarter end frequency |
| BQ | business quarter end frequency |
| QS | quarter start frequency |
| BQS | business quarter start frequency |
| A, Y | year end frequency |
| BA, BY | business year end frequency |
| AS, YS | year start frequency |
| BAS, BYS | business year start frequency |
| BH | business hour frequency |
| H | hourly frequency |
| T, min | minutely frequency |
| S | secondly frequency |
| L, ms | milliseconds |
| U, us | microseconds |
| N | nanoseconds |

<hr style="border:8px solid black"> </hr>

***

**<u>Task - 5 mins</u>**

***Get the count of ratings left each week.*** 

* Why do you see lots of zeroes?
* **Extension** - Create this instead as a `DataFrame` called `weekly_rating_count` (**Hint**: `pd.DataFrame()`), and then add a new column `day` containing the `day_name()` of the `DatetimeIndex`. What day does `pandas` report for weekly frequency?

**Solution**

In [17]:
feedback.rating.resample(rule='W').count()

date
2020-04-12    2
2020-04-19    0
2020-04-26    0
2020-05-03    0
2020-05-10    0
2020-05-17    1
2020-05-24    0
2020-05-31    2
2020-06-07    2
2020-06-14    0
2020-06-21    0
2020-06-28    0
2020-07-05    1
2020-07-12    0
2020-07-19    0
2020-07-26    0
2020-08-02    1
Name: rating, dtype: int64

We see lots of zeroes as `.resample()` returns a `DateFrame` with a **fixed frequency** of `DatetimeIndex`. So even though many weeks had no ratings, those weeks are still reported.

In [18]:
# Extension
weekly_rating_count = pd.DataFrame({'rating_count': feedback.rating.resample(rule='W').count()})
weekly_rating_count.loc[:, 'day'] = weekly_rating_count.index.day_name()
weekly_rating_count

Unnamed: 0_level_0,rating_count,day
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-04-12,2,Sunday
2020-04-19,0,Sunday
2020-04-26,0,Sunday
2020-05-03,0,Sunday
2020-05-10,0,Sunday
2020-05-17,1,Sunday
2020-05-24,0,Sunday
2020-05-31,2,Sunday
2020-06-07,2,Sunday
2020-06-14,0,Sunday


***

<hr style="border:8px solid black"> </hr>

The object returned by the `.resample()` method is a `DatetimeIndexResampler` (!), e.g.

***Resample the feedbacks on a quarterly basis***

In [19]:
type(feedback.resample(rule='Q'))

pandas.core.resample.DatetimeIndexResampler

on which we can call `.agg()` just as for a `DataFrameGroupBy` object! Let's see an example applying different aggregators to different columns

In [20]:
feedback.resample(rule='Q').agg({
    'item_no': 'count', 
    'rating': ['count', 'min', 'max']
})

Unnamed: 0_level_0,item_no,rating,rating,rating
Unnamed: 0_level_1,count,count,min,max
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2020-06-30,8,8,1,5
2020-09-30,1,1,5,5


What if you want to group by a date-time column **along with another column**? Well, in that case, you will first want the date-time column to be just another column in your `DataFrame` (and not the `index`) 

In [21]:
feedback.reset_index(inplace=True)
feedback

Unnamed: 0,date,item_no,rating,message,quarter
0,2020-04-11,2,5,Ideal for my lunchbox - Dave Smith,2.0
1,2020-04-12,2,1,"Broke first time I used it, I want a refund! G...",2.0
2,2020-05-13,3,3,My name is Tony 07700900829,2.0
3,NaT,4,5,Bought another one for my sister,
4,2020-05-28,5,4,"Works pretty well, but can't handle carrots",2.0
5,2020-05-29,1,3,"The concept is great, the execution- not so gr...",2.0
6,2020-06-01,9,2,Bit of a cheap version of the real thing,2.0
7,2020-06-07,5,5,"Arrived on time, as expected",2.0
8,NaT,7,1,Customer service terrible - hello anyone there...,
9,2020-06-30,10,4,"Workks well, seems solid, good value",2.0


Now we can group by `date` using a `pd.Grouper` object (this can handle the concept of **frequency** for a date-time aggregation). Let's do another way what we did above

In [23]:
feedback.groupby(pd.Grouper(key='date', freq='W')).rating.count()

date
2020-04-12    2
2020-04-19    0
2020-04-26    0
2020-05-03    0
2020-05-10    0
2020-05-17    1
2020-05-24    0
2020-05-31    2
2020-06-07    2
2020-06-14    0
2020-06-21    0
2020-06-28    0
2020-07-05    1
2020-07-12    0
2020-07-19    0
2020-07-26    0
2020-08-02    1
Name: rating, dtype: int64

Now, though, we are free to also group by additional columns, if we wish

In [24]:
date_item_weekly_count = feedback\
    .groupby([pd.Grouper(key='date', freq='W', ), 'item_no'])\
    .rating.count()
date_item_weekly_count

date        item_no
2020-04-12  2          2
2020-05-17  3          1
2020-05-31  1          1
            5          1
2020-06-07  5          1
            9          1
2020-07-05  10         1
2020-08-02  8          1
Name: rating, dtype: int64

This returns a `MultiIndex`, similar to the behaviour of `.groupby()` when you group by two or more variables in a `DataFrame`

In [25]:
date_item_weekly_count.index

MultiIndex([('2020-04-12',  2),
            ('2020-05-17',  3),
            ('2020-05-31',  1),
            ('2020-05-31',  5),
            ('2020-06-07',  5),
            ('2020-06-07',  9),
            ('2020-07-05', 10),
            ('2020-08-02',  8)],
           names=['date', 'item_no'])