In [1]:
import pandas as pd
import random

In [2]:
dogs = pd.read_csv('https://raw.githubusercontent.com/SimonCarryer/pandas_tutorial/master/data/dog_registrations.csv')

# Working with Dates and Times

Pandas provides a whole lot of clever methods for working with dates and times. They use some different syntax in some cases, but most of this should be fairly familiar from previous tutorials.

## Converting strings to dates 

The easiest way to work with time-series data is to convert it to a `Timestamp` format. Pandas has a method to do that which will infer the correct value for most regularly-structured strings that represent dates and times. Confusingly, it's called `to_datetime`, even though the format it converts to is NOT `datetime` but something else. Because it is inferring the date from a string, it tends to be really slow. For that reason, it's a good idea to convert all the data once, before you do anything else (also check that it's inferred the dates correctly).

In [3]:
dogs['ValidDate_dt'] = pd.to_datetime(dogs['ValidDate'])

In [4]:
dogs[['ValidDate', 'ValidDate_dt']].head()

Unnamed: 0,ValidDate,ValidDate_dt
0,5/1/2007 15:15,2007-05-01 15:15:00
1,5/1/2007 15:15,2007-05-01 15:15:00
2,4/11/2007 15:14,2007-04-11 15:14:00
3,4/5/2007 15:00,2007-04-05 15:00:00
4,5/25/2007 12:15,2007-05-25 12:15:00


It's worth knowing a little bit more about what's going on in there. Remember when I said it's not a `datetime` format? It's actually a special Pandas class, called `Timestamp`.

In [5]:
dogs['ValidDate_dt'][0]

Timestamp('2007-05-01 15:15:00')

A `Timestamp` is for the most part exactly the same as a `datetime` object, but there are a few differences which we'll get to later. For now you can use the column just like any other - for example using it to group your data.

In [6]:
dogs.groupby('ValidDate_dt')['DogName'].count().head(10)

ValidDate_dt
2006-12-12 09:00:00    1
2006-12-12 09:07:00    1
2006-12-12 09:11:00    1
2006-12-12 09:15:00    1
2006-12-12 09:40:00    4
2006-12-12 09:55:00    1
2006-12-12 10:19:00    6
2006-12-12 10:30:00    1
2006-12-12 10:54:00    1
2006-12-12 10:58:00    9
Name: DogName, dtype: int64

### Problems

* Convert the `year` column of the `income` DataFrame to `Timestamp`. (Note - this one is trickier than it looks - you might have to Google)

## Using a DatetimeIndex

The above is cool and all, but many of the most useful benefits of the `Timestamp` format come when you set them as the `index` of your DataFrame. We haven't talked much about indexes yet, so I'll go into that a little bit now.

Here's our `dogs` DataFrame.

In [7]:
dogs.head()

Unnamed: 0,LicenseType,Breed,Color,DogName,OwnerZip,ExpYear,ValidDate,ValidDate_dt
0,Dog Individual Female,AM PIT BULL TERRIER,SPOTTED,BUTTER,15001,2007,5/1/2007 15:15,2007-05-01 15:15:00
1,Dog Individual Female,AM PIT BULL TERRIER,BROWN,SABLE,15001,2007,5/1/2007 15:15,2007-05-01 15:15:00
2,Dog Individual Neutered Male,MIXED,.,YIP,15001,2007,4/11/2007 15:14,2007-04-11 15:14:00
3,Dog Individual Male,DOBERMAN PINSCHER,RED,SABER,15003,2007,4/5/2007 15:00,2007-04-05 15:00:00
4,Dog Individual Spayed Female,MIXED,BLACK,DAISY,15003,2007,5/25/2007 12:15,2007-05-25 12:15:00


See on the left there where there's a column of just numbers (0, 1, 2...)? That's the DataFrame's `index`. You can access it as an attribute of the DataFrame.

In [8]:
dogs.index

RangeIndex(start=0, stop=97462, step=1)

You see that `dogs` currently has what's called a `RangeIndex`. That means it's a series of numbers that increase monotonically. When you call the `loc` and `iloc` attributes, they look at the DataFrame's `index`. You can replace with index with any series of values of the same length - for example another column from the DataFrame.

In [9]:
dogs.index = dogs['DogName']
dogs.head()

Unnamed: 0_level_0,LicenseType,Breed,Color,DogName,OwnerZip,ExpYear,ValidDate,ValidDate_dt
DogName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
BUTTER,Dog Individual Female,AM PIT BULL TERRIER,SPOTTED,BUTTER,15001,2007,5/1/2007 15:15,2007-05-01 15:15:00
SABLE,Dog Individual Female,AM PIT BULL TERRIER,BROWN,SABLE,15001,2007,5/1/2007 15:15,2007-05-01 15:15:00
YIP,Dog Individual Neutered Male,MIXED,.,YIP,15001,2007,4/11/2007 15:14,2007-04-11 15:14:00
SABER,Dog Individual Male,DOBERMAN PINSCHER,RED,SABER,15003,2007,4/5/2007 15:00,2007-04-05 15:00:00
DAISY,Dog Individual Spayed Female,MIXED,BLACK,DAISY,15003,2007,5/25/2007 12:15,2007-05-25 12:15:00


In [10]:
dogs.index

Index(['BUTTER', 'SABLE', 'YIP', 'SABER', 'DAISY', 'SCOOTER', 'TINKY', 'AMICA',
       'TAFFY', 'BELLE',
       ...
       'TISHA', 'TEDDY', 'DOLLY', 'DUKE', 'SNOWBALL', 'HOOCH', 'ROSCO',
       'MAXIMUS', 'WALTER', 'ROXY'],
      dtype='object', name='DogName', length=97462)

Now `dogs` has an index which is the name of each dog. We can still use `loc` to get specific rows where the index matches what we're after. For example, we can get all the dogs called "CHOMPERS".

In [11]:
dogs.loc['CHOMPERS']

Unnamed: 0_level_0,LicenseType,Breed,Color,DogName,OwnerZip,ExpYear,ValidDate,ValidDate_dt
DogName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
CHOMPERS,Dog Individual Neutered Male,MIXED,WHITE/BLACK,CHOMPERS,15035,2007,3/30/2007 14:37,2007-03-30 14:37:00
CHOMPERS,Dog Individual Male,LAB MIX,BROWN,CHOMPERS,15145,2007,2/20/2007 15:32,2007-02-20 15:32:00
CHOMPERS,Dog Individual Spayed Female,.,WHITE/BLACK,CHOMPERS,15024,2008,1/9/2008 16:01,2008-01-09 16:01:00
CHOMPERS,Dog Individual Male,LAB MIX,BROWN,CHOMPERS,15145,2008,2/22/2008 10:47,2008-02-22 10:47:00
CHOMPERS,Dog Individual Spayed Female,MANCHESTER TERRIER,BLACK/BROWN,CHOMPERS,15037,2009,1/15/2009 14:08,2009-01-15 14:08:00
CHOMPERS,Dog Individual Male,POODLE MIX,BLONDE,CHOMPERS,15044,2009,5/14/2009 15:30,2009-05-14 15:30:00
CHOMPERS,Dog Individual Male,AM PITT BULL MIX,WHITE/BLACK,CHOMPERS,15045,2009,3/19/2009 13:40,2009-03-19 13:40:00


#### IMPORTANT WARNING

Accessing rows like this can be very dangerous! Here's why: Getting dogs called 'Chompers' like we did above works fine, but what happens when we look for a less common name?

In [12]:
dogs.loc['VLADIMORE']

LicenseType     Dog Individual Male
Breed           AM PIT BULL TERRIER
Color                   WHITE/BLACK
DogName                   VLADIMORE
OwnerZip                      15014
ExpYear                        2007
ValidDate            6/4/2007 12:07
ValidDate_dt    2007-06-04 12:07:00
Name: VLADIMORE, dtype: object

Tragically, there's only one entry for "Vladimore", and in its wisdom, Pandas decides that you therefore need to see this row as a `Series`, rather than a `DataFrame`. This can cause _serious_ headaches if you're operating over a list of names, for example, and one of them throws an error because suddenly you're dealing with an entirely different kind of object.

Fortunately, there's a way around this, which is to pretend that you're passing a list of names.

In [13]:
dogs.loc[['VLADIMORE']]

Unnamed: 0_level_0,LicenseType,Breed,Color,DogName,OwnerZip,ExpYear,ValidDate,ValidDate_dt
DogName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
VLADIMORE,Dog Individual Male,AM PIT BULL TERRIER,WHITE/BLACK,VLADIMORE,15014,2007,6/4/2007 12:07,2007-06-04 12:07:00


We can also set the index to be a column containing `Timestamp` objects, in which case our `index` becomes something called a `DatetimeIndex`. A `DatetimeIndex` has some very useful functions.

In [14]:
dogs.index = dogs['ValidDate_dt']
dogs.index

DatetimeIndex(['2007-05-01 15:15:00', '2007-05-01 15:15:00',
               '2007-04-11 15:14:00', '2007-04-05 15:00:00',
               '2007-05-25 12:15:00', '2007-06-19 12:13:00',
               '2007-07-13 13:35:00', '2007-02-27 11:52:00',
               '2007-03-12 15:57:00', '2007-01-26 09:24:00',
               ...
               '2009-05-06 12:16:00', '2009-05-06 12:16:00',
               '2009-05-06 12:16:00', '2009-01-06 14:54:00',
               '2009-01-21 14:28:00', '2009-04-10 11:35:00',
               '2009-03-13 14:28:00', '2008-12-15 11:53:00',
               '2009-03-12 14:23:00', '2009-09-21 11:57:00'],
              dtype='datetime64[ns]', name='ValidDate_dt', length=97462, freq=None)

For example, you can select a date range using the `loc` attribute and strings formatted as dates.

In [15]:
dogs.loc['2008-09-01':'2008-09-02']

Unnamed: 0_level_0,LicenseType,Breed,Color,DogName,OwnerZip,ExpYear,ValidDate,ValidDate_dt
ValidDate_dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2008-09-02 13:12:00,Dog Senior Citizen or Disability Spayed Female,AUS SHEPHERD,BLUE,TANK,15017,2008,9/2/2008 13:12,2008-09-02 13:12:00
2008-09-02 12:27:00,Dog Individual Male,BOXER,WHITE/BLACK/BROWN,DEBO,15017,2008,9/2/2008 12:27,2008-09-02 12:27:00
2008-09-02 11:17:00,Dog Individual Male,GER SHEPHERD,YELLOW,RUJO,15018,2008,9/2/2008 11:17,2008-09-02 11:17:00
2008-09-02 12:29:00,Dog Individual Spayed Female,LAB MIX,BLACK,DIGGER,15037,2008,9/2/2008 12:29,2008-09-02 12:29:00
2008-09-02 12:44:00,Dog Individual Neutered Male,BORD COLLIE,BROWN,WALLACE,15044,2008,9/2/2008 12:44,2008-09-02 12:44:00
2008-09-02 12:44:00,Dog Individual Neutered Male,MIXED,BROWN,LITTLE DOG,15044,2008,9/2/2008 12:44,2008-09-02 12:44:00
2008-09-02 10:03:00,Dog Individual Neutered Male,LAB MIX,BROWN,ELVIS,15044,2008,9/2/2008 10:03,2008-09-02 10:03:00
2008-09-02 13:11:00,Dog Individual Spayed Female,LABRADOR RETRIEVER,BROWN,LUCKY,15057,2008,9/2/2008 13:11,2008-09-02 13:11:00
2008-09-02 13:08:00,Dog Individual Male,SILKY TERRIER,MULTI,YODA,15071,2008,9/2/2008 13:08,2008-09-02 13:08:00
2008-09-02 12:12:00,Dog Individual Spayed Female,PEEKAPOO,BROWN/TAN,PEIKA,15084,2008,9/2/2008 12:12,2008-09-02 12:12:00


And probably the most useful thing is the `resample` method. `resample` is basically `group_by` but it's got a bunch of convenient notation for time-period calculations, for example you can get a period of three months by passing the string `'3M'` to the `resample` method.

In [16]:
dogs.resample('3M')['DogName'].count()

ValidDate_dt
2006-12-31     1517
2007-03-31    14419
2007-06-30     9287
2007-09-30     3527
2007-12-31     4814
2008-03-31    18207
2008-06-30     9436
2008-09-30     3854
2008-12-31     2994
2009-03-31    15498
2009-06-30     8931
2009-09-30     3584
2009-12-31      971
2010-03-31        1
Name: DogName, dtype: int64

'Magic strings' like this `'3M'` are a bit annoying to remember, but they're mostly fairly intuitive. `M` is month, `H` is hour, `W` is week, and so on.

One of the most useful things about `resample` is that it "upsamples" - it adds a zero-value row for those time periods where there is no data.

In [17]:
dogs.resample('H')['DogName'].count().head(24)

ValidDate_dt
2006-12-12 09:00:00     9
2006-12-12 10:00:00    17
2006-12-12 11:00:00     7
2006-12-12 12:00:00     8
2006-12-12 13:00:00    18
2006-12-12 14:00:00    26
2006-12-12 15:00:00    54
2006-12-12 16:00:00     1
2006-12-12 17:00:00     0
2006-12-12 18:00:00     0
2006-12-12 19:00:00     0
2006-12-12 20:00:00     0
2006-12-12 21:00:00     0
2006-12-12 22:00:00     0
2006-12-12 23:00:00     0
2006-12-13 00:00:00     0
2006-12-13 01:00:00     0
2006-12-13 02:00:00     0
2006-12-13 03:00:00     0
2006-12-13 04:00:00     0
2006-12-13 05:00:00     0
2006-12-13 06:00:00     0
2006-12-13 07:00:00     0
2006-12-13 08:00:00     2
Name: DogName, dtype: int64

### Problems

* How many dogs were registered between 1 April 2007 and 1 April 2009?
* What is the average number of dogs registered in a month?

## Extracting part of a date

There's one case that `resample` doesn't handle well - extracting a portion of the `Timestamp` value. Imagine you want to know about seasonal patterns in dog registrations - you want to see the count of registrations by month, regardless of what year the dog was registered. To do that, we have to use something slightly more esoteric.  A `Timestamp` column has an attribute called `dt`, which lets you access a bunch of methods for manipulating and extracting portions of the date or time. 

In [18]:
dogs['ValidDate_dt'].dt.month.head()

ValidDate_dt
2007-05-01 15:15:00    5.0
2007-05-01 15:15:00    5.0
2007-04-11 15:14:00    4.0
2007-04-05 15:00:00    4.0
2007-05-25 12:15:00    5.0
Name: ValidDate_dt, dtype: float64

The `dt.month` attribute holds a simple number representing the month of the year of that `Timestamp`. There's all the other attributes you'd expect as well: `weekday`, `hour`, `year`, etc. You can use that column along with the `value_counts` method to get a grouped view of the data (but remember that `value_counts` sorts by values, not by the index, so our old pal `sort_index` will need to make an appearance.

In [19]:
( # here's that "splitting out operations by wrapping them in brackets" thing I talked about in Part Four
dogs['ValidDate_dt']
    .dt
    .month
    .value_counts()
    .sort_index()
)

1.0     16634
2.0     15996
3.0     15495
4.0     10149
5.0     11136
6.0      6369
7.0      4731
8.0      3687
9.0      2547
10.0     1930
11.0      898
12.0     7468
Name: ValidDate_dt, dtype: int64

### Problems

* What are the opening hours of the dog registration office?
* What's the most popular day of the week to register a dog?
* EXTRA FOR EXPERTS: How many dogs get registered per hour during the usual work day?