# Handling datetime in pandas

In [1]:
# initialization
import numpy as np
import pandas as pd

## Create pandas datetime objects

One reason to use pandas to handle tabular data is that pandas have good support of datetime values. Suppose we have a python `list` of dates, in the form of string:

In [2]:
dates = ["2024-01-15", "2024-02-15", "2024-03-15"]

We can convert the strings to a `DatetimeIndex` using `pd.to_datetime()`:

In [3]:
dates_pd = pd.to_datetime(dates)
display(dates_pd)

DatetimeIndex(['2024-01-15', '2024-02-15', '2024-03-15'], dtype='datetime64[ns]', freq=None)

The same can be done if we have timestamps that also include hour, minute, and (optionally) second information:

In [4]:
timestamps = ["2024-01-15 14:20", "2024-02-15 7:35", "2024-03-15 18:06"]

In [5]:
timestamps_pd = pd.to_datetime(timestamps)
display(timestamps_pd)

DatetimeIndex(['2024-01-15 14:20:00', '2024-02-15 07:35:00',
               '2024-03-15 18:06:00'],
              dtype='datetime64[ns]', freq=None)

Note that when you use `pd.to_datetime()` on a sequence of strings, the output is not a `DataFrame` nor a `Series`. To convert th results to a pandas `Series`, just wrap it around a `pd.Series()` call:

In [6]:
time_series = pd.Series(timestamps_pd)
display(time_series)

0   2024-01-15 14:20:00
1   2024-02-15 07:35:00
2   2024-03-15 18:06:00
dtype: datetime64[ns]

We can also assign the result of `pd.to_datetime()` to column of DataFrame. For the CalSOFI dateset, we may do:

In [7]:
CalSOFI = pd.read_csv("data/CalSOFI_subset.csv")
CalSOFI["Datetime"] = pd.to_datetime(CalSOFI["Datetime"])

In [8]:
CalSOFI.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81374 entries, 0 to 81373
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Cast_Count  81374 non-null  int64         
 1   Station_ID  81374 non-null  object        
 2   Datetime    81374 non-null  datetime64[ns]
 3   Depth_m     81374 non-null  int64         
 4   T_degC      80973 non-null  float64       
 5   Salinity    80924 non-null  float64       
 6   SigmaTheta  80924 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(2), object(1)
memory usage: 4.3+ MB


In [9]:
CalSOFI

Unnamed: 0,Cast_Count,Station_ID,Datetime,Depth_m,T_degC,Salinity,SigmaTheta
0,14172,060.0 060.0,1965-01-11 04:43:00,0,12.12,33.030,25.030
1,14172,060.0 060.0,1965-01-11 04:43:00,10,12.08,33.040,25.050
2,14172,060.0 060.0,1965-01-11 04:43:00,20,12.06,33.040,25.050
3,14172,060.0 060.0,1965-01-11 04:43:00,30,12.06,33.040,25.050
4,14172,060.0 060.0,1965-01-11 04:43:00,50,11.18,33.280,25.400
...,...,...,...,...,...,...,...
81369,25948,090.5 043.0,1988-09-22 18:45:00,250,7.82,34.168,26.651
81370,25948,090.5 043.0,1988-09-22 18:45:00,275,7.66,34.203,26.701
81371,25948,090.5 043.0,1988-09-22 18:45:00,300,7.44,34.225,26.750
81372,25948,090.5 043.0,1988-09-22 18:45:00,350,7.17,34.260,26.817


In addition, when we use `.read_csv()`, we can use the `parse_dates` argument to tell pandas to read certain column as dates: 

In [10]:
CalSOFI2 = pd.read_csv("data/CalSOFI_subset.csv", parse_dates=["Datetime"])
CalSOFI2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81374 entries, 0 to 81373
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Cast_Count  81374 non-null  int64         
 1   Station_ID  81374 non-null  object        
 2   Datetime    81374 non-null  datetime64[ns]
 3   Depth_m     81374 non-null  int64         
 4   T_degC      80973 non-null  float64       
 5   Salinity    80924 non-null  float64       
 6   SigmaTheta  80924 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(2), object(1)
memory usage: 4.3+ MB


## Datetime formatting

Of course, textual representations of datetime comes in many different format. For example, the following all specifies the same date:

In [11]:
date_1 = "2024-08-05 3:08 PM"
date_2 = "August 5, 2024, 15:08"
date_3 = "5 Aug 2024 - 3:08:00 pm"
date_4 = "5/8/2024 15:08:00"
date_5 = "8/5/2024 15:08:00"

To deal with the different formats, the `pd.to_datetime()` function accept a `format` argument, which uses different *format code* to specify how months, days, etc. are encoded. The documentation for format code can be found in [https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior). Note that when the format code is specified, the non-informational characters (e.g., ` `, `/` or `,`) need to be included too.

As examples, to read the above strings into datetime, we'll need:

In [12]:
tstamp_1 = pd.to_datetime(date_1, format="%Y-%m-%d %I:%M %p")
display(tstamp_1)

Timestamp('2024-08-05 15:08:00')

In [13]:
tstamp_2 = pd.to_datetime(date_2, format="%B %d, %Y, %H:%M")
display(tstamp_2)

Timestamp('2024-08-05 15:08:00')

In [14]:
tstamp_3 = pd.to_datetime(date_3, format="%d %b %Y - %I:%M:%S %p")
display(tstamp_3)

Timestamp('2024-08-05 15:08:00')

In [15]:
tstamp_4 = pd.to_datetime(date_4, format="%d/%m/%Y %H:%M:%S")
display(tstamp_4)

Timestamp('2024-08-05 15:08:00')

In [16]:
tstamp_5 = pd.to_datetime(date_5, format="%m/%d/%Y %H:%M:%S")
display(tstamp_5)

Timestamp('2024-08-05 15:08:00')

Conversely, the format code can also be used to print out time:

In [17]:
# create a pandas Series of time
time_series = pd.Series(pd.to_datetime(
    ["2024-01-15 14:20", "2024-02-15 7:35", "2024-03-15 18:06"]
))

In [18]:
# use strftime (string-from-time) to create the corresponding strings
time_series.dt.strftime("%b %d, %Y - %I:%M %p")

0    Jan 15, 2024 - 02:20 PM
1    Feb 15, 2024 - 07:35 AM
2    Mar 15, 2024 - 06:06 PM
dtype: object

## Datetime comparisons

To perform logical comparison between datetime, we can create a datetime from a scalar string, and use logical operator in the expected way:

In [19]:
time_series = pd.Series(pd.to_datetime(
    ["2024-01-15 14:20", "2024-02-15 7:35", "2024-03-15 18:06"]
))
instant = pd.to_datetime("2024-02-21")
time_series > instant

0    False
1    False
2     True
dtype: bool

As an application, we can combine datetime comparsion and `.loc[]` to extract data that are collected in the 1970s

In [22]:
CalSOFI.loc[
    (CalSOFI["Datetime"] >= pd.to_datetime("1970-01-01")) & 
    (CalSOFI["Datetime"] < pd.to_datetime("1980-01-01"))
]

Unnamed: 0,Cast_Count,Station_ID,Datetime,Depth_m,T_degC,Salinity,SigmaTheta
25101,17384,050.0 060.0,1970-08-22 16:15:00,0,12.81,32.890,24.797
25102,17384,050.0 060.0,1970-08-22 16:15:00,10,12.81,32.900,24.804
25103,17384,050.0 060.0,1970-08-22 16:15:00,20,12.81,32.900,24.805
25104,17384,050.0 060.0,1970-08-22 16:15:00,30,12.75,32.880,24.801
25105,17384,050.0 060.0,1970-08-22 16:15:00,50,10.46,32.780,25.143
...,...,...,...,...,...,...,...
70259,20850,093.0 039.7,1976-11-11 21:36:00,309,7.77,34.190,26.676
70260,20850,093.0 039.7,1976-11-11 21:36:00,320,7.80,34.224,26.699
70261,20850,093.0 039.7,1976-11-11 21:36:00,330,7.74,34.240,26.719
70262,20850,093.0 039.7,1976-11-11 21:36:00,340,7.71,34.255,26.735


## Extracting parts of a datetime

Sometimes you are interested in parts of the datetime and not the whole datetime (e.g., in climatology studies you may be interested in the day of year but not the year itself). You are access these as attributes of the *datetime accessor `.dt` of the pandas `Series` (again, a column of a `DataFrame` is a `Series`). Examples include:

+ `.dt.year`: year of the datetime
+ `.dt.month`: month of the datetime
+ `.dt.day`: day of the datetime
+ `.dt.hour`: hour of the datetime
+ `.dt.minute`: minute of the datetime
+ `.dt.dayofyear`: day of year of the datetime

For a full list, [see https://pandas.pydata.org/docs/user_guide/timeseries.html#time-date-components](https://pandas.pydata.org/docs/user_guide/timeseries.html#time-date-components)

As an example, let's extract the day of year from the datetimes in CalSOFI:

In [24]:
CalSOFI["Datetime"].dt.dayofyear

0         11
1         11
2         11
3         11
4         11
        ... 
81369    266
81370    266
81371    266
81372    266
81373    266
Name: Datetime, Length: 81374, dtype: int32