# 1.3 Working with time data

## Introduction to `datetime` module

Python has several [buit-in types](https://docs.python.org/3/library/stdtypes.html) such as:
- `int` for integers (e.g. `-1` or `2023`),
- `str` for strings like `"hello"`, and
- `list` objects, which represent arrays like `[1, 2, 3]` or `["cow", "pig", "goat", "chicken"]`

We can also uses "classes" to create more complex object types. Previously, we used the [`DataFrame` class](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) that came from `pandas`. Now, we are going to use various classes that come from the [`datetime` module](https://docs.python.org/3/library/datetime.html), which is part of the standard Python library.

In order to do the `datetime` module, we first need to import it. We use `dt` as a shorthand for the whole module:

In [7]:
import datetime as dt

### `datetime` class
The first class that we'll look at in the `datetime` module is, in fact, `datetime`. This class of objects represents a Gregorian calendar date (year, month, day) and time (hour, minute, second, microsecond). For example, we can get the `datetime` representing the current moment:

In [8]:
dt.datetime.now()

datetime.datetime(2023, 2, 21, 17, 25, 33, 220960)

### `date` class
And sometimes, we may not need this level or precision of a `datetime` object, and we just need the date. For this, our module also comes with a `date` class. For example, we can grabe a `date` object representing the current date:

In [9]:
dt.date.today()

datetime.date(2023, 2, 21)

We can also pass in the arguments to create a `date` object for a specific date we want:
`dt.date(year, month, day)`

In [10]:
dt.date(2010, 12, 31)

datetime.date(2010, 12, 31)

The library also has some utility functions, so for example, we can convert an ISO format string into a `date` object:

In [11]:
dt.date.fromisoformat("2010-12-31")

datetime.date(2010, 12, 31)

We can also spit out the `date` in various formats. For custom formats, you can create your own format string. For example, `%d` represents the day of the month as a zero-padded number, and `%A` represents the weekday as a full name.

In [15]:
my_date = dt.date(2010, 12, 31)
print("ISO format:   " + my_date.isoformat())
print("%d/%m/%y:     " + my_date.strftime("%d/%m/%y"))
print("%A %d. %B %Y: " + my_date.strftime("%A %d. %B %Y"))

ISO format:    2010-12-31
%d/%m/%y:      31/12/10
%A %d. %B %Y:  Friday 31. December 2010


For all the different types of format strings you can use for datetimes, you can reference [the format codes](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes) in the official documentation.

### Other classes
The `datetime` module also supplies us with a few other classes, such as:
- a `time` class that represents a (local) time of day
- a `timedelta` class that represents a duration, the difference between two dates or times

For more information on all of these classes and more, you can refer to [the official documentation](https://docs.python.org/3/library/datetime.html) on the `datetime` module.

## Using `datetime` inside a `DataFrame`
Now that we have some understanding of the `datetime` and `date` classes, let's see how we can use them when we analyze large datasets with `pandas`. Suppose we have a CSV with a row for each president and dates for when the took office and when the left. We can create a `DataFrame` for this:

In [82]:
import pandas as pd
from io import StringIO

In [83]:
csv_data = """name,took_office,left_office
Abolhassan Banisadr,4 February 1980,22 June 1981
Mohammad-Ali Rajai,2 August 1981,30 August 1981
Ali Khamenei,9 October 1981,16 August 1989
Akbar Hashemi Rafsanjani,16 August 1989,3 August 1997
Mohammad Khatami,3 August 1997,3 August 2005
Mahmoud Ahmadinejad,3 August 2005,3 August 2013
Hassan Rouhani,3 August 2013,3 August 2021
Ebrahim Raisi,3 August 2021,"""

In [84]:
presidents = pd.read_csv(StringIO(csv_data))
presidents

Unnamed: 0,name,took_office,left_office
0,Abolhassan Banisadr,4 February 1980,22 June 1981
1,Mohammad-Ali Rajai,2 August 1981,30 August 1981
2,Ali Khamenei,9 October 1981,16 August 1989
3,Akbar Hashemi Rafsanjani,16 August 1989,3 August 1997
4,Mohammad Khatami,3 August 1997,3 August 2005
5,Mahmoud Ahmadinejad,3 August 2005,3 August 2013
6,Hassan Rouhani,3 August 2013,3 August 2021
7,Ebrahim Raisi,3 August 2021,


*Note: row 7 has a `NaN` value for `left_office` since it was left blank. This stands for "Not a Number," but it essentially means that [data is missing](https://pandas.pydata.org/pandas-docs/dev/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions), whether a number or not.

Now, what happens if we sort our `DataFrame` by the `took_office` column?

In [85]:
presidents.sort_values("took_office")

Unnamed: 0,name,took_office,left_office
3,Akbar Hashemi Rafsanjani,16 August 1989,3 August 1997
1,Mohammad-Ali Rajai,2 August 1981,30 August 1981
4,Mohammad Khatami,3 August 1997,3 August 2005
5,Mahmoud Ahmadinejad,3 August 2005,3 August 2013
6,Hassan Rouhani,3 August 2013,3 August 2021
7,Ebrahim Raisi,3 August 2021,
0,Abolhassan Banisadr,4 February 1980,22 June 1981
2,Ali Khamenei,9 October 1981,16 August 1989


We see that this function will sort Akbar Hashemi Rafsanjani first, even though he should come third if we are speaking chornologically. This is because the datatype for that column is a string. We can check the type for each column:

In [86]:
presidents.dtypes

name           object
took_office    object
left_office    object
dtype: object

If we wanted to convert the `took_office` and `left_office` columns to `datetime` objects, we could use the `to_datetime` function provided by `pandas`:

In [87]:
presidents["took_office"] = pd.to_datetime(presidents["took_office"].str.strip(), format="%d %B %Y")
presidents["left_office"] = pd.to_datetime(presidents["left_office"].str.strip(), format="%d %B %Y")

In [88]:
presidents

Unnamed: 0,name,took_office,left_office
0,Abolhassan Banisadr,1980-02-04,1981-06-22
1,Mohammad-Ali Rajai,1981-08-02,1981-08-30
2,Ali Khamenei,1981-10-09,1989-08-16
3,Akbar Hashemi Rafsanjani,1989-08-16,1997-08-03
4,Mohammad Khatami,1997-08-03,2005-08-03
5,Mahmoud Ahmadinejad,2005-08-03,2013-08-03
6,Hassan Rouhani,2013-08-03,2021-08-03
7,Ebrahim Raisi,2021-08-03,NaT


Now if we check for the datatypes of our columns, we see that `took_office` and `left_office` are of type `datetime`:

In [89]:
presidents.dtypes

name                   object
took_office    datetime64[ns]
left_office    datetime64[ns]
dtype: object

And if we sort by our `took_office` column, things are sorted in chronological order rather than alphabetical:

In [90]:
presidents.sort_values("took_office")

Unnamed: 0,name,took_office,left_office
0,Abolhassan Banisadr,1980-02-04,1981-06-22
1,Mohammad-Ali Rajai,1981-08-02,1981-08-30
2,Ali Khamenei,1981-10-09,1989-08-16
3,Akbar Hashemi Rafsanjani,1989-08-16,1997-08-03
4,Mohammad Khatami,1997-08-03,2005-08-03
5,Mahmoud Ahmadinejad,2005-08-03,2013-08-03
6,Hassan Rouhani,2013-08-03,2021-08-03
7,Ebrahim Raisi,2021-08-03,NaT


We can also now do more interesting analysis with these columns as `datetime` objects. For example, we can calculate the duration that each president was in office:

In [92]:
presidents["duration"] = presidents["left_office"] - presidents["took_office"]

In [93]:
presidents

Unnamed: 0,name,took_office,left_office,duration
0,Abolhassan Banisadr,1980-02-04,1981-06-22,504 days
1,Mohammad-Ali Rajai,1981-08-02,1981-08-30,28 days
2,Ali Khamenei,1981-10-09,1989-08-16,2868 days
3,Akbar Hashemi Rafsanjani,1989-08-16,1997-08-03,2909 days
4,Mohammad Khatami,1997-08-03,2005-08-03,2922 days
5,Mahmoud Ahmadinejad,2005-08-03,2013-08-03,2922 days
6,Hassan Rouhani,2013-08-03,2021-08-03,2922 days
7,Ebrahim Raisi,2021-08-03,NaT,NaT


And if we wanted, we could sort our rows to see which president was in office for the shortest amount of time and which one for the longest:

In [94]:
presidents.sort_values("duration")

Unnamed: 0,name,took_office,left_office,duration
1,Mohammad-Ali Rajai,1981-08-02,1981-08-30,28 days
0,Abolhassan Banisadr,1980-02-04,1981-06-22,504 days
2,Ali Khamenei,1981-10-09,1989-08-16,2868 days
3,Akbar Hashemi Rafsanjani,1989-08-16,1997-08-03,2909 days
4,Mohammad Khatami,1997-08-03,2005-08-03,2922 days
5,Mahmoud Ahmadinejad,2005-08-03,2013-08-03,2922 days
6,Hassan Rouhani,2013-08-03,2021-08-03,2922 days
7,Ebrahim Raisi,2021-08-03,NaT,NaT


## Jalali dates
Oftentimes, Iranian datasets may use the Jalali calendar rather than the Gregorian calendar. The `datetime` classes are formatted in terms of the Gregorian calendar. However, for this exact problem, someone has created [another package](https://pypi.org/project/jdatetime/) called `jdatetime`, which is "a Jalali implementation of Python’s datetime module."

In [1]:
import jdatetime as jdt

We can do very similar things with the `jdatetime` package as we did with the `datetime`, including getting objects that represent the current `datetime` and `date`, except now according to the Jalali calendar:

In [2]:
jdt.datetime.now()

jdatetime.datetime(1401, 12, 2, 17, 22, 39, 271776)

In [3]:
jdt.date.today()

jdatetime.date(1401, 12, 2)

### Converting between Jalali and Gregorian calendar dates
The `jdatetime` module also offers functions that allow you to convert from Gregorian calendar to Jalali, and vice-versa.

In [27]:
jdt.date.fromgregorian(day=31, month=12, year=2010)

jdatetime.date(1389, 10, 10)

In [28]:
jdt.date(1389,10,10).togregorian()

datetime.date(2010, 12, 31)

### Formatting Jalali date strings
Note that for formatted strings, you may need to add a locale as an argument (`locale="fa_IR"`) when you create your `date` or `datetime` object. Otherwise, the string may be written with Roman characters.

In [24]:
en_date = jdt.date(1397, 4, 23)
fa_date = jdt.date(1397, 4, 23, locale="fa_IR")

print(en_date.strftime("%A, %d %b %Y"))
print(fa_date.strftime("%A, %d %b %Y"))

Saturday, 23 Tir 1397
شنبه, 23 تیر 1397


## Jalali dates inside a DataFrame

Now let's do a brief example of a `DataFrame` with Jalali dates. Here we'll use data on [consumption of energy products by year](https://iranopendata.org/en/dataset/consumption-energy-products-2-1355-1397)

In [98]:
df = pd.read_csv("../../input/niocnre2144-consumption-energy-products-2-1355-1397-en.csv")

In [104]:
df.head()

Unnamed: 0,year,natural gas - one thousand barrels of crude oil per day,liquid petroleum gas - one thousand barrels of crude oil per day,petrol - one thousand barrels of crude oil per day,naphtha - one thousand barrels of crude oil per day,diesel - one thousand barrels of crude oil per day,fuel oil - one thousand barrels of crude oil per day,jet fuel - one thousand barrels of crude oil per day,total - one thousand barrels of crude oil per day
0,1355,45,9,61,86,126,104,19,449
1,1356,57,12,73,96,152,111,23,524
2,1357,43,12,79,96,159,108,19,515
3,1358,63,13,89,118,158,114,10,565
4,1359,71,12,75,92,155,134,10,549


We can see that the years get read in as integers:

In [100]:
df.dtypes

year                                                                int64
natural gas - one thousand barrels of crude oil per day             int64
liquid petroleum gas - one thousand barrels of crude oil per day    int64
petrol - one thousand barrels of crude oil per day                  int64
naphtha - one thousand barrels of crude oil per day                 int64
diesel - one thousand barrels of crude oil per day                  int64
fuel oil - one thousand barrels of crude oil per day                int64
jet fuel - one thousand barrels of crude oil per day                int64
total - one thousand barrels of crude oil per day                   int64
dtype: object

If we wanted to convert the year into a Jalali `datetime`, we can use the `apply()` function that comes with the `DataFrame` class. From this, we can give it a function to apply to each row. In this case, we take the `"year"` variable of a given row and use it to create a `date` object from the `jdatetime` library:

In [112]:
df["jdt_year"] = df.apply(lambda row: jdt.date(row["year"], 1, 1), axis=1)

In [116]:
df.head()

Unnamed: 0,year,natural gas - one thousand barrels of crude oil per day,liquid petroleum gas - one thousand barrels of crude oil per day,petrol - one thousand barrels of crude oil per day,naphtha - one thousand barrels of crude oil per day,diesel - one thousand barrels of crude oil per day,fuel oil - one thousand barrels of crude oil per day,jet fuel - one thousand barrels of crude oil per day,total - one thousand barrels of crude oil per day,jdt_year,dt_year
0,1355,45,9,61,86,126,104,19,449,1355-01-01,1976-03-21
1,1356,57,12,73,96,152,111,23,524,1356-01-01,1977-03-21
2,1357,43,12,79,96,159,108,19,515,1357-01-01,1978-03-21
3,1358,63,13,89,118,158,114,10,565,1358-01-01,1979-03-21
4,1359,71,12,75,92,155,134,10,549,1359-01-01,1980-03-21


Now that we are using a `date` class from the `jdatetime` module, we can use its `togregorian()` function if we want to convert it to the Gregorian calendar:

In [123]:
df["dt_year"] = df.apply(lambda row: row["jdt_year"].togregorian(), axis=1)

In [124]:
df.head()

Unnamed: 0,year,natural gas - one thousand barrels of crude oil per day,liquid petroleum gas - one thousand barrels of crude oil per day,petrol - one thousand barrels of crude oil per day,naphtha - one thousand barrels of crude oil per day,diesel - one thousand barrels of crude oil per day,fuel oil - one thousand barrels of crude oil per day,jet fuel - one thousand barrels of crude oil per day,total - one thousand barrels of crude oil per day,jdt_year,dt_year
0,1355,45,9,61,86,126,104,19,449,1355-01-01,1976-03-21
1,1356,57,12,73,96,152,111,23,524,1356-01-01,1977-03-21
2,1357,43,12,79,96,159,108,19,515,1357-01-01,1978-03-21
3,1358,63,13,89,118,158,114,10,565,1358-01-01,1979-03-21
4,1359,71,12,75,92,155,134,10,549,1359-01-01,1980-03-21
