# Working with time data

## Introduction to `datetime` module

Python has several [buit-in types](https://docs.python.org/3/library/stdtypes.html) such as:
- `int` for integers (e.g. `-1` or `2023`),
- `str` for strings like `"hello"`, and
- `list` objects, which represent arrays like `[1, 2, 3]` or `["cow", "pig", "goat", "chicken"]`

We can also uses "classes" to create more complex object types. Previously, we used the [`DataFrame` class](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) that came from `pandas`. Now, we are going to use various classes that come from the [`datetime` module](https://docs.python.org/3/library/datetime.html), which is part of the standard Python library.

In order to do the `datetime` module, we first need to import it. We use `dt` as a shorthand for the whole module:

In [1]:
import datetime as dt

### `datetime` class
The first class that we'll look at in the `datetime` module is, in fact, `datetime`. This class of objects represents a Gregorian calendar date (year, month, day) and time (hour, minute, second, microsecond). For example, we can get the `datetime` representing the current moment:

In [2]:
dt.datetime.now()

datetime.datetime(2023, 6, 16, 23, 47, 2, 281492)

### `date` class
And sometimes, we may not need this level or precision of a `datetime` object, and we just need the date. For this, our module also comes with a `date` class. For example, we can grabe a `date` object representing the current date:

In [None]:
dt.date.today()

We can also pass in the arguments to create a `date` object for a specific date we want:
`dt.date(year, month, day)`

In [None]:
dt.date(2010, 12, 31)

The library also has some utility functions, so for example, we can convert an ISO format string into a `date` object:

In [None]:
dt.date.fromisoformat("2010-12-31")

We can also spit out the `date` in various formats. For custom formats, you can create your own format string. For example, `%d` represents the day of the month as a zero-padded number, and `%A` represents the weekday as a full name.

In [None]:
my_date = dt.date(2010, 12, 31)
print("ISO format:   " + my_date.isoformat())
print("%d/%m/%y:     " + my_date.strftime("%d/%m/%y"))
print("%A %d. %B %Y: " + my_date.strftime("%A %d. %B %Y"))

For all the different types of format strings you can use for datetimes, you can reference [the format codes](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes) in the official documentation.

### Other classes
The `datetime` module also supplies us with a few other classes, such as:
- a `time` class that represents a (local) time of day
- a `timedelta` class that represents a duration, the difference between two dates or times

For more information on all of these classes and more, you can refer to [the official documentation](https://docs.python.org/3/library/datetime.html) on the `datetime` module.

## Using `datetime` inside a DataFrame
Now that we have some understanding of the `datetime` and `date` classes, let's see how we can use them when we analyze large datasets with `pandas`. Suppose we have a CSV with a row for each president and dates for when the took office and when the left. We can create a DataFrame for this:

In [None]:
import pandas as pd
from io import StringIO

In [None]:
csv_data = """name,took_office,left_office
Abolhassan Banisadr,4 February 1980,22 June 1981
Mohammad-Ali Rajai,2 August 1981,30 August 1981
Ali Khamenei,9 October 1981,16 August 1989
Akbar Hashemi Rafsanjani,16 August 1989,3 August 1997
Mohammad Khatami,3 August 1997,3 August 2005
Mahmoud Ahmadinejad,3 August 2005,3 August 2013
Hassan Rouhani,3 August 2013,3 August 2021
Ebrahim Raisi,3 August 2021,"""

In [None]:
presidents = pd.read_csv(StringIO(csv_data))
presidents

*Note: row 7 has a `NaN` value for `left_office` since it was left blank. This stands for "Not a Number," but it essentially means that [data is missing](https://pandas.pydata.org/pandas-docs/dev/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions), whether a number or not.

Now, what happens if we sort our DataFrame by the `took_office` column?

In [None]:
presidents.sort_values("took_office")

We see that this function will sort Akbar Hashemi Rafsanjani first, even though he should come third if we are speaking chornologically. This is because the datatype for that column is a string. We can check the type for each column:

In [None]:
presidents.dtypes

If we wanted to convert the `took_office` and `left_office` columns to `datetime` objects, we could use the `to_datetime` function provided by `pandas`:

In [None]:
presidents["took_office"] = pd.to_datetime(presidents["took_office"].str.strip(), format="%d %B %Y")
presidents["left_office"] = pd.to_datetime(presidents["left_office"].str.strip(), format="%d %B %Y")

In [None]:
presidents

Now if we check for the datatypes of our columns, we see that `took_office` and `left_office` are of type `datetime`:

In [None]:
presidents.dtypes

And if we sort by our `took_office` column, things are sorted in chronological order rather than alphabetical:

In [None]:
presidents.sort_values("took_office")

We can also now do more interesting analysis with these columns as `datetime` objects. For example, we can calculate the duration that each president was in office:

In [None]:
presidents["duration"] = presidents["left_office"] - presidents["took_office"]

In [None]:
presidents

And if we wanted, we could sort our rows to see which president was in office for the shortest amount of time and which one for the longest:

In [None]:
presidents.sort_values("duration")

## Jalali dates
Oftentimes, Iranian datasets may use the Jalali calendar rather than the Gregorian calendar. The `datetime` classes are formatted in terms of the Gregorian calendar. However, for this exact problem, someone has created [another package](https://pypi.org/project/jdatetime/) called `jdatetime`, which is "a Jalali implementation of Pythonâ€™s datetime module."

In [None]:
import jdatetime as jdt

We can do very similar things with the `jdatetime` package as we did with the `datetime`, including getting objects that represent the current `datetime` and `date`, except now according to the Jalali calendar:

In [None]:
jdt.datetime.now()

In [None]:
jdt.date.today()

### Converting between Jalali and Gregorian calendar dates
The `jdatetime` module also offers functions that allow you to convert from Gregorian calendar to Jalali, and vice-versa.

In [None]:
jdt.date.fromgregorian(day=31, month=12, year=2010)

In [None]:
jdt.date(1389,10,10).togregorian()

### Formatting Jalali date strings
Note that for formatted strings, you may need to add a locale as an argument (`locale="fa_IR"`) when you create your `date` or `datetime` object. Otherwise, the string may be written with Roman characters.

In [None]:
en_date = jdt.date(1397, 4, 23)
fa_date = jdt.date(1397, 4, 23, locale="fa_IR")

print(en_date.strftime("%A, %d %b %Y"))
print(fa_date.strftime("%A, %d %b %Y"))

## Jalali dates inside a DataFrame

Now let's do a brief example of a DataFrame with Jalali dates. Here we'll use data on [consumption of energy products by year](https://iranopendata.org/en/dataset/consumption-energy-products-2-1355-1397)

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/ICFJ-Computational-Journalism/datasets/main/csv/niocnre2144-consumption-energy-products-2-1355-1397-en.csv")

In [None]:
df.head()

We can see that the years get read in as integers:

In [None]:
df.dtypes

If we wanted to convert the year into a Jalali `datetime`, we can use the `apply()` function that comes with the `DataFrame` class. From this, we can give it a function to apply to each row. In this case, we take the `"year"` variable of a given row and use it to create a `date` object from the `jdatetime` library:

In [None]:
df["jdt_year"] = df.apply(lambda row: jdt.date(int(row["year"]), 1, 1), axis=1)

In [None]:
df.head()

Now that we are using a `date` class from the `jdatetime` module, we can use its `togregorian()` function if we want to convert it to the Gregorian calendar:

In [None]:
df["dt_year"] = df.apply(lambda row: row["jdt_year"].togregorian(), axis=1)

In [None]:
df.head()