# Date times

As part of the data quality process, we often have to assess and correct that our software interprets the data appropriately.

In the previous notebook, correcting data types helped us to identify errors. It can also help us manipulate data in the correct way: for example, if numbers were stored as objects, then `"2" + "3" = "23"`, whereas if we change it to a numerical datatype, `2 + 3 = 5`.

Changing dates to the `datetime` data type allows us to manipulate and check our data in a similar way. In this notebook, we will learn about this special data type.

In [None]:
import pandas as pd

In [None]:
# orders_cl.csv
url = "https://drive.google.com/file/d/1Tla62vfu__kCqvgypZyVt2S9VuC016yH/view?usp=sharing"
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
orders_cl = pd.read_csv(path)

Before we begin, we'll make a copy of the `orders_cl` DataFrame.

In [None]:
orders_df = orders_cl.copy()

Let's have a look at the top 5 rows of the DataFrame.

In [None]:
orders_df.head()

## 1.&nbsp; Converting to datetime

Now, let's take a look at the datatypes pandas has given each column.

In [None]:
orders_df.info()

We can see that `created_date` is currently an object. Let's convert it using [pd.to_datetime()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html).

In [None]:
orders_df["created_date"] = pd.to_datetime(orders_df["created_date"])

Let's have a look at `.info()` again to check we now have `datetime`.



In [None]:
orders_df.info()

Let's have another look at the top 5 rows - they still look the same!

In [None]:
orders_df.head()

Success!!!

## 2.&nbsp; `.dt` accessor
`.dt` is to datetime what `.str` is to strings. If you have a Series that is of the datetime data type, `.dt` is an accessor that allows you to return datetime properties from the values of the Series. These properties will be indexed the same as the original Series.

Remember,  if you select one column from a DataFrame it's a Series.

In [None]:
single_column = orders_df["created_date"]

type(single_column)

`.dt` can return many properties from datetime, such as the `.month`

In [None]:
orders_df.loc[:,"month"] = orders_df["created_date"].dt.month
orders_df.head()

If we'd rather have the month as a name, as oppossed to a number, that's also possible using `.month_name()`.

In [None]:
orders_df.loc[:,"month_name"] = orders_df["created_date"].dt.month_name()
orders_df.head()

We can also return the `.year`.

In [None]:
orders_df.loc[:,"year"] = orders_df["created_date"].dt.year
orders_df.head()

Sometimes the properties seem obvious, but don't always return what we think they would. `.day` returns the day of the month, not the day of the week.

In [None]:
orders_df.loc[:,"day"] = orders_df["created_date"].dt.day
orders_df.head()

Day of the week is accessed through either `weekday` or `day_name`.

In [None]:
orders_df.loc[:,"weekday"] = orders_df["created_date"].dt.weekday
orders_df.loc[:,"day_of_week"] = orders_df["created_date"].dt.day_name()
orders_df.head()

We even have the option to extract the properties of the datetime as a string using `.strftime()`. Inside the brackets we place the code for how we'd like the string to be written.

In [None]:
orders_df["date_as_string"] = orders_df["created_date"].dt.strftime("%A %d %b %y")
orders_df.head()

A full list of the properties and methods that can be returned with `.dt` can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#datetimelike-properties).

A full list of `strftime()` format codes can be found [here](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior).

## 3.&nbsp; Datetime aggregates
Datetime is not limited to only returning properties, we can also perform aggregates with datetime.

We can utilise methods such as `.min()` and `.max()`, which will show us the earliest and the latest date in the Series.

In [None]:
orders_df["created_date"].min(), orders_df["created_date"].max()

This also quickly allows you to see the timespan of the Series.

In [None]:
(orders_df["created_date"].max() - orders_df["created_date"].min()).days

> **Note:** We won't dive deep into the differences between `Timestamp`, `Timedelta`, and the other time data types. For now, just know that a `Timestamp` is a point in time, whereas `Timedelta` represents a span of time. If you have the time, you can learn more [here](https://pandas.pydata.org/docs/user_guide/timeseries.html#overview).

We can also use `.mean()` or `.median()`.

In [None]:
orders_df["created_date"].mean(), orders_df["created_date"].median()

Or, even use `.describe()` to get an idea of the whole Series.

In [None]:
orders_df["created_date"].describe()

## 4.&nbsp; Filtering by datetime
Datetime is also useful to find information at, or in between, certain points in time.

For example, we can find all the orders created in 2018.

In [None]:
date_filtering_df = orders_cl.copy()
date_filtering_df["created_date"] = pd.to_datetime(date_filtering_df["created_date"])

In [None]:
date_filtering_df.loc[date_filtering_df["created_date"].dt.year == 2018, :].head()

Or, all the orders in March.

In [None]:
date_filtering_df.loc[date_filtering_df["created_date"].dt.month == 3, :].head()

With the code above, we get all of the orders from March of any year (in this case, both 2017 and 2018). What if want just the orders from March of one year?

When selecting a particular period of time, we have a few options. The most obvious - based on the code cells above - is to use 2 clauses in `.loc`.

When passing multiple clauses to `.loc`, you have to wrap them in parentheses.  You also have to use them in combination with logical operators (`&` for "and", `|` for "or", `~` for "not").

In [None]:
date_filtering_df.loc[(date_filtering_df["created_date"].dt.month == 3) & (date_filtering_df["created_date"].dt.year == 2018), :].head()

This is good to give us a particular month. However, we can use the pandas method `.between()` to give us greater flexibility. With `.between()` we can search between any 2 datetimes we need.

In [None]:
date_filtering_df.loc[date_filtering_df["created_date"].between("2018-03-01", "2018-04-01")].head()

There are a couple of other ways of filtering periods of time. We won't go through these in class as the 2 methods above are more than enough for the moment. However, we'll leave them commented out below for those with inquisitive minds.

In [None]:
# date_filtering_df.loc[date_filtering_df["created_date"].dt.strftime("%Y-%m") == "2018-03"].head()

In [None]:
# date_filtering_df.loc[date_filtering_df["created_date"].dt.to_period("M") == "2018-03"].head()

# Challenges

### Challenge 1.

What's the latest order?

In [None]:
# your code here
latest_date = orders_df['created_date'].max()
latest_order = orders_df.loc[orders_df['created_date'] == latest_date]
latest_order

Unnamed: 0,order_id,created_date,total_paid,state
226903,527401,2018-03-14 13:58:36,18.98,Place Order


### Challenge 2.

Use `.strftime()` to print out the latest order as "Wed, 14/03/2018".

In [None]:
# your code here
formatted_date = latest_order['created_date'].dt.strftime('%a, %d/%m/%Y').iloc[0]
print(f"The latest order date is: {formatted_date}")

The latest order date is: Wed, 14/03/2018


### Challenge 3.

What's the order number of the first order sold in June 2017?

In [None]:
# june_mask = orders_df["created_date"].dt.month == 6
# year_mask = orders_df["created_date"].dt.year == 2017
time_mask = orders_df['created_date'].dt.strftime('%B %Y') == 'June 2017'

sold_mask = orders_df['state'].isin(['Place Order', 'Completed'])
earliest = orders_df.loc[time_mask & sold_mask, 'created_date'].min()
orders_df.loc[orders_df['created_date']==earliest]

Unnamed: 0,order_id,created_date,total_paid,state
60829,360369,2017-06-01 00:11:27,415.99,Completed


### Challenge 4.

How many orders, regardless of state, were processed between 15th April 2017 and 6th May 2017?

In [None]:
# your code here
date_filtering_df[date_filtering_df["created_date"]
                  .between("2017-04-15", "2017-05-06")
                  ].shape[0]

6854

### Challenge 5.

Using the `.dt` accessor, create an extra column showing the quarter in which each order was sold.

In [None]:
# your code here
orders_df["order_quarter"] = orders_df["created_date"].dt.quarter
orders_df

Unnamed: 0,order_id,created_date,total_paid,state,order_quarter
0,241319,2017-01-02 13:35:40,44.99,Cancelled,1
1,241423,2017-11-06 13:10:02,136.15,Completed,4
2,242832,2017-12-31 17:40:03,15.76,Completed,4
3,243330,2017-02-16 10:59:38,84.98,Completed,1
4,243784,2017-11-24 13:35:19,157.86,Cancelled,4
...,...,...,...,...,...
226899,527397,2018-03-14 13:56:38,42.99,Place Order,1
226900,527398,2018-03-14 13:57:25,42.99,Shopping Basket,1
226901,527399,2018-03-14 13:57:34,141.58,Shopping Basket,1
226902,527400,2018-03-14 13:57:41,19.98,Shopping Basket,1
