# **Lecture 10A**
# **Reading date and datetime columns from data files**
In this part we will learn how to prepare date and datetime in Excel & CSV files and how to read those date and datetime columns into DataFrame correctly.

Before your start, run the two cells below to connect to Google Drive and load pandas module.

In [1]:
# Run the code below to access files in your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# We also need Panadas module in this lecture
import pandas as pd

---
**Example 1:** First, follow the instructions in your lecture notes to create the Excel file **DatetimeDemo.xlsx**. Then upload the file to the folder /MyDrive/Data in Google Drive.

Now read the Excel file by using **pd.read.excel()**. You will see that the column **Submission** is recognized by pandas as **datetime64** type.

In [5]:
# Read date_examples.xlsx data file
# In the first worksheet "ex1", the BirthDate is already a date in Excel
data = pd.read_excel("/content/drive/MyDrive/Data/DatetimeDemo.xlsx",sheet_name="Sheet1")
display(data)

# Show the type of the columns in the DataFrame
print(data.dtypes)

Unnamed: 0,ID,Name,Submission
0,1,David,2022-10-03 15:06:12
1,2,Nancy,2022-10-02 23:45:01
2,3,John,2022-09-30 02:05:44
3,4,Susan,2022-10-03 11:23:12


ID                     int64
Name                  object
Submission    datetime64[ns]
dtype: object


---
**Example 2:** In **date** worksheet of **date_examples.xlsx**, the column **BirthDate** has *date format (d-mmm-yy)* in Excel. When this worksheet is read, this column will have the type **datetime64** in the resulting DataFrame. 

In [7]:
# Read date_examples.xlsx data file
# In the first worksheet "date", the BirthDate is already a date in Excel
data = pd.read_excel("/content/drive/MyDrive/Data/date_examples.xlsx",sheet_name="date")
display(data)

# Show the type of the columns in the DataFrame
print(data.dtypes)

Unnamed: 0,student,BirthDate
0,1,1988-08-22
1,2,1990-07-30
2,3,2002-12-10
3,4,2008-01-15
4,5,1997-06-07


student               int64
BirthDate    datetime64[ns]
dtype: object


---
**Example 3:** In **datetime** worksheet of **date_time_examples.xlsx**, the column **SubmissionDT** contains datetime values up to fractional seconds.

In [6]:
# Read date_examples.xlsx data file
# In the worksheet "datetime", the SubmissionDT contains datetime values in Excel
data = pd.read_excel("/content/drive/MyDrive/Data/date_examples.xlsx",sheet_name="datetime")
display(data)

# Show the type of the columns in the DataFrame
print(data.dtypes)

Unnamed: 0,student,SubmissionDT
0,1,2022-07-21 14:22:51.510
1,2,2022-12-15 06:27:18.060
2,3,2022-01-01 18:09:11.340
3,4,2022-03-06 11:12:33.100
4,5,2022-10-30 09:45:11.220


student                  int64
SubmissionDT    datetime64[ns]
dtype: object


---
**Example 4:** In **date_num** worksheet of **date_examples.xlsx**, we have 3 numeric columns **BirthYear**, **BirthMonth** and **BirthDay** containing the birthdate of the students. We can combine them into a new date column **BirthDate**.
* The conversion can be done using the function **pd.to_datetime(*df*)**.
* The argument ***df*** is a DataFrame with columns **year**, **month** and **day**. You can also include **hour**, **minute** and **second** if needed. Note that you cannot use other column names when doing the conversion.
* The function will produce a **Series** containing the date created. You can add the Series back to your DataFrame.

In [8]:
# Read the DataFrame
data = pd.read_excel("/content/drive/MyDrive/Data/date_examples.xlsx",sheet_name="date_num")
display(data)
print(data.dtypes)

# Extract the 3 integers into a new DataFrame
tmp = data[["BirthYear","BirthMonth","BirthDay"]]

# We have to rename the columns before using pd.to_date()
tmp.columns = ["year","month","day"]
display(tmp)

# Create the new column
data["BirthDate"] = pd.to_datetime(tmp)
display(data)
print(data.dtypes)

Unnamed: 0,student,BirthYear,BirthMonth,BirthDay
0,1,1999,3,22
1,2,1987,5,11
2,3,2007,6,9
3,4,2001,1,7
4,5,1986,7,16


student       int64
BirthYear     int64
BirthMonth    int64
BirthDay      int64
dtype: object


Unnamed: 0,year,month,day
0,1999,3,22
1,1987,5,11
2,2007,6,9
3,2001,1,7
4,1986,7,16


Unnamed: 0,student,BirthYear,BirthMonth,BirthDay,BirthDate
0,1,1999,3,22,1999-03-22
1,2,1987,5,11,1987-05-11
2,3,2007,6,9,2007-06-09
3,4,2001,1,7,2001-01-07
4,5,1986,7,16,1986-07-16


student                int64
BirthYear              int64
BirthMonth             int64
BirthDay               int64
BirthDate     datetime64[ns]
dtype: object


---
**Example 5:** In **date_txt** worksheet of **date_examples.xlsx**, **BirthDate1** is a string containing a date in the format **d/m/y** and **BirthDate2** is a string containing a date in the format **y-m-d**. If a date is stored as a string, we cannot process it until we convert it to **datetime64**.
* We will use the function **pd.to_datetime(*date_string*,format=*date_format*)** again but with different options.
* ***date_string*** is a string column in the DataFrame, which contains dates.
* The **format=** option is for us to specify the format of the datetimes in the string. **%d** means day (integer), **%m** means month (integer), **%Y** means 4 digit year (integer), **%H** means hour (24 hours), **%M** means minutes, **%S** means seconds and **%f** is fractional seconds.
* You can see more format code at https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.




In [None]:
# Get the Excel worksheet
data = pd.read_excel("/content/drive/MyDrive/Data/date_examples.xlsx",sheet_name="date_txt")
display(data)
print(data.dtypes)

# Convert BirthDate1 (string) to BirthDate1_new (date)
data["BirthDate1_new"] = pd.to_datetime(data["BirthDate1"],format="%d/%m/%Y")

# Convert BirthDate2 (string) to BirthDate2_new (date)
data["BirthDate2_new"] = pd.to_datetime(data["BirthDate2"],format="%Y-%m-%d")
display(data)
print(data.dtypes)

Unnamed: 0,student,BirthDate1,BirthDate2
0,1,22/8/1970,1970-8-22
1,2,12/5/2010,2010-5-12
2,3,7/12/1999,1999-12-7
3,4,9/1/2002,2002-1-9
4,5,30/10/1988,1988-10-30


student        int64
BirthDate1    object
BirthDate2    object
dtype: object


Unnamed: 0,student,BirthDate1,BirthDate2,BirthDate1_new,BirthDate2_new
0,1,22/8/1970,1970-8-22,1970-08-22,1970-08-22
1,2,12/5/2010,2010-5-12,2010-05-12,2010-05-12
2,3,7/12/1999,1999-12-7,1999-12-07,1999-12-07
3,4,9/1/2002,2002-1-9,2002-01-09,2002-01-09
4,5,30/10/1988,1988-10-30,1988-10-30,1988-10-30


student                    int64
BirthDate1                object
BirthDate2                object
BirthDate1_new    datetime64[ns]
BirthDate2_new    datetime64[ns]
dtype: object


---
**Example 6:** In **datetime_txt** worksheet of **date_examples.xlsx**, **SubmissionDT** is a string containing datetime values up to fractional second precision. We can convert it to datetime64 as in **Example 5**.

In [None]:
# Get the Excel worksheet
data = pd.read_excel("/content/drive/MyDrive/Data/date_examples.xlsx",sheet_name="datetime_txt")

# Convert SubmissionDT (string) to SubmissionDT_new (date)
data["SubmissionDT_new"] = pd.to_datetime(data["SubmissionDT"],format="%d-%m-%Y %H:%M:%S.%f")

display(data)
print(data.dtypes)

Unnamed: 0,student,SubmissionDT,SubmissionDT_new
0,1,21-07-2022 14:22:51.51,2022-07-21 14:22:51.510
1,2,15-12-2022 06:27:18.06,2022-12-15 06:27:18.060
2,3,01-01-2022 18:09:11.34,2022-01-01 18:09:11.340
3,4,06-03-2022 11:12:33.10,2022-03-06 11:12:33.100
4,5,30-10-2022 09:45:11.22,2022-10-30 09:45:11.220


student                      int64
SubmissionDT                object
SubmissionDT_new    datetime64[ns]
dtype: object


---
**Example 7:** If you are using CSV file, the dates and times will be read as strings by default. Although there are options in **pd.read_csv()** to deal with that, we can simply reuse what we have learnt in Example 5 to do the conversion after the DataFrame is read.

In **date_examples.csv**, **BirthDate1** and **BirthDate2** are strings containing dates. We can use the same code in Example 5 to convert them to **datetime64**.

In [None]:
# Get the CSV file
data = pd.read_csv("/content/drive/MyDrive/Data/date_examples.csv")
display(data)
print(data.dtypes)

# Convert BirthDate1 (string) to BirthDate1_new (date)
data["BirthDate1_new"] = pd.to_datetime(data["BirthDate1"],format="%d/%m/%Y")

# Convert BirthDate2 (string) to BirthDate2_new (date)
data["BirthDate2_new"] = pd.to_datetime(data["BirthDate2"],format="%Y-%m-%d")
display(data)
print(data.dtypes)

Unnamed: 0,student,BirthDate1,BirthDate2
0,1,22/8/1970,1970-8-22
1,2,12/5/2010,2010-5-12
2,3,7/12/1999,1999-12-7
3,4,9/1/2002,2002-1-9
4,5,30/10/1988,1988-10-30


student        int64
BirthDate1    object
BirthDate2    object
dtype: object


Unnamed: 0,student,BirthDate1,BirthDate2,BirthDate1_new,BirthDate2_new
0,1,22/8/1970,1970-8-22,1970-08-22,1970-08-22
1,2,12/5/2010,2010-5-12,2010-05-12,2010-05-12
2,3,7/12/1999,1999-12-7,1999-12-07,1999-12-07
3,4,9/1/2002,2002-1-9,2002-01-09,2002-01-09
4,5,30/10/1988,1988-10-30,1988-10-30,1988-10-30


student                    int64
BirthDate1                object
BirthDate2                object
BirthDate1_new    datetime64[ns]
BirthDate2_new    datetime64[ns]
dtype: object
