Common date formats contain numbers and sometimes text as well to specify months and days. Getting dates into a friendly format and extracting features of dates like month and year into new variables can be useful preprocessing steps.

In [1]:
import numpy as np
import pandas as pd


In [5]:
df = pd.read_csv('dates_lesson_16.csv')

In [6]:
df

Unnamed: 0,month_day_year,day_month_year,date_time,year_month_day
0,04/22/96,22-Apr-96,Tue Aug 11 09:50:35 1996,2007-06-22
1,04/23/96,23-Apr-96,Tue May 12 19:50:35 2016,2017-01-09
2,05/14/96,14-May-96,Mon Oct 14 09:50:35 2017,1998-04-12
3,05/15/96,15-May-96,Tue Jan 11 09:50:35 2018,2027-07-22
4,05/16/01,16-May-01,Fri Mar 11 07:30:36 2019,1945-11-15
5,05/17/02,17-May-02,Tue Aug 11 09:50:35 2020,1942-06-22
6,05/18/03,18-May-03,Wed Dec 21 09:50:35 2021,1887-06-13
7,05/19/04,19-May-04,Tue Jan 11 09:50:35 2022,1912-01-25
8,05/20/05,20-May-05,Sun Jul 10 19:40:25 2023,2007-06-22


In [8]:
df.dtypes

month_day_year    object
day_month_year    object
date_time         object
year_month_day    object
dtype: object

In [10]:
df.columns

Index(['month_day_year', 'day_month_year', 'date_time', 'year_month_day'], dtype='object')

In [14]:
for col in df:
    print (type(df[col][1]))

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>


The output confirms that all the date data is currently in string form. To work with dates, we need to convert them from strings into a data format built for processing dates. The pandas library comes with a Timestamp data object for storing and working with dates. You can instruct pandas to automatically convert a date column in your data into Timestamps when you read your data by adding the "parse_dates" argument to the data reading function with a list of column indices indicated the columns you wish to convert to Timestamps. Let's re-read the data with parse_dates turned on for each column:

In [15]:
df = pd.read_csv('dates_lesson_16.csv',parse_dates = [0,1,2,3]) # It will the columns into the timestamp

In [16]:
df

Unnamed: 0,month_day_year,day_month_year,date_time,year_month_day
0,1996-04-22,1996-04-22,1996-08-11 09:50:35,2007-06-22
1,1996-04-23,1996-04-23,2016-05-12 19:50:35,2017-01-09
2,1996-05-14,1996-05-14,2017-10-14 09:50:35,1998-04-12
3,1996-05-15,1996-05-15,2018-01-11 09:50:35,2027-07-22
4,2001-05-16,2001-05-16,2019-03-11 07:30:36,1945-11-15
5,2002-05-17,2002-05-17,2020-08-11 09:50:35,1942-06-22
6,2003-05-18,2003-05-18,2021-12-21 09:50:35,1887-06-13
7,2004-05-19,2004-05-19,2022-01-11 09:50:35,1912-01-25
8,2005-05-20,2005-05-20,2023-07-10 19:40:25,2007-06-22


In [17]:
for i in df:
    print(type(df[i][0]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
<class 'pandas._libs.tslibs.timestamps.Timestamp'>


* Now the each columns is changed into the timestamp formate 
* We can also convert date strings to Timestamps using the function pd.to_datetime().

* If we have oddly formatted date time objects, we might have to specify the exact format to get it to convert correctly into a Timestamp. For instance, consider a date format that gives date times of the form hour:minute:second year-day-month:

In [18]:
odd_date = '12:30:15 2022-29-05' 

The default to_datetime parser will fail to convert this date because it expects dates in the form year-month-day. In cases like this, specify the date's format to convert it to Timestamp:

In [19]:
pd.to_datetime(odd_date,
              format = '%H:%M:%S %Y-%d-%m')

Timestamp('2022-05-29 12:30:15')

Once we have our dates in the Timestamp format, we can extract a variety of properties like the year, month and day. Converting dates into several simpler features can make the data easier to analyze and use in predictive models. Access date properties from a Series of Timestamps with the syntax: Series.dt.property. To illustrate, let's extract some features from the first column of our date data and put them in a new DataFrame:

In [20]:
df.head(3)

Unnamed: 0,month_day_year,day_month_year,date_time,year_month_day
0,1996-04-22,1996-04-22,1996-08-11 09:50:35,2007-06-22
1,1996-04-23,1996-04-23,2016-05-12 19:50:35,2017-01-09
2,1996-05-14,1996-05-14,2017-10-14 09:50:35,1998-04-12


In [26]:
c1 = df.iloc[:,0]

In [27]:
c1

0   1996-04-22
1   1996-04-23
2   1996-05-14
3   1996-05-15
4   2001-05-16
5   2002-05-17
6   2003-05-18
7   2004-05-19
8   2005-05-20
Name: month_day_year, dtype: datetime64[ns]

In [30]:
# creating a new data frame from the the first column of the dataset 
pd.DataFrame({'year' : c1.dt.year,
            'month' : c1.dt.month,
            'Day' : c1.dt.day,
             'hour' : c1.dt.hour,
             'minute' : c1.dt.minute,
             'second' : c1.dt.second})

Unnamed: 0,year,month,Day,hour,minute,second
0,1996,4,22,0,0,0
1,1996,4,23,0,0,0
2,1996,5,14,0,0,0
3,1996,5,15,0,0,0
4,2001,5,16,0,0,0
5,2002,5,17,0,0,0
6,2003,5,18,0,0,0
7,2004,5,19,0,0,0
8,2005,5,20,0,0,0


we can use the subtraction operator on Timestamp objects to determine the amount of time between two different dates:

In [31]:
df.head(3)

Unnamed: 0,month_day_year,day_month_year,date_time,year_month_day
0,1996-04-22,1996-04-22,1996-08-11 09:50:35,2007-06-22
1,1996-04-23,1996-04-23,2016-05-12 19:50:35,2017-01-09
2,1996-05-14,1996-05-14,2017-10-14 09:50:35,1998-04-12


In [40]:
d1 = df.iloc[1,0]
d2 = df.iloc[3,0]

In [42]:
d2-d1


Timedelta('22 days 00:00:00')

Pandas includes a variety of more advanced date and time functionality beyond the basics, particularly for dealing time series data (data consisting of many periodic measurements over time.)

* Cleaning and preprocessing numeric, character and date data is sometimes all we need to do before we start a project.