Datetime Objects in Pandas - Strengths and Limitations

In my last post about movie budgets, one of the key feature categeories was the Release Dates of films. Dates are an interesting linguistic structure bceause different cultures represent them differently for example 2019-04-21 or YYYY-MM-DD, this is totally intuitive, biggest unti to smallest unit, year then month then day. 

Of course in the USA the standard date format is MM-DD-YYYY, which makes absolutely no sense to me but has to be delt with in most US source datasets.

Let's see how this shows up in my movie dataset:


In [None]:
# Import libraries 
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

In [None]:
# load data and relabel
df = pd.read_csv('the_numbers_2009.csv')

# cleaning column names, and dropping nans
df = df.drop(columns = ['Unnamed: 0', '0', '6']).drop([0]).dropna(how='all').rename(columns={'1': 'Release Date', '2':'Movie',
                                '3':'Production Budget', '4': 'Domestic Gross', '5':'Worldwide Gross'})

In [None]:
# Now we can set dtypes before describing our data
df.dtypes

In [None]:
df.head()

Note that Release date is listed as an 'object', in fact all features are, this means that pandas read all of these features as strings, because they all contained characters, numbers, and symbols, no clear datatype was identified by the pandas parser.

The Movie category can stay a string, while the three numerical categories can be easily cast to float.

What I would like to focus on is how we deal with dates. Pandas actually has a very sophisticated function that can parse our dates (even in DD-MMM-YY format) into a datetime object.

In [5]:
# and now I'll cast our dates into datetime objects
df['Release Date'] = pd.to_datetime(df['Release Date'],  infer_datetime_format=True)

In [9]:
df.dtypes

Release Date         datetime64[ns]
Movie                        object
Production Budget            object
Domestic Gross               object
Worldwide Gross              object
dtype: object

In [None]:
dates = matplotlib.dates.date2num(df['Release Date'])
budgets = df['Production Budget']
gross = df['Worldwide Gross']

plt.plot_date(dates, budgets, c = 'red')


In [8]:
df['Release Date'].describe()

(count                    5729
 unique                   2396
 top       2015-12-31 00:00:00
 freq                       24
 first     1969-01-01 00:00:00
 last      2068-12-11 00:00:00
 Name: Release Date, dtype: object, dtype('<M8[ns]'))

This one line of code has turned our strings into a datetime object, infering the date format automatically. However, when we look at the values, we can see some future dates that don't make sense. No movies have been planned for realease in 2068 to my knowledge, so let's explore that.

Actually, matplot lib has a very handy 