# Introduction to Time Series with Pandas

Most of our data will have a datatime index, so let's learn how to deal with this sort of data with pandas!

## Python Datetime Review
here we are using Python's DateTime functionality

In [1]:
from datetime import datetime

In [2]:
# To illustrate the order of arguments
my_year = 2017
my_month = 1
my_day = 2
my_hour = 13 #hour takes a 24hour format
my_minute = 30
my_second = 15

In [3]:
# January 2nd, 2017
my_date = datetime(my_year,my_month,my_day)

In [4]:
# Defaults to 0:00
my_date 
#python automatically sets the hour and minute to 0:00

datetime.datetime(2017, 1, 2, 0, 0)

In [5]:
# January 2nd, 2017 at 13:30:15
my_date_time = datetime(my_year,my_month,my_day,my_hour,my_minute,my_second)

In [6]:
my_date_time

datetime.datetime(2017, 1, 2, 13, 30, 15)

You can grab any part of the datetime object you want

In [7]:
my_date.
#the attributes which you can call on the datetime object-- day, hour

2

In [8]:
my_date_time.hour

13

In [8]:
type(my_date_time)

datetime.datetime

## NumPy Datetime Arrays
NumPy handles dates more efficiently than Python's datetime format.<br>
The NumPy data type is called <em>datetime64</em> to distinguish it from Python's datetime.

In this section we'll show how to set up datetime arrays in NumPy.<br>
For more info on NumPy visit https://docs.scipy.org/doc/numpy-1.15.4/reference/arrays.datetime.html

In [9]:
import numpy as np

In [10]:
# CREATE AN ARRAY FROM THREE DATES
np.array(['2016-03-15', '2017-05-24', '2018-08-09'], dtype='datetime64')
#because of 'dtype' parameter, numpy understands that these strings are dates.If dtype is not mentioned as 'datetime64', then the datatype will be considered as string


array(['2016-03-15', '2017-05-24', '2018-08-09'], dtype='datetime64[D]')

In [15]:
abc=np.array(['2016-03-15', '2017-05-24', '2018-08-09'])
type(abc[0])

numpy.str_

<div class="alert alert-info"><strong>NOTE:</strong> We see the dtype listed as <tt>'datetime64[D]'</tt>. This tells us that NumPy applied a day-level date precision.<br>
    If we want we can pass in a different measurement, such as <TT>[h]</TT> for hour or <TT>[Y]</TT> for year.</div>

In [11]:
np.array(['2016-03-15', '2017-05-24', '2018-08-09'], dtype='datetime64[h]')

array(['2016-03-15T00', '2017-05-24T00', '2018-08-09T00'],
      dtype='datetime64[h]')

In [12]:
np.array(['2016-03-15', '2017-05-24', '2018-08-09'], dtype='datetime64[Y]')
#it cannot handle any type of string. the format has to be year-mm-dd.Incase you want to use the precision of Y,D,h on any other format then first you have to re-format it to yyyy-mm-dd format.

array(['2016', '2017', '2018'], dtype='datetime64[Y]')

In [16]:
abc=np.array(['2016-03-15', '2017-05-24', '2018-02-31'],dtype='datetime64[D]')

ValueError: Day out of range in datetime string "2018-02-31"

## NumPy Date Ranges
Just as <tt>np.arange(start,stop,step)</tt> can be used to produce an array of evenly-spaced integers, we can pass a <tt>dtype</tt> argument to obtain an array of dates. Remember that the stop date is <em>exclusive</em>.

In [13]:
# AN ARRAY OF DATES FROM 6/1/18 TO 6/22/18 SPACED ONE WEEK APART
np.arange('2018-06-01', '2018-06-23', 7, dtype='datetime64[D]')
#here 7 corresponds to 7 days step size between these 2 dates. Upper limit of date is inclusive
#if you meniton the step size as 40, then it displays just the starting date.No error is thrown

array(['2018-06-01', '2018-06-08', '2018-06-15', '2018-06-22'],
      dtype='datetime64[D]')

By omitting the step value we can obtain every value based on the precision.

In [22]:
# AN ARRAY OF DATES FOR EVERY YEAR FROM 1968 TO 1975
np.arange('1968', '1976', dtype='datetime64[Y]')
#default step size=1 because the precision is set to Y(year. so every single year between 1968 to 1976)

array(['1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975'],
      dtype='datetime64[Y]')

## Pandas Datetime Index

We'll usually deal with time series as a datetime index when working with pandas dataframes. Fortunately pandas has a lot of functions and methods to work with time series!<br>
For more on the pandas DatetimeIndex visit https://pandas.pydata.org/pandas-docs/stable/timeseries.html

In [14]:
import pandas as pd

The simplest way to build a DatetimeIndex is with the <tt><strong>pd.date_range()</strong></tt> method:

In [24]:
# THE WEEK OF JULY 8TH, 2018
#3 arguments-- start, number of periods requested and the frequency of the periods
#Here day wise frequency is being used
idx = pd.date_range('7/8/2018', periods=7, freq='D')
idx
#observe the output--default precision given by pandas-- nano seconds

DatetimeIndex(['2018-07-08', '2018-07-09', '2018-07-10', '2018-07-11',
               '2018-07-12', '2018-07-13', '2018-07-14'],
              dtype='datetime64[ns]', freq='D')

<div class="alert alert-info"><strong>DatetimeIndex Frequencies:</strong> When we used <tt>pd.date_range()</tt> above, we had to pass in a frequency parameter <tt>'D'</tt>. This created a series of 7 dates spaced one day apart. We'll cover this topic in depth in upcoming lectures, but for now, a list of time series offset aliases like <tt>'D'</tt> can be found <a href='http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases'>here</a>.</div>

In [16]:
#there are a variety of string quotes that pandas can take in and understand:
idx = pd.date_range('Jan 01, 2018', periods=7, freq='D') #the string quote provided for 1st parameter is understood by pandas. YOu can check other built in string codes which pandas can understand
idx

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07'],
              dtype='datetime64[ns]', freq='D')

Another way is to convert incoming text with the <tt><strong>pd.to_datetime()</strong></tt> method:

In [35]:
#Incase you have a very specific format, then use to_datetime to format the input
#you can pass different arrays,lists, strings which you want format/transform to datetime
#here these are just different formats which pandas can read as datetimeindex
idx = pd.to_datetime(['Jan 01, 2018','1/2/18','03-Jan-2018',None])
idx

DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', 'NaT'], dtype='datetime64[ns]', freq=None)

In [17]:
#reading and transforming the list of strings in a specific format
idx = pd.to_datetime(['01--02--2018','01--03--2018'],format='%d--%m--%Y')
idx

DatetimeIndex(['2018-02-01', '2018-03-01'], dtype='datetime64[ns]', freq=None)

A third way is to pass a list or an array of datetime objects into the <tt><strong>pd.DatetimeIndex()</strong></tt> method:

In [36]:
# Create a NumPy datetime array
some_dates = np.array(['2016-03-15', '2017-05-24', '2018-08-09'], dtype='datetime64[D]')
some_dates

array(['2016-03-15', '2017-05-24', '2018-08-09'], dtype='datetime64[D]')

In [37]:
# Convert to an index
idx = pd.DatetimeIndex(some_dates)
idx

DatetimeIndex(['2016-03-15', '2017-05-24', '2018-08-09'], dtype='datetime64[ns]', freq=None)

Notice that even though the dates came into pandas with a day-level precision, pandas assigns a nanosecond-level precision with the expectation that we might want this later on.

To set an existing column as the index, use <tt>.set_index()</tt><br>
><tt>df.set_index('Date',inplace=True)</tt>

## Pandas Datetime Analysis

In [38]:
# Create some random data
data = np.random.randn(3,2)
cols = ['A','B']
print(data)

[[-0.67209504 -1.34011471]
 [ 0.60344757  0.99925486]
 [ 0.20917162  0.72522761]]


In [39]:
# Create a DataFrame with our random data, our date index, and our columns
df = pd.DataFrame(data,idx,cols)
df

Unnamed: 0,A,B
2016-03-15,-0.672095,-1.340115
2017-05-24,0.603448,0.999255
2018-08-09,0.209172,0.725228


Now we can perform a typical analysis of our DataFrame

In [40]:
df.index

DatetimeIndex(['2016-03-15', '2017-05-24', '2018-08-09'], dtype='datetime64[ns]', freq=None)

In [41]:
# Latest Date Value
df.index.max()

Timestamp('2018-08-09 00:00:00')

In [42]:
# Latest Date Index Location
df.index.argmax()

2

In [43]:
# Earliest Date Value
df.index.min()

Timestamp('2016-03-15 00:00:00')

In [44]:
# Earliest Date Index Location
df.index.argmin()

0

<div class="alert alert-info"><strong>NOTE:</strong> Normally we would find index locations by running <tt>.idxmin()</tt> or <tt>.idxmax()</tt> on <tt>df['column']</tt> since <tt>.argmin()</tt> and <tt>.argmax()</tt> have been deprecated. However, we still use <tt>.argmin()</tt> and <tt>.argmax()</tt> on the index itself.</div>

## Great, let's move on!