# Chapter 09 -- Panda Time Series and Date Handling--DRAFT

## Topics Covered:

<a href="http://nbviewer.jupyter.org/github/RandyBetancourt/PythonForSASUsers/blob/master/Chapter%2009%20--%20Panda%20Time%20Series%20and%20Date%20Handling#Definitions.ipynb"> Definitions </a>

<a href="http://nbviewer.jupyter.org/github/RandyBetancourt/PythonForSASUsers/blob/master/Chapter%2009%20--%20Panda%20Time%20Series%20and%20Date%20Handling"> Creating and manipulating a fixed-frequency of datetime spans </a>

Convert time series from one frequency to another

Increment 'non-standard' datetimes intervals (e.g. business week)

Time Series Walk-Through

Chapter 8, <a href="http://nbviewer.jupyter.org/github/RandyBetancourt/PythonForSASUsers/blob/master/Chapter%2008%20--%20Python%20Date%2C%20Time%2C%20and%20%20Timedelta%20Objects.ipynb"> Understanding Date Time and TimeDelta objects </a> provided a short introduction to Python's built-in datetime capabilities.  In this chapter we illustrate pandas time series and date handling.  



In [45]:
from datetime import date, time, datetime, timedelta
import numpy as np
import pandas as pd
from pandas import Series, DataFrame, Index

## Definitions

To begin, we need to distinguish between object types used to represent datetimes.  While a bit pandantic,, it helps to clarify the behaviors of these objects.  panda Time Series utilize NumPy datetime64 and timedelta64 dtypes.

Recall, you can always return an object's type with the type method:

    type()
    
Examples work better than prose.  Consider the assignments in the cell below.

In [46]:
a_date = date(2016, 10, 24)
a_datetime = datetime(2016, 10, 24)

In [47]:
print(a_date)
print(a_datetime)

2016-10-24
2016-10-24 00:00:00


In [48]:
a_date == a_datetime

False

Not surpringly the date and datetime objects are not logically equivalent.  After all, they use different 'counters'.  Further, 
the cell below illustrates they are are from two different classes from the datetime module.

In [49]:
print(type(a_date))
print(type(a_datetime))

<class 'datetime.date'>
<class 'datetime.datetime'>


And in case you were wondering about SAS:

````
    56       data _null_;
    57       
    58       a_date = '24Oct2016'd;
    59       a_datetime = '24Oct2016:00:00:00'dt;
    60       
    61       if a_date = a_datetime then
    62          put 'True';
    63       else
    64          put 'False';

    False
````

Python also distinquishes between datetime and datestamps.  Again, examples work better than prose.  The path.getatime() method returns the access time for a file.

In [50]:
file = "lines.html"
from os import path

a_time = path.getatime(file)

af_time = datetime.fromtimestamp(a_time)

In [51]:
print('value returned:', a_time)
print('value returned:', af_time)

value returned: 1477433492.8224854
value returned: 2016-10-25 16:11:32.822485


In [52]:
print('Type for a_time is', type(a_time), 'and Type for af_time is', type(af_time))

Type for a_time is <class 'float'> and Type for af_time is <class 'datetime.datetime'>


A timestamp is time value that represents a count of the number of seconds from the start of an epoch.  This is similiar to SAS datetime values that represent an off-set from an epoch beginning at midnight.   

In [53]:
pdt = pd.Timestamp('2016-10-24')

In [54]:
type(pdt)

pandas.tslib.Timestamp

## Creating and manipulating a fixed-frequency of date and time spans

The pd.date_range() method generates a DateTime Index which is applied to a panda Series or DataFrame to provide datetime interval indexing.  We will see examples of its construction methods.  And later we will utilize indexers taking advange of the Date TimeIndex.  

In [55]:
rng = pd.date_range('1/1/2016', periods=90, freq='D')

Print the first 10 dates in the DateTimeIndex

In [56]:
rng[:10]

DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08',
               '2016-01-09', '2016-01-10'],
              dtype='datetime64[ns]', freq='D')

In [57]:
ts = pd.Series(np.random.randn(len(rng)), index=rng)

In [58]:
type(ts)

pandas.core.series.Series

Time-stamped data for pandas represent a point in time.

Period being inferred from the datetime string.

In [59]:
pd.Period('2016-01-01')

Period('2016-01-01', 'D')

Get type

In [60]:
type(pd.Period('2016-01-01'))

pandas._period.Period

Period being set explicitly

In [61]:
pd.Period('2016-05', freq='D')

Period('2016-05-01', 'D')

Timestamp and Period can be an index.  Coerced into PeriodIndex and DateTimeIndex

In [62]:
dates = [pd.Timestamp('2012-05-01'), pd.Timestamp('2012-05-02'), pd.Timestamp('2012-05-03')]

In [63]:
dates

[Timestamp('2012-05-01 00:00:00'),
 Timestamp('2012-05-02 00:00:00'),
 Timestamp('2012-05-03 00:00:00')]

In [64]:
 ts = pd.Series(np.random.randn(3), dates)

In [65]:
ts

2012-05-01    0.475898
2012-05-02   -0.039289
2012-05-03   -0.909969
dtype: float64

In [66]:
type(ts.index)

pandas.tseries.index.DatetimeIndex

In [67]:
ts.index

DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype='datetime64[ns]', freq=None)

Convert date string to datetime

In [68]:
pd.to_datetime('2016/11/30')

Timestamp('2016-11-30 00:00:00')

In [69]:
type(pd.to_datetime('2016/11/30'))

pandas.tslib.Timestamp

Convert date string to Timestamp

In [70]:
pd.Timestamp('2016/11/30')

Timestamp('2016-11-30 00:00:00')

In [71]:
type(pd.Timestamp('2016/11/30'))

pandas.tslib.Timestamp

You can assemble a DataFrame by using strings and integers for columns.

In [72]:
df = pd.DataFrame({'year': [2014, 2015, 2016],
                   'month': [1, 2, 3],
                   'day': [1,2,3,]})
df1 = pd.to_datetime(df)

In [73]:
from datetime import datetime, date, time
start = datetime(2016, 1, 1)
end = datetime(2016, 12, 31)
rng = pd.date_range(start,end)

In [74]:
rng

DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08',
               '2016-01-09', '2016-01-10',
               ...
               '2016-12-22', '2016-12-23', '2016-12-24', '2016-12-25',
               '2016-12-26', '2016-12-27', '2016-12-28', '2016-12-29',
               '2016-12-30', '2016-12-31'],
              dtype='datetime64[ns]', length=366, freq='D')

In [75]:
start = datetime(2016, 1, 1)
end = datetime(2016, 12, 31)
b_rng = pd.bdate_range(start,end)

In [76]:
b_rng

DatetimeIndex(['2016-01-01', '2016-01-04', '2016-01-05', '2016-01-06',
               '2016-01-07', '2016-01-08', '2016-01-11', '2016-01-12',
               '2016-01-13', '2016-01-14',
               ...
               '2016-12-19', '2016-12-20', '2016-12-21', '2016-12-22',
               '2016-12-23', '2016-12-26', '2016-12-27', '2016-12-28',
               '2016-12-29', '2016-12-30'],
              dtype='datetime64[ns]', length=261, freq='B')

In [77]:
rng = pd.date_range(start, end, freq='BM')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts.index

DatetimeIndex(['2016-01-29', '2016-02-29', '2016-03-31', '2016-04-29',
               '2016-05-31', '2016-06-30', '2016-07-29', '2016-08-31',
               '2016-09-30', '2016-10-31', '2016-11-30', '2016-12-30'],
              dtype='datetime64[ns]', freq='BM')

Returns the first 5

In [78]:
ts[:5].index

DatetimeIndex(['2016-01-29', '2016-02-29', '2016-03-31', '2016-04-29',
               '2016-05-31'],
              dtype='datetime64[ns]', freq='BM')

 Returns the nth, i.e. 2 = every other one

In [79]:
ts[::2]

2016-01-29   -1.416549
2016-03-31    1.688049
2016-05-31    0.650461
2016-07-29   -1.083026
2016-09-30    0.219029
2016-11-30   -0.591642
Freq: 2BM, dtype: float64

In [80]:
ts[::6]

2016-01-29   -1.416549
2016-07-29   -1.083026
Freq: 6BM, dtype: float64

## Time Series Walk-Through

We can begin combining features covered in previous chapters to conduct a walk-through of an actual time-series analysis.  The data is the FHFA House Price Index (HPI) 

It is a broad measure of the movement of single-family house prices. The HPI is a weighted, repeat-sales index, meaning that it measures average price changes in repeat sales or refinancings on the same properties. This information is obtained by reviewing repeat mortgage transactions on single-family properties whose mortgages have been purchased or securitized by Fannie Mae or Freddie Mac.  

Details about the data and how it is organized can be found <a href="https://catalog.data.gov/dataset/fhfa-house-price-indexes-hpis"> here </a>. 

This time series begins January 1991 and end August 2016.  Both the seasonally adjusted index 'index_sa' and the non-seaonally adjusted index 'index_nsa' set the index value at 100 for January 1991.  

The .csv file is two parts.  Part 1, rows 2 to 3079 are records for the aggregate market groups at the Census Division level.  The frequency interval is monthly.

Part 2, rows 3080 to 96,243 are more granular with 4 values for level, 'MSA', 'State', 'USA or Census Division', and 'Puerto Rico'.  The frequency interval is quarterly. 

Start with the U.S. portion by reading part 1 of the file.  The pd.read_csv method uses the one required arguement, the input file name to create the DataFrame 'df_all'.

In [81]:
df_all = pd.read_csv("C:\Data\\HPI_master.csv")

Check the first 5 rows to determine if the read_csv() method is giving the expected results.

In [82]:
df_all.tail()

Unnamed: 0,hpi_type,hpi_flavor,frequency,level,place_name,place_id,yr,period,index_nsa,index_sa
99320,developmental,purchase-only,quarterly,Puerto Rico,Puerto Rico,PR,2015,2,160.23,158.62
99321,developmental,purchase-only,quarterly,Puerto Rico,Puerto Rico,PR,2015,3,159.54,161.27
99322,developmental,purchase-only,quarterly,Puerto Rico,Puerto Rico,PR,2015,4,155.14,152.81
99323,developmental,purchase-only,quarterly,Puerto Rico,Puerto Rico,PR,2016,1,150.61,154.71
99324,developmental,purchase-only,quarterly,Puerto Rico,Puerto Rico,PR,2016,2,165.45,163.82


We need to combine the year and period fields into a DataTime Stamp.  The .csv file in cell #xx above is read without any datetime parsing for the fields, 'yr' and 'period'.  We could post-process these fields to construct the appropriate date timestamp values.  

A better approach is below.  The parse_dates= argument allows a <a href="http://nbviewer.jupyter.org/github/RandyBetancourt/PythonForSASUsers/blob/master/Chapter%2002%20--%20Data%20Structures.ipynb#dictionary"> dictionary </a> object with the key being the arbitrary name of the new column created and the key values indicating which fields are to be read in the .csv file.  Recall that Python indexes have a start position of 0.  In the .csv file, these fields are the 7th and 8th.

Sometimes, you may need to create your own date-parser, analogous to building a user-defined SAS INFORMAT to map field values into a datetime object.  This is particularly true in cases where the date value is stored as component values in multiple fields.   

## url for custom date_parser here

In [95]:
df = pd.read_csv("C:\Data\\HPI_master.csv",
                 parse_dates={'date_idx': [6,7]},
                 nrows=3080)

In [96]:
df.shape

(3080, 9)

Check for missing values.

In [97]:
df.isnull().any()

date_idx      False
hpi_type      False
hpi_flavor    False
frequency     False
level         False
place_name    False
place_id      False
index_nsa     False
index_sa      False
dtype: bool

Map the string 'date_idx' column constructed through the date_parser to a datetime value.  Set the 'date' column as the index on the DataFrame.

In [103]:
df['date'] = pd.to_datetime(df['date_idx'])
df.set_index("date", inplace=True, drop=False)

Indexing on the datetime column 'date' creates a 'time-aware' DateTimeIndex.  

In [104]:
df.index

DatetimeIndex(['1991-01-01', '1991-02-01', '1991-03-01', '1991-04-01',
               '1991-05-01', '1991-06-01', '1991-07-01', '1991-08-01',
               '1991-09-01', '1991-10-01',
               ...
               '2015-11-01', '2015-12-01', '2016-01-01', '2016-02-01',
               '2016-03-01', '2016-04-01', '2016-05-01', '2016-06-01',
               '2016-07-01', '2016-08-01'],
              dtype='datetime64[ns]', name='date', length=3080, freq=None)

Get the first and last date values.

In [109]:
print('Earliest date is:', df.date.min())
print('Latest date is:', df.date.max())

Earliest date is: 1991-01-01 00:00:00
Latest date is: 2016-08-01 00:00:00


We see from cell #xx above, that we have several categorical columns.  So we need to understand their levels.  Earlier, we saw the .describe() method used for numerical columns.  In this example, specifying the 'include=' argument provides a description of string columns.  

In [None]:
df_all.describe(include=['O'])

The 'place_name' column has 10 unique levels or values.  We can examine these values with the .unique() attribute.

In [None]:
df_all.place_name.unique()

By setting an index on the column 'place_name', you can create a sub-set of the DataFrame 'df_all' to include just those rows for the U.S.  The .loc indexer allows row slicing which is covered in detail <a href="http://nbviewer.jupyter.org/github/RandyBetancourt/PythonForSASUsers/blob/master/Chapter%2005%20--%20Understanding%20Indexes.ipynb#.loc-Indexer"> here </a>.

In [None]:
df_all.set_index('place_name', inplace=True, drop=False)
df_us = df_all.loc['United States']

In [None]:
import bokeh.charts
import bokeh.charts.utils
import bokeh.io
import bokeh.models
import bokeh.palettes
import bokeh.plotting

# Display graphics in this notebook
bokeh.io.output_notebook()

In [None]:
p = bokeh.charts.Line(df_us, x='date_idx', y='index_nsa', color='firebrick',  title="Home Price Values in the U.S.")

# Display it
bokeh.io.show(p)

In [None]:
df_us_3 = df_all.loc[['West South Central Division', 'United States', 'Pacific Division']]

During the Great Recession of 2008-2010, home prices across the U.S. declined dramatically.  Home prices in the Pacific region, which includes California, grew significantly more than the U.S. as a whole.  Aggregate U.S. home prices have regained all of their price losses since then and the Pacific and West South Central regions are not too far behind. 

In [None]:
p = bokeh.charts.Line(df_us_3, x='date_idx', y='index_nsa', color='place_name',
                      legend="top_left")

bokeh.io.show(p)

To continue this analysis, we need to read the remainder of the .csv containing state-level data.  In this instance, we use to 'skiprows=' argument to begin reading on row 3084.  We specify the columns 

Beginning on row 3081, there are no values for the field 'index_sa'.  We re-read starting with row 3081 to the end of the file.  And since the default is to key off the column names, we will need to supply the column mappings.  And because we are reading from an arbirary start point, we supply a  <a href= "http://nbviewer.jupyter.org/github/RandyBetancourt/PythonForSASUsers/blob/master/Chapter%2002%20--%20Data%20Structures.ipynb#tuple"> tuple </a> of names.  Header=None is needed in order to prevent the reader from attempting to build column names at row position nrows-1, which in our case contains data values.

There are no values for seasonally adjusted prices beyond rows 3081.

In [None]:
df_states = pd.read_csv(file_loc2, low_memory=False,
            parse_dates={'date_idx': [6,7]},
            skiprows=3083,
            usecols=(0, 1, 2, 3, 4, 5, 6, 7, 8),
            names=('hpi_type', 'hpi_flavor', 'frequency', 'level', 'place_name', 'place_id', 'yr', 'period', 'index_nsa'),
            header=None)

In [None]:
df_states.head()

In [None]:
df_states.level.unique()

In [None]:
df_states.shape

Check for missing values.

In [None]:
df_states.isnull().sum()

What imputation method should be used to treat missing data?.  We can start by finding the range of values.  Like most languages, there are multiple methods for accomplishing a given task.  

We can set an index on the 'index_nsa' values and find the maximum and minimum.

In [None]:
df_states.set_index('index_nsa', inplace=True, drop=False)

In [None]:
print('Max value for index_nsa:', df_states['index_nsa'].max())
print('Min value for index_nsa:', df_states['index_nsa'].min())

Alternatively, we can sort the values and use the <a href="http://nbviewer.jupyter.org/github/RandyBetancourt/PythonForSASUsers/blob/master/Chapter%2005%20--%20Understanding%20Indexes.ipynb#.iloc-Indexer"> iloc </a> indexer.  

Recall that the .iloc indexer returns slices by index position similiar to the way \_n\_ in SAS behaves.

## Sort and Sort Sequences

This is a good opportunity to understand the sort behaviors for DataFrames.  We begin by examing what I call the default sort behavior.  We have provided the minimum argument to the .sort_values attribute, the sort key in the example below.

In [None]:
default_srt = df_states.sort_values('index_nsa')

By examing the first the first two rows of the sorted DataFrame, 'states_srt', we can see the default sort sequence is ascending.  Of course, by reading the doc for <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html"> pandas.DataFrame.sort_values </a> we could see this as well. 

In [None]:
default_srt.iloc[[0, 1], :]

We can also see that the panda default sort sequence places NaN's last in the sort sequence by default.  So this can be an alternative to using boolean operators and the .loc() method to detect missing values.

In [None]:
default_srt.iloc[[-1, -2],:]

Since there were 2 NaN's found, we can use the .iloc indexer to return 3rd and 4th row from the 'bottom' of the DataFrame, displaying the 2 highest values for the column 'index_nsa'.

In [None]:
default_srt.iloc[[-3, -4], :]

And naturally, we can completely alter the organization of the data_frame by supplying arguments and values to the sort_values attribute.  In the example below we sort using a descending sort sequence and placing the missing values at the beginning.

In [None]:
states_srt2 = df_states.sort_values('index_nsa', ascending=False, na_position='first')

The first two rows in the DataFrame 'states_srt2' contain the 2 NaN's for the column 'index_nsa' values.  The next 2 rows contain the highest values for 'index_nsa'.

In [None]:
states_srt2.iloc[0:4,]

#### Start a new dataframe here for the frequency shifting example

In the case of the aggregate U.S. housing index (DataFrame 'df_us') created above, the value for the 'frequency' column is monthly. So we need to find what the frequency value is for the 'states_srt2' DataFrame.    

In [None]:
states_srt2.describe(include=['O'])

In [None]:
states_srt2['level'].unique()

### have not introduced groupby and most of the examples are in chapter 13

In [None]:
states_srt2.groupby('level').count() 

In [None]:
df_states.sort_values('place_name')


In [None]:
df_states.set_index('place_name', inplace=True)

In [None]:
df_states.columns

In [None]:
la = df_states.loc['Los Angeles-Long Beach-Glendale, CA (MSAD)']

In [None]:
la

## Navigation

<a href="http://nbviewer.jupyter.org/github/RandyBetancourt/PythonForSASUsers/tree/master/"> Return to Chapter List </a>    