## Managing Time Series Data in Python

*This initial section is a basic primer to help manage time series data structures in Python*

In this notebook you will:

* Learn how NumPy and Pandas store datetime data of different frequencies
* Learn how to manipulate dates and times in NumPy and Pandas
* Read time series data into Pandas DataFrame
* Setup a Pandas DataFrame with a DataTimeIndex.

Before we can produce forecasts we need to learn how to manipulate and manage dates in Python's NumPy and Pandas libraries.

In [None]:
import numpy as np
import pandas as pd

## Datetimes in NumPy

If not done correctly, dates and times can be painful to use in coding!

NumPy's data type to manage datetimes is called `datetime64`. 

### Static arrays of `datetime64`

In [None]:
np.array(['2019-07-11', '2019-07-12', '2019-07-13', '2019-07-14'], dtype='datetime64')

Notice that the np.array has dtype='datetime64[D]'.  The 'D' standard for the minimum unit of days
Consider an alternative where we include hours.  You need to include the letter 'T' (for timestamp) in string pass to the numpy.array

In [None]:
np.array(['2019-07-11T01', '2019-07-12T02', '2019-07-13T03', '2019-07-14T17'], 
         dtype='datetime64')

That time the dtype='datetime64[h]' where 'h' stands for hours.  We can go further and try minutes.

In [None]:
np.array(['2019-07-11T00:13', '2019-07-12T00:15', '2019-07-13T00:15', '2019-07-14T00:05'], 
         dtype='datetime64')

And now try seconds

In [None]:
np.array(['2019-07-11T00:13:59', '2019-07-12T00:15:30', '2019-07-13T00:15:20', '2019-07-14T00:05:15'], 
         dtype='datetime64')

and miliseconds

In [None]:
np.array(['2019-07-11T00:13:59.100', '2019-07-12T00:15:30.189'], 
         dtype='datetime64')

### Quick creation of date arrays using `np.arange`

`np.arange(start,stop,step)` (where stop is *exclusive*) is commonly used to produce an np.array of integers can be used to produce an array of evenly-spaced integers (particularly good for generating synthetic testing data).  

`np.arange` can also be used to generate a range of date time stamps.

*Try changing the step argument to a different value*

In [None]:
np.arange('2019-07-01', '2019-07-31', step=3, dtype='datetime64[D]')

In [None]:
foo = np.arange('2019-07-01', '2019-07-31', step=7, dtype='datetime64[m]')
foo.shape

To get all values in between two dates then *omit* the step argument.  The below generates days between 1st and 10th August

In [None]:
np.arange('2019-07-01', '2019-07-10', dtype='datetime64[D]')

## Date Time Index in Python

pandas `datetimeindex` builds on numpy datetime64 data type.  Pandas is definitely the easiest way to work with time series data in Python.  One of the reasons for this is that pandas can detect and handle different formats of date strings in input files.  Always watch out for problems with US -> UK dates and vice versa.

### Static creation

If you need to create some synthetic data for testing then you can use the `pandas.date_range` function.

In [None]:
#note that by default pandas will assume the below is MM/DD/YYY
index = pd.date_range('1/3/2019', periods=7, freq='D')
index

* A hourly date range

In [None]:
index = pd.date_range('1/1/2019', periods=7, freq='h')
index

* A 'monthly start' range.

In [None]:
index = pd.date_range('1/1/2019', periods=7, freq='MS')
index

**Convert a list to datetime index**

In [None]:
dates = ['1/1/2019', '2/1/2019', '3/1/2019']
index = pd.DatetimeIndex(dates)
index

**US to UK problems**

In [None]:
dates = ['1/1/2019', '2/1/2019', '3/1/2019']
index = pd.DatetimeIndex(dates, dayfirst=True)
index

**Convert numpy array to datetime index**

For data manipulation and analysis I often find myself moving between NumPy arrays and pandas DataFrames. 

In [None]:
arr_dates = np.array(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64')
index = pd.DatetimeIndex(arr_dates)
index

Note that in the example above the frequency is **None**.  That's annoying and there are some forecasting tools in Python that will insist on having a frequency.  There are two ways to sort this out.

In [None]:
#pass in the frequency argument
arr_dates = np.array(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64')
index = pd.DatetimeIndex(arr_dates, freq='D')
index

In [None]:
#set the frequency post-hoc
arr_dates = np.array(['2019-07-11', '2019-07-12', '2019-07-13'], dtype='datetime64')
index = pd.DatetimeIndex(arr_dates)
index.freq = 'D'
index

**Finding the min|max andin a date time index and accessing a TimeStamp**

In [None]:
index.min()

In [None]:
index.max()

In [None]:
print(index.min().year)
print(index.min().month)
print(index.min().days_in_month)

## Importing data from a CSV file

First create a synthetic data set and save to csv

In [None]:
LAMBDA = 30
PERIODS = 365 * 2

idx = pd.date_range('2018-01-01', periods=PERIODS, freq='D')

# representing a count variable of sales og widgets with mean LAMBDA.
sales = np.random.poisson(LAMBDA, size=PERIODS) 
df = pd.DataFrame(sales, index=idx)
df.columns = ['sales']
df.index.name = 'date'
df.head()


In [None]:
df.to_csv('data/example_data1.csv') # save to file

Now read in the data and let pandas know that the index is a date field using the `parse_dates` argument.

In [None]:
df = pd.read_csv('data/example_data1.csv', index_col='date', parse_dates=True)
#you have to set this manually
df.index.freq = 'D'

In [None]:
df.index

In [None]:
df.head(10)