# 3. Time series Analysis


## Contents
    1. Introduction
    2. Numpy
        1. Arange
        2. Zeroes
        3. One's
        4. Linspace
        5. Identity Matrix
        6. Random Data
            1. Uniform Distribution
    3. Time Series with Pandas
        1. DateTime Index


## Introduction

Some of the things that will be covered in the notes:

 - Understand how to use Python to work with Time Series Data
    - will use Pandas and Stats Models to create predictive models
 - Will also have some Numpy Basics
 - Look at the specific Time Series tools that is included with Pandas
 - Stats Models has a full Time Series sections
 - Will look at General Forecasting Models (Looking at ARIMA Models)
 - Look at Deep Learning and Prophet which is one of the latest methods for forecasting

## Numpy

For Majority of this section I will only be covering some useful methods that are included with Numpy and will assume that the users of these notes have knowledge of the creation and slicing of Numpy Array's


### Arange

This is a method that is useful to create a **Numpy Array** from a decided range of values.

While this may seem very basic at this point in time, this can also be done with Datetime data in order to create date values for Index's or other variables.

Let's have a look at how it works:

In [None]:
import numpy as np

np.arange(0,10,2)

### Zeros:

We may get to a point that we want a set of numbers that are only zeros and this can be done using the following method:

Note that the passed value is the size of the final Array that we are creating.

In [None]:
np.zeros(5)

We can also create a turple of Zero's if this is required in our analysis:

In [None]:
np.zeros((4,10))


### One's:

Just like the method which created an array of Zero's we may need to have a situation where we require an array of all 1 values, this can be done with the ones method.

Just like the **Zeros** method the parssed value is the size of the Array:

In [None]:
np.ones(4)

### Linspace

We have seen that the **Arange** method will create an array with a starting point, ending point and increment but there are also cases where we want to have a equal spacing between the starting and ending point. This is where the **Linspace** method becomes useful.

With **Linspace** method, the final array will have the number of values equally spaced between the minimum and maximum values.

In [None]:
np.linspace(0,10,20)

### Identity Matrix

While this is not used in Time Series analysis it is usefuly to know how to create an identity matrix for any analysis in the future.

In [None]:
np.eye(5)

### Random Data

#### Uniform Distribution:

The first array we can create is one generated from a uniform distribution, the values will only be between 0 and 1:

In [None]:
np.random.rand(10)

#### Normal Distribution:

It may not be ideal to always have the distribution as a uniform distribtuion so there are other options to have the data calculated from other distributions.

The first of these distributions is the normal distribution, we can do it using the following method:

In [None]:
np.random.randn(10)

### Random numbers between values:

In some cases we want to specify the values that the random values should be between this can be done using the method below:

In [None]:
np.random.randint(1,100,10)

### Setting the seed:

With most analysis that is required we need to ensure that the results are reproducable. The most important part of this is to use seeds when we are creating random values.

In [None]:
np.random.seed(42)

np.random.rand(4)

In Jupyter we need to keep the seed in the same cell to ensure it is used for the random function.

We can now start with creating a few arrays to work with.

In [None]:
arr = np.arange(25)
ranarr = np.random.randint(0,50,10)

We may want to reshape an array at some point. So lets reshape the arr into a 5 x 5 array:

In [None]:
arr

In [None]:
arr.reshape(5,5)

We always need to ensure that the shape of the array makes sense in terms of the size of the array else there will be a error given.

We can also check for the maximum and minimum values within an array:

In [None]:
ranarr.max()

In [None]:
ranarr.min()

We can find the index location of the maximum and minimum of the array:

In [None]:
ranarr.argmax()

In [None]:
ranarr.argmin()

We can check the datatype of the values within the array with the following:

In [None]:
ranarr.dtype

### Indexing and selection:

Index gets a single element while selection is to get a slice.

Let's get an index from our array:

In [None]:
arr[8]

From a range:

In [None]:
arr[1:8]

We can also broadcast functions across our array:

In [None]:
arr +10


We can slice the array with the following:

In [None]:
slice_1 = arr[0:6]

slice_1

and now we can modify the slice:

In [None]:
slice_1 [:] = 99

slice_1

It is important to note that the slice is reffering to the original array as well:

In [None]:
arr

to make a copy of an array we can use the copy method:

In [None]:
arr_copy = arr.copy()


arr_copy


## Time Series Data:

### Date time Index

Built into standard Python is the ability to create datetime values:

Let's try creating an object:

In [None]:
from datetime import datetime

my_year = 2020
my_month = 6
my_day = 18
my_hour = 16
my_min = 25
my_sec = 15

my_date =datetime(my_year,my_month,my_day)

my_date

Note that when all the elements are not included the rest of the objects are set to 0.


But if we define all of it:

In [None]:
my_date_time =datetime(my_year,my_month,my_day,my_hour,my_min,my_sec)

my_date_time

we can now interact with these objects:

In [None]:
my_date_time.day


We can see what python has avaliable but what about numpy?

In [None]:
np.array(['2020-03-15','2020-03-16','2020-03-17'])

Numpy will treat this as normal strings but we can convert them:

In [None]:
np.array(['2020-03-15','2020-03-16','2020-03-17'],dtype='datetime64')

For extra control we can set the precision after datatime64 with square brackets:

In [None]:
np.array(['2020-03-15','2020-03-16','2020-03-17'],dtype='datetime64[Y]')


we can also use ranges to create a list of dates:

In [None]:
np.arange('2018-06-01','2018-06-23',7,dtype='datetime64[D]')

This can also be done with Pandas as it is built off the Numpy foundation.

Pandas also has some built in utilites to handle date time variables

In [None]:
import pandas as pd

pd.date_range('2020-01-01',periods=7,freq='D')

we will now get a date time index from these values and it will be aware of the datetime variable type

Pandas is also really good at infering string codes:

In [None]:
pd.date_range('Jan 01, 2018',periods=7,freq='D')

this needs to be in the specific formats that are built into Pandas but it will generally work.

We can also use a to date time with the format:

In [None]:
pd.to_datetime(['1/2/2018','Jan 03, 2018'])

This will convert these into date times.

If we need to deal with the inverse of the American order or data that is in the other format we can specify the format:

In [None]:
pd.to_datetime(['1/2/2018','1/3/2018'],format='%d/%m/%Y')

This will now calculate the date format that should be in day month year format

Lets have a look at pandas date time analysis with random data:

In [None]:
data = np.random.randn(3,2)

data

In [None]:
idx = pd.date_range('2020-01-01',periods=3,freq='D')
cols=['A','B']
df = pd.DataFrame(data,index=idx,columns=cols)

In [None]:
df.index

In [None]:
df.index.max()

In [None]:
df.index.argmax()

### Time Resampling

This is almost like grouping data together but this is almost like using functions across the data.

Lets start by importing the data we will be working with:

We will also set the index to the data and refence it as a date time

In [None]:
df = pd.read_csv('Data\\starbucks.csv',index_col='Date',parse_dates=True)

df.head()

In [None]:
df.index

We can see that the data is daily data but missing weekend information. We can resample this off a rule.

In [None]:
df.resample(rule='D').sum()

this is along the same lines as group by except there are rules that can be applied to the data when it is grouping the data by the functions.

We can combine resampling with plotting:

In [None]:
%matplotlib inline
df['Close'].resample('M').mean().plot.bar(title = 'Yearly closing price Means')

### Time Shifting

We may need to shift the data up and down along the time series index. we will still be using the Starbucks data file.

In [None]:
df.head()

In [None]:
df.tail()

We can shift using the shift Method:

In [None]:
df.shift(1)

This will move all of the data points up one point but this will leave us with an empty data point in the front and the loss of the final row.

this can also be done using a negative number to shift everything back one date.

This can also be done using time series frequency codes:

In [None]:
df.shift(periods=1,freq='M')

### Rolling and Expanding

This is a way of grouping the data into windows of time and then apply an aggregate of time.


We will plot the data first with no rolling mean:

In [None]:
%matplotlib inline
df['Close'].plot(figsize=(12,5))


In [None]:
df.rolling(window=7).mean()

Window is how large the section we want to look at usually in days. this calculates the moving average from the last x days

In [None]:
%matplotlib inline
df['Close'].plot(figsize=(12,5))
df.rolling(window=7).mean()['Close'].plot()

we can also do expanding - this is to consider all prior data in our ever expanding data set.

In [None]:
%matplotlib inline
df['Close'].expanding().mean().plot(figsize=(12,5))


### Rolling and Expanding

This is a way of grouping the data into windows of time and then apply an aggregate of time.


We will plot the data first with no rolling mean:

In [None]:
%matplotlib inline
df['Close'].plot(figsize=(12,5))


In [None]:
df.rolling(window=7).mean()

Window is how large the section we want to look at usually in days. this calculates the moving average from the last x days

In [None]:
%matplotlib inline
df['Close'].plot(figsize=(12,5))
df.rolling(window=7).mean()['Close'].plot()

we can also do expanding - this is to consider all prior data in our ever expanding data set.

In [None]:
df['close'].expanding().mean().plot(figsize=(12,5))

This should start to average off eventually due to the ever increasing data that is included with the method.
