<figure>
  <IMG SRC="https://www.capgemini.com/nl-nl/wp-content/themes/capgemini-2018-02/assets/images/logo.svg" WIDTH=250 ALIGN="RIGHT"/>
</figure>

# DS Python Workshop 3: Time Series
*By Jeroen Dhondt & Allain Issa *

### How to use this notebook

 
*   Double click any cell to edit it's content
*   Click ```SHIFT+ENTER```  to execute a 
*  The exercises have a difficulty assigned to them, from * to *** 

### Resources
Official documentation (can be a bit hard to understand) [Link official documentation](https://docs.python.org/3/)
Pandas official documentation (search box left bottom) [Pandas](https://pandas.pydata.org/pandas-docs/stable/)

But in reality: just google or use stack overflow `'pandas do x'`


#Chapter 1: Introducing Time Series
A time series is a set of data where the values are respresented over time. The index is thus represented as a time or date unit. For instance the hourly stock exchange index or the daily number of visitors to a museum.

**Important** to start, download all the data files.
If that doesn't work:  let the Jupyter notebook know where to look for the data files. Make sure to use double dashes \\\\ in your path name as shown in the example.

In [0]:
!curl -O https://raw.githubusercontent.com/JeroenD-BE/PythonDataScienceWorkshops/master/data/climate-belgium.csv
!curl -O https://raw.githubusercontent.com/JeroenD-BE/PythonDataScienceWorkshops/master/data/new-year-resolutions.csv
!curl -O https://raw.githubusercontent.com/JeroenD-BE/PythonDataScienceWorkshops/master/data/time_serie_1.xlsx

# if that doesn't work:
#import os
#data_folder = "C:\\Users\\...\\data"
#os.chdir(data_folder)

## Importing the data and cleaning

We begin with reading our new data file "new-year-resolutions.csv", specifying that the seperator character is a comma (,) in the file. 

The data here is the **relative popularity of search terms on Google** (found via Google Trends). The popularity of three keywords were compared over a monthly basis: diet, gym and finance. The presumed expectation of the data is that we will see an increase in popularity of 'diet' and 'gym' around the New Year's period because of people making their good resolutions for the new year. Let us find out together!

In [0]:
import pandas as pd

resolutions = pd.read_csv("new-year-resolutions.csv",sep=",")
print(resolutions.info())
resolutions.head()

With the data loaded, we now reformat our data to have the preferred column names and index. 

PS> Remember, we use `inplace=True` so the operation on the data set is done directly on the object. With `inplace=False` (the default value), the operation would create and return a new copy of the data instead!

In [0]:
resolutions.columns = ["month","diet","gym","finance"]
resolutions.set_index("month", inplace=True)

resolutions.head()

## Using the DateTimeIndex of Pandas

The data looks much better now. However, we still have a problem with our index:

In [0]:
resolutions.index

The program does not recognize it is a date (see above, `dtype='object'`). Making the index of the **type DateTimeIndex** will give us a lot of powerful tools to work with this data. 

A single String object can be transformed to DateTime object with the `pd.to_datatime(..)`. We transform all elements of the index of our data set here in one line of code:

In [0]:
resolutions.index = pd.to_datetime(resolutions.index)

print(resolutions.index)
resolutions.index[0]


Now the index is of type `datetime64[ns]` and a single entry is shown as a timestamp. This will allow us to do a lot of neat things with this time series.

So what are the benefits of using a DateTimeIndex? Well, it will allow you to very easily take a subset of your data. For instance to get all data of 2014:

In [0]:
resolutions["2014"]


Or by selecting a range ([start] : [end], including both the starting and end elemlnt). For example, selecting the first 4 months of 2005:

In [0]:
resolutions["2005-01":"2005-04"]

Finally, let us visualize the data with the build in, default plotter. More advanced visualizations are discussed in a later chapter. 

For now, just remember that the .plot() method will give you a quick graph of your data. 'figsize' helps you select the size of the image. Feel free to adjust it to fit your screensize.

In [0]:
resolutions.plot(figsize=(15,8))

Some interesting recurring patterns can be seen from this: at the beginning of the year, diet & gym become more popular. We also see an upwards trend for gym, and a downwards trend for diet. These findings we will explore further in chapter 2. First, we will make some exercises on another data set:

## <a name="ex1_1"> Exercise 1.1: Stock exchange prices (*)
Below we load the stock exchange info for Amazon.

- What is the type of the index of the data? Make it into a DataTimeIndex
- Filter out only the data from the years 2015 & 2016
- Visualize the closing price in a simple plot for the month July 2015

In [0]:
# install the pandas_datareader package to fetch the data
!pip install pandas_datareader

from pandas_datareader import data
stock = data.DataReader('AMZN', 'iex', start='2014')

In [0]:
# Write your code here

<a href="#ex1_1answer">Answer to Exercise 1.1</a>

## Exercise 1.2:

# Chapter 2: Finding trends and seasonal effects in data

## Resampling

The first technique that we will tackle to detect trends in this chapter is resampling. Resampling simply means converting your data to a different time interval. For instance, like we will see in the following example, from monthly data to quarter (4 months) data. 

Resampling allows to smoothen your data and make trends more clear and can also help with removing the impact of seasonal trends. 

---
Let's revisit our "resolutions" data frame: 

In [0]:
resolutions.head()

All possible ranges of time can be selected with Pandas to resample, from (nano)seconds to years. A full list can be found here: [http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases )


Different techniques to **consolidate** (= merging the values together) the data  also exist, including taking the average of the points or taking the first element of the series. Here's the full list of functions to consolidate the data: sum, mean, std, sem, max, min, median, first, last, ohlc or you can write a custom function yourself.

We will chose first to resample quarterly (Q), using the mean to consolidate:

In [0]:
quarterly_mean = resolutions.resample('Q').mean()
quarterly_mean.head()

Or taking the first value for 2-year intervals can be done as follows:

In [0]:
resolutions.resample('2Y').first().head()

The interval doesn't necessarily have to become larger though, one can also resample for instance from weekly to daily data. In those cases, a technique to fill in the unknown values ('padding') is chosen. We will not go in further detail here though.

## Finding a trend with a rolling window
The next important technique is rolling windows. Rolling windows use a similar technique of merging values over a larger time range. However they recalculate the value at each spot instead of consolidating the data. This is particularly useful for removing seasonal trends.

In [0]:
resolutions_rolling = resolutions.rolling(12).mean()
resolutions_rolling.tail()

Let's visually check if this had the wanted effect. We need to perform a join on the data of our original data frame, with the newly created one 'resolutions_rolling'

In [0]:
resolutions_joined = resolutions.join(resolutions_rolling, rsuffix='_rolling') #suffix is added to the name of columns that are found in both dataframes to avoid names clashing

# focus only on diet
diet = resolutions_joined[['diet','diet_rolling']]
diet.plot(figsize=(15,8))

The rolling average of 12 months is indeed is a good representation of the underlying trend, removing the seasonal effect.

Note that any other value for the rolling average (including higher values!) will not give such a nice effect. Try it yourself if you are curious..

## Seasonal effects

So how to study the seasonal swing? We pose a simple assumption: the data we found is simply the sum of the seasonal effect on top of a general trend.

data = trend + seasonal -> seasonal = data - trend
Let's find out if this is a good assumption.

In [0]:
seasonal_diet = diet['diet'] - diet['diet_rolling']

# let's zoom into the period 2006-2009 to make it clearer
seasonal_diet['2006':'2009'].plot(figsize=(20,10),grid=True)

## <a name="ex2_1"> Exercise 2.1: Belgian Climate
The next exercise uses a data set with the **monthly average temperatures** in Belgium between 1950 and now.

- (*) make the `date` column into the main index and change it to the DateTimeIndex-format (with `.to_datetime()`)
- (*) deduce the yearly averages (use resampling) from this data and plot them 
- (*) to describe the current climate, usually the 20-year rolling average is taken. Calculate and plot this rolling average (\**optional: together with the yearly averages)
- (\**) Calculate the seasonal effect 
- (\*\*\*) Can we say that the temperature has become more extreme (colder winters and warmer summers)?

In [0]:
import pandas as pd
climate = pd.read_csv("climate-belgium.csv", delimiter=";")


In [0]:
# Write your code here

<a href="#ex2_1answer">Answer to Exercise 2.1</a>

# Chapter 3: Data Visualisation 

To visualize data you have two possibilites:

1) take a look at the data in raw format ... but not really useful and helpful 

2) Plot them in a chart 

=> e.g. Bubble Chart and Violin Box plot

### Violin Boxplot 

The Violin Boxplot is used as an error metric. It shows the distribution around the mean which mean that it shows if the data are homogenous (look the same) or heterogenous (looks different)

In order to create a box plot with Pandas you can use the following code:


In [0]:

import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(9, 4))

# generate some random test data
all_data = [np.random.normal(0, std, 100) for std in range(6, 10)]
# plot violin plot
axes[0].violinplot(all_data,
                   showmeans=False,
                   showmedians=True)
axes[0].set_title('violin plot')

# plot box plot
axes[1].boxplot(all_data)
axes[1].set_title('box plot')

# adding horizontal grid lines
for ax in axes:
    ax.yaxis.grid(True)
    ax.set_xticks([y+1 for y in range(len(all_data))]) #the ticks are the graduation on the x axis , it allows to have those labels "x1","x2" etc
    ax.set_xlabel('xlabel')
    ax.set_ylabel('ylabel')

# add x-tick labels
plt.setp(axes, xticks=[y+1 for y in range(len(all_data))], 
         xticklabels=['x1', 'x2', 'x3', 'x4']) #labels that you can see in the charts in the x axis
plt.show()

### Bubble chart
the bubble chart are useful to have three information at the same time (even four with the colors)

<figure>
  <IMG SRC="http://static.klipfolio.com/images/saas/example_bubblechart.png" WIDTH=600 ALIGN="CENTER"/>
</figure>


In this figure we can see the time spent on the TV by the male/female as well as the porportion (how many of them) 

It is also useful for detecting outlier because if we have one point really far from the other with a small diameter, it means that there are a few points (or maybe only one data) that corresponds to this point.

In order to create a bubble chart we can use the following code

In [0]:
import matplotlib.pyplot as plt
import numpy as np
 
# create data
#BE CAREFUL! all the data have to be of the same length
x = np.random.rand(40) #data for the first axis
y = np.random.rand(40) #data for the second axis
z = np.random.rand(40) #data for the third axis
 
# use the scatter function
plt.scatter(x, y, s=z*1000, alpha=0.5)
plt.show()


##<a name="ex3_1"> Exercise 3.1: Violin box plot for a time series (*)

From the "time_serie_1.xlsx":

 1) import the data from the tab "month"
 
 2) Make a violin box plot based on the data 

In [0]:
# Write your code here

<a href="#ex3_1answer">Answer to Exercise 3.1</a>

# Chapter 4: Linear Regression

Linear regression is an algorithm that tries to find a line that will modelize the data
It does it by finding the line that will minimizing the square residual error (=distance between the points and the corresponding point in the line = the estimated value) for all the points


In order to build a linear regression model you will need 

1) A dataset

2) To know which column of your dataset you will use (e.g. the time and the temperature for exemple)  

3) give your data to the algorithm to make the job 


Example:

In [0]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# Load the diabetes dataset
diabetes = datasets.load_diabetes()


# Use only one feature
# The data for the x axis should be a matrix like the following: [[first_value],[second_value], etc]
diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_Y = diabetes.target

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model
regr.fit(diabetes_X, diabetes_Y)

# Make predictions 
diabetes_y_pred = regr.predict(diabetes_X) 

# Plot outputs
plt.scatter(diabetes_X, diabetes_Y,  color='black')
plt.plot(diabetes_X, diabetes_y_pred, color='blue', linewidth=3)


plt.show()

The closer the variance score is from 1 the better it prediciton is

## <a name="ex4_1"> Exercise 4.1 (*)
Based on the file "time_serie_1.xlsx" 

1) Parse the data from the tab "linear" 

2) Apply the linear regression 

3) Find the temperature for Sunday


In [0]:
# Write your code here

<a href="#ex4_1answer">Answer to Exercise 4.1</a>

## <a name="ex4_2"> Exercise 4.2 (**)
Based on the file "time_serie_1.xlsx" 

1) Parse the data from the tab "non_linear" 

2) Apply the linear regression 

3) Find the temperature for the next 4 month (September, October, November,December)

4) What do you see?

Note that if you want to plot the line you can add the following realistic temperature 

September : 20

October: 17

November: 13

December: 5

In [0]:
# Write your code here

<a href="#ex4_2answer">Answer to Exercise 4.2</a>

# Chapter 2.Extra: The Correlation Coefficient ***

The correlation coefficient can measure the correlation (= do they have a similar shape) between two data series. The coefficient ranges from -1 to 1, where 1 shows perfect correlation, 0 no correlation and -1 a reverse correlation.

We can visualize our data of New Year's resolutions once more. Try to already guess what correlation coefficients you expect to get between the 3 data series.


In [0]:
resolutions['2010':'2013'].plot(figsize=(15,8))

From above plot, I would say the orange('gym') and blue ('diet') show a similar shape, raising and dropping at the same points. The green line seems not to follow this shape though.

We can now calculate the actual correlation values. With Pandas, this is extremely easy. Let's see if the values match our expectations:

In [0]:
resolutions['2010':'2013'].corr()

That's close to what we expected, but we had to zoom in to the period 2010-2013. If we take the whole data series, we will find a very strange result:

In [0]:
resolutions.corr()

Diet and gym have a close to zero, and even negative correlation! That can be explained from our assumption before: the data is actual the sum of the seasonal fluctuation on top of the general trends. And the general trends of diet and gym are actually quite opposite (apart from the 2010-2013 period), as can be seen when we plot their rolling averages.

In [0]:
resolutions_joined[['diet_rolling','gym_rolling']].plot(figsize=(15,8))
resolutions_joined[['diet_rolling','gym_rolling']].corr()

And the negative correlation of -0.29 also shows this.

Your next exercise will be to prove that the seasonal effect is actually positively correlated for diet and gym!

## Exercise 2.3: Correlation of the seasonal patterns in the New Years Resolutions (***)

- We calculated the seasonal effect on 'diet' before. Start by calculating the seasonal effect on 'gym' in the same matter and then join the data.
- Visualize the 2 seasonal effects together. Does our assumption hold that they are strongly correlated?
- Calculate the correlation to validate

# Answers for the exercises


## <a name="ex1_1answer">Answer to Exercise 1.1</a>

- What is the type of the index of the data? Make it into a DataTimeIndex

In [0]:
# print out the index -> dtype = 'object'
print(stock.index)

# or print out it's first element and the type -> a string
print(stock.index[0],'is of type',type(stock.index[0]))

In [0]:
# make the index a DataTimeIndex
stock.index = pd.to_datetime(stock.index)

# print out again
print(stock.index[0],'is of type',type(stock.index[0]))

- Filter out only the data from the years 2015 & 2016


In [0]:
stock1516=stock['2015':'2016'] 
print(stock1516.head()) # first 5 entries = JAN 2015
print(stock1516.tail()) # last 5 entries = DEC 2016

# another option is btw to create a filter on the index
# stock1516 = stock[(stock.index >= '2015') & (stock.index <= '2016')]


- Visualize the closing price in a simple plot for the month July 2015

In [0]:
stock1516['2015-07']['close'].plot(figsize=(15,8))

<a href="#ex1_1">Back to Exercise 1.1</a>

## <a name="ex2_1answer">Answer to Exercise 2.1</a>

- make the `date` column into the main index and change it to the DateTimeIndex-format (with `.to_datetime()`)



In [0]:
ex21 = climate.set_index('date')
ex21.index = pd.to_datetime(ex21.index)
# printing out the info: the index is of DateTimeIndex format 
print(ex21.info()) 
ex21.head()

In [0]:
# PS. plotting at this point has little effect, the data is hard to interpret:
ex21.plot(figsize=(15,8))


- deduce the yearly averages (use resampling) from this data and plot them 

In [0]:
ex21_year = ex21.resample('Y').mean()
ex21_year.plot(figsize=(15,8))

- to describe the current climate, usually the 20-year rolling average is taken. Calculate and plot this rolling average together with the yearly averages

In [0]:
# the date is already by year, so we simply take a rolling window of 20 and consolidate to the mean value
ex21_year_rolling = ex21_year.rolling(20).mean()

# join our new data together & plot
ex21_year_joined = ex21_year.join(ex21_year_rolling, rsuffix='_20y')
ex21_year_joined['1950':'2012'].plot(figsize=(15,10))
# we see the global warming trend also clearly in the belgian climate

- Calculate the seasonal effect 

In [0]:
ex21_rolling = ex21.rolling(12).mean()
ex21_seasonal = ex21 - ex21_rolling
ex21_seasonal['2000':'2003'].plot(figsize=(15,8))
ex21_seasonal.plot(figsize=(15,8))



In [0]:
ex21['2000':'2018'].plot(figsize=(10,8))
ex21_year['2000':'2018'].plot(figsize=(10,8))


- (\*\*\*) Can we say that the temperature has become more extreme (colder winters and warmer summers)?

In [0]:
# more extreme months means that their temperature is further from the yearly mean = their seasonal (absolute!) values are larger
# we can calculate this in 2 steps: first make the dataset in absolute values:
ex21_abs = ex21_seasonal.abs()
# then resample
ex21_abs_yearly = ex21_abs.resample('Y').sum()

# and finally make a rolling average to see the trend
ex21_abs_yearly_rolling = ex21_abs_yearly.rolling(20).mean();

ex21_extreme = ex21_abs_yearly.join(ex21_abs_yearly_rolling, rsuffix='_20y')

ex21_extreme['1970':'2018'].plot(figsize=(15,8))
# nope, from this data i would say no..


<a href="#ex2_1">Back to Exercise 2.1</a>

## <a name="ex3_1answer">Answer to Exercise 3.1</a>

In [0]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
!pip install -q --upgrade xlrd

excel_reader = pd.ExcelFile("time_serie_1.xlsx")
month_tab = excel_reader.parse(sheet_name = "month")
temperature = month_tab.get("temperature")

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(12, 9))

# plot violin plot
axes.violinplot(temperature,
                   showmeans=True,
                   showmedians=True)
axes.set_title('violin plot')

# adding horizontal grid lines
axes.yaxis.grid(True)
axes.set_xlabel('Month')
axes.set_ylabel('Temperature')

plt.setp(axes, xticks=[0], #to have no graduation on the x axis
         )
plt.show()


<a href="#ex3_1">Back to Exercise 3.1</a>

## <a name="ex4_1answer">Answer to Exercise 4.1</a>

In [0]:
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

excel_reader = pd.ExcelFile("time_serie_1.xlsx")
linear_tab = excel_reader.parse(sheet_name = "linear")
regr = linear_model.LinearRegression()
x_train = [[i] for i in range(len(linear_tab.get("time")))] #the times have to be integer, so we can represent those by value from 0 to the end of the list
regr.fit(x_train, linear_tab.get("temperature"))

prediction = regr.predict([[i] for i in range(len(x_train)) ]) 
#the plot of the current data set
plt.scatter(x_train, linear_tab.get("temperature"),  color='black')
plt.plot(x_train, prediction, color='blue', linewidth=3)
plt.show()

regr.predict(([[7]])) # 7 for Sunday


<a href="#ex4_1">Back to Exercise 4.1</a>

## <a name="ex4_2answer">Answer to Exercise 4.2</a>

In [0]:
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

excel_reader = pd.ExcelFile("time_serie_1.xlsx")
linear_tab = excel_reader.parse(sheet_name = "non_linear")
regr = linear_model.LinearRegression()
length_tab = len(linear_tab.get("time"))
months = [[i] for i in range(length_tab)] 
all_months = months + [[length_tab+i] for i in range(4)]
regr.fit(months, linear_tab.get("temperature"))

prediction = regr.predict([[i] for i in range(12) ])
temperature_to_add = [20,17,13,5]
all_temperature = list(linear_tab.get("temperature")) + temperature_to_add
plt.scatter(all_months, all_temperature,  color='black')
plt.plot(all_months, prediction, color='blue', linewidth=3)


plt.xlabel('Month')
plt.ylabel('Temperature')
plt.show()
print(regr.predict([[9],[10],[11],[12]]))
print("those prediction are impossible ... especially in Belgium")
print("as you can see after plotting, the linear regression is not really a good model for the time series as those are never linear in long time period")

<a href="#ex4_2">Back to Exercise 4.2</a>