## FINANCIAL DATA
MODULE 6 | LESSON 2


---

<h1> <b> CORRELATIONS AND TIME SERIES


|  |  |
|:---|:---|
|**Reading Time** |  30 minutes |
|**Prior Knowledge** | Basic Python  |
|**Keywords** |Simple Moving Average (SMA), Exponential Moving Average (SMA), Pearson correlation, Spearman correlation, <br> Kendall correlation, Single exponential smoothing, Double (Holt's) exponential smoothing, Triple (Holt-Winters) exponential smoothing |   	

---


*In the last lesson, we downloaded batches of data using an API and analyzed it for the frequency of various sentiments. In this lesson, we learn about time series analysis and especially moving averages, which includes simple moving averages and exponential moving averages of various kinds. Then, we review the three most used types of correlation (Pearson, Kendall, and Spearman) in preparation for analyzing the correlations among various cryptocurrencies.*

###### These lesson notes are a fork of [this notebook](https://www.kaggle.com/code/dbarkhorn/crypto-correlation/notebook) by Danbarkhorn, which is released under the Apache 2.0 open source license.

## 1. Time Series Analysis and Smoothing 

Financial engineering often relies on historical data, so a solid understanding of time series analysis is crucial. Most of what you need to know for this lesson about forecast quality metrics and smoothing techniques will be found [in the required reading (section 1: "Introduction" and section 2: "Move, smoothe, evaluate").](https://www.kaggle.com/kashnitsky/topic-9-part-1-time-series-analysis-in-python)


Get further information on Double and Triple Exponential Smoothing from [this required reading.](https://online.stat.psu.edu/stat501/node/1001)

As you will see, moving averages (both exponential and simple moving averages) will come into play in many ways in the later lesson about technical analysis.

## 2. Correlations

There are different approaches to assessing the strength of the relationship between two variables, and each has its own strengths and weaknesses. [See the following required reading before we calculate the different correlation metrics between cryptocurrencies in the rest of this notebook.](https://www.kaggle.com/kiyoung1027/correlation-pearson-spearman-and-kendall/report)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.signal import correlate
from sklearn.linear_model import LinearRegression
from statsmodels.tsa.stattools import adfuller

# 3. Cryptocurrency Correlations

The data for this lesson comes from Kaggle and has been downloaded to your virtual machine. You can consult the original source [here](https://www.kaggle.com/dbarkhorn/crypto-correlation/data).

In [None]:
crypto = {}

crypto["bitcoin"] = pd.read_csv("./data/bitcoin_price.csv")
crypto["bitcoin_cash"] = pd.read_csv("./data/bitcoin_cash_price.csv")
crypto["dash"] = pd.read_csv("./data/dash_price.csv")
crypto["ethereum"] = pd.read_csv("./data/ethereum_price.csv")
crypto["iota"] = pd.read_csv("./data/iota_price.csv")
crypto["litecoin"] = pd.read_csv("./data/litecoin_price.csv")
crypto["monero"] = pd.read_csv("./data/monero_price.csv")
crypto["nem"] = pd.read_csv("./data/nem_price.csv")
crypto["neo"] = pd.read_csv("./data/neo_price.csv")
crypto["numeraire"] = pd.read_csv("./data/numeraire_price.csv")
crypto["ripple"] = pd.read_csv("./data/ripple_price.csv")
crypto["stratis"] = pd.read_csv("./data/stratis_price.csv")
crypto["waves"] = pd.read_csv("./data/waves_price.csv")

In [None]:
# For this analysis, (*) we will only be looking at closing price to make things more manageable
for coin in crypto:
    for column in crypto[coin].columns:
        if column not in ["Date", "Close"]:
            crypto[coin] = crypto[coin].drop(column, 1)
    # Make date the datetime type and reindex
    crypto[coin]["Date"] = pd.to_datetime(crypto[coin]["Date"])
    crypto[coin] = crypto[coin].sort_values("Date")
    crypto[coin] = crypto[coin].set_index(crypto[coin]["Date"])
    crypto[coin] = crypto[coin].drop("Date", 1)

In [None]:
for coin in crypto:
    print(coin, len(crypto[coin]))

**Note:** 
As this is a fixed set of data, the coins numeraire, iota, and bitcoin_cash all were all relatively young and therefore do not have many data points. (*) So let's omit these currencies and for the time being consider only the most recent 350 data points for the remaining currencies.

In [None]:
del crypto["bitcoin_cash"], crypto["numeraire"], crypto["iota"]

In [None]:
cryptoAll = {}  # for later on

for coin in crypto:
    cryptoAll[coin] = crypto[coin]
    crypto[coin] = crypto[coin][-350:]

 As previously stated, the goal of this analysis is to create a correlation matrix for these currencies. One way to find correlation between time series is to look at *cross-correlation* of the time series. Cross-correlation is computed between two time series using a lag, so when creating the correlation matrix, (*) we need to specify the correlation as well as the lag.
 
 Before computing the cross-correlation, it is important to have "wide-sense stationary" (often just called stationary) data. There are a few ways to make data stationary--one of which is through differencing. But even after this, it is famously difficult to avoid spurious correlations between time series data that are often caused by autocorrelation. See this article for an in-depth analysis of how spurious correlations arise and how to avoid them: ["Dangers and Uses of Cross-Correlation in Analyzing Time Series in Perception, Performance, Movement, and Neuroscience: The Importance of Constructing Transfer Function Autoregressive Models."](https://doi.org/10.3758/s13428-015-0611-2) (This is a free, shareable article, but it is not a required reading.)
 
For now, (*) we will employ daily differencing (as it is not seasonal) and test for stationarity to prepare for cross-correlation testing.

In [None]:
# Differencing
for coin in crypto:
    crypto[coin]["CloseDiff"] = crypto[coin]["Close"].diff().fillna(0)

Now, let's take a preliminary look at how our graph appears. Further steps may have to be taken to make the data stationary.

In [None]:
for coin in crypto:
    plt.plot(crypto[coin]["CloseDiff"], label=coin)
plt.legend(loc=2)
plt.title("Daily Differenced Closing Prices")
plt.show()

Note:
Here we see that one of the coins (Bitcoin) has much larger spikes than the other coins. While this may still have given us stationarity, it may be useful to also look at the percentage change per day of the time series.

In [None]:
# Percent Change
for coin in crypto:
    crypto[coin]["ClosePctChg"] = crypto[coin]["Close"].pct_change().fillna(0)

In [None]:
for coin in crypto:
    plt.plot(crypto[coin]["ClosePctChg"], label=coin)
plt.legend(loc=2)
plt.title("Daily Percent Change of Closing Price")
plt.show()

**Note:**
As before, we still have some very large peaks, but overall, the data looks more contained than it was previously. Most importantly, we do not have a single coin dominating the others.

Focus on one particular part of the graph to get an idea of any correlation going on.

In [None]:
for coin in crypto:
    plt.plot(crypto[coin]["ClosePctChg"][-30:], label=coin)
plt.legend(loc=2)
plt.title("Daily Percent Change of Closing Price")
plt.show()

**Note:**
Here it seems as if we do in fact have some correlation going on, which is what we were hoping for.

It is also important to note that a number of other types of differencing or normalizations could have been applied. As this is only a preliminary analysis, this may not end up being the best way to prepare the data.

## 4. Stationarity

You will learn all about stationarity in the Econometrics course. In short, the mean and variance of a stationary series does not change over time. You will not be quizzed on stationarity per se in this lesson, but read through this section so that you will have been exposed to the concept by the time you get to Econometrics--and because we should perform these tests on our Bitcoin data anyway.
<br>

We can test for stationarity by using *unit root tests*. One of which is the Augmented Dickey-Fuller Test. Dickey Fuller utilizes the following regression.

$$ Y'_t \space = \space \phi Y_{t-1} \space + \space b_1 Y'_{t-1} \space + \space b_2 Y'_{t-2} \space +...+ \space b_p Y'_{t-p} $$
$$ $$
$$ Y'_t \space = \space Y_t \space - \space Y_{t-1} $$

Using the Augmented Dickey Fuller test, we look at the following statistic.

$$ DF_t \space = \space \frac{\hat{\phi}}{SE(\hat{\phi})} $$

Then, this statistic is compared to a table given by Dickey Fuller. Given the number of samples, we can guess with a % certainty whether or not our data is stationary.

$$ H_{0} \space : data \space is \space nonstationary $$
$$ H_{A} \space : data \space is \space stationary $$

To check these hypotheses, we look at the p-value of our given statistic (*) [using a table](https://www.real-statistics.com/statistics-tables/augmented-dickey-fuller-table/). In the table, we look at model 1 (the middle table) with 250 < n < 500. From here, we can see that in order to know with 1% certainty whether or not our data is stationary, we can compare our 𝐷𝐹𝑡 statistic to the values -3.457 and -3.443. Since our calculated values are less than these critical values from the table, we have a significant result, i.e., our time series are stable.

[^@#$%]

In [None]:
for coin in crypto:
    print("\n", coin)
    adf = adfuller(crypto[coin]["ClosePctChg"][1:])
    print(coin, "ADF Statistic: %f" % adf[0])
    print(coin, "p-value: %f" % adf[1])
    print(coin, "Critical Values", adf[4]["1%"])
    print(adf)

In [None]:
for coin in crypto:
    print("\n", coin)
    adf = adfuller(crypto[coin]["CloseDiff"][1:])
    print(coin, "ADF Statistic: %f" % adf[0])
    print(coin, "p-value: %f" % adf[1])
    print(coin, "Critical Values", adf[4]["1%"])
    print(adf)

**Note:**
Here we see that our data is very stationary. This is clear because of the extremely low p-values. 

It is important here to note there are other ways to detrend other than looking at differenced data or percent change. However, some of these methods would not have proven fruitful for this dataset. Take, for example, using the residuals of this data based on a simple linear regression. This can be easily done using scikit learn's linear regression tool.

In [None]:
for coin in crypto:
    model = LinearRegression()
    model.fit(np.arange(350).reshape(-1, 1), crypto[coin]["Close"].values)
    trend = model.predict(np.arange(350).reshape(-1, 1))
    plt.subplot(1, 2, 1)
    plt.plot(trend, label="trend")
    plt.plot(crypto[coin]["Close"].values)
    plt.title(coin)

    plt.subplot(1, 2, 2)
    plt.plot(crypto[coin]["Close"].values - trend, label="residuals")
    plt.title(coin)

    plt.show()

**Note:**
We are getting ineffective results: Since many of these currencies only started gaining traction recently, this shows that the preferred method was what was done originally.

And finally, the actual correlations analysis. We use scipy's correlate function. The cross-correlation will tell us if we should lag one of the series. Cross-correlation is often used in signal processing to match signals.

In [None]:
corrBitcoin = {}
corrDF = pd.DataFrame()

for coin in crypto:
    corrBitcoin[coin] = correlate(
        crypto[coin]["ClosePctChg"], crypto["bitcoin"]["ClosePctChg"]
    )
    lag = np.argmax(corrBitcoin[coin])
    laggedCoin = np.roll(crypto[coin]["ClosePctChg"], shift=int(np.ceil(lag)))
    corrDF[coin] = laggedCoin

    plt.figure(figsize=(15, 10))
    plt.plot(laggedCoin)
    plt.plot(crypto["bitcoin"]["ClosePctChg"].values)
    title = coin + "/bitcoin PctChg lag: " + str(lag - 349)
    plt.title(title)

    plt.show()

Now we can look at the correlations among these currencies. 
We will compute the correlations using three different methods: pearson, spearman, and kendall.

In [None]:
font = {
    "family": "serif",
    "color": "black",
    "weight": "normal",
    "size": 20,
}

plt.matshow(corrDF.corr(method="pearson"))
plt.xticks(range(10), corrDF.columns.values, rotation="vertical")
plt.yticks(range(10), corrDF.columns.values)
plt.xlabel("Pearson Correlation", fontdict=font)
plt.show()
corrDF.corr(method="pearson")

In [None]:
plt.matshow(corrDF.corr(method="spearman"))
plt.xticks(range(10), corrDF.columns.values, rotation="vertical")
plt.yticks(range(10), corrDF.columns.values)
plt.xlabel("Spearman Correlation", fontdict=font)
plt.show()
corrDF.corr(method="spearman")

In [None]:
plt.matshow(corrDF.corr(method="kendall"))
plt.xticks(range(10), corrDF.columns.values, rotation="vertical")
plt.yticks(range(10), corrDF.columns.values)
plt.xlabel("Kendall Correlation", fontdict=font)
plt.show()
corrDF.corr(method="kendall")

**Note:**
We see here that with all of these correlation methods we get about the same results, but with slightly different magnitudes.
Also, we should note that there are *no* correlations greater than .5
This is contrary to what may be found if we were to, for example, take the correlation of the nonstationary datasets. This suggests (*) we have avoided spurious correlations between currencies. Additionally, note that only two of the currencies showed stronger correlations with lagged data. This makes sense as these currencies have shown to be very responsive to media in the recent past.



## 5. Conclusion

In this lesson, we reviewed moving averages and correlations in order to analyze the correlations among Bitcoin time series data. In the next lesson, we continue to use moving averages as an input to various technical analysis approaches.

**References**

* Danbarkhorn. "Crypto-correlation." *Kaggle*. 10 Sep 2017. https://www.kaggle.com/code/dbarkhorn/crypto-correlation/notebook.

* Dean, Roger, and William Dunsmuir. "Dangers and Uses of Cross-Correlation in Analyzing Time Series in Perception, Performance, Movement, and Neuroscience: The Importance of Constructing Transfer Function Autoregressive Models." *Behavioral Research Methods*, vol. 48, 2016, pp. 783–802. https://doi.org/10.3758/s13428-015-0611-2.

* Kashnitsky, Yury. "Topic 9. Part 1. Time series analysis in Python." *Kaggle*. 3 Jan 2021. https://www.kaggle.com/kashnitsky/topic-9-part-1-time-series-analysis-in-python.

* Rooney. "Correlation (Pearson, Spearman, and Kendall)." *Kaggle*. 31 Dec 2019. https://www.kaggle.com/kiyoung1027/correlation-pearson-spearman-and-kendall/report.

* Simon, Laura, and Derek Young. "14.5.2 - Exponential Smoothing." *Penn State Statistics Online*. https://online.stat.psu.edu/stat501/node/1001.

 **Footnotes**
 
 - In compliance with the Apache License 2.0, the (*) marks the places where changes were made to the original notebook.
 - You can find the [APACHE LICENSE, VERSION 2.0 here.](https://www.apache.org/licenses/LICENSE-2.0) 
 - **NOTE:** The above Apache license notice is copied here to comply with its requirements, but it does **not** apply to the content in these lesson notes. 

---
Copyright © 2022 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
