<a href="https://colab.research.google.com/github/HomayounfarM/Timeseries/blob/main/Cross_correlation_Autocorrelation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Cross-correlation

##Cross-Correlation in Python: 4 Different Methods

Cross-correlation is a basic signal processing method, which is used to analyze the similarity between two signals with different lags. Not only can you get an idea of how well the two signals match with each other, but you also get the point of time or an index, where they are the most similar.

Investors use it to check how two stocks or assets perform against each other.  In time series analysis, it can be used to find the time delays between two series.  

##Cross-correlation definition
Cross-correlation is the correlation between two signals on different delays (lags).  

The definition is quite simple, you just overlap the two signals with a given delay.

![picture](https://drive.google.com/uc?export=view&id=1MrmjzS6wl-grAlwLJyT5veNeOxCjI5cq)

It is also called the sliding inner product, because, for a given delay, it is basically an inner product of the two signals.

Note that autocorrelation can be viewed as a special case of cross-correlation, where the cross-correlation is taken with respect to the signal itself.

![picture](https://drive.google.com/uc?export=view&id=1ANVoTIilPkrUnIgAjk8od20Jfzxhr3C7)

In [None]:
#Data set and number of lags to calculate

import numpy as np

# First signal
sig1 = np.sin(np.r_[-1:1:0.1])

# Seconds signal with pi/4 phase shift. Half the size of sig1
sig2 = np.sin(np.r_[-1:0:0.1] + np.pi/4)

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

ax.plot(sig1)
ax.plot(sig2)


In [None]:
''' Python only implementation '''

# Pre-allocate correlation array
corr = (len(sig1) - len(sig2) + 1) * [0]

# Go through lag components one-by-one
for l in range(len(corr)):
    corr[l] = sum([sig1[i+l] * sig2[i] for i in range(len(sig2))])

print(corr)

In [None]:
corr = np.correlate(a=sig1, v=sig2)
corr

In [None]:
import scipy.signal

corr = scipy.signal.correlate(sig1, sig2)

# Remove padded correlations
corr = corr[(len(sig1)-len(sig2)-1):len(corr)-((len(sig1)-len(sig2)-1))]

print(corr)

##Summary
As usual, it is up to you, which implementation is the best suited for you. If the performance is not an issue, go with the package that you are using anyways. For performance, NumPy is usually quite a safe bet. However, we have not done any performance comparison here. For the least dependencies go with NumPy or Python-only implementation.

#Autocorrelation function (ACF)



Autocorrelation, also known as serial correlation, is a statistical concept that refers to the correlation of a signal with a delayed copy of itself as a function of delay. This concept is commonly used in signal processing and time series analysis.

In a time series context, autocorrelation can be thought of as the correlation between a series and its lagged values. For example, in daily stock market returns, autocorrelation might be used to measure how today's return is related to yesterday's return.

If a series is autocorrelated, it means that the values in the series are not independent of each other. For example, if a positive autocorrelation is detected at a lag of 1, it means that high values in the series tend to be followed by high values, and low values tend to be followed by low values.

Mathematically, the autocorrelation function (ACF) is defined as the correlation between the elements of a series and others from the same series separated from them by a given interval. We can write this for real-valued discrete signals as


![picture](https://drive.google.com/uc?export=view&id=1TkLN8wKRfT14x6GPFi2c_ePDS_8nb_si)



Obviously, the maximum is at lag l=0. If the peaks of the ACF occur at even intervals, we can assume that the signal periodic component at that interval.

Let's consider a 10Hz sine wave, and sample this wave with a 1000Hz sampling rate.  

##The degree of correlation determined by correlation coefficients

-A correlation coefficient close to 1 indicates a strong positive autocorrelation. That is, a high value in the time series is likely to be followed by another high value, and a low value is likely to be followed by another low value.

-A correlation coefficient close to -1 indicates a strong negative autocorrelation. That is, a high value in the time series is likely to be followed by a low value, and vice versa.

-A correlation coefficient closer to 0 indicates no correlation. That is, the values in the time series appear to be random and do not follow a discernible pattern.

##Data set and number of lags to calculate
Before going into the methods of calculating autocorrelation, we need to have some data. You can find below the data set that we are considering in our examples. The data consists of a list of random integers. It could be anything really, but here we did not want to provide the data any specific properties. Thus, we made it random.

In addition to a data set, we need to know how many lag points we are interested in calculating. In this case, we've chosen  10 points. We could have calculated the autocorrelation of all possible lags (the same as data set length). However, for some of the methods shown here, the computational complexity is relative to the number of lags. We want to highlight this by choosing only a subset of lags to consider.

In [None]:
# Our data set
data = [3, 16, 156, 47, 246, 176, 233, 140, 130,
        101, 166, 201, 200, 116, 118, 247,
        209, 52, 153, 232, 128, 27, 192, 168, 208,
        187, 228, 86, 30, 151, 18, 254,
        76, 112, 67, 244, 179, 150, 89, 49, 83, 147, 90,
        33, 6, 158, 80, 35, 186, 127]

# Delay (lag) range that we are interesting in
lags = [0,1,2,3,4,5,6,7,8,9]


In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()

ax.plot(data)

In [None]:
''' Python only implementation '''

# Pre-allocate autocorrelation table
acorr = len(lags) * [0]

# Mean
mean = sum(data) / len(data)
# Variance
var = sum([(x - mean)**2 for x in data]) / len(data)

# Normalized data
ndata = [x - mean for x in data]


# Go through lag components one-by-one
for l in lags:
    c = 1 # Self correlation

    if (l > 0):
        tmp = [ndata[l:][i] * ndata[:-l][i] for i in range(len(data) - l)]

        c = sum(tmp) / len(data) / var

    acorr[l] = c

acorr

In [None]:
import statsmodels.api as sm

acorr = sm.tsa.acf(data, nlags = len(lags)-1)
acorr

In [None]:
''' numpy.correlate '''

import numpy as np

x = np.array(data)

# Mean
mean = np.mean(data)

# Variance
var = np.var(data)

# Normalized data
ndata = data - mean

acorr = np.correlate(ndata, ndata, 'full')[len(ndata)-1:]
acorr = acorr / var / len(ndata)

acorr

##Durbin-Watson test
The Durbin-Watson test is a statistical test used to check for autocorrelation in the residuals from a statistical regression analysis. Specifically, it's often used to detect ACF at lag 1. It's named after statisticians James Durbin and Geoffrey Watson.

The test statistic is approximately equal to 2*(1-r), where r is the sample ACF of the residuals. Therefore, for r == 0, indicating no autocorrelation, the test statistic equals 2. The statistic ranges from 0 to 4, and a value close to 2 suggests there is no autocorrelation. If the statistic is significantly less than 2, there is evidence of positive autocorrelation, and if it's greater than 2, it suggests negative autocorrelation.

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

# load a sample dataset
data = sm.datasets.get_rdataset('mtcars').data

# fit a linear regression model
model = smf.ols('mpg ~ cyl + disp + hp', data=data).fit()

# calculate Durbin-Watson statistic
dw = sm.stats.durbin_watson(model.resid)

print('Durbin-Watson statistic:', dw)



In this example, we load the mtcars dataset from statsmodels, fit a linear regression model, and then calculate the Durbin-Watson statistic on the residuals of the model. The Durbin-Watson statistic is a single number that you can interpret as described above.

As with any statistical test, you should interpret the Durbin-Watson statistic in the context of your specific analysis and data. The Durbin-Watson test is just one tool to detect autocorrelation, and it might not be appropriate for all cases. For instance, it is most powerful for detecting first-order autocorrelation and may not be as effective at identifying higher-order autocorrelation.

##Summary
As expected, all four methods produce the same output. It is up to you, which approach is the most convenient for you. The performance difference is out of the scope of this post, but as your data set starts to increase in size you can expect an exponential increase in complexity.