<a href="https://colab.research.google.com/github/RocioLiu/ML_Resources/blob/master/Course_1_Time_Series_Analysis_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **DataCamp SKILL TRACK: Time Series with Python**

## **Course 1. Time Series Analysis in Python**

### **Section 1. Correlation and Autocorrelation**

Google Trends allows users to see how often a term is searched for. We downloaded a file from Google Trends containing the frequency over time for the search word "diet", which is pre-loaded in a DataFrame called `diet`. A first step when analyzing a time series is to visualize the data with a plot.   
  
Like many time series datasets , the index of dates are strings and should be **converted to a datetime index** before plotting.

In [0]:
# Import pandas and plotting modules
import pandas as pd
import matplotlib.pyplot as plt

# Convert the date index to datetime
diet.index = pd.to_datetime(diet.index)

In [0]:
# Plot the entire time series diet and show gridlines
diet.plot(grid=True)
plt.show()

In [0]:
# Slice the dataset to keep only 2012
diet2012 = diet['2012']

# Plot 2012 data
diet2012.plot(grid=True)
plt.show()

#### **Merging Time Series With Different Dates**
Stock and bond markets in the U.S. are closed on different days. One way to see the dates that the stock market is open and the bond market is closed is to convert both indexes of dates into sets and take the difference in sets.  
   
Merge the two DataFrames into a new DataFrame, stocks_and_bonds using the `.join()` method, which has the syntax `df1.join(df2)`.
To get the intersection of dates, use the argument `how='inner'`. 
  
Stock prices and 10-year US Government bond yields, which were downloaded from FRED, are pre-loaded in DataFrames `stocks` and `bonds`.

In [0]:
# Import pandas
import pandas as pd

# Convert the stock index and bond index into sets
set_stock_dates = set(stocks.index)
set_bond_dates = set(bonds.index)

# Take the difference between the sets and print
print(set_stock_dates - set_bond_dates)

# Merge stocks and bonds DataFrames using join()
stocks_and_bonds = stocks.join(bonds, how='inner')

# {'2013-11-11', '2007-11-12', '2015-11-11', '2009-11-11', '2007-10-08', 
# '2010-11-11', '2008-11-11', '2012-11-12', '2011-10-10', '2012-10-08', 
# '2016-11-11', '2016-10-10', '2011-11-11', '2014-11-11', '2015-10-12', 
# '2014-10-13', '2017-06-09', '2013-10-14', '2009-10-12', '2010-10-11', '2008-10-13'}

#### **Correlation of Stocks and Bonds**
Investors are often interested in the correlation between the returns of two different assets for asset allocation and hedging purposes.  
  
Keep in mind that you should compute the correlations on the percentage changes rather than the levels.  
Compute percent changes of DataFrame with the `.pct_change()` method.  
Using the `.corr()` method for Series which has the syntax `series1.corr(series2)`.

Stock prices and 10-year bond yields are combined in a DataFrame called `stocks_and_bonds` under columns `SP500` and `US10Y`

In [0]:
# Compute percent change using pct_change()
returns = stocks_and_bonds.pct_change()

# Compute correlation using corr()
correlation = returns['SP500'].corr(returns['US10Y'])
print("Correlation of stocks and interest rates: ", correlation)

# Make scatter plot
plt.scatter(x=returns['SP500'], y=returns['US10Y'])
plt.show()

#### **Flying Saucers Aren't Correlated to Flying Markets**
Two trending series may show a strong correlation even if they are completely unrelated. This is referred to as "spurious correlation". That's why when you look at the correlation of say, two stocks, you should look at the correlation of their *returns* and not their *levels*.

To illustrate this point, calculate the correlation between the levels of the stock market and the annual sightings of UFOs. Both of those time series have trended up over the last several decades, and the correlation of their levels is very high. But the correlation of their percent changes will be close to zero, since there is no relationship between those two series.

The DataFrame `levels` contains the levels of `DJI` and `UFO`. UFO data was downloaded from www.nuforc.org.


In [0]:
# Compute correlation of levels
correlation1 = levels['DJI'].corr(levels['UFO'])
print("Correlation of levels: ", correlation1)

# Compute correlation of percent changes
changes = levels.pct_change()
correlation2 = changes['DJI'].corr(changes['UFO'])
print("Correlation of changes: ", correlation2)

# Correlation of levels:  0.9399762210726432
# Correlation of changes:  0.06026935462405376

#### **Regression's R-Squared**
R-squared measures how closely the data fit the regression line. In particular, the magnitude of the correlation is the square root of the R-squared and the sign of the correlation is the sign of the regression coefficient.

In this exercise, you will start using the statistical package `statsmodels`, which performs much of the statistical modeling and testing that is found in R.

You will take two series, `x` and `y`, compute their correlation, and then regress `y` on `x` using the function `OLS(y,x)` in the `statsmodels.api` library (note that the dependent, or right-hand side variable y is the first argument). Most linear regressions contain a constant term which is the intercept (the `α` in the regression *yt = α + βxt + ϵt*). To include a constant using the function `OLS()`, you need to add a column of 1's to the right hand side of the regression.

The module `statsmodels.api` has been imported for you as `sm`.

In [0]:
# Import the statsmodels module
import statsmodels.api as sm

# Compute correlation of x and y
correlation = x.corr(y)
print("The correlation between x and y is %4.2f" %(correlation))

# Convert the Series x to a DataFrame and name the column x
dfx = pd.DataFrame(x, columns=['x'])

# Add a constant to the DataFrame dfx
dfx1 = sm.add_constant(dfx)

# Regress y on dfx1
result = sm.OLS(y, dfx1).fit()

# Print out the results and look at the relationship between R-squared and the correlation above
print(result.summary())

#### **A Popular Strategy Using Autocorrelation**
One puzzling anomaly with stocks is that investors tend to overreact to news. Following large jumps, either up or down, stock prices tend to reverse. This is described as mean reversion in stock prices: prices tend to bounce back, or revert, towards previous levels after large moves, which are observed over time horizons of about a week. A more mathematical way to describe mean reversion is to say that stock returns are negatively autocorrelated.

This simple idea is actually the basis for a popular hedge fund strategy. If you're curious to learn more about this hedge fund strategy (although it's not necessary reading for anything else later in the course), see [here](https://www.quantopian.com/posts/enhancing-short-term-mean-reversion-strategies-1).

You'll look at the autocorrelation of weekly returns of MSFT stock from 2012 to 2017. You'll start with a DataFrame `MSFT` of daily prices. You should use the `.resample()` method to get weekly prices and then compute returns from prices. Use the pandas method `.autocorr()` to get the autocorrelation and show that the autocorrelation is negative. Note that the `.autocorr()` method only works on Series, not DataFrames (even DataFrames with one column), so you will have to select the column in the DataFrame.

Istrctions:  
* Use the `.resample()` method with `rule='W'` and `how='last'` to convert daily data to weekly data.
  * The argument how in `.resample()` has been deprecated.
  * The new syntax .`resample().last()` also works.  
* Create a new DataFrame, `returns`, of percent changes in weekly prices using the `.pct_change()` method.
* Compute the autocorrelation using the `.autocorr()` method on the series of closing stock prices, which is the column `'Adj Close'` in the DataFrame `returns`.

In [0]:
# Convert the daily data to weekly data
MSFT = MSFT.resample(rule='W', how='last')

# Compute the percentage change of prices
returns = MSFT.pct_change()

# Compute and print the autocorrelation of returns
autocorrelation = returns['Adj Close'].autocorr()
print("The autocorrelation of weekly returns is %4.2f" %(autocorrelation))

#### Are Interest Rates Autocorrelated?
When you look at daily changes in interest rates, the autocorrelation is close to zero. However, if you resample the data and look at annual changes, the autocorrelation is negative. This implies that while short term changes in interest rates may be uncorrelated, long term changes in interest rates are negatively autocorrelated. A daily move up or down in interest rates is unlikely to tell you anything about interest rates tomorrow, but a move in interest rates over a year can tell you something about where interest rates are going over the next year. And this makes some economic sense: over long horizons, when interest rates go up, the economy tends to slow down, which consequently causes interest rates to fall, and vice versa.

The DataFrame `daily_rates` contains daily data of 10-year interest rates from 1962 to 2017.

* Create a new DataFrame, `daily_diff`, of changes in daily rates using the `.diff()` method.
* Compute the autocorrelation of the column `'US10Y'` in `daily_diff` using the `.autocorr()` method.
* Use the `.resample()` method with arguments `rule='A'` to convert to annual frequency and `how='last'`.
  * The argument `how` in `.resample()` has been deprecated.
  * The new syntax `.resample().last()` also works.
* Create a new DataFrame, `yearly_diff` of changes in annual rates and compute the autocorrelation, as above.

In [0]:
# Compute the daily change in interest rates 
daily_diff = daily_rates.diff()

# Compute and print the autocorrelation of daily changes
autocorrelation_daily = daily_diff['US10Y'].autocorr()
print("The autocorrelation of daily interest rate changes is %4.2f" %(autocorrelation_daily))

# Convert the daily data to annual data
yearly_rates = daily_rates.resample(rule='A').last()

# Repeat above for annual data
yearly_diff = yearly_rates.diff()
autocorrelation_yearly = yearly_diff['US10Y'].autocorr()
print("The autocorrelation of annual interest rate changes is %4.2f" %(autocorrelation_yearly))

### **Section 2. Some Simple Time Series** 

#### **Taxing Exercise: Compute the ACF**
You will compute the array of autocorrelations for the H&R Block quarterly earnings that is pre-loaded in the DataFrame `HRB`. Then, plot the autocorrelation function using the `plot_acf` module. This plot shows what the autocorrelation function looks like for cyclical earnings data. The ACF at `lag=0` is always one, of course. In the next exercise, you will learn about the confidence interval for the ACF, but for now, suppress the confidence interval by setting `alpha=1`.

Istrctions:  
* Import the acf module and `plot_acf` module from `statsmodels`.
* Compute the array of autocorrelations of the quarterly earnings data in DataFrame `HRB`.
* Plot the autocorrelation function of the quarterly earnings data in `HRB`, and pass the argument `alpha=1` to suppress the confidence interval.

In [0]:
# Import the acf module and the plot_acf module from statsmodels
from statsmodels.tsa.stattools import acf
from statsmodels.graphics.tsaplots import plot_acf

# Compute the acf array of HRB
acf_array = acf(HRB)
print(acf_array)

# Plot the acf function
plot_acf(HRB)
plt.show()

![1](https://drive.google.com/uc?id=1V0tJAmczpaU_BQzmIh3so6oHklfStITh)

![2](https://drive.google.com/uc?id=18n_4D2pGG8xSSLeNWP79Wdf1JZ4ETFJe)
    

Notice the strong positive autocorrelation at lags 4, 8, 12, 16,20, ...


#### **Are We Confident This Stock is Mean Reverting?**
In the last chapter, you saw that the autocorrelation of MSFT's weekly stock returns was -0.16. That autocorrelation seems large, but is it statistically significant? In other words, can you say that there is less than a 5% chance that we would observe such a large negative autocorrelation if the true autocorrelation were really zero? And are there any autocorrelations at other lags that are significantly different from zero?
  
Even if the true autocorrelations were zero at all lags, in a finite sample of returns you won't see the estimate of the autocorrelations exactly zero. In fact, the standard deviation of the sample autocorrelation is 1/sqrt(N) where N is the number of observations, so if N=100, for example, the standard deviation of the ACF is 0.1, and since 95% of a normal curve is between +1.96 and -1.96 standard deviations from the mean, the 95% confidence interval is ±1.96/sqrt(N). This approximation only holds when the true autocorrelations are all zero.
  
You will compute the actual and approximate confidence interval for the ACF, and compare it to the lag-one autocorrelation of -0.16 from the last chapter. The weekly returns of Microsoft is pre-loaded in a DataFrame called `returns`.

Instructions:  
* Recompute the autocorrelation of weekly returns in the Series `'Adj Close'` in the `returns` DataFrame.
* Find the number of observations in the returns DataFrame using the `len()` function.
* Approximate the 95% confidence interval of the estimated autocorrelation. The math function `sqrt()` has been imported and can be used.
* Plot the autocorrelation function of `returns` using `plot_acf` that was imported from `statsmodels`. Set `alpha=0.05` for the confidence intervals (that's the default) and `lags=20`.

In [0]:
# Import the plot_acf module from statsmodels and sqrt from math
from statsmodels.graphics.tsaplots import plot_acf
from math import sqrt

# Compute and print the autocorrelation of MSFT weekly returns
autocorrelation = returns['Adj Close'].autocorr()
print("The autocorrelation of weekly MSFT returns is %4.2f" %(autocorrelation))

# Find the number of observations by taking the length of the returns DataFrame
nobs = len(returns)

# Compute the approximate confidence interval
conf = 1.96/sqrt(nobs)
print("The approximate confidence interval is +/- %4.2f" %(conf))

# Plot the autocorrelation function with 95% confidence intervals and 20 lags using plot_acf
plot_acf(returns, alpha=0.05, lags=20)
plt.show()

![](https://drive.google.com/uc?id=13Dtyx2uxSx6GSXqkM9UPdb8nxR-peii_)
   
Notice that the autocorrelation with lag 1 is significantly negative, but none of the other lags are significantly different from zero

#### **Can't Forecast White Noise**
A white noise time series is simply a sequence of uncorrelated random variables that are identically distributed. Stock returns are often modeled as white noise. Unfortunately, for white noise, we cannot forecast future observations based on the past - autocorrelations at all lags are zero.

You will generate a white noise series and plot the autocorrelation function to show that it is zero for all lags. You can use `np.random.normal()` to generate random returns. For a Gaussian white noise process, the mean and standard deviation describe the entire process.

Instructions:  
* Generate 1000 random normal returns using `np.random.normal()` with mean 2% (0.02) and standard deviation 5% (0.05), where the argument for the mean is `loc` and the argument for the standard deviation is `scale`.
* Verify the mean and standard deviation of returns using `np.mean()` and `np.std()`.
* Plot the time series.
* Plot the autocorrelation function using `plot_acf` with `lags=20`.

In [0]:
# Import the plot_acf module from statsmodels
from statsmodels.graphics.tsaplots import plot_acf

# Simulate white noise returns
returns = np.random.normal(loc=0.02, scale=0.05, size=1000)

# Print out the mean and standard deviation of returns
mean = np.mean(returns)
std = np.std(returns)
print("The mean is %5.3f and the standard deviation is %5.3f" %(mean,std))

# Plot returns series
plt.plot(returns)
plt.show()

# Plot autocorrelation function of white noise returns
plot_acf(returns, lags=20)
plt.show()

![](https://drive.google.com/uc?id=1sh9D9yrXqmbIuLlgbhNYYI_x6QujVA0c)
![](https://drive.google.com/uc?id=1fAMNQInK2MFlLl5P_b344LY0rMm1ay42)
  
Notice that for a white noise time series, all the autocorrelations are close to zero, so the past will not help you forecast the future.