## Introduction

This in-class example demonstrates how to handle time trend, seasonality, and autoregressive processes.

What you need to know:
- Statsmodels and pandas modules in Python
- Theoretical concepts on time series regression model

The list of references for detailed concepts and techniques used in this exerise.


***
## Data Description

The data set is contained in a comma-separated value (csv) file named ```CDN_hprice.csv``` with column headers. 

The data is a set of quarterly observations on a housing price index and other relevant variables in Canada for 1976 through 2019.

This data set is obtained from [Federal Reserve Bank of Dallas's International House Price Database](https://www.dallasfed.org/institute/houseprice).

Description of the data is as follow:

| Variable name | Variable description |
| -- | ----------- |
| Year      | Year |
| Quarter   | Quarter |
| RHPI      | House Price Index (real) |
| RPDI      | Personal disposable income (real) |
| logRHPI   | log(RHPI)   |
| logRPDI   | log(RPDI)   |
| UE        | Unemployment rate (in percentage points) |
| CPI       | Consumer Price Index |
| TB10_rt   | 10-year treasury bond rate (nominal) |
| RTB10_rt  | 10-year Treasury Bond rate (real) |


***
## Load the required modules

In [None]:
import numpy as np
import pandas as pd
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot

***
## Import the data set

#### Load the data set into Python

***
## Data preparation

#### 1.1 Sort the data in ascending order by year and quarter

#### 1.2 Generate time index

Create new variable ```t``` such that $t=0$ in the first period. 

Note that we have repeated year and quarters. You can create the the required index by:
1. Sorting the observations in ascending order, as you are asked to do in (1.1)
2. Add a new column using the ```df.index``` method, where ```df``` is the name of the pandas dataframe

#### 1.3 Generate lag variable

Here we create lag variables with a lag of **4** period. That is, in quarterly frequency, the lag variable is in the *same* quarter but *different* year.

Create a new column in the date set named ```logRHPI_4```, such that $\text{logRHPI}\_4 = \log(\textit{RHPI}_{t-4})$ 

Create a new column in the date set named ```logRPDI_4```, such that $\text{logRPDI}\_4 = \log(\textit{RPDI}_{t-4})$ 

#### 1.4 Generate "*first-differencing*" variables

It is also known as the year-over-year difference, because we take the difference between now and 4 quarters (a year) ago. 

E.g. we are calulating the change from 2018 Q1 to 2019 Q1. 

Create a new column named ```gRHPI```, such that $\text{gRHPI} = \Delta \log(\textit{RHPI}_t) = \log(\textit{RHPI}_t) - \log(\textit{RHPI}_{t-4})$

#### 1.5 Plot the time series for log(RHPI) and growth rate of RHPI

***
## Time trend and seasonality

Consider the following model for house price growth:
$$\log(\textit{RHPI}_t) = \beta_0 + \beta_1 t + \delta_1 Q2 + \delta_2 Q3 + \delta_3 Q4 + u_t$$

- $t = 0, 1, 2, \ldots$ is the time index you created in (1.2)
- Q2, Q3, Q4 are dummy variables that equals to 1 in the second, third, and fourth quarter, respectively. Otherwise, they are equal to 0. 

#### 2.1 Estimate the model

#### 2.2 Get the estimation results

#### 2.3 What is the benchmark (base period) in this model?

#### 2.4 At 5% significance level, is there a statistically significant time trend?

#### 2.5 At 5% significance level, would you conclude that house price growth exhibits seasonality at quarterly frequency?

***
## Autoregressive process of order 1

Consider a house price model with AR(1) specification:
$$\log(\textit{RHPI}_t) = \beta_0 + \beta_1 t + \beta_2 \log(\textit{RHPI}_{t-4}) + \delta_1 Q2 + \delta_2 Q3 + \delta_3 Q4 + u_t$$

#### 3.1 Estimate the model

#### 3.2 Get the estimation results

#### 3.3 At 5% significance level, what would you conclude about $\beta_2$?

#### 3.4 Does the AR(1) process have weak dependence?

#### 3.5 Is the time trend still significant? Why or why not?

***
## Complete model

Consider a house price model with AR(1) specification and other exogenous regressors:
$$\log(\textit{RHPI}_t) = \beta_0 + \beta_1 t + \beta_2 \log(\textit{RHPI}_{t-4}) + \beta_3 \textit{UE}_t + \beta_4 \log(\textit{RPDI}_t) + \delta_1 Q2 + \delta_2 Q3 + \delta_3 Q4 + u_t$$

#### 4.1 Estimate the model

#### 4.2 Get the estimation results

#### 4.3 How would you interpret $\beta_3$? (Be careful on how unemployment rate is reported in the data set)

#### 4.4 At 5% significance level, what would you conclude about $\beta_3$?

#### 4.5 Explain why the AR(1) dependence ($\beta_2$) is weakened, when compare with your results in Question 3.

***
## References
- Jeffrey M. Wooldridge (2012). "Introductory Econometrics: A Modern Approach, 5e" Chapter 11.
    
- Seabold, Skipper, and Josef Perktold (2010). "[statsmodels: Econometric and statistical modeling with python](https://www.statsmodels.org/stable/examples/notebooks/generated/ols.html)." Proceedings of the 9th Python in Science Conference.