# Module 10.1: Linear Regression Basics

In [5]:
# Import libraries
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.core.common import random_state
from sklearn.linear_model import LinearRegression

### LOS 10.a: Describe a simple linear regression model, how the least squares criterion is used to estimate regression coefficients, and the interpretation of these coefficients.

The purpose of simple linear regression is to explain the variation in a dependent variable in terms of the variation in a single independent variable. Here, the term variation is interpreted as the degree to which a variable differs from its mean value. Don't confuse variation with variance—they are related, but they are not the same.


$\text{variation in Y} = {\Large\sum_{1}^{n}} (Y_i - \bar{Y})^2$

* The **dependent variable** is the variable whose variation is explained by the independent variable. We are interested in answering the question, "What explains fluctuations in the dependent variable?" The dependent variable is also referred to as the terms explained variable, endogenous variable, or predicted variable.
</br>

* The **independent variable** is the variable used to explain the variation of the dependent variable. The independent variable is also referred to as the terms explanatory variable, exogenous variable, or predicting variable.
</br>

**Example: Dependent vs. independent variables**

Suppose you want to predict stock returns with GDP growth. Which variable is the independent variable?

&emsp;**Answer:**

Because GDP is going to be used as a predictor of stock returns, stock returns are being *explained* by GDP. Hence, stock returns are the dependent (explained) variable, and GDP is the independent (explanatory) variable.


#### Simple Linear Regression Model

The following linear regression model is used to describe the relationship between two variables, $X$ and $Y$:
<br></br>
$\Large{Y_i = b_0 + b_1X_i + \epsilon_i ,... i = 1, ..., n}$

&emsp; 

<U>where:</U>

$Y_i$ = ith observation of the dependent variable, $Y$ 

$X_i$ = ith observation of the independent variable, $X$
 
$b_0$ = regression intercept term
 
$b_1$ = regression slope coefficient
 
$\epsilon_i$ = **residual** for the $i_{th}$ observation (also referred to as the disturbance term or error term);

Based on this regression model, the regression process estimates an equation for a line through a scatter plot of the data that "best" explains the observed values for $Y$ in terms of the observed values for $X$.


#### Simple Linear Regression Model

The linear equation, often called the line of best fit or regression line, takes the following form:
<br></br>

$\Large\hat{Y}_{i} = \hat{b}_{0} + \hat{b}_{1}X_i i=1,2,3...,n$

&emsp; 

<U>where:</U>

$\hat{Y}_{i}$ = estimated value of $Y_i$ given $X_i$

$\hat{b}_{0}$ = estimated intercept term.

$\hat{b}_{1}$ = estimated slope coefficient.

 
<br>
The hat "^" above a variable or parameter indicates a predicted value.
</br>


Thus, the regression line is the line that minimizes the **SSE**. This explains why simple linear regression is frequently referred to as ordinary least squares **(OLS) regression**, and the values determined by the estimated regression equation, $\hat{Y}_i$, are called least squares estimates.

<br>
The estimated slope coefficient $\hat{b}_{1}$ for the regression line describes the change in $Y$ for a one-unit change in $X$. It can be positive, negative, or zero, depending on the relationship between the regression variables. The slope term is calculated as follows:
</br>
&emsp; 

$\Large\hat{b}_{1} = \frac{CovXY}{\sigma^2_X}$

The intercept term $\hat{b}_{0}$ is the line's intersection with the $Y$-axis at $X = 0$. It can be positive, negative, or zero. A property of the least squares method is that the intercept term may be expressed as follows:

&emsp; 
$\large\hat{b}_{0}=\bar{Y}−\hat{b}_{1}\bar{X}$

where:

Y = mean of Y

X = mean of X

The intercept equation highlights the fact that the regression line passes through a point with coordinates equal to the mean of the independent and dependent variables (i.e., the point X, Y).

<hr>

**Example: Computing the slope coefficient and intercept term**

Compute the slope coefficient and intercept term using the following information:

<table>
<thead>
  <tr>
    <th></th>
    <th></th>
    <th></th>
    <th></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>Cov(S&amp;P 500, ABC</td>
    <td>0.000336</td>
    <td>Mean return, S&amp;P 500</td>
    <td>−2.70%</td>
  </tr>
  <tr>
    <td>Var(S&amp;P 500)</td>
    <td>0.000522</td>
    <td>Mean return, ABC</td>
    <td>−4.05%</td>
  </tr>
</tbody>
</table>

**Answer:**

The slope coefficient is calculated as $\hat{b}_{1} = \frac{0.000336}{0.000522} = 0.64$.

The intercept term is calculated as follows:

$\hat{b}_{0}=\overline{ABC}−\hat{b}_{1}\overline{SP500}=−4.05\% −0.64 (−2.70\%) = −2.3\%$
<br>
</br>
The estimated regression line that minimizes the SSE in our ABC stock return example is shown in  Estimated Regression Equation for ABC vs. S&P 500 Excess Returns.

<br>
This regression line has an intercept of $–2.3\%$ and a slope of $0.64$. The model predicts that if the S&P 500 excess return is $–7.8\%$ (May 20X4 value), then the ABC excess return would be $–2.3\% + (0.64)(–7.8\%) = –7.3\%$. The residual (error) for the May 20X4 ABC prediction is $8.4\%$—the difference between the actual ABC excess return of $1.1\%$ and the predicted return of $–7.3\%$.
</br>


In [6]:
covAB= 0.000336
ABC_Var = 0.000522
SP500_Mu = -2.70
ABC_Mu = -4.05
## Define the Slope intercept
bhat_1 = covAB / ABC_Var
bhat_0 = ABC_Mu - (bhat_1*SP500_Mu)

#Excess returns
excessSP = bhat_0 + bhat_1 * -7.8

#Actual ABC returns
actABC = 1.1
sse = actABC - excessSP 

In [7]:
print("Slope Coeffiecient b^1   =  ", round(bhat_1, 2))
print("Intercept b^0            = ", round(bhat_0, 1))
print("Excess returns of SP500  = ", round(excessSP, 1))
print("Actual returns of ABC    =  ", round(actABC, 1))
print("SSE Error ABC prediction =  ", round(sse, 1))

Slope Coeffiecient b^1   =   0.64
Intercept b^0            =  -2.3
Excess returns of SP500  =  -7.3
Actual returns of ABC    =   1.1
SSE Error ABC prediction =   8.4


<hr>


<img src="https://github.com/PachaTech/CFA-Level-1/blob/main/10_1%20graph.jpeg?raw=true">

<br>



A Simple linear regression explains the variation in a dependent variable $Y$ in terms of the variation in a single independent variable $X$

#### Assumptions of a Simple Linear Regression

1. The relationship between $X$ independent and $Y$ dependent does (must) exist. 
2. Error terms are normally distrubuted. Their $\mu=0$
3. The variance $\sigma$ of the error term is constant *(Homoskedastic)*.
4. Error terms are independently distributed and uncorrelated with each other, *(serial or autocorrelation)*.  
5. Error terms are not random.

<hr></hr>

#### Results can be an issue for Standard Error terms or to Hypothesis testing.
* HOMOSKEDACITY - refers to the case where all prediction errors all have the same constant variance.  $\sigma = c$
<br>

* HETEROSKEDACITY - refers to the variance of the error terms $\epsilon$ not being constant.    $\sigma \neq c$
<br>

* Conditional HETEROSKEDACITY -  where the variance of the error terms is related to the independent variable.  
*i.e. if the independent variable is getting bigger and bigger or the variance is increasing and getting bigger.  Maybe, the independent variable is getting smaller and the variance is getting smaller too.*
<br>

#### NOTES
* The model **does not** assume that the dependent variable is uncorrelated with the residuals. 
* The model **does assume** that the independent variable is uncorrelated with the residuals.


# Module 10.2: Analysis of Variance (ANOVA) and Goodness of Fit

<hr>

### LOS 10.c: Calculate and interpret measures of fit and formulate and evaluate tests of fit and of regression coefficients in a simple linear regression.


### LOS 10.d: Describe the use of analysis of variance (ANOVA) in regression analysis, interpret ANOVA results, and calculate and interpret the standard error of estimate in a simple linear regression.

<hr>

**Analysis of variance (ANOVA)** is a statistical procedure for analyzing the total variability of the dependent variable. Let's define some terms before we move on to ANOVA tables:

* The **total sum of squares (SST)** measures the total variation in the dependent variable. SST is equal to the sum of the squared differences between the actual $Y$-values and the mean of $Y$:

$\qquad\large\text{SST}=\displaystyle\sum_{i=1}^n(Y_i−\overline{Y})^2$

* The **sum of squares regression (SSR)** measures the variation in the dependent variable that is explained by the independent variable. SSR is the sum of the squared distances between the predicted $Y$-values and the mean of $Y$:

$\qquad\large\text{SSR}=\displaystyle\sum_{i=1}^n(\hat{Y}−\overline{Y})^2$

* The **mean square regression (MSR)** is the SSR divided by the number of independent variables. A simple linear regression has only one independent variable, so in this case, MSR = SSR.

#### Professor's Note

Multiple regression (i.e., with more than one independent variable) is addressed in the Level II CFA curriculum.

* The **sum of squared errors (SSE)** measures the unexplained variation in the dependent variable. It's also known as the sum of squared residuals or the residual sum of squares. SSE is the sum of the squared vertical distances between the actual $Y$-values and the predicted $Y$-values on the regression line:
<br>

$\qquad\large\text{SSR}=\displaystyle\sum_{i=1}^n(Y_i−\hat{Y})^2$

</br>

* The **mean squared error (MSE)** is the SSE divided by the degrees of freedom, which is $n – 1$ minus the number of independent variables. A simple linear regression has only one independent variable, so in this case, degrees of freedom are $n – 2$.

<br>

You probably will not be surprised to learn the following:
<br>

$\qquad\text{total variation = explained variation + unexplained variation}$

<br>
or:

<br>
$\qquad\text{SST = SSR + SSE}$

<br>

**Components of Total Variation** illustrates how the total variation in the dependent variable (SST) is composed of SSR and SSE.


<img src="https://github.com/PachaTech/CFA-Level-1/blob/main/10_2%20chart.jpeg?raw=true">

<br>

<hr>

The output of the ANOVA procedure is an ANOVA table, which is a summary of the variation in the dependent variable. ANOVA tables are included in the regression output of many statistical software packages. You can think of the ANOVA table as the source of the data for the computation of many of the regression concepts discussed in this reading. A generic ANOVA table for a simple linear regression (one independent variable) is presented in  **ANOVA Table for a Simple Linear Regression**.

<img src="https://github.com/PachaTech/CFA-Level-1/blob/main/10_2.jpeg?raw=true">
<br>

#### Standard Error of Estimate (SEE)

The SEE for a regression is the standard deviation of its residuals. The lower the SEE, the better the model fit:

$\qquad\text{SEE} = \sqrt{MSE}$

#### Coefficient of Determination (R2)

The coefficient of determination (R2) is defined as the percentage of the total variation in the dependent variable explained by the independent variable. For example, an R2 of 0.63 indicates that the variation of the independent variable explains 63% of the variation in the dependent variable:

$\qquad\large{R^2 = \frac{\text{SSR}}{\text{SST}}}$

#### Professor's Note

For simple linear regression (i.e., with one independent variable), the coefficient of determination, $R^2$, may be computed by simply squaring the correlation coefficient, $r$. In other words, $R^2 = r^2$ for a regression with one independent variable.

**Example:** Using the ANOVA table

Given the following ANOVA table based on 36 observations, calculate the $R^2$ and the standard error of estimate (SEE).

<hr>


**Completed ANOVA table for ABC regression**

<table>
<thead>
  <tr>
    <th>Source of Variation</th>
    <th>Degrees of Freedom</th>
    <th>Sum of Squares</th>
    <th>Mean Sum of Squares</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>Regression (explained)</td>
    <td>1</td>
    <td>0.0076</td>
    <td>0.0076</td>
  </tr>
  <tr>
    <td>Error (unexplained)</td>
    <td>34</td>
    <td>0.0406</td>
    <td>0.0012</td>
  </tr>
  <tr>
    <td>Total</td>
    <td>35</td>
    <td>0.0482</td>
    <td></td>
  </tr>
</tbody>
</table>

<hr>

**Answer**

$\quad\large R^2=\frac{\text{explained variation (SSR)}}{\text{total variation (SST)}} = \frac{0.0076}{0.0482}=0.158$ or 15.8% 

<br>

$\qquad\text{SEE}=\sqrt{MSE}=\sqrt{0.0012}=0.035$






In [26]:
## Input the variables
SSR = 0.0076
SST = 0.0482

## Solve for coefficient of determination (R^2)
MSE = (SSR / SST)

## Calculate SEE
SEE = math.sqrt(0.0012)

In [35]:
## Print variables
print("Explained Variation (SSR)          =  ", round(SSR, 3))
print("Total Variation     (SST)          =  ", round(SST, 3))
print("coefficient of determination (R^2) =  ", round(MSE, 3))
print("coefficient of determination (R^2) =  ", round(MSE*100,1)),print("%")
print("Standard Error of Estimate (SEE)   =  ", round(SEE, 3))

Explained Variation (SSR)          =   0.008
Total Variation     (SST)          =   0.048
coefficient of determination (R^2) =   0.158
coefficient of determination (R^2) =   15.8
%
Standard Error of Estimate (SEE)   =   0.035


#### The F-Statistic

An *F*-test assesses how well a set of independent variables, as a group, explains the variation in the dependent variable.

The *F*-statistic is calculated as follows:

<br>
$\quad\large\text{F}=\frac{\text{MSR}}{\text{MSE}}=\frac{\text{SSR}\div K}{\text{SSE}\div(n-k-1)}$

where:

$\text{MSR} =$ mean regression sum of squares

$\text{MSE} =$ mean squared error

**Important**: This is always a one-tailed test

<br>
For simple linear regression, there is only one independent variable, so the $F$-test is equivalent to a $t$-test of the statistical significance of the slope coefficient:

$\qquad H_0: b_1 = 0$   versus   $H_a: b_1 \neq 0$

To determine whether $b_1$ is statistically significant using the $F$-test, the calculated $F$-statistic is compared with the critical $F$-value, $F_c$, at the appropriate level of significance. The degrees of freedom for the numerator and denominator with one independent variable are as follows:

$\qquad df_{\text{numerator}} = k = 1$

$\qquad df_{\text{denominator}} = n − k − 1 = n − 2$

where:

$n =$ number of observations

The decision rule for the $F$-test is to reject $H_0$ if $F > F_c$.

<br>

Rejecting the null hypothesis that the value of the slope coefficient equals zero at a stated level of significance indicates that the independent variable and the dependent variable have a significant linear relationship.
