# Linear Regression Model

## Terms
  * Model: an equation between one variable -the response- that you're trying to estimate and other variables that you can use to come up with estimation - called predictors


## Models

Model: $y = f(x_1,x_2,x_3,...)$

Linear Regression Model: $y = b_0 + b_1x_1 + b_2x_2 +...$

Simple Linear Regression Model: $y= b_0 + b_1x_1$

### Assumptions of Linear Regression Model
  * Linearity: mean of y is linearly determined by predictors
  * Independence: with different X , responses are independent
  * Normality: random noise and y follow normal distribution
  * Equal variance: variance for y are all equal even if values of predictors are different
 
Above assumptions must be validated if you want to make inferences using linear regression models. But if you just want to make predictions, you do not need to be strict about these assumptions for the most part
  

# Applying Linear Regression Model to Data

Population model : $ y_i = \beta_0 + \beta_1 * x_i + \epsilon_i \\ \epsilon_i \sim N(0,\sigma^2)$
  * $\beta_0$: the intercept
  * $\beta_1$: the slope

Mean Equation : $ \mu_{y|x_i} = \beta_0 + \beta_1 * x_i $
  * use sample pairs ($x_i,y_i$) to estimate population parameters: $\beta_0, \beta_1, \sigma$
  * the sample pairs on a scatter plot will not line up in straight line because there are multiple other factors that affect the response, besides whatever one predictor we chose for the simple linear regression model
  * So, all the other factors will be pushed into a "noise term"

## Ordinary Least Square Estimation: Estimating Population Parameters

Prediction Equation : $\hat{y_i} = b_0 + b_1x_i$
  * $b_0$: estimated value for $\beta_0$
  * $b_1$: estimated value for $\beta_1$
  * $\epsilon_i$: the error, the shortest distance between the data point $(x_i,y_i)$ and the prediction equation line for given estimated values  
  * minimization problem: need to find $b_0$ and $b_1$ that minimizes sum of squared errors: $ \sum_{i=1}^{N}\epsilon_i^2$
  * solving the minimization problem gives us the prediction equation that is **the best fit line**
  
  ### Explicit formulas for estimated values
  $ b_1 = \frac{\sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^N (x_i - \bar{x})^2}$
  
  $b_0 = \bar{y} - b_1\bar{x} $
  
# Applying Two-Tail Testing

  * Goal: to find significance of the population parameter we chose

$H_0: \beta_1 = 0$ meaning the parameter has no effect

$H_a: \beta_1 \neq 0$

$ \hat{t} = \frac{b_1 - \beta_1}{S_{b_1}}$

  * this statistic follows t-distribution with degreee of freedom n - 2 (because $\bar{x}$ and $\bar{y}$ constrained to compute $S_{b_1}$)
  
# R-squared

  * important measure of performance of model

## Step 1: Compute Variation of y without model (Total Variation)

$ SST = \sum_{i=1}^N (y - \bar{y})^2 $

  * y: observed y
  * $\bar{y}$ : mean of y

## Step 2: Compute Variation Explained

$ SSR = \sum_{i=1}^N (\hat{y} - \bar{y})^2 $

  * $\hat{y}$ : computed by the prediction equation, point on the best fit line
  * calculated how much of the variation can be predicted by the model
  
## Step 3: Compute Variation Unexplained

$ SSE = \sum_{i=1}^N (\hat{y} - y)^2 $

  * reflects the variation of response that can't be explained by model
  
## Step 4: Put it all together

Total Variation = Variation Explained + Variation Unexplained

$ R^2 = 1 - \frac{SSE}{SST}$

  * $R^2$ : the percent of variation that can be explained by model

In [None]:
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt