# 01 | Hypotheses Testing with Linear Regression

Session recording:  https://youtu.be/aAqrkD49nwc

## Terminology

EXAMPLE: Unemployment rate has a significant impact on the crime rate.

**Null hypothesis:** There is no relationship between the unemployment rate and the crime rate.

**Alternative hypothesis:** A higher unemployment rate leads to a higher crime rate.

Goal: to (not) reject the null hypothesis

*We never say that we accept a hypothesis!*


1. We want to generalize the relationship between unemployment and crime --> we fit straight line through data.
2. We want to test if the relationship is statistically significant


$H0: β₁ = 0$

We reject the null hypothesis, if our calculated β1 is “far enough from zero”.

## OLS Model: Calculating beta coefficient

Beta is the average amount by which the dependent variable increases when the independent variable increases one standard deviation and other independent variables are held constant.

### Linear Regression - Fitted Line

$Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖ*Xₖ + ε$

- β0 = intersection with y-axis
- β1 = slope of the fitted line
- ε = error term

How do we know which line should be fitted? -->

**Minimization of Sum of Squared Residuals**

To calculate RSS, first find the model's level of error or residue by subtracting the actual observed values from the estimated values. Then, square and add all error values to arrive at RSS.

**Fitted Model**

Once you have estimated the coefficients, you can calculate the predicted or fitted values (Ŷ) using the estimated coefficients and the observed values of the independent variables. The formula for calculating the fitted values is:

$Ŷ = β̂₀ + β̂₁X₁ + β̂₂X₂ + ... + β̂ₖ*Xₖ$

Here, β̂₀, β̂₁, β̂₂, ..., β̂ₖ represent the estimated coefficients, and X₁, X₂, ..., Xₖ represent the observed values of the independent variables.

**Interpretation**

TBD


**Standard Errors and Confidence Intervals of Beta Estimate**

Standard errors: represent the average distance that the observed values have from the regression line → base for confidence interval.

Confidence intervals provide a range of plausible values within which the true population coefficient is likely to lie with a certain level of confidence. The confidence level is typically chosen beforehand, such as 95% or 99%.


**Hypotheses Testing with OLS: Is Beta “far enough from zero”?**

Statistical significance: Is the p-value small enough?
- if p-value = 0.10 → we have 90% confidence, that our variable is significant
- if p-value = 0.05 → we have 95% confidence, that our variable is significant
- if p-value = 0.01 → we have 99% confidence, that our variable is significant
 



## OLS Model: Validation

- F-test
    - H0: Model with no independent variables fits the data as well as your model  
    - we want to reject H0
- R-squared
    - Indicates the percentage of the variance in the dependent variable that the independent variables explain collectively
    - 0-100% scale (the higher the better)



## OLS Model: Multiple linear regressions

$Y = β₀ + β₁X₁ + β₂X₂ + ... + βₚ*Xₚ + ε$

- Example: Impact of various marketing investments on product sales (youtube ads, facebook ads, newspaper ads...)
- Assumption: Investment in Facebook advertising (β₂) has a positive impact on sales. 
- Null hypothesis: Investment in Facebook advertising **has NO impact** on sales.
- Economic significance: is β₂ large enough for our business?
- Statistical significance: is p-value small enough?



In [1]:
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms

marketing = pd.read_csv('marketing.csv', index_col=0)

# Multiple linear model
model_full = smf.ols(formula='sales ~ facebook + newspaper + youtube', data=marketing).fit()

print(model_full.summary())

# F-stats is statistically significant (p-value < 0.05), so the model makes sense overall.
# R2  is high (we want it as close to 1 as possible), so our variables explain sales well.

# Youtube and Facebook investments are statistically significant because their p-values are nearly zero. 
# Newspaper investment is not significant.

# If FB investment increases by 1000 USD → sales increase by 189 USD on average, keeping other variables fixed. (=coef)

                            OLS Regression Results                            
Dep. Variable:                  sales   R-squared:                       0.897
Model:                            OLS   Adj. R-squared:                  0.896
Method:                 Least Squares   F-statistic:                     570.3
Date:                Tue, 18 Jul 2023   Prob (F-statistic):           1.58e-96
Time:                        17:03:49   Log-Likelihood:                -1804.2
No. Observations:                 200   AIC:                             3616.
Df Residuals:                     196   BIC:                             3630.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   3526.6672    374.290      9.422      0.0

## OLS Assumptions

The first four assumptions are crucial to obtain correct betas.
Violation of the other assumptions does not make beta estimates invalid.It makes statistical inference invalid (standard errors, p-values, ...).


1. Linear relationship 
    - Relationship should be linear
    - Non-linear relationship may or may not jeopardize our conclusion
2. No multicollinearity
    - occurrence of high correlations among two or more independent variables in a multiple regression model
    - Why?
        - an isolated relationship between each independent variable and the dependent variable is needed
        - stronger multicollinearity == higher standard errors (explodes to infinity for correlation approaching 1)
    - How to test?
        - Correlation matrix: correlation above ~70% may be problematic
        - Variance Inflation Factor (VIF)
            - above 5: multicollinearity might be present
            - above 10: multicollinearity certainly present
    - Solution?
        - remove variable
            - remove on of the two highly correlated variables
            - hypotheses or theory should guide your decision
        - specialized methods
            - ridge regression, LASSO, elastic net, principal component analysis
            - better for large datasets with many variables
3. Random sample
    - individual observations are independent from each other
    - all individuals have the same probability of sampling
    - examples of violations:
        - analyzing impact of education on income using one individual over her/his lifetime
        - analyzing the impact of education on income when high-income individuals are less willing to share information about their income
        - MSD project example: analyzing the productivity of farms in France only for farms that have good data about productivit
4. No omitted variable
    - causal impact
        - possible solutions:
            - include all relevant variables
            - panel models (covered next time)
            - randomized experiments
            - regression discontinuity design
    - correlation is not causation
5. Homoskedasticity
    - variance of residuals is the same across all values of the independent variables
    - solution == robust standard errors (White standard errors)
    - how to test?
        - Breusch-Pagan Test – tests simple form of heteroskedasticity
        - White Test - tests various forms of heteroskedasticity
6. Normality
    - Normality holds when: 
        - Residuals are normally distributed
        - or we have large sample
    - Asymptotic normality
    - When number of observations is high the estimates of betas are approximately normal
    - Follows from central limit theorem (and few other theorems)
    - In practice you almost always rely on asymptotic normality
    

When testing with linear model (OLS), we are interested in:
- Model performance  
- Beta coefficients
- Statistical significance

We want p-value < 0.1, ideally even p-value < 0.05





In [2]:
# OLS model with heteroskedasticity robust standard errors
print(model_full.get_robustcov_results(cov_type = "HC3").summary())

                            OLS Regression Results                            
Dep. Variable:                  sales   R-squared:                       0.897
Model:                            OLS   Adj. R-squared:                  0.896
Method:                 Least Squares   F-statistic:                     577.0
Date:                Tue, 18 Jul 2023   Prob (F-statistic):           5.65e-97
Time:                        17:04:18   Log-Likelihood:                -1804.2
No. Observations:                 200   AIC:                             3616.
Df Residuals:                     196   BIC:                             3630.
Df Model:                           3                                         
Covariance Type:                  HC3                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   3526.6672    410.411      8.593      0.0