# Calculating OLS

This exercise will walk you through calculating a multivariate ordinary linear regression.

## Housekeeping

In [1]:
using Distributions

In order to make the results replicable set the seed.

In [2]:
srand(0);

In this step we define the following parameters

In [3]:
n = 100; # Number of observations
k = 2; # Number of features
beta1 = 2; # Weights on X1 and X2
beta2 = 1;
IncludeIntercept = true;

## Generating the data

Let us generate two features using *n* random numbers from a standard normal distribution: $X_{1}$ and $X_{2}$.

In [4]:
X1 = randn(n);
X2 = randn(n);

Let us generate a random normal response (outcome variable) using a linear combination of $X_{1}$, $X_{2}$, and noise in the form of random draws from a standard normal distribution.

In [5]:
y = beta1 * X1 + beta2 * X2 + randn(n);

To verify output with other software one can export the data as a text file.

In [6]:
# writedlm("/Users/jbsc/JuliaData.csv", cat(2,y,X1,X2), ",")

## Fitting the Model

In order to estimate $\beta$ we can use the normal equation solution.

Let us create the design matrix with the intercept.

In [7]:
X = cat(2, ones(n,1),X1,X2);

In [8]:
Xt = X';
inv_Xt_X = inv(Xt * X);

In [9]:
β = inv_Xt_X * Xt * y

3-element Array{Float64,1}:
 -0.0850227
  2.13912  
  0.944262 

Calculate the predictions $\hat{y}$

In [10]:
yhat = X * β;

Calculate the residuals

In [11]:
res = y - yhat;

The sum of squares residuals are decomposed by source: *Model* and *Residual*

The degrees of freedoms are given by: ESS (k), RSS (n - k - intercept), TSS (n - intercept)

The Total Sum of Squared Residuals is defined as: $TSS = \Sigma\ (y_{i} - \bar{y}) ^ 2$

In [12]:
TSS = sum((y - mean(y)) .^ 2)

612.9278300078876

In [13]:
MTSS = TSS / (n - Float64(IncludeIntercept))

6.191190202099874

The Explained Sum of Squared Residuals (ESS) is defined as: $ESS = \Sigma (\hat{y}_{i} - \bar{y})^2$

In [14]:
ESS = sum((yhat - mean(y)) .^ 2)

545.9217014773542

In [15]:
MESS = ESS / k

272.9608507386771

The Residual Sum of Squared Residuals by the model (RSS) is defined as: $RSS = \Sigma (\hat{y}_{i} - y_{i}) ^ 2 = e^{\prime}e$

In [16]:
RSS = res' * res

67.0061285305332

In [17]:
MRSS = RSS / (n - k - Float64(IncludeIntercept))

0.6907848302116825

Calculate RMSE

In [18]:
RMSE = sqrt(MRSS)

0.831134664306382

The coefficient of determination (R-squared) is defined as: $R^{2} = 1 - \frac{RSS}{TSS}$

In [19]:
Rsq = 1 - RSS / TSS

0.890678599910089

$\beta$ is the expected values of the random variables the model estimated.

Standard errors are the measure of uncertainty associated with the estimates of the expected value of parameters.

$Var\left(\hat{\beta}\right) = \hat{\sigma} ^ 2 \left(X^{\prime}X\right)^{-1}$

$\hat{\sigma}$ is estimated from the Mean Residual Sum of Squared Residuals (MRSS).

In [20]:
se = sqrt.(MRSS * diag(inv_Xt_X))

3-element Array{Float64,1}:
 0.0839841
 0.0797778
 0.0792884

The t-statistics are calculated.

In [21]:
t_statistic = β ./ se

3-element Array{Float64,1}:
 -1.01237
 26.8134 
 11.9092 

P-values are calculated using a t-distribution with $n - k - Intercept$ degrees of freedom

In [22]:
t_dist = TDist(n - k - Float64(IncludeIntercept))

Distributions.TDist{Float64}(ν=97.0)

For a two-tailed test, the p-value is twice the area from the t-statistic to the tail.

In [23]:
p_value = 2 * ccdf(t_dist, abs.(t_statistic))

3-element Array{Float64,1}:
 0.313881   
 1.19341e-46
 1.09355e-20

To calculate the F-test for the model use the F ratio: $F = \frac{MESS}{MRSS}$ and FDist( k , n - k - Intercept )

In [24]:
F = MESS / MRSS

395.14598294671816

In [25]:
F_dist = FDist(k, n - k - Float64(IncludeIntercept))

Distributions.FDist{Float64}(ν1=2.0, ν2=97.0)

In [26]:
F_value = ccdf(F_dist, F)

2.38342493063221e-47

In [27]:
"""
This function calculates confidence intervals given a Distribution, mean (μ), standard deviation (σ), and confidence level.
This function can also be used for estimates using the sample statistics.
"""
function conf_intervals(Distribution, μ, σ, conf_level = 0.95)
    α = 1 - conf_level
    tstar = quantile(Distribution, 1 - α / 2)
    l, u = μ + [-1, 1] * tstar * σ
    return(l,u)
end

conf_intervals

In [28]:
CI = Array{Float64}(length(β), 2)
for idx in 1:size(CI,1)
    CI[idx,1], CI[idx,2] = conf_intervals(t_dist, β[idx], se[idx])
end

## Output

In [29]:
using DataFrames

In [30]:
Results = DataFrame(cat(2,["Intercept", "X1", "X2"],round.(cat(2,β,se,t_statistic,p_value,CI), 4)))
names!(Results, [:Feature, :Coeff, :Std_Error, :t_Statistic, :p_value, :LowerCI, :UpperCI])
Results

Unnamed: 0,Feature,Coeff,Std_Error,t_Statistic,p_value,LowerCI,UpperCI
1,Intercept,-0.085,0.084,-1.0124,0.3139,-0.2517,0.0817
2,X1,2.1391,0.0798,26.8134,0.0,1.9808,2.2975
3,X2,0.9443,0.0793,11.9092,0.0,0.7869,1.1016


In [31]:
Anova = DataFrame(cat(2, ["Model", "Residual", "Total"], round.(cat(2, [ESS, RSS, TSS], [k, n - k - Float64(IncludeIntercept), n - Float64(IncludeIntercept)], [MESS, MRSS, MTSS]), 4)))
names!(Anova, [:Source, :SS, :df, :MS])
Anova

Unnamed: 0,Source,SS,df,MS
1,Model,545.9217,2.0,272.9609
2,Residual,67.0061,97.0,0.6908
3,Total,612.9278,99.0,6.1912


In [32]:
Model = DataFrame(cat(2, ["Sample Size", "R-Squared", "F-Statistic", "F-value", "RMSE"], round.([n, Rsq, F, F_value, RMSE],4)))
names!(Model, [:Statistic, :Value])
Model

Unnamed: 0,Statistic,Value
1,Sample Size,100.0
2,R-Squared,0.8907
3,F-Statistic,395.146
4,F-value,0.0
5,RMSE,0.8311
