### Building linear models

This process allows you to select and drop variables if they are are not statistically significant.

For example, if you have independent variables $ a, b, c $ and you want to predict some dependent variable $ y $. These variables are in a sample of data and we want to see if the relationship between variables is significant in the population.

There are four basic ways to build a model to contain useful variables:

- Forward
- Backward
- Stepwise
- Subset

This page covers the subset method.

#### Rationale

We could make the naive assumption that for the variables $ a, b, c $ in the sample, all are useful. Then:

$ y = \beta_0 + \beta_1 a + \beta_2 b + \beta_3 c + \epsilon $

But this is more complex than it needs to be if any of those variables do not have a significant relationship with $ y $. It's better to keep things simple.

There number of possible models is equal to the cardinality of the power set of the variables, minus the empty set because that wouldn't be helpful at all. Some methods will test some (Forward and Backwards) or all (Subset). The Stepwise method is not gauranteed to test every combination.

$ P(\{a,b,c\}) - \{\} = \{\{a\},\{b\},\{c\},\{a,b\},\{a,c\},\{b,c\},\{a,b,c\}\} $

$ \hat y = \hat \beta_0 + \hat \beta_1 a $

$ \hat y = \hat \beta_0 + \hat \beta_2 b $

$ \hat y = \hat \beta_0 + \hat \beta_3 c $

$ \hat y = \hat \beta_0 + \hat \beta_1 a + \hat \beta_2 b $

$ \hat y = \hat \beta_0 + \hat \beta_1 a + \hat \beta_3 c $

$ \hat y = \hat \beta_0 + \hat \beta_2 b + \hat \beta_3 c $

$ \hat y = \hat \beta_0 + \hat \beta_1 a + \hat \beta_2 b + \hat \beta_3 c $, also the "full model"

This idea scales to meet a dataset with any number of variables.

In [1]:
data = read.csv("multipleRegression.csv")
data

y,x1,x2,x3,x4,x5
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
271.8,777.5556,33.53,40.55,16.66,13.2
264.0,790.0,36.5,36.19,16.46,14.11
238.8,811.8756,34.66,37.31,17.66,15.68
230.7,814.0,33.13,32.52,17.5,10.53
251.6,768.96,35.75,33.71,16.4,11.0
257.9,765.0384,34.46,34.14,16.28,11.31
263.9,757.9236,34.6,34.85,16.06,11.96
266.5,750.0,35.38,35.89,15.93,12.58
229.1,775.56,35.85,33.53,16.6,10.66
239.3,769.2881,35.68,33.79,16.41,10.85


### Subset method

For each set of independent variables, make a regression with the dependent variable.

There are three statistics to evaluate each model:

For each variable, make a regression with the dependent variable. Then get the $ R^2 $ for those. For the model with the highest $ R^2 $. For that independent variable run a test to see if the $ R^2 $ is significant (some kind of  

In [2]:
library("rje")
independentVariables = c("x2","x3","x4","x5") # first variable is dependent
variableCombinations = powerSet(independentVariables)
variableCombinations = variableCombinations[-1] # first element is always the empty set

In [3]:
for (combination in variableCombinations) {
    print(combination)
    model = lm(as.formula(paste("y ~ ",paste(combination, collapse="+"),sep = "")), data=data)
    print(summary(model)[8])
}

[1] "x2"
$r.squared
[1] 0.01047594

[1] "x3"
$r.squared
[1] 0.01256447

[1] "x2" "x3"
$r.squared
[1] 0.03427915

[1] "x4"
$r.squared
[1] 0.7205242

[1] "x2" "x4"
$r.squared
[1] 0.7205321

[1] "x3" "x4"
$r.squared
[1] 0.8587154

[1] "x2" "x3" "x4"
$r.squared
[1] 0.8741268

[1] "x5"
$r.squared
[1] 0.1232863

[1] "x2" "x5"
$r.squared
[1] 0.1296362

[1] "x3" "x5"
$r.squared
[1] 0.3703462

[1] "x2" "x3" "x5"
$r.squared
[1] 0.4608582

[1] "x4" "x5"
$r.squared
[1] 0.8202064

[1] "x2" "x4" "x5"
$r.squared
[1] 0.8202213

[1] "x3" "x4" "x5"
$r.squared
[1] 0.8637623

[1] "x2" "x3" "x4" "x5"
$r.squared
[1] 0.8748834



The combination you end up using would be a tradeoff between the $ R^2 $ and removing variables that don't provide a significant improvement to the model. the combination of x3,x4 provides a high score but adding other variables provides only a marginal improvement.

TO DO: add forward and backward model building techniques