# Linear Regression function in R using lm() Function

In [3]:
dataset <- read.csv(text = 'Month,Spend,Sales
1,1000,9914
2,4000,40487
3,5000,54324
4,4500,50044
5,3000,34719
6,4000,42551
7,9000,94871
8,11000,118914
9,15000,158484
10,12000,131348
11,7000,78504
12,3000,36284
', header = T, colClasses = "numeric")

In [4]:
dataset

Month,Spend,Sales
1,1000,9914
2,4000,40487
3,5000,54324
4,4500,50044
5,3000,34719
6,4000,42551
7,9000,94871
8,11000,118914
9,15000,158484
10,12000,131348


In [5]:
simple.fit <- lm(Sales ~ Spend, data = dataset)
summary(simple.fit)


Call:
lm(formula = Sales ~ Spend, data = dataset)

Residuals:
   Min     1Q Median     3Q    Max 
 -3385  -2097    258   1726   3034 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1383.4714  1255.2404   1.102    0.296    
Spend         10.6222     0.1625  65.378 1.71e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2313 on 10 degrees of freedom
Multiple R-squared:  0.9977,	Adjusted R-squared:  0.9974 
F-statistic:  4274 on 1 and 10 DF,  p-value: 1.707e-14


In [6]:
multi.fit <- lm(Sales ~ Spend + Month, data = dataset)
summary(multi.fit)


Call:
lm(formula = Sales ~ Spend + Month, data = dataset)

Residuals:
     Min       1Q   Median       3Q      Max 
-1793.73 -1558.33    -1.73  1374.19  1911.58 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -567.6098  1041.8836  -0.545  0.59913    
Spend         10.3825     0.1328  78.159 4.65e-14 ***
Month        541.3736   158.1660   3.423  0.00759 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1607 on 9 degrees of freedom
Multiple R-squared:  0.999,	Adjusted R-squared:  0.9988 
F-statistic:  4433 on 2 and 9 DF,  p-value: 3.368e-14


## Interpreting R's Regression Output

* Residues: The section summarizes the residuals, the error between the prediction of the model and the actual results. Smaller residuals are better.
* Coefficients: For each variable and the intercept, a weight is produced and that weight has other attributes like the standard error, a t-test value and significance.
    * Estimate: This is the weight given to the variable. In the simple regression case (one variable plus the intercept), for every one dollar increase in Spend, the model predicts an increase of \$10.6222.
    * Std.Error: Tells you how precisely was the estimate measured. It's really only useful for calculating the t-value.
    * t-value and Pr(>[t]): the t-value is calculated by taking the coefficient divided by the Std.Error. It is then used to test whether or not the coefficient is significantly different from zero. If it isn't significant, then the coefficient really isn't adding anything to the model and could be dropped or investigated further. Pr(>[t]) is the significance level.

* Performance Measures: Three sets of measurements are provided.
    * Residual Standard Error: This is the standard deviation of the residuals. Smaller is better.
    * Multiple / Ajusted R-Square: For one variable, the distinction doesn't really matter. R-squared shos the amount of variance explained by the model. Adjusted R-Square takes into account the number of variables and is most useful for umltiple-regression.
    * F-statistic: The F-test checks if at least one variable's weight is significantly different than zero. This is a global test to help asses a model. If the p-value is not significant (e.g. greater than 0.05) than your model is essentially not doing anything.

# Explaining thelm() Summary in R

Summary:
* Residual Standard Error: Essentially standard deviation of residuals / errors of your regression model.

* Multiple R-Squared: Percent of the variance of Y intact after subtracting the error of the model.

* Adjusted R-Squared: Same as multiple R-squared but takes into account the number of samples and variables you're using.

* F-Statistic: Global test to check if your model has at least one significant variable. Takes into account number of variables and observation used.

R's lm() function is fast, easy and succinct. However, when you're getting started, that brevity can be a bit of curse. I'm going to explain some of the key components to the summary() function in R for linear regression models. In addition, I'll also show you how to calculate these figures for yourself so you have a better intuition of what they mean.


## Getting Started: Build a Model

Before we can examine a model summary, we need to build a model. To follow along with this example, create these three variables.

In [7]:
#Anscombe's Quartet Q1 Data
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68)
x1=c(10,8,13,9,11,14,6,4,12,7,5)
#Some fake data, set the seed to be reproducible.
set.seed(15)
x2=sqrt(y)+rnorm(length(y))

Now, we'll create a linear regression model using R's lm() function and we'll get the summary output using the summary() function.

In [8]:
model <- lm(y ~ x1 + x2)
summary(model)


Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.69194 -0.61053 -0.08073  0.60553  1.61689 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   0.8278     1.7063   0.485  0.64058   
x1            0.5299     0.1104   4.802  0.00135 **
x2            0.6443     0.4017   1.604  0.14744   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.141 on 8 degrees of freedom
Multiple R-squared:  0.7477,	Adjusted R-squared:  0.6846 
F-statistic: 11.85 on 2 and 8 DF,  p-value: 0.004054


# Meaning behind Each Section of Summary()

* Call: this is an R feature that shows what function and parameters were used to create the model.

* Residuals: Difference between what the model predicted and the actual value of y. You can calculate the Residuals section like so: ```summary(y - model\$fitted.values```

* Coefficients: These are the weights that minimize the sum of square of the errors.
    * Std.Error is Residual Standard Error divided by the square root of the sum of the square of that particular x variable.
    * t value: Estimate divided by Std.Error.
    * Pr(>|t|): Look up your t value in a T distribution table with the degrees of freedom.


## Residual Standard Error

Standard deviation is the aquare root of variance. Standard Error is very similar. The only difference is that instead of dividing by n-1, you substract n minus 1 + # of variables involved.

In [9]:
model$coefficients

In [10]:
#Residual Standard error (Like Standard Deviation)
k <- length(model$coefficients) - 1 # substract one to ignore intercept
SSE <- sum(model$residuals ** 2)
n <- length(model$residuals)
sqrt(SSE/(n-(1+k))) #Residual Standard Error

## Multiple R-squared

Also called the coefficient of determination, this is an oft-cited measurement of how well your model fits to the data. While there are many issues with using it alone, it's a quick and pre-computed check for your model.   
R-Squared substracts the residual error from the variance in Y. The bigger the error, the worse the remaining variance will appear. 

In [11]:
#Multiple R-Squared (Coefficient of Determination)
SSyy <- sum((y-mean(y)) ** 2)
SSE <- sum(model$residuals ** 2)
(SSyy - SSE) / SSyy

If you notice, numerator doesn't have to be positive. If the model is so bad, you can actually end up with a negative R-Squared.


## Adjusted R-squared

Multiple R-squared works great for simple linear (one variable) regression. However, in most cases, the model has multiple variables. The more variables you add, the more variance you're going to explain. So you have to control for the extra variables.

Adjusted R-Squared normalizes Multiple R-Squared by taking into account how many samples you have and how many variables you're using.

In [12]:
# Adjusted R-Squared
n <- length(y)
k <- length(model$coefficients) - 1 #substract one to ignore intercept
SSE <- sum(model$residuals ** 2)
SSyy <- sum((y - mean(y)) ** 2)
1 - (SSE/SSyy) * (n - 1) / (n - (k + 1))

Notice how k is in the denominator. If you have 100 observations (n) and 5 variables, you'll be dividing by 100 - 5 - 1 = 94. If you have 20 variables instead, you're dividing by 100 - 20 -1 = 79. As the denominator gets smaller, the results get larger: 99/94 = 1.05; 79/94 = 1.25

A larger normalizing value is going to make the Adjusted R-squared worse since we're substracting its product from one.


## Fstatistic

Finally, the F-statistic. including the t-testes, this is the second "test" that the summary function produces for lm models. The F-Statistic is a "global" test that checks if at least one of your coefficients are nonzero.

In [16]:
#F-Statistic
#H0: All coefficients are zero
#Ha: At least one coefficient is nonzero
#Compare test statistic to F Distribution table
n <- length(y)
SSE <- sum(model$residuals ** 2)
SSyy <- sum((y-mean(y)) ** 2)
k <- length(model$coefficients) - 1
((SSyy - SSE) / k) / (SSE / (n - (k + 1)))

The reason for this test is based on the fact that if you run multiple hypothesis test (namely, on your coefficients), you're likely to include a variable that isn't actually significant. 