# The dataset

'50 startups'. Different startups with variables:

* Profit
* R&D Spend
* Administration
* Marketing Spend
* State - Categorical variable!

What yields best results? Which companies are more interesting?

# Regressions

* Simple linear: One independent
* Multiple linear: Several independent

Assumptions:

* Linearity: Data is actually linear
* Homoscedasticity: Same finite variance - variance homogenous even for larger values.
* Multivariate normality: Aha, normal distribution in multiple dimensions when combined. (Can it be different if sub-components are normal?)
* Independence of errors: Errors are not systematic?
* Lack of multicollinearity: Lines not codependent? No interaction?


# Dummy variables

Dummy variable: Replacing New York / California with a 0/1 value.

If using multiple dummy variable levels - if you retain both it messes things up. You always include one less than levels. 

The risk: It leads to situation where one variable can be predicted using another.

http://www.algosome.com/articles/dummy-variable-trap-regression.html



# What is a p

https://www.mathbootcamps.com/what-is-a-p-value/

`P-value tells us how unlikely result is if hypothesis is true`.

p = 0.18 here tells us that probability of getting a mean of 68.7 or less from sample of particular size is 0.18 if the peanuts in the population is 70 grams or more.

# How to build models - Step by step

Framework for several methods. All columns potential predictors! Why do we want to trim our model?

* Garbage in garbage out
* Intuition - Need to be able to explain what's up

## All-in

Throw in all variables. When:

* We have prior knowledge - Everything is valuable
* Framework saying you have to use everything
* Preparing for backward elimination

## Backward Elimination (stepwise)

Steps:

* Select significance level to stay in model
* Fit full model with *all* possible predictors
* Consider predictor with *highest* value. If > threshold.
* If so, remove the predictor
* Re-fit the model without the variable

## Forward Selection (stepwise)

More complex than just reversing previous.

* Select significance level
* Fit all simple regression models `y ~ x_n`. Select with lowest p.
* Keep this and fit all possible models with one extra predictor. Which additional is best?
* Consider new with lowest p. Lower than threshold, repeat from 3.

## Bidirectional Elimination (stepwise, or Stepwise)

* Select level to enter and to stay
* To enter, need to be less than SLENTER
* Perform all backward elimination. Back to 2.
* No new variables can enter and no variables can exit.

## All possible models

* Select criterion of goodness of fit
* Construct all possible regression models
* Select the one with the best criterion

Can be very big!

In [1]:
library(caTools)
set.seed(123)

source("~/src/jupyterrutils/VisualizationUtils.R")

[1] "Loading module to 'visutil' and 'vu'"


In [2]:
fp <- "../machine_learning_template_folder/Part 2 - Regression/Section 5 - Multiple Linear Regression/Multiple_Linear_Regression/50_Startups.csv"
df <- read.csv(fp)
head(df)

df$State <- factor(df$State)
str(df$State)

split <- sample.split(df$Profit, SplitRatio = 0.8)
training_set <- subset(df, split==T)
test_set <- subset(df, split==F)

dim(df)
dim(training_set)
dim(test_set)

R.D.Spend,Administration,Marketing.Spend,State,Profit
165349.2,136897.8,471784.1,New York,192261.8
162597.7,151377.59,443898.5,California,191792.1
153441.5,101145.55,407934.5,Florida,191050.4
144372.4,118671.85,383199.6,New York,182902.0
142107.3,91391.77,366168.4,Florida,166187.9
131876.9,99814.71,362861.4,New York,156991.1


 Factor w/ 3 levels "California","Florida",..: 3 1 2 3 2 3 1 2 3 1 ...


Feature scaling handled by the regression model, so we don't need this as separate step

# Applying a regressor

Can also express as dot for combination of all arguments.

R creates dummy variables for the factor here, and removes one dummy. We also seem to have significant effect for the R.D.Spend.

In [4]:
regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State, data=training_set)
summary(regressor)


Call:
lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + 
    State, data = training_set)

Residuals:
   Min     1Q Median     3Q    Max 
-33128  -4865      5   6098  18065 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.977e+04  7.516e+03   6.622 1.36e-07 ***
R.D.Spend        7.986e-01  5.604e-02  14.251 6.70e-16 ***
Administration  -2.942e-02  5.828e-02  -0.505    0.617    
Marketing.Spend  3.268e-02  2.127e-02   1.537    0.134    
StateFlorida     1.162e+02  4.048e+03   0.029    0.977    
StateNew York   -1.213e+02  3.751e+03  -0.032    0.974    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9908 on 34 degrees of freedom
Multiple R-squared:  0.9499,	Adjusted R-squared:  0.9425 
F-statistic:   129 on 5 and 34 DF,  p-value: < 2.2e-16


# Predicting test set results

In [6]:
y_pred = predict(regressor, newdata=test_set)
y_pred

test_set

Unnamed: 0,R.D.Spend,Administration,Marketing.Spend,State,Profit
4,144372.41,118671.85,383199.62,New York,182901.99
5,142107.34,91391.77,366168.42,Florida,166187.94
8,130298.13,145530.06,323876.68,Florida,155752.6
11,101913.08,110594.11,229160.95,Florida,146121.95
16,114523.61,122616.84,261776.23,New York,129917.04
20,86419.7,153514.11,0.0,New York,122776.86
21,76253.86,113867.3,298664.47,California,118474.03
24,67532.53,105751.03,304768.73,Florida,108733.99
31,61994.48,115641.28,91131.24,Florida,99937.59
32,61136.38,152701.92,88218.23,New York,97483.56


# Backward elimination

We want all parts of predicting model to be strong parts. We will use `summary` function to implement backwards elimination.

In [10]:
regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State, data=df)
summary(regressor)


Call:
lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + 
    State, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-33504  -4736     90   6672  17338 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      5.013e+04  6.885e+03   7.281 4.44e-09 ***
R.D.Spend        8.060e-01  4.641e-02  17.369  < 2e-16 ***
Administration  -2.700e-02  5.223e-02  -0.517    0.608    
Marketing.Spend  2.698e-02  1.714e-02   1.574    0.123    
StateFlorida     1.988e+02  3.371e+03   0.059    0.953    
StateNew York   -4.189e+01  3.256e+03  -0.013    0.990    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9439 on 44 degrees of freedom
Multiple R-squared:  0.9508,	Adjusted R-squared:  0.9452 
F-statistic: 169.9 on 5 and 44 DF,  p-value: < 2.2e-16


In [11]:
regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend, data=df)
summary(regressor)


Call:
lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend, 
    data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-33534  -4795     63   6606  17275 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      5.012e+04  6.572e+03   7.626 1.06e-09 ***
R.D.Spend        8.057e-01  4.515e-02  17.846  < 2e-16 ***
Administration  -2.682e-02  5.103e-02  -0.526    0.602    
Marketing.Spend  2.723e-02  1.645e-02   1.655    0.105    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9232 on 46 degrees of freedom
Multiple R-squared:  0.9507,	Adjusted R-squared:  0.9475 
F-statistic:   296 on 3 and 46 DF,  p-value: < 2.2e-16


In [12]:
regressor = lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data=df)
summary(regressor)


Call:
lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-33645  -4632   -414   6484  17097 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     4.698e+04  2.690e+03  17.464   <2e-16 ***
R.D.Spend       7.966e-01  4.135e-02  19.266   <2e-16 ***
Marketing.Spend 2.991e-02  1.552e-02   1.927     0.06 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9161 on 47 degrees of freedom
Multiple R-squared:  0.9505,	Adjusted R-squared:  0.9483 
F-statistic: 450.8 on 2 and 47 DF,  p-value: < 2.2e-16


# Automatic backward elimination code snippet

The same procedure - but automatic!

In [16]:
backwardElimination <- function(x, sl) {
    numVars = length(x)
    for (i in c(1:numVars)){
      regressor = lm(formula = Profit ~ ., data = x)
      maxVar = max(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"])
      if (maxVar > sl){
        j = which(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"] == maxVar)
        x = x[, -j]
      }
      numVars = numVars - 1
    }
    return(summary(regressor))
}
  
SL = 0.05
dataset = df[, c(1,2,3,4,5)]
backwardElimination(training_set, SL)


Call:
lm(formula = Profit ~ ., data = x)

Residuals:
   Min     1Q Median     3Q    Max 
-34334  -4894   -340   6752  17147 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.902e+04  2.748e+03   17.84   <2e-16 ***
R.D.Spend   8.563e-01  3.357e-02   25.51   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9836 on 38 degrees of freedom
Multiple R-squared:  0.9448,	Adjusted R-squared:  0.9434 
F-statistic: 650.8 on 1 and 38 DF,  p-value: < 2.2e-16
