# Introduction to Linear Regresssion

[Chapters 2 & 3 of Introduction to Statistical Learning by Gareth James, et al.](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf)

## History behind Linear Regression
English statistician, Francis Galton, was studying the relationship between parents and their children. In particular, he investigated the relationship between the height of fathers and their sons. 

What he discovered was that a man's son tended to be roughly as tall as his father. However, Galton's breakthrough was that the son's height tended to be closer to the overall average height of all people. 

A good example would be Shaquille O'Neal. Shaq is really tall. 7ft 1in (2.2m). If Shaq had a son, chances are he'll be pretty tall too, but there is a very good chance that his son will not be as tall as Shaq.

This turns out to be the case, Shaq's son is pretty tall, 6ft 7in, but not nearly as tall as his dad. 

Galton called this phenomenon __regression,__ as in "A father's son's height tends to _"regress"_ ( or drift toward) the average height"

## Example
Let's take the simplest example possible:
calculating a regression with only 2 data points

In [None]:
library(ggplot2)
df <- as.data.frame(matrix(c(1,5,4,10),nrow=2))
pl <- ggplot(data=df, aes(x=V1, y=V2))
pl + geom_point() + geom_line()

All we're trying to do when we calculate our regression line is to draw a line that's as close to every dot as possible. 

For classic linear regression, or the "Least Squares Method", you only measure the closeness in the "up and down" direction

Now wouldn't it be great if we could apply the same concept toa  graph with multiple data points. Doing this, we could 

In [None]:
df2 <- read.csv("pearson.csv", header = TRUE)
pl <- ggplot(data=df2,aes(x=Father, y=Son))
pl + geom_jitter() + geom_smooth(aes(group=1),method ='lm',formula = y~x,se=FALSE,color='red')

Our goal with linear regression is to __minimize the vertical distance__ between all the data points and our line. 

So in determining the __best line__, we are attempting to minimize the distance between __all__ the points and their distance to our line

There are lots of different ways to minimize this, (sum of squared errors, sum of absolute errors, etc), but all these methods have a feneral goal of minimizing this distance.

# Using R for Linear Regression
Formulas in R take the form (y~ x). To add more predictor variables, just use the + sign. i.e.(y~x+z)

For this example, we will be using the [ Student Performance Data Set from UC Irvine's Machine Learning Repository!](https://archive.ics.uci.edu/ml/datasets/Student+Performance)

In [4]:
library(ggplot2)
library(ggthemes)
library(dplyr)

# Load in the data. Note that this CSV is seperated by a semicolon
df <- read.csv("student-mat.csv", sep = ";")

In [5]:
head(df)
summary(df)
str(df)

Unnamed: 0_level_0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
Unnamed: 0_level_1,<fct>,<fct>,<int>,<fct>,<fct>,<fct>,<int>,<int>,<fct>,<fct>,...,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
2,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
3,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
4,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
5,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10
6,GP,M,16,U,LE3,T,4,3,services,other,...,5,4,2,1,2,5,10,15,15,15


 school   sex          age       address famsize   Pstatus      Medu      
 GP:349   F:208   Min.   :15.0   R: 88   GT3:281   A: 41   Min.   :0.000  
 MS: 46   M:187   1st Qu.:16.0   U:307   LE3:114   T:354   1st Qu.:2.000  
                  Median :17.0                             Median :3.000  
                  Mean   :16.7                             Mean   :2.749  
                  3rd Qu.:18.0                             3rd Qu.:4.000  
                  Max.   :22.0                             Max.   :4.000  
      Fedu             Mjob           Fjob            reason      guardian  
 Min.   :0.000   at_home : 59   at_home : 20   course    :145   father: 90  
 1st Qu.:2.000   health  : 34   health  : 18   home      :109   mother:273  
 Median :2.000   other   :141   other   :217   other     : 36   other : 32  
 Mean   :2.522   services:103   services:111   reputation:105               
 3rd Qu.:3.000   teacher : 58   teacher : 29                                
 Max.   :4.00

'data.frame':	395 obs. of  33 variables:
 $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
 $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
 $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
 $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
 $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
 $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
 $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
 $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
 $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
 $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
 $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
 $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
 $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
 $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
 $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
 $ schoolsup : Factor w/

## Attribute Information
Here is the attribute information for our data set: Attribute Information:

- 1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
- 2 sex - student's sex (binary: 'F' - female or 'M' - male)
- 3 age - student's age (numeric: from 15 to 22)
- 4 address - student's home address type (binary: 'U' - urban or 'R' - rural)
- 5 famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
- 6 Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
- 7 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
- 8 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
- 9 Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
- 10 Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
- 11 reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
- 12 guardian - student's guardian (nominal: 'mother', 'father' or 'other')
- 13 traveltime - home to school travel time (numeric: 1 - less than 15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - more than 1 hour)
- 14 studytime - weekly study time (numeric: 1 - less than 2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - more than 10 hours)
- 15 failures - number of past class failures (numeric: n if between 1 and 3 , else 4)
- 16 schoolsup - extra educational support (binary: yes or no)
- 17 famsup - family educational support (binary: yes or no)
- 18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
- 19 activities - extra-curricular activities (binary: yes or no)
- 20 nursery - attended nursery school (binary: yes or no)
- 21 higher - wants to take higher education (binary: yes or no)
- 22 internet - Internet access at home (binary: yes or no)
- 23 romantic - with a romantic relationship (binary: yes or no)
- 24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
- 25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
- 26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
- 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
- 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
- 29 health - current health status (numeric: from 1 - very bad to 5 - very good)
- 30 absences - number of school absences (numeric: from 0 to 93)

__These grades are related with the course subject

- 31 G1 - first period grade (numeric: from 0 to 20)
- 31 G2 - second period grade (numeric: from 0 to 20)
- 32 G3 - final grade (numeric: from 0 to 20, output target)

## Clean the Data
Next we have to clean this data. This data is actually already cleaned for you, But here are some things you may want to consider doing for other data sets:

#### Check for NA values
Let's see if we have any NA values:

In [6]:
any(is.na(df))

Great! Most real data sets will probably have NA or Null values, so its always good to check! Its up to you how to deal with them, either dropping them if they aren't too many, or imputing other values, like the mean value.

## Categorical Features
Moving on, let's make sure that categorical variables have a factor set to them. For example, the MJob column refers to categories of Job Types, not some numeric value from 1 to 5. R is actually really good at detecting these sort of values and will take of this work for you a lot of the time, but always keep in mind the use of factor() as a possible. Luckily this is basically already, we can check this using the str() function:

In [None]:
# Nuneric columns only
num.cols <- sapply(df,is.numeric)
# Filter
cor.data <- cor(df[,num.cols])

print(cor.data)

In [None]:
# install.packages("corrgram")
library(corrgram)

# install.packages("corrplot")
library(corrplot)

In [None]:
print(corrplot(cor.data, method = 'color'))

In [None]:
corrgram(df)

In [None]:
corrgram(df, 
         order = TRUE,
         lower.panel = panel.shade,
         upper.panel = panel.pie,
         text.panel = panel.txt) 

In [None]:
ggplot(df,
       aes(x = G3)
      ) + 
geom_histogram(
    bins = 20, 
    alpha = 0.5,
    fill = 'blue'
)

In [None]:
#install.packages("caTools")

In [None]:
library(caTools)
set.seed(101)

In [None]:
sample <- sample.split(df$G3, SplitRatio = 0.7)

In [None]:
train <- subset(df, sample == TRUE)

In [None]:
test <- subset(df, sample == FALSE)

In [None]:
model <- lm (G3 ~. , data = train)

In [None]:
summary(model)

In [None]:
res <- residuals(model)
class(res)

In [None]:
res <- as.data.frame(res)
head(res)

In [None]:
ggplot(res, aes(res))+ geom_histogram(fill = 'blue', alpha = 0.5)

In [None]:
plot(model)

In [None]:
G3.predictions <- predict(model, test)

In [None]:
results <- cbind(G3.predictions, test$G3)
colnames(results) <- c('predicted','actual')
results <- as.data.frame(results)
print(results)

pl <- ggplot(data = results, aes(x = actual, y = predicted)) +geom_jitter()
pl

min(results$predicted)

In [None]:
to_zero <- function(x){
    if (x < 0){
        return(0)
    }else{
        return(x)
    }
}

In [None]:
results$predicted <- sapply(results$predicted, to_zero)

In [None]:
pl <- ggplot(data = results, aes(x = actual, y = predicted)) +geom_jitter()
pl

In [None]:
mse <- mean( ( results$actual - results$predicted)^2)
print(mse)
print(mse^0.5)

In [None]:
SSE <- sum((results$predicted - results$actual)^2)
SST <- sum((mean(df$G3)- results$actual)^2)

R2 <- 1 - (SSE/SST)

In [None]:
R2