# **Multiple Linear Regression in R**

## **Importing the dataset**

In [1]:
ds = read.csv('/content/50_Startups.csv')
cat("First three rows of dataset", "\n")
head(ds, 3)

First three rows of dataset 


Unnamed: 0_level_0,R.D.Spend,Administration,Marketing.Spend,State,Profit
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<chr>,<dbl>
1,165349.2,136897.8,471784.1,New York,192261.8
2,162597.7,151377.6,443898.5,California,191792.1
3,153441.5,101145.6,407934.5,Florida,191050.4


## **Encoding categorical data**

In [2]:
ds$State = factor(ds$State,
                  levels = c('New York', 'California', 'Florida'),
                  labels = c(1, 2, 3))

## **Splitting the dataset into the Train set and Test set**

In [3]:
install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(ds$Profit, SplitRatio = 4/5)
split

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [4]:
train_set = subset(ds, split == TRUE)
test_set = subset(ds, split == FALSE)
print(head(train_set,3))

  R.D.Spend Administration Marketing.Spend State   Profit
1  165349.2       136897.8        471784.1     1 192261.8
2  162597.7       151377.6        443898.5     2 191792.1
3  153441.5       101145.6        407934.5     3 191050.4


In [5]:
print(head(test_set,3))

  R.D.Spend Administration Marketing.Spend State   Profit
4  144372.4      118671.85        383199.6     1 182902.0
5  142107.3       91391.77        366168.4     3 166187.9
8  130298.1      145530.06        323876.7     3 155752.6


## **Fitting Simple Linear Regression to the Training set**

In [6]:
regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State, data = train_set)
#regressor = lm(formula = Profit ~ ., data = train_set)

## **Predicting the Test set results**

In [7]:
y_pred = predict(regressor, newdata = test_set)
cat("Original data:", "\n")
print(head(test_set$Profit))
cat("Predicted data:", "\n")
print(head(y_pred))

Original data: 
[1] 182902.0 166187.9 155752.6 146122.0 129917.0 122776.9
Predicted data: 
       4        5        8       11       16       20 
173981.1 172655.6 160250.0 135513.9 146059.4 114151.0 


## **Backward Elimination**

In [8]:
summary(regressor)


Call:
lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + 
    State, data = train_set)

Residuals:
   Min     1Q Median     3Q    Max 
-33128  -4865      5   6098  18065 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.965e+04  7.637e+03   6.501 1.94e-07 ***
R.D.Spend        7.986e-01  5.604e-02  14.251 6.70e-16 ***
Administration  -2.942e-02  5.828e-02  -0.505    0.617    
Marketing.Spend  3.268e-02  2.127e-02   1.537    0.134    
State2           1.213e+02  3.751e+03   0.032    0.974    
State3           2.376e+02  4.127e+03   0.058    0.954    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9908 on 34 degrees of freedom
Multiple R-squared:  0.9499,	Adjusted R-squared:  0.9425 
F-statistic:   129 on 5 and 34 DF,  p-value: < 2.2e-16


### **Removing the least significant variable from the model**

In [9]:
regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend, data = train_set)
summary(regressor)


Call:
lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend, 
    data = train_set)

Residuals:
   Min     1Q Median     3Q    Max 
-33117  -4858    -36   6020  17957 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.970e+04  7.120e+03   6.980 3.48e-08 ***
R.D.Spend        7.983e-01  5.356e-02  14.905  < 2e-16 ***
Administration  -2.895e-02  5.603e-02  -0.517    0.609    
Marketing.Spend  3.283e-02  1.987e-02   1.652    0.107    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9629 on 36 degrees of freedom
Multiple R-squared:  0.9499,	Adjusted R-squared:  0.9457 
F-statistic: 227.6 on 3 and 36 DF,  p-value: < 2.2e-16


In [10]:
# Removing the next less significant variable from the model
regressor = lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = train_set)
summary(regressor)


Call:
lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = train_set)

Residuals:
   Min     1Q Median     3Q    Max 
-33294  -4763   -354   6351  17693 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     4.638e+04  3.019e+03  15.364   <2e-16 ***
R.D.Spend       7.879e-01  4.916e-02  16.026   <2e-16 ***
Marketing.Spend 3.538e-02  1.905e-02   1.857   0.0713 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9533 on 37 degrees of freedom
Multiple R-squared:  0.9495,	Adjusted R-squared:  0.9468 
F-statistic: 348.1 on 2 and 37 DF,  p-value: < 2.2e-16


In [11]:
# Removing the next less significant variable from the model
regressor = lm(formula = Profit ~ R.D.Spend, data = train_set)
summary(regressor)


Call:
lm(formula = Profit ~ R.D.Spend, data = train_set)

Residuals:
   Min     1Q Median     3Q    Max 
-34334  -4894   -340   6752  17147 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.902e+04  2.748e+03   17.84   <2e-16 ***
R.D.Spend   8.563e-01  3.357e-02   25.51   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9836 on 38 degrees of freedom
Multiple R-squared:  0.9448,	Adjusted R-squared:  0.9434 
F-statistic: 650.8 on 1 and 38 DF,  p-value: < 2.2e-16


Only 'R.D.Spend' with very low p-value and thereby high significance is left. 