# **Multiple Linear Regression**

Multiple linear regression refers to a statistical technique that is used to predict the outcome of a variable based on the value of two or more variables. It is sometimes known simply as multiple regression, and it is an extension of linear regression. The variable that we want to predict is known as the dependent variable, while the variables we use to predict the value of the dependent variable are known as independent or explanatory variables.

### **Let's see the difference in formulae**

<img src="https://miro.medium.com/max/3444/1*uLHXR8LKGDucpwUYHx3VaQ.png">


### **Let's see the visual of Multiple Linear Regression**

<img src="https://in.mathworks.com/help/stats/categorial_slopes1.png">

In this, as you can see there are multiple x variables so multiple linear regressions as well. Assuming Red, Green & Blue as different x variables. The y variable (MPG) is all dependent on all indepenent variables, as all x variables or the independent ones are in correlation with y.




## **Getting Started with Multiple Linear Regression**

In [27]:
#importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import math

## **DataFraming**

Read .csv data into a DataFrame

In [28]:
df = pd.read_csv("/content/sample_data/50_Startups.csv")

df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [29]:
X = df[["R&D Spend", "Administration", "Marketing Spend", "State"]].values  
#values is converting dataframe into array
print(X)
y = df[["Profit"]].values
print(y)

[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']
 [144372.41 118671.85 383199.62 'New York']
 [142107.34 91391.77 366168.42 'Florida']
 [131876.9 99814.71 362861.36 'New York']
 [134615.46 147198.87 127716.82 'California']
 [130298.13 145530.06 323876.68 'Florida']
 [120542.52 148718.95 311613.29 'New York']
 [123334.88 108679.17 304981.62 'California']
 [101913.08 110594.11 229160.95 'Florida']
 [100671.96 91790.61 249744.55 'California']
 [93863.75 127320.38 249839.44 'Florida']
 [91992.39 135495.07 252664.93 'California']
 [119943.24 156547.42 256512.92 'Florida']
 [114523.61 122616.84 261776.23 'New York']
 [78013.11 121597.55 264346.06 'California']
 [94657.16 145077.58 282574.31 'New York']
 [91749.16 114175.79 294919.57 'Florida']
 [86419.7 153514.11 0.0 'New York']
 [76253.86 113867.3 298664.47 'California']
 [78389.47 153773.43 299737.29 'New York']
 [73994.56 122782.75 303319.26 'Florida']
 [67532

### **Preprocessing**

## **Coverting Text variables to Numbers**

In [30]:
st = ColumnTransformer([("State", OneHotEncoder(), [3])], remainder = 'passthrough')
X = st.fit_transform(X)
print(X)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

## **Splitting Dataset**

In [35]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)

## **Training the Model**
We are using Linear regression model as imported from sklearn library and then it's being trained on x and y (any 2 major axis of datasets)

In [36]:
model = LinearRegression()
model.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## **Prediction**
We will predict the Brain weight by giving an input of Brain Size

In [37]:
y_pred = model.predict(X_test)

In [43]:
#Comparison

print('- y_pred : ')
print(y_pred)
print('- y_test : ')
print(y_test)

- y_pred : 
[[103015.20159795]
 [132582.27760816]
 [132447.73845175]
 [ 71976.09851258]
 [178537.48221057]
 [116161.24230167]
 [ 67851.69209676]
 [ 98791.73374687]
 [113969.43533014]
 [167921.06569552]]
- y_test : 
[[103282.38]
 [144259.4 ]
 [146121.95]
 [ 77798.83]
 [191050.39]
 [105008.31]
 [ 81229.06]
 [ 97483.56]
 [110352.25]
 [166187.94]]


In [48]:
print(y_pred.sum().mean())
print(y_test.sum().mean())

print("So called Accuracy ", y_pred.sum().mean()/y_test.sum().mean())

1183253.9675519639
1222774.07
So called Accuracy  0.9676799636027315


In [46]:
y_pred/y_test

array([[0.99741313],
       [0.91905469],
       [0.90641918],
       [0.92515657],
       [0.93450467],
       [1.10621   ],
       [0.83531303],
       [1.01341943],
       [1.03277854],
       [1.01042871]])