## **Data Preprocessing**

### **Importing the Libraries**

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### **Importing the Data**

In [2]:
dataset = pd.read_csv("/content/50_Startups.csv")
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [None]:
print(X)

### **Encoding Categorical Data**

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(), [3])],remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [5]:
print(X)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

### **Splitting the Data**

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

**Note : No Need to Apply Feature Scaling in Multiple Linear Regression**

## **Building a model**

### **Backward Elimination**

Step 1: Select a significance level to stay in the model (eg SL=0.05)

Step 2: Fit the full model with all possible predictors.

Step 3: Consider the predictor with the highest P-value. If P>SL, go to STEP 4, otherwise go to FIN

Step 4: Remove the Predictor

Step 5: Fit the model without this variable*

**FIN : Your Model is Ready**

### **Forward Selection**

Step 1: Select a significance level to enter the model (e.g. SL = 0.05)

Step 2: Fit all simple regression models **y~xn** Select the one with the lowest P-value

Step 3: Keep this variable and fit all possible models with one extra predictor added to the one(s)

Step 4: Consider the predictor with the lowest P-value. If p < SL, go to STEP 3, otherwise go to FIN

**FIN : Your Model is Ready**


### **Bidirectional Elimination**

Step 1: Select a significance level to enter and to stay in the model. e.g: SLENTER = 0.05, SLSTAY = 0.05

Step 2: Perform the next step of Forward Selection (new variables must have: P < SLENTER to enter)

Step 3: Perform ALL steps of Backward Elimination (old variables must have P < SLSTAY to stay)

Step 4: No new variables and no old variables can exit

**FIN : Your Model is Ready**

### **All Possible Models**

Step 1: Select a criterion of goodness of fit (e.g. Akaike criteron)

Step 2: Construct All Possible Regession Models: 2^N -1 total Combinations

Step 3: Select the one with the best criterion

**FIN : Your Model is Ready**

Example: 10 columns means 1023 models

## **Multiple Linear Regression**

### **Training the Multiple Linear Regression Model on the Training Set**

In [8]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

### **Predicting the Test Set Results**

In [9]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]
