> # Enough to be Dangeous: Multiple Linear Regression

> ### This is the 2nd notebook of my **"Enough to be Dangeous"** notebook series

See the other notebooks here:

[Simple Linear regression](https://www.kaggle.com/thaddeussegura/enough-to-be-dangeous-simple-linear-regression)

[Polynomial regression](https://www.kaggle.com/thaddeussegura/enough-to-be-dangerous-polynomial-regression)


 

> ### This notebook is separated into two parts:

**1) Conceptual Overview:**  I will introduce the topic in 200 words or less.

**2) Implementation:**  I will implement the algorithm in as few lines as possible.

## Conceptual Overview

Like Simple Linear Regression, multiple regression is also a “supervised” “regression” algorithm.

![image.png](attachment:image.png)


Supervised meaning we use labeled data to train the model.

![image.png](attachment:image.png)

Regression meaning we predict a numerical value, instead of a “class”.

![image.png](attachment:image.png)

However, in multiple regression, we have multiple independent variables that impact the dependent variable.

![image.png](attachment:image.png)

Least Squares is still used, but instead of fitting a line to our data, we fit a (n-1) dimensional plane. (Ex: 3D data -> 2D plane.)

![image.png](attachment:image.png)

Before applying Multiple Regression, we must test 4 specific assumptions, which can be done with any statistical software.

![image.png](attachment:image.png)

Next, we must select which independent variables to include; striking a balance between quality of fit and number of variables. This balance is called “parsimony”.

![image.png](attachment:image.png)

Once we’ve completed the regression, we evaluate the fit with the “R^2 score” which tells us how closely our prediction matched the data.

![image.png](attachment:image.png)


Overall, Multiple Regression is very applicable to real world problems, however, practitioners must test the assumptions and apply parsimony for valid conclusions to be made.

## Implementation

In this section I will implement the code in its simplest verison so that it is understandable if you are brand new to machine learning. 

Below we will use the attributes of 50 start ups to predict their profit.

**The independent variables are**:
* R&D Spend 
* Administrative Spend 
* Marketing Spend 
* State of operation.
    
**The dependent variable is** profit. 
    
The first step is to start with "imports". These are "libraries" of pre-written code that will help us significantly.

In [1]:
#Numpy is used so that we can deal with array's, which are necessary for any linear algebra
# that takes place "under-the-hood" for any of these algorithms.

import numpy as np


#Pandas is used so that we can create dataframes, which is particularly useful when
# reading or writing from a CSV.

import pandas as pd


#Matplotlib is used to generate graphs in just a few lines of code.

import matplotlib.pyplot as plt

#Sklearn is a very common library that allows you to implement most basic ML algorithms.
#LabelEncoder, OneHotEncoder, and ColumnTransfomer are necessary since we have a field of categorical data.

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

#Train_test_split will allow us to quickly split our dataset into a training set and a test set.

from sklearn.model_selection import train_test_split


#LinearRegression is the class of the algorithm we will be using.

from sklearn.linear_model import LinearRegression


#This will allow us to evaluate our fit using the R^2 score. 

from sklearn.metrics import r2_score



With our imports complete, we now read in the data using Pandas.

We will set a independent variable (X) and a dependent variable (y).

In [2]:
#read dataset from csv
dataset = pd.read_csv('../input/50-startups/50_Startups.csv')

#set independent variable using all rows, and all columns except for the last one.
X = dataset.iloc[:, :-1].values

#set the dependent variable using all rows, but ony the last column.
y = dataset.iloc[:, 4].values

#Lets look at our data
dataset


Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94
5,131876.9,99814.71,362861.36,New York,156991.12
6,134615.46,147198.87,127716.82,California,156122.51
7,130298.13,145530.06,323876.68,Florida,155752.6
8,120542.52,148718.95,311613.29,New York,152211.77
9,123334.88,108679.17,304981.62,California,149759.96


Next we have to deal with our categorial column.  (State).

We will do this by "OneHotEncoding" it, which turns each value into 0 or 1 for processing.

In [3]:
#create an object of the class LabelEncoder
labelencoder = LabelEncoder()

# Country column
ct = ColumnTransformer([("State", OneHotEncoder(), [3])], remainder = 'passthrough')
X = ct.fit_transform(X)

#We need to omit one of the columns to avoid the dummy variable trap.
X = X[:, 1:]

#take a look at X now.
X

array([[0.0, 1.0, 165349.2, 136897.8, 471784.1],
       [0.0, 0.0, 162597.7, 151377.59, 443898.53],
       [1.0, 0.0, 153441.51, 101145.55, 407934.54],
       [0.0, 1.0, 144372.41, 118671.85, 383199.62],
       [1.0, 0.0, 142107.34, 91391.77, 366168.42],
       [0.0, 1.0, 131876.9, 99814.71, 362861.36],
       [0.0, 0.0, 134615.46, 147198.87, 127716.82],
       [1.0, 0.0, 130298.13, 145530.06, 323876.68],
       [0.0, 1.0, 120542.52, 148718.95, 311613.29],
       [0.0, 0.0, 123334.88, 108679.17, 304981.62],
       [1.0, 0.0, 101913.08, 110594.11, 229160.95],
       [0.0, 0.0, 100671.96, 91790.61, 249744.55],
       [1.0, 0.0, 93863.75, 127320.38, 249839.44],
       [0.0, 0.0, 91992.39, 135495.07, 252664.93],
       [1.0, 0.0, 119943.24, 156547.42, 256512.92],
       [0.0, 1.0, 114523.61, 122616.84, 261776.23],
       [0.0, 0.0, 78013.11, 121597.55, 264346.06],
       [0.0, 1.0, 94657.16, 145077.58, 282574.31],
       [1.0, 0.0, 91749.16, 114175.79, 294919.57],
       [0.0, 1.0, 86419.7

With our data loaded, we now need to split the data into training and test sets.

In [4]:
#This will create x and y variables for training and test sets.
#Here we are using 25% of our examples for the test set.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)


Now, its time to load the model

In [5]:
#this sets the object regressor to the class of LinearRegression from the Sklearn library.
regressor = LinearRegression()

#this fits the model to our training data.
regressor.fit(X_train, y_train)

LinearRegression()

With our model built, we can now use it for generating predictions.

We will use our test set so we can see how well it did.

In [6]:
#Predict on our test set.
y_pred = regressor.predict(X_test)

Because we have 4 independent variables, (technically more because of the onehotencoding) this would be impossible to visualize.  So we will skip the step of visualization. 

So instead we will move on directly to the R^2 score, which tells us how much of the variation in our dependent variable can be explained by our independent variable.

In [7]:
#calculate the R^2 score
score = r2_score(y_test, y_pred)

#print out our score properly formatted as a percent.
print("R^2 score:", "{:.0%}".format(score))

R^2 score: 93%


We can now apply this trained model to novel examples to predict their profit.  

In [8]:
#Prediction for a business in CA, with R&D of 160,000, Admin of 130,000 and Marketing of 300,000.
print(regressor.predict([[1, 0, 160000, 130000, 300000]]))

[181502.23688748]
