# Multiple Linear Regression

### Definition

**Multiple linear regression** is an extension of simple linear regression, used to model the relationship between two or more independent variables (inputs) and a single dependent variable (output) by fitting a linear equation to the observed data.

### Formula:

The formula for multiple linear regression expands on the simple linear regression formula to accommodate multiple independent variables:

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon$

Here's what each element in the formula represents:

- `y`: The dependent variable, target variable, or the output that we're trying to predict.
- `β0`: The constant or intercept. It represents the predicted value of \( y \) when all the independent variables are zero.
- `β1`, `β2`, ..., `βn`: These are the coefficients of the independent variables. Each coefficient represents the change in the predicted \( y \) value for a one-unit change in the corresponding independent variable, holding all other variables constant.
- `x1`, `x2`, ..., `xn`: The independent variables or predictors. These are the variables we think have an effect on our dependent variable.
- `𝟄`: The error term or residual. It represents the part of \( y \) not explained by the model, capturing all other factors influencing \( y \) but not included in the model.

### Key Differences from Simple Linear Regression:

- **Number of Predictors**: Involves multiple independent variables compared to one in simple linear regression.
- **Interpretation of Coefficients**: Each coefficient must be interpreted while holding all other variables constant, which can be more complex than in simple linear regression.
- **Assumptions**: Additional assumptions are required, like no perfect multicollinearity among the independent variables.
- **Model Complexity**: The model is more complex and requires more data to estimate the coefficients reliably.

### Example

Imagine we want to predict a house’s selling price (our dependent variable \( y \)) based on its size, age, and location. Here, the independent variables (\( x_1, x_2, x_3 \)) would be the size of the house, its age, and a numerical value representing the location.

- `β1`, `β2`, and `β3` would tell us how much we expect the selling price to change with a one-unit change in house size, age, and location value, respectively, holding other factors constant.
- `𝟄` accounts for the variation in selling price not explained by house size, age, and location.

### Calculation

Just like in simple linear regression, the coefficients in multiple linear regression are typically estimated using Ordinary Least Squares (OLS). However, the calculation is more complex due to the involvement of multiple variables. The goal remains to minimize the sum of the squares of the residuals, but the calculation involves solving a set of linear equations or using matrix algebra.

### Multicollinearity Consideration

An important consideration in multiple linear regression is multicollinearity, where two or more independent variables are highly correlated with each other. This can make it difficult to determine the individual effect of each independent variable on the dependent variable and can lead to unreliable coefficient estimates.

### Importing the dataset

In [82]:
import numpy as np
import pandas as pd

dataset = pd.read_csv("./filez/50_Startups.csv")

X = dataset.iloc[:, :-1].values  # features
y = dataset.iloc[:, -1].values  # target

dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


### Encoding categorical data

In [83]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(
    transformers=[("encoder", OneHotEncoder(drop='first'), [3])], remainder="passthrough"
)
X = np.array(ct.fit_transform(X))

X[:5]


array([[0.0, 1.0, 165349.2, 136897.8, 471784.1],
       [0.0, 0.0, 162597.7, 151377.59, 443898.53],
       [1.0, 0.0, 153441.51, 101145.55, 407934.54],
       [0.0, 1.0, 144372.41, 118671.85, 383199.62],
       [1.0, 0.0, 142107.34, 91391.77, 366168.42]], dtype=object)

    ☝🏻 No need to apply scaling in Multiple Linear Regression with Ordinary Least Squares (OLS):
- **Effect on Coefficients**: OLS estimates are scale-invariant. This means that whether the features are scaled or not, the model will produce the same predictions. The coefficients will adjust their scale automatically to accommodate the scale of the features.

- **Intercept Term**: The OLS implementation in sklearn includes an intercept term that compensates for differences in scale among the features, ensuring that the model provides an accurate fit to the data.

- **Optimization Algorithm**: OLS does not use gradient descent or any other iterative optimization algorithm that would benefit from feature scaling. It solves the regression equation directly.

    ☝🏻 Do we need to check the linear regression assumptions beforehand? Nope!
- It's preferrable to apply all models and evaluate them. If our dataset is not linear, this will be shown with poor performance when using regression models.

### Splitting dataset into Train/Test set

    ☝🏻 Do we need to select the backwards elimination to select features with highest p.values? Nope!
- The model will automatically identify the best features (more significant)

In [84]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### Training the Multiple Linear Regression model with Train set

In [85]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

### Predicting the Test set results

In [86]:
# vector of the predicted profits
y_pred = regressor.predict(X_test)

# show only 2 decimals
np.set_printoptions(precision=2)

# show predicted profits (left) and actual profits (right)
# display vertically -> y_pred.reshape(rows, cols)
print(
    np.concatenate(
        (y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), axis=1
    )
)

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


### Making a new prediction

In [87]:
input_features = [
    0,  # Dummy 1
    0,  # Dummy 2
    170000,  # R&D Spend
    50000,  # Administration
    375000,  # Marketing Spend
]

print(f"New prediction: {regressor.predict([input_features])[0]:,.2f}")

New prediction: 189,416.58


### Evaluating the Model

In [88]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Predicting the Test set results
y_pred = regressor.predict(X_test)

# MAE: Mean Absolute Error
mae = mean_absolute_error(y_test, y_pred)
print(f'MAE: {mae:,.2f}')

# MSE: Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f'MSE: {mse:,.2f}')

# R-squared Score
r2 = r2_score(y_test, y_pred)
print(f'R^2: {r2:,.2f}')



MAE: 7,514.29
MSE: 83,502,864.03
R^2: 0.93


- An MAE of 7,514.29 means that, on average, the predictions of our model are about $7,514.29 away from the actual values.
- An R^2 of 0.93 is quite high, indicating that 93% of the variance in your dependent variable is predictable from the independent variables (it can sometimes indicate overfitting).

### Getting the linear regression equation with coefficients

In [89]:
print(f'intercept (B0):        {regressor.intercept_:,.2f}')
print(f"coefficients (B1..BN): {', '.join(f'{coef:,.2f}' for coef in regressor.coef_)}")

intercept (B0):        42,554.17
coefficients (B1..BN): -959.28, 699.37, 0.77, 0.03, 0.04


Linear Regression formula:

$y = \beta_0 + \beta_1x_1 ... + \beta_Nx_N + \epsilon$

In our context:

$Profit = 42,554.17 - 959.28×dummy_state_1 + 699.37×dumm_state_2 + 0.77×rd_spend + 0.03×administration + 0.04×marketing_spend$

# TODO : BACKWARD ELIMINATION