# Part 2.2 - Multiple Linear Regression  

## What is Multiple Linear Regression?  
- **Simple Linear Regression (SLR):** one independent variable (X).  
- **Multiple Linear Regression (MLR):** more than one independent variable (X1, X2, …, Xn).  

💡 Example:  
- Predicting a startup’s **Profit (y)** based on:  
  - R&D Spend (X1)  
  - Administration (X2)  
  - Marketing Spend (X3)  
  - State (X4 → categorical, needs encoding)  

---

## The Equation  

$$
y = b_0 + b_1 x_1 + b_2 x_2 + \dots + b_n x_n
$$

- **b0:** intercept (baseline when all X = 0)  
- **b1…bn:** coefficients (effect of each feature on y)  

---


## Assumptions of Linear Regression  

For Linear Regression to work properly, ideally:  
1. Linearity  
2. Homoscedasticity  
3. Multivariate normality  
4. Independence of errors  
5. Lack of multicollinearity  
6. No autocorrelation  

⚠️ In ML practice: we usually **don’t test all assumptions**.  
If data doesn’t fit, the model’s performance will simply be poor.  


<center>
  <img src="../docs/Assumptions of LR.png" width="400"/>
  <img src="../docs/Assumptions of LR-2.png" width="400"/>
</center>

---

## ⭐ Importing the libraries

We use:  
- **NumPy** → arrays  
- **Pandas** → dataset handling  
- **Matplotlib** → optional visualization  
- **scikit-learn** → regression model  


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## ⭐ Importing the Dataset  

Dataset: `50_Startups.csv`  

- **X** = all independent variables (spendings + state).  
- **y** = dependent variable (Profit).  


In [2]:
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values # all columns except last
y = dataset.iloc[:, -1].values  # last column (Profit)

In [3]:
print(X)

[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']
 [144372.41 118671.85 383199.62 'New York']
 [142107.34 91391.77 366168.42 'Florida']
 [131876.9 99814.71 362861.36 'New York']
 [134615.46 147198.87 127716.82 'California']
 [130298.13 145530.06 323876.68 'Florida']
 [120542.52 148718.95 311613.29 'New York']
 [123334.88 108679.17 304981.62 'California']
 [101913.08 110594.11 229160.95 'Florida']
 [100671.96 91790.61 249744.55 'California']
 [93863.75 127320.38 249839.44 'Florida']
 [91992.39 135495.07 252664.93 'California']
 [119943.24 156547.42 256512.92 'Florida']
 [114523.61 122616.84 261776.23 'New York']
 [78013.11 121597.55 264346.06 'California']
 [94657.16 145077.58 282574.31 'New York']
 [91749.16 114175.79 294919.57 'Florida']
 [86419.7 153514.11 0.0 'New York']
 [76253.86 113867.3 298664.47 'California']
 [78389.47 153773.43 299737.29 'New York']
 [73994.56 122782.75 303319.26 'Florida']
 [67532

## ⭐ Encoding Categorical Data  

- Some columns are **categorical** (e.g., *State* = California, Florida, New York).  
- ML models cannot handle text, so we use **OneHotEncoder** to create **dummy variables** (0/1).  

Example:  

| State      | D1 (California) | D2 (Florida) | D3 (New York) |
|------------|-----------------|--------------|---------------|
| California | 1               | 0            | 0             |
| Florida    | 0               | 1            | 0             |
| New York   | 0               | 0            | 1             |

---

⚠️ **The Dummy Variable Trap**  
- If we keep **all** dummy variables, they are **linearly dependent**.  
- Why? Because if we know two of them, the third one is automatically known.  
  - Example: if D1=0 and D2=0 → then D3 must be 1.  
- This situation is called **multicollinearity** → the model gets confused because the variables are redundant.  

👉 **Solution:**  
- Always drop one dummy variable.  
- In the example above, we could remove D3 (New York).  
- The information is still preserved:  
  - If D1=0 and D2=0, then we automatically know it’s New York.  
- ✅ Don’t worry: in **scikit-learn**, the algorithm handles this automatically — you don’t need to remove dummy variables manually.  


In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Encode 'State' column (index 3)
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [5]:
print(X)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

## ⭐ Splitting the dataset into the Training set and Test set

We split into:  
- 80% training set  
- 20% test set  

⚠️ Note: The ratio is flexible. Here we use 80/20 for balance.  


In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## ⭐ Training the Multiple Linear Regression model on the Training set

- Class: `LinearRegression`  
- Object: `regressor`  
- Method: `.fit()` → trains the model  


In [7]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

## ⭐ Predicting the Test Set Results  

We predict profit for the test set and compare predictions with actual values.  

Since we have multiple features, we can’t plot a simple 2D line like in SLR.  
Instead, we print side by side:  
- Predicted values  
- Actual values  


In [8]:
y_pred = regressor.predict(X_test)

# print predictions vs actuals side by side
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


## ⭐ R² and Adjusted R²

### R² (Coefficient of Determination)
- Measures how well the regression line fits the data.  
- Formula:  

$$
R^2 = 1 - \frac{SS_{res}}{SS_{tot}}
$$  

Where:  

- Residual sum of squares:  
  $$
  SS_{res} = \sum (y_i - \hat{y}_i)^2
  $$  

- Total sum of squares:  
  $$
  SS_{tot} = \sum (y_i - \bar{y})^2
  $$  

👉 **Interpretation:**  
- Closer to **1** → better fit.  
- But: adding more variables always increases R² (even irrelevant ones).  

---

### Adjusted R²
- Fixes the problem of R² increasing artificially when adding irrelevant features.  
- Formula:  

$$
Adj \; R^2 = 1 - (1 - R^2) \times \frac{n-1}{n-k-1}
$$  

Where:  
- \(n\) = number of samples  
- \(k\) = number of features  

👉 **Interpretation:**  
- Adjusted R² only increases if the new variable improves the model.  
- If the variable is irrelevant, Adjusted R² decreases.  


In [9]:
from sklearn.metrics import r2_score

# R²
r2 = r2_score(y_test, y_pred)
print("R²:", r2)

# Adjusted R²
n = X_test.shape[0]  # number of samples
k = X_test.shape[1]  # number of features
adj_r2 = 1 - (1-r2) * (n-1)/(n-k-1)
print("Adjusted R²:", adj_r2)

R²: 0.9347068473282546
Adjusted R²: 0.8041205419847638


---

## How to Build a Model (Feature Selection Methods)  

When we have many features, not all are useful.  
Here are common strategies:  

1. **All-in**  
   - Use all variables.  
   - Simple but computationally expensive.  

2. **Backward Elimination** (fastest)  
   - Start with all features.  
   - Remove the least significant one step by step.  

   Steps:  
   1. Select a significance level (e.g., 0.05).  
   2. Fit the model with all predictors.  
   3. Find the predictor with the highest p-value.  
   4. If p-value > SL, remove the predictor.  
   5. Refit and repeat until all predictors are significant.  

3. **Forward Selection**  
   - Start with no variables.  
   - Add one at a time (the most significant).  

4. **Bidirectional Elimination**  
   - Combination of forward and backward.  

5. **Score Comparison**  
   - Compare models using performance metrics.  

---

💡 In practice:  
- In pure statistics, we care about significance.  
- In ML, scikit-learn handles much of this automatically.  
---

# 💡 Wrap-Up  

- Multiple Linear Regression allows multiple features.  
- Dummy variables must avoid multicollinearity (drop one).  
- scikit-learn handles dummy traps & feature selection automatically.  
- **No feature scaling needed:**  
  - In algorithms like SVM or KNN, distances between data points matter. If features are on very different scales (e.g., *Age in years* vs. *Salary in thousands*), the larger-scale feature dominates, so scaling is required.  
  - In Linear Regression, the model assigns **coefficients** to each feature.  
  - These coefficients automatically adjust to the scale of the feature, balancing their influence.  
  - Example: if *Salary* is in thousands, its coefficient (*b*) will be very small, while *Age* might have a larger coefficient. Together, they balance out without scaling.  
- Visualization is harder in high dimensions → compare predictions vs. actual values.  
- R² & Adjusted R²: measure how well the model explains the variance; Adjusted R² is more reliable with many features.
