<a href="https://colab.research.google.com/github/ThanhVanLe0605/Data-Mining-For-Business-Analytics-In-Python/blob/main/Chapter_06_Multiple_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

$CHAPTER$ $ $ $6:$ $MULTIPLE$ $ $ $LINEAR$ $ $ $REGRESSION$

This chapter introduces linear regression models with a specific focus on predictive analytics. Key topicscovered include:

* **Linear Regression Models**

Introduction to regression models designed for **prediction**

Distinguish between model fitting for **inference**(classical statistics) and fitting for **prediction**

* **Model Evaluation Methodology**

The necessity of evaluating model performance on a **validation set**

The use of specific **predictive metrics**(e.g, RMSE, MAE) rather than goodness-of-fit statistics  alone ( like $R^{2}$ )

* **Variable Selection**

Addressing the challenges associated with using a large number of **predictors** (features).

Implementation of **Variable Selection Algorithms** to identify the most relevant features for the model.

# TABLE OF CONTENTS

6.1. [INTRODUCTION](https://colab.research.google.com/drive/1x2NpXouzqo869hcjUJJ9aW59Fqbhr78k#scrollTo=IB8Z828xSRLb&line=1&uniqifier=1)

6.2. [EXPLANATORY VS. PREDICTIVE MODELING](https://colab.research.google.com/drive/1x2NpXouzqo869hcjUJJ9aW59Fqbhr78k#scrollTo=XBNBFMp6SWuR&line=1&uniqifier=1)

6.3. [ESTIMATING THE REGRESSION EQUATION AND PREDICTION](https://colab.research.google.com/drive/1x2NpXouzqo869hcjUJJ9aW59Fqbhr78k#scrollTo=p143lu77Sco2&line=1&uniqifier=1)

6.4. [VARIABLE SELECTION IN LINEAR REGRESSION](https://colab.research.google.com/drive/1x2NpXouzqo869hcjUJJ9aW59Fqbhr78k#scrollTo=7VFkjt6tTfQ-&line=1&uniqifier=1)

6.5. [APPENDIX: USING STATMODELS](https://colab.research.google.com/drive/1x2NpXouzqo869hcjUJJ9aW59Fqbhr78k#scrollTo=Hugi2j4RU6Na&line=1&uniqifier=1)

**Python**

In this chapter, we will use **pandas** for data handling, and **scikit-learn** for building the models, and variable (feature) selection. We will also make use of the utility functions from the Python Utilities Functions Appendix. We could use **statmodels** for the linear regression model, however, **statmodels** provides more information than needed for predictive modeling. Use the following import statements for the Python code in this chapter.


In [None]:
!pip install dmba

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge, LassoCV, BayesianRidge
import statsmodels.formula.api as sm
import matplotlib.pylab as plt

from dmba import regressionSummary, exhaustive_search
from dmba import backward_elimination, forward_selection, stepwise_selection


Collecting dmba
  Downloading dmba-0.2.4-py3-none-any.whl.metadata (1.9 kB)
Downloading dmba-0.2.4-py3-none-any.whl (11.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.8/11.8 MB[0m [31m59.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dmba
Successfully installed dmba-0.2.4
Colab environment detected.


# 6.1. INTRODUCTION

#### **6.1.1. Model Definition & Terminology**
* The **multiple linear regression model** is the most popular model for making predictions. It fits a relationship between a numerical outcome and a set of predictors.

* **Outcome  variable ($Y$):** Also called the response, target, or dependent variable.

* **Predictors ($X_1, \dots, X_p$):** Also called independent variables, input variables, regressors, or covariates.

#### **6.1.2. The Model Equation**
The model assumes the relationship is approximated by the function:

$$Y= \beta_0 +\beta_1x_1 +\beta_2x_2 +\dots +\beta_px_p +\epsilon$$

* $\beta_0, \dots, \beta_p$: **Coefficients** estimated from the data.

* $\epsilon$: **Noise** (unexplained part)
#### **6.1.3. Feature Engineering (Input Forms)**
Regression modeling involves not only estimating coefficients but also selecting **which predictors** to include and in **what form**.

* **Forms:** Predictors can be included 'as is', in logarithmic form [$\log(X)$], or in binned form (e.g., are groups)
* **Selection Criteria:** Choices depend on **domain knowledge**, data availability, and needed predictive power.


#### **6.1.4. Applications**

Common predictive modeling situations include:
* Predicting credit card activity based on demographics and historical patterns.
* Predicting vacation travel expenditures based on frequent flyer data.
* Predicting help desk staffing requirements based on sales information.
* Predicting sales from cross-selling products.
* Predicting the impact of discounts on retail sales.

# 6.2. EXPLANATORY VS. PREDICTIVE MODELING

Before using linear regression for prediction, it is crucial to distinguish between two popular but different objectives.

## 6.2.1. The Distinction



1.   **Explanatory Task :** Explaining or quantifying the average effect of inputs on an outcome (explanatory or descriptive task, respectively)
2.   **Predictive Task :** Predicting the outcome value for new records, given their input values



## 6.2.2. The Classical Statistical Approach (Explanatory)

* **Focus :** Objective #1 (explaning)
* **Data view :** Data is treated as a random sample from a larger population. The model attempts to capture the **average relationship** in that population.
* **Interpretation :** Used to generate statements like **"A unit increase in service speed ($X_1$) is associated with an average increase of 5 points in Satisfaction ($Y$), all other factors ($X_2$, $X_3$, ... , $X_p$) being equal. "**
* **Explanatory modeling:** If causal structure is known ($X$ causes $Y$), used for actionable policy changes.
* **Descriptive Modeling:** If causal structure is unknown, it quantifies the degree of **association**.

## 6.2.3. The Predictive Analytics Approach (Data Mining)

* **Focus:** Object #2 (predicting).
* **Target:** Predicting **new individual records**
* **Mindset:** We are not interested in the coefficients themselves or the **average record**, but rather in the **predictions ($\hat{y}$)** the model generates.
* **Usage:** Used for **micro-decision-making** at the record level (e.g., predicting satisfaction for *each* new customer).

## 6.2.4. Modeling Process & Trade-off

* **Optmization:**
  * *Explanatory:* Tries to fit the **best model to the existing data** to learn underlying relationships.
  * *Predictive:* Tries to find a model that best predicts **new individual records**.
* **The Overfitting Paradox:** A regression model that fits the existing data **too well** is not likely to perform well with new data.
* **Solution:** We look for the model with the highest predictive power by evaluating it on a **holdout set** (validation set) using predictive metrics.

## 6.2.5. Summary of Key Differences

There are four main differences in using linear regression for these two scenarios:

1. **Definition of "Good":**
   * *Explanatory :* A good model fits the data closely.
   * *Predictive :* A good model predicts new records accurately. (Input variable selection may differ).

2. **Data Usage:**
   * *Explanatory :* The **entire dataset** is used to estimate the best-fit model (maximizing information).
   * *Predictive :* Data is split into a **Training Set** (to estimate the model) and a **Validation/ Holdout Set** (to assess predictive performance on unobserved data).

3. **Performance Measures:**
   * *Explanatory:* Measures how well the model approximates the data (**Goodness-of-fi**) and the strength of average relationship.
   * *Predictive:* Measured by **predictive accuracy**

4. **Focus:**
   * *Explanatory:* Focus is on the **coefficients ($\beta$)**.
   * *Predictive:* Focus is on the **predictions ($\hat{y}$)**.

For these reasons, it is extremely important to know the goal of the analysis before beginning the modeling process. A good predictive model can have a looser fit to the data on which it is based, and a good explanatory model can have low prediction accuracy. In the remainder of this chapter,we focus on predictive models because these are more popular in data mining and because most statistics textbooks focus on explanatory modeling.

# 6.3. ESTIMATING THE REGRESSION EQUATION AND PREDICTION

## 6.3.1. Estimation method: Ordinary Least Squares (OLS)

One predictors and their forms are selected, the coefficients are estimated from the data using a method called **Ordinary Least Squares (OLS)**.

* **Objective:** Find values $\hat{\beta_0}, \hat{\beta_1}, \dots, \hat{\beta_p}$ that **minimize the sum of squared deviations** between the actual outcome values ($Y$) and their predicted values ($\hat{Y}$).

* **Prediction Equation:** To predict the value for a new record, we use the equation:

    $$\hat{Y} = \hat{\beta_0} + \hat{\beta_1} + \dots + \hat{\beta_p}$$



## 6.3.2. Statistical assumptions

For OLS estimates to be the **best** (unbiased and having the smallest mean squared error), the following assumptions are typically made:

1. The noise $\epsilon$ (or equivalently, $Y$) follows a **normal distribution**

2. The choice of predictors and their form is correct (*linearity*)

3. The records are independent of each other.

4. The variability in the outcome values for a given set of predictors is the same regardless of the values of the predictors (*homoskedasticity*)

## 6.3.3. Data mining perspective: Prediction vs. Assumptions

* **Key insight:** For the goal of **prediction**, satisfying the strict statistical assumptions (like the normal distribution of noise) is often of **secondary interest**.

* **Focus:**: Even if assumptions are violated, predictions can still be sufficiently accurate. The priority is to evaluate the model's **predictive performance** on a validation set rather than just checking assumptions.

## 6.3.4. EXAMPLE : PREDICTING THE PRICE OF USED TOYATA COROLLA CARS

* **Goal:** Predict the price of used cars to ensure dealership profitability.
* **Data Partitioning:** The dataset (1000 records) is partitioned into a **Training Set (60%)** for fitting the model and a **Validation Set (40%)** for evaluating performance.
* **Handling Categorical Predictors (Dummy Variables):**
    * Categorical variables like `Fuel Type` (Petrol, Diesel, CNG) must be converted into **dummy variables** (0/1).
    * **The $N-1$ Rule:** If a variable has $N$ categories, we create **$N-1$ dummy variables**.
    * *Example:* For `Fuel Type` (3 categories), we create `Fuel_Type_Petrol` and `Fuel_Type_Diesel`. The third category (`CNG`) is redundant. Including it would cause the regression to fail due to perfect linear combination.


**TABLE 6.1. VARIABLES IN THE TOYOTA COROLLA EXAMPLE**

In [None]:
tb = pd.read_csv("Table_6.1.csv")
tb


Unnamed: 0,Variable,Description
0,Price,Offer price in Euros
1,Age,Age in months as of August 2004
2,Kilometers,Accumulated kilimeters on odometer
3,Fuel type,Fuel type(Petrol or Diesel or CNG)
4,HP,Horsepower
5,Metalic,Metalic color?(Yes = 1 or No = 0)
6,Automatic,Automatic(Yes =1 or No =0)
7,CC,Cylinder volume in cubic centimeters
8,Doors,Numer of doors
9,QuarTax,Quarterly road tax in Euros


**TABLE 6.2. PRICES AND ATTRIBUTES FOR USED TOYOTA COROLLA CARS (SELECTED ROWS AND COLUMNS ONLY)**

In [None]:
car_df = pd.read_csv('ToyotaCorolla.csv')

car_df_needed = ['Price', 'Age_08_04', 'KM', 'Fuel_Type', 'HP', 'Met_Color', 'Automatic', 'cc', 'Doors', 'Quarterly_Tax', 'Weight']
car_df = car_df[[c for c in car_df_needed if c in car_df.columns]]
car_df = car_df.rename(columns= {'cc':'CC'})
car_df


Unnamed: 0,Price,Age_08_04,KM,Fuel_Type,HP,Met_Color,Automatic,CC,Doors,Quarterly_Tax,Weight
0,13500,23,46986,Diesel,90,1,0,2000,3,210,1165
1,13750,23,72937,Diesel,90,1,0,2000,3,210,1165
2,13950,24,41711,Diesel,90,1,0,2000,3,210,1165
3,14950,26,48000,Diesel,90,0,0,2000,3,210,1165
4,13750,30,38500,Diesel,90,0,0,2000,3,210,1170
...,...,...,...,...,...,...,...,...,...,...,...
1431,7500,69,20544,Petrol,86,1,0,1300,3,69,1025
1432,10845,72,19000,Petrol,86,0,0,1300,3,69,1015
1433,8500,71,17016,Petrol,86,0,0,1300,3,69,1015
1434,7250,70,16916,Petrol,86,1,0,1300,3,69,1015


## **Predictive Measures of Error**

In predictive modeling, we typically do not use $R^2$ to assess performance on the validation set. Instead, we use measures based on the **prediction error** ($e_i = y_i - \hat{y}_i$).

### **1. Key Error Metrics**

| Metric | Name | Formula | Key Characteristics & Usage |
| :--- | :--- | :--- | :--- |
| **ME** | **Mean Error** | $$\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)$$ | • **Purpose:** Measures **Bias**. Tells if the model is over-forecasting or under-forecasting.<br>• **Interpretation:**<br>  - $ME > 0$: Under-forecast (Actual > Predicted).<br>  - $ME < 0$: Over-forecast (Actual < Predicted).<br>• **Note:** $ME \approx 0$ does not mean the model is good (positive and negative errors cancel out). |
| **MAE** | **Mean Absolute Error** | $$\frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$ | • **Meaning:** On average, how many units is the prediction off by?<br>• **Usage:** Good for reporting to management (easy to explain). Use when the cost of error increases **linearly**.<br>• **Pros/Cons:** Robust to outliers (does not penalize large errors heavily). |
| **RMSE** | **Root Mean Squared Error** | $$\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$ | • **Meaning:** Standard deviation of the prediction errors.<br>• **Usage:** Use when you want to heavily penalize **large errors** (outliers).<br>• **Pros/Cons:** Very sensitive to outliers. A few bad predictions can inflate RMSE significantly. |
| **MAPE** | **Mean Absolute Percentage Error** | $$\frac{100\%}{n}\sum_{i=1}^{n}\left|\frac{y_i - \hat{y}_i}{y_i}\right|$$ | • **Meaning:** Average error in percentage terms.<br>• **Usage:** Useful for comparing performance across products/channels with **different scales**.<br>• **Note:** Undefined if Actual ($y$) = 0. Can "explode" if values are very small. |
| **MPE** | **Mean Percentage Error** | $$\frac{100\%}{n}\sum_{i=1}^{n}\left(\frac{y_i - \hat{y}_i}{y_i}\right)$$ | • **Purpose:** Measures **Percentage Bias**.<br>• **Interpretation:** Similar to ME but in percentage.<br>  - $MPE > 0$: Under-forecast in %.<br>  - $MPE < 0$: Over-forecast in %. |

---

### **2. Strategic Metric Interpretation**
How to read these metrics to make decisions?

**Step 1: Check for Bias (ME & MPE)**
* If $ME \approx 0$ and $MPE \approx 0 \rightarrow$ The model is **unbiased** (predictions are centered around actuals).
* If $ME > 0 \rightarrow$ The model tends to **under-forecast**.
* If $ME < 0 \rightarrow$ The model tends to **over-forecast**.
* *Action:* Use this to adjust the intercept or scaling if necessary.

**Step 2: Detect Large Errors (RMSE vs. MAE)**
* If $RMSE \approx MAE \rightarrow$ Errors are distributed evenly. The model is **stable**.
* If $RMSE \gg MAE$ (significantly larger) \rightarrow **Warning!** There are **large outliers** (big mistakes) in the predictions.
* *Insight:* Tells you if there are specific "disaster" points causing high error.

**Step 3: Assess Relative Accuracy (MAPE)**
* $MAPE < 10\% \rightarrow$ Excellent.
* $10\% - 20\% \rightarrow$ Acceptable.
* $> 40\% \rightarrow$ Problematic.
* *Insight:* Helps determine if the error level is acceptable for the business scale.

**TABLE 6.3. LINEAR REGRESSION MODEL OF PRICE VS. CAR ATTRIBUTES**

**TABLE 6.4. PREDICTED PRICES (AND ERRORS) FOR 20 CARS IN VALIDATION SET AND SUMMARY PREDICTIVE MEASURES FOR ENTIRE VALIDATION SET (CALLED TEST SET IN R)**

**FIGURE 6.1. HISTOGRAM OF MODEL ERRORS (BASED ON VALIDATION SET)**

# 6.4. VARIABLE SELECTION IN LINEAR REGRESSION

## REDUCING THE NUMBER OF PREDICTORS

## HOW TO REDUCE THE NUMBER OF PREDICTORS

**TABLE 6.5. EXHAUSTIVE SEARCH FOR REDUCING PREDIICTORS IN TOYOTA COROLLA EXAMPLE**

**TABLE 6.6. BACKWARD ELIMINATION FOR REDUCING PREDICTORS IN TOYOTA COROLLA EXAMPLE**

**TABLE 6.7. FORWARD SELECTION FOR REDUCING PREDICTORS IN TOYOTA COROLLA EXAMPLE**

**TABLE 6.8 STEPWISE REGRESSION FOR REDUCING PREDICTORS IN TOYOTA COROLLA EXAMPLE**

**REGULARIZATION (SHRINKAGE MODELS)**

**TABLE 6.9. LASSO AND RIDGE REGRESSION APPLIED TO THE TOYOTA COROLLA DATA**

**TABLE 6.10. LINEAR REGRESSION MODEL OF PRICE VS. CAR ATTRIBUTES USING STATSMODELS (COMPARE WITH TABLE 6.3)

# APPENDIX: USING STATMODELS