## Python For Machine Learning Fall 2025
---
# Linear Regression




### 1. Introduction
Linear regression is one of the simplest supervised learning algorithms. In fact, it is so simple that it is sometimes not considered machine learning at all! Whatever you believe, the fact is that linear regression--and its extensions--continues to be a common and useful method of making predictions when the target vector is a quantitative value (e.g. home price, age)

#### 1.1 Early Concepts and the Method of Least Squares

The true birth of linear regression is tied to the development of the method of least squares. While several brilliant minds worked on similar problems, two figures are most prominently credited with its creation in the early 19th century.

- Adrien-Marie Legendre: the first to publish the method of least squares in 1805
- Carl Friedrich Gauss: Gauss claimed he had been using the method since 1795 but didn't publish it until 1809

#### 1.2 Regression

In the 1880s, Sir Francis Galton was studying the relationship between the heights of parents and their children. He observed that very tall parents tended to have children who were tall, but slightly shorter than them. Similarly, very short parents tended to have children who were short, but slightly taller than them.

He called this phenomenon "regression towards mediocrity," which was later termed "regression to the mean." He developed a graphical method to describe this relationship, plotting the parents' heights against the children's heights and drawing a line through the data. He used the term "regression line" to describe this relationship. This is where the name of the technique comes from, even though today its primary use is for prediction, not just describing this specific biological phenomenon.

#### 1.3 Problem Statement
You want to train a model that represents a linear relationship between the feature and target vector.

#### 1.4 Solution
Use a linear regression (`LinearRegression` in scikit-learn)

In [4]:
# load libraries
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# generate feature matrix, target vector
features, target = make_regression(n_samples=100, n_features=3, n_informative=3, random_state=1)

# create linear regression
regression = LinearRegression()

# fit the linear regression
model = regression.fit(features, target)

#### 1.5 Code Interpretation

The script uses `scikit-learn` library to perform three main steps:

- Generate a synthetic dataset suitable for a regression problem.
- Create an instance of a linear regression model.
- Train (or "fit") the model to the generated data.

##### 1.5.1 Generating Data
`features, target = make_regression(n_samples=100, n_features=3, n_informative=3, random_state=1)`

This line calls the make_regression function to create our dataset.

`n_samples=100`: This specifies that we want `100` observations or rows in our dataset.

`n_features=3`: This means each sample will have `3` descriptive features (or columns).

`n_informative=3`: This tells the function that all `3` features should be "informative," meaning they all have a meaningful influence on the target value.

`random_state=1`: This is a seed for the random number generator. Using a specific random_state ensures that every time this code is run, the exact same "random" data is generated. This is crucial for getting reproducible results.

The function returns two things:

`features`: A matrix (like a table) with `100` rows and `3` columns. These are the independent variables.

`target`: A vector (a single column) with `100` values. This is the dependent variable that the model will learn to predict.

##### 1.5.2 Creating the Model
`regression = LinearRegression()`

This line creates an instance of the LinearRegression model. You can think of regression as a blank, untrained model object, ready to learn from data.

##### 1.5.3 Training the model

`model = regression.fit(features, target)`

This is the most critical step. The `.fit()` method is where the model training happens.

`regression.fit(features, target)`: The features and target data were passed to the fit method. The linear regression algorithm then analyzes this data to find the optimal coefficients (or weights) for the three features that best predict the target value. It essentially "learns" the mathematical relationship between the inputs (`features`) and the output (`target`).

The `model` now holds this trained object, which can be used to make predictions on new, unseen data.

### 2. Linear Regression Model

Linear regression assumes that the relationship between the features and the target vector is approximately linear. That is, the effect (also called coefficient, weight, or parameter) of the features on the target vector is constant. In the solution above, for the sake of explanation, the model was trained using only three features. Thus, the model is as follows:
$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon
$$


#### 2.1 Coefficient and Intecept

In `sklearn`, the intercept $\beta_0$ is stored in the model's `intercept_` attribute.

In [6]:
model.intercept_

np.float64(3.552713678800501e-15)

Corospondingly, the coefficients (or weights) are stored in the model's `coef_` attribute

In [5]:
model.coef_

array([44.19042807, 98.97517077, 58.15774073])

#### 2.2 Make Predictions

With the model in hand, we are able to predict the target when given a new sample.

In [7]:
predictions = model.predict(features)
for pred, tar in zip(predictions, target):
    print(f"Prediction: {pred:.2f}, Target: {tar:.2f}, Error: {abs(pred-tar):.2f}")

Prediction: -10.38, Target: -10.38, Error: 0.00
Prediction: 25.51, Target: 25.51, Error: 0.00
Prediction: 19.68, Target: 19.68, Error: 0.00
Prediction: 149.50, Target: 149.50, Error: 0.00
Prediction: -121.65, Target: -121.65, Error: 0.00
Prediction: 90.29, Target: 90.29, Error: 0.00
Prediction: 214.01, Target: 214.01, Error: 0.00
Prediction: 224.74, Target: 224.74, Error: 0.00
Prediction: -73.17, Target: -73.17, Error: 0.00
Prediction: -195.63, Target: -195.63, Error: 0.00
Prediction: -52.49, Target: -52.49, Error: 0.00
Prediction: 201.80, Target: 201.80, Error: 0.00
Prediction: 20.27, Target: 20.27, Error: 0.00
Prediction: 89.16, Target: 89.16, Error: 0.00
Prediction: -4.44, Target: -4.44, Error: 0.00
Prediction: -45.48, Target: -45.48, Error: 0.00
Prediction: 56.90, Target: 56.90, Error: 0.00
Prediction: 120.55, Target: 120.55, Error: 0.00
Prediction: -66.22, Target: -66.22, Error: 0.00
Prediction: -43.84, Target: -43.84, Error: 0.00
Prediction: 34.29, Target: 34.29, Error: 0.00
Pred

#### 2.3 Evaluation using `score()`

`sklearn` provides `score()` function for more efficient evaluation. The `model.score(X, y)` method performs two main steps internally:

- It uses the input data `X` to generate predictions by calling the `model.predict(X)` method.

- It then compares these predictions against the true labels `y` and returns a single score based on a default metric.

The specific metric used depends on whether you are doing classification or regression.The function takes

In [None]:
model.score(features, target)

#### 2.4 Noise

Data in the real world is rarely perfect and almost always contains noise. As a result, a model's predictions won't align perfectly with observed values. In fact, it's completely normal for a model to have some degree of error.

The

In [None]:
# generate feature matrix, target vector
features, target = make_regression(n_samples=100, n_features=3, n_informative=3, noise=5, random_state=1)

# create linear regression
regression = LinearRegression()

# fit the linear regression
model = regression.fit(features, target)
model.score(features, target)

The make_regression function first generates data based on a perfect linear relationship：
$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3
$$

Here, vector $[\beta_0, \beta_1, \beta_2, \beta_3]$ represents the true coefficients (which can be obtained by setting the `coef` parameter to `True`). If the data were generated this way, all the points would fall perfectly on a straight line (or a hyperplane in higher dimensions).

However, real-world data always has noise. The noise parameter is used to simulate this imperfection. It adds random numbers to the perfect y values, drawn from a normal distribution (also called a Gaussian distribution) with a mean of `0` and a standard deviation equal to the value you specify.

### 3. Interaction Effect

An interaction effect (or simply "interaction") occurs when the effect of one independent variable on an outcome depends on the level or value of at least one other independent variable. In simpler terms, the variables don't work in isolation; their combined impact is different from the sum of their individual impacts.

#### 3.1 Simple Example

For example, imagine a simple coffee-based example where we have two binary features--the presence of sugar  and whether or not we have stirred-—and we want to predict if the coffee tastes sweet. Just putting sugar in the coffee (`sugar=1, stirred=0`) won't make the coffee taste sweet (all the sugar is at the bottom!) and just stirring the coffee without adding sugar (`sugar=0, stirred=1`) won't make it sweet either. Instead it is the interaction of putting sugar in the coffee and stirring the coffee (`sugar=1, stirred=1`) that will make a coffee taste sweet. The effects of sugar and stirred on sweetness are dependent on each other. In this case we say there is an interaction effect between the features sugar and stirred.

#### 3.2 Model the Interaction Effect

Now, let's generate a group of samples with higher noise. We can certainly expect the model performs poorly on the new dataset.

In [8]:
# generate feature matrix, target vector
features, target = make_regression(
    n_samples=100, n_features=3, n_informative=3, noise=20, random_state=1)

# create linear regression
regression = LinearRegression()

# fit the linear regression
model = regression.fit(features, target)
model.score(features, target)

0.9734073126511832

Then we create the interaction term, as follows.

In [9]:
from sklearn.preprocessing import PolynomialFeatures
interaction = PolynomialFeatures(
    degree=3, include_bias=False, interaction_only=True)

The parameters of the interaction term are as follows:
- `degree`
This parameter defines the maximum degree of the features to be created. Here, "degree" refers to the total number of features multiplied together in a single term.

- `interaction_only=False` (The default): This generates all feature combinations up to the specified degree, including powers of a single feature. For example, with $[x1, x2]$ and `degree=3`, it would generate $x_1$, $x_2$, $x_1^2$, $x_1\cdot x_2$, $x_2^2$, $x_1^3$, $x_1^2\cdot x_2$, $x_1\cdot x_2^2$, $x_2^3$.
- `interaction_only=True`: This generates only the products of different features (interaction terms) and will not generate powers of a single feature (like x1² or x1³). Its goal is purely to capture the "interaction" between features, not the non-linear relationship of a single feature with itself.
- `include_bias=True` (The default): This adds a column of all 1s as the first feature. This 1 corresponds to the constant term (the intercept) in a polynomial equation, equivalent to x_1
0
- `include_bias=False`: This does not add the column of all 1s.

The `.fit_transform(features)` method applies this rule to the original features.

In [None]:
features_interaction = interaction.fit_transform(features)

Finally, a standard `LinearRegression` model is created and trained using the `.fit()` method. Crucially, it's trained on the enhanced feature matrix, `features_interaction`, which includes the interaction term, allowing the model to find the best coefficients for not just $x_1$ and $x_2$, but also for their product, $x_1\cdot x_2$.

In [None]:
regression = LinearRegression()

# Fit the linear regression
model = regression.fit(features_interaction, target)
model.score(features_interaction, target)

### 4 Nonlinear Relationship

Although linear relationships are highly interpretable and practical, nonlinear relationships are more prevalent in reality. Consequently, polynomial fitting often yields superior results. As the most widely used machine learning library, `scikit-learn` inherently supports polynomial fitting.

#### 4.1 Creating Polynomial Features

In [None]:
# Create polynomial features x^2 and x^3
polynomial = PolynomialFeatures(degree=3, include_bias=False)
features_polynomial = polynomial.fit_transform(features)

#### 4.2 Model Training and Evaluation

In [None]:
# Create linear regression
regression = LinearRegression()

# Fit the linear regression
model = regression.fit(features_polynomial, target)
model.score(features_polynomial, target)

### 5. Standardization, Normalization, and Regularization

Standardization, Normalization, and Regularization are three concepts in machine learning that are often easily confused. In this context, we will briefly discuss regularization through the lens of the linear regression problem.

#### 5.1 Standardization

`sklearn` provides strong support for data preprocessing. For standardization, the `StandardScaler` is quite convinient. First, let's generate the dataset.


In [None]:
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.datasets import make_regression

features, target = make_regression(n_samples = 100,
                                   n_features = 3,
                                   n_informative = 2,
                                   n_targets = 1,
                                   noise = 0.2,
                                   coef = False,
                                   random_state = 1)

print(np.mean(features, axis=0), np.var(features, axis=0))

Then, we can perform standarization on the generated data.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
features_standardized = scaler.fit_transform(features)
print(np.mean(features_standardized, axis=0), np.var(features_standardized, axis=0))

You will notice that after standarization, the variables in the data follows normal distribution.

#### 5.2 Ridge Regression

The idea of regularization is to add a penalty to the original loss function. For a brief understanding, we present the loss function of linear regression.  The standard loss function for linear regression is the Mean Squared Error (MSE). The loss reflects the average squared difference between the model's predictions and the actual data.

$$
J(\theta) = \frac{1}{2n}Σ_{i=1}^n(y_i -\hat{y_i})^2
$$

If you don't take the average of the loss, you will get the residual sum of squares (RSS).
$$
RSS = Σ_{i=1}^n(y_i -\hat{y_i})^2
$$

The `ridge` regression adds a penalty that is is a tuning hyperparameter multiplied by the squared sum of all coefficients to the loss function, as follows:
$$
J(\theta) = \frac{1}{2n}Σ_{i=1}^n(y_i -\hat{y_i})^2 + \lambda\Sigma_{j=1}^n\hat{\beta_j}^2
$$

In [None]:
from sklearn.linear_model import Ridge
regression = Ridge(alpha=0.5)
model_ridge = regression.fit(features_standardized, target)
model_ridge.score(features_standardized, target)

#### 5.3 Lasso Regression

Lasso regression, on the other hand, add L1 regularization to the loss, which lead to:
$$
J(\theta) = \frac{1}{2n}Σ_{i=1}^n(y_i -\hat{y_i})^2 + \lambda\Sigma_{j=1}^n|\hat{\beta_j}|
$$

In [None]:
from sklearn.linear_model import Lasso
regression = Lasso(alpha=0.5)
model_lasso = regression.fit(features_standardized, target)
model_lasso.score(features_standardized, target)

Ridge (L2 regularization) and Lasso (L1 regularization) yield distinct outcomes. Intuitively, ridge regression does not force coefficients to be exactly zero, but rather shrinks the magnitude of each coefficient. In contrast, lasso regression performs a function analogous to feature selection by setting the coefficients of irrelevant features to zero.

In [None]:
print(model_ridge.coef_)
print(model_lasso.coef_)