# 5. Linear Model

### Linear Regression: Explore the (Linear) relationship between Y-X
- Method: **sklearn.linear_model.LinearRegression**

In [10]:
import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
y = np.dot(X, np.array([1, 2])) + 3

model = LinearRegression(fit_intercept = True, copy_X = True, n_jobs = 3).fit(X, y)

prediction = model.predict(np.array([[3, 5]]))

print(f"Model Score: {model.score(X, y)}")# evaluate the performance of the model, 0~1, 1-best score.
print(f"Model Coefficients: {model.coef_}")
print(f"Model Intercept: {model.intercept_}")

Model Score: 1.0
Model Coefficients: [1. 2.]
Model Intercept: 3.0000000000000018


## Logistic Regression: Estimate 0/1

- Method: sklearn.linear_model.LogisticRegression
-

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y = True) # (150, 4), (150,)

model = LogisticRegression(penalty = "l2", tol = 1e-4, solver = "liblinear", multi_class = "auto").fit(X, y)

print(f"Model Score: {model.score(X, y)}")
print(f"Actual iter number for each class: {model.n_iter_}")
print(f"Intercept/Bias: {model.intercept_}")
print(f"Total Classes: {model.classes_}")

Model Score: 0.96
Actual iter number for each class: [7 7 6]
Intercept/Bias: [ 0.26421853  1.09392467 -1.21470917]
Total Classes: [0 1 2]



## Ridge Regression: Add L2 Regularization to Loss Function
$$\sum_{j=1}^m\left(Y_i-W_0-\sum_{i=1}^n W_i X_{j i}\right)^2+\alpha \sum_{i=1}^n W_i^2=\text { loss_function }+\alpha \sum_{i=1}^n W_i^2$$
- Method: sklearn.linear_model.Ridge

In [26]:
from sklearn.linear_model import Ridge

model = Ridge(solver = "sag").fit(X, y)

print(f"Model Score: {model.score(X, y)}")
print(f"Model Coefficients: {model.coef_}")
print(f"Bias: {model.intercept_}")
print(f"Iter: {model.n_iter_}")


Model Score: 0.930087491805933
Model Coefficients: [-0.11347865 -0.03188039  0.25933952  0.53762684]
Bias: 0.14117085854172984
Iter: [28]




### Bayesian Regression: Allows a natural mechanism to survive insufficient data or poorly distributed data by formulating Linear Regression using Probability Distributors rather than point estimators

#### Bayes Theorem

$$ P(A \mid B)=\frac{P(B \mid A) \times P(A)}{P(B)} $$
- $P(A \mid B)$ is the posterior probability: The probability of event $\mathrm{A}$ occurring given that $B$ is true.
- $P(B \mid A)$ is the likelihood: The probability of observing $\mathrm{B}$ given that $\mathrm{A}$ is true.
- $P(A)$ is the prior probability: The initial probability of event $\mathrm{A}$.
- $P(B)$ is the evidence: The overall probability of event $B$.

#### Bayesian Inference
Bayesian inference uses Bayes' theorem to **update the probability estimate for a hypothesis as more evidence or information becomes available**. 
- It's about belief revision: you start with a prior belief (prior probability), and as new data comes in, you update your belief to form a new, revised belief (posterior probability).

**In this process, the main takeaway is that instead of just getting the "best guess" (as in classical regression), Bayesian Ridge Regression gives you a "range of guesses" along with how confident you are in each guess. This is particularly useful when making decisions in the face of uncertainty.**

##### $P(data)$ evidence: the probability of observing the data under all possible values of parameters.

- Essentially serves as a normalization factor to ensure that the posterior distribution is a true probability distribution that sums/integrates = 1
- $P(\text { data })=\int P(\text { data } \mid \beta) \cdot P(\beta) d \beta$

#### Bayesian & Traditional Linear Regression
- Traditional linear regression gives you point estimates for the output $y$ given inputs $X$ and weights $w$.
- Bayesian regression, on the other hand, treats the output $y$ as a random variable with a probability distribution. This allows the model to express uncertainty in its predictions.
#### Bayesian Ridge Regression
- In Bayesian Ridge regression, we start with a prior belief about the distribution of the weights $w$.
- The prior is assumed to be a **spherical Gaussian**, which means **it's a multivariate normal distribution where all dimensions are independent and have the same variance**.
- The notation $p(w \mid \lambda)$ denotes the probability of the weights given the hyperparameter $\lambda$, which controls the spread of the prior distribution (a larger $\lambda$ means a tighter spread around zero, acting as regularization).
- The formula $w \sim \mathcal{N}\left(0, \lambda^{-1} I_p\right)$ states that the prior distribution of the weights is centered at zero with a covariance matrix $\lambda^{-1} I_p$ where $I_p$ is the identity matrix of size $p$, and $p$ is the number of features (or predictors).

In [4]:
from sklearn.linear_model import BayesianRidge
x = [[0, 0], [1, 1], [2, 2], [3, 3]]
y = [0, 1, 2, 3]

bayes_model = BayesianRidge()
bayes_model.fit(x, y)


print(f"Weight: {bayes_model.coef_}")
print(f"Intercept: {bayes_model.intercept_}")
print(f"Estimated Precision of the noise: {bayes_model.alpha_}")
print(f"Estimated Precision of the weight: {bayes_model.lambda_}")
print(f"Iterations: {bayes_model.n_iter_}")
print(f"Estimated Covariance of the weights: {bayes_model.sigma_}")
print(f"Accuracy: {bayes_model.scores_}")

Weight: [0.49999993 0.49999993]
Intercept: 1.9999946720972162e-07
Estimated Precision of the noise: 1500000.9999922747
Estimated Precision of the weight: 1.9999962667112887
Iterations: 4
Estimated Covariance of the weights: [[ 0.2500005  -0.25000043]
 [-0.25000043  0.2500005 ]]
Accuracy: []



### LASSO: Add L1 Regularization to Loss Function
$$\sum_{j=1}^m\left(Y_i-W_0-\sum_{i=1}^n W_i X_{j i}\right)^2+\alpha \sum_{i=1}^n\left|W_i\right|=\text { loss_function }+\alpha \sum_{i=1}^n\left|W_i\right|$$


In [22]:
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
x = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]]
y = [0, 1, 2, 3, 4]
lasso = Lasso(alpha = 0.03) # the penalty intensity
lasso.fit(x, y)
print(f"Predict: {lasso.predict(x)}")
print(f"R-squared: {r2_score(y, lasso.predict(x))}")
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
print(f"MSE: {mean_squared_error(y, lasso.predict(x))}")
print(f"MAE: {mean_absolute_error(y, lasso.predict(x))}")

print(f"Weights: {lasso.coef_}")
print(f"Intercept: {lasso.intercept_}")
print(f"Iterations: {lasso.n_iter_}")

Predict: [0.03  1.015 2.    2.985 3.97 ]
R-squared: 0.999775
MSE: 0.00045000000000000216
MAE: 0.018000000000000016
Weights: [9.85000000e-01 2.77555756e-17]
Intercept: 0.030000000000000027
Iterations: 2



### Multi-task Lasso: trained with L1 + L2 mixed regularization, estimates sparse coefficients for multiple regression problem jointly.

- Multi-task is useful when you believe that the tasks are related and can benefit from being learned together because the shared structure among the tasks can be exploited.

#### Example
Imagine you're managing a chain of stores, and you want to predict the daily sales of multiple product categories across your stores. You have features like store location, day of the week, marketing spend, and holidays. While each product category is a separate regression problem, they share common features and potentially some underlying sales patterns.

Here, each "task" is the prediction of sales for a specific product category:

Task 1: Predicting daily sales of electronics.
Task 2: Predicting daily sales of clothing.
Task 3: Predicting daily sales of groceries.
Instead of building three separate models, one for each category, you build a multi-task model with shared feature sets but separate outputs for each category.

In [27]:
from sklearn.linear_model import MultiTaskLasso
import numpy as np

# Feature matrix (each row corresponds to data for one day)
X = np.array([
    [1, 2],  # Features for Day 1
    [2, 3],  # Features for Day 2
    [3, 4],  # Features for Day 3
    # ... and so on for more days
])

# Target matrix (each row corresponds to sales for each product category)
Y = np.array([
    [200, 150, 50],  # Sales for Day 1: [electronics, clothing, groceries]
    [220, 165, 60],  # Sales for Day 2: [electronics, clothing, groceries]
    [240, 180, 70],  # Sales for Day 3: [electronics, clothing, groceries]
    # ... and so on for more days
])

# Initialize the MultiTaskLasso model
multi_task_lasso = MultiTaskLasso(alpha=0.5)

# Fit the model to the data
multi_task_lasso.fit(X, Y)

# Predict sales for a new day with given features
new_X = np.array([[4, 5]])  # Features for the new day
predicted_sales = multi_task_lasso.predict(new_X)

print(f"Electrinics, Clothing, Groceries: {predicted_sales}")
print(f"Model Coefficients: {multi_task_lasso.coef_}")
print(f"Model Intercept: {multi_task_lasso.intercept_}")
print(f"Model Iterations: {multi_task_lasso.n_iter_}")
print(f"Model Score: {multi_task_lasso.score(X, Y)}")

      

Electrinics, Clothing, Groceries: [[258.88582797 194.16437098  79.44291399]]
Model Coefficients: [[1.94429140e+01 1.21842475e-14]
 [1.45821855e+01 9.13818561e-15]
 [9.72145699e+00 6.09212374e-15]]
Model Intercept: [181.11417203 135.83562902  40.55708601]
Model Iterations: 2
Model Score: 0.9992241379310345



### Elastic-Net: linearly combines both L1 and L2 penalty. 

$$\min _w \frac{1}{2 n_{\text {samples }}}\left\|X_w-y\right\|_2^2+\alpha \rho\|w\|_1+\frac{\alpha(1-\rho)}{2}\|w\|_2^2$$

- It's useful when there are multiple correlated features
- Lasso: feature selection   Ridge: handle correlated variables. 
- useful when you have data with many features, some of which are correlated. 
- Elastic-Net can be a more stable choice over Lasso and Ridge individually, especially when dealing with datasets where multiple features have a relationship or when the number of predictors is more than the number of observations.

#### Lasso & Elastic-Net
- Lasso can result in **sparse models** where only a subset of all possible predictors is used. 
     - This is particularly useful for **feature selection** when you have a large number of features.
- As for **Dealing With Correlated Features**: 
    - In the case of **highly correlated features**, Lasso tends to select one of them at random and shrinks the others to 0, which can be somewhat arbitrary and depends on the data and the other of the features. 
    - Elastic-Net, on the other hand, is likely to include both correlated features in the model but with **their coefficients shrunk towards each other and possibly towards zero**. This is because Elastic-Net includes the L2 penalty, which does not enforce sparsity as strongly as the L1 penalty.
- Combination of L1 and L2 Regularization: 
    1. produce sparse model (L1)
    2. The L2 part of the penalty helps to handle correlated features more effectively than Lasso alone. It tends to **shrink correlated predictors towards each other**, which means that if one feature in a group of correlated features has a non-zero coefficient, the others are likely to as well. This "grouping effect" is beneficial when predictors are correlated.
       - include all features in the model but with the coefficients shrink to 0.
       - can lead to better performance when the number of independent features (predictors) is large compared of observations(data points), or when several features 

#### Cyclic or Random ?
- Cyclic updating is systematic and ensures that all features are considered in every cycle, which can be **more stable**. 
- Random updating can *potentially escape local minima* in the optimization landscape by introducing randomness, which **might lead to faster convergence in some cases, but it can also be less stable**.


In [28]:
from sklearn.linear_model import ElasticNet
from sklearn.datasets import make_regression
from sklearn.metrics import r2_score

# Generate a synthetic regression dataset
X, y = make_regression(n_features=2, noise=0.1)

# Create an ElasticNet regression model instance
# Note: L1_ratio between 0 and 1 controls the balance between L1 and L2 regularization (0.5 is an equal balance)
elastic_net_model = ElasticNet(alpha=0.1, l1_ratio=0.5)

# Fit the model
elastic_net_model.fit(X, y)

# Predict using the trained model
y_pred = elastic_net_model.predict(X)

# Print model properties
print(f"Coefficients: {elastic_net_model.coef_}")
print(f"Intercept: {elastic_net_model.intercept_}")
print(f"Number of iterations: {elastic_net_model.n_iter_}")

# Evaluate the model's performance
r2 = r2_score(y, y_pred)
print(f"R-squared: {r2}")


Coefficients: [57.77248544 13.23491871]
Intercept: -0.5103725681526683
Number of iterations: 3
R-squared: 0.9980894184240998


### MultiTaskElasticNet
Following is the objective function to minimize:
$$
\min _w \frac{1}{2 n_{\text {samples }}}\left\|X_w-y\right\|_{\text {Fro }}^2+\alpha \rho\|w\|_{21}+\frac{\alpha(1-\rho)}{2}\|w\|_{\text {Fro }}^2
$$

As in MultiTaskLasso, here also, Fro indicates the Frobenius norm:
$$
\|A\|_{F r o}=\sqrt{\sum_{i j} a_{i j}^2}
$$

And L1L2 leads to the following:
$$
\|A\|_{21}=\sum_i \sqrt{\sum_j a_{i j}^2}
$$


In [29]:
from sklearn.linear_model import MultiTaskElasticNet
from sklearn.datasets import make_regression

# Generate synthetic data
# Here, n_targets > 1 creates multiple y values for each X, suitable for multi-task learning
X, Y = make_regression(n_samples=100, n_features=10, n_targets=3, noise=0.1)

# Create a MultiTaskElasticNet regression model instance
# alpha: Constant that multiplies the penalty terms and thus determines the level of regularization
# l1_ratio: The ElasticNet mixing parameter, with 0 < l1_ratio <= 1; l1_ratio=1 corresponds to Lasso; l1_ratio = 0 to Ridge.
multi_task_elastic_net_model = MultiTaskElasticNet(alpha=0.01, l1_ratio=0.5)

# Fit the model
multi_task_elastic_net_model.fit(X, Y) 

# The coefficients
print(f"Coefficients: {multi_task_elastic_net_model.coef_}")

# The intercepts
print(f"Intercepts: {multi_task_elastic_net_model.intercept_}")

# Predict using the trained model
Y_pred = multi_task_elastic_net_model.predict(X)

# The model has a `.score` method that returns the coefficient of determination R^2 of the prediction.
# The score method estimates the performance of the model on the training set
score = multi_task_elastic_net_model.score(X, Y)
print(f"Score: {score}")


Coefficients: [[63.36710479 69.43852685 16.98190953 33.48697788 79.2751291  64.9175761
  61.82069432 98.76607078 98.26524367 74.99887439]
 [17.5410879  44.59688814  5.43184517 89.02715564  9.03561985 35.94878403
  16.94801022 55.66961899 11.11050918 94.5492273 ]
 [57.88935148 90.78696675 40.73576259 44.32712342 40.74797597 12.58649269
   9.41150463 63.18893214 51.3272421  12.73376737]]
Intercepts: [ 0.15607329 -0.03357989  0.11068505]
Score: 0.9999765070436003
