# Linear Models

Linear Regression (Simple/Multiple)

Ridge Regression (L2 regularization) 

Lasso Regression (L1 regularization) 

ElasticNet Regression (L1 + L2)  

Bayesian Linear Regression

Polynomial Regression  


# Linear Regression 


Linear Regression is a supervised learning algorithm used to predict a continuous value by finding a linear relationship between input features (X) and output (Y).

Equation

Simple Linear Regression:


y=mx+c

y ‚Üí predicted value

x ‚Üí input feature

m ‚Üí slope (weight)

c ‚Üí intercept (bias)

Multiple Linear Regression:


y = w1x1 + w2x2 + ... + b
Goal of Linear Regression

To find the best-fit line that minimizes prediction error between actual and predicted values.

Cost Function (Loss Function)

Mean Squared Error (MSE):


J(Œ∏)= 1/n ‚àë(y actual‚àíy predicted)^2

Measures how wrong the model is.

How Model Learns
Gradient Descent

Iteratively updates weights

Moves in the direction of minimum error


w=w‚àíŒ± (‚àÇJ/‚àÇw)

	

Œ± (learning rate) controls step size

Assumptions of Linear Regression

Linear relationship between X and Y

No multicollinearity

Errors are normally distributed

Homoscedasticity (constant variance)

Independence of errors



Types of Linear Regression

Simple Linear Regression (1 feature)

Multiple Linear Regression (many features)

Polynomial Regression (non-linear pattern using linear model)

Evaluation Metrics

MSE

RMSE

MAE

R¬≤ Score (goodness of fit)

Advantages

Simple & fast

Easy to interpret

Works well for linear data

Limitations

 Poor for non-linear data
 Sensitive to outliers
 Assumption-dependent



Why squared error?

Penalizes large errors more strongly.

When not to use Linear Regression?

When data is non-linear or has many outliers.

What is R¬≤?
Explains how much variance in Y is explained by X.

Real-World Use Cases

House price prediction

Salary prediction

Sales forecasting

Risk analysis

1. Cost Function - The "Why" of MSE

MSE vs. MAE: MSE is used because it's differentiable everywhere, which is essential for Gradient Descent. It penalizes large errors quadratically, making the model more sensitive to outliers. MAE is less sensitive to outliers but isn't as smoothly differentiable (gradient issues at 0).

The Normal Equation: Mention this as an alternative to Gradient Descent. It provides a closed-form solution for finding optimal weights: Œ∏ = (X·µÄX)‚Åª¬πX·µÄy. It's fast for small datasets but computationally expensive (O(n¬≥)) for large feature sets.

2. Gradient Descent - Nuances

Learning Rate (Œ±): Critical hyperparameter. Too high ‚Üí overshoot & diverge. Too low ‚Üí slow convergence.

Types: Batch GD (uses all data, slow), Mini-batch GD (uses a subset, best of both worlds), Stochastic GD (uses one sample per step, noisy but fast).

3. Assumptions - Deeper Dive 

Linearity: Check with scatter plots of y vs each X. If violated, consider transformations or Polynomial Regression.

No Multicollinearity: Check with Variance Inflation Factor (VIF). If high (>5 or 10), it inflates coefficient variance. Fix with feature removal, PCA, or regularization (Ridge Regression).

Normality of Errors: Check with a Q-Q plot of residuals. Violation affects confidence intervals & p-values, but predictions might still be okay. Central Limit Theorem helps with large n.

Homoscedasticity: Check with residuals vs. fitted values plot. If violated (heteroscedasticity), standard errors are unreliable. Consider transformations (log(y)) or weighted least squares.

Independence of Errors: Critical for time series. Violation (autocorrelation) invalidates tests. Use time series models (ARIMA) or check with Durbin-Watson statistic.

4. Model Interpretation & Pitfalls

Interpretation of Coefficients: "Holding all other features constant, a one-unit increase in X1 is associated with an average change of w1 units in Y."

The P-Value Trap: A low p-value for a coefficient doesn't mean the predictor is important, just that the relationship is precise. Always check the effect size (coefficient magnitude).

Overfitting: Even linear models can overfit with many features. Solution: Regularization (L1/Lasso, L2/Ridge).

5. Regularization Connection 

Ridge Regression (L2): Adds penalty Œª * Œ£(w·µ¢¬≤) to MSE. Shrinks coefficients but doesn't zero them out. Good for multicollinearity.

Lasso Regression (L1): Adds penalty Œª * Œ£|w·µ¢|. Can shrink coefficients to exactly zero, performing feature selection.

ElasticNet: Combines L1 and L2 penalties.

6. Advanced Interview One-Liners


Q: What if residuals are not normally distributed?

A: The model's coefficient estimates are still unbiased, but hypothesis tests (p-values, confidence intervals) become invalid. For large sample sizes, CLT often saves us.

Q: Is Linear Regression a parametric or non-parametric model?

A: Parametric. It makes a strong assumption about the form of the underlying function (linear in parameters).

Q: Can you use Linear Regression for classification?

A: Technically yes (e.g., predict probability), but it's unsuitable as outputs can be outside [0,1] and error distribution is wrong. Use Logistic Regression instead.

Q: How do you handle categorical variables?

A: Use One-Hot Encoding. Remember to drop one category to avoid the dummy variable trap (perfect multicollinearity).

In [None]:
#Linear Regression from scratch using Gradient Descen

import numpy as np

class LinearRegression:
    def __init__(self, lr=0.01, epochs=1000):
        self.lr = lr
        self.epochs = epochs
        self.w = 0
        self.b = 0

    def fit(self, X, y):
        n = len(X)

        for _ in range(self.epochs):
            y_pred = self.w * X + self.b

            dw = (-2/n) * np.sum(X * (y - y_pred))
            db = (-2/n) * np.sum(y - y_pred)

            self.w -= self.lr * dw
            self.b -= self.lr * db

    def predict(self, X):
        return self.w * X + self.b


In [5]:
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])

model = LinearRegression()
model.fit(X, y)

print(model.predict(np.array([6])))


[11.98848257]


In [6]:
from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])

model = LinearRegression()
model.fit(X, y)

prediction = model.predict([[6]])
print(prediction)


[12.]


In [7]:
print("Slope:", model.coef_)
print("Intercept:", model.intercept_)


Slope: [2.]
Intercept: 0.0


In [8]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(X)

print("MSE:", mean_squared_error(y, y_pred))
print("R2 Score:", r2_score(y, y_pred))


MSE: 0.0
R2 Score: 1.0


# Polynomial Regression
Polynomial Regression is a type of Linear Regression that models a non-linear relationship by transforming input features into polynomial terms.

It is linear in parameters, but non-linear in features.

Because the model is linear in weights (w), even though x is raised to powers.

When to Use Polynomial Regression?

Data shows curved / non-linear pattern

Linear Regression underfits

Relationship is smooth and continuous

Degree of Polynomial

Low degree ‚Üí Underfitting (high bias)

High degree ‚Üí Overfitting (high variance)


Overfitting Risk 

Polynomial Regression overfits easily, especially with high degree.

Regularization (Ridge/Lasso) is commonly used

Cross-validation to choose degree


Q: Why not always use high-degree polynomial?

Causes overfitting.

Q: Is Polynomial Regression non-linear?

Non-linear in features, linear in parameters.

Q: How to choose degree?

Cross-validation.

Q: How to reduce overfitting?

Regularization.


In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

model = LinearRegression()
model.fit(X_poly, y)


In [None]:
from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0)
model.fit(X_poly, y)



# Ridge Regression
Ridge Regression is a regularized version of Linear / Multiple Linear Regression that adds an L2 penalty to the loss function to reduce overfitting.
Mainly used when multicollinearity exists.
Why Ridge Regression?

Linear Regression problems:

Overfitting

Large coefficients

Unstable weights (multicollinearity)


Ridge solves these by shrinking weights

Loss Function 

Loss = MSE + Œª‚àëw^2

Where:

MSE ‚Üí prediction error

Œª ‚Üí regularization strength

‚àëw^2‚Üí L2 penalty

Effect of L2 Penalty

Penalizes large weights :

Shrinks coefficients towards zero

Never makes weights exactly zero



Ridge reduces variance without feature elimination.


Role of Œª (Lambda)

Œª = 0 ‚Üí Normal Linear Regression

Small Œª ‚Üí Slight regularization

Large Œª ‚Üí Strong regularization ‚Üí Underfitting

When to Use Ridge Regression?

Many correlated features

All features are important

Want stable model

#### Loss = MSE + lambda * sum(w^2)
#### Gradient update adds extra term: 2 * lambda * w

Why Ridge handles multicollinearity?
Shrinks correlated feature weights together.

Q: Does Ridge remove features?
No.

Q: What happens if alpha is too high?
Underfitting.

Q: Is Ridge linear?
Yes, linear in parameters.



In [None]:
# Ridge Regression
from sklearn.linear_model import Ridge

model = Ridge(alpha=1.0)
model.fit(X, y)

print(model.coef_)
print(model.intercept_)


# Lasso Regression

Lasso Regression is a regularized linear regression technique that adds an L1 penalty to the loss function.

Main purpose: Reduce overfitting + perform feature selection

Why Lasso?

Problems with Linear Regression:
Overfitting
Too many irrelevant features

Lasso solves this by forcing some weights to exactly zero

Lasso creates sparse models.

Effect of L1 Penalty

Penalizes absolute value of weights

Pushes some weights to exactly zero

Automatically performs feature selection

When to Use Lasso?

Dataset has many irrelevant features
Need feature selection
Want simpler, interpretable model


Geometric Intuition 

L1 penalty has sharp corners

Optimization often hits axis ‚Üí zero coefficients

Lasso Limitation

If features are highly correlated, Lasso picks only one and ignores others

Solution: ElasticNet

Why does Lasso perform feature selection?
L1 penalty forces weights to zero.

Q: Can Lasso handle multicollinearity well?
No.

Q: What happens if alpha is too large?
Underfitting.

Q: Ridge or Lasso for interpretability?
Lasso.


In [None]:
from sklearn.linear_model import Lasso

model = Lasso(alpha=0.1)
model.fit(X, y)

print(model.coef_)
print(model.intercept_)


# ElasticNet Regression (L1 + L2)

What is ElasticNet Regression?

ElasticNet is a regularized linear regression technique that combines L1 (Lasso) and L2 (Ridge) penalties.
Used when:

Many features

Features are highly correlated

Need feature selection + stability


Why ElasticNet?

Problems:

Lasso ‚Üí unstable with correlated features

Ridge ‚Üí no feature selection

ElasticNet gets the best of both

ElasticNet balances sparsity and stability

Hyperparameters

Œ± (alpha)

Overall regularization strength

Higher Œ± ‚Üí more regularization

l1_ratio

Controls balance between L1 and L2

l1_ratio = 1 ‚Üí Lasso

l1_ratio = 0 ‚Üí Ridge


Effect of ElasticNet

Shrinks coefficients (L2 effect)

Sets some coefficients to zero (L1 effect)

Handles correlated features better than Lasso

``` bash
 Feature              Ridge  Lasso   ElasticNet 
 -------------------  -----  ------  ---------- 
 L1 Penalty           no     yes     yes          
 L2 Penalty           yes    no      yes          
 Feature Selection    no     yes     yes          
 Correlated Features  Best   Poor    Best       
 Stability            High   Medium  High       

```

When to Use ElasticNet?

High-dimensional data
Correlated features
Want feature selection + stability
Text / Genomics data

Feature Scaling 

Always scale features before ElasticNet
Regularization depends on coefficient magnitude


Q: Why ElasticNet over Lasso?
Handles correlated features better.

Q: What does l1_ratio control?
Balance between L1 and L2.

Q: What if l1_ratio = 1?
Lasso Regression.

Q: What if l1_ratio = 0?
Ridge Regression.



In [None]:
from sklearn.linear_model import ElasticNet

model = ElasticNet(alpha=0.1, l1_ratio=0.5)
model.fit(X, y)

print(model.coef_)
print(model.intercept_)



# Ridge vs Lasso vs ElasticNet
```bash
 Feature                      Ridge (L2)                                Lasso (L1)                                        ElasticNet (L1 + L2)                    
 ---------------------------  ----------------------------------------  ------------------------------------------------  --------------------------------------- 
 Penalty Type                 L2 ‚Üí sum of squares of weights            L1 ‚Üí sum of absolute weights                      L1 + L2                                 
 Effect on Weights            Shrinks coefficients, but never zero      Can shrink some coefficients to exactly zero      Shrinks coefficients +
                                                                                                                            some can be zero 
 Feature Selection             No                                       Yes                                               Yes                                   
 Handles Correlated Features   Well                                     Poor                                              Better than Lasso                     
 Stability                    High                                      Medium                                            High                                    
 Overfitting                  Reduced                                   Reduced                                           Reduced                                 
 Best for                     All features important                    Sparse/irrelevant features                        Many correlated features 
                                                                                                                            & sparsity

```

Ridge ‚Üí L2 ‚Üí good for many small correlated features, shrinks weights, no feature selection.

Lasso ‚Üí L1 ‚Üí good for feature selection, some weights become zero, unstable with correlated features.

ElasticNet ‚Üí L1 + L2 ‚Üí best of both worlds, shrinks weights, performs feature selection, handles correlated features.

How it Fits with Regression Types

Linear Regression ‚Üí Any of them can be used, mainly for overfitting control

Multiple Regression ‚Üí Ridge/Lasso/ElasticNet are most used when features > 1

Polynomial Regression ‚Üí Regularization is essential for high-degree polynomials to prevent overfitting

# Bayesian Linear Regression

Bayesian Linear Regression is a probabilistic approach to linear regression.

Instead of finding single best weights, it computes a distribution over weights.

Incorporates prior knowledge and updates beliefs with data (posterior).

y=Xw+œµ

In classical Linear Regression, we estimate a single weight vector 
ùë§


In Bayesian Regression, we estimate a probability distribution for 
ùë§ :
p(w‚à£X,y)‚àùp(y‚à£X,w)‚ãÖp(w)

Where:

p(w) = prior (belief before seeing data)


p(y‚à£X,w) = likelihood (probability of data given weights)


p(w‚à£X,y) = posterior (updated belief after seeing data)

ADV:

Provides uncertainty estimates for predictions (confidence intervals)

Can incorporate prior knowledge

Reduces overfitting via priors

Works well with small datasets


Difference between classical and Bayesian regression?
Classical: single weights, Bayesian: distribution over weights.

Q: Why use Bayesian regression?
Gives uncertainty + reduces overfitting + allows priors.

Q: Can it work with multiple features or polynomial regression?
Yes, same as linear regression, just the prior and posterior are over multiple weights.

Q: What prior is commonly used?
Gaussian prior over weights.

Classical ‚Üí Best guess
Bayesian ‚Üí ‚ÄúBelief + data = updated belief‚Äù


from sklearn.linear_model import BayesianRidge
import numpy as np

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1, 3, 2, 3, 5])

model = BayesianRidge()
model.fit(X, y)

# Mean prediction
y_pred = model.predict([[6]])
# Standard deviation of prediction
y_std = model.predict([[6]], return_std=True)

print("Prediction:", y_pred)
print("Uncertainty:", y_std)
