# **Linear Models**

## Ordinary Least Squares

### Basics

| Theory Term         | Scikit-learn Term                | What It Means                           |
| ------------------- | -------------------------------- | --------------------------------------- |
| Training your model | `model.fit(x, y)`                | Show your model the data and answers    |
| Make a prediction   | `model.predict(X_test)`          | Ask your model to guess something       |
| x values (features) | `X`                              | The inputs (what you know)              |
| y values (targets)  | `y`                              | The outputs (what you want to predict)  |
| Slope & intercept   | `coef_` & `intercept_`           | Parameters learned by linear regression |
| Accuracy checking   | `mean_squared_error`, `r2_score` | How good your guesses are               |

In [1]:
from sklearn import linear_model

In [2]:
reg = linear_model.LinearRegression() # To create a linear regression model
#It finds the best-fit line by minimizing the sum of squared residuals (errors).
# It does not include regularization, so it can face overfitting, 
# especially if features are correlated or there's noise.
reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])

In [3]:
reg.coef_

array([0.475, 0.475])

In [4]:
reg.intercept_

np.float64(0.050000000000000155)

### When your features (inputs) are too similar to each other, your model freaks out.

Think of each feature as a person in a group project. If two people are saying the exact same thing, the teacher (your model) doesn’t know who to give credit to — and ends up giving both weird scores. That’s multicollinearity.<br><br>
| Strategy                   | How                                                              | Why                                          |
| -------------------------- | ---------------------------------------------------------------- | -------------------------------------------- |
| 🔍 Check correlation       | `df.corr()` or `seaborn.heatmap()`                               | Spot the troublemakers                       |
| 🧼 Drop similar features   | Keep only one                                                    | Simplify the model                           |
| 🔄 Use Regularization      | `Ridge()`, `Lasso()`                                             | Penalize large coefficients                  |
| 📦 Use PCA                 | `sklearn.decomposition.PCA()`                                    | Combine features into “principal components” |
| 🛠️ Design features better | Avoid overlaps (e.g. use area *or* rooms, not both if redundant) | Prevent the mess early                       |

### Hey, no negative weights allowed. Only positive numbers make sense here.

LinearRegression(positive=True)<br><br>
| Situation                       | Why NNLS?                        |
| ------------------------------- | -------------------------------- |
| 🎵 Music signal analysis        | Amplitude (can't be negative)    |
| 🏠 House price predictors       | Negative size doesn’t make sense |
| 📊 Count/frequency-based models | Frequencies can't be negative    |
| 🧪 Chemical mixture prediction  | Concentrations must be ≥ 0       |
| 📦 Stock quantities             | You can't sell -3 items          |


## Ridge regression and classification

### Regression

Greater the alpha, more shrinkage = less diversed weighs 


|Term|Full Name|What it Does|Formula|When to Use|
|---|---|---|---|---|
|**L1**|Lasso/Manhattan|Adds sum of absolute values of coefficients|`λ∑|wi|
|**L2**|Ridge/Euclidean|Adds sum of squared coefficients|`λ∑wi²`|Prevent overfitting, keep all features|
|**Elastic Net**|L1 + L2|Combines both penalties|`λ1∑|wi|

<br><br>

| Model              | Regularization? | Use Case                                                   |
| ------------------ | --------------- | ---------------------------------------------------------- |
| `LinearRegression` | ❌ No            | When data is clean and multicollinearity is not a concern. |
| `Ridge(alpha=...)` | ✅ Yes (L2)      | When features are correlated or model is overfitting.      |


In [5]:
reg = linear_model.Ridge(alpha=.5)
# Alpha Reduce overfitting, handle multicollinearity
reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])

In [6]:
reg.coef_

array([0.34545455, 0.34545455])

In [7]:
reg.intercept_

np.float64(0.13636363636363638)

### Classification



##### 1. RidgeClassifier

> A version of Ridge Regression used for **classification** (instead of predicting numbers). It treats class labels like numbers and makes predictions based on the **sign** of the result (e.g., positive = class 1, negative = class 0).

##### 2. {-1, 1} conversion

> Instead of labeling classes as `0` and `1`, the model converts them to `-1` and `+1`. This helps in using mathematical tricks like treating classification as regression.

##### 3. Multi-output regression

> Instead of predicting **one** value (like normal regression), the model predicts **multiple outputs at once** — one for each possible class. Then, whichever class gets the **highest score** is the predicted one.

##### 4. Penalized Least Squares loss

> This is the regular **squared error loss** with a twist — it adds a **penalty** to keep model weights small (that’s the **regularization** part like in Ridge). It’s a way to stop the model from overfitting.

##### 5. Logistic loss

> Used in **Logistic Regression**, this loss focuses on **how confident the model is** about being right — penalizing it more when it’s confidently wrong. It’s designed specifically for **classification** problems.

##### 6. Hinge loss

> The loss used in **Support Vector Machines (SVMs)**. It only penalizes predictions that are **too close to the decision boundary** or on the wrong side of it. Helps build sharp decision margins.

##### 7. Cross-validation scores

> A way to **test how good your model is** by splitting your data into parts, training on some, and testing on the rest — to make sure your model performs well not just on the training data.

##### 8. Projection matrix

> A special matrix used in linear algebra to **map** input data into another space (like collapsing into fewer dimensions). In RidgeClassifier, it’s used to efficiently compute predictions.

##### 9. Least Squares Support Vector Machine (LS-SVM)

> A version of SVM where the optimization is done using **squared error** (like in regression), instead of hinge loss. It simplifies computation and makes it easier to solve, especially with a **linear kernel**.

##### 10. Linear kernel

> In SVMs, a **kernel** is how we measure similarity between data points. A **linear kernel** just uses the regular dot product — so it doesn’t transform the data at all; it keeps it in the original space.




###  Ridge Complexity

Ridge Regression = Linear Regression + L2 regularization penalty

<br>

| Alpha Value        | What the Model Does                        | Resulting Problem                  |
| ------------------ | ------------------------------------------ | ---------------------------------- |
| 🔺 **Large alpha** | Shrinks weights too much → Ignores details | **Underfitting**: Model too simple |
| 🔻 **Small alpha** | Barely shrinks weights → Fits everything   | **Overfitting**: Model too complex |




One of the ways to find best alpha:-<br>
You can plot a graph:<br>
x-axis = alpha<br>
y-axis = validation error<br>
You’ll often see a U-shape → pick alpha where error is lowest.

#### OR SHORTCUT

In [8]:
from sklearn.linear_model import RidgeCV
import numpy as np
X_train = np.array([
    [1000, 2],
    [1200, 3],
    [1500, 3],
    [1800, 4],
    [2000, 4],
    [2200, 5],
    [2500, 5]
])

y_train = np.array([200, 240, 280, 310, 340, 360, 400])

# This will automatically find the best alpha from the list
ridge = RidgeCV(alphas=[0.001, 0.01, 0.1, 1, 10, 100])
ridge.fit(X_train, y_train)
print(ridge.alpha_)  # Best alpha selected


100.0


### Setting the regularization parameter: leave-one-out Cross-Validation

| Term                            | Simple Explanation                                                                                                                                   |
| ------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| **RidgeCV / RidgeClassifierCV** | Ridge models that **auto-pick the best alpha** using cross-validation (so you don’t have to tune manually).                                          |
| **GridSearchCV**                | A brute-force method to test many combinations of hyperparameters and pick the best.                                                                 |
| **Leave-One-Out (LOO) CV**      | A special kind of cross-validation where you leave out **one data point** at a time for validation. Super detailed, but slower on large datasets.    |
| **Alpha ≠ 0 restriction**       | In LOO, you **can’t use alpha = 0**, because mathematically it breaks the formula RidgeCV uses to estimate error.                                    |                                                                      |


RidgeCV and RidgeClassifierCV are like smart Ridge models with auto-tuning built in.
They work kind of like GridSearchCV, but by default they use Leave-One-Out CV — a precise cross-validation method that:
Tests on every single data point individually

In [9]:
reg = linear_model.RidgeCV(alphas=np.logspace(-6, 6, 13))
reg.fit([[0, 0], [0, 0], [1, 1]], [0, .1, 1])
reg.alpha_


np.float64(0.01)

## LearnModel.()

| **Function / Class**                  | **Use Case**                                             | **Penalty / Regularization** | **Goal / Behavior**                        | **Good For**                        |
| ------------------------------------- | -------------------------------------------------------- | ---------------------------- | ------------------------------------------ | ----------------------------------- |
| `LinearRegression()`                  | Basic linear regression (no regularization)              | ❌ None                       | Fits line to minimize squared error        | Clean data, low risk of overfitting |
| `Ridge(alpha=...)`                    | Linear regression with **L2 regularization**             | ✅ L2 (squared weights)       | Shrinks weights but **keeps all** features | When features are **correlated**    |
| `Lasso(alpha=...)`                    | Linear regression with **L1 regularization**             | ✅ L1 (absolute weights)      | Shrinks some weights to **exactly zero**   | **Feature selection**, sparse data  |
| `ElasticNet(alpha=..., l1_ratio=...)` | Combo of Lasso + Ridge                                   | ✅ L1 + L2                    | Balances sparsity and shrinkage            | When Lasso alone is too aggressive  |
| `RidgeCV()`                           | Ridge with **auto-tuning alpha via cross-validation**    | ✅ L2                         | Picks best alpha using CV                  | Efficient tuning of Ridge           |
| `LassoCV()`                           | Lasso with **auto-tuning alpha via cross-validation**    | ✅ L1                         | Picks best alpha using CV                  | Efficient Lasso tuning              |
| `RidgeClassifier()`                   | Classification using Ridge regression idea               | ✅ L2                         | Converts labels to -1/+1 and fits          | Fast alt to logistic regression     |
| `RidgeClassifierCV()`                 | Same as above but picks best alpha                       | ✅ L2                         | Auto alpha for Ridge classification        | Multiclass, fast training           |
| `LogisticRegression()`                | Logistic regression for binary/multiclass classification | ✅ L1, L2, or both            | Probabilistic output, non-linear loss      | Classification tasks                |
| `SGDRegressor()`                      | Linear model using stochastic gradient descent           | ✅ L1, L2, ElasticNet         | Faster on large datasets                   | Online or large-scale training      |
| `SGDClassifier()`                     | Logistic-like classifier via SGD                         | ✅ L1, L2, ElasticNet         | Good for sparse/high-dim data              | Text, online learning               |


## Polynomial regression: extending linear models with basis functions

Filling linear model with non learn data like X = cosx, Y = sin Y

In [11]:
from sklearn.preprocessing import PolynomialFeatures
X = np.arange(6).reshape(3,2)
poly = PolynomialFeatures(degree=2)
poly.fit_transform(X)
# x1,x2 -> 1 x1 x2 x1^2 x1x2 x2^2
# degree-> 0  1  1  2    2    2

array([[ 1.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  2.,  3.,  4.,  6.,  9.],
       [ 1.,  4.,  5., 16., 20., 25.]])

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

model = Pipeline([('poly', PolynomialFeatures(degree=3)),
                  ('linear', LinearRegression(fit_intercept = False))])
# fit to order 3 polynomial data 
x = np.arange(5)
y = 3 - 2*x + x**2 + x**3
model = model.fit(x[:,np.newaxis],y)
# converts 1d to 2d
model.named_steps['linear'].coef_

# Pipeline([
#     ('step_name1', transformation1),
#     ('step_name2', model)
# ])


array([ 3., -2.,  1.,  1.])

In [13]:
from sklearn.linear_model import Perceptron
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = X[:, 0] ^ X[:, 1]
X = PolynomialFeatures(interaction_only=True).fit_transform(X).astype(int)


| Input x1 | x2 | Transformed: \[1, x1, x2, x1\*x2] |
| -------- | -- | --------------------------------- |
| 0        | 0  | \[1, 0, 0, 0]                     |
| 0        | 1  | \[1, 0, 1, 0]                     |
| 1        | 0  | \[1, 1, 0, 0]                     |
| 1        | 1  | \[1, 1, 1, 1]                     |


In [14]:
clf = Perceptron(fit_intercept=False, max_iter=10, tol=None,
                 shuffle=False).fit(X, y)

In [15]:
clf.predict(X)
clf.score(X, y)

1.0