# 🌟 **Regression in Machine Learning – Detailed Notes**

---

### 📌 **What is Regression?**

Regression is a **supervised learning technique** used in machine learning to **predict continuous (real-valued) outcomes** based on input data. It involves estimating the relationships among variables. In regression problems, the output variable is numerical (e.g., price, age, temperature), unlike classification where outputs are categorical (e.g., spam or not spam).

---

### 🔍 **Why Use Regression?**

- To understand relationships between variables.
- To forecast or predict future outcomes.
- To identify trends and patterns.
- To perform risk analysis or make informed decisions based on data.

---

## 🧠 **Types of Regression in Machine Learning – In Detail**

---

### **1. Simple Linear Regression**

#### ✅ Definition:
Simple Linear Regression models the relationship between a **single independent variable** and a **dependent variable** by fitting a straight line (linear) to the data.

#### 🧮 Mathematical Model:
$$
y = \beta_0 + \beta_1 x + \epsilon
$$

- $y$ = Target (dependent) variable  
- $x$ = Predictor (independent) variable  
- $\beta_0$ = Intercept (value of $y$ when $x = 0$)  
- $\beta_1$ = Slope (rate of change of $y$ with respect to $x$)  
- $\epsilon$ = Error/residual term (difference between actual and predicted value)

#### 🔍 Use Case:
- Predicting salary based on years of experience  
- Estimating house price based on square footage

#### 🧱 Assumptions:
- Linear relationship between $x$ and $y$
- Independence of residuals
- Constant variance of residuals (homoscedasticity)
- Normally distributed residuals

#### ✅ Advantages:
- Easy to implement and interpret
- Works well when the relationship is truly linear

#### ❌ Limitations:
- Only works with one feature
- Not suitable for non-linear relationships

---

### **2. Multiple Linear Regression**

#### ✅ Definition:
Extends simple linear regression to **multiple independent variables**.

#### 🧮 Mathematical Model:
$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \epsilon
$$

- $x_1, x_2, \dots, x_n$ = Independent variables (features)

#### 🔍 Use Case:
- Predicting house price based on area, number of rooms, location, etc.
- Predicting student marks based on study time, sleep hours, etc.

#### 🧱 Assumptions:
Same as simple linear regression, with the additional assumption:
- No multicollinearity (independent variables shouldn't be highly correlated)

#### ✅ Advantages:
- Can handle complex relationships with many features

#### ❌ Limitations:
- Prone to overfitting with irrelevant features
- Sensitive to multicollinearity

---

### **3. Polynomial Regression**

#### ✅ Definition:
A type of regression where the relationship between the independent variable and the dependent variable is modeled as an **nth-degree polynomial**.

#### 🧮 Mathematical Model:
$$
y = \beta_0 + \beta_1 x + \beta_2 x^2 + \dots + \beta_n x^n + \epsilon
$$

#### 🔍 Use Case:
- Modeling population growth
- Predicting trajectories (e.g., ball thrown in the air)

#### 🧱 Assumptions:
- Relationship between variables is polynomial
- Same as linear regression assumptions (except linearity in features)

#### ✅ Advantages:
- Can model non-linear relationships

#### ❌ Limitations:
- High-degree polynomials can overfit
- Interpretability decreases with increasing degree

---

### **4. Ridge Regression (L2 Regularization)**

#### ✅ Definition:
Linear regression with **L2 regularization**. Adds a penalty for large coefficients to reduce model complexity and overfitting.

#### 🧮 Loss Function:
$$
\text{Loss} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2
$$

- $\lambda$ = Regularization strength  
- Larger $\lambda$ leads to more shrinkage

#### 🔍 Use Case:
- Situations with multicollinearity or many features

#### ✅ Advantages:
- Reduces overfitting
- Improves model generalization

#### ❌ Limitations:
- All coefficients are reduced, but **none are eliminated**

---

### **5. Lasso Regression (L1 Regularization)**

#### ✅ Definition:
Linear regression with **L1 regularization**, which can reduce some coefficients to exactly zero, performing **feature selection**.

#### 🧮 Loss Function:
$$
\text{Loss} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j|
$$

#### 🔍 Use Case:
- Feature selection when you suspect only a few features are important

#### ✅ Advantages:
- Automatic feature selection
- Sparse solutions (some coefficients are zero)

#### ❌ Limitations:
- Can behave erratically when variables are highly correlated

---

### **6. Elastic Net Regression**

#### ✅ Definition:
A hybrid of **Ridge and Lasso**. Uses a linear combination of L1 and L2 penalties.

#### 🧮 Loss Function:
$$
\text{Loss} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \left( \alpha \sum_{j=1}^{p} |\beta_j| + (1 - \alpha) \sum_{j=1}^{p} \beta_j^2 \right)
$$

- $\alpha$ = balance parameter between L1 and L2

#### 🔍 Use Case:
- Datasets with many features, especially correlated ones

#### ✅ Advantages:
- Combines strengths of Lasso and Ridge
- Handles correlated variables better

#### ❌ Limitations:
- Requires tuning of two hyperparameters: $\lambda$ and $\alpha$

---

### **7. Logistic Regression** *(for classification)*

#### ✅ Definition:
Used for **binary classification** problems (0/1, True/False). Despite the name, it's a classification algorithm.

#### 🧮 Sigmoid Function:
$$
\hat{y} = \frac{1}{1 + e^{-z}} \quad \text{where} \quad z = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n
$$

#### 🔍 Use Case:
- Spam detection
- Disease diagnosis (e.g., diabetic or not)

#### ✅ Advantages:
- Probabilistic interpretation
- Efficient and interpretable

#### ❌ Limitations:
- Only handles binary output
- Assumes linear boundary in log-odds space

---

### **8. Stepwise Regression**

#### ✅ Definition:
A combination of forward selection and backward elimination to choose features step-by-step.

- **Forward Selection**: Start with no variables, add one at a time.
- **Backward Elimination**: Start with all variables, remove one at a time.

#### 🔍 Use Case:
- Feature selection in large datasets

#### ✅ Advantages:
- Systematic feature selection
- Reduces model complexity

#### ❌ Limitations:
- Greedy approach—might miss best subset

---

### **9. Quantile Regression**

#### ✅ Definition:
Instead of modeling the mean of the target variable, **quantile regression** models specific quantiles (like median, 25th percentile, etc.)

#### 🧮 Loss Function:
Minimizes asymmetric loss depending on quantile value $q$.

#### 🔍 Use Case:
- When we want to model medians or other quantiles instead of the mean
- Used in finance for risk modeling

#### ✅ Advantages:
- Robust to outliers
- Models conditional quantiles

#### ❌ Limitations:
- Interpretation can be complex
- Less popular than OLS

---

### **10. Bayesian Regression**

#### ✅ Definition:
Estimates the distribution of model parameters using **Bayesian inference**, instead of fixed point estimates.

#### 🧮 Uses:
- Priors and likelihood to obtain posterior distribution of weights

#### 🔍 Use Case:
- Uncertainty modeling
- When data is scarce

#### ✅ Advantages:
- Captures uncertainty
- Robust to small datasets

#### ❌ Limitations:
- Computationally expensive
- Requires prior knowledge

---

### ⚙️ **Core Assumptions in Linear Regression**

1. **Linearity**: The relationship between input and output is linear.
2. **Independence of errors**: Residuals are independent of each other.
3. **Homoscedasticity**: Constant variance of errors across all values of independent variables.
4. **Normality of residuals**: Errors are normally distributed (important for inference).
5. **No multicollinearity**: Independent variables are not too highly correlated with each other.

---
---

### 🎯 **Model Evaluation Metrics**

---

#### ✅ **Mean Absolute Error (MAE)**  
$$
MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$  
- Measures the **average absolute difference** between actual and predicted values.  
- **Interpretation**: Lower MAE = better fit  
- **Sensitive to outliers**: No  

---

#### ✅ **Mean Squared Error (MSE)**  
$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$  
- Measures the **average of squared errors**.  
- **Interpretation**: Penalizes larger errors more than MAE  
- **Sensitive to outliers**: Yes  

---

#### ✅ **Root Mean Squared Error (RMSE)**  
$$
RMSE = \sqrt{MSE}
$$  
- **Square root of MSE**, making it interpretable in the same unit as the target.  
- **Interpretation**: Lower RMSE = better model performance  

---

#### ✅ **R-squared ($R^2$)**  
$$
R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}
$$  
Where:  
- $SS_{\text{res}} = \sum (y_i - \hat{y}_i)^2$ (Residual Sum of Squares)  
- $SS_{\text{tot}} = \sum (y_i - \bar{y})^2$ (Total Sum of Squares)

- Indicates the **proportion of variance in the target variable** explained by the model.  
- $R^2$ ranges from **0 to 1**:
  - 1 = perfect prediction
  - 0 = model explains nothing
- **Limitation**: Can **increase** with the addition of features, even if they're not useful.

---

#### ✅ **Adjusted R-squared ($\bar{R}^2$)**  
$$
\bar{R}^2 = 1 - \left(1 - R^2\right) \cdot \frac{n - 1}{n - k - 1}
$$

Where:  
- $n$ = number of observations  
- $k$ = number of independent variables (predictors)  
- $R^2$ = regular R-squared

- **Improves R-squared** by adjusting for the number of predictors.  
- **Penalizes** the model for adding features that do not improve predictive power.  
- Can **decrease** if new variables don't add meaningful value.

---

#### 🔁 **Summary of Use Cases:**

| Metric | Best When | Notes |
|--------|-----------|-------|
| MAE | Outliers need to be minimized | Easy to interpret |
| MSE | Penalizing large errors is important | Sensitive to outliers |
| RMSE | Interpretable and emphasizes large errors | Same units as target |
| $R^2$ | Quick variance explanation | May mislead with many features |
| Adjusted $R^2$ | Comparing models with different number of features | Better for model selection |

---
---

### ⚖️ **Overfitting vs. Underfitting**

| Concept       | Description                                                                 | Solution                                  |
|---------------|-----------------------------------------------------------------------------|-------------------------------------------|
| **Overfitting** | Model is too complex and fits noise in training data                         | Use regularization (Ridge, Lasso, Elastic Net), reduce complexity, cross-validation |
| **Underfitting** | Model is too simple to capture underlying pattern                          | Use a more complex model or add features  |

---

### 🛠️ **Model Selection Techniques**

- **Train-Test Split**: Divide data into training and testing sets to evaluate performance.
- **Cross-Validation (e.g., K-Fold)**: Systematically rotate through training/testing sets to ensure robustness.
- **Grid Search**: Try combinations of hyperparameters and pick the best using validation score.
- **Random Search**: Randomly sample hyperparameter combinations—faster and often effective.

---

### 🧠 **Applications of Regression**

- **Economics**: Predicting housing prices, GDP growth, inflation.
- **Healthcare**: Estimating patient recovery time, disease likelihood.
- **Marketing**: Forecasting sales, estimating ROI.
- **Engineering**: Predicting machine wear, energy efficiency.
- **Finance**: Modeling risk, pricing assets.

---

### ✅ **Summary**

Regression is a fundamental tool in machine learning for modeling and predicting continuous outcomes. By understanding its types (simple, multiple, polynomial, regularized), assumptions, and evaluation techniques, you can build robust models that drive decisions across domains.

---
