
## Introduction to Linear Regression

Linear regression stands as a cornerstone of statistical modeling and predictive analytics, primarily used to understand and quantify the relationship between variables. At its core, it seeks to establish a linear association between a dependent variable (often denoted as 'y'), which is the outcome we aim to predict or explain, and one or more independent variables (often denoted as 'x' or 'xᵢ'), which are the predictors or explanatory factors. The fundamental idea is to identify the "line of best fit" that most accurately represents the trend within a dataset, thereby allowing for predictions of the dependent variable based on given values of the independent variable(s). This method is not only valuable for prediction but also for inferring the strength and direction of the relationships. Linear regression models are broadly categorized into simple linear regression, involving a single independent variable, and multiple linear regression, which incorporates two or more independent variables. The "linear" aspect signifies that the model assumes the relationship between the dependent variable and the *parameters* (the coefficients) is linear, even if the independent variables themselves are transformed (e.g., x²).

---

**Simple Linear Regression (SLR)**

Simple Linear Regression (SLR) is the most basic form of linear regression, focusing on modeling the relationship between a single dependent variable (y) and a single independent variable (x). The objective is to describe this relationship using a straight line. The underlying population model for simple linear regression is expressed by the equation:

`y = β₀ + β₁x + ε`

In this equation:
*   `y` represents the dependent variable, which is the outcome we are trying to predict or understand.
*   `x` represents the independent variable, which is used as the predictor.
*   `β₀` (beta-naught or beta-zero) is the **population y-intercept**. This is the theoretical average value of `y` when the independent variable `x` is equal to zero. Its practical interpretation depends on whether `x=0` is a meaningful and observed value in the context of the data.
*   `β₁` (beta-one) is the **population slope coefficient**. This crucial parameter quantifies the average change in the dependent variable `y` for a one-unit increase in the independent variable `x`. The sign of `β₁` (positive or negative) indicates the direction of the relationship (i.e., whether `y` tends to increase or decrease as `x` increases), and its magnitude indicates the strength of this linear effect.
*   `ε` (epsilon) is the **error term** or residual. This term accounts for the random variation or noise in the data and represents the portion of `y` that cannot be explained by the linear relationship with `x`. It captures the influence of all other unobserved factors and inherent randomness, representing the difference between the observed `y` values and the true underlying linear relationship.

When we fit this model to actual sample data, we estimate the population parameters `β₀` and `β₁` with sample statistics, typically denoted `b₀` (or `β̂₀`) and `b₁` (or `β̂₁`), respectively. The most common method for this estimation is Ordinary Least Squares (OLS), which minimizes the sum of the squared differences between the observed `y` values and the values predicted by the fitted line (`ŷ`). The resulting estimated regression equation is `ŷ = b₀ + b₁x`. Key assumptions underpinning SLR include the linearity of the relationship, independence of errors, homoscedasticity (constant variance of errors across all levels of x), and often, the normality of the error term distribution, especially for inferential purposes like hypothesis testing on the coefficients. The R-squared (R²) value is a common metric used to assess how well the model fits the data, indicating the proportion of the variance in `y` that is predictable from `x`.

---

**Multiple Linear Regression (MLR)**

Multiple Linear Regression (MLR) extends the principles of simple linear regression to scenarios where the dependent variable (y) is influenced by two or more independent variables (x₁, x₂, ..., xₚ). This allows for a more comprehensive and often more realistic model of real-world phenomena where outcomes are rarely determined by a single factor. The population model for multiple linear regression is given by the equation:

`y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε`

In this equation:
*   `y` is the dependent variable.
*   `x₁`, `x₂`, ..., `xₚ` are the `p` distinct independent (predictor) variables.
*   `β₀` is the **population y-intercept**. It represents the theoretical average value of `y` when all independent variables (`x₁` through `xₚ`) are simultaneously equal to zero. As with SLR, its practical relevance depends on the context.
*   `β₁`, `β₂`, ..., `βₚ` are the **population partial regression coefficients**. Each coefficient `βᵢ` represents the average change in the dependent variable `y` for a one-unit increase in the corresponding independent variable `xᵢ`, *while holding all other independent variables (`xⱼ` where `j ≠ i`) constant*. This "ceteris paribus" (all else being equal) interpretation is a critical distinction from the slope in SLR, as it isolates the unique contribution of each predictor to `y`, controlling for the influence of the other predictors in the model.
*   `ε` is the **error term**, representing the collective influence of unobserved factors and random variability not accounted for by the included predictors.

Similar to SLR, the population parameters in MLR are estimated from sample data, yielding an estimated regression equation: `ŷ = b₀ + b₁x₁ + b₂x₂ + ... + bₚxₚ`, where `b₀`, `b₁`, ..., `bₚ` are the estimated coefficients. Ordinary Least Squares (OLS) is also the standard method for estimating these coefficients. MLR requires the same core assumptions as SLR (linearity in parameters, independence of errors, homoscedasticity, normality of errors for inference), but with an additional crucial consideration: the absence of perfect multicollinearity. Multicollinearity occurs when independent variables are highly correlated with each other, which can make it difficult to disentangle their individual effects on `y` and can lead to unstable or unreliable coefficient estimates. Model evaluation in MLR often involves the Adjusted R-squared, which accounts for the number of predictors in the model (penalizing overfitting), and the F-statistic, which tests the overall significance of the entire model (i.e., whether at least one predictor has a non-zero effect). Individual t-tests are used to assess the statistical significance of each partial regression coefficient (`βᵢ`).

## Assumptions of Linear Regression
The most common set of assumptions are often remembered by the acronym **LINE**:

1.  **L**inearity
2.  **I**ndependence of Errors
3.  **N**ormality of Errors
4.  **E**qual Variance of Errors (Homoscedasticity)

Additionally, for multiple linear regression, we have:
5.  **No Perfect Multicollinearity**

And a general consideration for good modeling:
6.  **No Significant Outliers / Influential Points**

Let's dive into each one:

---

### 1. Linearity

*   **What it means:** The underlying relationship between the independent variable(s) (X) and the dependent variable (Y) is linear. This means that a change in X leads to a constant change in Y.
    *   In simple linear regression (one X): The relationship between X and Y can be well-represented by a straight line.
    *   In multiple linear regression (multiple Xs): The relationship between each X and Y, holding other Xs constant, is linear. The dependent variable Y is a linear combination of the independent variables.

*   **Why it's important:** If the true relationship is non-linear, forcing a linear model onto it will result in a poor fit. The model will systematically under-predict or over-predict in certain ranges, leading to biased coefficient estimates and inaccurate predictions.

*   **How to check:**
    *   **Scatter plot of Y vs. X (for simple linear regression):** Visually inspect if the points roughly form a straight line.
    *   **Residuals vs. Fitted values plot:** This is the most common and robust check. If the linearity assumption holds, the residuals should be randomly scattered around the horizontal line at zero, with no discernible pattern (e.g., no curves, U-shapes). A pattern suggests non-linearity.
    *   **Partial Regression Plots (or Added Variable Plots) for multiple regression:** These plots show the relationship between Y and one X variable after accounting for the effects of other X variables. They should show a linear trend.

*   **What to do if violated:**
    *   **Transform variables:** Apply non-linear transformations to the independent variable(s) (e.g., log(X), X², √X) or the dependent variable (e.g., log(Y)).
    *   **Add polynomial terms:** Include X², X³, etc., as new predictors.
    *   **Use a non-linear model:** Consider models like polynomial regression, splines, or generalized additive models (GAMs).

---

### 2. Independence of Errors (No Autocorrelation)

*   **What it means:** The residuals (the differences between observed and predicted values) are independent of each other. The error for one observation should not provide any information about the error for another observation.

*   **Why it's important:** If errors are correlated (autocorrelation), standard errors of the regression coefficients will be underestimated. This leads to:
    *   Inflated t-statistics and F-statistics.
    *   P-values that are too small.
    *   Confidence intervals that are too narrow.
    You might incorrectly conclude that a variable is statistically significant when it's not. This is a common issue in time-series data where an observation at one point in time is often related to the observation at the previous point.

*   **How to check:**
    *   **Residuals vs. Time/Order plot:** If your data has a natural order (like time series), plot residuals against the order. Look for systematic patterns (e.g., waves, trends).
    *   **Durbin-Watson test:** A formal statistical test for first-order autocorrelation. Values close to 2 suggest no autocorrelation; values towards 0 suggest positive autocorrelation; values towards 4 suggest negative autocorrelation.
    *   **Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots of residuals:** These plots show the correlation of residuals with their lagged values. Significant spikes indicate autocorrelation.

*   **What to do if violated:**
    *   **For time-series data:** Use models designed for correlated errors, such as ARIMA models, Autoregressive (AR) models, or models with lagged dependent variables.
    *   **Use Generalized Least Squares (GLS) or Newey-West standard errors:** These methods can adjust for autocorrelation.
    *   **Check for omitted variables:** Sometimes, autocorrelation is a symptom of a missing predictor that varies systematically.

---

### 3. Normality of Errors

*   **What it means:** The residuals (errors) of the model are normally distributed with a mean of zero.
    *   Note: This assumption applies to the *errors*, not necessarily to the dependent or independent variables themselves.

*   **Why it's important:** Normality of errors is primarily important for the validity of hypothesis tests (t-tests for coefficients, F-test for overall model significance) and for constructing accurate confidence and prediction intervals, especially with small sample sizes.
    *   The Ordinary Least Squares (OLS) estimators for the coefficients are still BLUE (Best Linear Unbiased Estimators) even if errors are not normal, thanks to the Gauss-Markov theorem. However, the inference (p-values, CIs) relies on normality.
    *   For large sample sizes, due to the Central Limit Theorem, the sampling distribution of the coefficients will tend towards normality even if the errors themselves are not perfectly normal.

*   **How to check:**
    *   **Histogram of residuals:** Should look approximately bell-shaped.
    *   **Q-Q (Quantile-Quantile) plot of residuals:** Points should fall roughly along a straight diagonal line if residuals are normally distributed. Deviations (e.g., S-shapes) indicate non-normality.
    *   **Formal statistical tests:** Shapiro-Wilk test or Kolmogorov-Smirnov test. However, these can be overly sensitive with large datasets (detecting trivial deviations) or not sensitive enough with small datasets. Visual inspection is often preferred.

*   **What to do if violated:**
    *   **Transform the dependent variable (Y):** E.g., log(Y), √Y, Box-Cox transformation.
    *   **Check for outliers:** Outliers can distort the distribution of residuals.
    *   **Use robust regression methods:** These methods are less sensitive to violations of normality.
    *   **If sample size is large:** You might rely on the Central Limit Theorem, but it's still good practice to investigate why errors aren't normal.

---

### 4. Equal Variance of Errors (Homoscedasticity)

*   **What it means:** The variance of the residuals is constant across all levels of the independent variable(s). In other words, the spread of residuals should be roughly the same for all fitted values. The opposite of this is **heteroscedasticity** (unequal variance).

*   **Why it's important:** If heteroscedasticity is present:
    *   The OLS coefficient estimates are still unbiased and consistent.
    *   However, OLS is no longer the most efficient estimator (not BLUE).
    *   The standard errors of the coefficients are biased (usually underestimated in areas of higher variance and overestimated in areas of lower variance). This leads to unreliable t-statistics, p-values, and confidence intervals.

*   **How to check:**
    *   **Residuals vs. Fitted values plot:** This is the primary tool. Look for a "fanning out" or "cone" shape (variance increasing with fitted values) or a "funneling in" shape (variance decreasing). The spread of residuals should be roughly constant across the x-axis.
    *   **Residuals vs. Independent variable(s) plot:** Plot residuals against each X.
    *   **Formal statistical tests:** Breusch-Pagan test, White test, Goldfeld-Quandt test.

*   **What to do if violated:**
    *   **Transform the dependent variable (Y):** Often, a log transformation (log(Y)) or square root transformation (√Y) can stabilize the variance. Box-Cox transformation can also help.
    *   **Use Weighted Least Squares (WLS):** If the form of heteroscedasticity is known or can be estimated, WLS gives more weight to observations with smaller error variance.
    *   **Use heteroscedasticity-consistent standard errors (Robust Standard Errors):** E.g., Huber-White standard errors. These correct the standard errors without changing the coefficient estimates.
    *   **Re-specify the model:** Heteroscedasticity can sometimes be a sign of omitted variables or an incorrect functional form.

---

### 5. No Perfect Multicollinearity (for Multiple Linear Regression)

*   **What it means:** The independent variables (predictors) should not be perfectly linearly related to each other.
    *   **Perfect multicollinearity:** One predictor is an exact linear combination of one or more other predictors (e.g., X1 = 2*X2 + 3*X3). The model cannot be estimated.
    *   **High multicollinearity (imperfect but strong):** Predictors are highly correlated. This doesn't prevent estimation but causes problems.

*   **Why it's important:**
    *   **Perfect multicollinearity:** Makes it impossible to estimate unique coefficients for the involved variables because there are infinite combinations of coefficients that would fit the data equally well. Most software will drop one of the perfectly collinear variables or return an error.
    *   **High multicollinearity:**
        *   Inflates the standard errors of the affected coefficients, making them appear statistically insignificant even if they are important.
        *   Coefficients can become very unstable and sensitive to small changes in the data or model specification.
        *   Makes it difficult to interpret the individual effect of each collinear predictor on Y, as their effects are confounded.

*   **How to check:**
    *   **Correlation matrix of predictors:** Look for high pairwise correlations (e.g., > |0.7| or |0.8|), but this only detects pairwise collinearity, not multicollinearity involving three or more variables.
    *   **Variance Inflation Factor (VIF):** VIF for each predictor measures how much the variance of its estimated coefficient is inflated due to collinearity with other predictors.
        *   VIF = 1: No multicollinearity.
        *   VIF > 1: Some multicollinearity.
        *   VIF > 5 or > 10: Often considered problematic (thresholds vary).
    *   **Tolerance:** (1/VIF). Low tolerance (e.g., < 0.1 or < 0.2) indicates high multicollinearity.

*   **What to do if violated:**
    *   **Remove one or more of the highly correlated variables:** Choose based on theoretical importance or ease of measurement.
    *   **Combine correlated variables:** Create a composite variable (e.g., an index or average).
    *   **Use Principal Component Analysis (PCA):** Transform original predictors into a smaller set of uncorrelated principal components, then use these components as predictors (though interpretation becomes harder).
    *   **Use regularization techniques:** Ridge regression or Lasso regression can handle multicollinearity by shrinking coefficients.
    *   **Collect more data:** Sometimes, multicollinearity is a data artifact and can be reduced with a larger, more diverse dataset.

---

### 6. No Significant Outliers / Influential Points

*   **What it means:** There are no extreme data points (outliers) that disproportionately influence the estimation of the regression line.
    *   **Outlier:** An observation with a large residual (far from the regression line).
    *   **Leverage point:** An observation with an extreme value for an independent variable (X).
    *   **Influential point:** An observation that, if removed, would markedly change the regression coefficients or model fit. Influential points are often outliers with high leverage.

*   **Why it's important:** Influential points can:
    *   Drastically change the slope and intercept of the regression line.
    *   Inflate the standard errors and reduce the model's statistical power.
    *   Distort measures of model fit like R-squared.
    *   Affect the assumptions of normality and homoscedasticity of errors.

*   **How to check:**
    *   **Residual plots:** Look for points far from the zero line.
    *   **Leverage plots (Hat values):** Identify points with high leverage.
    *   **Cook's Distance:** Measures the overall influence of an observation on the fitted values. Values > 1 are often considered highly influential (or > 4/n, where n is sample size).
    *   **DFBETAS:** Measures the change in each regression coefficient when an observation is removed.
    *   **DFFITS:** Measures the change in the fitted value for an observation when it is removed.

*   **What to do if violated:**
    *   **Investigate the point:** Is it a data entry error? If so, correct it.
    *   **Genuine extreme value:**
        *   Consider if it represents a different underlying process.
        *   You might remove it if you can justify it (e.g., measurement error, clearly not part of the population of interest), but always report such removals.
        *   Transform variables (e.g., log) to reduce the impact of extreme values.
        *   Use robust regression methods that are less sensitive to outliers.
        *   Report results both with and without the influential point to show its impact.

---


**Calculating Beta Coefficients for Simple Linear Regression (SLR)**

The model is: `yᵢ = β₀ + β₁xᵢ + εᵢ`
We want to find estimates `b₀` and `b₁` for `β₀` and `β₁`.

**Calculation Lines:**

1.  `x̄ = (Σxᵢ) / n`
2.  `ȳ = (Σyᵢ) / n`
3.  `Sₓₓ = Σ(xᵢ - x̄)² = Σxᵢ² - (Σxᵢ)²/n`
4.  `Sₓᵧ = Σ[(xᵢ - x̄)(yᵢ - ȳ)] = Σxᵢyᵢ - (Σxᵢ)(Σyᵢ)/n`
5.  `b₁ = Sₓᵧ / Sₓₓ`
6.  `b₀ = ȳ - b₁x̄`

**Explanation of Each Step:**

*   **Line 1: `x̄ = (Σxᵢ) / n`**
    *   **Calculation:** Calculate the mean (average) of the independent variable `x`. Sum all the observed values of `x` (denoted `xᵢ`) and divide by the total number of observations (`n`).
    *   **Purpose:** `x̄` is a central point for the `x` data and is used in subsequent calculations for the slope and intercept.

*   **Line 2: `ȳ = (Σyᵢ) / n`**
    *   **Calculation:** Calculate the mean (average) of the dependent variable `y`. Sum all the observed values of `y` (denoted `yᵢ`) and divide by the total number of observations (`n`).
    *   **Purpose:** `ȳ` is a central point for the `y` data. The regression line is guaranteed to pass through the point (`x̄`, `ȳ`).

*   **Line 3: `Sₓₓ = Σ(xᵢ - x̄)² = Σxᵢ² - (Σxᵢ)²/n`**
    *   **Calculation:** Calculate the sum of squared deviations of `x` from its mean.
        *   Method 1 (definitional): For each `xᵢ`, subtract `x̄`, square the result, and then sum all these squared differences.
        *   Method 2 (computational, often easier by hand): Square each `xᵢ` and sum them (`Σxᵢ²`). Then, take the sum of all `xᵢ` (which you already have from step 1), square that sum `(Σxᵢ)²`, divide it by `n`, and subtract this from `Σxᵢ²`.
    *   **Purpose:** `Sₓₓ` represents the total variation in the independent variable `x`. It forms the denominator in the slope calculation, essentially scaling the covariance by the variance of `x`.

*   **Line 4: `Sₓᵧ = Σ[(xᵢ - x̄)(yᵢ - ȳ)] = Σxᵢyᵢ - (Σxᵢ)(Σyᵢ)/n`**
    *   **Calculation:** Calculate the sum of the products of deviations of `x` and `y` from their respective means.
        *   Method 1 (definitional): For each pair `(xᵢ, yᵢ)`, calculate `(xᵢ - x̄)` and `(yᵢ - ȳ)`, multiply these two deviations, and then sum all these products.
        *   Method 2 (computational): For each pair, multiply `xᵢyᵢ` and sum them (`Σxᵢyᵢ`). Then, take the sum of `xᵢ` and the sum of `yᵢ`, multiply these two sums `(Σxᵢ)(Σyᵢ)`, divide by `n`, and subtract this from `Σxᵢyᵢ`.
    *   **Purpose:** `Sₓᵧ` represents the covariance between `x` and `y` (multiplied by `n-1` or `n` depending on the covariance formula used, but the ratio for `b₁` remains the same). It measures how `x` and `y` vary together.

*   **Line 5: `b₁ = Sₓᵧ / Sₓₓ`**
    *   **Calculation:** Divide the sum of products of deviations (`Sₓᵧ` from step 4) by the sum of squared deviations of `x` (`Sₓₓ` from step 3).
    *   **Purpose:** This calculates `b₁`, the **estimated slope coefficient** for `β₁`. It indicates the average change in `y` for a one-unit change in `x`. This formula arises directly from minimizing the sum of squared residuals.

*   **Line 6: `b₀ = ȳ - b₁x̄`**
    *   **Calculation:** Multiply the estimated slope `b₁` (from step 5) by the mean of `x` (`x̄` from step 1), and subtract this product from the mean of `y` (`ȳ` from step 2).
    *   **Purpose:** This calculates `b₀`, the **estimated y-intercept** for `β₀`. It represents the predicted value of `y` when `x` is zero. This formula ensures the regression line passes through the point (`x̄`, `ȳ`).

---

**Calculating Beta Coefficients for Multiple Linear Regression (MLR) using Matrix Algebra**

The model is: `Y = Xβ + ε`
We want to find the vector of estimates `b` (or `β̂`) for the vector of population parameters `β`.
`b = [b₀, b₁, ..., bₚ]ᵀ`
`β = [β₀, β₁, ..., βₚ]ᵀ`

**Calculation Lines (Matrix Operations):**

1.  Define `Y` (vector of `n` dependent variable observations).
2.  Define `X` (design matrix of `n` rows and `p+1` columns, first column is all 1s for `β₀`, subsequent `p` columns are the independent variables `x₁` to `xₚ`).
3.  `Xᵀ` (Calculate the transpose of matrix `X`).
4.  `A = XᵀX` (Calculate the matrix product of `Xᵀ` and `X`).
5.  `A⁻¹ = (XᵀX)⁻¹` (Calculate the inverse of matrix `A`).
6.  `C = XᵀY` (Calculate the matrix product of `Xᵀ` and `Y`).
7.  `b = A⁻¹C = (XᵀX)⁻¹XᵀY` (Calculate the matrix product of `A⁻¹` and `C`).

**Explanation of Each Step:**

*   **Line 1: Define `Y` (vector of `n` dependent variable observations).**
    *   **Calculation:** This is simply listing your observed `y` values in a single column vector.
        `Y = [y₁, y₂, ..., yₙ]ᵀ`
    *   **Purpose:** `Y` represents all the outcome data points you are trying to model.

*   **Line 2: Define `X` (design matrix of `n` rows and `p+1` columns).**
    *   **Calculation:** Construct a matrix where each row corresponds to an observation and each column corresponds to an independent variable. Crucially, the *first column is entirely composed of 1s*. This column is associated with the intercept term `β₀`. Subsequent columns contain the values of your independent variables `x₁ᵢ, x₂ᵢ, ..., xₚᵢ` for each observation `i`.
        `X = [[1, x₁₁, x₂₁, ..., xₚ₁],`
        `     [1, x₁₂, x₂₂, ..., xₚ₂],`
        `     ...,`
        `     [1, x₁ₙ, x₂ₙ, ..., xₚₙ]]`
    *   **Purpose:** The design matrix `X` structures all the predictor information in a way that's amenable to matrix algebra for solving the system of equations.

*   **Line 3: `Xᵀ` (Calculate the transpose of matrix `X`).**
    *   **Calculation:** Swap the rows and columns of matrix `X`. If `X` is `n x (p+1)`, then `Xᵀ` will be `(p+1) x n`.
    *   **Purpose:** Transposition is a standard matrix operation required to form the "normal equations" `(XᵀX)b = XᵀY` which are derived from minimizing the sum of squared residuals.

*   **Line 4: `A = XᵀX` (Calculate the matrix product of `Xᵀ` and `X`).**
    *   **Calculation:** Perform matrix multiplication of `Xᵀ` (from step 3) and `X` (from step 2). The result `A` will be a square `(p+1) x (p+1)` matrix. The diagonal elements of `XᵀX` relate to the sum of squares of each predictor (and `n` for the intercept column), and the off-diagonal elements relate to the sum of cross-products between predictors.
    *   **Purpose:** `XᵀX` is a fundamental component in the OLS solution. It captures the variance and covariance structure of the independent variables.

*   **Line 5: `A⁻¹ = (XᵀX)⁻¹` (Calculate the inverse of matrix `A`).**
    *   **Calculation:** Find the matrix inverse of `A` (which is `XᵀX`). This is a more complex matrix operation. A matrix has an inverse if and only if its determinant is non-zero. If `XᵀX` is singular (determinant is zero), it means there is perfect multicollinearity among the predictors, and unique `β` estimates cannot be found.
    *   **Purpose:** The inverse `(XᵀX)⁻¹` is needed to isolate the vector `b` in the normal equations. It also plays a crucial role in calculating the variance-covariance matrix of the estimated coefficients.

*   **Line 6: `C = XᵀY` (Calculate the matrix product of `Xᵀ` and `Y`).**
    *   **Calculation:** Perform matrix multiplication of `Xᵀ` (from step 3) and `Y` (from step 1). The result `C` will be a `(p+1) x 1` column vector. Each element in `C` represents the sum of products of an independent variable (or the column of 1s for the intercept) with the dependent variable `y`.
    *   **Purpose:** `XᵀY` captures the relationship (covariance structure) between each independent variable and the dependent variable.

*   **Line 7: `b = A⁻¹C = (XᵀX)⁻¹XᵀY` (Calculate the matrix product of `A⁻¹` and `C`).**
    *   **Calculation:** Perform matrix multiplication of `A⁻¹` (the inverse of `XᵀX` from step 5) and `C` (the `XᵀY` vector from step 6). The result `b` is a `(p+1) x 1` column vector.
    *   **Purpose:** This final step yields the vector `b` containing the **estimated OLS coefficients**: `b₀` (intercept), `b₁` (coefficient for `x₁`), `b₂` (coefficient for `x₂`), ..., `bₚ` (coefficient for `xₚ`). These are the values that minimize the sum of squared differences between observed `y` values and the `y` values predicted by the linear model.


Okay, let's illustrate the calculation of beta coefficients (`b₀`, `b₁`, etc., as estimates for the population `β₀`, `β₁`, etc.) with concrete examples.

---

**Example 1: Simple Linear Regression (SLR)**

Suppose we want to predict a student's exam score (`y`) based on the hours they studied (`x`). We have the following data for 4 students:

| Student | Hours Studied (xᵢ) | Exam Score (yᵢ) |
| :------ | :----------------- | :-------------- |
| 1       | 1                  | 2               |
| 2       | 2                  | 4               |
| 3       | 3                  | 5               |
| 4       | 4                  | 4               |
| **n=4** |                    |                 |

The SLR model is `yᵢ = β₀ + β₁xᵢ + εᵢ`. We want to find `b₀` and `b₁`.

**Calculations (Line by Line):**

1.  `x̄ = (Σxᵢ) / n`
    *   `Σxᵢ = 1 + 2 + 3 + 4 = 10`
    *   `n = 4`
    *   `x̄ = 10 / 4 = 2.5`
    *   **Explanation:** We calculate the mean (average) of the hours studied.

2.  `ȳ = (Σyᵢ) / n`
    *   `Σyᵢ = 2 + 4 + 5 + 4 = 15`
    *   `n = 4`
    *   `ȳ = 15 / 4 = 3.75`
    *   **Explanation:** We calculate the mean (average) of the exam scores.

3.  `Sₓₓ = Σ(xᵢ - x̄)²`
    *   To calculate this, let's make a small table:
        | xᵢ | xᵢ - x̄ (xᵢ - 2.5) | (xᵢ - x̄)² |
        |----|-------------------|-----------|
        | 1  | -1.5              | 2.25      |
        | 2  | -0.5              | 0.25      |
        | 3  | 0.5               | 0.25      |
        | 4  | 1.5               | 2.25      |
        |    |                   | **Σ = 5** |
    *   `Sₓₓ = 5`
    *   **Explanation:** We calculate the sum of the squared differences between each `x` value and the mean of `x`. This measures the total variation in `x`.

4.  `Sₓᵧ = Σ[(xᵢ - x̄)(yᵢ - ȳ)]`
    *   To calculate this, let's expand our table:
        | xᵢ | yᵢ | xᵢ - x̄ (xᵢ - 2.5) | yᵢ - ȳ (yᵢ - 3.75) | (xᵢ - x̄)(yᵢ - ȳ) |
        |----|----|-------------------|-------------------|-------------------|
        | 1  | 2  | -1.5              | -1.75             | 2.625             |
        | 2  | 4  | -0.5              | 0.25              | -0.125            |
        | 3  | 5  | 0.5               | 1.25              | 0.625             |
        | 4  | 4  | 1.5               | 0.25              | 0.375             |
        |    |    |                   |                   | **Σ = 3.5**       |
    *   `Sₓᵧ = 3.5`
    *   **Explanation:** We calculate the sum of the products of the differences of each `x` from its mean and each `y` from its mean. This measures how `x` and `y` co-vary.

5.  `b₁ = Sₓᵧ / Sₓₓ`
    *   `b₁ = 3.5 / 5 = 0.7`
    *   **Explanation:** We calculate the estimated slope coefficient (`b₁` for `β₁`). This means for each additional hour studied, the exam score is predicted to increase by 0.7 points.

6.  `b₀ = ȳ - b₁x̄`
    *   `b₀ = 3.75 - (0.7 * 2.5)`
    *   `b₀ = 3.75 - 1.75`
    *   `b₀ = 2`
    *   **Explanation:** We calculate the estimated y-intercept (`b₀` for `β₀`). This means if a student studied 0 hours, their predicted exam score would be 2. (Interpret with caution if x=0 is outside the range of observed data).

**Resulting SLR Equation:**
`ŷ = b₀ + b₁x`
`ŷ = 2 + 0.7x`

---

**Example 2: Multiple Linear Regression (MLR)**

Suppose we want to predict house price (`y`, in $10,000s) based on its size (`x₁`, in 100 sq ft) and number of bedrooms (`x₂`). We have data for 3 houses:

| House | Price (yᵢ) | Size (x₁ᵢ) | Bedrooms (x₂ᵢ) |
| :---- | :--------- | :--------- | :------------- |
| 1     | 30         | 10         | 2              |
| 2     | 45         | 15         | 3              |
| 3     | 50         | 18         | 3              |
| **n=3**|            |            |                |

The MLR model is `yᵢ = β₀ + β₁x₁ᵢ + β₂x₂ᵢ + εᵢ`. We want to find `b = [b₀, b₁, b₂]ᵀ`.

**Calculations (Matrix Operations):**

1.  Define `Y` (vector of dependent variable observations).
    *   `Y = [[30], [45], [50]]`
    *   **Explanation:** This vector holds the observed house prices.

2.  Define `X` (design matrix: column of 1s for `β₀`, then columns for `x₁` and `x₂`).
    *   `X = [[1, 10, 2], [1, 15, 3], [1, 18, 3]]`
    *   **Explanation:** The first column of 1s allows us to estimate the intercept `b₀`. The other columns are the values of our independent variables.

3.  `Xᵀ` (Calculate the transpose of matrix `X`).
    *   `Xᵀ = [[1,  1,  1], [10, 15, 18], [2,  3,  3]]`
    *   **Explanation:** We swap rows and columns of `X`.

4.  `A = XᵀX` (Calculate the matrix product of `Xᵀ` and `X`).
    *   `A = [[1,  1,  1], [10, 15, 18], [2,  3,  3]] * [[1, 10, 2], [1, 15, 3], [1, 18, 3]]`
    *   `A = [[(1*1+1*1+1*1), (1*10+1*15+1*18), (1*2+1*3+1*3)],`
        `     [(10*1+15*1+18*1), (10*10+15*15+18*18), (10*2+15*3+18*3)],`
        `     [(2*1+3*1+3*1), (2*10+3*15+3*18), (2*2+3*3+3*3)]]`
    *   `A = [[3,   43,   8], [43,  649,  119], [8,   119,  22]]`
    *   **Explanation:** This matrix represents sums of squares and cross-products of the independent variables (and the constant term).

5.  `A⁻¹ = (XᵀX)⁻¹` (Calculate the inverse of matrix `A`).
    *   This requires calculating the determinant and the adjugate matrix.
    *   `det(A) = 3(649*22 - 119*119) - 43(43*22 - 119*8) + 8(43*119 - 649*8)`
        `det(A) = 3(117) - 43(-6) + 8(-75) = 351 + 258 - 600 = 9`
    *   `adj(A) = [[117,   6,  -75], [6,    2,  -13], [-75, -13,   98]]` (The matrix of cofactors, transposed. `XᵀX` is symmetric, so its adjugate is also symmetric here).
    *   `A⁻¹ = (1/det(A)) * adj(A) = (1/9) * [[117,   6,  -75], [6,    2,  -13], [-75, -13,   98]]`
    *   `A⁻¹ = [[13,    2/3,  -25/3], [2/3,   2/9,  -13/9], [-25/3, -13/9, 98/9]]`
        (Approximately: `[[13, 0.667, -8.333], [0.667, 0.222, -1.444], [-8.333, -1.444, 10.889]]`)
    *   **Explanation:** The inverse matrix is crucial for solving for `b`. Its existence depends on `det(A)` not being zero (i.e., no perfect multicollinearity).

6.  `C = XᵀY` (Calculate the matrix product of `Xᵀ` and `Y`).
    *   `C = [[1,  1,  1], [10, 15, 18], [2,  3,  3]] * [[30], [45], [50]]`
    *   `C = [[(1*30+1*45+1*50)], [(10*30+15*45+18*50)], [(2*30+3*45+3*50)]]`
    *   `C = [[125], [1875], [345]]`
    *   **Explanation:** This vector represents sums of cross-products of each independent variable (and the constant) with the dependent variable.

7.  `b = A⁻¹C = (XᵀX)⁻¹XᵀY` (Calculate the matrix product of `A⁻¹` and `C`).
    *   `b = [[13,    2/3,  -25/3], [2/3,   2/9,  -13/9], [-25/3, -13/9, 98/9]] * [[125], [1875], [345]]`
    *   `b₀ = 13*125 + (2/3)*1875 - (25/3)*345 = 1625 + 1250 - 2875 = 0`
    *   `b₁ = (2/3)*125 + (2/9)*1875 - (13/9)*345 = (250/3) + (3750/9) - (4485/9) = (750 + 3750 - 4485)/9 = 15/9 = 5/3`
    *   `b₂ = (-25/3)*125 - (13/9)*1875 + (98/9)*345 = (-3125/3) - (24375/9) + (33810/9) = (-9375 - 24375 + 33810)/9 = 60/9 = 20/3`
    *   So, `b = [[0], [5/3], [20/3]]`
        (Approximately: `b = [[0], [1.667], [6.667]]`)
    *   **Explanation:** This final vector `b` contains our estimated coefficients:
        *   `b₀ = 0` (intercept for `β₀`)
        *   `b₁ = 5/3 ≈ 1.667` (coefficient for `x₁`, size, for `β₁`)
        *   `b₂ = 20/3 ≈ 6.667` (coefficient for `x₂`, bedrooms, for `β₂`)

**Resulting MLR Equation:**
`ŷ = b₀ + b₁x₁ + b₂x₂`
`ŷ = 0 + (5/3)x₁ + (20/3)x₂`
`ŷ ≈ 1.667x₁ + 6.667x₂`



**The Problem They Solve: Overfitting and Multicollinearity**

Imagine you're trying to predict house prices.
*   **Standard Linear Regression (Ordinary Least Squares - OLS):** Tries to find the line (or plane, if you have many features) that best fits your existing data by minimizing the sum of squared differences between actual prices and predicted prices.
    *   **Problem 1: Overfitting:** If you have too many features (e.g., house size, number of rooms, age, garden size, color of the front door, number of windows, distance to 10 different landmarks...), the model might learn the existing data *too well*, including its noise. It becomes overly complex and performs poorly on new, unseen houses.
    *   **Problem 2: Multicollinearity:** If some features are highly correlated (e.g., "house size in sq ft" and "number of large rooms"), OLS can get confused about which feature is truly important. This can lead to wildly unstable and large coefficient values that don't make much sense.

Lasso, Ridge, and Elastic Net are called "regularization" techniques. They add a **penalty** to the regression equation to discourage overly complex models or very large coefficients.

---

**1. Ridge Regression (L2 Regularization)**

*   **Simple Idea:** "Let's have all the features, but let's make sure none of them become *too* influential on their own, especially if they are similar to other features." It shrinks the coefficients towards zero, but rarely to *exactly* zero.
*   **How it Works:** Ridge regression adds a penalty to the OLS objective function that is equal to the sum of the *squares* of the coefficients.
*   **Equation (What it tries to minimize):**
    `Minimize [ Σ(yᵢ - ŷᵢ)²  +  λ Σ(βⱼ)² ]`
    Let's break this down:
    *   `Σ(yᵢ - ŷᵢ)²`: This is the standard OLS part – the sum of squared differences between actual values (`yᵢ`) and predicted values (`ŷᵢ`). We want this to be small.
        *   `ŷᵢ = β₀ + β₁x₁ᵢ + β₂x₂ᵢ + ...` (the predicted value based on coefficients `β` and features `x`)
    *   `λ` (lambda): This is a **tuning parameter** (a positive number you choose). Think of it as a knob that controls the strength of the penalty.
    *   `Σ(βⱼ)²`: This is the **L2 penalty**. It's the sum of the squares of all the feature coefficients (`β₁² + β₂² + ...`). (Note: The intercept `β₀` is usually not penalized).

*   **Effect of the Penalty:**
    *   To minimize the whole equation, the model now has to balance two things:
        1.  Fitting the data well (making `Σ(yᵢ - ŷᵢ)²` small).
        2.  Keeping the coefficients small (making `λ Σ(βⱼ)²` small).
    *   If a coefficient `βⱼ` becomes very large, `βⱼ²` becomes even larger, increasing the penalty. So, Ridge prefers models where the "weight" or influence is spread out more evenly among predictors, rather than having a few with massive coefficients.
    *   It shrinks coefficients towards zero, but they usually don't become *exactly* zero (unless λ is infinitely large). So, all features are typically kept in the model.

*   **Simple Example:**
    Imagine predicting house price (`y`) with:
    *   `x₁`: Size in square feet
    *   `x₂`: Number of large rooms (which is highly correlated with size)
    *   `x₃`: Age of the house
    Standard OLS might give a very large positive coefficient to `x₁` and a large negative coefficient to `x₂` (or vice-versa) to compensate for their correlation, which is hard to interpret.
    Ridge regression would shrink both `β₁` and `β₂` towards zero, making them smaller and more stable. It acknowledges both are important but doesn't let one dominate erratically due to their similarity. `β₃` would also be shrunk based on its contribution.

*   **When is λ chosen?** Usually through a process called cross-validation, where you try different values of λ and see which one gives the best performance on unseen data.
    *   If `λ = 0`, Ridge is just OLS.
    *   As `λ` increases, the shrinkage effect becomes stronger, and coefficients get closer to zero.

---

**2. Lasso Regression (L1 Regularization)**

*   **Simple Idea:** "Let's try to simplify the model by not just shrinking coefficients, but by actually kicking out the least important features altogether by making their coefficients *exactly* zero."
*   **How it Works:** Lasso regression adds a penalty to the OLS objective function that is equal to the sum of the *absolute values* of the coefficients.
*   **Equation (What it tries to minimize):**
    `Minimize [ Σ(yᵢ - ŷᵢ)²  +  λ Σ|βⱼ| ]`
    Let's break this down:
    *   `Σ(yᵢ - ŷᵢ)²`: Same as OLS and Ridge – sum of squared errors.
    *   `λ`: Again, the tuning parameter controlling penalty strength.
    *   `Σ|βⱼ|`: This is the **L1 penalty**. It's the sum of the absolute values of all feature coefficients (`|β₁| + |β₂| + ...`). (Intercept `β₀` is usually not penalized).

*   **Effect of the Penalty:**
    *   Similar to Ridge, Lasso balances fitting the data with keeping coefficients small.
    *   However, the nature of the absolute value penalty (`|βⱼ|`) means that for a strong enough `λ`, some coefficients can be forced to be *exactly zero*. This is a form of automatic **feature selection**.
    *   If a feature is not very predictive, Lasso is more likely to discard it by setting its `βⱼ` to 0.

*   **Simple Example:**
    Imagine predicting house price (`y`) with many features:
    *   `x₁`: Size in square feet (important)
    *   `x₂`: Number of bedrooms (important)
    *   `x₃`: Age of the house (moderately important)
    *   `x₄`: Color of the kitchen tiles (likely unimportant)
    *   `x₅`: Distance to nearest coffee shop (maybe somewhat important)
    Lasso, with an appropriate `λ`, might result in:
    *   `β₁` (for size): A significant value (shrunk from OLS, but non-zero)
    *   `β₂` (for bedrooms): A significant value (shrunk, non-zero)
    *   `β₃` (for age): A smaller value (shrunk, non-zero)
    *   `β₄` (for tile color): **Exactly 0** (feature effectively removed)
    *   `β₅` (for coffee shop): Maybe a small non-zero value, or maybe 0 if `λ` is high enough.
    This makes the model simpler and potentially easier to interpret because you only focus on the non-zero coefficient features.

*   **Limitation of Lasso:** If you have a group of highly correlated features that are all useful, Lasso tends to arbitrarily pick one or a few from the group and shrink the others to zero, rather than sharing the load like Ridge does.

---

**3. Elastic Net Regression**

*   **Simple Idea:** "Why choose between Ridge and Lasso? Let's get the best of both worlds!"
*   **How it Works:** Elastic Net combines both the L1 penalty (from Lasso) and the L2 penalty (from Ridge).
*   **Equation (What it tries to minimize):**
    `Minimize [ Σ(yᵢ - ŷᵢ)²  +  λ₁ Σ|βⱼ|  +  λ₂ Σ(βⱼ)² ]`
    *   `Σ(yᵢ - ŷᵢ)²`: Sum of squared errors.
    *   `λ₁ Σ|βⱼ|`: The Lasso (L1) penalty part. `λ₁` controls its strength.
    *   `λ₂ Σ(βⱼ)²`: The Ridge (L2) penalty part. `λ₂` controls its strength.

    Often, this is written with a single overall penalty `λ` and a mixing parameter `α` (alpha) between 0 and 1:
    `Minimize [ Σ(yᵢ - ŷᵢ)²  +  λ (α Σ|βⱼ|  +  (1-α) Σ(βⱼ)²) ]`
    *   `α`: Controls the mix.
        *   If `α = 1`, it's pure Lasso.
        *   If `α = 0`, it's pure Ridge.
        *   If `0 < α < 1`, it's a combination. For example, `α = 0.5` gives equal weight to both penalties.
    *   `λ`: Controls the overall strength of the combined penalty.

*   **Effect of the Penalty:**
    *   It can perform feature selection (like Lasso, setting some `βⱼ` to 0).
    *   It can handle correlated predictors better than Lasso alone (like Ridge, it can shrink coefficients of correlated features together rather than arbitrarily picking one). This is often called the "grouping effect" – if predictors are correlated, their coefficients tend to rise or fall together.

*   **Simple Example:**
    Imagine predicting house price (`y`) with:
    *   `x₁`: Size in square feet
    *   `x₂`: Number of large rooms (correlated with `x₁`)
    *   `x₃`: Number of bathrooms (correlated with `x₁` and `x₂`)
    *   `x₄`: Quality of kitchen appliances (somewhat correlated with overall house quality/price)
    *   `x₅`: Color of the curtains (likely irrelevant)
    Elastic Net, with appropriate `λ` and `α`, might:
    *   Set `β₅` (for curtain color) to 0 (feature selection from Lasso part).
    *   Shrink `β₁`, `β₂`, `β₃` (size, large rooms, bathrooms) together, acknowledging their grouped importance (Ridge-like behavior for correlated features). `β₄` would also be appropriately shrunk.
    It tries to select important variables and can shrink groups of correlated variables together.

---

**In Summary (Simple Terms):**

| Feature        | Standard Linear Regression (OLS) | Ridge Regression                                | Lasso Regression                               | Elastic Net Regression                           |
| :------------- | :------------------------------- | :---------------------------------------------- | :--------------------------------------------- | :----------------------------------------------- |
| **Main Goal**  | Minimize prediction errors       | Minimize errors + shrink coefficients           | Minimize errors + shrink some coeffs to ZERO   | Minimize errors + shrink coeffs (some to ZERO) |
| **Penalty**    | None                             | Sum of **squared** coefficients (L2 norm)       | Sum of **absolute** coefficients (L1 norm)   | Combination of L1 and L2 norms                 |
| **Feature Selection?** | No                           | No (coefficients get small, but rarely zero)    | Yes (can set coefficients to exactly zero)     | Yes (can set coefficients to exactly zero)     |
| **Handles Correlated Features?** | Poorly (unstable coeffs)     | Better (shrinks them together)                  | Okay (tends to pick one, zero out others)    | Good (can group them and select/shrink)        |
| **Use When...**| Simple models, few features, no strong correlations | Many features, some correlation, want to keep all features | Many features, want a simpler model by removing some features | Many features, some correlation, want feature selection and good handling of correlated groups |



##  Gradient Descent


**The Core Idea: Finding the Bottom of a Valley**

Imagine you're on a foggy mountain range, and you want to get to the lowest point (the valley). You can't see the whole map, but you can feel the slope of the ground beneath your feet.
Gradient Descent works similarly:
1.  You start somewhere on the "mountain" (your initial guess for the model parameters).
2.  You check the "slope" (the gradient) at your current position. The gradient tells you the direction of the steepest *ascent* (uphill).
3.  To go *downhill*, you take a small step in the *opposite* direction of the gradient.
4.  You repeat steps 2 and 3 until you reach a point where the ground is flat (or flat enough), meaning you're at the bottom (or close to it).

In machine learning, the "mountain" is your **cost function** (or loss function). This function measures how "bad" your model's predictions are. The "lowest point" corresponds to the model parameters (like `β₀` and `β₁` in linear regression) that make your model's predictions as accurate as possible (i.e., minimize the cost).

**Key Components:**

1.  **Cost Function (J):** A function that measures the error of your model. For linear regression, a common cost function is the Mean Squared Error (MSE).
    `J(β₀, β₁) = (1/2m) * Σ(ŷᵢ - yᵢ)²`
    Where:
    *   `m` is the number of training examples.
    *   `ŷᵢ` is the prediction for the i-th example (`ŷᵢ = β₀ + β₁xᵢ`).
    *   `yᵢ` is the actual value for the i-th example.
    *   `β₀`, `β₁` are the parameters of our model we want to find.
    *   The `1/2m` is a scaling factor; `1/m` for averaging, and the `1/2` simplifies the derivative calculation.

2.  **Parameters (β or θ):** These are the values your model uses to make predictions (e.g., `β₀` and `β₁` in `y = β₀ + β₁x`). Gradient Descent's job is to find the best values for these parameters.

3.  **Learning Rate (α - alpha):** This determines the size of the step you take downhill in each iteration.
    *   If `α` is too small, Gradient Descent will be very slow.
    *   If `α` is too large, you might overshoot the minimum and even diverge (the cost might increase).

4.  **Gradient (∇J):** The gradient is a vector of partial derivatives of the cost function with respect to each parameter. It points in the direction of the steepest increase of the cost function.
    *   `∂J/∂β₀` (partial derivative of J with respect to `β₀`)
    *   `∂J/∂β₁` (partial derivative of J with respect to `β₁`)

**The Gradient Descent Algorithm Steps:**

1.  **Initialize Parameters:** Start with some initial guesses for `β₀` and `β₁` (e.g., `β₀=0`, `β₁=0` or small random values).
2.  **Calculate the Gradient:** Compute the partial derivatives of the cost function `J` with respect to each parameter `β₀` and `β₁` using your current parameter values.
    For our MSE cost function and linear model `ŷᵢ = β₀ + β₁xᵢ`:
    *   `∂J/∂β₀ = (1/m) * Σ( (β₀ + β₁xᵢ) - yᵢ )`
    *   `∂J/∂β₁ = (1/m) * Σ( (β₀ + β₁xᵢ) - yᵢ ) * xᵢ`
3.  **Update Parameters:** Adjust the parameters by moving in the opposite direction of the gradient, scaled by the learning rate `α`.
    *   `β₀ := β₀ - α * (∂J/∂β₀)`
    *   `β₁ := β₁ - α * (∂J/∂β₁)`
    (The `:=` means "is updated to")
4.  **Repeat:** Go back to step 2 and repeat until the cost function converges (i.e., it changes very little between iterations, or you reach a maximum number of iterations).

**Example: Simple Linear Regression with Gradient Descent**

Let's use a very small dataset to predict `y` from `x`.
Model: `ŷ = β₀ + β₁x`
Cost Function: `J(β₀, β₁) = (1/2m) * Σ( (β₀ + β₁xᵢ) - yᵢ )²`

**Data (m=3):**

| xᵢ | yᵢ |
| :-- | :-: |
| 1  | 2  |
| 2  | 4  |
| 3  | 3  |

**Initialization:**
*   `β₀ = 0`
*   `β₁ = 0`
*   Learning rate `α = 0.1`
*   Number of examples `m = 3`

**Calculations Table (Iteration by Iteration):**

Let's go through the first two iterations.

---

**Iteration 1:**

1.  **Current Parameters:**
    *   `β₀ = 0`
    *   `β₁ = 0`

2.  **Calculate Predictions (ŷᵢ = β₀ + β₁xᵢ):**
    *   For x₁=1: `ŷ₁ = 0 + 0*1 = 0`
    *   For x₂=2: `ŷ₂ = 0 + 0*2 = 0`
    *   For x₃=3: `ŷ₃ = 0 + 0*3 = 0`

3.  **Calculate Errors (Errorᵢ = ŷᵢ - yᵢ):**
    *   Error₁ = `0 - 2 = -2`
    *   Error₂ = `0 - 4 = -4`
    *   Error₃ = `0 - 3 = -3`

4.  **Calculate Cost J(β₀, β₁):**
    *   `J = (1/(2*3)) * [(-2)² + (-4)² + (-3)²]`
    *   `J = (1/6) * [4 + 16 + 9]`
    *   `J = (1/6) * 29 = 4.833`

5.  **Calculate Gradients:**
    *   `∂J/∂β₀ = (1/m) * Σ(Errorᵢ)`
        *   `∂J/∂β₀ = (1/3) * (-2 + -4 + -3) = (1/3) * (-9) = -3`
    *   `∂J/∂β₁ = (1/m) * Σ(Errorᵢ * xᵢ)`
        *   `∂J/∂β₁ = (1/3) * [(-2*1) + (-4*2) + (-3*3)]`
        *   `∂J/∂β₁ = (1/3) * [-2 - 8 - 9] = (1/3) * (-19) = -6.333`

6.  **Update Parameters (using α = 0.1):**
    *   `β₀ := β₀ - α * (∂J/∂β₀)`
        *   `β₀ := 0 - 0.1 * (-3) = 0 + 0.3 = 0.3`
    *   `β₁ := β₁ - α * (∂J/∂β₁)`
        *   `β₁ := 0 - 0.1 * (-6.333) = 0 + 0.6333 = 0.6333`

**End of Iteration 1:** New parameters are `β₀ = 0.3`, `β₁ = 0.6333`. Cost was `4.833`.

---

**Iteration 2:**

1.  **Current Parameters:**
    *   `β₀ = 0.3`
    *   `β₁ = 0.6333`

2.  **Calculate Predictions (ŷᵢ = β₀ + β₁xᵢ):**
    *   For x₁=1: `ŷ₁ = 0.3 + 0.6333*1 = 0.3 + 0.6333 = 0.9333`
    *   For x₂=2: `ŷ₂ = 0.3 + 0.6333*2 = 0.3 + 1.2666 = 1.5666`
    *   For x₃=3: `ŷ₃ = 0.3 + 0.6333*3 = 0.3 + 1.8999 = 2.1999`

3.  **Calculate Errors (Errorᵢ = ŷᵢ - yᵢ):**
    *   Error₁ = `0.9333 - 2 = -1.0667`
    *   Error₂ = `1.5666 - 4 = -2.4334`
    *   Error₃ = `2.1999 - 3 = -0.8001`

4.  **Calculate Cost J(β₀, β₁):**
    *   `J = (1/(2*3)) * [(-1.0667)² + (-2.4334)² + (-0.8001)²]`
    *   `J = (1/6) * [1.1379 + 5.9213 + 0.6402]`
    *   `J = (1/6) * 7.6994 = 1.2832`
    *   *(Notice the cost has decreased from 4.833 to 1.2832! We are going downhill.)*

5.  **Calculate Gradients:**
    *   `∂J/∂β₀ = (1/m) * Σ(Errorᵢ)`
        *   `∂J/∂β₀ = (1/3) * (-1.0667 + -2.4334 + -0.8001) = (1/3) * (-4.3002) = -1.4334`
    *   `∂J/∂β₁ = (1/m) * Σ(Errorᵢ * xᵢ)`
        *   `∂J/∂β₁ = (1/3) * [(-1.0667*1) + (-2.4334*2) + (-0.8001*3)]`
        *   `∂J/∂β₁ = (1/3) * [-1.0667 - 4.8668 - 2.4003] = (1/3) * (-8.3338) = -2.7779`

6.  **Update Parameters (using α = 0.1):**
    *   `β₀ := β₀ - α * (∂J/∂β₀)`
        *   `β₀ := 0.3 - 0.1 * (-1.4334) = 0.3 + 0.14334 = 0.44334`
    *   `β₁ := β₁ - α * (∂J/∂β₁)`
        *   `β₁ := 0.6333 - 0.1 * (-2.7779) = 0.6333 + 0.27779 = 0.91109`

**End of Iteration 2:** New parameters are `β₀ = 0.44334`, `β₁ = 0.91109`. Cost is `1.2832`.

---

**Summary Table of Iterations:**

| Iter | `β₀` (start) | `β₁` (start) | Cost (J) | `∂J/∂β₀` | `∂J/∂β₁` | `β₀` (end) | `β₁` (end) |
| :--- | :----------- | :----------- | :------- | :------- | :------- | :--------- | :--------- |
| 1    | 0            | 0            | 4.833    | -3.000   | -6.333   | 0.3000     | 0.6333     |
| 2    | 0.3000       | 0.6333       | 1.2832   | -1.433   | -2.778   | 0.4433     | 0.9111     |
| ...  | ...          | ...          | ...      | ...      | ...      | ...        | ...        |

We would continue this process. With each iteration:
*   The cost `J` should generally decrease (if `α` is chosen well).
*   The parameters `β₀` and `β₁` will get closer to the values that minimize the cost function.
*   The gradients `∂J/∂β₀` and `∂J/∂β₁` will get closer to 0 as we approach the minimum.

**When to Stop (Convergence):**

You stop iterating when:
1.  The change in the cost function `J` between iterations is very small (below a threshold).
2.  The magnitude of the gradient vector is very small (meaning you're on flat ground).
3.  A pre-defined maximum number of iterations is reached.

**Important Considerations:**

*   **Learning Rate (α):** Crucial. Too small means slow convergence. Too large means overshooting or divergence. Often requires experimentation.
*   **Feature Scaling:** If your input features (`xᵢ`) are on very different scales (e.g., one feature ranges from 0-1, another from 0-1000), gradient descent can be slow or oscillate. It's good practice to scale features (e.g., normalization or standardization) so they have similar ranges. This helps the "valley" of the cost function be more circular, making it easier for gradient descent to find the bottom.
*   **Local vs. Global Minima:** For a convex cost function (like MSE for linear regression), there's only one global minimum, so gradient descent is guaranteed to find it (eventually). For non-convex functions (common in neural networks), gradient descent might get stuck in a local minimum, which isn't the absolute best solution.
*   **Batch vs. Stochastic vs. Mini-Batch Gradient Descent:**
    *   **Batch Gradient Descent (what we did above):** Uses *all* training examples (`m`) to calculate the gradient in each step. Can be slow for very large datasets.
    *   **Stochastic Gradient Descent (SGD):** Uses *one* training example at a time to calculate the gradient and update parameters. Much faster per iteration but the path to the minimum is noisier (it zig-zags more).
    *   **Mini-Batch Gradient Descent:** A compromise. Uses a small batch (e.g., 32, 64, 128) of training examples in each step. Offers a good balance between the stability of Batch GD and the speed of SGD.
