# Chapter 3: Linearity
Linearity is a foundational concept in machine learning, characterizing models where the output is a direct, weighted sum of the input features. While simple and highly interpretable, linear models like Linear and Logistic Regression make strong assumptions about the underlying data structure. Their performance is often limited when faced with complex, non-linear relationships. However, the principle of linearity remains profoundly influential, serving as the essential building block within sophisticated non-linear architectures such as neural networks and kernel methods. This chapter covers fundamental concepts through building end-to-end projects based on the machine learning life cycle.

## Table of Contents
- [Gradient Descent](#gradient-descent)
- Stochastic Gradient Descent
- Linear Regression
- Multiple Linear Regression
- Linear Classification (Logistic Regression)
- References

## Gradient Descent
**Optimization** forms the computational core of machine learning, framing model training as the search for parameters that **minimize a loss function**. Gradient descent is the foundational **iterative algorithm** that solves this problem. By calculating the gradient of the loss, it navigates the parameter space, taking steps proportional to the negative gradient to converge toward a **local minimum**. This simple yet powerful principle of following the path of **steepest descent** enables the learning process in models ranging from **simple linear regressions** to the most **complex deep neural networks**.

### Problem Statement
The solution to the general optimization problem is given by the following equation.
$$ w^* = \underset{w \in R^n}{\arg\min} \, L(w)$$

Where:
- $w = (w_0, w_1, …, w_n)$ is the **decision variable** vector.
- $L$ is the **objective/cost function** (or loss function in machine learning problems).

The following table provides a summary of common loss functions and their applications in machine learning.
### Summary of Common Loss Functions
| Loss Function | Formula | Type | Use Cases | Key Properties |
| :-: | :-: | :-: | :-: | :-: |
| **L1 Loss (MAE)** | `∑\|y - ŷ\|` | Regression | Robust regression, outliers present | Robust to outliers, non-differentiable at 0 |
| **L2 Loss (MSE)** | `∑(y - ŷ)²` | Regression | Standard regression, smooth outputs | Sensitive to outliers, differentiable |
| **RMSE** | `√(∑(y - ŷ)²/n)` | Regression | Regression (interpretable units) | Same units as target, sensitive to outliers |
| **Binary Cross-Entropy** | `-[y·log(ŷ) + (1-y)·log(1-ŷ)]` | Classification | Binary classification | Probabilistic, penalizes wrong confidence |
| **Categorical Cross-Entropy** | `-∑ y·log(ŷ)` | Classification | Multi-class classification | Multi-class generalization, softmax output |

### Key Notes:
- $\hat{y}$ is the predicted output from the model
- $y$ is the true/actual/desired (target) value (the ground truth from dataset)
- $MSE$ = Mean Squared Error, $MAE$ = Mean Absolute Error
- $L1/L2$ can refer to either loss functions or regularization terms
- **Cross-entropy** losses work with probability outputs (sigmoid/softmax)
- Choice depends on problem type, outlier sensitivity, and optimization needs

### Formulation
Gradient descent is an iterative algorithm in which the key mathematical tool is the **first-order Taylor approximation**.

$$w^{t + 1} = w^t - \eta \nabla L(w^t)$$
$$w^0 = w^\text{ initial}$$

Where:
- $\eta$ is the **learning rate**.
- $\nabla$ is the **Laplace operator (Laplacian)**.

## References
[1] "User Guide, Scikit-Learn Documentation", https://scikit-learn.org/stable/user_guide.html, Accessed October 2025.
[2] Grus, Joel., "Data Science from Scratch", O'Reilly, 2015.