# 📜 Regression Losses in AI/ML/DL

---

## 1) Core point-estimate losses (predict a single number)

| Loss                           | Formula (ŷ = prediction)                            | When to use                               | Pros                            | Cons                               |
| ------------------------------ | --------------------------------------------------- | ----------------------------------------- | ------------------------------- | ---------------------------------- |
| **MSE / L2**                   | $$\frac{1}{n}\sum (y-\hat{y})^2$$                   | Gaussian noise, penalize large errors     | Convex, smooth, easy            | Outlier-sensitive, scale-dependent |
| **RMSE**                       | $$\sqrt{\text{MSE}}$$                               | Same as MSE but interpretable units       | Easy to interpret               | Same issues as MSE                 |
| **MAE / L1**                   | $$\frac{1}{n}\sum |y-\hat{y}|$$                     | Laplace noise, outliers present           | Robust to outliers              | Non-smooth at 0, slower to optimize |
| **Huber**                      | $$\phi_\delta(r)=\begin{cases}\tfrac12 r^2,& |r|\le\delta \\ \delta(|r|-\tfrac12\delta),& \text{else}\end{cases}$$ | Mix of L2 (small r) & L1 (large r) | Robust + smooth | Need to choose $$\delta$$ |
| **Pseudo-Huber / Charbonnier** | $$\delta^2(\sqrt{1+(r/\delta)^2}-1)\quad \text{or}\quad \sqrt{r^2+\epsilon^2}$$ | Smooth robust alternative                 | Differentiable everywhere       | Hyperparam sensitivity             |
| **Log-Cosh**                   | $$\sum \log \cosh(r)$$                              | Gentle robust loss                        | Smooth, easy                    | Slightly costlier than MSE         |
| **RMSLE**                      | $$\sqrt{\tfrac1n\sum(\log(1+y)-\log(1+\hat{y}))^2}$$| Positive targets, relative errors matter  | Penalizes under-prediction less | Undefined if $$y<0$$               |

> Residual: $$r = y - \hat{y}$$.  
> Scaling targets often improves convergence.

---

## 2) Asymmetric losses (quantiles, tails, risk)

| Loss                   | Formula                                            | Use case                                     | Notes                                    |
| ---------------------- | -------------------------------------------------- | -------------------------------------------- | ---------------------------------------- |
| **Quantile / Pinball** | $$L_\tau(r)=\max(\tau r,( \tau-1) r)$$             | Predict $$\tau$$-quantiles (e.g., P90 latency) | Robust, handles asymmetry                |
| **Expectile**          | $$L_\tau(r) = (\tau - \mathbf{1}_{r<0}) r^2$$      | Risk-sensitive regression                     | Squared variant of quantiles             |
| **Tilted-Huber**       | Huber on sign-weighted residual                    | Smooth quantile loss                          | More stable than raw quantile loss       |

---

## 3) Robust M-estimators (heavy tails, outliers)

| Loss                 | Formula (ρ(r))                                            | When / Notes                       |
| -------------------- | --------------------------------------------------------- | ---------------------------------- |
| **Tukey (bisquare)** | $$\tfrac{c^2}{6}\left[1-(1-(r/c)^2)^3\right],\; |r|\le c;\;\text{else const}$$       | Strong outlier rejection            |
| **Cauchy**           | $$\log(1+(r/c)^2)$$                                       | Heavy-tailed noise                 |
| **Geman–McClure**    | $$\tfrac{r^2}{r^2+c^2}$$                                  | Vision, feature matching           |
| **Welsch / Leclerc** | $$1-\exp(-(r/c)^2)$$                                      | Smooth robust alternative          |
| **Fair**             | $$c^2\Big(\tfrac{|r|}{c}-\log(1+|r|/c)\Big)$$             | Gentle robust loss                 |

> Scale $$c$$ often chosen from MAD (median absolute deviation).

---

## 4) Distributional / probabilistic regression

Instead of point ŷ, predict parameters of $$p(y|x;\theta)$$ and minimize **negative log-likelihood (NLL)**.

| Family                        | NLL (per sample)                                       | Predict              | When / Notes                        |
| ----------------------------- | ------------------------------------------------------ | -------------------- | ----------------------------------- |
| **Gaussian (homosced.)**      | $$\tfrac{(y-\mu)^2}{2\sigma^2}+\tfrac12\log 2\pi\sigma^2$$ (σ fixed) | μ | Classic; reduces to MSE if σ const |
| **Gaussian (heterosced.)**    | $$\tfrac{(y-\mu)^2}{2\sigma^2}+\tfrac12\log\sigma^2$$  | μ, log σ²           | Learn aleatoric uncertainty          |
| **Laplace**                   | $$\tfrac{|y-\mu|}{b} + \log(2b)$$                      | μ, log b             | Robust (L1-like)                     |
| **Student-t**                 | $$\log\!\Big(1+\tfrac{(y-\mu)^2}{\nu s^2}\Big)$$ + const | μ, log s, ν        | Heavy-tailed noise                   |
| **Mixture Density Net (MDN)** | $$-\log\sum_k \pi_k \,\mathcal{N}(y|\mu_k,\sigma_k^2)$$| mixture params       | Multi-modal targets                  |
| **Poisson**                   | $$\lambda - y\log\lambda$$                             | λ>0                  | Count data                           |
| **Neg. Binomial**             | $$-\log \text{NB}(y|r,p)$$                             | mean, dispersion     | Over-dispersed counts                |
| **Gamma**                     | $$-\log \text{Gamma}(y|k,\theta)$$                     | k, θ                 | Positive skewed targets              |
| **Tweedie**                   | Tweedie deviance                                       | mean, p              | Zero-inflated positives              |

> For vector targets $$y\in \mathbb{R}^d$$, use multivariate Gaussian:  
> $$\tfrac12[(y-\mu)^T\Sigma^{-1}(y-\mu)+\log|\Sigma|]$$.

---

## 5) Forecasting & business losses

| Loss            | Formula                                       | Caveats                      | Use cases                         |
| --------------- | --------------------------------------------- | ---------------------------- | --------------------------------- |
| **MAPE**        | $$\frac1n\sum \left|\frac{y-\hat{y}}{y}\right|$$ | Division by 0, bias to small y | Demand forecasting, KPIs          |
| **sMAPE**       | $$\frac{2}{n}\sum \frac{|y-\hat{y}|}{|y|+|\hat{y}|}$$ | Non-convex, tricky optimization | Seasonality, business forecasting |
| **MASE/RMSSE**  | Error normalized by naive/seasonal baseline    | Needs baseline series        | Forecast competitions (M4, M5)    |
| **CRPS**        | $$\int (F(y)-\mathbf{1}\{t\le y\})^2 \,dt$$    | Needs forecast distribution  | Probabilistic forecasts           |
| **Energy Score**| Distance between forecast distribution & obs. | Sampling required            | Multivariate forecasting          |

---

## 6) Geometry & domain-specific regression losses

| Domain               | Loss formula                                               | Notes                          |
| -------------------- | ---------------------------------------------------------- | ------------------------------ |
| **Directions**       | $$1-\frac{y\cdot \hat{y}}{\|y\|\|\hat{y}\|}$$              | Orientation regression         |
| **Rotations (SO(3))**| Geodesic angle (e.g., quaternion log-map)                  | Pose estimation                |
| **Geographic**       | Haversine distance                                         | Lat/Lon regression             |
| **Images / Signals** | SSIM/MS-SSIM, perceptual (VGG), total variation            | Better perceptual fidelity     |
| **Speech/Audio**     | Log-mel MSE, spectral convergence                          | Perceptual audio quality       |

---

## 7) Multi-task & masking

- **Masked losses:**  
  $$\frac{\sum m_i \ell_i}{\sum m_i}$$ where $$m_i$$ masks missing labels.  
- **Task balancing (uncertainty weighting):**  
  $$\sum \tfrac{1}{2\sigma_t^2}\mathcal{L}_t + \tfrac12 \log \sigma_t^2$$.  
- **Curriculum/dynamic reweighting:** GradNorm, homoscedastic/heteroscedastic methods.  

---

## 8) Practical guidance

- Gaussian noise → **MSE/RMSE**.  
- Outliers / heavy tails → **MAE, Huber, Student-t NLL**.  
- Asymmetric costs (P90, P95) → **Quantile (Pinball) Loss**.  
- Count data → **Poisson / NegBin**.  
- Positive skewed data → **Gamma / RMSLE**.  
- Multi-modal targets → **MDN**.  
- Need uncertainty → **Heteroscedastic Gaussian NLL, CRPS**.  
- Perceptual tasks (CV, audio) → **L1 + SSIM/Perceptual**.  
- Forecasting → **MASE, RMSSE, Quantile Loss**.  

---

## 9) Optimization & stability tips

- **Scale targets** (standardize/log) → smoother optimization.  
- **Clamp predicted variances** to avoid NaNs.  
- **Warm-up strategy:** start with MAE, then switch to MSE.  
- **Handle imbalance:** reweight samples.  
- **Evaluation ≠ training:** train with MSE, report MAE/quantile/RMSLE.  

---

## 🔑 Quick reference formulas

- MSE: $$\tfrac1n\sum (y-\hat{y})^2$$  
- MAE: $$\tfrac1n\sum |y-\hat{y}|$$  
- Huber: piecewise quadratic/linear at threshold δ  
- Quantile: $$L_\tau = \max(\tau(y-\hat{y}), (\tau-1)(y-\hat{y}))$$  
- Gaussian NLL: $$\tfrac{(y-\mu)^2}{2\sigma^2}+\tfrac12\log\sigma^2$$  
- Poisson NLL: $$\lambda - y\log\lambda$$  
- CRPS: $$\mathbb{E}|Y-Y'| - \tfrac12\mathbb{E}|Y-Y''|$$  

---


# 📌 Key Points on Regression Losses in AI/ML/DL

---

## 🔹 Core Losses
- **MSE / RMSE** → best for Gaussian noise, penalizes large errors heavily.  
- **MAE** → robust to outliers, but slower gradients (non-smooth).  
- **Huber / Charbonnier / Log-Cosh** → hybrids that balance robustness with smooth optimization.  

---

## 🔹 Asymmetric & Risk-Sensitive
- **Quantile Loss (Pinball)** → predict quantiles (e.g., P90, P95) → risk-aware forecasting.  
- **Expectile Loss** → squared variant of quantiles → emphasizes tail risks.  

---

## 🔹 Robust M-Estimators
- **Tukey, Cauchy, Welsch, Fair** → down-weight extreme outliers.  
- Useful in **vision, sensor data, and noisy environments**.  

---

## 🔹 Probabilistic Losses
- **Gaussian NLL** → predict mean $$\mu$$ and variance $$\sigma^2$$.  
- **Laplace / Student-t NLL** → handle heavier tails.  
- **Mixture Density Nets (MDN)** → capture **multi-modal targets**.  
- **Poisson / Neg. Binomial / Gamma / Tweedie** → for **count or skewed positive data**.  

---

## 🔹 Domain-Specific & Business Metrics
- **MAPE / sMAPE** → business KPIs (demand, finance), but unstable if $$y\approx 0$$.  
- **MASE / RMSSE / CRPS / Energy Score** → forecasting & probabilistic benchmarks.  
- **Geodesic / Cosine / Perceptual Losses** → geometry (pose, angles), images, and speech/audio quality.  

---

## 🔹 Practical Training Tips
- **Scale targets** (normalize, log-transform skewed data).  
- **Clamp predicted variances** when learning uncertainty to avoid NaNs.  
- **Warm-up strategy:** start with MAE, then switch to MSE for stability.  
- **Combine pixel-level + perceptual losses** for images/audio.  
- **Match training loss to evaluation metric** → e.g., use quantile loss if business cares about P95.  

---


# 📊 Comparative Table of Regression Loss Functions in AI/ML/DL

---

### 🔹 Core Losses

| Loss Function                  | Formula (simplified)                           | Pros                                   | Cons                     | When to Use                                          |
| ------------------------------ | ---------------------------------------------- | -------------------------------------- | ------------------------ | ---------------------------------------------------- |
| **MSE (L2)**                   | $$\frac{1}{n}\sum (y-\hat{y})^2$$              | Smooth, convex, penalizes large errors | Sensitive to outliers    | Standard regression, Gaussian noise, stable datasets |
| **RMSE**                       | $$\sqrt{\text{MSE}}$$                          | Interpretable in original units        | Same issues as MSE       | Reporting/benchmarking errors                        |
| **MAE (L1)**                   | $$\frac{1}{n}\sum |y-\hat{y}|$$                | Robust to outliers                     | Non-smooth, slower conv. | Skewed/noisy data, outlier presence                  |
| **Huber**                      | Quadratic near 0, linear otherwise             | Mix of MSE & MAE                       | Needs $$\delta$$ tuning  | Balanced case: some outliers but not dominant        |
| **Pseudo-Huber / Charbonnier** | $$\sum \delta^2\left(\sqrt{1+(r/\delta)^2}-1\right)$$ | Differentiable everywhere              | Hyperparameter tuning    | Vision, robotics, continuous control                 |
| **Log-Cosh**                   | $$\sum \log(\cosh(y-\hat{y}))$$                | Smooth, robust                         | Slightly slower than MSE | Robust regression with smooth gradients              |
| **RMSLE**                      | $$\sqrt{\tfrac{1}{n}\sum(\log(1+y)-\log(1+\hat{y}))^2}$$ | Handles exponential growth             | Undefined if $$y<0$$     | Finance, sales, demand forecasting                   |

---

### 🔹 Asymmetric / Risk-Sensitive

| Loss                        | Pros                          | Cons                                | When to Use                                |
| --------------------------- | ----------------------------- | ----------------------------------- | ------------------------------------------ |
| **Quantile Loss (Pinball)** | Captures quantiles (P90, P95) | Needs multiple models for intervals | Forecasting risk, service-level guarantees |
| **Expectile Loss**          | Smooth quantile alternative   | Less common                         | Risk-sensitive finance, insurance          |

---

### 🔹 Robust M-Estimators

| Loss                       | Pros                       | Cons                  | When to Use                      |
| -------------------------- | -------------------------- | --------------------- | -------------------------------- |
| **Tukey’s Biweight**       | Ignores extreme outliers   | Non-convex            | Computer vision, sensor data     |
| **Cauchy / Geman–McClure** | Heavy-tail robustness      | Can underfit extremes | Vision, medical data             |
| **Welsch / Fair**          | Smooth robust alternatives | Task-specific         | Image matching, depth regression |

---

### 🔹 Probabilistic Losses

| Loss                               | Pros                          | Cons                  | When to Use                         |
| ---------------------------------- | ----------------------------- | --------------------- | ----------------------------------- |
| **Gaussian NLL**                   | Models uncertainty (variance) | Sensitive to σ errors | General regression with uncertainty |
| **Laplace NLL**                    | Robust to heavy tails         | Less smooth           | NLP & noisy targets                 |
| **Student-t NLL**                  | Handles very heavy tails      | Adds ν parameter      | Finance, extreme-value domains      |
| **Mixture Density Networks (MDN)** | Captures multimodality        | Expensive, unstable   | Multi-modal outputs (pose, speech)  |
| **Poisson NLL**                    | Natural for counts            | Only for integers     | Event counts, word frequencies      |
| **Negative Binomial**              | Models over-dispersion        | More parameters       | Epidemiology, finance               |
| **Gamma / Tweedie**                | For positive skewed data      | Domain-specific       | Energy, insurance, healthcare       |

---

### 🔹 Forecasting & Business

| Loss             | Pros                               | Cons               | When to Use               |
| ---------------- | ---------------------------------- | ------------------ | ------------------------- |
| **MAPE**         | Scale-free, intuitive              | Division by zero   | Demand, KPIs              |
| **sMAPE**        | Better for seasonal series         | Non-convex         | Business forecasting      |
| **MASE / RMSSE** | Relative to naive baseline         | Needs baseline     | Competitions (M4/M5)      |
| **CRPS**         | Probabilistic, proper scoring rule | Needs distribution | Probabilistic forecasting |
| **Energy Score** | Multivariate version of CRPS       | Expensive          | Multivariate forecasts    |

---

### 🔹 Domain-Specific

| Loss                       | Pros                      | Cons                 | When to Use                                     |
| -------------------------- | ------------------------- | -------------------- | ----------------------------------------------- |
| **Cosine Distance**        | Angle similarity          | No magnitude info    | Directional regression (NLP embeddings, vision) |
| **Geodesic Loss**          | Proper for rotations      | Domain-specific math | Robotics, pose estimation                       |
| **Haversine Loss**         | Handles Earth curvature   | Only for geodata     | Geospatial prediction                           |
| **SSIM / Perceptual Loss** | Better perceptual quality | Non-convex           | Images, audio, GANs                             |
| **Spectral Losses**        | Capture frequency         | Not universal        | Speech/audio regression                         |

---

✅ **Rule of Thumb**  
- **MSE/MAE/Huber** → general regression.  
- **Quantile/Expectile** → risk or service-level forecasting.  
- **Probabilistic NLL** → uncertainty & distributions.  
- **Domain-specific losses** → when geometry, perception, or human-judged quality matters.  
