# **Machine Learning Basics**





## **Introduction to Machine Learning**
- **Deep learning** is a subset of machine learning.
- To understand deep learning, mastering the basic principles of machine learning is essential.
- This chapter provides an overview of fundamental concepts, with references to texts such as Murphy (2012) and Bishop (2006).

### What is **"learning"** (E, T, P) and types of **tasks**

***A machine "learns"*** if, given an *experience E*, it improves *performance P* on a *set of tasks T* (Mitchell's classic definition). 

## **The Task** $T$
ML tasks include:

- **Classification**: Assign a label to an input.
- **Regression**: Predict a numerical value.
- **Transcription**: Convert unstructured data into text.
- **Machine translation**: Convert sequences between languages.
- **Structured output**: Produce data with complex relationships (e.g., parse trees).
- **Anomaly detection**: Identify unusual events.
- **Synthesis and sampling**: Generate new examples similar to the training data.
- **Missing value imputation**: Predict absent values.
- **Denoising**: Reconstruct clean data from corrupted versions.
- **Density estimation**: Learn the probability distribution of the data.

## **Performance Measures** $P$
Examples: *accuracy*, *error rate*, *mean log-likelihood*.
Performance is evaluated on a *test set* separate from the *training data*.

## **Experience** $E$
- **Supervised learning**: Labeled examples $(x, y)$.
- **Unsupervised learning**: Data only $x$, no labels.
- **Semi-supervised and reinforcement learning** are variants.


## **Model capacity, overfitting/underfitting, and bias-variance trade-off**

- **Capacity**: How well a model can represent complex functions; **too low capacity → underfitting**, **too high capacity → overfitting**.
- **Bias-Variance trade-off**: As capacity increases, bias tends to decrease, variance tends to increase; generalization error often shows a U-shaped curve, with "optimal" capacity in between.
- **Consistency**: It is desirable for the estimator to converge to the true value as the data grows (formally defined).

## **Validation Set and Hyperparameters (Key Idea)**

Many algorithms have **hyperparameters** (e.g., regularization strength, learning rate) that are **not learned from standard training and must be chosen with data outside of the training set** (validation set or cross-validation). This separates model training from hyperparameter selection, limiting leakage and conceptual overfitting.

## **Estimators, Bias/Variance, and Maximum Likelihood (ML)**

This chapter introduces **statistical terminology (estimators, bias, variance, consistency)**. Consistency requires, among other things, that the true data generator be within the model family and correspond to a single parameter value.

### **Estimators**
- Rule to guess parameters from data
- **Bias**: Average error | **Variance**: Sensitivity to data
- **Consistent** if: finds truth with infinite data AND true model in family

### **Maximum Likelihood (ML)**
- Choose parameters making data most probable
- Maximize log-likelihood = Minimize NLL/Cross-entropy
- **Linear regression example**: Assume Gaussian noise → ML = Minimize MSE

### **ML Properties**
- **With big data**: Consistent + Efficient (lowest variance)
- **With small data**: Overfits (high variance)
- **Solution**: Regularization (e.g., weight decay)
  - Trade: Small bias increase for large variance reduction
  - Better generalization

- **Estimators** (Bias/Variance) → The "thermometer" that tells us whether the model is generalizing well
- **Likelihood** → The "engine" that drives training toward probabilistically correct predictions

## **Bayesian vs Frequentist Approaches in Machine Learning**

### **Core Concepts**
- **Prior**  $p(θ)$ : Initial beliefs about parameters *before* seeing data
- **Posterior**  $p(θ|x_{1:m})$ : Updated beliefs *after* observing data (via Bayes' rule)
- **Prediction**: Integrate over all possible parameters → **natural protection against overfitting**

### **MAP (Maximum A Posteriori)**
- Practical compromise: take the **mode of the posterior distribution**
-  $θ_{MAP} = \arg\max[\log p(x|θ) + \log p(θ)] $
- **Gaussian Prior** → L2 regularization (weight decay)
- **Key insight**: Many classical regularizations = MAP with specific priors

---

### **PRACTICAL COMPARISON**

| **Bayesian Approach** | **Frequentist (ML) Approach** |
|----------------------|------------------------------|
| ✅ **Better generalization** with limited data | ✅ **Simpler**, computationally efficient |
| ✅ **Incorporates prior knowledge** | ✅ **Consistent & efficient** asymptotically |
| ❌ **Computationally expensive** | ❌ **Tends to overfit** with small datasets |
| ❌ **Prior choice can be subjective** | ❌ **Requires explicit regularization** |

---

### **MODERN PRACTICE**

- **Large datasets**: Often prefer **ML + regularization** (equivalent to MAP)
- **Limited data**: Full Bayesian approach can provide **better generalization**
- **Modern Deep Learning**: Primarily **Frequentist** for efficiency, but with **heavy regularization** (dropout, batch norm, weight decay) that mimics Bayesian benefits

**The bottom line**: Two perspectives on the same underlying mathematics, with choice depending on data size, computational resources, and need to incorporate existing domain knowledge.

# **Supervised vs Unsupervised Learning: Deep Dive**

## **SUPERVISED LEARNING**

### **Core Concept:**
We model **$p(y | x)$** - the probability of outputs given inputs

### **Key Examples:**

#### **1. Linear Regression (Gaussian)**
$$p(y|x) = \mathcal{N}(y; w^T x, \sigma^2)$$
$$\min_w \sum_{i=1}^m (y_i - w^T x_i)^2$$

- **Output**: Continuous value
- **Assumption**: Gaussian error
- **Solution**: Closed-form (normal equations)

#### **2. Logistic Regression (Binary Classification)**
$$p(y=1|x) = \sigma(w^T x) \quad \text{where} \quad \sigma(z) = \frac{1}{1+e^{-z}}$$
$$\max_w \sum_{i=1}^m \left[y_i \log(\sigma(w^T x_i)) + (1-y_i)\log(1-\sigma(w^T x_i))\right]$$

- **Output**: Probability [0,1]
- **No closed form** → numerical optimization (gradient descent)
- **Loss function**: Negative Log-Likelihood (NLL)

#### **3. Support Vector Machines (SVM)**
$$\min ||w||^2 \quad \text{subject to} \quad y_i(w^T x_i + b) \geq 1$$

- **Output**: Class label (-1 or +1) **without probabilities**
- **Kernel trick**: Implicit mapping to higher dimensions

---

## **UNSUPERVISED LEARNING**

### **Core Concept:**
We model the **intrinsic structure** of $x$ - without labels

### **Main Examples:**

#### **1. PCA (Principal Component Analysis)**
$$\max_w \text{Var}(w^T x) = w^T \Sigma w \quad \text{constraint:} \quad ||w|| = 1$$
$$\min ||x - ww^T x||^2$$

- **Interpretations**:
  - **Geometric**: Coordinate axis rotation
  - **Probabilistic**: Feature decorrelation
  - **Compression**: Dimensionality reduction

#### **2. k-Means Clustering**
$$\min \sum_{j=1}^k \sum_{x \in C_j} ||x - \mu_j||^2$$

- **Representation**: One-hot encoding of cluster assignments
- **Limitation**: Assumes spherical, similarly-sized clusters

#### **3. Density Estimation**
$$p_{\text{model}}(x) \approx p_{\text{data}}(x)$$

- **Applications**: Generative modeling, anomaly detection

---

## **UNIFIED RECIPE (Model + Cost + Optimizer)**

### **Same Framework for Both:**

| **Component** | **Supervised** | **Unsupervised** |
|---------------|----------------|------------------|
| **Model** | $p(y\|x)$ | $p(x)$ or $f(x)$ |
| **Cost** | NLL, MSE | Reconstruction, Likelihood |
| **Optimizer** | Gradient Descent, Normal Equations | SVD, EM, Gradient Descent |

### **Concrete Example - PCA as Optimization:**
$$\min_W ||X - XWW^T||^2_F \quad \text{subject to} \quad W^T W = I$$
Solution: $W^* = \text{eigenvectors}(X^T X)[:, :k]$

---

## **FUNDAMENTAL DIFFERENCES**

### **Supervised:**
- **Goal**: Accurate prediction of $y$
- **Data**: Labeled $(x_i, y_i)$
- **Evaluation**: Accuracy, MSE, F1-score
- **Risk**: Overfitting on labels

### **Unsupervised:**
- **Goal**: Discover latent structure
- **Data**: Only $x_i$
- **Evaluation**: Less objective (silhouette, likelihood)
- **Risk**: Uninterpretable structures

---

## **IMPORTANT INTERCONNECTIONS**

### **1. Supervised → Unsupervised**
$$p(x,y) = p(y|x)p(x)$$
We often only model $p(y|x)$ and ignore $p(x)$

### **2. Unsupervised → Supervised**
$$X_{\text{original}} \rightarrow \text{PCA} \rightarrow X_{\text{reduced}} \rightarrow \text{Classifier}$$

### **3. Semi-Supervised Learning**
$$L_{\text{total}} = \alpha L_{\text{supervised}} + (1-\alpha) L_{\text{unsupervised}}$$

---

## **PRACTICAL GUIDELINES**

**Use Supervised when:**
- You have labeled data
- You want specific predictions
- Input-output relationship is clear

**Use Unsupervised when:**
- Exploring unknown data
- Finding hidden patterns
- Preparing features for subsequent models

**The beauty is that the same mathematical "recipe" works for both, demonstrating the fundamental unity of machine learning.**

### **Real-world Applications:**

**Supervised:**
- Spam detection (classification)
- House price prediction (regression)
- Medical diagnosis

**Unsupervised:**
- Customer segmentation (clustering)
- Anomaly detection in networks
- Topic modeling in documents

The choice depends entirely on your data availability and problem objectives!

---

## THE UNIVERSAL MACHINE LEARNING RECIPE

### THE 4 FUNDAMENTAL COMPONENTS:
Every ML algorithm is built with:
1. **DATASET** → $(X, y)$ (supervised) or $X$ (unsupervised)
2. **MODEL** → $p_{\text{model}}$ that describes the data
3. **COST FUNCTION** → often Negative Log-Likelihood (NLL) ± regularization
4. **OPTIMIZER** → closed-form **or** gradient descent/SGD

**Why do some algorithms (decision trees, k-means) require special optimizers?**  
Because they have cost functions with **flat regions** where gradient-based methods don't work!

---

## OPERATIONAL TAKEAWAYS - WHAT TO DO IN PRACTICE

### 1. CLEARLY DEFINE THE PROBLEM
- **Task (T)**: What should the model do?
- **Performance (P)**: How do you measure success?
- **Experience (E)**: What data does it learn from?
- **METRIC** → must align with real objectives

### 2. COMBAT OVERFITTING
- **Regularization** (L2/L1, weight decay)
- **Early stopping**
- **More data** when possible
- **BALANCE** the bias-variance trade-off

### 3. CHOOSE THE ESTIMATION STRATEGY
- **Default**: Maximum Likelihood (ML) via NLL/cross-entropy
- **Limited data?** → Bayesian or MAP (incorporate prior knowledge)
- **Large datasets?** → ML + regularization

### 4. SCALE TRAINING
- **SGD with minibatches** → essential for large datasets
- **Modern variants** → Adam, RMSProp, etc.

### 5. APPLY THE MENTAL RECIPE
- Any ML problem = **Dataset + Model + Cost + Optimizer**
- **Works for supervised AND unsupervised** → unified framework

### 6. WATCH OUT FOR HIGH DIMENSIONS
- **Curse of dimensionality** → exponentially harder problems
- **SOLUTION**: Distributed representations and deep models
- **Deep models learn hierarchical structures** that mitigate this problem

