## 📘 Supervised Learning: Regression – Notation

### 🔢 Mathematical Symbols and Notations

- **ℝ**: Set of **real numbers**  
  ℝ = all numbers on the number line, including decimals, fractions, etc.

- **ℝ₊**: Set of **positive real numbers**  
  Includes only real numbers greater than 0

- **ℝᵈ**: A **d-dimensional vector** of real numbers  
  For example, in 3D: x ∈ ℝ³ means x = [x₁, x₂, x₃]

---

### 🧮 Vectors and Coordinates

- **x**: A vector (list of numbers), e.g., `x = [3, 5, 1]`

- **xⱼ**: The **jᵗʰ coordinate** (element) of the vector x  
  For example, if `x = [3, 5, 1]`, then `x₂ = 5`

- **‖x‖**: The **length (magnitude)** of the vector x  
  It is calculated as: ‖x‖ = √(x₁² + x₂² + ... + x_d²)

---

### 🔁 Collection of Vectors

- **x¹, x², ..., xⁿ**: A **collection of n vectors**  
  Example: If n = 3 and d = 2, then we may have:
  - x¹ = [1, 2]
  - x² = [3, 4]
  - x³ = [5, 6]

- **xⱼⁱ**: The **jᵗʰ coordinate of the iᵗʰ vector**  
  Example: x₂³ = 6 → (2nd element of 3rd vector)

---

### 📐 Other Notations

- **(x₁)²**: Square of the **first coordinate** of the vector x  
  If x = [4, 3, 2], then (x₁)² = 4² = 16

- **1(condition)**: This is called the **indicator function**.  
  It returns 1 if the condition is **true**, otherwise 0.  
  Example:
    - 1(2 is even) = 1
    - 1(2 is odd) = 0

## 📘 Supervised Learning – Regression

### 🏠 Example Problem

**Goal:** Predict **house price** from features such as:
- Number of rooms
- Area (sq. ft.)
- Distance to city center

This is a **regression** task because the target output (house price) is a real number.

---

### 📊 Training Data

Training data is a collection of **input-output pairs**:
$$(x^1, y^1), (x^2, y^2), \dots, (x^n, y^n)$$

Where:
- $x^i \in \mathbb{R}^d$: the **i-th input vector**, containing `d` features (e.g., rooms, area, distance)
- $y^i \in \mathbb{R}$: the **i-th output**, a real number (house price)

---

### 🔁 Model and Output

The learning algorithm outputs a **model** function:
$$f: \mathbb{R}^d \rightarrow \mathbb{R}$$

This function maps input vectors to predicted outputs.

---

### 📉 Loss Function

To measure prediction error, we use **Mean Squared Error (MSE)**:
$${\text{Loss}} = \frac{1}{n} \sum_{i=1}^{n} \left( f(x^i) - y^i \right)^2$$

This penalizes the square of the difference between the predicted value and the actual output.

---

### 📈 Linear Regression Model

We assume a linear function:
$$f(x) = \mathbf{w}^\top x + b = \sum_{j=1}^{d} w_j x_j + b$$

Where:
- $\mathbf{w}$ = weight vector (learned by the model)
- $x$ = input vector
- $b$ = bias/intercept term

This is a **linear regression model**, where the prediction is a weighted sum of input features.


--- 

--- 

# Notes from transcript


# Machine Learning Foundations: Supervised Learning

Hello, everyone, and welcome to another lecture on **Machine Learning Foundations**.  
In this lecture, we will take a deeper dive into one main paradigm of machine learning, which is **supervised learning**.

There are several supervised learning tasks, but the two most important tasks that are encountered in machine learning are **regression** and **classification**. In fact, supervised learning is the most common paradigm of machine learning. When someone says "machine learning," they typically mean supervised learning unless they specify otherwise (e.g., unsupervised learning, reinforcement learning).

---

## Notation

We will use some mathematical notation which will be needed for the rest of the lecture.

- **\(\mathbb{R}\)**: the set of real numbers.  
  Examples: 2.3, \(\sqrt{\pi}\), -7.6 are real numbers.

- **\(\mathbb{R}^d\)**: the set of all \(d\)-dimensional vectors of real numbers.  
  Example: an element in \(\mathbb{R}^3\) might be \((3.6, 5.2, -1.8)\).

- We denote a vector by \( \mathbf{x} \), and \( x_j \) denotes the \(j\)-th coordinate of the vector.

- The **norm** (or length) of a vector \( \mathbf{x} \) is denoted as \( \|\mathbf{x}\| \), which is the Euclidean length. For example, for \( \mathbf{x} \in \mathbb{R}^3 \):

\[
\|\mathbf{x}\|^2 = x_1^2 + x_2^2 + x_3^2,
\quad
\|\mathbf{x}\| = \sqrt{x_1^2 + x_2^2 + x_3^2}
\]

- When working with a collection of vectors, we use **superscripts** for indexing different vectors and **subscripts** for elements in a vector. For example, \( x_i \) is the \(i\)-th vector, and \( x_{ij} \) is the \(j\)-th coordinate of the \(i\)-th vector.

Example:  
If

\[
\mathbf{x}^1 = (1, 2, 3), \quad \mathbf{x}^2 = (1, 1, 1), \quad \mathbf{x}^3 = (7, 7, 8)
\]

then \( x_{32} = 7 \) (the second coordinate of the third vector).

- To denote powers, we use parentheses to avoid confusion. For example, \( (x_1)^2 \) is the square of the first coordinate of the vector \( \mathbf{x} \).

- We also use **indicator variables** denoted by \( \mathbf{1}\{\text{condition}\} \), which equals 1 if the condition is true, and 0 otherwise.  
For example:  
\[
\mathbf{1}\{2 \text{ is even}\} = 1, \quad \mathbf{1}\{2 \text{ is odd}\} = 0
\]

---

## Supervised Learning Paradigm

At its core, **supervised learning** can be thought of as **curve-fitting**. You have a set of points \((\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\), where \(\mathbf{x}_i \in \mathbb{R}^d\) are input vectors and \(y_i\) are labels.

The goal is to find a model \( f: \mathbb{R}^d \to \mathcal{Y} \) such that:

\[
f(\mathbf{x}_i) \approx y_i
\]

This means the function \(f\) should produce outputs close to the labels for each input vector.

---

### Example: Regression Problem

Suppose you want to predict the price of a house from features such as number of rooms, area, and distance to metro.  
- Each house is represented by a 3D vector:  
\[
\mathbf{x} = (\text{number of rooms}, \text{area}, \text{distance})
\]  
- The label \(y\) is the house price (a real number).

The goal of the learning algorithm is to find a function \(f : \mathbb{R}^3 \to \mathbb{R}\) such that:

\[
f(\mathbf{x}_i) \approx y_i
\]

---

### Loss Function

To evaluate how well the model \(f\) performs, we use a **loss function** measuring the deviation:

\[
\text{Loss}(f) = \frac{1}{n} \sum_{i=1}^n (f(\mathbf{x}_i) - y_i)^2
\]

- The loss is always non-negative.
- The best model has a loss of 0, meaning \( f(\mathbf{x}_i) = y_i \) for all \(i\).

---

### Linear Models

A common model form is a **linear model**:

\[
f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b = w_1 x_1 + w_2 x_2 + \cdots + w_d x_d + b
\]

- \(w_1, w_2, \ldots, w_d, b\) are parameters learned from data.
- For house price prediction, it might look like:

\[
\text{price} = w_1 \times \text{rooms} + w_2 \times \text{area} + w_3 \times \text{distance} + b
\]

---

### Simple Example with One-Dimensional Input

Suppose the input dimension \(d=1\) and the training data points are:

\[
x: 1, 2, 3, 6, 7
\]
\[
y: 2.1, 3.9, 6.2, 11.5, 13.9
\]

Plotting these points on the \(xy\)-plane helps visualize the data.

Let's consider two candidate models:

- \( f(x) = 2x \)
- \( g(x) = x + 3 \)

Evaluate the predictions:

| \(x\) | 1 | 2 | 3 | 6 | 7 |
|-------|---|---|---|---|---|
| \(f(x)\) | 2 | 4 | 6 | 12 | 14 |
| \(g(x)\) | 4 | 5 | 6 | 9 | 10 |

Calculate the losses for each and choose the model with the smaller loss.

---

This concludes the foundational overview of supervised learning, regression, and the notation used.
