**3 Distinct Approaches to Classification:**
1. Use a ***Discriminant Function*** to directly assign each imput vector $\bf x$ to a specific class
2. Determine the class-specific conditional distributions $p(C_k|\bf x)$ through parametric modeling, then use the distributions to make optimal classification decisions by optimizing over the parametric model(s)
    - This is ***Discriminative Probabilistic Modeling***
3. Alternatively, model the class-conditional densities $p(\mathbf{x}|C_k)$ along with the prior class probabilities $p(C_k)$, then compute the posterior probabilities via Baye's Theorem: $$p(C_k|\mathbf{x}) = \frac{p(\mathbf{x}|C_k) p(C_k)}{p(\mathbf{x})}$$
    - This is ***Generative Probabilistic Modeling***, so called because it offers the opportunity to generate samples from each of the class-conditional densities $p(\mathbf{x}|C_k)$

## Discriminant Functions

The simplest representation of a linear discriminant function is:
$$y(\mathbf{x}) = \mathbf{w}^\intercal \mathbf{x} + w_0$$
When modeling a binary class system, the input vector is assigned to class $C_1$ if $y(\mathbf{x}) \ge 0$ and to class $C_2$ otherwise. This *linear* discriminant function creates a decision boundary that is a $(D-1)$-dimensional hyperplane within the $D$-dimensional space of the input vector.

### Orthogonality of $\bf w$

Consider two points $\mathbf{x}_A$ and $\mathbf{x}_B$, both of which lie upon the decision surface. Then, for these points we have:
$$y(\mathbf{x}_A) = y(\mathbf{x}_B) = 0 \ \rightarrow \ \mathbf{w}^\intercal (\mathbf{x}_A - \mathbf{x}_B) = 0$$
Thus, $\mathbf{w}$ is orthogonal to every vector upon the decision surface. Why?\
Recall that the dot-product between two vectors $\bf w$ and $\bf x$ equals: $$\|\mathbf{w}\|\|\mathbf{x}\|\cos(\theta)$$ Where $\theta$ is the angle between the vectors. In words, this is the cosine simialrity between the vectors' directions scaled by the product of their magnitudes. Then, for the dot product of the two vectors to be $0$, one of two things must be true:
1. One of the vectors has zero magnitude (will not be the case for all $\mathbf{x}_A$ and $\mathbf{x}_B$ and is not the case for $\mathbf{w}$ because it defines the discriminant function)
2. The cosine of the angle between the vectors is zero, which means that the vectors are perpendicular: θ = 90°
Because $\mathbf{x}_A - \mathbf{x}_B$ is a vector lying upon the decision surface, $\mathbf{w}$ must be perpendicular to the decision surface.

So, $\bf w$ is orthogonal to the hyperplane that is the decision boundary. This means that it is the ***Normal Vector*** that *defines* the hyperplane's direction. 

Now, the distance from the origin to the decision surface may be given by: $$\frac{\mathbf{w}^\intercal \mathbf{x}}{\| \mathbf{w} \|} = -\frac{w_0}{\|\mathbf{w}\|}$$

So, the bias parameter $w_0$ determines the location of the decision surface.

*Claude Explanation:*
> We've just learned that w is perpendicular to the decision surface. Therefore, the shortest path from the origin to the surface must lie along the direction of w (or -w). Let's call the point where this perpendicular line meets the surface x*.\
Here's where we can use a clever trick: we know that x* must be some scalar multiple of w (since it lies along w's direction). Let's call this scalar α:\
x* = αw\
Since x* lies on the decision surface, it must satisfy our original equation:\
w^T(αw) + w₀ = 0\
Using the properties of transposes:\
αw^Tw + w₀ = 0\
Note that w^Tw is just ||w||², the squared magnitude of w. So:\
α||w||² + w₀ = 0\
Solving for α:\
α = -w₀/||w||²\
Now, x* = αw = (-w₀/||w||²)w is our point on the surface. The distance we're looking for is ||x*||:\
||x*|| = ||(-w₀/||w||²)w|| = |-w₀/||w||²| · ||w|| = |w₀|/||w||

Similarly, the distance of any point $\bf x$ to the decision surface is simply the distance between $\bf x$ and its orthogonal projection upon the decision surface $\mathbf{x}_\perp$. We may express $\bf x$ in terms of $\mathbf{x}_\perp$ as: $$\mathbf{x} = \mathbf{x}_\perp + r \frac{\mathbf{w}}{\| \mathbf{w} \|}$$
That is, $\mathbf{x}$ is its orthogonal projection upon the decision surface $\mathbf{x}_\perp$ plus the distance to its orthogonal projection. $r$ is the signed perpendicular distance  and is given by: $$r = \frac{y(\mathbf{x})}{\|\mathbf{w}\|}$$


## Generative Classifiers