# Chapter 4: Linear Models For Classification (Summary)

<hr style="height:2px;"></hr>

<li> Classification problem:
    <ul><li> Given $D=\{(x_n, t_n)\}_{n=1}^N$
        <li> Assign the $D$-dimensional input $x$ to $C_k \in \{1, 2, .., K\}$ by predicting $C_k$ or its posterior $P(C_k|x)$.
        <li> The input space is then divided into regions, its boundaries are decision boundaries or decision surfaces
    </ul>

<li> We have three main approaches:
    <ul><li><b>1) Discriminant function:</b>
        <ul><li> Assign $x$ to $C_k$ directly
            <li> Can not estimate the probability of the class $P(C_k|x)$
            <li> ex. Perceptron
        </ul>
        <li><b>2) Discriminative Models:</b>
        <ul><li> Model $P(C_k|x)$ directly.
            <li> Can not generate new samples.
            <li> ex. Logistic Regression
        </ul>
        <li><b>3) Generative Models:</b>
        <ul><li> Model class-conditional densities $P(x|C_k)$ together with the prior $P(C_k)$.
            <li> Then ues Bayes' rule to get the posterior $P(C_k|x)$.
            <li> ex. Gaussian Discriminant Analysis.
        </ul>
    </ul>
<li> Linear models is used in classification by considering a generalization form in which we transform the linear function of $w$ using a nonlinear function.
    <ul><li> $y(x) = f(X^TW + W_0)$
        <li> The function $f(.)$ is the activation function (in ML), its inverse is the link function (in statistics).
        <li> The model is no longer linear in $w$ due to the non-linearity of $f(.)$.
    </ul>


<hr style="height:2px;"></hr>
<h2>1) Discriminant Function</h2><br>

<li> Assign $x => C_K$
<li> All decision boundaries are linear, singly connected, and convex.
<h3>For two-class problem:</h3>
    <ul><li> $y = W^TX + W_0$
        <li> $\begin{equation}
                C_K=
                \begin{cases}
                1 & \text{,if } y(x) \geq 0\\
                0 & \text{,if } y(x) < 0
                \end{cases}
                \end{equation}$
        <li> $W$ determine the orientation of the D.B.
        <li> $W_0$ determines the location of the D.B.
        <li> $y(x)$ gives a signed measure of the perpendicular distance of x from the D.B.
    </ul>
<h3>For multi-class problem:</h3>
    <ul><li> Can be done by two main approaches:
        <li>1) Using 2-class classifiers:
            <ul><li> Leads to ambiguous.
                <li> ex. One-Versus-The-Rest or One-Versus-One.
            </ul>
        <li>2) Using a single k-class discriminant comprising K linear functions"
            <ul><li> $y_k(x) = W^T_kX + W_{k0}$, $y(x) = (y_1(x),...,y_K(x))$.
                <li> Assign $X$ to $C_K \text{ where } y_k(x) > y_j(x) \ \ \forall_{j\neq k}$
                <li> The D.B between $C_k$ and $C_j$ is $y_k(x) = y_j(x)$
            </ul>
    </ul>
<h3>To find the parameters $W$ in discriminant function, we have three methods:</h3>
    <ul><li><b>1) Least-Squares:</b>
        <ul><li> $y(x) = \tilde{W}^T\tilde{X}$, where all classes models $y_k(x)$ are grouped.
            <li> $E_D(\tilde{w}) = \frac{1}{2} \mathrm{Tr}\{(\tilde{W}^T\tilde{X} - T)^T(\tilde{W}^T\tilde{X} - T)\}$, where $\tilde{W} = (W_{k0}, W_{k})$ and $\tilde{X} = (1, X)$
            <li> $\tilde{W} = (X^TX)^{-1}X^TT = X^{+}T$, where $T$ is $N x K$ matrix where $n^{th}$ row is $t_n^T$
            <li> Too sensitive to outliers.
        </ul>
        <li><b>2) Fisher's Linear Discriminant:</b>
            <ul><li> A way to view a linear classification model is in terms of dimensionality reduction.
                <li> $ Y = W^TX$ is a projection from $X$-space to $Y$-space through the $W$
                <li> We need to find the projection that maximizes the between-classes variance and minimizes the within-class variance