# Home Assignment No. 1: Theory

In this part of the homework, you are to solve several theoretical problems related to machine learning algorithms.

* For every separate problem you can get **INTERMEDIATE scores**.


* Your solution must me **COMPLETE**, i.e. contain all required formulas/proofs/detailed explanations.


* You must write your solution for each problem right after the words **YOUR SOLUTION**. Attaching pictures of your handwriting is allowed, but **highly discouraged**.

## $\LaTeX$ in Jupyter

Jupyter has constantly improving $\LaTeX$ support. Below are the basic methods to write **neat, tidy, and well typeset** equations in your notebooks:

* to write an **inline** equation use 
```markdown
$ you latex equation here $
```

* to write an equation, that is **displayed on a separate line** use 
```markdown
$$ you latex equation here $$
```

* to write **cases of equations** use 
```markdown
$$ left-hand-side = \begin{cases}
                     right-hand-side on line 1, & \text{condition} \\
                     right-hand-side on line 2, & \text{condition} \\
                    \end{cases} $$
```

* to write a **block of equations** use 
```markdown
$$ \begin{align}
    left-hand-side on line 1 &= right-hand-side on line 1 \\
    left-hand-side on line 2 &= right-hand-side on line 2
   \end{align} $$
```

The **ampersand** (`&`) aligns the equations horizontally and the **double backslash**
(`\\`) creates a new line.

## Task 1. Locally Weighted Linear Regression [6 points]

Under the assumption that $\mathbf{X}^\top W(\mathbf{x}_0) \mathbf{X}$ is inverible, derive the closed form solution for the LWR problem, defined in Task 3 of the practical part.

### Your solution:
$$
\theta^*(\mathbf{x}_0) = \arg \min_{\theta(\mathbf{x}_0)} \sum_{i = 1}^m w^{(i)}(\mathbf{x}_0) \left(y_i - \theta(\mathbf{x}_0)^\top \mathbf{x}_i\right)^2
$$

To minimize this, we calculate the gradient of the loss function with respect to $\theta(\mathbf{x}_0)$ and set it to zero.

We have the following loss function:
$$
L(\theta(\mathbf{x}_0)) = \sum_{i = 1}^m w^{(i)}(\mathbf{x}_0) \left(y_i - \theta(\mathbf{x}_0)^\top \mathbf{x}_i\right)^2 = (\mathbf{X}\theta(\mathbf{x}_0) - y) \cdot W(\mathbf{x}_0) \cdot (\mathbf{X}\theta(\mathbf{x}_0) - y)
$$

For simlicity we will denote $W(\mathbf{x}_0) = W$ and $\theta = \theta(\mathbf{x}_0)$.

Now we will calculate the gradient of the loss function with respect to $\theta$:

$$
\nabla_\theta L(\theta) = \nabla_\theta \left((\mathbf{X}\theta - y) \cdot W \cdot (\mathbf{X}\theta - y)\right) = \nabla_\theta \left(\theta^\top \mathbf{X}^\top W \mathbf{X} \theta - 2y^\top W \mathbf{X} \theta + y^\top W y\right) 
= 2\mathbf{X}^\top W \mathbf{X} \theta - 2\mathbf{X}^\top W y = 0
$$

Now we can solve this equation for $\theta$:

$$
\mathbf{X}^\top W \mathbf{X} \theta = \mathbf{X}^\top W y
$$

Because $\mathbf{X}^\top W \mathbf{X}$ is invertible, we can multiply both sides by its inverse:

$$
\theta = (\mathbf{X}^\top W \mathbf{X})^{-1} \mathbf{X}^\top W y
$$

$$
\theta(\mathbf{x}_0) = (\mathbf{X}^\top W(\mathbf{x}_0) \mathbf{X})^{-1} \mathbf{X}^\top W(\mathbf{x}_0) y
$$

## Task 2. Multiclass Naive Bayes Classifier [4 points]

Let us consider **multiclass classification problem** with classes $C_1, \dots, C_K$.

Assume that all $d$ features $\mathbf{x} = \begin{bmatrix} x_1 \\ \vdots \\ x_d \end{bmatrix}$ are **binary**, i.e. $x_{i} \in \{0, 1\}$ for $i = \overline{1, d}$ **or** feature vector $\mathbf{x} \in \{0, 1\}^d$.

Show that the decision rule of a **Naive Bayes Classifier** can be represented as $\arg\max$ of linear functions of the input.

&nbsp;

**Hint**: use the **maximum a posteriori** (MAP) decision rule: $\hat{y} = \arg\max\limits_{y \in \overline{1, K}} p(y)p(\mathbf{x}|y)$

### Your solution:

The Naive Bayes Classifier uses the following decision rule:

$$
P(C_{i}|x) = \frac{P(x|C_{i})P(C_{i})}{P(x)}
= \frac{P(x_{1}, x_{2}, ..., x_{d}|C_{i})P(C_{i})}{P(x)}
=  \frac{\prod_{j=1}^{d}P(x_{j}|C_{i})P(C_{i})}{P(x)}
$$

Because  $x_{i}$ is binary, we can represent $P(x_{i}|C_{i})$ as $p(x_{i} = 1|C_{i})$ and $p(x_{i} = 0|C_{i}) = 1 - p(x_{i} = 1|C_{i})$.
$$
p(x_i \mid C_{i}) = 
\begin{cases} 
p(x_i = 1 \mid C_{i}), & \text{if } x_i = 1 \\
1 - p(x_i = 1 \mid C_{i}), & \text{if } x_i = 0 
\end{cases}
$$
$$
= p(x_i = 1 \mid C_{i})^{x_i} \cdot (1 - p(x_i = 1 \mid C_{i}))^{1 - x_i}
$$

The maximum a posteriori (MAP) decision rule is the following:
$$
\hat{y} = \arg\max_{y \in \overline{1, K}} P(C_{y}|X) \\
= arg\max_{y \in \overline{1, K}} \frac{P(X|C_{y})P(C_{y})}{P(X)} \\ 
= arg\max_{y \in \overline{1, K}} P(X|C_{y})P(C_{y}) \\
$$

Because P(X) is constant for all classes, we can ignore it in the argmax operation.


$$     
\hat{y} = \arg\max_{y \in \overline{1, K}} P(x|C_{y})P(C_{y}) \\
= \arg\max_{y \in \overline{1, K}} \prod_{i=1}^{d} P(x_{i}|C_{y})P(C_{y}) \\
= \arg\max_{y \in \overline{1, K}} \sum_{i=1}^{d} \log(P(x_{i}|C_{y})) + \log(P(C_{y})) \\
= \arg\max_{y \in \overline{1, K}} \sum_{i=1}^{d} \log(p(x_{i} = 1|C_{y})^{x_i} \cdot (1 - p(x_{i} = 1|C_{y}))^{1 - x_i}) + \log(P(C_{y})) \\
= \arg\max_{y \in \overline{1, K}} \sum_{i=1}^{d} x_i \log(p(x_{i} = 1|C_{y})) + (1 - x_i) \log(1 - p(x_{i} = 1|C_{y})) + \log(P(C_{y}))
$$

This is a linear function of the input $\mathbf{x}$.
It can be written as:
$$
\hat{y} = \arg\max_{y \in \overline{1, K}} \mathbf{w}_{y}^\top \mathbf{x} + b_{y}
$$

With $\mathbf{w}_{y} = \begin{bmatrix} \log(\frac{p(x_{1} = 1|C_{y})}{1 - p(x_{1} = 1|C_{y})}) \\ \vdots \\ \log(\frac{p(x_{d} = 1|C_{y})}{1 - p(x_{d} = 1|C_{y})}) \end{bmatrix}$ and $b_{y} = \log(P(C_{y})) + \sum_{i=1}^{d} \log(1 - p(x_{i} = 1|C_{y}))$.
