### CS229 Problem Set #1

# CS 229, Public Course

## Problem Set #1: Supervised Learning

---

# Naive Bayes

In this problem, we look at **maximum likelihood parameter estimation** using the [***naive Bayes***](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) assumption.

Here, the input features $x_j$ , $j = 1, . . . , n$ to our model are ***discrete, binary-valued variables***, so each

>$$x_j \in \{0, 1\}$$

We call $x = [x_1 x_2 \dots x_n]^T$ to be the input vector.

For each training example, our **output** targets are a ***single binary-value***

>$$y \in \{0, 1\}$$

Our **model is then parameterized by**

>$$\phi_{j|y=0} = p(x_j = 1|y = 0)$$
>
>$$\phi_{j|y=1} = p(x_j = 1|y = 1)$$
>
>and
>
>$$\phi_{y} = p(y = 1)$$






We model the **joint** distribution of $(x, y)$ according to

>|$$p(y) =$$| |
>|-|-|
>||$$ = (\phi_y)^y(1 - \phi_y)^{1-y}$$|


>|$$p(x | y=0) =$$| |
>|-|-|
>||$$ = \prod_{j=1}^n p(x_j | y=0)$$|
>||$$= \prod_{j=1}^n (\phi_{j|y=0})^{x_j} (1- \phi_{j|y=0})^{1 - x_j}$$|

>|$$p(x | y=1) =$$| |
>|-|-|
>||$$ = \prod_{j=1}^n p(x_j | y=1)$$|
>||$$= \prod_{j=1}^n (\phi_{j|y=1})^{x_j} (1- \phi_{j|y=1})^{1 - x_j}$$|

* **a)** Find the **joint likelihood function**

$$\ell ( \varphi) = log \prod_{i=1}^m p(x^{(i)}, y^{(i)}; \varphi)$$ in terms of the model parameters given above.

Here, $\varphi$ represents the entire **set of parameters**:

>$\phi_y$
>
>$\phi_{j|y=0}, $
>
>$\phi_{j|y=1},$
>
>$j = 1, \dots , n$

* **b)** Show that the parameters which maximize the likelihood function are the same as those given in the lecture notes; i.e., that

>$$\phi_{j|y=0} = \frac {\sum_{i=1}^m \mathbf 1 \{x_j^{(i)} = 1 \land y^{(i)} = 0\}}
{\sum_{i=1}^m \mathbf 1 \{ y^{(i)} = 0\} }$$
>

>$$\phi_{j|y=1} = \frac {\sum_{i=1}^m \mathbf 1 \{x_j^{(i)} = 1 \land y^{(i)} = 1\}}
{\sum_{i=1}^m \mathbf 1 \{ y^{(i)} = 1\} }$$
>

>$$\phi_y = \frac {\sum_{i=1}^m \mathbf 1 \{y^{(i)} = 1\}}
{m}$$
>

* c) Consider making a prediction on some **new data point** $x$ using the most likely class estimate generated by the ***naive Bayes algorithm***.

Show that the **hypothesis** returned by naive Bayes is a **linear classifier**, i.e.,

if 
>$p(y = 0|x)$
>

and 
>
>$p(y = 1|x)$

are the class probabilities returned by naive Bayes, show that:

>there exists some $\theta \in R^{n+1}$ such that
>
>$$p(y = 1|x) ≥ p(y = 0|x) \Longleftrightarrow \theta^T
\begin{bmatrix}1\\
x \end{bmatrix}
\geq 0$$
>
>(Assume $\theta_0$ is an intercept term.)

>$$\{(x^{(i)}, \  y^{(i)})\ , \ i=1,\dots,m\}$$
>
>$$x^{(i)} \in \mathbb R ^n$$
>$$y^{(i)} \in \mathbb R ^p$$


Thus for each training example, $y^{(i)}$ is ***vector-valued***, with $p$ **entries**.

We wish to use a **linear model** to predict the outputs, as in least squares, by specifying the parameter matrix $\Theta$ in


> $$y = \Theta ^T x$$
>
> where $\Theta \in \mathbb R ^ {n \times p}$

* **a)** The cost function for this case is
  
  >  $$J(\Theta) = \frac 1 2 \sum_{i=1}^m \sum_{j=1}^p \left( (\Theta^T x^{(i)})_j - y_j ^{(i)}\right)^2$$


  Write $J(\Theta)$ in **matrix-vector notation** (i.e., without using any summations).
  
  > ***Hint:*** Start with the $m \times n$ ***design matrix***:

$$X = \begin{bmatrix}
- \ \ (x^{(1)})^T \ \ - \\
- \ \ (x^{(2)})^T \ \ - \\
\vdots \\
- \ \ (x^{(m)})^T \ \ - \\
\end{bmatrix}$$

>  and the $m \times p$ ***target matrix***:

$$Y = \begin{bmatrix}
- \ \ (y^{(1)})^T \ \ - \\
- \ \ (y^{(2)})^T \ \ - \\
\vdots \\
- \ \ (y^{(m)})^T \ \ - \\
\end{bmatrix}$$

> and then work out how to express $J(\Theta)$ in terms of these matrices.

### Solution:
| Cost function|
|--|
|$$\huge J(\Theta) = \frac 1 2 (X \Theta - y)^T(X \Theta - y)$$|

> where
>
>|Matrix|$$X$$|$$\Theta$$|$$y$$|
>|--|--|--|--|
>|Dim|$$m \times n$$|$$n \times p$$|$$m \times p$$|
>
>
>| |$$(X \Theta - y)^T$$|$$X \Theta - y$$|
>|--|--|--|
>|Dim|$$p \times m$$|$$m \times p$$|
>
>
>| |$$J(\Theta)$$|
>|--|--|
>|Dim|$$p \times p$$|

* **b)** Find the **closed form solution** for $\Theta$ which minimizes $J(\Theta)$.
  
  This is the **equivalent** to the ***normal equations*** for the **multivariate case**.

### Solution:

From part 1.

> |Gradient of loss|
> |-|
> |$$\large \nabla_\theta J(\theta) = X^T \left( X \theta - y \right)$$|

Expanding:

$$\nabla_\theta J(\Theta) = X^T X \theta - X^T y$$

Want to minimize loss

$$\nabla_\Theta J(\Theta) \stackrel{\text{set}}{=} 0$$


$$X^T X \Theta - X^T y = 0$$


$$X^T X \Theta = X^T y$$

> |Closed form solution for $\theta$|
> |-|
|$$\huge \Theta = (X^T X)^{-1} X^T y$$|


* **c)** Suppose **instead** of considering the multivariate vectors $y^{(i)}$ **all at once**, we **instead compute each variable** $y_j^{(i)}$ separately for each $j = 1, \dots , p$.

  In this case, we have $p$ individual linear models, of the form:

  > $$y_j^{(i)} = \theta_j^T x^{(i)}$$
  >
  > with $j = 1, \dots, p$

  So here, each $\theta_j \in \mathbb R ^n$.
  
  How do the parameters from these $p$ **independent least squares** problems **compare** to the **multivariate** solution?

### Solution:

Each $\theta_j$ corresponds to an **output dimension** $j$, where $j = 1,\dots, p$

$$\Theta = \begin{bmatrix}
- \theta_1 - \\
- \theta_2 - \\
 \vdots \\
- \theta_p - \\
\end{bmatrix}$$

Where each
$$\theta_j = (X^T X)^{-1} X^T y_j$$

Concretely

$$\Theta = \begin{bmatrix}
(X^T X)^{-1} X^T y_1 \\
(X^T X)^{-1} X^T y_2 \\
 \vdots \\
(X^T X)^{-1} X^T y_p \\
\end{bmatrix}$$