## 1. Gaussian Statistics

The density of a multivariate normal random variable with mean $\mu$ and covariance matrix $\Sigma$ is given by

$P_{\mathcal{N}}(x \mid \mu, \Sigma)=(\mathrm{det}(2 \Sigma))^{-1 / 2} \exp (-\frac{1}{2}(x-\mu)^{\top} \Sigma^{-1}(x-\mu)).$

(1) [4pts] Consider the case where the random vector is split as $x=(x_1, x_2)$, with mean $\mu=(\mu_1, \mu_2)$ and (strictly positive definite) covariance matrix $\Sigma=(\begin{array}{ll}\Sigma_{1,1} & \Sigma_{1,2} \\ \Sigma_{2,1} & \Sigma_{2,2}\end{array})$. Using the formula for conditional densities, show that the conditional distribution $x_2 \mid x_1=z$ is a multivariate normal with mean $\mu_2+\Sigma_{2,1}(\Sigma_{1,1})^{-1} z$ and covariance matrix $\Sigma_{2,2}-\Sigma_{2,1} \Sigma_{1,1} \Sigma_{1,2}$ in the special case where $\Sigma$ is $2 \times 2$. Hint: complete the square in the exponential. Use the formula $(\begin{array}{ll}a & b \\ c & d\end{array})^{-1}=\frac{1}{a d-b c}(\begin{array}{cc}d & -b \\ -c & a\end{array})$.



<div style="color:blue">


References:
* [Bishop Section 2.3](https://www.seas.upenn.edu/~cis520/papers/Bishop_2.3.pdf)
* [The Multivariate Gaussian (Berkeley)](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter13.pdf#page139)
* [Wikipedia of Multivariate Normal Distribution](https://en.wikipedia.org/wiki/Multivariate_normal_distribution)
* [this proof](https://statproofbook.github.io/P/mvn-cond.html)
* [StackExchange](https://stats.stackexchange.com/questions/30588/deriving-the-conditional-distributions-of-a-multivariate-normal-distribution)
* [Tutorial on Multivariate Normal (UIUC)](https://education.illinois.edu/docs/default-source/carolyn-anderson/edpsy584/lectures/MultivariateNormal-beamer-online.pdf)

</div>

<div style="color:blue">


A few key steps:

Expand the quadratic form $(x - \mu)^{\top} \Sigma^{-1} (x - \mu)$:

$$
\begin{align*}
&\begin{pmatrix} x_1 - \mu_1 & x_2 - \mu_2 \end{pmatrix}
\begin{pmatrix} \frac{\Sigma_{2,2}}{\Delta} & \frac{-\Sigma_{1,2}}{\Delta} \\ \frac{-\Sigma_{2,1}}{\Delta} & \frac{\Sigma_{1,1}}{\Delta} \end{pmatrix}
\begin{pmatrix} x_1 - \mu_1 \\ x_2 - \mu_2 \end{pmatrix} \\
&= \frac{1}{\Delta} ( \Sigma_{2,2}(x_1 - \mu_1)^2 - 2\Sigma_{1,2}(x_1 - \mu_1)(x_2 - \mu_2) + \Sigma_{1,1}(x_2 - \mu_2)^2 ),
\end{align*}
$$

where $\Delta = \Sigma_{1,1} \Sigma_{2,2} - \Sigma_{1,2} \Sigma_{2,1}$

Estimate $y$ using the conditional mean formula:

$$ \hat{y} = \mu_{2, \text{cond}} = \hat{\mu}_2 + \hat{\Sigma}_{2,1}(\hat{\Sigma}_{1,1})^{-1}(x - \hat{\mu}_1). $$

</div>

(2) [3pts] Assume that data points $(x_i, y_i)_{1 \leq i \leq N}$ are obtained as independent samples from a joint normal distribution with unknown mean and covariance. Derive the maximum likelihood estimates of its mean and covariance. Use part [1] to derive an estimate of $y$ for a previously unseen $x$.


<div style="color:blue">

The log-likelihood function is:

$\ln L(\mu, \Sigma) = -\frac{1}{2}\ln|\mathrm{det}(2\Sigma)| - \frac{1}{2}\sum_{i=1}^{N}(x_i-\mu)^\top \Sigma^{-1}(x_i-\mu)$

Check how the derivatives are calculated using **[The Multivariate Gaussian (Berkeley) (Page 7-9)](https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter13.pdf#page139)**

The MLE for the mean $\hat{\mu}$ is the sample mean:

$$ \hat{\mu} = \frac{1}{N} \sum_{i=1}^{N} \begin{pmatrix} x_i \\ y_i \end{pmatrix}. $$

The MLE for the covariance matrix $\hat{\Sigma}$ is calculated using:
     
$$ \hat{\Sigma} = \frac{1}{N} \sum_{i=1}^{N} ( \begin{pmatrix} x_i \\ y_i \end{pmatrix} - \hat{\mu} ) ( \begin{pmatrix} x_i \\ y_i \end{pmatrix} - \hat{\mu} )^{\top}. $$

</div>

(3) [3pts] Consider the conditional model $y_i=\alpha^{\top} x_i+\beta+\epsilon_i$ for independently distributed $\epsilon_i$ with mean zero and identity covariance matrix. Derive the maximum likelihood estimate of $\alpha, \beta$ and show how to use this model to predict $y$ for a previously unseen $x$. Compare the result to (2) and comment.

<div style="color:blue">

The likelihood function based on the given model is

$$
L(\alpha, \beta)=\Pi_{i=1}^N P(\epsilon_i) = \Pi_{i=1}^N P (y_i - a^{\top} x_i - \beta)
$$

Since $\epsilon_i$ has a normal distribution with mean zero and identity covariance, the likelihood function becomes

$$
L(\alpha, \beta)=\Pi_{i=1}^N P(\epsilon_i) = \Pi_{i=1}^N \frac{1}{\sqrt{2\pi}} \exp( -\frac{1}{2}(y_i - a^{\top} x_i - \beta)^2)
$$


It's easier to work with the log-likelihood:


$$ \log L(\alpha, \beta) = \sum_{i=1}^{N} \left[ -\frac{1}{2}(y_i - \alpha^{\top} x_i - \beta)^2 \right] - \frac{N}{2}\log(2\pi). $$


The derivative of the log-likelihood w.r.t. $\alpha$ is:

$$ \frac{\partial}{\partial \alpha} \log L(\alpha, \beta) = \sum_{i=1}^{N} x_i(y_i - \alpha^{\top} x_i - \beta). $$

The derivative of the log-likelihood w.r.t. $\beta$ is:

$$ \frac{\partial}{\partial \beta} \log L(\alpha, \beta) = \sum_{i=1}^{N} (y_i - \alpha^{\top} x_i - \beta). $$

Once $\alpha$ and $\beta$ are estimated, predict $y$ for a new observation $x$ using:

$$ \hat{y} = \hat{\alpha}^{\top} x + \hat{\beta}. $$


$$
\begin{align*}
\sum_{i=1}^{N} (y_i - \alpha^{\top} x_i - \beta) &= 0 \\
\sum_{i=1}^{N} y_i - \sum_{i=1}^{N} \alpha^{\top} x_i - N\beta &= 0 \\
N\beta &= \sum_{i=1}^{N} y_i - \alpha^{\top} \sum_{i=1}^{N} x_i \\
\beta &= \frac{1}{N} ( \sum_{i=1}^{N} y_i - \alpha^{\top} \sum_{i=1}^{N} x_i ).
\end{align*}
$$

$$
\begin{align*}
\sum_{i=1}^{N} x_i(y_i - \alpha^{\top} x_i - \beta) &= 0 \\
\sum_{i=1}^{N} x_i y_i - \sum_{i=1}^{N} x_i \alpha^{\top} x_i - \sum_{i=1}^{N} x_i \beta &= 0 \\
\sum_{i=1}^{N} x_i y_i - \alpha^{\top} \sum_{i=1}^{N} x_i x_i^{\top} - \beta \sum_{i=1}^{N} x_i &= 0.
\end{align*}
$$


Now substitute $\beta$ from the previous step:

$$
\begin{align*}
\sum_{i=1}^{N} x_i y_i - \alpha^{\top} \sum_{i=1}^{N} x_i x_i^{\top} - (\frac{1}{N} ( \sum_{i=1}^{N} y_i - \alpha^{\top} \sum_{i=1}^{N} x_i )) \sum_{i=1}^{N} x_i &= 0.
\end{align*}
$$



$$\sum_{i=1}^{N} x_i y_i - \alpha^{\top} \sum_{i=1}^{N} x_i x_i^{\top} - \frac{1}{N} \sum_{i=1}^{N} y_i \sum_{i=1}^{N} x_i + \frac{1}{N} \alpha^{\top} ( \sum_{i=1}^{N} x_i ) ( \sum_{i=1}^{N} x_i ) = 0.$$

Rearrange terms to collect all terms involving $\alpha$ on one side:

$$
\alpha^{\top} \sum_{i=1}^{N} x_i x_i^{\top} - \frac{1}{N} \alpha^{\top} ( \sum_{i=1}^{N} x_i ) ( \sum_{i=1}^{N} x_i ) = \sum_{i=1}^{N} x_i y_i - \frac{1}{N} \sum_{i=1}^{N} y_i \sum_{i=1}^{N} x_i.
$$

We can write this equation in matrix notation. Let $X$ be the matrix with rows $x_i^{\top}$ and $Y$ be the vector of $y_i$. The equation becomes:

$$\alpha^{\top} (X^{\top} X) - \frac{1}{N} \alpha^{\top} (\mathbf{1}^{\top} X)^{\top} (\mathbf{1}^{\top} X) = Y^{\top} X - \frac{1}{N} \mathbf{1}^{\top} Y \cdot \mathbf{1}^{\top} X$$

where $\mathbf{1}$ is a vector of ones.

</div>


## 2. Dimensionality Reduction

(1) [PCA] There are many ways to "project" data $X \in \mathbb{R}^{n \times d}$ from high dimensions to lower dimensions $\hat{X} \in \mathbb{R}^{n \times p}$. $n$ is the number of data points, $d$ is the dimension of the original data, and $p$ is the dimension of the projected representations. PCA aims to finds the best linear projection i.e. the one that minimizes the reconstruction error $\|X-\hat{X}\|_F$ (the norm here is the Frobenius norm, described below). It does so by computing the covariance matrix of the data $C=\frac{1}{n} X^{\top} X$, and then projecting the data onto the first few eigenvectors of $C$. But why does finding the projection of the data onto the largest eigenvectors of the covariancee matrix minimize the reconstruction error? Please prove this mathematically. For simplicity, we will consider the case of PCA from $d$ dimensions to 1 dimension. We will also assume that $X$ is "centered," meaning that the mean across all samples of every data dimension is 0. (Hint, establish the connection between minimizing the reconstruction error and maximizing projected variance).

**The Frobenius norm, sometimes also called the Euclidean norm (a term unfortunately also used for the vector $L^2$-norm), is matrix norm of an $m \times n$ matrix $A$ defined as the square root of the sum of the absolute squares of its elements, $\|A\|_F=\sqrt{\sum_{i=1}^m \sum_{j=1}^n\left|a_{i j}\right|^2}$.**


<div style="color:blue">
    
See [Pattern Recognition and Machine Learning (Page 562)](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf) 
    

**Objective of PCA:**

1. **Maximize Variance**: When reducing the dimensionality of the data, PCA aims to preserve as much variance as possible. 

2. **Minimize Reconstruction Error**: PCA also aims to minimize the difference between the original data and its lower-dimensional representation, quantified by the reconstruction error.

For PCA maximizing the variance of the projected data is equivalent to minimizing the reconstruction error.

Given data matrix $X \in \mathbb{R}^{n \times d}$, and assuming it is centered (mean of 0 in each dimension), we project it to 1 dimension. Let $v \in \mathbb{R}^d$ be the unit vector representing the direction onto which we project the data. The projected data $\hat{X}$ is then $Xv$.

1. **Covariance Matrix and Variance Maximization:**

   The covariance matrix $C = \frac{1}{n} X^\top X$. The variance of the projected data is given by $v^\top C v$. Our goal is to maximize this variance subject to $\|v\|_2 = 1$ (since $v$ is a unit vector).

2. **Reconstruction Error:**

   The reconstruction error is the Frobenius norm of the difference between $X$ and its reconstruction $\hat{X}v^\top$, which is $\|X - \hat{X}v^\top\|_F$. Substituting $\hat{X}$ with $Xv$, this becomes $\|X - Xvv^\top\|_F$.

3. **Relating Variance and Reconstruction Error:**

   The key step is to relate the variance maximization to the minimization of the reconstruction error. This can be done by showing that maximizing $v^\top C v$ is equivalent to minimizing $\|X - Xvv^\top\|_F$.

**Maximizing Variance:**

Maximizing $v^\top C v$ under the constraint $\|v\|_2 = 1$ leads to the largest eigenvalue problem. The solution $v$ that maximizes $v^\top C v$ is the eigenvector corresponding to the largest eigenvalue of $C$. In this case, the covariance matrix $C = \frac{1}{n} X^{\top} X$ has its maximum variance along the direction of its largest eigenvector.

**Minimizing Reconstruction Error:**

Expanding the Frobenius norm $\|X - Xvv^\top\|_F$, we get:

$$
\begin{aligned}
\|X - Xvv^\top\|_F^2 &= \text{tr}((X - Xvv^\top)^\top (X - Xvv^\top)) \\
&= \text{tr}(X^\top X) - 2\text{tr}(v^\top X^\top Xv) + \text{tr}(vv^\top X^\top Xvv^\top) \\
&= \text{tr}(X^\top X) - \text{tr}(v^\top X^\top Xv) \\
&= \text{const} - \text{tr}(v^\top C v) \times n
\end{aligned}
$$

Minimizing the reconstruction error is therefore equivalent to maximizing $\text{tr}(v^\top C v)$, which, as we established, is maximized when $v$ is the eigenvector corresponding to the largest eigenvalue of $C$.

**Conclusion:**

Hence, projecting the data onto the largest eigenvectors of the covariance matrix $C$ both maximizes the variance of the projected data and minimizes the reconstruction error, fulfilling the objectives of PCA.


**Connection to Eigenvectors**:

$u^{\top}X^{\top}Xu$ is maximized when $u$ is the eigenvector of $X^{\top}X$ corresponding to its largest eigenvalue. This is because the covariance matrix $C = \frac{1}{n} X^{\top} X$ has its maximum variance along the direction of its largest eigenvector.

NOTE: There are 2 interpretations for PCA:

E.g., for the first component.

Maximum Variance Direction: $1^{\text {st }} \mathrm{PC}$ a vector $v$ such that projection on to this vector capture maximum variance in the data (out of all possible one dimensional projections)
$$
\frac{1}{n} \sum_{i=1}^n(\mathbf{v}^T \mathbf{x}_i)^2=\mathbf{v}^T \mathbf{X} \mathbf{X}^T \mathbf{v}
$$

Minimum Reconstruction Error: $1^{\text {st }} \mathrm{PC}$ a vector $\mathrm{v}$ such that projection on to this vector yields minimum MSE reconstruction
$$
\frac{1}{n} \sum_{i=1}^n\left\|\mathbf{x}_i-(\mathbf{v}^T \mathbf{x}_i) \mathbf{v}\right\|^2
$$

More in [this slide from Stanford](https://www.cs.cmu.edu/~mgormley/courses/10701-f16/slides/lecture14-pca.pdf)



</div>

(2) [KernelPCA] Show how to use kernels in PCA, i.e., derive kernelPCA and its projected low-dimensional representation.

<div style="color:blue">

See [PRML (Page 586)](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf)
    
[Reference]
* Paper of [Kernel PCA](https://arxiv.org/pdf/1207.3538.pdf)
* [Kernel PCA (Wiki)](https://en.wikipedia.org/wiki/Kernel_principal_component_analysis)
* [PRML (Page 586)](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf)



* Kernel PCA allows capturing non-linear structures in the data, which standard PCA might miss. It's particularly useful in scenarios where the data forms complex, non-linear manifolds. The choice of the kernel function (like Gaussian, polynomial, etc.) is crucial and can significantly impact the performance and suitability of kernel PCA for a given task.
* Kernel PCA maps the input data into a higher-dimensional feature space $\mathcal{F}$ (where the data that is not linearly separable in the original space might become linearly separable) using a non-linear mapping $\phi$.
* Kernel Trick: The kernel trick is used to compute the inner products in the feature space $\mathcal{F}$ without explicitly computing the mapping $\phi$. A kernel function $k(x_i, x_j)$ computes the inner product $\langle \phi(x_i), \phi(x_j) \rangle$ in $\mathcal{F}$.
* Kernel Matrix: For a dataset $X$, the kernel matrix $K$ is defined as $K_{ij} = k(x_i, x_j)$. This matrix is symmetric and positive semi-definite.

* Centering the Kernel Matrix:
In practice, it’s important to center the kernel matrix $K$ to ensure that the data is centered in the feature space. This is done by transforming $K$ as follows:

   $$K' = K - 1_n K - K 1_n + 1_n K 1_n$$

   where $1_n$ is an $n \times n$ matrix with all elements equal to $1/n$.

* Eigenvalue Problem: Solve the eigenvalue problem for the centered kernel matrix $ K' $:
   $K' v = \lambda v$

   Here, $v$ are the eigenvectors and $\lambda$ are the eigenvalues. Unlike standard PCA, we do not need to compute the covariance matrix explicitly.

### Projecting New Data Points:

   To project a new data point $x$ into this kernel PCA space, we compute the projection as:

   $z = \sum_{i=1}^n v_i k(x, x_i)$

   where $v_i$ are the components of the eigenvectors corresponding to the largest eigenvalues (principal components).

### Choosing Components:

   Select the top $k$ components (eigenvectors corresponding to the largest eigenvalues) for the low-dimensional representation. The choice of $k$ depends on the desired balance between dimensionality reduction and retaining the variance of the original dataset.
   

</div>

(3) [KernelPCA] We want to apply KernelPCA to the 2D raw data in Figure 1, which is centered (the figure shows that) The projected data should also be 2-dimensixonal. Which kernel function should be used so that the projected data would be linearly separable? The kernel function doesn't need to be precise. You can use $\{a, b, c, \ldots\}$ to replace the unknown coefficients. Please also draw a sketch of the $2 \mathrm{D}$ representation with KernelPCA applied to the data in Figure 1.

The plot looks something like the figure to the right, but the colors are reversed:

![](https://training.atmosera.com/wp-content/uploads/2021/07/linear-vs-rbf.png)

<div style="color:blue">

In this plot, red dots are clustered at the center and blue dots are circling around these red dots.

A linear kernel might not be sufficient to make the data linearly separable after projection.

A non-linear kernel should be used. Given the structure of the data, a radial basis function (RBF) kernel, also known as the Gaussian kernel, would be an appropriate choice.

$$
K(x_i, x_j) = \exp(-\frac{\|x_i - x_j\|^2}{2\sigma^2})
$$

where $\|x_i - x_j\|^2$ is the squared Euclidean distance between two data points $x_i$ and $x_j$, and $\sigma$ is a free parameter that controls the width of the Gaussian. The RBF kernel is particularly good for cases where the data forms a 'circle'-like structure, as it can map such data into a higher-dimensional space where the different classes become linearly separable.

The kernel should be a circle that circles around the red dots

#### More on linear vs. non-linear kernels

Linear kernels:

A linear kernel is used when the data is linearly separable, (can be separated by a straight line (in 2D), a plane (in 3D), or a hyperplane in higher dimensions.)

* $K(x, y) = x^{\top}y$

Non-linear kernels:
* Polynomial Kernels
  * $K(x, y) = (a \cdot x^T y + c)^d$
  * $a$ is a scale factor.
  * $c$ is a constant that allows for adjustment.
  * $d$ is the degree of the polynomial.
  * Good for datasets where relationships between features are polynomial in nature.
* Radial Basis Function (RBF) or Gaussian Kernel
  * $K(x, y) = \exp(-\frac{\|x - y\|^2}{2\sigma^2})$
  * $\sigma$ controls the width of the Gaussian.
  * Particularly effective for cases where the decision boundary is not linear and data points form clusters.
* Sigmoid Kernel
  * $K(x, y) = \tanh(a \cdot x^T y + c)$
  * Mimics the behavior of a neural network’s sigmoid activation function.


</div>



## 3. Maximum Likelihood and Maximum A Posteriori Estimations

Consider a biased coin with an unknown probability $\theta \in[0,1]$ of landing heads after a flip. After conducting multiple coin flips, the resulting sequence is denoted as $\mathbf{x}=$ $\{\mathrm{H}, \mathrm{H}, \mathrm{T}, \mathrm{H}, \mathrm{T}\}$, where ' $\mathrm{H}$ ' represents heads and ' $\mathrm{T}$ ' represents tails.

(1) [3pts] Determine the maximum likelihood estimation (MLE) for the parameter $\theta$ based on the observed flips.

<div style="color:blue">

References
* [Chapter 7. Statistical Estimation](https://web.stanford.edu/class/archive/cs/cs109/cs109.1218/files/student_drive/7.5.pdf)


To determine the Maximum Likelihood Estimation (MLE) for the parameter $\theta$ based on the observed coin flips, we need to follow these steps:

1. **Define the Likelihood Function**: The likelihood function $L(\theta)$ represents the probability of observing the given sequence of coin flips under the assumption that the probability of heads is $\theta$. For a sequence of independent coin flips, the likelihood function is the product of the probabilities for each flip. Given the sequence $\mathbf{x} = \{\mathrm{H}, \mathrm{H}, \mathrm{T}, \mathrm{H}, \mathrm{T}\}$, the likelihood function is:

    $$L(\theta) = P(\mathbf{x}|\theta) = \theta^{\text{number of heads}}(1 - \theta)^{\text{number of tails}}$$

    In our case, there are 3 heads and 2 tails, so:

    $$L(\theta) = \theta^3(1 - \theta)^2$$

2. **Compute the Log-Likelihood Function**: It is often easier to maximize the log of the likelihood function, as this turns products into sums. The log-likelihood function is given by:

    $$\ln L(\theta) = \ln[\theta^3(1 - \theta)^2] = 3\ln(\theta) + 2\ln(1 - \theta)$$

3. **Differentiate the Log-Likelihood Function**: We find the derivative of the log-likelihood function with respect to $\theta$:

    $$\frac{d}{d\theta} \ln L(\theta) = \frac{d}{d\theta}[3\ln(\theta) + 2\ln(1 - \theta)] = \frac{3}{\theta} - \frac{2}{1 - \theta}$$

4. **Set the Derivative to Zero to Find the Maximum**: To find the maximum of the function, we set the derivative equal to zero and solve for $\theta$:
   
   $$\frac{3}{\theta} - \frac{2}{1 - \theta} = 0$$
   
   Rearranging, we get:

   $$\frac{3}{\theta} = \frac{2}{1 - \theta} \implies 3(1 - \theta) = 2\theta \implies 3 - 3\theta = 2\theta \implies 3 = 5\theta$$

5. **Solve for $\theta$**: Finally, solving for $\theta$, we get:
   $\theta = \frac{3}{5}$

So, the MLE for the parameter $\theta$ based on the observed flips is $\frac{3}{5}$.


</div>

(2) [2pts] If we know that the probability $\theta$ of the fixed coin must be one of the following values: $\theta \in\{0.2,0.5,0.8\}$, then what is the MLE for $\theta$ ?

<div style="color:blue">

See [PRML Page 30/441: MAP}
    
    
We need to find which of the three acceptable $\theta$ values maximizes the likelihood, and since there are only finitely many, we can just plug them all in and compare


We calculate this likelihood for each possible value of $\theta$:

1. **For $\theta = 0.2$:**
   $L(0.2) = 0.2^3 \times (1 - 0.2)^2 = 0.2^3 \times 0.8^2 = 0.00512$

2. **For $\theta = 0.5$:**
   $L(0.5) = 0.5^3 \times (1 - 0.5)^2 = 0.5^3 \times 0.5^2 = 0.03125$

3. **For $\theta = 0.8$:**
   $L(0.8) = 0.8^3 \times (1 - 0.8)^2 = 0.8^3 \times 0.2^2 = 0.02048$


Since the likelihood is highest for $\theta = 0.5$, the Maximum Likelihood Estimation (MLE) for $\theta$ from the given set $\{0.2, 0.5, 0.8\}$ is $\theta = 0.5$.

</div>

(3) [3pts] Consider the same restricted set of possible $\theta$ values as in (2). Additionally, you have access to prior probabilities for each value: $p(\theta=0.2)=0.1, p(\theta=$ $0.5)=0.05$, and $p(\theta=0.8)=0.85$. Determine the maximum a posteriori (MAP) estimation for $\theta$.

<div style="color:blue">

Instead of maximizing just the likelihood, we need to maximize the likelihood times the prior. 

The posterior probability for each $\theta$ is proportional to the product of its likelihood and its prior probability. This is given by Bayes' theorem:
- $p(\theta|\mathbf{x}) \propto p(\mathbf{x} | \theta) \times p(\theta) = L(\theta) \times p(\theta)$

For each $\theta$ value, calculate this product:
- For $\theta = 0.2$: $p(\theta = 0.2|\mathbf{x}) \propto L(0.2) \times p(\theta = 0.2) = 0.1 * 0.00512 = 0.000512$
- For $\theta = 0.5$: $p(\theta = 0.5|\mathbf{x}) \propto L(0.5) \times p(\theta = 0.5) = 0.05 * 0.03125 = 0.0015625$
- For $\theta = 0.8$: $p(\theta = 0.8|\mathbf{x}) \propto L(0.8) \times p(\theta = 0.8) = 0.85 * 0.02048 = 0.017408$

Since the posterior probability is highest for $\theta=0.8$, the Maximum A Posteriori (MAP) estimation for $\theta$ from the given set 
$\{0.2,0.5,0.8\}$ and the given prior probabilities is $\theta=0.8$

</div>


(4) [2pts] Given an infinite number of flips of the biased coin, discuss the relationship between the results obtained from MLE and MAP estimations. Provide a concise explanation.

<div style="color:blue">

1. **MLE in the Limit of Infinite Data**: As the number of coin flips approaches infinity, the MLE becomes increasingly accurate in estimating the true parameter $\theta$. This is due to the Law of Large Numbers, which states that as the number of trials increases, the average of the outcomes converges to the expected value. Therefore, with an infinite number of flips, the MLE will converge to the true value of $\theta$.

2. **MAP and the Influence of Prior**: MAP estimation, unlike MLE, incorporates prior beliefs about the parameter $\theta$. However, as the number of flips goes to infinity, the impact of the prior on the MAP estimation diminishes. This happens because the overwhelming amount of data (evidence) starts to outweigh the initial prior, making the posterior distribution more reflective of the data than the prior.

3. **Convergence of MLE and MAP**: In the case of an infinite number of observations, the results of MLE and MAP estimations converge, assuming that the prior in the MAP estimation is not zero for the true value of $\theta$. This convergence occurs because, with infinite data, both estimations effectively become fully informed by the data, reducing the influence of the prior in MAP to negligible levels. The key difference is that MAP starts with a prior belief and adjusts with evidence, while MLE relies solely on the evidence from the start.

</div>


## 4. Neural Networks

Figure 1 shows a feed forward neural network with one hidden layer containing three neurons $h_1, h_2, h_3$ with a sigmoid activation function, with three inputs $x_1, x_2, x_3$ and two linear output layers $y_1, y_2$.

(1) [1pts] How many total parameters are in this model? Suppose we add an additional hidden layer with 4 neurons, how many total parameters do we have now?






<div style="color:blue">

The neural network depicted in Figure 1 consists of an input layer with three neurons ($x_1, x_2, x_3$), one hidden layer with three neurons ($h_1, h_2, h_3$), and an output layer with two neurons ($y_1, y_2$). The connections between these neurons are represented by weights, and each neuron in the hidden and output layers has a bias term.

In the given network:

1. There are $3 \times 3 = 9$ weights connecting the input layer to the hidden layer.
2. There are 3 bias terms for the hidden layer.
3. There are $3 \times 2 = 6$ weights connecting the hidden layer to the output layer.
4. There are 2 bias terms for the output layer.

So, the total number of parameters in the original network is:

$$
\text{Total parameters} = (3 \times 3) + 3 + (3 \times 2) + 2 = 9 + 3 + 6 + 2 = 20
$$

Now, if we add an additional hidden layer with 4 neurons:

1. There are $3 \times 4 = 12$ new weights connecting the first hidden layer to the new hidden layer.
2. There are 4 new bias terms for the new hidden layer.
3. We previously counted 6 weights connecting the old hidden layer to the output layer, but with the new layer, these connections are replaced by $4 \times 2 = 8$ new weights connecting the second hidden layer to the output layer (we no longer count the original 6 weights since they are replaced by the connections from the new hidden layer).

The corrected new total number of parameters, including the additional hidden layer, is:

$\text{New total parameters} = (3 \times 3) + 3 + (3 \times 4) + 4 + (4 \times 2) + 2 = 9 + 3 + 12 + 4 + 8 + 2 = 38$

So there are 38 parameters with the additional hidden layer.

</div>

(2) [2pts] What is the forward expression to compute $y_1$ ?

<div style="color:blue">


Denote the weights from the input layer to the hidden layer as $w_{ij}^{1}$, where $i$ denotes the input neuron, and $j$ denotes the hidden layer neuron. The biases for the hidden layer neurons are denoted as $b_j^{1}$. Similarly, the weights from the hidden layer to the output layer are $w_{jk}^{2}$, where $j$ denotes the hidden layer neuron, and $k$ denotes the output neuron. The biases for the output layer neurons are $b_k^{2}$.

The sigmoid activation function is denoted as $\sigma$ and is defined as:

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

Using these notations, the forward expression for computing $y_1$ involves the following steps:

Compute the weighted input for each hidden neuron:

$z_j^{1} = w_{1j}^{1} x_1 + w_{2j}^{1} x_2 + w_{3j}^{1} x_3 + b_j^{1}$ for $j = 1, 2, 3$

Apply the sigmoid activation function to each $z_j^{1}$:

$h_j^{1} = \sigma(z_j^{1})$ for $j = 1, 2, 3$

Compute the weighted sum for $y_1$ using the activated values of the hidden layer:

$y_1 = w_{11}^{2} a_1^{1} + w_{21}^{2} a_2^{1} + w_{31}^{2} a_3^{1} + b_1^{2}$

Putting it all together, the forward expression for computing $y_1$ is:
$$y_1 = w_{11}^{2} \sigma(w_{11}^{1} x_1 + w_{21}^{1} x_2 + w_{31}^{1} x_3 + b_1^{1}) + w_{21}^{2} \sigma(w_{12}^{1} x_1 + w_{22}^{1} x_2 + w_{32}^{1} x_3 + b_2^{1}) + w_{31}^{2} \sigma(w_{13}^{1} x_1 + w_{23}^{1} x_2 + w_{33}^{1} x_3 + b_3^{1}) + b_1^{2}$$


This expression gives the forward computation for $y_1$ as a function of the inputs $x_1, x_2, x_3$, the weights $w_{ij}^{1}, w_{jk}^{2}$, and the biases $b_j^{1}, b_k^{2}$.

</div>

(3) [2pts] Suppose we train the model on the squared loss $L=1 / 2(y-y^{\prime})^2$. What is the expression for $\frac{\partial L}{\partial w_{i j}^2}$ ?

<div style="color:blue">

$y_i' = \sum_{j=1}^{3} w_{ij}^2 h_j + b_{i}^2$,

where $h_j$ is the output of the $j$-th neuron in the hidden layer.

$\frac{\partial L}{\partial w_{ij}^2} = \frac{\partial L}{\partial y_i'} \cdot \frac{\partial y_i'}{\partial w_{ij}^2}$

* $\frac{\partial L}{\partial y_i'} = -(y_i - y_i')$
* $\frac{\partial y_i'}{\partial w_{ij}^2} = h_j$

$\frac{\partial L}{\partial w_{ij}^2} = -(y_i - y_i') \cdot h_j$

So the full expression for the derivative of the loss with respect to the weight $w_{ij}^2$ is:

$\frac{\partial L}{\partial w_{ij}^2} = -(y_i - y_i') \cdot h_j$

</div>

(4) [3pts] What is the expression for $\frac{\partial L}{\partial w_{i j}^1}$ ?

<div style="color:blue">

$\frac{\partial L}{\partial w_{ij}^1} = \frac{\partial L}{\partial y_k'} \cdot \frac{\partial y_k'}{\partial h_i} \cdot \frac{\partial h_i}{\partial w_{ij}^1}$

* $\frac{\partial L}{\partial y_k'} = -(y_k - y_k')$
* $\frac{\partial y_k'}{\partial h_i} = w_{ki}^2$ since $y_k' = \sum_{i=1}^{3} w_{ki}^2 h_i + b_{k}^2$.
* $\frac{\partial h_i}{\partial z_i} = h_i (1 - h_i)$ since $h_i = \sigma(z_i)$ and the derivative of the sigmoid function $\sigma(z_i)$ is $\sigma(z_i)(1 - \sigma(z_i))$.
* $z_i = \sum_{j=1}^{3} w_{ij}^1 x_j + b_{i}^1$
* $\frac{\partial z_i}{\partial w_{ij}^1} = x_j$

Putting it all together, we get

$\frac{\partial L}{\partial w_{ij}^1} = ( \sum_{k=1}^{2} \frac{\partial L}{\partial y_k'} \cdot w_{ki}^2 ) \cdot h_i (1 - h_i) \cdot x_j$

The final expression:

$\frac{\partial L}{\partial w_{ij}^1} = ( \sum_{k=1}^{2} -(y_k - y_k') \cdot w_{ki}^2 ) \cdot h_i (1 - h_i) \cdot x_j$

This gives us the gradient of the loss with respect to the weights in the first layer, which is needed for backpropagation during the training of the network.

</div>

(5) [2pts] Name any three strategies to reduce overfitting for this model.

<div style="color:blue">

1. **Regularization**: Add regularization terms to the loss function to penalize large weights. For example:

   - **L1 Regularization (Lasso)**: Adds a term proportional to the absolute value of the weights, encouraging sparsity in the weight matrix.
   - **L2 Regularization (Ridge)**: Adds a term proportional to the square of the weights, which tends to evenly distribute the weights and avoid any single weight from becoming too dominant.

2. **Dropout**: During training, randomly "drop" units (along with their connections) from the neural network. This prevents units from co-adapting too much and forces the network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.
3. **Early Stopping**: Monitor the performance of the model on a validation set and stop training when the performance on the validation set begins to degrade. This prevents the model from continuing to learn the noise in the training set.
4. **Reducing Network Complexity**: Simplify the model by reducing the number of layers or the number of neurons in each layer. A less complex model has fewer parameters and is less likely to overfit.
5. **Batch Normalization**: Normalizing the inputs to each layer so that they have a mean of zero and a variance of one can reduce overfitting. It can also make the network less sensitive to the scale of different features, which can have a regularizing effect.
6. **Learning Rate Schedules**: Use learning rate schedules or adaptive learning rate methods to change the learning rate during training. This can help the network converge to a more general solution rather than overfitting to the training data.

</div>