## 1. Graphical Models
 
You are required to build a pairwise Markov Random Fields (MRFs) in this question. Denote the $x$ as hidden binary variables and the $y$ as the observed continuous variables, consider the MRFs with four hidden variables and four observed variables, we define the energy function as

$$
E(x, y)=-\sum_{i, j=1}^4 \omega_{i j} x_i x_j-\sum_{i=1}^4 \theta_i x_i y_i .
$$

Therefore, the joint distribution is

$$
p(x, y) \propto \exp (-E(x, y)) .
$$

a) Complete the graphical model by drawing the edges.

<div style="color:blue">

* Draw undirected edges between hidden variable nodes that are neighbors to represent the pairwise interactions $\omega_{ij} x_i x_j$. Since there are four hidden variables, we connect each $x$ node to every other $x$ node because the energy function includes summation over all pairs of hidden variables. This will form a fully connected graph among the hidden variables $x_1, x_2, x_3,$ and $x_4$.

* Draw undirected edges between each hidden variable $x_i$ and its corresponding observed variable $y_i$ to represent the dependency $\theta_i x_i y_i$. Each $x_i$ will only be connected to the corresponding $y_i$, meaning $x_1$ to $y_1$, $x_2$ to $y_2$, $x_3$ to $y_3$, and $x_4$ to $y_4$.

* Note that there are no direct edges between observed variables $y_i$ and $y_j$ unless specified by the problem because the given energy function does not include direct interactions between observed variables.

</div>

(b) Mark TRUE or FALSE to the following statements about conditional independence properties in the model

- $\left(x_1 \perp x_2 \mid x_3\right)$
- $\left(y_1 \perp y_2 \mid y_3\right)$
- $\left(y_1 \perp y_2 \mid x_1, x_2, x_3\right)$


<div style="color:blue">

1. $(x_1 \perp x_2 \mid x_3)$: FALSE. In an MRF, two nodes are conditionally independent given the rest if they are not neighbors (there is no direct edge between them once the conditioning nodes are removed). Since $x_1$ and $x_2$ are directly connected and the presence of $x_3$ does not block this path in the graph, they are not conditionally independent.

2. $(y_1 \perp y_2 \mid y_3)$: TRUE. The observed variables $y_i$ are only dependent on their corresponding hidden variables $x_i$ and not on each other directly. Since there are no edges between $y_1$ and $y_2$ in the graph, they are conditionally independent of each other given any subset of other variables, including $y_3$.

3. $(y_1 \perp y_2 \mid x_1, x_2, x_3)$: TRUE. The observed variables $y_1$ and $y_2$ are conditionally independent given their respective hidden variables because $y_1$ is only dependent on $x_1$ and $y_2$ is only dependent on $x_2$. Knowing $x_3$ does not change this independence, and there are no direct interactions between $y_1$ and $y_2$. Therefore, conditioning on $x_1$, $x_2$, and $x_3$ does not create a dependence between $y_1$ and $y_2$.

</div>


(c) Write down the form of the normalizer, $Z(\theta, \omega)$, so that $p(x, y)=\frac{1}{Z(\theta, \omega)} \exp (-E(x, y))$ is a valid distribution.

<div style="color:blue">

To ensure that the joint distribution $p(x,y)$ is a valid probability distribution, it must sum (or integrate, for continuous variables) to 1 over all possible values of $x$ and $y$

The normalizer, $Z(\theta, \omega)$, often called the partition function in the context of Markov Random Fields, is a sum over all possible configurations of the hidden variables $x$ and integrates over all possible values of the observed variables $y$. 


The $y$ are observed continuous variables, which would typically require integration over their possible values. However, since $y$ are observed, we do not sum or integrate over them in the normalizer; they are constants for the purpose of calculating $Z$.

Thus, $Z(\theta, \omega)$ is computed as:

$Z(\theta, \omega) = \sum_{x_1 \in \{0, 1\}} \sum_{x_2 \in \{0, 1\}} \sum_{x_3 \in \{0, 1\}} \sum_{x_4 \in \{0, 1\}} \exp \left( \sum_{i, j=1}^4 \omega_{ij} x_i x_j + \sum_{i=1}^4 \theta_i x_i y_i \right)$

$= \sum_{\mathbf{x} \in \{0, 1\}}^4 \int \exp \left( \sum_{i, j=1}^4 \omega_{ij} x_i x_j + \sum_{i=1}^4 \theta_i x_i y_i \right)$

where the sums are over all possible combinations of $x_1, x_2, x_3, x_4$. 

Here, the sum is over all possible states of the binary hidden variables $x = \{x_1, x_2, x_3, x_4\}$, and the integral is over all possible values of the continuous observed variables $y = \{y_1, y_2, y_3, y_4\}$. Since the hidden variables are binary, the summation will have $2^4$ terms, corresponding to the 16 possible configurations of $x$.

The partition function can be very difficult to compute directly for large systems due to the exponential number of configurations of $x$ and the continuous nature of $y$, but it is a crucial component for ensuring that $p(x, y)$ is properly normalized.

</div>


(d) Write down the log-likelihood and its derivative w.r.t. $\omega_{i j}$



The log-likelihood of the observed variables $y$ given the model parameters $\theta$ and $\omega$ is the logarithm of the joint probability $p(x, y)$, summed over all possible hidden states $x$:

$\mathcal{L}(\theta, \omega; y) = \log \sum_{x} p(x, y) = \log \sum_{x} \frac{1}{Z(\theta, \omega)} \exp (-E(x, y))$

Using the energy function $E(x, y)$, the log-likelihood becomes:

$\mathcal{L}(\theta, \omega; y) = \log \sum_{x} \frac{1}{Z(\theta, \omega)} \exp \left( \sum_{i, j=1}^4 \omega_{ij} x_i x_j + \sum_{i=1}^4 \theta_i x_i y_i \right)$

Since the log-likelihood involves the log of a sum, it does not simplify directly. However, we can write the derivative of the log-likelihood with respect to $\omega_{ij}$ as:

$\frac{\partial \mathcal{L}}{\partial \omega_{ij}} = \frac{\partial}{\partial \omega_{ij}} \log \left( \sum_{x} \frac{1}{Z(\theta, \omega)} \exp \left( \sum_{k, l=1}^4 \omega_{kl} x_k x_l + \sum_{k=1}^4 \theta_k x_k y_k \right) \right)
$$

Using the derivative of a log we get:

$$
\frac{\partial \mathcal{L}}{\partial \omega_{ij}} = \frac{1}{\sum_{x} \exp (-E(x, y))} \cdot \frac{\partial}{\partial \omega_{ij}} \left( \sum_{x} \exp \left( \sum_{k, l=1}^4 \omega_{kl} x_k x_l + \sum_{k=1}^4 \theta_k x_k y_k \right) \right) - \frac{\partial \log Z(\theta, \omega)}{\partial \omega_{ij}}
$$

Now, taking the derivative inside the sum and using the fact that $\frac{\partial}{\partial \omega_{ij}} \exp(\cdot) = x_i x_j \exp(\cdot)$:

$$
\frac{\partial \mathcal{L}}{\partial \omega_{ij}} = \frac{\sum_{x} x_i x_j \exp \left( \sum_{k, l=1}^4 \omega_{kl} x_k x_l + \sum_{k=1}^4 \theta_k x_k y_k \right)}{\sum_{x} \exp \left( \sum_{k, l=1}^4 \omega_{kl} x_k x_l + \sum_{k=1}^4 \theta_k x_k y_k \right)} - \frac{\partial \log Z(\theta, \omega)}{\partial \omega_{ij}}
$$

The second term involves the derivative of $\log Z(\theta, \omega)$ with respect to $\omega_{ij}$, which is:

$$
\frac{\partial \log Z(\theta, \omega)}{\partial \omega_{ij}} = \frac{1}{Z(\theta, \omega)} \frac{\partial Z(\theta, \omega)}{\partial \omega_{ij}}
$$

The derivative of $Z(\theta, \omega)$ with respect to $\omega_{ij}$ involves a similar sum over all $x$:

$$
\frac{\partial Z(\theta, \omega)}{\partial \omega_{ij}} = \sum_{x} x_i x_j \exp \left( \sum_{k, l=1}^4 \omega_{kl} x_k x_l + \sum_{k=1}^4 \theta_k x_k y_k \right)
$$

Putting it all together:

$$
\frac{\partial \mathcal{L}}{\partial \omega_{ij}} = \frac{\sum_{x} x_i x_j \exp \left( \sum_{k, l=1}^4 \omega_{kl} x_k x_l + \sum_{k=1}^4 \theta_k x_k y_k \right)}{\sum_{x} \exp \left( \sum_{k, l=1}^4 \omega_{kl} x_k x_l + \sum_{k=1}^4 \theta_k x_k y_k \right)} \\ - \frac{\sum_{x} x_i x_j \exp \left( \sum_{k, l=1}^4 \omega_{kl} x_k x_l + \sum_{k=1}^4 \theta_k x_k y_k \right)}{Z(\theta, \omega)}
$$

This simplifies to the expected values:

$$
\frac{\partial \mathcal{L}}{\partial \omega_{ij}} = \langle x_i x_j \rangle_{p(x|y)} - \langle x_i x_j \rangle_{p(x)}
$$

where $\langle x_i x_j \rangle_{p(x|y)}$ is the expected value of $x_i x_j$ given $y$, and $\langle x_i x_j \rangle_{p(x)}$ is the expected value of $x_i x_j$ marginalized over $y$. This is the derivative of the log-likelihood function with respect to $\omega_{ij}$ and is commonly used in parameter learning for Markov Random Fields.

</div>


## 2. Logistic Regression

Denote training set $D$ as $\{\left(\mathbf{x}_i, y_i\right)\}_1^N$, where $y_i \in\{0,1\}$ is the label and $\mathbf{x}_i \in \mathbf{R}^d$ is the feature vector of the $i$-th data point. 

In logistic regression we have $p\left(y_i=1 \mid \mathbf{x}_i\right)=$ $\sigma\left(\mathbf{w}^T \mathbf{x}_i\right)$, where $\mathbf{w} \in \mathbf{R}^d$ is the learned coefficient vector and $\sigma(t)=\frac{1}{1+e^{-t}}$ is the sigmoid function.

### 1) Batch Gradient Descent

a) Specify the negative log-likelihood for logistic regression

<div style="color:blue">

The likelihood of observing the given dataset $D$ is the product of the probabilities assigned to each individual observation:

$$L(\mathbf{w}) = \prod_{i=1}^{N} p(y_i | \mathbf{x}_i)^{y_i} (1 - p(y_i | \mathbf{x}_i))^{1 - y_i}$$

In the context of logistic regression, this becomes:

$$L(\mathbf{w}) = \prod_{i=1}^{N} \sigma(\mathbf{w}^T \mathbf{x}_i)^{y_i} (1 - \sigma(\mathbf{w}^T \mathbf{x}_i))^{1 - y_i}$$

The negative log-likelihood is then:

$$
\begin{aligned}
-\log L(\mathbf{w}) &= -\sum_{i=1}^{N} \left[ y_i \log(\sigma(\mathbf{w}^T \mathbf{x}_i)) + (1 - y_i) \log(1 - \sigma(\mathbf{w}^T \mathbf{x}_i)) \right] \\
&= -\sum_{i=1}^{N} \left[ y_i \log\left(\frac{1}{1 + e^{-\mathbf{w}^T \mathbf{x}_i}}\right) + (1 - y_i) \log\left(1 - \frac{1}{1 + e^{-\mathbf{w}^T \mathbf{x}_i}}\right) \right]
\end{aligned}
$$

</div>

b) Derive the gradient of the negative log-likelihood in terms of $\mathbf{w}$ for this setting.


<div style="color:blue">

The gradient of the negative log-likelihood with respect to $\mathbf{w}$ is:

$$
\begin{aligned}
\nabla_{\mathbf{w}} (-\log L(\mathbf{w})) &= -\nabla_{\mathbf{w}} \sum_{i=1}^{N} \left[ y_i \log(\sigma(\mathbf{w}^T \mathbf{x}_i)) + (1 - y_i) \log(1 - \sigma(\mathbf{w}^T \mathbf{x}_i)) \right] \\
&= -\sum_{i=1}^{N} \left[ y_i \frac{1}{\sigma(\mathbf{w}^T \mathbf{x}_i)} \nabla_{\mathbf{w}} \sigma(\mathbf{w}^T \mathbf{x}_i) - (1 - y_i) \frac{1}{1 - \sigma(\mathbf{w}^T \mathbf{x}_i)} \nabla_{\mathbf{w}} \sigma(\mathbf{w}^T \mathbf{x}_i) \right]
\end{aligned}
$$

The derivative of the sigmoid function is $\sigma'(t) = \sigma(t)(1 - \sigma(t))$

We can write:

$\nabla_{\mathbf{w}} \sigma(\mathbf{w}^T \mathbf{x}_i) = \sigma(\mathbf{w}^T \mathbf{x}_i) (1 - \sigma(\mathbf{w}^T \mathbf{x}_i)) \mathbf{x}_i$

Substituting this into the gradient expression gives:

$$
\begin{aligned}
\nabla_{\mathbf{w}} (-\log L(\mathbf{w})) &= -\sum_{i=1}^{N} \left[ y_i \frac{1}{\sigma(\mathbf{w}^T \mathbf{x}_i)} - (1 - y_i) \frac{1}{1 - \sigma(\mathbf{w}^T \mathbf{x}_i)} \right] \sigma(\mathbf{w}^T \mathbf{x}_i) (1 - \sigma(\mathbf{w}^T \mathbf{x}_i)) \mathbf{x}_i \\
&= \sum_{i=1}^{N} \left[ \sigma(\mathbf{w}^T \mathbf{x}_i) - y_i \right] \mathbf{x}_i
\end{aligned}
$$

This is the gradient of the negative log-likelihood with respect to $\mathbf{w}$ for logistic regression, which can be used to update the coefficients in a gradient descent algorithm.

</div>

### 2) Stochastic Gradient Descent

If $N$ and $d$ are very large, it may be prohibitively expensive to consider every patient in $D$ before applying an update to $\mathbf{w}$. One alternative is to consider stochastic gradient descent, in which an update is applied after only considering a single data point.

a) Show the log likelihood, $l$, of a single data point $\left(\mathbf{x}_t, y_t\right)$.

<div style="color:blue">

Given a data point $(\mathbf{x}_t, y_t)$, the probability of observing this point under logistic regression is:

$p(y_t | \mathbf{x}_t) = \sigma(\mathbf{w}^T \mathbf{x}_t)^{y_t} [1 - \sigma(\mathbf{w}^T \mathbf{x}_t)]^{1 - y_t}$

The log likelihood, $l$, of this observation is the logarithm of this probability:

$l = \log p(y_t | \mathbf{x}_t)$

$= y_t \log \sigma(\mathbf{w}^T \mathbf{x}_t) + (1 - y_t) \log [1 - \sigma(\mathbf{w}^T \mathbf{x}_t)]$

$= y_t \log \left(\frac{1}{1 + e^{-\mathbf{w}^T \mathbf{x}_t}}\right) + (1 - y_t) \log \left(1 - \frac{1}{1 + e^{-\mathbf{w}^T \mathbf{x}_t}}\right)$

</div>

b) Show how to update the coefficient vector $\mathbf{w}_t$ when you get a feature vector $\mathbf{x}_t$ and the label $y_t$ at time $t$ using $\mathbf{w}_{t-1}$ (assume learning rate $\eta$ is given).


<div style="color:blue">

In stochastic gradient descent, the update rule for the coefficient vector $\mathbf{w}$ at time $t$ can be derived by taking the gradient of the log likelihood with respect to $\mathbf{w}$.

The gradient of the log likelihood $l$ with respect to $\mathbf{w}$ is:
$ \nabla_{\mathbf{w}} l = \left(y_t - \sigma(\mathbf{w}_{t-1}^T \mathbf{x}_t)\right) \mathbf{x}_t $

The update rule for $\mathbf{w}_t$ is then:

$\mathbf{w}_t = \mathbf{w}_{t-1} + \eta \nabla_{\mathbf{w}} l$
$= \mathbf{w}_{t-1} + \eta \left(y_t - \sigma(\mathbf{w}_{t-1}^T \mathbf{x}_t)\right) \mathbf{x}_t$

This rule updates the coefficient vector $\mathbf{w}$ by moving it in the direction that increases the likelihood of the observed data point $(\mathbf{x}_t, y_t)$, scaled by the learning rate $\eta$.

</div>


c) What is the time complexity of the update rule from $\mathbf{b}$ if $\mathbf{x}_t$ is very sparse?

<div style="color:blue">


The critical parts of the update rule in (b) in terms of time complexity are:

1. **Computation of $\sigma(\mathbf{w}_{t-1}^T \mathbf{x}_t)$**: This involves calculating the dot product of the weight vector $\mathbf{w}_{t-1}$ and the feature vector $\mathbf{x}_t$, followed by applying the sigmoid function. For a sparse vector $\mathbf{x}_t$, most elements are zero. Therefore, the dot product computation only needs to be done for the non-zero elements of $\mathbf{x}_t$. If $\mathbf{x}_t$ has $k$ non-zero elements (where $k \ll d$, with $d$ being the dimensionality of the feature space), then this operation has a time complexity of $O(k)$.

2. **Update of $\mathbf{w}_t$**: This involves adding a scaled version of $\mathbf{x}_t$ to $\mathbf{w}_{t-1}$. Again, since $\mathbf{x}_t$ is sparse, this operation only needs to be performed on the elements corresponding to the non-zero entries of $\mathbf{x}_t$. This also has a time complexity of $O(k)$.

Overall, the time complexity of the update rule when $\mathbf{x}_t$ is very sparse is $O(k)$.

</div>


d) Briefly explain the consequence of using a very large $\eta$ and very small $\eta$.

<div style="color:blue">

Issues with very small learning rates
* Very slow convergence. Long training time
* Higher chance of getting stuck in local minima (or saddle points). This is because the small updates might not be sufficient to escape these suboptimal points.
* Overfitting: Prolonged training due to slow convergence can sometimes lead to overfitting, especially if the training dataset has noise or is not representative of the general population.

Issues with very large learning rates
* Failure to Converge: Due to large updates, the model might overshoot the minimum of the loss function and keep missing the optimal point and fail to converge to a satisfactory solution. The model's weights can oscillate or diverge, leading to unstable training.



</div>


e) Show how to update $\mathbf{w}_t$ with L2 regularization. That is to update $\mathbf{w}_t$ according to $l-\mu\|\mathbf{w}\|_2^2$, where $\mu$ is a constant. What's the time complexity? [5 points]


<div style="color:blue">

The regularized log likelihood $l'$ is given by:

$l' = l - \mu \|\mathbf{w}\|_2^2$

The gradient of $l'$ with respect to $\mathbf{w}$ now becomes:

$\nabla_{\mathbf{w}} l' = \nabla_{\mathbf{w}} l - 2\mu \mathbf{w}$

For the update rule, this means:

$\mathbf{w}_t = \mathbf{w}_{t-1} + \eta \left( (y_t - \sigma(\mathbf{w}_{t-1}^T \mathbf{x}_t)) \mathbf{x}_t - 2\mu \mathbf{w}_{t-1} \right)$

The update rule now includes an additional term $-2\mu \mathbf{w}_{t-1}$, which is the gradient of the regularization term. This term penalizes large weights and helps prevent overfitting.

Time Complexity:

1. **Computing $\sigma(\mathbf{w}_{t-1}^T \mathbf{x}_t)$**: As before, this is $O(k)$ for a sparse $\mathbf{x}_t$ with $k$ non-zero elements.

2. **Applying the L2 Regularization Gradient**: The term $-2\mu \mathbf{w}_{t-1}$ requires a component-wise multiplication of $\mathbf{w}_{t-1}$ by a scalar, which is $O(d)$ since it needs to be done for every element of $\mathbf{w}_{t-1}$.

3. **Updating $\mathbf{w}_t$**: This involves adding two vectors, which is $O(d)$.

So, the time complexity of the update rule with L2 regularization is $O(d)$. The presence of the regularization term, which affects all components of $\mathbf{w}$, means that the sparsity of $\mathbf{x}_t$ does not reduce the complexity of the update rule as it did without regularization.


</div>


f) When you use L2 regularization, you will find each time you get a new $(\mathbf{x}_t, y_t)$ you need to update every element of vector $\mathbf{w}_t$ even if $\mathbf{x}_t$ has very few nonzero elements. Write the pseudo-code on how to update $\mathbf{w}_t$ efficiently with sparse input.



<div style="color:blue">

```pseudocode
# Inputs:
# w: Current weight vector, w_{t-1}
# x: Feature vector of the current data point, x_t (sparse)
# y: Label of the current data point, y_t
# eta: Learning rate
# mu: Regularization constant
# d: Dimensionality of the feature space

# Output:
# Updated weight vector, w_t

function update_weights(w, x, y, eta, mu, d):
    # Compute the prediction using the current weight vector
    prediction = sigmoid(dot_product(w, x))

    # Update each element in the weight vector
    for i in range(1, d+1):
        # Apply L2 regularization to all elements
        w[i] = w[i] * (1 - 2 * eta * mu)

        # Update weights corresponding to non-zero elements in x
        if x[i] is not zero:
            gradient = (y - prediction) * x[i]
            w[i] = w[i] + eta * gradient

    return w

# Helper functions:
function sigmoid(z):
    return 1 / (1 + exp(-z))

function dot_product(vector1, vector2):
    # Efficiently compute dot product for sparse vectors
    result = 0
    for i in non_zero_indices(vector2):
        result += vector1[i] * vector2[i]
    return result
```



In this pseudo-code, `non_zero_indices(vector2)` is a function that returns the indices of non-zero elements in a sparse vector, optimizing the dot product computation. The key efficiency comes from updating only the non-zero elements of \(\mathbf{x}_t\) when adjusting \(\mathbf{w}_t\), while applying the regularization to all elements. This approach significantly reduces the computational load for sparse inputs.

</div>


## 3. Bayesian Linear Regression and Regularization

Linear regression is a model of the form $P(y \mid \mathbf{x}) \sim N\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}, \sigma^2\right)$ from a probabilistic point of view, where $\mathbf{w}$ is a $d$-dimensional vector. In ridge regression, we add an $l-2$ regularization term to our least squares objective function to prevent overfitting. Given data $D=\{\mathbf{x}_i, y_i\}_{i=1}^n$, our objective function for ridge regression is then:
$$
J(\mathbf{w})=\sum_{i=1}^n\left(y_i-\mathbf{w}^{\mathrm{T}} \mathbf{x}_i\right)^2+\lambda \mathbf{w}^{\mathrm{T}} \mathbf{w}
$$

We can arrive at the same objective function in a Bayesian setting, if we consider a maximum a posteriori probability (MAP) estimate and assume w has the prior distribution $N(0,f(\lambda, \sigma)I)$.

(a) Write down the posterior distribution of $w$ given the data.

<div style="color:blue">

In a Bayesian setting, the posterior distribution of $\mathbf{w}$ given the data $D$ is proportional to the product of the likelihood of the data and the prior distribution of $\mathbf{w}$:

$$
P(\mathbf{w} \mid D) \propto P(D \mid \mathbf{w}) \cdot P(\mathbf{w})
$$

The likelihood of the data given $\mathbf{w}$ under a linear regression model is:
$$
P(D \mid \mathbf{w}) \sim \prod_{i=1}^n N\left(y_i \mid \mathbf{w}^{\mathrm{T}} \mathbf{x}_i, \sigma^2\right)
$$

The prior distribution of $\mathbf{w}$ is:
$$
P(\mathbf{w}) \sim N\left(0, f(\lambda, \sigma)I\right)
$$

Combining these, the posterior distribution is:

$$
P(\mathbf{w} \mid D) \propto \prod_{i=1}^n \exp\left(-\frac{1}{2\sigma^2}\left(y_i - \mathbf{w}^{\mathrm{T}}\mathbf{x}_i\right)^2\right) \cdot \exp\left(-\frac{1}{2f(\lambda, \sigma)}\mathbf{w}^{\mathrm{T}}\mathbf{w}\right)
$$

</div>


(b) What $\lambda, \sigma$ makes this MAP estimate the same as the solution to optimizing
$J (w)$?

<div style="color:blue">

To find the relationship between $(\lambda, \sigma)$ that equates the MAP estimate with the optimization of $J(\mathbf{w})$, we need to make the exponent in the posterior distribution match the form of $J(\mathbf{w})$.

The exponent of the posterior is:

$$
-\frac{1}{2\sigma^2}\sum_{i=1}^n \left(y_i - \mathbf{w}^{\mathrm{T}}\mathbf{x}_i\right)^2 - \frac{1}{2f(\lambda, \sigma)}\mathbf{w}^{\mathrm{T}}\mathbf{w}
$$

This needs to be equivalent to the form of $J(\mathbf{w})$:
$$
\sum_{i=1}^n \left(y_i - \mathbf{w}^{\mathrm{T}}\mathbf{x}_i\right)^2 + \lambda \mathbf{w}^{\mathrm{T}}\mathbf{w}
$$

For this equivalence to hold, the coefficients of the two quadratic forms must match. That is,
$$
\frac{1}{2\sigma^2} = 1 \quad \text{and} \quad \frac{1}{2f(\lambda, \sigma)} = \lambda
$$

Therefore, $\sigma^2 = \frac{1}{2}$ and $f(\lambda, \sigma) = \frac{1}{2\lambda}$. This setup makes the MAP estimate equivalent to optimizing the ridge regression objective function $J(\mathbf{w})$.

</div>




## 4. Random Forests

(a) Random forests is a modification over bagging decision trees. The random forests improves variance reduction (over bagging) by reducing correlation among trees. Briefly explain how this correlation reduction ("de-correlation") among trees is achieved when growing the trees.



<div style="color:blue">

### 1. Bootstrapping the data: 

Like bagging, Random Forests create multiple decision trees using bootstrap samples of the training dataset. Each tree is grown on a different bootstrap sample, which is a random selection of data points from the original dataset, with replacement. The bootstrapping process introduces variability among the trees, as each tree sees a slightly different subset of the data. 

Mathematically:

Let $D$ be the original dataset and $D_i$ be the bootstrap sample for the $i^{th}$ tree. Then, $D_i$ is a subset of $D$ and is drawn with replacement.

### 2. Random selection of features

Unlike bagging, Random Forests introduce an additional layer of randomness by randomly selecting a subset of features at each split in the decision tree. This means that even if two trees are grown on the same bootstrap sample, they can still be different because they might use different sets of features to make decisions at the splits. Mathematically, this can be described as follows:

Let $F$ be the set of all features in the dataset. For each split in a tree, a subset $f \subset F$ is randomly chosen, where $|f| < |F|$. The best split is then determined from this subset $f$ instead of the entire set $F$.



More on Decorrelation in Random Forest 

* [Why is tree correlation a problem when working with bagging?](https://stats.stackexchange.com/questions/295868/why-is-tree-correlation-a-problem-when-working-with-bagging)

</div>


(b) Random forests are generally easy to implement and to train. It can be fit in one sequence, with cross validation performed along the way (almost identical to performing N-fold cross-validation, where N is the number of data instances), through the use of out-of-bag (OOB) samples. Explain why using OOB samples eliminates the need for setting aside a test set for evaluating a random forest, and how this leads to more efficient training.


<div style="color:blue">

For each tree in the forest, since its OOB samples were not used in training that particular tree, they can be used as a test set, which eliminates the need to set aside a separate test set. 

</div>

(c) List the model hyperparameters and model parameters of a random forest.



<div style="color:blue">

### Model Hyperparameters

1. **Number of Trees (`n_estimators`)**: The number of trees in the forest. A higher number usually improves performance but increases computational cost.

2. **Maximum Depth of Trees (`max_depth`)**: The maximum depth of each tree. Deeper trees can model more complex patterns but may lead to overfitting.

3. **Minimum Samples Split (`min_samples_split`)**: The minimum number of samples required to split an internal node. Affects the depth of the tree.

4. **Minimum Samples Leaf (`min_samples_leaf`)**: The minimum number of samples required to be at a leaf node. Smaller leaf size leads to capturing finer detail, but may cause overfitting.

5. **Maximum Features (`max_features`)**: The number of features to consider when looking for the best split. Can be a fraction, an integer, or a function like "sqrt" or "log2".

6. **Bootstrap (`bootstrap`)**: Whether bootstrap samples are used when building trees. If false, the whole dataset is used to build each tree.

7. **OOB Score (`oob_score`)**: Whether to use out-of-bag samples to estimate the generalization accuracy.

8. **Criterion (`criterion`)**: The function to measure the quality of a split (e.g., "gini" for Gini Impurity or "entropy" for Information Gain in classification tasks).

### Model Parameters

Model parameters, on the other hand, are learned or estimated from the data during the training process. They are not set manually and are adjusted to best fit the training data:

1. **Split Points at Internal Nodes**: The values at which each node splits the data.

2. **Feature Weights**: The weights given to each feature in making splits, which determine their importance.

3. **Leaf Values**: The values at the leaves of the trees, which can be a class label in classification or a numerical value in regression.

</div>


(d) Alice and Bob are data scientists debating whether a random forest is an "interpretable" model. Alice argues that it is interpretable, while Bob argues that its interpretability is limited. Briefly discuss why they may both be correct.

<div style="color:blue">



### Why Random Forests Are Considered Interpretable

* **Interpretability of individual decision trees**: Random forests are constructed using individual trees, which are interpretable. We can trace the decision path of specific predictions for individual trees. This can offer insights into the decision-making process of the model

* **Feature Importance**: Random Forests provide straightforward metrics for feature importance, showing which features are most influential in making predictions. This can be insightful for understanding the driving factors behind the model's decisions.

1. **Aggregated Decision Making**: The ensemble nature of Random Forests, where multiple trees vote on the outcome, can be interpreted as a form of collective decision-making, which some might find more interpretable compared to a single, potentially very complex, model.

### Why Random Forests Are Considered Not Interpretable

1. **Complexity of Ensemble**: While individual decision trees are interpretable, a Random Forest combines potentially hundreds of trees, making it difficult to understand the collective decision-making process. The ensemble nature obscures the clarity that individual trees offer.

2. **Lack of Predictive Insights**: Unlike linear models that provide coefficients indicating the direction and magnitude of the effect of each feature, Random Forests do not provide such clear-cut interpretative insights. Understanding how changes in feature values quantitatively affect the output is not straightforward.

3. **Local vs. Global Interpretability**: While one can analyze individual trees to understand specific decisions (local interpretability), getting a global understanding of how the model behaves overall across all features and data points is challenging.

4. **Black Box Nature**: To some extent, Random Forests are considered "black box" models, especially when dealing with large numbers of trees and complex structures. Extracting clear rules or patterns from the entire forest is not as direct as in simpler models.

### Conclusion

The interpretability of Random Forests, therefore, depends on the context and the specific requirements for interpretability. For applications where a high-level understanding of feature importance is sufficient, Random Forests can be considered interpretable. However, for applications requiring detailed insight into the exact decision-making process or the quantitative effect of each feature, Random Forests may fall short in interpretability compared to simpler, more transparent models like linear regression. The debate reflects the broader challenge in machine learning of balancing model complexity and accuracy with the need for transparency and ease of understanding.

</div>

