## 1. Graphical Models

The Restricted Boltzmann Machine (RBM) is an undirected graphical model over binary vectors. It has "visible" variables $v$ and "hidden" variables $h$. The jointly distribution is
$$
p(v, h) \propto e^{-E(v, h)}
$$

Note that you need to normalize $e^{-E(v, h)}$ to get the probability. The energy function $E(v, h)$ is defined as

$E(v, h)=-\sum_i a_i v_i-\sum_j b_j h_j-\sum_{i, j} v_i h_j w_{i j}$

where $v_i$ and $h_j$ are the binary states of the visible variable $i$ and hidden variable $j$, respectively. $a_i$ and $b_i$ are their biases, and $w_{i, j}$ is the weight between them.


(a) Consider the RBM with three visible variables and two hidden variables. Complete the graphical model by drawing the edges. (The graph contains 5 nodes, $h_1, h_2, v_1, v_2, v_3$)


<div style="color:blue">

In an RBM, every visible node is connected to every hidden node, but there are no connections between nodes of the same type. So we connect each of $v_1, v_2, v_3$ to both $h_1$ and $h_2$. Also, there should be no edges connecting $v_1, v_2, v_3$ among themselves, and similarly no edges connecting $h_1$ and $h_2$ among themselves.

Each edge in this model represents a dependency between a visible and a hidden variable, influenced by the weight $w_{ij}$ where $i$ and $j$ correspond to the indices of the connected visible and hidden nodes, respectively.

</div>


(b) Mark TRUE or FALSE to the following statements about conditional independence properties in the model

<div style="color:blue">

References: 

* [Boltzmann Machines (UToronto)](https://www.cs.toronto.edu/~rgrosse/courses/csc321_2017/slides/lec19.pdf)

Under the bipartite structure, the hidden units are all conditionally independent given the visibles

To determine the conditional independence properties in the Restricted Boltzmann Machine (RBM) model, we need to consider the structure of the graphical model and the dependencies between variables. Here are the statements evaluated for conditional independence, marked as TRUE or FALSE:

1. $(v_1 \perp v_2 \mid h_1)$: FALSE. In an RBM, given a hidden node, the visible nodes are not independent. This is because each visible node is directly influenced by all the hidden nodes, so knowing the state of $h_1$ doesn't render $v_1$ and $v_2$ independent.

2. $(h_1 \perp h_2 \mid v_1)$: TRUE. In an RBM, the hidden nodes are conditionally independent given the visible nodes. So given $v_1$, the hidden nodes $h_1$ and $h_2$ are independent.

3. $(v_1 \perp v_2 \mid h_1, h_2)$: TRUE. When both hidden nodes $h_1$ and $h_2$ are known, the visible nodes $v_1$ and $v_2$ become independent because their only dependency is through these hidden nodes.

4. $(v_2 \perp v_3 \mid h_1, h_2)$: TRUE. Similar to the previous case, knowing both hidden nodes $h_1$ and $h_2$ renders the visible nodes $v_2$ and $v_3$ independent.

5. $(h_1 \perp h_2 \mid v_1, v_2)$: TRUE. Again, in an RBM, hidden nodes are conditionally independent given the visible nodes. So, given $v_1$ and $v_2$, the hidden nodes $h_1$ and $h_2$ are independent. (REALLY?)

6. $(h_1 \perp h_2 \mid v_1, v_2, v_3)$: TRUE. This is a general case of the principle that hidden nodes are conditionally independent given the visible nodes. Therefore, given all visible nodes $v_1, v_2,$ and $v_3$, the hidden nodes $h_1$ and $h_2$ are independent.

</div>

(c) What is the marginal distribution $p(v)$ and the corresponding normalization factor.

<div style="color:blue">

To obtain the marginal distribution $p(v)$, we sum over all possible states of the hidden variables:

$p(v) = \sum_h p(v, h)$
$= \sum_h \frac{e^{-E(v, h)}}{Z}$

Here, $Z = \sum_{v, h} e^{-E(v, h)}$ is the normalization factor (also known as the partition function), which ensures that the probabilities sum to 1.

The marginal distribution $p(v)$ can be expressed as:

$p(v) = \frac{1}{Z} \sum_h e^{-E(v, h)}$
$= \frac{1}{Z} \sum_h e^{\left(\sum_i a_i v_i + \sum_j b_j h_j + \sum_{i, j} v_i h_j w_{ij}\right)}$


</div>


(d) Assume $a=0, b=0$, and $w_{2,1}=1$, and all other entries of $w$ are zero. Compute the probability $P(v_1=1, v_2=1 \mid h_2=0, v_3=0)$.

<div style="color:blue">

$P(v_1=1, v_2=1 \mid h_2=0, v_3=0) = P(v_1=1, v_2=1, h_2=0, v_3=0) / P(h_2=0, v_3=0)$

### Numerator: $P(v_1=1, v_2=1, h_2=0, v_3=0)$

Under the condition $h_2=0, v_3=0$, we have:
- $v_1 = 1$
- $v_2 = 1$

We need to sum over all possible states of $h_1$ (which can be 0 or 1):

$P(v_1=1, v_2=1, h_2=0, v_3=0) = \sum_{h_1} \frac{e^{-E(v, h)}}{Z}$

$= \sum_{h_1} \frac{e^{-(-v_2 h_1)}}{Z}$

$= \frac{1}{Z} (e^{-(-1 \cdot 0)} + e^{-(-1 \cdot 1)})$

$= \frac{1}{Z} (1 + e)$

### Denominator: $P(h_2=0, v_3=0)$

Here, we sum over all possible states of $v_1, v_2,$ and $h_1$:

$P(h_2=0, v_3=0) = \sum_{v_1, v_2, h_1} \frac{e^{-E(v, h)}}{Z}$

$= \frac{1}{Z} \sum_{v_1, v_2, h_1} e^{-(-v_2 h_1)}$

$= \frac{1}{Z} \sum_{v_1, v_2, h_1} e^{v_2 h_1}$

Since $v_1$ and $v_2$ can be either 0 or 1, and $h_1$ can be either 0 or 1, there are 2 x 2 x 2 = 8 terms in this sum.

$=  \frac{1}{Z} [2\exp(1 \times 1) + 2\exp(1 \times 0) + 2\exp(0 \times 1) + 2\exp(0 \times 0)] = \frac{1}{Z} (e^2 + 6)$

So the final result is $(\frac{1}{Z} (1 + e)) / (\frac{1}{Z} (e^2 + 6)) = 
\frac{e + 1}{e^2 + 6}$

</div>



## 2. SVMs and the Kernel trick

You are given a data set $D$ (see Fig. 1) with data from a single feature $X_1$ in $\mathbb{R}^1$ and corresponding label $Y \in\{+,-\}$. The data set contains three positive examples at $X_1=\{-3,-2,3\}$ and three negative examples at $X_1=\{-1,0,1\}$.



(a) (1 point) Can this data set (in its current feature space) be perfectly separated using a linear separator? Why or why not? (Explain in 1 line)

<div style="color:blue">

No. The positive examples and negative examples are not linearly separable along the $X_1$ axis; they are interspersed. For example, the positive examples $X_1=\{-3,-2\}$ are to the left of all negative examples, whereas the positive example $X_1=\{3\}$ is to the right of all negative examples.

</div>

(b) $\left(2\right.$ points) Lets define the simple feature map $\phi(u)=\left(u, u^2\right)$ which transforms points in $\mathbb{R}^1$ to points in $\mathbb{R}^2$. Apply $\phi$ to the data and plot the points in the new $\mathbb{R}^2$ feature space (i.e. just show the plot). Can a linear separator perfectly separate the points in the new $\mathbb{R}^2$ features space induced by $\phi$ ? Why or why not? (Again, explain in 1 line)

<div style="color:blue">

To answer this question, we first apply the feature map $\phi(u) = (u, u^2)$ to the data set $D$, and then plot the points in the new $\mathbb{R}^2$ feature space. Finally, we discuss whether a linear separator can perfectly separate the points in this new feature space.

1. **Applying the Feature Map:**
   The feature map $\phi(u) = (u, u^2)$ transforms each data point $X_1$ into a point in $\mathbb{R}^2$. For our data points:

   - Positive examples at $X_1 = \{-3, -2, 3\}$:
     - $\phi(-3) = (-3, 9)$
     - $\phi(-2) = (-2, 4)$
     - $\phi(3) = (3, 9)$
   - Negative examples at $X_1 = \{-1, 0, 1\}$:
     - $\phi(-1) = (-1, 1)$
     - $\phi(0) = (0, 0)$
     - $\phi(1) = (1, 1)$

- The transformed points for the negative examples are closer to the origin, while the positive examples are farther away in the $\mathbb{R}^2$ space due to the squaring operation in the feature map.
- This transformation creates a scenario where the positive and negative examples can be separated by a line in the $\mathbb{R}^2$ space. The non-linear relationship in the original feature space becomes linearly separable after applying the feature map $\phi$.


</div>

(c) (1 point) Give the analytic form of the kernel that corresponds to the feature map $\phi$ in terms of only $X_1$ and $X_2$. Specifically define $k\left(X_1, X_2\right)=<\phi\left(X_1\right), \phi\left(X_2\right)>$ ($<\cdot, \cdot>$ is the dot-product of two vectors), and give the analytical form of $k(\cdot, \cdot)$) .$

<div style="color:blue">

$k(X_1, X_2) = X_1 \cdot X_2 + X_1^2 \cdot X_2^2$

</div>

(d) (4 points) Construct a maximum-margin separating hyperplane. This hyperplane will be a line in $\mathbb{R}^2$, which can be parameterized by its normal equation, i.e. $w_1 Y_1+w_2 Y_2+c=0$ for appropriate choices of $w_1, w_2$ and $c$. Here, $\left(Y_1, Y_2\right)=\phi\left(X_1\right)$ is the result of applying the feature map $\phi$ to the original feature $X_1$. Give the values for $w_1, w_2$ and $c$. Also, explicitly compute the margin for your hyperplane. You do not need to solve a quadratic program to find the maximum margin hyperplane. Instead, let your geometric intuition guide you.

<div style="color:blue">

Considering the negative example $(-1, 1)$ and the positive example $(-2, 4)$, the hyperplane should lie exactly in the middle of these points. The midpoint of these points is $\left(\frac{-1 - 2}{2}, \frac{1 + 4}{2}\right) = \left(-\frac{3}{2}, \frac{5}{2}\right)$. 

Similarly, the midpoint between $(-1, 1)$ and $(3, 9)$ is $(1, 5)$.

So the slope of the line connecting the midpoints are $1$ and thus $-\frac{w_1}{w_2} = 1$

To determine $w_1, w_2, c$, we use the condition that the hyperplane should maximize the margin between the positive and negative examples. The margin is the distance from the hyperplane to the nearest positive or negative example, which should be equidistant for a maximum-margin hyperplane.

Let $w = \frac{w_1}{w_2}$. Thus, the hyperplane must satisfy:

- $\frac{3}{2} w_1 + \frac{5}{2} w_2 + b = 0$
- $-w_1 + 5 w_2 + b = 0$

Here we let $w_1 =1, w_2 = -1$, then we get $c = 4$

The margin can be computed using the formula $\frac{2}{\sqrt{w_1^2 + w_2^2}}$


</div>

(e) (2 points) Draw the decision boundary separating of the separating hyperplane, in the original $\mathbb{R}^1$ feature space. Also circle the support vectors.

<div style="color:blue">

</div>



## 3. Convolutional Neural Networks

Consider the following convolutional neural network architecture (Figure 2):
In the first layer, we have a one-dimensional convolution with a single filter of size 3 such that $h_i=s\left(\sum_{j=1}^3 v_j x_{i+j-1}\right)$. The second layer is fully connected, such that $z=\sum_{i=1}^4 w_i h_i$. The hidden units activation function $s(x)$ is the logistic (sigmoid) function. The output unit is linear (no activation function). We perform gradient descent on the loss function $R=(y-z)^2$, where $y$ is the training label for $x$.

(a) [1 pt] What is the total number of parameters in this neural network? Recall that convolutional layers share weights. There are no bias terms.

<div style="color:blue">
    

In the **convolutional layer**, we have a single filter of size 3. Since there are no bias terms and the weights are shared across the entire layer due to the nature of convolutional layers, the total number of parameters for this layer is equal to the size of the filter: 3.

In the **fully connected layer**, each of the 4 hidden units is connected to the single output unit $z$, and each connection has its own weight. Thus, there are 4 weights.

Therefore, the total number of parameters is 7.

</div>




(b) [4 pts] Compute $\partial R / \partial w_i$.

<div style="color:blue">

$\frac{\partial R}{\partial w_i} = \frac{\partial R}{\partial z} \frac{\partial z}{\partial w_i} = 2(z - y) h_i$
    
</div>


(c) [1 pt] Vectorize the previous expressionthat is, write $\partial R / \partial w$.

<div style="color:blue">

The vectorized form of the output $z$ can be written as the dot product of $\mathbf{w}$ and $\mathbf{h}$:

$$z = \mathbf{w}^{\top} \mathbf{h}$$

To find the vectorized form of the gradient of $R$ with respect to $\mathbf{w}$, denoted as $\frac{\partial R}{\partial \mathbf{w}}$, we'll take the derivative of $R$ with respect to each component of $\mathbf{w}$:

$$\frac{\partial R}{\partial \mathbf{w}} = -2(y - z) \cdot \mathbf{h}$$

This is a vector, where each element is the derivative of $R$ with respect to the corresponding element in $\mathbf{w}$. The $-2(y - z)$ term is a scalar, and it multiplies each component of the vector $\mathbf{h}$, so the vectorized gradient is simply the scalar multiple of the vector $\mathbf{h}$:

In a more compact form, we can represent this as:

$$\nabla_{\mathbf{w}} R = -2(y - \mathbf{w}^\top \mathbf{h}) \mathbf{h}$$
    
    
</div>


(d) $[5 \mathrm{pts}]$ Compute $\partial R / \partial v_j$.

<div style="color:blue">
    
Given that:
$$R = (y - z)^2$$
$$z = \sum_{i=1}^4 w_i h_i$$
$$h_i = s\left(\sum_{j=1}^3 v_j x_{i+j-1}\right)$$

where $s(x)$ is the sigmoid function. 

We will again apply the chain rule for derivatives, but this time we will have to apply it twice because $v_j$ affects $R$ through $h_i$ and $h_i$ affects $R$ through $z$. 

$$\frac{\partial R}{\partial h_i} = \frac{\partial R}{\partial z} \cdot \frac{\partial z}{\partial h_i}$$

Since $\frac{\partial R}{\partial z} = -2(y - z)$, and $\frac{\partial z}{\partial h_i} = w_i$, we have:

$$\frac{\partial R}{\partial h_i} = -2(y - z) w_i$$

Next, we need to compute $\frac{\partial h_i}{\partial v_j}$, which requires the derivative of the sigmoid function $s$:

$$h_i = s(u_i) = \frac{1}{1 + e^{-u_i}}$$
$$u_i = \sum_{j=1}^3 v_j x_{i+j-1}$$

The derivative of the sigmoid function $s(u_i)$ with respect to $u_i$ is:

$$s'(u_i) = s(u_i)(1 - s(u_i))$$

The derivative of $u_i$ with respect to $v_j$ is simply $x_{i+j-1}$ because it is the coefficient that multiplies $v_j$ in the sum.

Now applying the chain rule to find $\frac{\partial h_i}{\partial v_j}$:

$$\frac{\partial h_i}{\partial v_j} = \frac{\partial h_i}{\partial u_i} \cdot \frac{\partial u_i}{\partial v_j}$$
$$\frac{\partial h_i}{\partial v_j} = s'(u_i) \cdot x_{i+j-1}$$
$$\frac{\partial h_i}{\partial v_j} = h_i (1 - h_i) \cdot x_{i+j-1}$$

Finally, we combine these results to find $\frac{\partial R}{\partial v_j}$:

$$\frac{\partial R}{\partial v_j} = \sum_{i=j-2}^{\min(j, 4)} \frac{\partial R}{\partial h_i} \cdot \frac{\partial h_i}{\partial v_j}$$
$$\frac{\partial R}{\partial v_j} = \sum_{i=j-2}^{\min(j, 4)} -2(y - z) w_i h_i (1 - h_i) x_{i+j-1}$$

Here, $\frac{\partial R}{\partial v_j}$ is a sum over the contributions of each hidden unit $h_i$ that the filter weight $v_j$ influences. This sum is taken over the valid range of $i$ such that $i+j-1$ is within the bounds of the input $x$.

</div>


## 4. Applied Data Analysis

Carol and Bob are data scientists working at a Fortune 100 company that routinely works with petabyte-scale high-dimensional datasets with huge number of data points. They are tasked to solve two machine learning problems: a binary classification problem, and a clustering problem. They are debating what methods to try first. Carol believes that it is a good idea to first try some "simpler," well-known methods (e.g., knearest neighbors, random forests, $k$-means), then try more sophisticated and possibly better-performing methods. Bob, on the other hand, thinks it is good to first try the latest techniques published at top academic conferences (e.g., KDD, NeurIPS) since many of them report state-of-the-art results.

(a) (2 points) Briefly justify why both of their approaches may be reasonable.

<div style="color:blue">

#### Trying simpler, well-known methods (Carol)

* **Simpler, well-known methods**: Simpler methods like k-nearest neighbors, random forests, and $k$-means are well-understood, relatively easy to implement, and often require less computational resources. Starting with these can provide a solid baseline for performance and help in understanding the structure and distribution of the data. These methods are also more **interpretable**, which might be important for understanding the model's predictions. Simpler methods can also have faster runtime and better scalability (less computational resources), especially when a quick solution is needed or computational resources are limited. Simpler models are less likely to overfit.
* **Trying latest techniques (Bob)**: Techniques presented at top conferences often represent the cutting edge of machine learning research and have been peer-reviewed. They may provide superior performance over traditional methods. They can handle  high-dimensional data efficiently and may offer better feature extraction and selection. 

Innovation and Competitive Edge: By adopting the latest methods, the company could potentially gain a competitive edge in the industry, especially if these techniques can handle high-dimensional data more effectively.

Addressing Limitations of Simpler Methods: Sophisticated methods might overcome the limitations of simpler ones, especially in dealing with complex patterns and interactions within the data that simpler models might miss.

Leveraging Company Resources: A Fortune 100 company is likely to have the computational resources and expertise needed to implement and benefit from the latest techniques.

Academic Collaboration: Using cutting-edge techniques can foster collaboration with academia, staying current with the latest research trends and innovations.

</div>

(b) (2 points) They decide to first try k-nearest neighbors (k-NN) for their classification problem and $\mathrm{k}$-means for their clustering problem. Briefly describe the scalability challenges they may encounter.

<div style="color:blue">

Computational Complexity:

* k-NN: The k-NN algorithm involves calculating the distance between a query point and every other point in the dataset, which is computationally intensive. For a dataset of the size they are working with, this could lead to extremely high computational costs, especially as the number of dimensions (features) increases.
* k-means: Each iteration of k-means requires calculating the distance of each data point to every cluster centroid and reassigning points to the nearest cluster. This becomes increasingly computationally demanding with the growth of the dataset and the number of clusters.

Memory Requirements

* k-NN: k-NN typically requires the entire dataset to be held in memory to perform distance calculations efficiently. For petabyte-scale data, this is impractical, if not impossible, with standard computing resources.
* k-means: While k-means can be more memory-efficient than k-NN, it still requires substantial memory to store the centroids, the data points, and intermediate computations, especially as the number of dimensions and clusters grows.


</div>

(c) (2 points) Briefly describe how they may determine the value of $\mathrm{k}$ in $\mathrm{k}-\mathrm{NN}$, and the value of $\mathrm{k}$ in $\mathrm{k}$-means. (Both $\mathrm{k}$ values can be different.)

<div style="color:blue">

</div>

(d) (4 points) They believe visualization will play an important role in evaluating and comparing the performance of the machine learning methods that they are going to try. Briefly describe two visualization approaches that can help with such comparison - for each approach:
i. describe a challenge that could arise when the visualization approach is applied on large datasets that Carol and Bob are working with; and
ii. propose a method to tackle that challenge.
For easier discussion, your discussion and examples may center around evaluation metrics of you choosing. You are welcome to include illustrations to support your answers.

<div style="color:blue">

</div>