
## 1. Recommendation Systems



You have collected the following ratings of popular comedy TV shows from five users:

|  | Rating |  |  |  |  |
| :--- | :--- | :--- | :--- | :--- | :--- |
|  | Alice | Bob | Charles | David | Eugene |
| Friends (F) | 5 | 3 |  | 2 | 4 |
| The Office (TO) | 4 |  |  | | 4 |
| Arrested Development (AD) | 4 |  |  | 4 | |
| The Bing Bang Theory (BBT) |  | 2 | 1 | |  |
| The Invincible (IV) | 1 |  | 1 | 1 |  |



(a) To generate recommendations, you adopt the following policy: “if a user U likes show X, then U will also like show Y ”. You implement this by maximizing the cosine similarity between the ratings of items X and Y . Your policy also states that you will only make a recommendation to user U if (a) U has not already watched or rated Y and (b) U’s rating of show X is at least 3.
Using this policy, which TV show would be recommended to Eugene? Show the com- parisons that you made.

<div style="color:blue">

Cosine similarity between "Friends" and all other movies:

* $\mathrm{sim}(F, AD) = \frac{28}{\sqrt{54} \times \sqrt{32}} = \frac{7}{6 \sqrt{3}} = 0.6736$
* $\mathrm{sim}(F, BBT) = \frac{6}{\sqrt{54} \times \sqrt{5}} = \frac{2}{\sqrt{30}} = 0.3651$
* $\mathrm{sim}(F, BBT) = \frac{7}{\sqrt{54} \times \sqrt{3}} = \frac{7}{9\sqrt{2}} = 0.5500$

Cosine similarity between "The Office" and all other movies:

* $\mathrm{sim}(TO, AD) = \frac{16}{32} = 0.5$
* $\mathrm{sim}(TO, BBT) = 0$
* $\mathrm{sim}(TO, BBT) = \frac{4}{\sqrt{32} \times \sqrt{3}} = \frac{1}{\sqrt{6}} = 0.4082$

So the movie "Arrested Development" will be recommended to Eugene.




</div>

(b) Next, you design a recommendation system to rank TV show to find the ‘Best TV Shows of All Times’, using the following formula: ratings(i) = a + b(i). In this formula, you set a as a global average rating term and b(i) as an show i’s bias score. You first fit this model to calculate a as the mean of all ratings across the dataset, and in the process, you calculate b(i) = average rating given to show i – global average rating a.
You rank the shows according to their bias scores (higher bias score is ranked higher). Which show, among the five shows shown in Table 1, would be the Best TV Show and which one would be the Worst TV show? Show your calculations.

<div style="color:blue">




</div>


(c) You come up with the idea of training a deep learning-based recommendation system model, namely the Neural Collaborative Filtering (NCF) model, on your large dataset to create better recommendation models. Your large dataset has 10 million ratings given by approximately 100,000 users to approximately 1,000,000 movies.

Your NCF model first generates 8-dimensional user and item embeddings. Then you pass the embeddings through two fully-connected neural CF layers with sizes 8x16 and 16x16 dimensions. Finally, this is passed through a 16x1 output layer with ReLU activation to produce a single prediction value of recommending an item to a user. You train the model for 10 epochs with back-propagation.

After training the model, you find that the model does not perform well. What changes can you make to the model or parameters to potentially improve the performance? Give at least three options. Note that you cannot choose a different model now.

<div style="color:blue">

</div>


(d) Recommender systems are typically trained on a subset of training data, due to the large size of the entire dataset. A popular dataset sampling strategy is to take the interactions between the most active users and most interactive items. Specifically, all users with less than k interactions are removed and all items with less than k interactions are removed. The recommender model is trained on the remaining dataset. Describe a bias that this sampling strategy can introduce in a recommender model trained on this dataset.


<div style="color:blue">
    
Popularity Bias: Since the model is trained primarily on items with a high number of interactions, it tends to recommend these popular items more frequently. This is because the model views these items as more relevant, given their higher presence in the training data. As a result, less popular items, which might be equally or more relevant to certain users, are underrepresented in recommendations.

</div>




## 2. Learning Theory and VC Dimension

<div style="color:red">Seems like ERM will not be included in Spring 2024</div>


Consider a binary classification problem with the hypothesis class of two-dimensional thresholds, $H=\{h_{a, b}: a \in \mathbb{R}$ and $b \in \mathbb{R} \}$ where:

$$h_{a, b}(\mathbf{x})= \begin{cases}1 & \text { if } x_1 \leq a \text { and } x_2 \leq b \\ -1 & \text { otherwise }\end{cases}$$

and $\mathcal{X}=\mathbb{R}^2$ and $\mathcal{Y}=\{-1,1\}$.

(a) [3 pts] Describe an algorithm for computing the ERM for this class in the realizable case, assuming the $0-1$ loss is used. State the computational complexity of the algorithm in the context of a training data set of size $m$.

<div style="color:blue">
    
    
References
* [ERM (Wiki)](https://en.wikipedia.org/wiki/Empirical_risk_minimization)
* [this video](https://youtu.be/ARkXG0q3sJU?si=NbcMJMDMEKFpS_I4)
    

A class of functions $C$ are learnable if, given an IID finite sample we can find a hypothesis h that minimizes
the generalization error w.r.t the best hypothesis in $C$. 

An ERM algorithm is an algorithm that chooses a hypothesis which minimizes the empirical error w.r.t the best hypothesis in $C$.



In the realizable case, we assume that there exists a hypothesis in our class that can perfectly classify the training data. 

The hypothesis class consists of two-dimensional thresholds, and the decision is binary based on the values of $x_1$ and $x_2$ relative to the thresholds $a$ and $b$. Given a training set of size $m$, the goal is to find the best thresholds $a$ and $b$ that minimize the empirical risk.

### Algorithm

1. **Initialize the Best Loss:** Set the minimum loss to infinity and the best thresholds $a$ and $b$ to None.

2. **Consider All Points as Potential Thresholds:** 
   - For each point $(x_{1}^{(i)}, x_{2}^{(i)})$ in the training set:
     - Set $a = x_{1}^{(i)}$ and $b = x_{2}^{(i)}$.
     - Compute the loss for this threshold using the 0-1 loss function.

3. **Compute the 0-1 Loss:**
   - For each point in the training set:
     - Use the current $a$ and $b$ to classify the point.
     - If the classification is incorrect, increment the loss count.

4. **Update the Best Thresholds:**
   - If the loss for the current $a$ and $b$ is less than the minimum loss recorded so far, update the minimum loss and the best thresholds.

5. **Output:** After considering all points, output the best thresholds $a$ and $b$ that minimize the loss.

### Computational Complexity

- The key operation is the evaluation of the 0-1 loss for each potential threshold pair $a$ and $b$, which involves checking each point in the training set. For each point, the complexity is $O(m)$ since we need to compare it against every other point.
- Since there are $m$ points, and for each point, we consider it as a potential threshold, the overall computational complexity of the algorithm is $O(m^2)$.

</div>

(b) [7 pts] What is the VC dimension of this hypothesis class? Provide a complete proof.

<div style="color:blue">

To determine the VC dimension of the given hypothesis class, we need to establish the largest number of points that can be shattered by this class.

In our case, the hypothesis class $H$ consists of two-dimensional thresholds. Each hypothesis $h_{a, b}$ can classify a point $\mathbf{x} = (x_1, x_2)$ as either 1 or -1 based on whether $x_1 \le a$ and $x_2 \le b$.

**Shattering 2 points**

Consider any two points in the 2D plane. It's possible to position $a, b$ such that one point is classified as 1 and the other as -1, regardless of their arrangement. This is because you can always place the threshold lines to separate these points into different quadrants.

**Shattering 3 points**

1. If the three points are collinear, they cannot be shattered since there's no way to position $a, b$ to separate one point from the other two without also separating the other two points.

2. If the points are NOT collinear. Without loss of generality, assume that no two points share the same $x_1$ or $x_2$ value (since such a case would reduce to the collinear scenario). These points will form a triangle.
For certain arrangements (e.g., labeling two adjacent points as 1 and the third as -1), it is impossible to place $a, b$ in such a way that this labeling is achieved without violating the conditions for the other points. Therefore, three points in general position cannot be shattered.

Since the hypothesis can shatter any two pionts but not any three, the VC dimension of the hypothesis class is 2.


</div>




## 3. Gaussian Discriminant Analysis


Consider a two-class classification problem with data in $\mathbb{R}^d \times\{1,2\}$. Gaussian discriminant analysis solves this problem by modeling the class-conditional distributions as a Gaussian:

$$
p(X \mid Y=i) \sim \mathcal{N}\left(\mu_i, \Sigma_i\right), \quad p(Y=i)=\pi_i \quad \text { for } \quad i \in\{1,2\} .
$$

Here, $\mathcal{N}(\mu, \Sigma)$ denotes a multivariate normal density with mean $\mu$ and covariance matrix $\Sigma$.

(a) [2 pts] Given a large set $\left(X_i, Y_i\right)_{1 \leq i \leq N}$ of independent data, derive the maximum likelihood estimates of the parameters $\left(\pi_i, \mu_i, \Sigma_i\right)_{i \in\{1,2\}}$.


* [Analytic Maximum Likelihood for the parameters of the Gaussian and Bernoulli distributions, as well as the parameters of a linear model using Ordinary Least Squares](https://web.pdx.edu/~joel8/resources/ConceptualPresentationResources/AnalyticDerivations.pdf)




<div style="color:blue">

1. **Log-Likelihood Function**: 
   The likelihood of observing a single data point $(X_i, Y_i)$ is given by:

   $$
   p(X_i, Y_i \mid \pi, \mu, \Sigma) = p(X_i \mid Y_i; \mu, \Sigma) p(Y_i \mid \pi).
   $$

   Since $Y_i \in \{1, 2\}$, this can be written as:

   $$
   p(X_i, Y_i \mid \pi, \mu, \Sigma) = \left[\pi_1 \mathcal{N}(X_i \mid \mu_1, \Sigma_1)\right]^{1[Y_i=1]} \cdot \left[\pi_2 \mathcal{N}(X_i \mid \mu_2, \Sigma_2)\right]^{1[Y_i=2]},
   $$

   where $1[\cdot]$ is the indicator function. The log-likelihood for the entire dataset is:

   $$
   \log L(\pi, \mu, \Sigma) = \sum_{i=1}^{N} \log \left( \left[\pi_1 \mathcal{N}(X_i \mid \mu_1, \Sigma_1)\right]^{1[Y_i=1]} \cdot \left[\pi_2 \mathcal{N}(X_i \mid \mu_2, \Sigma_2)\right]^{1[Y_i=2]} \right).
   $$

2. **Deriving MLE for $\pi_i$**: 
   The MLE for $\pi_i$ is obtained by maximizing the log-likelihood with respect to $\pi_i$ under the constraint $\pi_1 + \pi_2 = 1$:

   $$
   \hat{\pi}_i = \frac{1}{N} \sum_{i=1}^{N} 1[Y_i = i] = \frac{N_i}{N}.
   $$

   where $N_i$ is the number of samples belonging to class $i$, and $N$ is the total number of samples.

3. **Deriving MLE for $\mu_i$**: 
   The MLE for $\mu_i$ is obtained by differentiating the log-likelihood with respect to $\mu_i$ and setting it to zero:

   $$
   \hat{\mu}_i = \frac{\sum_{i=1}^{N} 1[Y_i = i] X_i}{\sum_{i=1}^{N} 1[Y_i = i]}.
   $$

4. **Deriving MLE for $\Sigma_i$**: 
   Similarly, the MLE for $\Sigma_i$ is found by differentiating the log-likelihood with respect to $\Sigma_i$ and setting it to zero:

   $$
   \hat{\Sigma}_i = \frac{\sum_{i=1}^{N} 1[Y_i = i] (X_i - \hat{\mu}_i)(X_i - \hat{\mu}_i)^T}{\sum_{i=1}^{N} 1[Y_i = i]}.
   $$

</div>

(b) [2 pts] Once the parameters of the model have been obtained, we want to classify a new data point $\tilde{X}$ by maximizing the conditional probability $p(Y \mid X=\tilde{X})$. Formulate the resulting decision rule mathematically.

<div style="color:blue">

1. **Posterior Probability**:
   The posterior probability for a class $i$ given the data point $\tilde{X}$ is computed using Bayes' theorem:

   $$
   p(Y=i \mid X=\tilde{X}) = \frac{p(X=\tilde{X} \mid Y=i) \cdot p(Y=i)}{p(X=\tilde{X})}.
   $$

   * $p(X=\tilde{X} \mid Y=i)$ is the likelihood of $\tilde{X}$ under the Gaussian model for class $i$ with parameters $(\mu_i, \Sigma_i)$
   * $p(Y=i)$ is the prior probability of class $i$, which is $\pi_i$.

2. **Applying the Gaussian Model**:
   We substitute the Gaussian density for $p(X=\tilde{X} \mid Y=i)$:

   $$
   p(Y=i \mid X=\tilde{X}) = \frac{\pi_i \cdot \mathcal{N}(\tilde{X} \mid \mu_i, \Sigma_i)}{\sum_{j=1}^2 \pi_j \cdot \mathcal{N}(\tilde{X} \mid \mu_j, \Sigma_j)}.
   $$

3. **Decision Rule**:
   To classify $\tilde{X}$, we choose the class $i \in \{1, 2\}$ that maximizes this posterior probability. Mathematically, the decision rule is:

   $$
   \text{Classify } \tilde{X} \text{ as class } i^* \text{ where } i^* = \mathrm{argmax}_{i \in \{1, 2\}} p(Y=i \mid X=\tilde{X}).
   $$

   This translates to comparing the posterior probabilities for $i=1$ and $i=2$, and choosing the one that is higher.

</div>

(c) [1 pt] Consider the special case where we make the restriction that $\Sigma_1=\Sigma_2$. What is the maximum likelihood estimator for $\left(\pi_i, \mu_i, \Sigma_i\right)_{i \in\{1,2\}}$ in this case?

<div style="color:blue">

The estimators of $\pi_i$ and $\mu_i$ remain the same.

$\Sigma_i$ is the covariance matrix of the Gaussian distribution for class $i$. The MLE for $\Sigma_i$:

$$
\hat{\Sigma}_i = \frac{\sum_{j=1}^N (X_j - \mu_i)(X_j - \mu_i)^T \cdot \mathbb{1}[Y_j = i]}{\sum_{j=1}^N \mathbb{1}[Y_j = i]}
$$

(Need to proofread the following)

With the constraint $\Sigma_1 = \Sigma_2$, the estimator for the covariance matrix changes. Instead of estimating a separate covariance matrix for each class, we estimate a common covariance matrix, $\Sigma$, as the weighted average of the individual covariance matrices:

$$
\hat{\Sigma} = \frac{\sum_{i=1}^{N} 1[Y_i = 1] (X_i - \hat{\mu}_1)(X_i - \hat{\mu}_1)^T + \sum_{i=1}^{N} 1[Y_i = 2] (X_i - \hat{\mu}_2)(X_i - \hat{\mu}_2)^T}{N}.
$$

This estimator takes into account the variability in both classes but assumes they share the same underlying covariance structure.

</div>

(d) $[2 \mathrm{pts}]$ Show that in the setting of (c), the decision boundaries are linear. In particular, show that log-probability-ratio is of the form
$\log \left(\frac{p(Y=1 \mid X=x)}{p(Y=2 \mid X=x)}\right)=c+v^{\top} x$
where $c$ and $v$ are independent of $x$.

<div style="color:blue">


From Bayes' Theorem, we know:

$$
p(Y=i \mid X=\tilde{X}) = \frac{\mathcal{N}(\tilde{X} \mid \mu_i, \Sigma_i) \cdot \pi_i}{p(X=\tilde{X})}
$$

So, the log-probability-ratio becomes:

$$
\log \left(\frac{\mathcal{N}(\tilde{X} \mid \mu_1, \Sigma_1) \cdot \pi_1}{\mathcal{N}(\tilde{X} \mid \mu_2, \Sigma_2) \cdot \pi_2}\right)
$$

We can expand this using the definition of the Gaussian distribution:

$$
\log \left(\frac{\frac{1}{(2\pi)^{d/2} |\Sigma_1|^{1/2}} \exp\left(-\frac{1}{2}(\tilde{X}-\mu_1)^T \Sigma_1^{-1} (\tilde{X}-\mu_1)\right) \cdot \pi_1}{\frac{1}{(2\pi)^{d/2} |\Sigma_2|^{1/2}} \exp\left(-\frac{1}{2}(\tilde{X}-\mu_2)^T \Sigma_2^{-1} (\tilde{X}-\mu_2)\right) \cdot \pi_2}\right)
$$

This expression simplifies significantly, especially noting that the $(2\pi)^{d/2}$ terms cancel out. The expression can be further simplified by taking the logarithm of the ratios:

$$
\log \left(\frac{\pi_1}{\pi_2}\right) - \frac{1}{2} \log \left(\frac{|\Sigma_2|}{|\Sigma_1|}\right) - \frac{1}{2}(\tilde{X}-\mu_1)^T \Sigma_1^{-1} (\tilde{X}-\mu_1) + \frac{1}{2}(\tilde{X}-\mu_2)^T \Sigma_2^{-1} (\tilde{X}-\mu_2)
$$

Expand and simplify the last 2 terms:

$$
\begin{aligned}
   &- \frac{1}{2}\tilde{X}^T\Sigma^{-1}\tilde{X} + \tilde{X}^T\Sigma^{-1}\mu_1 - \frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1 + \frac{1}{2}\tilde{X}^T\Sigma^{-1}\tilde{X} - \tilde{X}^T\Sigma^{-1}\mu_2 + \frac{1}{2}\mu_2^T\Sigma^{-1}\mu_2 \\
   &= \tilde{X}^T\Sigma^{-1}\mu_1 - \frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1 - \tilde{X}^T\Sigma^{-1}\mu_2 + \frac{1}{2}\mu_2^T\Sigma^{-1}\mu_2
\end{aligned}
$$

Combining like terms gives us:

$$
\tilde{X}^T\Sigma^{-1}(\mu_1 - \mu_2) - \frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1 + \frac{1}{2}\mu_2^T\Sigma^{-1}\mu_2
$$

Finally, we get the complete log-probability-ratio:

$$
\log \left(\frac{\pi_1}{\pi_2}\right) - \frac{1}{2} \log \left(\frac{|\Sigma_2|}{|\Sigma_1|}\right) + \tilde{X}^T\Sigma^{-1}(\mu_1 - \mu_2) - \frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1 + \frac{1}{2}\mu_2^T\Sigma^{-1}\mu_2
$$

The decision rule is to classify $\tilde{X}$ as class 1 if this expression is greater than 0, and class 2 otherwise. 

The decision boundary is linear because the only term involving $\tilde{X}$ is linear in $\tilde{X}$. The other terms are constants with respect to $\tilde{X}$.


</div>


(e) [2 pts] Consider the univariate case $d=1$ and assume that our data set is such that $\mu_1=-1, \mu_2=1, \Sigma_1=\Sigma_2=1$. Now imagine we add the data points $\left(X_{N+1}, Y_1\right)=$ $(-\lambda, 1)$ and $\left(X_{N+2}, Y_2\right)=(\lambda, 2)$ to our data set and reapply the methodology from (c). What happens, for a given input value $x$, to the left hand side of (1) as $\lambda$ goes to infinity. What does this mean for the classifier?

<div style="color:red">Need to proofread the following</div>

<div style="color:blue">

Given that $\Sigma_1=\Sigma_2=1$, the Gaussian distributions simplify to standard normal distributions centered at $\mu_1$ and $\mu_2$ respectively. For a given input value $x$, the decision rule becomes a comparison between the densities of two standard normal distributions centered at $\mu_1=-1$ and $\mu_2=1$.

Now, as we add the data points $(-\lambda, 1)$ and $(\lambda, 2)$ and let $\lambda \to \infty$, these points become increasingly extreme and distant from the means of their respective classes. This addition will eventually affect the estimated means and variances of each class, but since $\Sigma_1=\Sigma_2$ are fixed and equal to 1, only the means $\mu_1$ and $\mu_2$ and the class priors $\pi_1$ and $\pi_2$ might change.

As $\lambda$ goes to infinity, the influence of these extreme data points on the means could potentially push $\mu_1$ further left (more negative) and $\mu_2$ further right (more positive), depending on the rest of the data distribution.

The impact on the classifier can be understood as follows:

1. **Shifting Decision Boundary**: If the means $\mu_1$ and $\mu_2$ shift significantly due to the addition of extreme points, the decision boundary (the point where the classifier switches from predicting class 1 to class 2) will also shift. 

2. **Robustness to Outliers**: A classifier's robustness can be tested by adding such extreme points. In real-world scenarios, classifiers should be robust to outliers. If the addition of these points drastically alters the classifier's decision boundary, it might indicate a lack of robustness.

3. **Potential Overfitting**: If the classifier is too sensitive to such extreme additions, it may overfit the training data, especially if these points are not representative of the general data distribution.

4. **Effect on Class Priors**: The addition of these points could also affect the class priors $\pi_1$ and $\pi_2$, depending on how many data points belong to each class. This, in turn, would affect the posterior probabilities and the classifier's decisions.

In summary, as $\lambda \to \infty$, the left-hand side of the decision rule could change, potentially leading to a shift in the decision boundary. The exact impact depends on the overall data distribution and how sensitive the model is to such extreme additions.


</div>

(f) [1 pt] What would you expect if in (e), we were using logistic regression instead of the class-conditional Gaussian approach? Explain your prediction.

<div style="color:blue">

The behavior of the classifier in response to the extreme data point will be different.

1. **Model Formulation**: Logistic regression is a linear model for binary classification. It predicts the probability that a given data point belongs to a particular class using a logistic function. The logistic regression model for two classes (1 and 2) is usually given by:

   $$
   P(Y=1 \mid X=x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}},
   $$
   
   where $\beta_0$ and $\beta_1$ are the model parameters.

2. **Impact of Extreme Values**: Logistic regression is less sensitive to outliers than Gaussian discriminant analysis. In Gaussian discriminant analysis, extreme values can significantly alter the mean and covariance estimates of the Gaussian distributions, which can lead to a considerable shift in the decision boundary. However, in logistic regression, while extreme values can still influence the estimated coefficients, their impact is often less dramatic. This is due to the linear nature of the decision boundary in logistic regression, which does not change as drastically with the addition of outliers.

3. **Robustness**: Logistic regression tends to be more robust to the presence of outliers compared to methods based on assumptions of normality (like Gaussian discriminant analysis). The logistic function's sigmoid shape compresses predictions into a range between 0 and 1, making it inherently more resistant to extreme values.

4. **Overfitting and Regularization**: Logistic regression models are also prone to overfitting, especially when the dataset is not large enough or when the model is overly complex. However, logistic regression can be regularized (using methods like L1 or L2 regularization) to reduce overfitting, which is not a standard procedure in Gaussian discriminant analysis.

In summary, if logistic regression were used in scenario (e), the impact of adding extreme data points as $\lambda \to \infty$ would likely be less severe than in the Gaussian discriminant analysis approach. The model would be more robust to these outliers, and the decision boundary would not shift as drastically, maintaining better generalization to the underlying data distribution.


</div>



## 4. Support Vector Machine

You are provided with $m>1$ data points $\{x_j \in \mathbb{R}^n\}_{j=1}^m$ of which at least d, with $1<d \leq m$ are distinct. Let $X=\left[x_1, \ldots, x_m\right]$ and consider the one class SVM problem:
$$
\begin{aligned}
\min _{R \in \mathbb{R}, a \in \mathbb{R}^n, s \in \mathbb{R}^m} & R^2+C \mathbf{1}^{\top} s \\
\text { s.t. } & \left\|x_i-a\right\|_2^2 \leq R^2+s_i, \quad i=1, \ldots, m, \\
& s \geq \mathbf{0} .
\end{aligned}
$$

(a) [1 pts] Show that this is a feasible convex program and that strong duality holds.
[Hint: let $r=R^2$]

<div style="color:blue">
    
    
References
    
* [Strong Duality (Wiki)](https://en.wikipedia.org/wiki/Strong_duality)
* [Notes on KKT (Princeton)](https://www.cs.princeton.edu/courses/archive/spring16/cos495/slides/ML_basics_lecture5_SVM_II.pdf)

Replacing $R^2$ with $r$ changes the problem to:

$$
\begin{aligned}
\min _{r \in \mathbb{R}, a \in \mathbb{R}^n, s \in \mathbb{R}^m} & r+C \mathbf{1}^{\top} s \\
\text { s.t. } & \left\|x_i-a\right\|_2^2 \leq r+s_i, \quad i=1, \ldots, m, \\
& s \geq \mathbf{0} .
\end{aligned}
$$

The objective function is $r+C \mathbf{1}^{\top} s$. This is a linear function in terms of $r$ and $s$, and thus it is convex.

The constraints can be broken down as:

1. $\left\|x_i-a\right\|_2^2 \leq r+s_i$: The left side of the inequality, $\left\|x_i-a\right\|_2^2$, is a convex function with respect to $a$. This is because the Euclidean norm squared is a convex function. The right side, $r+s_i$, is linear and hence convex. Therefore, this constraint is convex.

2. $s \geq \mathbf{0}$: This is a set of linear inequalities and is thus convex.


To show that strong duality holds, we need to demonstrate that the Slater's condition is satisfied. Slater's condition states that if a convex optimization problem has a feasible point where all the inequality constraints are strictly satisfied, then strong duality holds.

In our case, we can choose a sufficiently large $r$ and appropriate $a$ and $s$ such that all inequality constraints are strictly satisfied. Therefore, Slater's condition is satisfied, implying that strong duality holds for this convex program.

</div>


(b) $[1 \mathrm{pts}]$ Write down the KKT conditions.

<div style="color:blue">

References
* [Stanford CS229 Notes on SVM](https://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf)


The Lagrangian $L(r, a, s, \lambda, \mu)$ with Lagrange multipliers $\lambda_i$ corresponding to the constraint $\|x_i-a\|_2^2 \leq r+s_i$ and $\mu_i$ corresponding to the constraint $s_i \geq 0$,

$$
L(r, a, s, \lambda, \mu) = r + C \mathbf{1}^\top s + \sum_{i=1}^m \lambda_i (\left\|x_i - a\right\|_2^2 - r - s_i) - \sum_{i=1}^m \mu_i s_i
$$



**Stationarity**: For the gradients with respect to the primal variables must vanish at the optimum:

$$
\begin{aligned}
\frac{\partial L}{\partial r} &= 1 - \sum_{i=1}^m \lambda_i = 0, \\
\frac{\partial L}{\partial a} &= -2 \sum_{i=1}^m \lambda_i (x_i - a) = 0, \\
\frac{\partial L}{\partial s_i} &= C - \lambda_i - \mu_i = 0, \quad i = 1, \ldots, m.
\end{aligned}
$$

**Primal Feasibility**: The original constraints of the optimization problem must hold:

$$
\begin{aligned}
\left\|x_i - a\right\|_2^2 &\leq r + s_i, \quad i = 1, \ldots, m, \\
s_i &\geq 0, \quad i = 1, \ldots, m.
\end{aligned}
$$

**Dual Feasibility**: The Lagrange multipliers must be non-negative:

$$
\lambda_i \geq 0, \quad \mu_i \geq 0, \quad i = 1, \ldots, m.
$$

**Complementary Slackness**: Each Lagrange multiplier must be zero whenever its associated constraint is not active (not binding):

$$
\begin{aligned}
\lambda_i (\left\|x_i - a\right\|_2^2 - r - s_i) &= 0, \quad i = 1, \ldots, m, \\
\mu_i s_i &= 0, \quad i = 1, \ldots, m.
\end{aligned}
$$




More on KKT:

The standard form for KKT is:

Minimize $f(x)$ s.t. 

$g_i(x) - b_i \ge 0, i = 1, \ldots k$

$g_i(x) - b_i = 0, i = k + 1, \ldots m$

(There are $m - k$ equality constraints and $k$ inequality constraints)



</div>

(c) [2 pts] Show that $\alpha^* \neq 0$ and that if $C>1 /(d-1)$ then $\left(R^2\right)^*>0$ (harder).

<div style="color:blue">


Showing $\alpha^* \neq 0$ 

The dual variables $\alpha_i$ correspond to the constraints $\left\|x_i-a\right\|_2^2 \leq R^2+s_i$. If all $\alpha_i$ were zero, this would imply that the corresponding constraints are non-active, meaning that the solution would lie in the interior of the feasible region. However, SVMs are characterized by their support vectors, which are points that lie on the boundary of the decision margin. Thus, for an SVM to be meaningful (i.e., to actually separate data points), at least some of the dual variables $\alpha_i$ must be non-zero. This implies that $\alpha^* \neq 0$.

Showing $\left(R^2\right)^*>0$ if $C > \frac{1}{d-1}$


* **Lagrange Dual Problem:**
   We introduce Lagrange multipliers $\alpha_i \geq 0$ for the constraint $\left\|x_i-a\right\|_2^2 \leq R^2+s_i$. The Lagrangian is given by:
   $$
   \mathcal{L}(R, a, s, \alpha) = R^2 + C \sum_{i=1}^m s_i - \sum_{i=1}^m \alpha_i \left( R^2 + s_i - \left\|x_i-a\right\|_2^2 \right)
   $$
   Differentiating with respect to $R, a, s$ and setting to zero gives the KKT conditions.

* **KKT Conditions:**
   From the KKT conditions, we have:
   - $\frac{\partial \mathcal{L}}{\partial R} = 0 \implies 2R(1 - \sum_{i=1}^m \alpha_i) = 0$.
   - $\frac{\partial \mathcal{L}}{\partial a} = 0 \implies \sum_{i=1}^m \alpha_i (x_i - a) = 0$.
   - $\frac{\partial \mathcal{L}}{\partial s_i} = 0 \implies C - \alpha_i = 0$ for all $i$.

* **Non-Zero $\alpha^*$:**
   The condition $C > \frac{1}{d-1}$ implies that $\alpha_i > 0$ for at least $d$ distinct data points, since the slack variables $s_i$ for these points need to be positive to accommodate the constraint. This means $\alpha^* \neq 0$.

* **Non-Zero $\left(R^2\right)^*$:**
   From the first KKT condition, since $\alpha^* \neq 0$, and considering that $\sum_{i=1}^m \alpha_i = 1$ (obtained from the second KKT condition), it follows that $R^2$ must be positive to satisfy the equality. Thus, $\left(R^2\right)^* > 0$.

In conclusion, the constraints imposed by the one-class SVM formulation, along with the given condition on $C$, ensure that both $\alpha^*$ and $\left(R^2\right)^*$ are non-zero in the optimal solution.

</div>

(d) $[2 \mathrm{pts}]$ What are the support vectors for this problem?

<div style="color:blue">


Support vectors in this context are the data points that are either on the boundary of the sphere or outside it (but still influencing the sphere's size and position due to the slack variable). Mathematically, support vectors can be identified as follows:

On the Boundary: A data point $x_i$ is on the boundary of the sphere if the corresponding slack variable $s_i = 0$ and $\left|x_i - a\right|_2^2 = R^2$. These points lie exactly on the sphere and are crucial for defining its radius and position.

Outside the Sphere: A data point $x_i$ is outside the sphere if the corresponding slack variable $s_i > 0$. These points lie outside the sphere but are within a margin defined by the slack variable. They influence the sphere's size and position due to the penalty term in the objective function.

</div>

(e) [2 pts] Derive the dual problem.

<div style="color:blue">

The Lagrangian for the primal problem with the correct terms is:

$$
\mathcal{L}(R, a, s, \alpha) = R^2 + C \sum_{i=1}^m s_i - \sum_{i=1}^m \alpha_i \left( \left\|x_i - a\right\|_2^2 - R^2 - s_i \right)
$$

Here, $\alpha_i \geq 0$ are the Lagrange multipliers associated with the constraints.

We find the conditions for optimality by taking the partial derivatives of $\mathcal{L}$ with respect to $R$, $a$, and $s_i$ and setting them equal to zero.

For $R$:
$$
\frac{\partial \mathcal{L}}{\partial R} = 2R - 2R \sum_{i=1}^m \alpha_i = 0 \implies R \left( 1 - \sum_{i=1}^m \alpha_i \right) = 0
$$
For $a$:
$$
\frac{\partial \mathcal{L}}{\partial a} = 2 \sum_{i=1}^m \alpha_i (a - x_i) = 0 \implies \sum_{i=1}^m \alpha_i x_i = a \sum_{i=1}^m \alpha_i
$$
For $s_i$:
$$
\frac{\partial \mathcal{L}}{\partial s_i} = C - \alpha_i = 0 \implies \alpha_i = C
$$

Substitute these conditions back into the Lagrangian to obtain the dual problem. Note that $R$, $a$, and $s_i$ are eliminated.

For $R^2$ term:
- Since $R \left( 1 - \sum_{i=1}^m \alpha_i \right) = 0$, if $\sum_{i=1}^m \alpha_i = 1$, then $R^2$ remains as it is. Otherwise, if $\sum_{i=1}^m \alpha_i \neq 1$, $R$ becomes unrestricted, and we need to find it from other conditions.

For the $\left\|x_i - a\right\|_2^2$ term:
- We use $\sum_{i=1}^m \alpha_i x_i = a \sum_{i=1}^m \alpha_i$. Expanding $\left\|x_i - a\right\|_2^2 = (x_i - a)^\top (x_i - a)$ and substituting for $a$, we get expressions involving $x_i$ and $\alpha_i$ only.

For the $s_i$ term:
- Since $\alpha_i = C$, the $C \sum_{i=1}^m s_i$ and $-\sum_{i=1}^m \alpha_i s_i$ terms cancel each other out.

$$
\begin{aligned}
\max_{\alpha} \quad & \sum_{i=1}^m \alpha_i \left\|x_i\right\|_2^2 - \sum_{i=1}^m \sum_{j=1}^m \alpha_i \alpha_j x_i^\top x_j \\
\text{s.t.} \quad & 0 \leq \alpha_i \leq C, \, \forall i, \\
& \sum_{i=1}^m \alpha_i = 1.
\end{aligned}
$$

Substituting these back into $\mathcal{L}$, we simplify to get the dual formulation. The $R^2$ and $s_i$ terms are eliminated or simplified, and the expression only involves $x_i$, $\alpha_i$, and their products.

The dual problem becomes:
$$
\max_{\alpha} \quad G(\alpha) = -\frac{1}{2} \sum_{i=1}^m \sum_{j=1}^m \alpha_i \alpha_j (x_i - a)^\top (x_j - a) + \sum_{i=1}^m \alpha_i
$$
with the constraints $0 \leq \alpha_i \leq C$ and $\sum_{i=1}^m \alpha_i = 1$. 

This dual problem is a quadratic optimization problem with inequality and equality constraints. Solving it gives the optimal values of $\alpha_i$, which are then used to determine the optimal values of $R$ and $a$ in the primal problem.



</div>


(f) [2 pts] Assume $C>1 /(d-1)$. Given the dual solution, how should $a$ and $R^2$ be selected?

<div style="color:blue">

</div>


<text style="color:red">


The primal optimization problem:

$$
\begin{array}{cl}
\min _w & f(w) \\
\text { s.t. } & g_i(w) \leq 0, \quad i=1, \ldots, k \\
& h_i(w)=0, \quad i=1, \ldots, l .
\end{array}
$$

To solve it, we start by defining the generalized Lagrangian
$$
\mathcal{L}(w, \alpha, \beta)=f(w)+\sum_{i=1}^k \alpha_i g_i(w)+\sum_{i=1}^l \beta_i h_i(w) .
$$

Here, the $\alpha_i$ 's and $\beta_i$ 's are the Lagrange multipliers. 


$$
\theta_{\mathcal{P}}(w)=\max _{\alpha, \beta: \alpha_i \geq 0} \mathcal{L}(w, \alpha, \beta)
$$

The primal problem:

$$
\min _w \theta_{\mathcal{P}}(w)=\min _w \max _{\alpha, \beta: \alpha_i \geq 0} \mathcal{L}(w, \alpha, \beta)
$$

we see that it is the same problem (i.e., and has the same solutions as) our original, primal problem. 

We also define the optimal value of the objective (the value of the primal problem) to be $p^*=\min _w \theta_{\mathcal{P}}(w)$; we call this 


$$
\theta_{\mathcal{D}}(\alpha, \beta)=\min _w \mathcal{L}(w, \alpha, \beta) .
$$

Here, the "$\mathcal{P}$ and $\mathcal{D}$" subscript stands for "primal" and "dual," respectively. Note also that whereas in the definition of $\theta_{\mathcal{P}}$ we were optimizing (maximizing) with respect to $\alpha, \beta$, here we are minimizing with respect to $w$.

The dual optimization problem:

$$
\max _{\alpha, \beta: \alpha_i \geq 0} \theta_{\mathcal{D}}(\alpha, \beta)=\max _{\alpha, \beta: \alpha_i \geq 0} \min _w \mathcal{L}(w, \alpha, \beta) .
$$

This is exactly the same as our primal problem shown above, except that the order of the "max" and the "min" are now exchanged. We also define the optimal value of the dual problem's objective to be $d^*=\max _{\alpha, \beta: \alpha_i \geq 0} \theta_{\mathcal{D}}(w)$.

The primal and the dual problems are related:

$$
d^*=\max _{\alpha, \beta: \alpha_i \geq 0} \min _w \mathcal{L}(w, \alpha, \beta) \leq \min _w \max _{\alpha, \beta: \alpha_i \geq 0} \mathcal{L}(w, \alpha, \beta)=p^* .
$$

Under certain conditions, we have $d^{*}=p^{*}$


The KKT conditions:

* $\frac{\partial}{\partial w_i} (\mathcal{L}(w^*, \alpha^*, \beta^*)) = 0, i = 1, \ldots, n$
* $\frac{\partial}{\partial \beta_i} (\mathcal{L}(w^*, \alpha^*, \beta^*)) = 0, i = 1, \ldots, l$
* $\alpha_i^* g_i(w^*) = 0, i = 1, \ldots, k$
* $g_i(w^*) \le 0, i = 1, \ldots k$
* $\alpha^* \ge 0, i = 1, \ldots, k$


The primal optimization problem for finding the optimal margin classifier

$$
\begin{aligned}
\min _{\gamma, w, b} & \frac{1}{2} \|w \|^2 \\
\text { s.t. } & y^{(i)} (w^{\top}x^{(i)} + b) \ge 1, i = 1, \ldots, m.
\end{aligned}
$$

We can write the constraint as:

$$g_i(w)= -y^{(i)}(w^{\top}x^{(i)} + b) + 1 \le 0$$

We have one such constraint for each training example.

The Lagrangian for the problem:

$\mathcal{L}(w, \beta, \alpha) = \frac{1}{2} \| w \|^2 - \sum_{i=1}^{m} \alpha_i [y^{(i)} (w^{\top}x^{(i)} + b) - 1]$.

For the derivatives of $\mathcal{L}$ w.r.t $w$:

$\nabla_w \mathcal{L}(w, b, a) = w - \sum_{i=1}^m \alpha_i y^{(i)}x^{(i)} = 0$

which implies:

$w = \sum_{i=1}^m \alpha_i y^{(i)}x^{(i)}$

For the derivatives of $\mathcal{L}$ w.r.t $b$:


$\nabla_b \mathcal{L}(w, b, a) = \sum_{i=1}^m \alpha_i y^{(i)} = 0$

Plugging these back

$$
\mathcal{L}(w, b, \alpha)=\sum_{i=1}^m \alpha_i-\frac{1}{2} \sum_{i, j=1}^m y^{(i)} y^{(j)} \alpha_i \alpha_j\left(x^{(i)}\right)^T x^{(j)}-b \sum_{i=1}^m \alpha_i y^{(i)} .
$$

But the last term must be zero, so we obtain

$$
\mathcal{L}(w, b, \alpha)=\sum_{i=1}^m \alpha_i-\frac{1}{2} \sum_{i, j=1}^m y^{(i)} y^{(j)} \alpha_i \alpha_j\left(x^{(i)}\right)^T x^{(j)} .
$$


Putting this together with the constraints $\alpha_i \ge 0$,m we get the dual optimization problem:

$$
\begin{aligned}
    \max_{\alpha} W(\alpha) &= \sum_{i=1}^m \alpha_i - \frac{1}{2} \sum_{i,j=1}^m y^{(i)}y^{(j)} \alpha_i \alpha_j <x^{(i)}, x^{(j)}> \\
    \text { s.t. } \alpha &\ge 0, i = 1, \ldots, m \\
    \sum_{i=1}^{m} \alpha_i y^{(i)} &= 0
\end{aligned}
$$

</div>