## 1. SVMs and the Kernel trick

You are given a data set $D$ with data from a single feature $X_1$ in $\mathbb{R}^1$ and corresponding label $Y \in\{+,-\}$. The data set contains three positive examples at $X_1=\{-3,-2,3\}$ and three negative examples at $X_1=\{-1,0,1\}$.

(a) (1 point) Can this data set (in its current feature space) be perfectly separated using a linear separator? Why or why not? (Explain in 1 line)


<div style="color:blue">

No, the dataset cannot be perfectly separated using a linear separator in the $\mathbb{R}^1$ space. The positive and negative examples interleave on the number line, making it impossible to find a single threshold that separates all positives from all negatives. 

If we choose -1.5 as the threshold, then the data point "3" will be misclassified. If we choose 2 as the threshold, then the data point "-3" and "-2" will be misclassified.

</div>

(b) (2 points) Let's define the simple feature map $\phi(u)=\left(u, u^2\right)$ which transforms points in $\mathbb{R}^1$ to points in $\mathbb{R}^2$. Apply $\phi$ to the data and plot the points in the new $\mathbb{R}^2$ feature space (i.e. just show the plot). Can a linear separator perfectly separate the points in the new $\mathbb{R}^2$ features space induced by $\phi$ ? Why or why not? (Again, explain in 1 line)

<img src="img/fall2021/q1.png">

<div style="color:blue">

The above shows the data points transformed to $\mathbb{R}^2$ using the feature map $\phi(u) = (u, u^2)$.

A linear separator (e.g. $y=2$) can perfectly separate the positive and negative examples, as they are now separable by a line. The positive examples are located on the parabolic curve in the upper half, while the negative ones are clustered near the origin.

</div>

(c) (1 point) Give the analytic form of the kernel that corresponds to the feature map $\phi$ in terms of only $X_1$ and $X_2$. Specifically define $k\left(X_1, X_2\right)=<\phi\left(X_1\right), \phi\left(X_2\right)>$ $(<\cdot, \cdot>$ is the dot-product of two vectors), and give the analytical form of $k(\cdot, \cdot) .$

<div style="color:blue">

The kernel function becomes 

$k(X_1, X_2) = <(X_1, X_1^2), (X_2, X_2^2)> = X_1 \cdot X_2 + X_1^2 \cdot X_2^2$

</div>

(d) (4 points) Construct a maximum-margin separating hyperplane. This hyperplane will be a line in $\mathbb{R}^2$, which can be parameterized by its normal equation, i.e. $w_1 Y_1+w_2 Y_2+c=0$ for appropriate choices of $w_1, w_2$ and $c$. Here, $\left(Y_1, Y_2\right)=\phi\left(X_1\right)$ is the result of applying the feature map $\phi$ to the original feature $X_1$. Give the values for $w_1, w_2$ and $c$. Also, explicitly compute the margin for your hyperplane. You do not need to solve a quadratic program to find the maximum margin hyperplane. Instead, let your geometric intuition guide you.


<div style="color:blue">


To find a maximum-margin hyperplane in $\mathbb{R}^2$, we look for the line that maximizes the distance between the closest points (support vectors) of the two classes. The support vectors are $X_1 = -2$, $X_1 = -1$, and $X_1 = 3$. the maximum-margin separating hyperplane in the $\mathbb{R}^2$ space after applying the feature map $\phi(u) = (u, u^2)$.

Transforming the support vectors:
- $\phi(-2) = (-2, 4)$
- $\phi(-1) = (-1, 1)$
- $\phi(3) = (3, 9)$


Finding the hyperplane:

The line should pass through the midpoint between the closest pair of points from opposite classes. The closest pair here is $\phi(-1) = (-1, 1)$ and either $\phi(-2) = (-2, 4)$. The midpoint between these points is $(-1.5, 2.5)$. The line should be perpendicular to the line connecting these two points. The direction vector of this line is $(-1, 1) - (-2, 4) = (1, -3)$, so the normal to the hyperplane (the weights $w_1, w_2$) is perpendicular to this, which can be taken as $(3,1)$ or any scalar multiple thereof.

$3 \times (-1.5) + 1 \times 2.5 + c = 0$ 

so $c = 2$ 


</div>

(e) (2 points) Draw the decision boundary separating of the separating hyperplane, in the original $\mathbb{R}^1$ feature space. Also circle the support vectors.


<div style="color:blue">

The decision boundary in $\mathbb{R}^1$ is where the hyperplane in $\mathbb{R}^2$ (after transformation) intersects with the parabola defined by the transformation $\phi(u) = (u, u^2)$. Given the hyperplane we found (or at least approximated) in part (d), we need to solve for $X_1$ in the equation of the hyperplane.

The hyperplane in $\mathbb{R}^2$ was approximated as $3Y_1 + Y_2 + 2 = 0$, translating back to the original space, it becomes:

$$
3X_1 + X_1^2 + 2 = 0
$$

The solutions are $X_1 = -2$ and $X_1 = -1$. This means that these are the points in the original feature space where the decision changes from one class to the other.

</div>

## 2. Recommendation Systems

You have collected the following ratings of popular comedy TV shows from five users. The following matrix indicate whether a user has rated a movie or not:

|                                                                   | Alice | Bob | Charles | David | Eugene | 
|-------------------------------------------------------------------|-------|-----|---------|-------|--------|
| Friends                                                           | 1     | 1   | 0       | 1     | 1      | 
| The Office                     | 1     | 0   | 0       | 0     | 1      | 
| Arrested Development (AD)    | 1     | 0   | 0       | 0     | 0      | 
| The Bing Bang Theory (BBT)   | 0     | 1   | 0       | 0     | 0      | 
| The Marvelous Mrs. Maisel (MMM) | 1     | 0   | 1       | 1     | 1      |


Rating matrix

| | Alice | Bob | Charles | David | Eugene |
|-------------------------------------------------------------------|-------|----|---------|-------|--------|
| Friends                                                           |  5     | 3   | $?$     | 1     | 4      |
| The Office   | 5     | $?$ | $?$     | $?$   | 4      |
| Arrested Development (AD)  | 4     | $?$ | $?$     | $?$   | $?$    |
| The Bing Bang Theory (BBT)   | $?$   | 2   | $?$     | $?$   | $?$    |
| The Marvelous Mrs. Maisel (MMM) | 1     | $?$ | 1       | 2     | 4      |


More in this [lecture about collaborative filtering by Stanford University](https://youtu.be/h9gpufJFF-0?si=qm9-z2zu8S6jr7ey)


(a) (3 points) To generate recommendations, you adopt the following policy: "if a user $\mathrm{U}$ likes item $\mathrm{X}$, then $\mathrm{U}$ will also like item $\mathrm{Y}$ ". You implement this by maximizing the cosine similarity between the ratings of items $\mathrm{X}$ and $\mathrm{Y}$. Your policy also states that you will only make a recommendation to user $U$ if (a) $U$ has not already watched or rated $\mathrm{Y}$ and (b) U's rating of item $\mathrm{X}$ is at least 3.

Using this policy, which TV show would be recommended to Eugene? Show the comparisons that you made.


<div style="color:blue">



All movies watched by Eugene have ratings > 3, so condition 2 is automatically satisfied.


We calculate the similarities between AD & BBT and all other movies

Friends and AD:

$\mathrm{sim}(F, AD) = \frac{20}{4 \sqrt{51}} = \frac{5}{\sqrt{51}} \approx 0.700$

Friends and BBT:

$\mathrm{sim}(F, BBT) = \frac{6}{2 \sqrt{51}} = \frac{3}{\sqrt{51}} \approx 0.420$

The Office and AD:

$\mathrm{sim}(TO, AD) = \frac{5 \cdot 4 + 0 \cdot 0 + 0 \cdot 0 + 0 \cdot 0 + 4 \cdot 0}{\sqrt{5^2 + 4^2} \sqrt{4^2}} = \frac{20}{\sqrt{41} \cdot 4} = \frac{5}{\sqrt{41}} \approx 0.781$

The Office and BBT:

$\mathrm{sim}(TO, BBT) = \frac{0 \cdot 0 + 0 \cdot 2 + 0 \cdot 0 + 0 \cdot 0 + 4 \cdot 0}{\sqrt{5^2 + 4^2} \sqrt{2^2}} = \frac{0}{\sqrt{41} \cdot 2} = 0$

MMM and AD:

$\mathrm{sim}(MMM, AO) = \frac{1 \cdot 4 + 0 \cdot 0 + 1 \cdot 0 + 2 \cdot 0 + 4 \cdot 0}{\sqrt{1^2 + 1^2 + 2^2 + 4^2} \sqrt{4^2}} = \frac{1}{\sqrt{22}} \approx 0.213$

MMM and BBT:

$\mathrm{sim}(MMM, BBT) = \frac{0 \cdot 0 + 0 \cdot 2 + 1 \cdot 0 + 2 \cdot 0 + 4 \cdot 0}{\sqrt{1^2 + 1^2 + 2^2 + 4^2} \sqrt{2^2}} = \frac{0}{\sqrt{22} \cdot 2} \approx 0$

The highest similarity is between TO and AD, which has a similarity value of 0.781. So we will recommend AD. 


</div>

(b) (3 points) Next, you design a recommendation system to rank TV show to find the 'Best TV Shows of All Times', using the following formula: ratings(i) $=a+b(i)$. In this formula, you set $a$ as a global term and $b(i)$ as an item's bias score. You first fit this model to calculate $a$ as the mean of all ratings across the dataset, and in the process, you calculate $b(i)$ to be the remainder value per item.

You rank the items according to their bias scores (higher bias score is ranked higher). Which item, among the five shows shown in Table 1, would be the Best TV Show and which one would be the Worst TV show? Show your calculations.


<div style="color:blue">

$a = 36 / 12 = 3$

Then, we calculate the average ratings for each movie

$\bar{r}_{\text{F}} = 3.25$

$b(\text{F}) = 3.25 - 3 = 0.25$

$\bar{r}_{\text{TO}} = 4.5$

$b(\text{TO}) = 4.5 - 3 = 1.5$

$\bar{r}_{\text{AD}} = 4$

$b(\text{AD}) = 4 - 3 = 1$

$\bar{r}_{\text{BBT}} = 2$

$b(\text{BBT}) = 2 - 3 = -1$

$\bar{r}_{\text{MMM}} = 2$

$b(\text{MMM}) = 2 - 3 = -1$

The movies are ranked as:

TO > AD > F > BBT = MMM

So *The Office* is the best TV show, and both BBT and MMM are the worst.


</div>


(c) (2 points) An attacker aims to manipulate the above recommendation system trained on the TV Show Rating data. The recommendation algorithm generates recommendations based on bias score of the item (higher rated items are ranked higher). The goal of the attacker is to increase the ranking of the lowest ranked item as you identified in the previous question.

To conduct the attack, the attacker can adopt one of the two strategies: first, the attacker can change the existing ratings in the data, and second, the attacker can create new account(s) and give ratings.

Of the two strategies, which one is:
(i) more likely to succeed?
(ii) more easily detectable by an attacker detection system?
(iii) has lower cost?

<div style="color:blue">


1. **Likelihood of Success**:
   - **Changing Existing Ratings**: This method could be more effective if the attacker has access to the system. By altering existing ratings, especially those of influential users (e.g., users whose ratings highly correlate with others), the attacker can significantly impact the recommendation algorithm. However, this requires unauthorized access or compromising the system, which might be challenging.
   - **Creating New Accounts and Ratings**: This strategy is often easier to implement. By creating new accounts and strategically rating shows, the attacker can influence the system's bias score towards a particular item. This method relies on the idea of "shilling" or "profile injection" attacks, where fake profiles are created to manipulate recommendations.

2. **Detectability by an Attacker Detection System**:
   - **Changing Existing Ratings**: This is more likely to be detected since it involves modifying existing user data, triggering alarms in a system that monitors for unusual activity.
   - **Creating New Accounts and Ratings**: While still detectable, especially with a large number of fake accounts or if the ratings patterns are unnatural, it might initially be less obvious than altering existing ratings.

3. **Cost**:
   - **Changing Existing Ratings**: Higher cost in terms of the required technical skill and access. Illegally accessing and modifying a database is not only technically challenging but also illegal and risky.
   - **Creating New Accounts and Ratings**: Lower cost since it usually only requires creating new user profiles and rating items, which can be done through normal user interfaces. However, if a large number of accounts are needed for a significant effect, this could become labor-intensive or require automation, which then increases complexity and potential detectability.


</div>



(d) (2 points) You come up with the idea of training a deep learning-based recommendation system model, namely the Neural Collaborative Filtering (NCF) model, on your large dataset to create better recommendation models. Your large dataset has 10 million ratings given by approximately 100,000 users to approximately $1,000,000$ movies.

Your NCF model first generates 8-dimensional user and item embeddings. Then you pass the embeddings through two fully-connected neural CF layers with sizes 8x16 and 16x16 dimensions. Finally, this is passed through a $16 \times 1$ output layer with ReLU activation to produce a single prediction value of recommending an item to a user. You train the model for 10 epochs with back-propagation.

After training the model, you find that the model does not perform well. What changes can you make to the model or parameters to potentially improve the performance? Give at least three options. Note that you cannot choose a different model now.

<div style="color:blue">


Improving the performance of a Neural Collaborative Filtering (NCF) model, especially in the context of a recommendation system like the one you're working with, can be approached in several ways. Here are three potential strategies:

1. Adjust the Model Architecture
   - Increasing the number of layers (depth) or the number of neurons in each layer (width). However, we should be cautious of overfitting, especially given the sparse nature of many recommendation datasets.
   - **Add Dropout Layers:** To prevent overfitting, we can introduce dropout layers between the fully-connected layers. Dropout randomly deactivates a fraction of neurons during training, which can help in generalizing the model better.
   - Experiment with Different Activation Functions: We can experiment other activation functions like Leaky ReLU, ELU, or even Sigmoid (especially in the output layer for rating prediction).

2. Tune Hyperparameters
   - **Learning Rate and Optimizer:** Choose a different optimizer (e.g., Adam, RMSprop), adjust the learning rate, or use a learning rate scheduler or adaptive learning rates.
   - **Batch Size:** Experiment with different batch sizes. A larger batch size provides a more accurate estimate of the gradient, but a smaller batch size can offer a regularizing effect and sometimes leads to better generalization.
   - **Regularization Techniques:** Applying L1 or L2 regularization to the layers can help in reducing overfitting and improving model performance.

3. Improve Training Strategy
   - Increase Training Epochs: If the model hasn’t converged, more training epochs might be necessary. Monitor validation performance to ensure the model is not overfitting.
   - **Early Stopping:** Implement early stopping to terminate training when the validation performance starts to degrade. This prevents overfitting.
   - **Data Augmentation:** In the context of NCF, this could mean artificially creating new user-item interactions based on existing patterns (although this needs to be done carefully to avoid introducing bias).

4. Improve data

   - Ensure that the data is properly preprocessed and normalized
   - Perform additional feature engineering.


</div>




## 3. Classification: ROC, random forests, interpretation

Holly and Will are cyber analysts working at a large technology company. They are developing binary classification techniques to detect network security threats.


(a) (1 point) In the analysts' context, what would the positive class refer to?

<div style="color:blue">

The positive class would refer to the presence of a network security threat that the algorithm tries to detect.

</div>


(b) (1 point) The analysts are consider using an ROC (receiver operating characteristic) curve to visualize the classifier's performance. What are the two axes in a ROC plot?

<div style="color:blue">

The two axes in a ROC plot are:

* **True Positive Rate (TPR)**, also known as Sensitivity, Recall, or Probability of Detection, on the Y-axis. This rate is calculated as TPR = TP / (TP + FN), where TP is the number of true positives and FN is the number of false negatives. It measures the proportion of actual positives that are correctly identified as such.

* **False Positive Rate (FPR)**, on the X-axis. This rate is calculated as FPR = FP / (FP + TN), where FP is the number of false positives and TN is the number of true negatives. It represents the proportion of actual negatives that are incorrectly classified as positives.


</div>


(c) (2 points) How is the ROC curve of a classifier generated? In other words, how should the analyst generates the points from which they can link together to build the curve? For easier discussion, your answer may center around a binary classifier of your choice. You are welcome to include illustrations to support your answer.

<div style="color:blue">


Choose a Range of Thresholds: Start with a range of threshold values, typically from 0 to 1. At each threshold, the classifier's predictions may change, affecting the TPR and FPR.

Calculate TPR and FPR for Each Threshold: For each threshold value, calculate the TPR and FPR. This is done by classifying the data at that threshold and then comparing the classifier's predictions to the actual labels.

Plot the Points: For each threshold, plot a point on the graph with the corresponding TPR and FPR. This results in a series of points.


</div>


(d) (1 point) If a classifier performs perfectly (i.e., it makes no mistakes), where will its "point" be located on the ROC plot?

<div style="color:blue">

A perfect classifier would have a curve that reaches the top left corner of the plot, indicating a high TPR of 1.0 and a low FPR of 0.0. In this case, the AUC is 1.0


</div>

(e) (1 points) On the same plot, the analysts want to draw both the ROC curve of their classifier, and that of a "baseline" classifier that guesses the positive class half of the time, which is a straight line that goes diagonally from the plot’s bottom left corner to the upper right corner. Is it possible for the ROC curve of the analysts’ classifier to lie completely under the baseline curve? If yes, when would that happen? If no, why not?


<div style="color:blue">

It is possible. This means the model is doing worse than random guessing. Possible reasons:

* Data quality issues: e.g. incorrect labels or highly imbalanced classes, which lead to poor model performance.
* Poor feature selection: Choosing features that do not correlate well with the target variable.
* Overfitting to noise or underfitting: The model might have learned the noise in the training data instead of the actual patterns, or it might be too simple to capture the complexity in the data.

</div>

As Holly and Will are evaluating more classification approaches, they come across random forests.

(f) (2 points) Random forests is a modification over bagging decision trees. The random forests improves variance reduction (over bagging) by reducing correlation among trees. Briefly explain how this correlation reduction (“de-correlation”) among trees is achieved when growing the trees.

<div style="color:blue">

- **Bagging**: In bagging, multiple decision trees are grown independently. Each tree is trained on a different bootstrap sample, which is a randomly drawn sample with replacement from the training data. While bagging reduces variance by averaging the predictions of these independently grown trees, the trees can still be quite correlated with each other, especially if some features are very strong predictors for the target variable.

### Random Forests: Additional Layer of Randomness

Random Forests introduce an additional layer of randomness compared to regular bagging:

* **Random Selection of Features**:
   - When growing each tree in a Random Forest, at each split in the tree, rather than considering all features, a random subset of features is selected. This number is typically much smaller than the total number of features.
   - This randomness ensures that each tree in the forest is not too similar (i.e., less correlated) to others since different trees will have splits based on different subsets of features.

* **Unique Trees**:
   - Because of the random selection of features at each split, each tree in a Random Forest is unique. Even if two trees happen to pick the same feature at a particular split, the subsequent splits will likely differ due to the randomness in feature selection.

* **Reduced Overfitting**:
   - This method reduces the chance of overfitting to particular features that may be very predictive in the training data but not necessarily in unseen data. By de-correlating the trees, the Random Forest ensemble becomes more robust and generalizes better.

</div>


(g) (2 points) The analysts debate whether a random forest is an “interpretable” model. Holly argues that it is interpretable, while Will argues that its inter- pretability is limited. Briefly discuss why they may both be correct.

<div style="color:blue">

Why Random Forests are Interpretable

* **Interpretability of Individual Tree**: Each decision tree within a Random Forest is interpretable. Decision trees make decisions based on clear, logical rules that can be visualized and understood.
* **Feature Importance**: Random Forests provide insights into feature importance — how much each feature contributes to the prediction. This can be interpreted as the model giving a clear indication of which features are more important for the predictions, thus providing a form of interpretability.

Why Random Forests might not be Interpretable

* A Random Forest is an ensemble of many decision trees. Ensembling the trees does not provide a clear, singular decision path or rule set. Understanding how all trees in the forest collectively contribute to a particular prediction is not straightforward.

</div>


## 4. Text Representation Learning and Gradient Descent

We consider the problem of learning word vectors from an unlabeled corpus using the skipgram model. Word embedding techniques learn to represent the words in a large text corpus as $N$ dimensional vectors, with the goal of making similar words close to each other in the vector space. The well-known skip-gram model achieves this goal by predicting the context words for a given center word. Let $\boldsymbol{v}_c$ denote the word vector of a given center word $\boldsymbol{c}$. Skip-gram models the probability of observing a context word $\boldsymbol{o}$ from the vocabulary using the softmax function:

$p(\boldsymbol{o} \mid \boldsymbol{c})=\frac{\exp \left(\boldsymbol{u}_o^{\top} \boldsymbol{v}_c\right)}{\sum_{w=1}^W \exp \left(\boldsymbol{u}_w^{\top} \boldsymbol{v}_c\right)},$

where $\boldsymbol{w}$ is the w-th word in the vocabulary and $\boldsymbol{u}_w(w=1, \ldots, W)$ are the context word vectors.
However, instead of directly maximizing the likelihood of groundtruth context words, we often use the negative sampling technique for learning the word vectors. It randomly draws $K$ negative samples (words) from the vocabulary, denoted as $1, \cdots, K$ $(o \notin\{1, \ldots, K\}$ ). The learning objective for negative sampling is to distinguish the groundtruth context word $\boldsymbol{o}$ from the negative samples $1, \cdots, K$. The loss function for this negative sampling model is given by:

$J\left(\boldsymbol{o}, \boldsymbol{v}_c, \boldsymbol{U}\right)=-\log \left(\sigma\left(\boldsymbol{u}_o^{\top} \boldsymbol{v}_c\right)\right)-\sum_{k=1}^K \log \left(\sigma\left(-\boldsymbol{u}_k^{\top} \boldsymbol{v}_c\right)\right),$

where $\sigma(x)=\frac{1}{1+\exp (-x)}$ is the sigmoid function, and $\boldsymbol{U}$ is the set of embeddings for all the words in the vocabulary.

(a) (5 points) Derive the stochastic gradient descent algorithm for the unknown parameters $\boldsymbol{v}_c, \boldsymbol{u}_o$ and $\boldsymbol{u}_k(k=1,2, \ldots, K)$ for the loss function $J$.


<div style="color:blue">

### Gradient with Respect to $\boldsymbol{v}_c$
    
For the first term $-\log \left(\sigma\left(\boldsymbol{u}_o^{\top} \boldsymbol{v}_c\right)\right)$, the gradient is:

$-\frac{\partial}{\partial \boldsymbol{v}_c} \log \left(\sigma\left(\boldsymbol{u}_o^{\top} \boldsymbol{v}_c\right)\right) = -\frac{1}{\sigma(\boldsymbol{u}_o^{\top} \boldsymbol{v}_c)} \cdot \sigma(\boldsymbol{u}_o^{\top} \boldsymbol{v}_c)(1-\sigma(\boldsymbol{u}_o^{\top} \boldsymbol{v}_c)) \cdot \boldsymbol{u}_o = \boldsymbol{u}_o (1 - \sigma(\boldsymbol{u}_o^{\top} \boldsymbol{v}_c))$

For the second term $-\sum_{k=1}^K \log \left(\sigma\left(-\boldsymbol{u}_k^{\top} \boldsymbol{v}_c\right)\right)$, the gradient is:

$- \sum_{k=1}^K \frac{1}{\sigma(-\boldsymbol{u}_k^{\top} \boldsymbol{v}_c)} \cdot \sigma(-\boldsymbol{u}_k^{\top} \boldsymbol{v}_c)(1-\sigma(-\boldsymbol{u}_k^{\top} \boldsymbol{v}_c)) \cdot (-\boldsymbol{u}_k) = \sum_{k=1}^K \boldsymbol{u}_k (1 - \sigma(-\boldsymbol{u}_k^{\top} \boldsymbol{v}_c))$

Therefore, the overall gradient with respect to $\boldsymbol{v}_c$ is:

$\frac{\partial J}{\partial \boldsymbol{v}_c} = - (1 - \sigma(\boldsymbol{u}_o^{\top} \boldsymbol{v}_c)) \boldsymbol{u}_o + \sum_{k=1}^K  (1 - \sigma(-\boldsymbol{u}_k^{\top} \boldsymbol{v}_c)) \boldsymbol{u}_k$



### Gradient with Respect to $\boldsymbol{u}_o$

   Next, we calculate $\frac{\partial J}{\partial \boldsymbol{u}_o}$:

   $\frac{\partial J}{\partial \boldsymbol{u}_o} = -\frac{\boldsymbol{v}_c}{\sigma(\boldsymbol{u}_o^\top \boldsymbol{v}_c)} \sigma(\boldsymbol{u}_o^\top \boldsymbol{v}_c) (1 - \sigma(\boldsymbol{u}_o^\top \boldsymbol{v}_c)) = (1 - \sigma(\boldsymbol{u}_o^\top \boldsymbol{v}_c)) \boldsymbol{v}_c$

### Gradient with Respect to $\boldsymbol{u}_k$

   Finally, for each negative sample $\boldsymbol{u}_k$, we compute $\frac{\partial J}{\partial \boldsymbol{u}_k}$:

   $\frac{\partial J}{\partial \boldsymbol{u}_k} = -\frac{-\boldsymbol{v}_c}{\sigma(-\boldsymbol{u}_k^\top \boldsymbol{v}_c)} \sigma(-\boldsymbol{u}_k^\top \boldsymbol{v}_c) (1 - \sigma(-\boldsymbol{u}_k^\top \boldsymbol{v}_c)) = \sum_{k=1}^K (1 - \sigma(\boldsymbol{u}_k^\top \boldsymbol{v}_c)) \boldsymbol{v}_c$

### Stochastic Gradient Descent Update Rules 

The update rules in the SGD algorithm involve adjusting the parameters in the opposite direction of the gradient, scaled by a learning rate $\eta$. Thus, the update rules are:

* $\boldsymbol{v}_c \leftarrow \boldsymbol{v}_c - \eta \frac{\partial J}{\partial \boldsymbol{v}_c}$
* $\boldsymbol{u}_o \leftarrow \boldsymbol{u}_o - \eta \frac{\partial J}{\partial \boldsymbol{u}_o}$
* $\boldsymbol{u}_k \leftarrow \boldsymbol{u}_k - \eta \frac{\partial J}{\partial \boldsymbol{u}_k}$

These update rules are applied iteratively to optimize the word vectors in the skip-gram model with negative sampling.



</div>




(b) (3 points) From your derived gradient descent algorithm, discuss why using Eq. (2) and the gradient descent procedure can make semantically similar worlds close to each other in the vector space.

<div style="color:blue">

The loss function used in this setting has two components:

* The first term encourages the model to bring the vector of the center word $\boldsymbol{v}_c$ closer to the vector of the actual context word $\boldsymbol{u}_o$ by maximizing the dot product $\boldsymbol{u}_o^{\top} \boldsymbol{v}_c$.
* The second term penalizes the model if it brings $\boldsymbol{v}_c$ close to vectors of negative sample words $\boldsymbol{u}_k$. These negative samples are random words from the vocabulary that are not in the context of the center word.
    
Through the gradient descent updates:

* The vector for the center word $\boldsymbol{v}_c$ is adjusted to be more similar to the vectors of its context words and less similar to the negative samples.

Semantic Similarity: As a result of this iterative learning process, the vectors for semantically similar words are pushed closer together in the vector space.
    
</div>

(c) (2 points) Discuss the advantage of using Eq. (2) instead of Eq. (1) for learning word embeddings.

<div style="color:blue">

The key difference lies in the denominator of the softmax function. 
    
For a large vocabulary (where $W$ is large), computing the sum of exponentials over the entire vocabulary for each training example is computationally expensive and impractical. 

In contrast, the negative sampling loss function involves computing sigmoid functions for the target word and a small number of negative samples ($K$), which is much more computationally efficient. This efficiency makes negative sampling particularly useful for training on large datasets with extensive vocabularies.
    
</div>