# CSE 256: Statistical NLP UCSD: Assignment 0

## PyTorch + Paper Reading (60 points)
### <font color='blue'> Due: Friday April 14, 2023 at  10pm </font>


#### IMPORTANT: After copying this notebook to your Google Drive, paste a link to it below. To get a publicly-accessible link, click the *Share* button at the top right, then click "Get shareable link" and copy the link. If you fail to do this, you will receive no credit for this assignment!
#### <font color="red">Link: paste your link here:  </font>


---
 
- All cells should run almost instantly. If they are taking long, you have made a mistake.

- Submission Instructions are located at the bottom of the notebook.

---


##Question 1 (10 points)
In morden NLP, we represent words with low-dimensional vectors also called *embeddings*. We use these embeddings to compute a vector representation $\boldsymbol{x}$ of a given prefix (a sequence of words). In the below cell, we use [PyTorch](https://pytorch.org), a machine learning framework, to explore this setup. We provide embeddings for the prefix "Alice talked to"; your job is to combine them into a single vector representation $\boldsymbol{x}$ using [element-wise vector addition](https://ml-cheatsheet.readthedocs.io/en/latest/linear_algebra.html#elementwise-operations). 

*TIP: if you're finding the PyTorch coding problems difficult, you may want to run through [the 60 minutes blitz tutorial](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)!*

---

In [9]:
import torch
torch.set_printoptions(sci_mode=False)
torch.manual_seed(0)

prefix = 'Alice talked to'

# spend some time understanding this code / reading relevant documentation! 
# this is a toy problem with a 5 word vocabulary and 10-d embeddings
embeddings = torch.nn.Embedding(num_embeddings=5, embedding_dim=10)
vocab = {'Alice':0, 'talked':1, 'to':2, 'Bob':3, '.':4}

# we need to encode our prefix as integer indices (not words) that index 
# into the embeddings matrix. the below line accomplishes this.
# note that PyTorch inputs are always Tensor objects, so we need
# to create a LongTensor out of our list of indices first.
indices = torch.LongTensor([vocab[w] for w in prefix.split()])
prefix_embs = embeddings(indices)
print('prefix embedding tensor size: ', prefix_embs.size())

# okay! we now have three embeddings corresponding to each of the three
# words in the prefix. write some code that adds them element-wise to obtain
# a representation of the prefix! store your answer in a variable named "x".

print(prefix_embs)
### YOUR CODE HERE!
x = torch.zeros(10)
x = torch.sum(prefix_embs,  dim=0)


### DO NOT MODIFY THE LINE BELOW
print('embedding sum: ', x)


prefix embedding tensor size:  torch.Size([3, 10])
tensor([[-1.1258, -1.1524, -0.2506, -0.4339,  0.8487,  0.6920, -0.3160, -2.1152,
          0.3223, -1.2633],
        [ 0.3500,  0.3081,  0.1198,  1.2377,  1.1168, -0.2473, -1.3527, -1.6959,
          0.5667,  0.7935],
        [ 0.5988, -1.5551, -0.3414,  1.8530,  0.7502, -0.5855, -0.1734,  0.1835,
          1.3894,  1.5863]], grad_fn=<EmbeddingBackward0>)
embedding sum:  tensor([-0.1770, -2.3993, -0.4721,  2.6568,  2.7157, -0.1408, -1.8421, -3.6277,
         2.2783,  1.1165], grad_fn=<SumBackward1>)


##Question 2 (5 points)
In practice, we do not use element-wise addition to combine the different word embeddings in the prefix into a single vector representation (a process called *composition*). What is a major issue with approach as a composition function? 

Answer: Using element-wise addition to combine word embeddings fails to capture both the syntax and semantics of the words in the prefix. The prefix "not happy," the embeddings of "not" and "happy" may have opposite directions, which will result in a composite vector that does not accurately represent the meaning of the prefix. Furthermore, element-wise addition can produce vectors with magnitudes that are too high or too low, which can make them challenging to use in later tasks.

#### <font color="red">Write your answer here (2-3 sentences) </font>



##Question 3 (10 points)
One very important function in current NLP (and for essentially every task we'll look at) is the [softmax](https://pytorch.org/docs/master/nn.functional.html#softmax), which is defined over an $n$-dimensional vector $<x_1, x_2, \dots, x_n>$ as $\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{1 \leq j \leq n} e^{x_j}}$. Let's say we have our prefix representation $\boldsymbol{x}$ from before. We can use the softmax function, along with a linear projection using a matrix $W$, to go from $\boldsymbol{x}$ to a probability distribution $p$ over the next word: $p = \text{softmax}(W\boldsymbol{x}). $Let's explore this in the code cell below:


In [20]:
# remember, our goal is to produce a probability distribution over the 
# next word, conditioned on the prefix representation x. This distribution
# is thus over the entire vocabulary (i.e., it is a 5-dimensional vector).
# take a look at the dimensionality of x, and you'll notice that it is a 
# 10-dimensional vector. first, we need to **project** this representation
# down to 5-d. We'll do this using the below matrix:

W = torch.rand(10, 5)

# use this matrix to project x to a 5-d space, and then
# use the softmax function to convert it to a probability distribution.
# this will involve using PyTorch to compute a matrix/vector product.
# look through the documentation if you're confused (torch.nn.functional.softmax)
# please store your final probability distribution in the "probs" variable.

### YOUR CODE HERE
probs = torch.nn.functional.softmax(dim=0, input=torch.matmul(x, W))


### DO NOT MODIFY THE BELOW LINE!
print('probability distribution', probs)


probability distribution tensor([0.0102, 0.0526, 0.0056, 0.9069, 0.0247], grad_fn=<SoftmaxBackward0>)


##Question 4 (15 points)
So far, we have looked at just a single prefix ("Alice talked to"). In practice, it is common for us to compute many prefixes in one computation, as this enables us to take advantage of GPU parallelism and also obtain better gradient approximations (we'll talk more about the latter point later). This is called *batching*, where each prefix is an example in a larger batch. Here, you'll redo the computations from the previous cells, but instead of having one prefix, you'll have a batch of two prefixes. The final output of this cell should be a 2x5 matrix that contains two probability distributions, one for each prefix. **NOTE: YOU WILL LOSE POINTS IF YOU USE ANY LOOPS IN YOUR ANSWER!** Your code should be completely vectorized (a few large computations is faster than many smaller ones).

In [32]:

# for this problem, we'll just copy our old prefix over three times
# to form a batch. in practice, each example in the batch would be different.
batch_indices = torch.cat(2 * [indices]).reshape((2, 3))
batch_embs = embeddings(batch_indices)

x = torch.sum(batch_embs,  dim=1)

batch_probs = torch.nn.functional.softmax(dim=1, input=torch.matmul(x, W))



### DO NOT MODIFY THE BELOW LINE
print("batch probability distributions:", batch_probs)

batch probability distributions: tensor([[0.0102, 0.0526, 0.0056, 0.9069, 0.0247],
        [0.0102, 0.0526, 0.0056, 0.9069, 0.0247]], grad_fn=<SoftmaxBackward0>)


## Question 5 (20 points)

Choose  one  paper  from  [ACL 2022](https://aclanthology.org/events/acl-2022/#2022acl-long) that you find interesting. A good way to do this is by scanning the titles and abstracts; there are hundreds of papers so take your time before selecting one!  Then, write a summary in  your own words of the paper you chose. Your summary should answer the following questions: what is its motivation? Why should anyone care about it? Were there things in the paper that you didn't understand at all? What were they? Fill out the cell below, and make sure to write 2-4 paragraphs for the summary to receive full credit! 

#### <font color="red">Write your answer here (2-4 sentences) </font>

**Title of paper**:

**Authors**:

**URL**:

**Your summary**:

# <font color="blue"> Submission Instructions</font>

1. Click the Save button at the top of the Jupyter Notebook.
2. Select Edit -> Clear All Outputs. This will clear all the outputs from all cells (but will keep the content of all cells). 
2. Select Runtime -> Run All. This will run all the cells in order, and will take several minutes.
3. Once you've rerun everything, select File -> Print --> Save as PDF, or you can also save the webpage as pdf. <font color='blue'> Make sure all your solutions especially the coding parts are displayed in the pdf</font>, it's okay if the provided codes get cut off because lines are not wrapped in code cells).
4. Look at the PDF file and make sure all your solutions are there, displayed correctly. The PDF is the only thing your graders will see!
5. Submit your PDF on Gradescope.


####  <font color="blue">  Academic honesty </font>

- We  reserve the right to audit the Colab notebooks from a set number of students, chosen at random. The audits will check that the code you wrote actually generates the answers in your PDF. If you turn in correct answers on your PDF without code that actually generates those answers, we will consider this a serious case of cheating. 




#### <font color="blue"> Acknowledgements</font>
This assignment is based on an assignment developed by Mohit Iyyer