# Assignment 3: Neural networks in natural language processing
### Due Date: Nov 21

### Grade (100 pts, 10%)

*Note: This assignment covers material from the recordings, notes, demos, and suggested readings from Lectures 8,9,11*

---

## Questions

### 1. Dropout (35 pts)

Dropout is a regularization technique that randomly sets units in each activation layer, $a \in \mathbb{R}^{D}$, to zero and then multiplies the resultant vector elementwise by a constant $\gamma$ according to:

$$a' \leftarrow  \gamma m \odot a$$

where $\odot$ represents the element-wise product operator and $m \in \{0, 1\}^D$ is a mask with entries drawn from a Bernoulli distribution: 

$$m_i \sim \begin{cases} 0 &with \quad P=p_{do} \\ 1 &with \quad P=1 - p_{do} \end{cases}$$

Because dropout get's performed at training time and not during inference, blindly applying dropout will lead to a data distribution shift at inference. This is evident in how the hidden layer get's computed:


$$ 
\begin{aligned}
z_{i}^{(l+1)} &= W_{i}^{(l+1)} \cdot a^{'(l)} \quad (training) \\
z_{i}^{(l+1)} &= W_{i}^{(l+1)} \cdot a^{(l)} \quad (inference)
\end{aligned}
$$

where $W_{i}^{(l+1)} \in \mathbb{R}^{D}$. Because $z_{i}^{(l+1)}$ is computed as a weighted sum over the activations from the previous layer, given an identical input, the node values will shift in the absense of dropout unless we scale the activation values (almost certainly; the pathological exception is the case in which the values in $a^{(l)}$ come from a zero mean, zero skew distribution, which is unlikely). To ensure this doesn't happen, derive a training-time scaling constant, ${\gamma}$, that will make each value $z_{i}^{(l+1)}$ invariant (in expectation) to the dropout operation.

*Hint: You want to find the $\gamma$ that makes the following true (in expectation). Some search terms that might come in handy: expectated value of a Bernoulli RV, the weak law of large numbers*.

$$
\sum_{j=0}^{D-1} W_{i,j}^{(l+1)} a_{j}^{(l)}  = \gamma \sum_{j=0}^{D-1} W_{i,j}^{(l+1)} a_{j}^{'(l)}
$$

Your answer goes here ...

### 2. Convolutions (30 pts)

Consider a sequence of $T$ token embeddings, $Z \in \mathbb{R}^{T \times D}$, for which $D=3$:

In [None]:
import numpy as np

Z = np.array([
    [1.3,   0.4, -0.2],
    [-3.1,  1.1,  2.1],
    [0.9,   2.8, -1.5],
    [1.3,   2.4,  0.1],
    [1.0,   1.0,  0.5],
    [3.0,  -1.4, -0.2],
    [-0.7,  1.8,  1.3]
])

and a set of convolutional filters, $W=\{ w^{(1)}, w^{(2)} \}$, and corresponding filter widths $S=\{ s^{(1)}, s^{(2)}  \}$:

In [2]:
w1 = np.array([
    [1, 1, 1],
    [1, 1, 1]
])

w2 = np.array([
    [2, 2, 2],
    [2, 2, 2],
    [2, 2, 2]
])

W = [w1, w2]

S = [2, 3]

In Lecture 08 we discussed a set of operations that maps $Z \in \mathbb{R}^{T \times D}$ onto $Z' \in \mathbb{R}^{N_F D}$ (in this problem $N_F = 2$). This involved three steps:

1. **Convolution**: The convolutional operation produces $N_F$ feature maps, $B^{(n)} \in \mathbb{R}^{(T - s^{(n)} + 1) \times D}$, where $n=\{1, \dots, N_F\}$, according to:

$$
\forall_{t \in \{ 1, \dots, T - s^{(n)} + 1 \} } \; B^{(n)}_{t,j} = \sum_{t'=1}^{S^{(n)}} w^{(n)}_{t',j} \; Z_{t+t'-1, \ j}
$$

2. **Max pooling**: The max pooling operation computes the max over the sequence dimension in each feature map, $ B_{maxpool}^{(n)} \in \mathbb{R}^D$, according to:

$$
B_{maxpool, j}^{(n)} = \underset{1 \leq t' \leq T - s^{(n)} + 1 }{\max} B^{(n)}_{t', j}
$$

3. **Concatenation**: The resultant set of $N_F$ feature vectors are then concatenated into a single vector $Z'$ according to:

$$
Z' = \big[ B_{maxpool}^{(1)}, \dots, B_{maxpool}^{(n)}, \dots,  B_{maxpool}^{(N_F)}  \big] \in \mathbb{R}^{D \cdot N_F}
$$

In the cell below, perform these three operations to produce $Z' \in \mathbb{R}^6$ and print it.

*Hint: The max pooling operation computes the maximum over each column in $B^{(n)}$*

In [None]:
# Your answer goes here

### 3. Attention (35 pts)

In this problem, you will take a pretrained language model's query, key, and value weight matrices to compute a simple self-attention layer, and then produce a plot of the resultant attention weights, for a single input sequence. 

The input array $X$ is an $T \times D_x$ matrix where $T$ is the number of tokens in your input sequence, and $D_x$ is the dimension of the token embedding. The query mapping $W_q$ is shape $D_x \times D_q$, where $D_q$ is the query dimension. The key mapping $W_k$ is also shape $D_x \times D_q$, and the value mapping $W_v$ is shape $D_x \times D_v$, where $D_v$ is the value dimension. These mappings act on the input aray $X$ to produce the query, key, and value matrices (with associated biases):

$$ Q = XW_q + b_q, \quad K = XW_k + b_k, \quad \text{and} \quad V = XW_v + b_v.$$

With the above computations, we compute the alignments as $E = QK^T / sqrt(D_q)$, and from these unnormalized scores, we obtain attention weights $A$ passing through a softmax: $A = \textrm{softmax}(E)$. Outputs are obtained by computing $Y= A V.$


(A) For the sequence `The quick brown fox jumps over the lazy dog`, compute the self-attention weights using the operations described above for the DistilBERT model (https://huggingface.co/distilbert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France). You will have to tokenize the inputs in the same way the model was trained. 


(B) For each token in the sequence from (A), visualize its attention weights.


In [None]:
# Starter code for (A)

from transformers import DistilBertTokenizer, DistilBertModel

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

In [None]:
# loop through generator to grab weights and biases
for name, param in model.named_parameters():                
    if name == "transformer.layer.5.attention.q_lin.weight":
        W_q = param
    if name == "transformer.layer.5.attention.q_lin.bias":
        b_q = param
    if name == "transformer.layer.5.attention.k_lin.weight":
        W_k = param
    if name == "transformer.layer.5.attention.k_lin.bias":
        b_k = param
    if name == "transformer.layer.5.attention.v_lin.weight":
        W_v = param
    if name == "transformer.layer.5.attention.v_lin.bias":
        b_v = param

In [None]:
# Code for (A) here

In [None]:
# Code for (B) here