In [1]:
import math
import torch
import torch.nn as nn

### Exercise 2. (A)


When we anticipate numerous out-of-vocabulary (OOV) words during test, the preferred tokenization method would be **subword tokenization**, because it performs well at handling OOV words by decomposing them into smaller, more common subunits known as subwords. By representing OOV words using subwords, the model can still capture meaningful information from these words, even if they haven't been explicitly encountered during training.


Subword tokenization approaches fall into two main categories. **WordPiece** which splits words into character subsequences based on a predefined vocabulary of subwords. It's particularly effective as it can generate subwords that correspond to meaningful morphemes.
**Byte Pair Encoding (BPE)** iteratively merges the most frequent byte pairs in the corpus until it reaches a specified vocabulary size. Unlike WordPiece, BPE doesn't need a predefined vocabulary, making it more adaptable to new words and languages.



In situations where OOV words are prevalent, **subword tokenization** offers several advantages like:
 1) **Reduced OOV Rate**, because Subwords effectively decompose OOV words into known units, significantly reducing the number of true OOV tokens encountered by the model;
 2) **Improved Representation** by representing OOV words using subwords, the model can still extract contextual information and semantic relationships from these words, even if they haven't been explicitly trained on;
 3) **Vocabulary Flexibility**: Subword tokenization techniques don't require a fixed vocabulary, allowing the model to adapt to new words and languages without explicit vocabulary updates.

Therefore, **subword tokenization** is the preferred choice for scenarios where OOV words are expected, as it effectively handles these rare words while still preserving meaningful information for model training and prediction.

### References:
 - https://www.datacamp.com/blog/what-is-tokenization

### Exercise 2. (B) I.


In [2]:
def bag_of_words(sentence, vocabulary):
    # Initialize a vector with zeros for each word in the vocabulary
    bag_of_words_vector = [0] * len(vocabulary)

    # Tokenize the sentence into words
    words = sentence.split()

    # Count the frequency of each word in the sentence
    for word in words:
        if word in vocabulary:
            index = vocabulary.index(word)
            bag_of_words_vector[index] += 1

    return bag_of_words_vector

vocabulary = ['and', 'apple', 'banana', 'eat', 'hate', 'I', 'pie', 'strawberry', 'the', 'they']

input_sentence = "You and I eat the strawberry pie"

result = bag_of_words(input_sentence, vocabulary)

result

[1, 0, 0, 1, 0, 1, 1, 1, 1, 0]

### Exercise 2. (B) II.

In [3]:
vocabulary = ["and", "apple", "banana", "eat", "hate", "I", "pie", "strawberry", "the", "they"]
# lower case all words in the vocabulary
vocabulary = [word.lower() for word in vocabulary]

document_counts = [90, 30, 15, 40, 10, 60, 20, 5, 85, 30]

total_documents = 100

sentence = "You and I eat the strawberry pie"

# Tokenize the sentence into words
words = sentence.lower().split()

# Compute TF-IDF representation
tf_representation = []
idf_representation = []
tfidf_representation = []

for term in vocabulary:
    # Compute TF (Term Frequency)
    tf = round(words.count(term) / len(words) if len(words) > 0 else 0, 6)
    tf_representation.append(tf)

    # Compute IDF (Inverse Document Frequency)
    idf = round(math.log(total_documents / document_counts[vocabulary.index(term)]), 6)
    idf_representation.append(idf)

    # Compute TF-IDF
    tfidf = round(tf * idf, 6)
    # Append TF-IDF value to the representation
    tfidf_representation.append(tfidf)

for word, tf, idf, tfidf in zip(vocabulary, tf_representation, idf_representation, tfidf_representation):
    print(f"Word: {word}, TF: {tf}, IDF: {idf}, TF-IDF: {tfidf}")

print()
print("TF-IDF representation of vocabulary for sentence:")
print(tfidf_representation)

Word: and, TF: 0.142857, IDF: 0.105361, TF-IDF: 0.015052
Word: apple, TF: 0.0, IDF: 1.203973, TF-IDF: 0.0
Word: banana, TF: 0.0, IDF: 1.89712, TF-IDF: 0.0
Word: eat, TF: 0.142857, IDF: 0.916291, TF-IDF: 0.130899
Word: hate, TF: 0.0, IDF: 2.302585, TF-IDF: 0.0
Word: i, TF: 0.142857, IDF: 0.510826, TF-IDF: 0.072975
Word: pie, TF: 0.142857, IDF: 1.609438, TF-IDF: 0.229919
Word: strawberry, TF: 0.142857, IDF: 2.995732, TF-IDF: 0.427961
Word: the, TF: 0.142857, IDF: 0.162519, TF-IDF: 0.023217
Word: they, TF: 0.0, IDF: 1.203973, TF-IDF: 0.0

TF-IDF representation of vocabulary for sentence:
[0.015052, 0.0, 0.0, 0.130899, 0.0, 0.072975, 0.229919, 0.427961, 0.023217, 0.0]


References
 - https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/#:~:text=The%20TF%2DIDF%20of%20a,multiplying%20TF%20and%20IDF%20scores.&text=Translated%20into%20plain%20English%2C%20importance,between%20documents%20measured%20by%20IDF.

### Exercise 2. (c)

In [4]:
class RNNLM(nn.Module):
    """RNN-based language model.

    Args:
        vocab_size: The size of the vocabulary.
        embedding_dim: The dimension of the word embeddings.
        hidden_dim: The dimension of the hidden state of the RNN.
        rnn_type: The type of RNN cell to use ('lstm' or 'gru').
        num_layers: The number of layers of the RNN.
        dropout: The dropout probability.

    Attributes:
        embeddings: The word embeddings layer.
        rnn: The recurrent neural network.
        output_layer: The output layer.
    """

    def __init__(self, vocab_size, embedding_dim, hidden_dim, rnn_type, num_layers, dropout=0.5):
        super().__init__()

        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        rnn_class = nn.LSTM if rnn_type == 'lstm' else nn.GRU

        self.rnn = rnn_class(embedding_dim, hidden_dim, num_layers, dropout=dropout)
        self.output_layer = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        """Forward pass through the language model.

        Args:
            x: A tensor of input sequences (with shape (batch_size, seq_len)).

        Returns:
            logits: A tensor of logits for the next word in the sequence (with shape (batch_size, seq_len, vocab_size)).
        """

        embeddings = self.embeddings(x)
        outputs, (h_n, _) = self.rnn(embeddings)
        logits = self.output_layer(outputs)
        return logits

    def generate(self, x, h0, no):
        """Generate text using the greedy decoding algorithm.

        Args:
            x: A tensor of input tokens (with shape (batch_size, 1)).
            h0: The initial state of the RNN (with shape (batch_size, num_layers, hidden_dim)).
            no: The desired number of tokens to be generated.

        Returns:
            decoded_text: A tensor of decoded text as a sequence of token indices (with shape (batch_size, no)).
        """

        decoded_text = torch.zeros(x.size(0), no, dtype=torch.long, device=x.device)
        output = x

        for i in range(no):
            embeddings = self.embeddings(output)
            rnn_output, (h_n, _) = self.rnn(embeddings, h0)
            logits = self.output_layer(rnn_output)
            topk, topk_indices = logits.topk(1, dim=2)
            output = topk_indices
            decoded_text[:, i] = output[:, 0]
            h0 = h_n.detach()

        return decoded_text