# Homework 2: Language Modeling
11-411/611 Natural Language Processing (Fall 2024)

- RELEASED: Tuesday, Oct 1, 2024
- DUE: Thursday, October 24 2024 11:59 pm EDT

Whether for transcribing spoken utterances as correct word sequences or generating coherent human-like text, language models are extremely useful.

In this assignment, you will be building your own language models powered by n-grams and RNNs.

### Submission Guidelines
**Programming:** 
- This notebook contains helpful test cases and additional information about the programming part of the HW. However, you are only required to submit `ngram_lm.py` and `rnn_lm.py` on Gradescope.
- We recommended that you first code in the notebook and then copy the corresponding methods/classes to `ngram_lm.py` and `rnn_lm.py`.

**Written:**
- Analysis questions would require you to run your code.
- You need to write your answers in a document and upload it alongside the programming components

### Upload (if using Colab) main.py and utils.py, and the data.zip file

In [None]:
!unzip data.zip

## Part 1: Language Models [60 points]

### Step 0: Preprocessing

In [2]:
import math
import torch
import numpy as np
import torch.nn as nn
from collections import Counter
from torch.utils.data import DataLoader, Dataset

We provide you with a few functions in `utils.py` to read and preprocess your input data. Do not edit this file!

In [3]:
from utils import *

We have performed a round of preprocessing on the datasets.

- Each file contains one sentence per line.
- All punctuation marks have been removed.
- Each line is a sequences of tokens separated by whitespace.

#### Special Symbols ( Already defined in `utils.py` )
The start and end tokens will act as padding to the given sentences, to make sure they are correctly defined, print them here:

In [4]:
print("Sentence START symbol: {}".format(START))
print("Sentence END symbol: {}".format(EOS))
print("Unknown word symbol: {}".format(UNK))

Sentence START symbol: <s>
Sentence END symbol: </s>
Unknown word symbol: <UNK>


#### Reading and processing an example file

In [5]:
# Read the sample file
sample = read_file("data/sample.txt")
print(sample)

['We are never ever ever ever ever getting back together\n', 'We are the ones together we are back']


In [6]:
# Preprocess the content to add corresponding number of start and end tokens. Try out the method with n = 3 and n = 4 as well.
# Preprocessing example for bigrams (n=2)
sample = preprocess(sample, n=3)
for s in sample:
    print(s)

['<s>', '<s>', 'we', 'are', 'never', 'ever', 'ever', 'ever', 'ever', 'getting', 'back', 'together', '</s>']
['<s>', '<s>', 'we', 'are', 'the', 'ones', 'together', 'we', 'are', 'back', '</s>']


In [7]:
# Flattens a nested list into a 1D list.
flattened = flatten(sample)
print(flattened)

['<s>', '<s>', 'we', 'are', 'never', 'ever', 'ever', 'ever', 'ever', 'getting', 'back', 'together', '</s>', '<s>', '<s>', 'we', 'are', 'the', 'ones', 'together', 'we', 'are', 'back', '</s>']


### Step 1: N-Gram Language Model

#### TODO: Defining `get_ngrams()`

In [8]:
#######################################
# TODO: get_ngrams()
#######################################
def get_ngrams(list_of_words, n):
    """
    Returns a list of n-grams for a list of words.
    Args
    ----
    list_of_words: List[str]
        List of already preprocessed and flattened (1D) list of tokens e.g. ["<s>", "hello", "</s>", "<s>", "bye", "</s>"]
    n: int
        n-gram order e.g. 1, 2, 3
    
    Returns:
        n_grams: List[Tuple]
            Returns a list containing n-gram tuples
    """
    n_grams = []
    for i in range(len(list_of_words) - n + 1):
        n_gram = tuple(list_of_words[i:i + n])
        n_grams.append(n_gram)
    return n_grams

In [9]:
#######################################
# TEST: get_ngrams()
#######################################
sample = preprocess(read_file("data/sample.txt"), n=3)
flattened = flatten(sample)

assert get_ngrams(flattened, 3) == [('<s>', '<s>', 'we'),
        ('<s>', 'we', 'are'),
        ('we', 'are', 'never'),
        ('are', 'never', 'ever'),
        ('never', 'ever', 'ever'),
        ('ever', 'ever', 'ever'),
        ('ever', 'ever', 'ever'),
        ('ever', 'ever', 'getting'),
        ('ever', 'getting', 'back'),
        ('getting', 'back', 'together'),
        ('back', 'together', '</s>'),
        ('together', '</s>', '<s>'),
        ('</s>', '<s>', '<s>'),
        ('<s>', '<s>', 'we'),
        ('<s>', 'we', 'are'),
        ('we', 'are', 'the'),
        ('are', 'the', 'ones'),
        ('the', 'ones', 'together'),
        ('ones', 'together', 'we'),
        ('together', 'we', 'are'),
        ('we', 'are', 'back'),
        ('are', 'back', '</s>')]

#### **TODO:** Class `NGramLanguageModel()`

*Now*, we will define our LanguageModel class.

**Some Useful Variables:**
- self.model: `dict` of n-grams and their corresponding probabilities, keys being the tuple containing the n-gram, and the value being the probability of the n-gram.
- self.vocab: `dict` of unigram vocabulary with counts, keys being the words themselves and the values being their frequency.
- self.n: `int` value for n-gram order (e.g. 1, 2, 3).
- self.train_data: `List[List]` containing preprocessed **unflattened** train sentences. You will have to flatten it to use in the language model
- self.smoothing: `float` flag signifying the smoothing parameter.

In `lm.py`, we will be taking most of these argumemts from the command line using this command:

`python3 lm.py --train data/sample.txt --test data/sample.txt --n 3 --smoothing 0 --min_freq 1`

Note that we will not be using log probabilities in this section. Store the probabilities as they are, not in log space.

**Laplace Smoothing**

There are two ways to perform this:
- Either you calculate all possible n-grams at train time and calculate smooth probabilities for all of them, hence inflating the model (eager emoothing). You then use the probabilities as when required at test time. **OR**
- You calculate the probabilities for the **observed n-grams** at train time, using the smoothed likelihood formula, then if any unseen n-gram is observed at test time, you calculate the probability using the smoothed likelihood formula and store it in the model for future use (lazy smoothing).

You will be implementing lazy smoothing

**Perplexity**

Steps:
1. Flatten the test data.
2. Extract ngrams from the flattened data.
3. Calculate perplexity according to given formula. For unseen n-grams, calculate using smoothed likelihood and store the unseen n-gram probability in the labguage model `model` attribute:

$ppl(W_{test}) = ppl(W_1W_2 ... W_n)^{-1/n} $

Tips:
- Remember that product changes to summation under `log`. Take the log of probabilities, sum them up, and then exponentiate it to get back to the original scale.
- Make sure to `flatten()` your data before creating the n_grams using `get_ngrams()`.
- The test suite provided is **not exhaustive**.


In [10]:
import numpy as np

class NGramLanguageModel():
    def __init__(self, n, train_data, alpha=1):
        self.n = n
        self.train_data = train_data
        self.alpha = alpha
        self.tokens = []  # 단어 토큰 리스트
        self.vocab = {}   # 단어 집합 (단어별 등장 횟수)
        self.model = {}   # n-gram 모델 (n-gram -> 확률)
        self.n_grams_counts = {}  # n-gram 빈도수
        self.prefix_counts = {}   # (n-1)-gram 빈도수
        self.build()  # 모델을 초기화할 때 build 함수 호출

    def build(self):
        flattened_data = flatten(self.train_data)
        self.tokens = flattened_data

        # 단어 등장 횟수 계산
        for word in flattened_data:
            self.vocab[word] = self.vocab.get(word, 0) + 1

        # n-grams 및 (n-1)-grams 생성 및 빈도수 계산
        ngrams = get_ngrams(flattened_data, self.n)
        for ngram in ngrams:
            self.n_grams_counts[ngram] = self.n_grams_counts.get(ngram, 0) + 1
            prefix = ngram[:-1]  # n-gram의 (n-1)-gram 부분
            self.prefix_counts[prefix] = self.prefix_counts.get(prefix, 0) + 1

        # 모델 확률 계산
        for ngram in self.n_grams_counts:
            self.model[ngram] = self.get_prob(ngram)

    def get_prob(self, ngram):
        """
        Returns the probability of the n-gram using Laplace Smoothing.
        """
        if self.n == 1:
            return (self.n_grams_counts.get(ngram, 0) + self.alpha) / (len(self.tokens) + self.alpha * len(self.vocab))
        else:
            prefix = ngram[:-1]
            prefix_count = self.prefix_counts.get(prefix, 0)
            ngram_count = self.n_grams_counts.get(ngram, 0)
            
            # 관찰되지 않은 n-gram에 대해 smoothing 적용
            prob = (ngram_count + self.alpha) / (prefix_count + self.alpha * len(self.vocab))
            
            # 확률이 너무 작으면 최소 값을 보정하여 설정
            return max(prob, 1e-3)

    def perplexity(self, test_data):
        """
        Calculates perplexity on the test data.
        """
        flattened_test = flatten(test_data)  # 데이터를 평탄화
        test_ngrams = get_ngrams(flattened_test, self.n)  # 테스트 데이터에서 n-gram 추출
        
        log_prob_sum = 0
        N = len(test_ngrams)  # n-gram의 총 개수
        
        # n-gram이 충분한지 체크 (너무 적으면 퍼플렉서티 값이 왜곡될 수 있음)
        if N == 0:
            raise ValueError("Test data is too small to calculate perplexity.")
        
        for ngram in test_ngrams:
            prob = self.get_prob(ngram)  # 각 n-gram의 확률 가져오기
            
            # 확률이 0인 경우를 방지하고 로그 계산
            if prob > 0:
                log_prob_sum += np.log(prob)
            else:
                log_prob_sum += np.log(1e-5)  # 확률이 0인 경우 최소값으로 보정
        
        # perplexity 계산
        perplexity = np.exp(-log_prob_sum / N)
        return perplexity


In [11]:
#######################################
# TEST: NGramLanguageModel()
#######################################
# For the sake of understanding we will pass alpha as 0 (no smoothing), so that you gain intuition about the probabilities
sample = preprocess(read_file("data/sample.txt"), n=2)
test_lm = NGramLanguageModel(n=2, train_data=sample, alpha=0)

assert test_lm.vocab == Counter({'<s>': 2,
        'we': 3,
        'are': 3,
        'never': 1,
        'ever': 4,
        'getting': 1,
        'back': 2,
        'together': 2,
        '</s>': 2,
        'the': 1,
        'ones': 1})

assert test_lm.model =={('<s>', 'we'): 1.0,
        ('we', 'are'): 1.0,
        ('are', 'never'): 0.3333333333333333,
        ('never', 'ever'): 1.0,
        ('ever', 'ever'): 0.75,
        ('ever', 'getting'): 0.25,
        ('getting', 'back'): 1.0,
        ('back', 'together'): 0.5,
        ('together', '</s>'): 0.5,
        ('</s>', '<s>'): 1.0,
        ('are', 'the'): 0.3333333333333333,
        ('the', 'ones'): 1.0,
        ('ones', 'together'): 1.0,
        ('together', 'we'): 0.5,
        ('are', 'back'): 0.3333333333333333,
        ('back', '</s>'): 0.5}

In [12]:
#######################################
# TEST smoothing: NGramLanguageModel()
#######################################
sample = preprocess(read_file("data/sample.txt"), n=2)
test_lm = NGramLanguageModel(n=2, train_data=sample, alpha=1)

assert test_lm.vocab == Counter({'<s>': 2,
        'we': 3,
        'are': 3,
        'never': 1,
        'ever': 4,
        'getting': 1,
        'back': 2,
        'together': 2,
        '</s>': 2,
        'the': 1,
        'ones': 1})

assert test_lm.model =={('<s>', 'we'): 0.23076923076923078,
        ('we', 'are'): 0.2857142857142857,
        ('are', 'never'): 0.14285714285714285,
        ('never', 'ever'): 0.16666666666666666,
        ('ever', 'ever'): 0.26666666666666666,
        ('ever', 'getting'): 0.13333333333333333,
        ('getting', 'back'): 0.16666666666666666,
        ('back', 'together'): 0.15384615384615385,
        ('together', '</s>'): 0.15384615384615385,
        ('</s>', '<s>'): 0.16666666666666666,
        ('are', 'the'): 0.14285714285714285,
        ('the', 'ones'): 0.16666666666666666,
        ('ones', 'together'): 0.16666666666666666,
        ('together', 'we'): 0.15384615384615385,
        ('are', 'back'): 0.14285714285714285,
        ('back', '</s>'): 0.15384615384615385}

In [13]:
#######################################
# TEST unigram: NGramLanguageModel()
#######################################
sample = preprocess(read_file("data/sample.txt"), n=1)
test_lm = NGramLanguageModel(n=1, train_data=sample, alpha=1)

assert test_lm.vocab == Counter({'<s>': 2,
        'we': 3,
        'are': 3,
        'never': 1,
        'ever': 4,
        'getting': 1,
        'back': 2,
        'together': 2,
        '</s>': 2,
        'the': 1,
        'ones': 1})

assert test_lm.model == {('<s>',): 0.09090909090909091,
        ('we',): 0.12121212121212122,
        ('are',): 0.12121212121212122,
        ('never',): 0.06060606060606061,
        ('ever',): 0.15151515151515152,
        ('getting',): 0.06060606060606061,
        ('back',): 0.09090909090909091,
        ('together',): 0.09090909090909091,
        ('</s>',): 0.09090909090909091,
        ('the',): 0.06060606060606061,
        ('ones',): 0.06060606060606061}

In [14]:
#######################################
# TEST: perplexity()
#######################################
test_lm = NGramLanguageModel(n=2, train_data=sample, alpha=0)
test_ppl = test_lm.perplexity(sample)
assert test_ppl < 1.7
assert test_ppl > 0

test_lm = NGramLanguageModel(n=2, train_data=sample, alpha=1)
test_ppl = test_lm.perplexity(sample)
assert test_ppl < 5.0
assert test_ppl > 0

AssertionError: 

In [15]:
# Perplexity 값 출력
test_ppl = test_lm.perplexity(sample)
print("Calculated perplexity:", test_ppl)


Calculated perplexity: 5.283124177782943


### Step 2: RNN Language Model
Recurrent Neural Networks (RNNs) are a class of neural networks designed to handle sequential data. Unlike traditional neural networks, which assume independence among inputs, RNNs utilize their internal state (memory) to process sequences of inputs. This makes them particularly well-suited for tasks where context and order matter.

Before diving into building RNN Language Models using PyTorch, it's essential to have a solid foundation in the following areas:
. We assume you have had a basic understanding of PyTorch and its core concepts, including tensors, autograd, modules (nn.Module), and how to construct simple neural networks using PyTorch. For more comprehensive learning, refer to the [PyTorch official tutorials](https://pytorch.org/tutorials/) and documentation.

#### Preparing the Data
The following Python code is used for loading and processing [GloVe (Global Vectors for Word Representation) embeddings](https://nlp.stanford.edu/projects/glove/). GloVe is an unsupervised learning algorithm for obtaining vector representations for words. These embeddings can be used in various natural language processing and machine learning tasks. You can download the 50d embeddings for this assignment from [Canvas](https://canvas.cmu.edu/courses/39596/files/10855662?module_item_id=5748476).

The `load_glove_embeddings(path)` function is used to load the GloVe embeddings from a file. The function takes a file path as an argument, reads the file line by line, and for each line, it splits the line into words and their corresponding embeddings, and stores them in a dictionary. The dictionary, embeddings_dict, maps words to their corresponding vector representations.

The `create_embedding_matrix(word_to_ix, embeddings_dict, embedding_dim)` function is used to create an embedding matrix from the loaded GloVe embeddings. This function takes a dictionary mapping words to their indices (`word_to_ix`), the dictionary of GloVe embeddings (`embeddings_dict`), and the dimension of the embeddings (`embedding_dim`) as arguments. It creates a zero matrix of size (vocab_size, embedding_dim) and then for each word in  `word_to_ix`, it checks if the word is in `embeddings_dict`. If it is, it assigns the corresponding GloVe vector to the word's index in the embedding matrix. If the word is not in the embeddings_dict, it assigns a random vector to the word's index in the embedding matrix.

The `glove_path` variable is the path to the GloVe file, and `glove_embeddings` is the dictionary of GloVe embeddings loaded using the `load_glove_embeddings` function. The `embedding_dim` variable is the dimension of the embeddings, and `embedding_matrix` is the embedding matrix created using the create_embedding_matrix function.

In [4]:
# Load the data
vocab, word_to_ix, ix_to_word, dataloader = loadfile("data/sample.txt")

In [5]:
def load_glove_embeddings(path):
    embeddings_dict = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = torch.tensor([float(val) for val in values[1:]], dtype=torch.float)
            embeddings_dict[word] = vector
    return embeddings_dict

# Path to the GloVe file
glove_path = 'glove.6B.50d.txt'  # Update this path
glove_embeddings = load_glove_embeddings(glove_path)

def create_embedding_matrix(word_to_ix, embeddings_dict, embedding_dim):
    vocab_size = len(word_to_ix)
    embedding_matrix = torch.zeros((vocab_size, embedding_dim))
    for word, ix in word_to_ix.items():
        if word in embeddings_dict:
            embedding_matrix[ix] = embeddings_dict[word]
        else:
            embedding_matrix[ix] = torch.rand(embedding_dim)  # Random initialization for words not in GloVe
    return embedding_matrix

# Create the embedding matrix
embedding_dim = 50
embedding_matrix = create_embedding_matrix(word_to_ix, glove_embeddings, embedding_dim)

#### TODO: Defining the RNN Model

In [6]:
#######################################
# TODO: RNNLanguageModel()
#######################################
class RNNLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, embedding_matrix):
        super(RNNLanguageModel, self).__init__()
        
        # 임베딩 레이어: 사전 학습된 GloVe 임베딩으로 초기화
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.embedding.weight.data.copy_(embedding_matrix)  # GloVe 임베딩으로 초기화
        self.embedding.weight.requires_grad = False  # GloVe 임베딩 고정
        
        # RNN 레이어: hidden_dim 크기의 RNN 레이어 정의
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        
        # Fully connected 레이어: RNN 출력에서 vocabulary 크기의 출력으로 변환
        self.fc = nn.Linear(hidden_dim, vocab_size)
    
    def forward(self, x, hidden=None):
        """
        RNN 모델의 순방향 패스.
        
        Args
        ____
        x: torch.Tensor (batch_size, sequence_length)
        hidden: torch.Tensor (num_layers, batch_size, hidden_dim)
        
        Returns
        -------
        out: torch.Tensor (batch_size, sequence_length, vocab_size)
        hidden: torch.Tensor (num_layers, batch_size, hidden_dim)
        """
        # 1. 입력을 임베딩으로 변환
        x = self.embedding(x)
        
        # 2. RNN 레이어를 통과
        out, hidden = self.rnn(x, hidden)
        
        # 3. RNN 출력값을 Fully connected 레이어를 통해 vocab 크기의 출력으로 변환
        out = self.fc(out)
        
        return out, hidden
    
    def generate_sentence(self, sequence, word_to_ix, ix_to_word, num_words, mode='max'):
        """
        주어진 문장에서 다음 단어들을 예측하는 함수.
        
        Args
        ____
        sequence: str
            입력 문장
        word_to_ix: dict
            단어 -> 인덱스 사전
        ix_to_word: dict
            인덱스 -> 단어 사전
        num_words: int
            예측할 단어의 최대 개수
        mode: str
            'max' 또는 'multinomial'
        
        Returns
        -------
        predicted_sequence: List[str]
            예측된 단어 리스트
        """
        # 입력 문장을 인덱스로 변환
        input_idx = [word_to_ix[word] for word in sequence.split() if word in word_to_ix]
        input_tensor = torch.tensor(input_idx).unsqueeze(0)  # (1, sequence_length) 크기의 텐서로 변환
        
        # 숨겨진 상태 초기화 (배치 크기 1로 설정)
        hidden = None
        
        predicted_sequence = sequence.split()
        
        for _ in range(num_words):
            # 순방향 패스: 단어를 예측
            output, hidden = self.forward(input_tensor, hidden)
            
            # 마지막 단어의 출력에서 다음 단어 예측
            last_word_logits = output[0, -1, :]
            
            if mode == 'max':
                # 확률이 가장 높은 단어 선택
                predicted_idx = torch.argmax(last_word_logits).item()
            elif mode == 'multinomial':
                # 확률 분포에서 단어 샘플링
                probs = torch.softmax(last_word_logits, dim=0)
                predicted_idx = torch.multinomial(probs, 1).item()
            else:
                raise ValueError("Unknown mode: choose 'max' or 'multinomial'")
            
            predicted_word = ix_to_word[predicted_idx]
            predicted_sequence.append(predicted_word)
            
            # 다음 예측을 위해 새로운 입력으로 업데이트
            input_tensor = torch.tensor([predicted_idx]).unsqueeze(0)  # (1, 1)
        
        return predicted_sequence

#### Training the Model
The following code snippet provided is responsible for training the RNN language model. 

In [7]:
#######################################
# TEST: RNNLanguageModel() and training
#######################################
torch.manual_seed(11411)
# Hyperparameters
vocab_size = len(vocab)
embedding_dim = 50
hidden_dim = 32
num_epochs = 10

# Initialize the model, loss function, and optimizer
RNN = RNNLanguageModel(vocab_size, embedding_dim, hidden_dim, embedding_matrix)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(RNN.parameters(), lr=0.005)

lines = ""
# Training loop
for epoch in range(num_epochs):
    for inputs, targets in dataloader:
        RNN.zero_grad()
        output, _ = RNN(inputs)
        loss = criterion(output.view(-1, vocab_size), targets.view(-1))
        loss.backward()
        optimizer.step()
    
    line = f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}, Perplexity: {np.exp(loss.item())}'
    lines += line + "\n"
    print(line)

Epoch 1/10, Loss: 2.6141231060028076, Perplexity: 13.655236931013835
Epoch 2/10, Loss: 2.4388041496276855, Perplexity: 11.459328903039383
Epoch 3/10, Loss: 2.288909673690796, Perplexity: 9.864176644404472
Epoch 4/10, Loss: 2.1654279232025146, Perplexity: 8.718331895227367
Epoch 5/10, Loss: 2.0620663166046143, Perplexity: 7.86219882938823
Epoch 6/10, Loss: 1.972102165222168, Perplexity: 7.185766290151172
Epoch 7/10, Loss: 1.8897345066070557, Perplexity: 6.617611515661019
Epoch 8/10, Loss: 1.8105781078338623, Perplexity: 6.113980951038627
Epoch 9/10, Loss: 1.7322709560394287, Perplexity: 5.653478141611421
Epoch 10/10, Loss: 1.6541872024536133, Perplexity: 5.228828215779611


## Part 2: Written [40 points]. We have given some code for some of the written parts to make it easier for you.

### **Written 4.2** – Song Attribution [8 points]

In [21]:
# Load the datasets
business_data = read_file('data/bbc/business.txt')
sports_data = read_file('data/bbc/sport.txt')

# Preprocess the data
business_tokens = preprocess(business_data, n=2)  # For 2-grams
sports_tokens = preprocess(sports_data, n=2)

# Flatten the token lists for easier n-gram extraction
flat_business = flatten(business_tokens)
flat_sports = flatten(sports_tokens)

# Generate 2-grams and 3-grams
business_2grams = get_ngrams(flat_business, n=2)
business_3grams = get_ngrams(flat_business, n=3)
sports_2grams = get_ngrams(flat_sports, n=2)
sports_3grams = get_ngrams(flat_sports, n=3)

# Count unique n-grams
unique_business_2grams = len(set(business_2grams))
unique_business_3grams = len(set(business_3grams))
unique_sports_2grams = len(set(sports_2grams))
unique_sports_3grams = len(set(sports_3grams))

# Output results
print(f"Unique 2-grams in business dataset: {unique_business_2grams}")
print(f"Unique 3-grams in business dataset: {unique_business_3grams}")
print(f"Unique 2-grams in sports dataset: {unique_sports_2grams}")
print(f"Unique 3-grams in sports dataset: {unique_sports_3grams}")

# Vocabulary size
vocab_size = len(set(flat_business + flat_sports))  # Combine both vocabularies for comparison

# Possible n-grams
possible_2grams = vocab_size ** 2
possible_3grams = vocab_size ** 3

print(f"Possible 2-grams: {possible_2grams}")
print(f"Possible 3-grams: {possible_3grams}")


Unique 2-grams in business dataset: 83819
Unique 3-grams in business dataset: 141220
Unique 2-grams in sports dataset: 77398
Unique 3-grams in sports dataset: 135644
Possible 2-grams: 309126724
Possible 3-grams: 5435066061368


In [23]:
# Example code for Taylor Swift N-Gram LM
n = 3
smoothing = 0.1
min_freq = 1

train = read_file("data/lyrics/taylor_swift.txt")
test = read_file("data/lyrics/test_lyrics.txt")

train = preprocess(train, n)
test = preprocess(test, n)
lm = NGramLanguageModel(n, train, smoothing)

ppl = lm.perplexity(test)
print(ppl)

99.16627687793007


### **Written 4.3.1** –  Intro to Decoding [8 points]

Please take a look at and understand the functions: `best_candidate()`, `top_k_best_candidates()` and `generate_sentences_from_phrase()` in `utils.py`.

In [24]:
n = 3
smoothing = 0.1
min_freq = 1

In [25]:
train = read_file("data/lyrics/taylor_swift.txt")
train = preprocess(train, n)
lm = NGramLanguageModel(n, train, smoothing)

In [26]:
s1 = ("the", "tortured", "poets", "department")

s2 = ("so", "long", "london")

s3 = ("down", "bad")

In [27]:
print(top_k_best_candidates(lm, s1, 5, without=['<s>', '</s>']))

('</s>', 1)


In [22]:
# 주어진 가수별 데이터 파일을 학습
taylor_swift_lyrics = read_file("data/lyrics/taylor_swift.txt")
green_day_lyrics = read_file("data/lyrics/green_day.txt")
ed_sheeran_lyrics = read_file("data/lyrics/ed_sheeran.txt")

# 익명의 테스트 가사 파일을 읽어옴
test_lyrics = read_file("data/lyrics/test_lyrics.txt")

# 각 가수의 n-gram 언어 모델을 생성
n = 3
alpha = 0.1

# Taylor Swift 언어 모델 학습
taylor_model = NGramLanguageModel(n, preprocess(taylor_swift_lyrics, n), alpha)

# Green Day 언어 모델 학습
green_day_model = NGramLanguageModel(n, preprocess(green_day_lyrics, n), alpha)

# Ed Sheeran 언어 모델 학습
ed_sheeran_model = NGramLanguageModel(n, preprocess(ed_sheeran_lyrics, n), alpha)

# 익명의 가사에 대해 perplexity 계산
taylor_perplexity = taylor_model.perplexity(preprocess(test_lyrics, n))
green_day_perplexity = green_day_model.perplexity(preprocess(test_lyrics, n))
ed_sheeran_perplexity = ed_sheeran_model.perplexity(preprocess(test_lyrics, n))

# 결과 출력
print(f"Taylor Swift perplexity: {taylor_perplexity}")
print(f"Green Day perplexity: {green_day_perplexity}")
print(f"Ed Sheeran perplexity: {ed_sheeran_perplexity}")

# 가장 낮은 perplexity를 가지는 가수 출력
if min(taylor_perplexity, green_day_perplexity, ed_sheeran_perplexity) == taylor_perplexity:
    print("The lyricist is most likely Taylor Swift.")
elif min(taylor_perplexity, green_day_perplexity, ed_sheeran_perplexity) == green_day_perplexity:
    print("The lyricist is most likely Green Day.")
else:
    print("The lyricist is most likely Ed Sheeran.")


Taylor Swift perplexity: 99.16627687793007
Green Day perplexity: 235.17079902124422
Ed Sheeran perplexity: 225.217257812529
The lyricist is most likely Taylor Swift.


### **Written 4.3.2** – Text Generation [8 points]

For this subtask, train an RNN LM using `data/taylor_swift.txt`

In this part, we will try the first two approaches to generate sentences.

Q1. Use `predict_next_words()` method to generate sentences after the provided phrases from `s1` to `s3`. Use modes `max` and `multinomial`. Report one of your favorite generations (for any strategy or phrase).

Q2. Which decoding strategy did you like better and why?

In [8]:
s1 = "the tortured poets department"

s2 = "so long, london"

s3 = "down bad"

In [9]:
# <UNK> 토큰 추가 (필요한 경우)
if "<UNK>" not in word_to_ix:
    word_to_ix["<UNK>"] = len(word_to_ix)
    ix_to_word[len(ix_to_word)] = "<UNK>"

def predict_next_words(RNN, phrase, word_to_ix, ix_to_word, num_words=5, mode='max'):
    """
    주어진 문장 뒤에 예측된 단어들을 생성하는 함수.
    
    Args:
        RNN: 학습된 RNN 언어 모델.
        phrase: str, 주어진 문장 (예: 'the tortured poets department').
        word_to_ix: dict, 단어 -> 인덱스 사전.
        ix_to_word: dict, 인덱스 -> 단어 사전.
        num_words: int, 예측할 단어의 개수.
        mode: str, 'max' 또는 'multinomial' 모드.
    
    Returns:
        predicted_sentence: List[str], 예측된 단어들로 이루어진 리스트.
    """
    RNN.eval()  # 예측 모드로 전환
    
    # 입력 문장을 인덱스로 변환
    input_idx = []
    for word in phrase.split():
        if word in word_to_ix:
            input_idx.append(word_to_ix[word])
        else:
            input_idx.append(word_to_ix["<UNK>"])  # 사전에 없는 단어는 <UNK>로 처리

    # 입력 인덱스가 임베딩 크기를 넘지 않도록 확인
    input_idx = [idx if idx < len(word_to_ix) else word_to_ix["<UNK>"] for idx in input_idx]
    
    input_tensor = torch.tensor(input_idx).unsqueeze(0)  # (1, sequence_length)
    
    predicted_sentence = phrase.split()
    hidden = None
    
    for _ in range(num_words):
        output, hidden = RNN(input_tensor, hidden)
        last_word_logits = output[0, -1, :]
        
        if mode == 'max':
            predicted_idx = torch.argmax(last_word_logits).item()
        elif mode == 'multinomial':
            probs = torch.softmax(last_word_logits, dim=0)
            predicted_idx = torch.multinomial(probs, 1).item()
        else:
            raise ValueError("Unknown mode: choose 'max' or 'multinomial'")
        
        # 예측된 인덱스가 범위를 벗어나지 않도록 확인
        predicted_word = ix_to_word.get(predicted_idx, "<UNK>")
        predicted_sentence.append(predicted_word)
        
        # 새로운 입력을 업데이트
        input_tensor = torch.tensor([predicted_idx]).unsqueeze(0)
    
    return predicted_sentence

In [10]:
# Perplexity 계산
perplexity = np.exp(loss.item())
print(f"Final Perplexity: {perplexity}")


Final Perplexity: 5.228828215779611


In [23]:
import torch

def predict_next_words(RNN, phrase, word_to_ix, ix_to_word, num_words=5, mode='max'):
    """
    주어진 문장 뒤에 예측된 단어들을 생성하는 함수.
    
    Args:
        RNN: 학습된 RNN 언어 모델.
        phrase: str, 주어진 문장 (예: 'The Tortured Poets Department').
        word_to_ix: dict, 단어 -> 인덱스 사전.
        ix_to_word: dict, 인덱스 -> 단어 사전.
        num_words: int, 예측할 단어의 개수.
        mode: str, 'max' 또는 'multinomial' 모드.
    
    Returns:
        predicted_sentence: List[str], 예측된 단어들로 이루어진 리스트.
    """
    RNN.eval()  # 예측 모드로 전환
    
    # 입력 문장을 인덱스로 변환
    input_idx = [word_to_ix.get(word, word_to_ix["<UNK>"]) for word in phrase.split()]
    input_tensor = torch.tensor(input_idx).unsqueeze(0)  # (1, sequence_length)
    
    predicted_sentence = phrase.split()
    hidden = None
    
    for _ in range(num_words):
        output, hidden = RNN(input_tensor, hidden)
        last_word_logits = output[0, -1, :]
        
        if mode == 'max':
            predicted_idx = torch.argmax(last_word_logits).item()
        elif mode == 'multinomial':
            probs = torch.softmax(last_word_logits, dim=0)
            predicted_idx = torch.multinomial(probs, 1).item()
        else:
            raise ValueError("Unknown mode: choose 'max' or 'multinomial'")
        
        predicted_word = ix_to_word.get(predicted_idx, "<UNK>")
        predicted_sentence.append(predicted_word)
        
        # 새로운 입력을 업데이트
        input_tensor = torch.tensor([predicted_idx]).unsqueeze(0)
    
    return predicted_sentence

# 예시 트랙 제목들
s1 = "The Tortured Poets Department"
s2 = "So Long, London"
s3 = "Down Bad"

# 각 트랙 제목에 대해 단어 예측 (max 모드와 multinomial 모드 모두 사용)
pred_s1_max = predict_next_words(RNN, s1, word_to_ix, ix_to_word, num_words=5, mode='max')
pred_s2_max = predict_next_words(RNN, s2, word_to_ix, ix_to_word, num_words=5, mode='max')
pred_s3_max = predict_next_words(RNN, s3, word_to_ix, ix_to_word, num_words=5, mode='max')

pred_s1_multi = predict_next_words(RNN, s1, word_to_ix, ix_to_word, num_words=5, mode='multinomial')
pred_s2_multi = predict_next_words(RNN, s2, word_to_ix, ix_to_word, num_words=5, mode='multinomial')
pred_s3_multi = predict_next_words(RNN, s3, word_to_ix, ix_to_word, num_words=5, mode='multinomial')

# 결과 출력
print(f"s1 (max): {' '.join(pred_s1_max)}")
print(f"s2 (max): {' '.join(pred_s2_max)}")
print(f"s3 (max): {' '.join(pred_s3_max)}")

print(f"s1 (multinomial): {' '.join(pred_s1_multi)}")
print(f"s2 (multinomial): {' '.join(pred_s2_multi)}")
print(f"s3 (multinomial): {' '.join(pred_s3_multi)}")

s1 (max): The Tortured Poets Department we are the ever ever
s2 (max): So Long, London we are the ever ever
s3 (max): Down Bad we are the ever ever
s1 (multinomial): The Tortured Poets Department getting are back together together
s2 (multinomial): So Long, London are ever ever ever together
s3 (multinomial): Down Bad together </s> we </s> are


In [31]:
import torch
import torch.nn as nn
import numpy as np

# 시드 설정
torch.manual_seed(11411)

# Hyperparameters
vocab_size = len(vocab)
embedding_dim = 50
hidden_dim = 32
num_epochs = 10

# 데이터 로드
vocab, word_to_ix, ix_to_word, dataloader = loadfile("data/lyrics/taylor_swift.txt")
glove_embeddings = load_glove_embeddings('glove.6B.50d.txt')
embedding_matrix = create_embedding_matrix(word_to_ix, glove_embeddings, embedding_dim)

# 모델 초기화
RNN = RNNLanguageModel(vocab_size, embedding_dim, hidden_dim, embedding_matrix)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(RNN.parameters(), lr=0.005)

# Training loop
for epoch in range(num_epochs):
    total_loss = 0
    num_batches = 0
    for inputs, targets in dataloader:
        RNN.zero_grad()
        output, _ = RNN(inputs)
        loss = criterion(output.view(-1, vocab_size), targets.view(-1))
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        num_batches += 1

    # Perplexity 계산 (전체 배치에 대한 평균 손실을 기반으로)
    avg_loss = total_loss / num_batches
    perplexity = np.exp(avg_loss)

    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss}, Perplexity: {perplexity}")

print(f"Final Perplexity: {perplexity}")

RuntimeError: The size of tensor a (12) must match the size of tensor b (3601) at non-singleton dimension 0

In [27]:
sentence = s1
predicted_words_sequence = RNN.generate_sentence(sentence, word_to_ix, ix_to_word, 10, mode='multinomial')
print(sentence + ' ' + ' '.join(predicted_words_sequence))

IndexError: index out of range in self

**Aside (for fun!)**: Train your LM on Taylor Swift lyrics and generate the next hit!

### **Written 4.4** – Battle of the LMs: GPT-2, Trigram and RNN [8 points]

For this subtask, you will be generating text and comparing GPT-2 with your n-gram and RNN language models. 

Generative pretrained transformer (GPT) is a neural language model series created by OpenAI. The n-gram language model you trained has on average around 10K-20K parameters (`len(lm.model)`.) Compare that to the 175 billion parameters of GPT-3, which is likely much smaller than more recent iterations (though they don't tell us anymore)!

Let's see how GPT-2 compares to the LMs you trained in Written 4.3.1 on the `data/bbc/tech-small.txt` dataset.

In [28]:
# Calculate your n-gram model's perplexity
test = preprocess(read_file("data/bbc/tech-small.txt"), 3)
NGram = NGramLanguageModel(n=3, train_data=test)
NGram.perplexity(test)

179.0901520291377

In [29]:
# Calculate your RNN model's perplexity
torch.manual_seed(11411)
# Hyperparameters
vocab_size = len(vocab)
embedding_dim = 50
hidden_dim = 32
num_epochs = 10

vocab, word_to_ix, ix_to_word, dataloader = loadfile("data/bbc/tech-small.txt")
glove_embeddings = load_glove_embeddings('glove.6B.50d.txt')
embedding_matrix = create_embedding_matrix(word_to_ix, glove_embeddings, embedding_dim)

# Initialize the model, loss function, and optimizer
RNN = RNNLanguageModel(vocab_size, embedding_dim, hidden_dim, embedding_matrix)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(RNN.parameters(), lr=0.005)

lines = ""
# Training loop
for epoch in range(num_epochs):
    for inputs, targets in dataloader:
        RNN.zero_grad()
        output, _ = RNN(inputs)
        loss = criterion(output.view(-1, vocab_size), targets.view(-1))
        loss.backward()
        optimizer.step()
    
    line = f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}, Perplexity: {np.exp(loss.item())}'
    lines += line + "\n"
    print(line)

RuntimeError: The size of tensor a (3601) must match the size of tensor b (444) at non-singleton dimension 0

In [24]:
# N-gram 모델을 통해 퍼플렉시티 계산
n = 3  # 3-gram 예시
smoothing = 0.1  # 라플라스 스무딩
train_data = preprocess(read_file('data/lyrics/taylor_swift.txt'), n)
test_data = preprocess(read_file('data/lyrics/test_lyrics.txt'), n)

# N-gram Language Model 생성
ngram_lm = NGramLanguageModel(n, train_data, smoothing)

# 퍼플렉시티 계산
ngram_perplexity = ngram_lm.perplexity(test_data)
print(f"N-gram Model Perplexity: {ngram_perplexity}")


N-gram Model Perplexity: 99.16627687793007


In [29]:
# Make sure the vocab_size is correct
vocab_size = len(word_to_ix)  # Ensure this matches the number of unique words in your dataset

# Now, create the embedding matrix based on the correct vocab_size
embedding_matrix = create_embedding_matrix(word_to_ix, glove_embeddings, embedding_dim)

# RNN 모델 설정 (embedding_matrix shape will now match vocab_size)
RNN = RNNLanguageModel(vocab_size, embedding_dim, hidden_dim, embedding_matrix)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(RNN.parameters(), lr=0.005)

# Training loop
for epoch in range(num_epochs):
    for inputs, targets in dataloader:
        RNN.zero_grad()
        output, _ = RNN(inputs)
        loss = criterion(output.view(-1, vocab_size), targets.view(-1))
        loss.backward()
        optimizer.step()
    
    # 퍼플렉시티 계산
    perplexity = np.exp(loss.item())
    print(f"RNN Model Perplexity after epoch {epoch+1}: {perplexity}")

RNN Model Perplexity after epoch 1: 12.871503056341572
RNN Model Perplexity after epoch 2: 11.020702011138756
RNN Model Perplexity after epoch 3: 9.616626600032012
RNN Model Perplexity after epoch 4: 8.551673420981833
RNN Model Perplexity after epoch 5: 7.722583219312739
RNN Model Perplexity after epoch 6: 7.044095047307152
RNN Model Perplexity after epoch 7: 6.460458975520567
RNN Model Perplexity after epoch 8: 5.941673136509541
RNN Model Perplexity after epoch 9: 5.4730746166131805
RNN Model Perplexity after epoch 10: 5.047685709881503


In [26]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import numpy as np

# GPT-2 모델 및 토크나이저 로드
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# 모델을 평가 모드로 전환
model.eval()

# 테스트 텍스트 데이터
test_text = "Your test sentence here"
input_ids = tokenizer.encode(test_text, return_tensors='pt')

# GPT-2를 사용한 퍼플렉시티 계산
with torch.no_grad():
    outputs = model(input_ids, labels=input_ids)
    loss = outputs.loss
    perplexity = torch.exp(loss)

print(f"GPT-2 Model Perplexity: {perplexity.item()}")


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

GPT-2 Model Perplexity: 1993.6246337890625


#### Computing GPT-2's perplexity on the test set

You need to enable a GPU runtime from the Colab `Runtime` menu option (you can also use your computer if you have an accelerator). Go to `Runtime` → `Change Runtime Type` → `Hardware Accelerator (GPU)`

In [None]:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
import torch

model_id = "distilgpt2"
model = GPT2LMHeadModel.from_pretrained(model_id)
tokenizer = GPT2TokenizerFast.from_pretrained(model_id)

In [None]:
test = read_file("data/bbc/tech-small.txt")
encodings = tokenizer("\n\n".join(test), return_tensors="pt")

In [None]:
from tqdm import tqdm

max_length = model.config.n_positions
stride = 100

nlls = []
for i in tqdm(range(0, encodings.input_ids.size(1), stride)):
    begin_loc = max(i + stride - max_length, 0)
    end_loc = min(i + stride, encodings.input_ids.size(1))
    trg_len = end_loc - i  # may be different from stride on last loop
    input_ids = encodings.input_ids[:, begin_loc:end_loc]
    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)
        neg_log_likelihood = outputs[0] * trg_len

    nlls.append(neg_log_likelihood)

ppl = torch.exp(torch.stack(nlls).sum() / end_loc)

In [None]:
print("Perplexity using GPT2:", ppl.item())