# Assignment 2: Theses

---

## Task 2) Theses Inspiration

Imagine you'd have to write another thesis, and you just can't find a good topic to work on.
Well, n-grams to the rescue!
Download the `theses.txt` data set from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group.
This dataset consists of approx. 1,000 theses topics chosen by students in the past.

In this assignment, you will be sampling from n-grams to generate new potential thesis topics.
Pay extra attention to preprocessing: How would you handle hyphenated words and acronyms/abbreviations?

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [207]:
# Dependencies
import numpy as np
import nltk
from nltk import lm
from nltk.lm import preprocessing as prep
from nltk.lm.api import LanguageModel
from functools import reduce
import random
from typing import Optional, Iterator
import re

### Prepare the Data

1.1 Spend some time on pre-processing. How would you handle hyphenated words and abbreviations/acronyms?

In [208]:
THESES_PATH = "data/theses.txt"

def load_theses_titles(filepath: str) -> list[str]:
    """Loads all theses titles and returns them as a list."""
    ### YOUR CODE HERE
    
    with open(filepath) as fp:
        return list(fp.readlines())
    
    ### END YOUR CODE

In [209]:
def tokenize(text: str) -> Iterator[str]:
    """Tokenizes a single thesis."""
    ### YOUR CODE HERE
    
    for s in text.split():
        m = re.match(r"^([@#]?\w+)[,\.?!]?$", s)
        if m is not None:
            if m.group(1) is None:
                print(m)
                print(m.group(1))
            yield m.group(1)

    ### END YOUR CODE

def preprocess(data: list[str]) -> list[list[str]]:
    """Preprocesses and tokenizes the given theses titles for further use."""
    ### YOUR CODE HERE

    return [list(tokenize(s)) for s in data]
    
    ### END YOUR CODE

In [210]:
theses_data = preprocess(load_theses_titles(THESES_PATH))

### Train N-gram Models

2.1 Train n-gram models with n = [1, ..., 5]. What about \<s> and \</s>?

In [211]:
def build_n_gram_models(n: int, data: list[list[str]]):
    """This method does calculate all n-grams up to the given n."""
    ### YOUR CODE HERE

    trn, voc = prep.padded_everygram_pipeline(n, data)
    model = lm.Laplace(n)
    model.fit(trn, voc)
    return model
    
    ### END YOUR CODE

In [212]:
n_gram_models = build_n_gram_models(5, theses_data)

### Generate the Titles

3.1 Write a generator that provides thesis titles of desired length. Please do not use the available `lm.generate` method but write your own.

3.2 How can you incorporate seed words?

3.3 How do you handle </s> tokens (w.r.t. the desired length?)

3.4 If you didn't just copy what nltk's lm.generate does: compare the outputs.

In [213]:
# Notice: If you fix the seed in numpy.random.choice, you get reproducible results.

def sample_next_token(prev: list[str], n_gram_model: LanguageModel) -> Optional[str]:
    """Samples the next word for the given n_grams."""
    ### YOUR CODE HERE
    
    n = n_gram_model.order
    r = random.random() * (1 - n_gram_model.score("<s>", prev[-n:]))
    sum = 0
    for w in n_gram_model.vocab:
        if w == "<s>":
            continue
        sum += n_gram_model.score(w, prev[-n:])
        if r < sum:
            return w
    return None

    ### END YOUR CODE


def generate(n_gram_models: LanguageModel, seed: Optional[str | list[str]], title_length: int):
    """Generates a thesis title using the n_grams, seed word and title length."""
    ### YOUR CODE HERE
    
    n = n_gram_models.order
    l = ["<s>"]
    if isinstance(seed, str):
        l.append(seed)
    elif isinstance(seed, list):
        l += seed
    for _ in range(title_length - len(l) + 1):
        next = sample_next_token(l, n_gram_models)
        if next is None or next == "</s>":
            break
        l.append(next)
    return " ".join(l[i] for i in range(1, len(l)))
    
    ### END YOUR CODE

In [214]:
title_length = 20
seed_word =  "Entwicklung"
thesis_title = generate(n_gram_models, seed_word, title_length)
print(thesis_title)

seed_word =  "Cloud"
thesis_title = generate(n_gram_models, seed_word, title_length)
print(thesis_title)

Entwicklung ISO Test Anforderungen Kalibrierung Audiokonferenzanwendungen Greifen Plugins Enable Studiengang RDF vor the gilt mobiles development Wissens Anton interaktive Alumni
Cloud Automation Fotorealistische exemplarischer optimierten TeamBank Texten IOTA Anton Prognosegenauigkeit Fraunhofer basierten Simulationskomponente Kunden Optical Valuation Sprachen First Fertigungsanlagen Studie
