# Assignment 2: Theses

---

## Task 2) Theses Inspiration

Imagine you'd have to write another thesis, and you just can't find a good topic to work on.
Well, n-grams to the rescue!
Download the `theses.txt` data set from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group.
This dataset consists of approx. 1,000 theses topics chosen by students in the past.

In this assignment, you will be sampling from n-grams to generate new potential thesis topics.
Pay extra attention to preprocessing: How would you handle hyphenated words and acronyms/abbreviations?

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [2]:
import nltk
import pandas as pd
import regex as re
import sklearn as sk

### Prepare the Data

1.1 Spend some time on pre-processing. How would you handle hyphenated words and abbreviations/acronyms?

In [10]:
def load_theses_titles(filepath):
    """Loads all theses titles and returns them as a list."""
    ### YOUR CODE HERE
    
  
    
    data= pd.read_csv(filepath, sep='\t', encoding='utf-8', header=None)
    data = data[3].to_list()
    return data
    ### END YOUR CODE

In [11]:
load_theses_titles("./data/theses.tsv")

['EMail am Beispiel SMTP im Internet',
 'Einführung des Configuration Management-Systems PCMS zur strukturierten Versions-Release- und Änderungskontrolle in Projekten der Abteilung Information-Systems der Firma Motorola',
 'Analyse und Leistungsvergleich von zwei Echtzeitsystemen für eingebettete Anwendungen',
 'Erfassung und automatische Zuordnung von Auftragsdaten für ein Dienstleistungsunternehmen mit Hilfe von Standardsoftware - Konzeption und Realisierung',
 'Organisationskonzept zur Administration von Lehrgangsrechnern für eine DV-Fortbildungsinstitution',
 'Monte Carlo-Simulation für ein gekoppeltes Round-Robin-System',
 'Untersuchung der elektrischen Eigenschaften von Supraleitern mit Hilfe eines Gas-Kryosystems',
 'Prioritätsfreies Scheduling in verteilten Echtzeit-Systemen unter Berücksichtigung von Zeit- und Betriebsmittelanforderungen',
 'Implementierung eines Testüberdeckungsgrad-Analysators für RAS',
 'Eine objektorientierte Fuzzy-Entwicklungsumgebung - Entwurf und Implem

In [16]:
def preprocess(data):
    """Preprocesses and tokenizes the given theses titles for further use."""
    ### YOUR CODE HERE
    titles=[]
    for title in data:
        # Remove special characters and digits
        #title = re.sub(r'[^a-zA-Z\s]', '', title)
        # Convert to lowercase
        title = title.lower()
        
        tokens = nltk.word_tokenize(title)
    # Add start and end tokens
        tokens = ["<s>"] + tokens + ["</s>"]
    # Tokenize the text
        titles.append(tokens)
    
    return titles
    
    ### END YOUR CODE

In [17]:
theses_data = preprocess(load_theses_titles("./data/theses.tsv"))

### Train N-gram Models

2.1 Train n-gram models with n = [1, ..., 5]. What about \<s> and \</s>?

In [18]:
import random
def build_n_gram_models(n, data):
    """
    To predict the first few words of the Tweet, we need the smaller n-grams as
    well. This method does calculate all n-grams up to the given n.
    """
    ### YOUR CODE HERE
    
    n_gram_models = {}

    for i in range(1, n+1):
        n_gram_model = {}
        for tweet in data:
            for j in range(len(tweet)-i):
                n_gram = tuple(tweet[j:j+i])
                if n_gram not in n_gram_model:
                    n_gram_model[n_gram] = []
                n_gram_model[n_gram].append(tweet[j+i])

        # Store the model
        n_gram_models[i] = n_gram_model
    return n_gram_models

    ### END YOUR CODE

In [19]:
n_gram_models = build_n_gram_models(5, theses_data)

### Generate the Titles

3.1 Write a generator that provides thesis titles of desired length. Please do not use the available `lm.generate` method but write your own.

3.2 How can you incorporate seed words?

3.3 How do you handle </s> tokens (w.r.t. the desired length?)

3.4 If you didn't just copy what nltk's lm.generate does: compare the outputs.

In [24]:
# Notice: If you fix the seed in numpy.random.choice, you get reproducible results.
def get_suggestion(prev, n_gram_model):
    """
    Gets the next random word for the given n_grams.
    The size of the previous tokens must be exactly one less than the n-value
    of the n-gram, or it will not be able to make a prediction.
    """
    ### YOUR CODE HERE
    
    # Check if the previous tokens are in the n-gram model
    if tuple(prev) in n_gram_model:
        # Get the next word
        next_words = n_gram_model[tuple(prev)]
        # Choose a random word from the list
        return next_words[0]  # Replace with random choice if needed
    else:
        return None  # No suggestion available for the given previous tokens

    ### END YOUR CODE


def generate(n, n_gram_models, seed, title_length):
    """Generates a random tweet using the given data set."""
    model = n_gram_models[n]
    title= [seed]
    while True:
        prev = title[-(n-1):]
        if len(prev) < n:
            next_word = get_suggestion(prev, n_gram_models[len(prev)])
        else:
            next_word = get_suggestion(prev, model)
        if next_word is None or next_word == "</s>" or len(title) >= title_length:
            break
        title.append(next_word)
    return " ".join(title)


In [32]:
title_length = 20
seed_word =  "entwicklung"
thesis_title = generate(4, n_gram_models, seed_word, title_length)
print(thesis_title)

seed_word =  "saas"
thesis_title = generate(5, n_gram_models, seed_word, title_length)
print(thesis_title)

entwicklung eines plattformunabhängigen 3964r-treibers für windows-nt
saas im kontext der anforderungen an ein erp-system
