<img src="../css/thro.svg" align="right" width="200">
 
# Introduction to AI (PART II) - Natural Language Processing (NLP)

## Lecture 10

---
## Part 2: Language Models

In this notebook we will be applying the n-gram language models. We start with a file of 2307 titles of bachelor and master thesis at the TH Rosenheim and the TH Nürnberg. 

In [1]:
!head -5 data/theses.txt

Der Befehl "head" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


In a first step, we use generated the 3-grams from this file using SRILM (http://www.speech.sri.com/projects/srilm/):

<pre>% ngram-count -lm theses.arpa.gz -order 3 -text theses.txt</pre>

In [2]:
!head -15 data/theses.arpa

Der Befehl "head" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


In [3]:
!tail -10 data/theses.arpa

Der Befehl "tail" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


In the following code, we will be using the PyNLPl library to access these 3-grams.

#### Setup

In [4]:
# We use pynlpl ('pineapple') - see https://pypi.org/project/PyNLPl/
from pynlpl.lm.lm import ARPALanguageModel



In [5]:
mdl = ARPALanguageModel('data/theses.arpa')

FileNotFoundError: [Errno 2] No such file or directory: 'data/theses.arpa'

#### Explore the language model

In [None]:
# check how long the n-grams are (i.e. what the "n" is)
mdl.order

In [None]:
# let's have a look at the n-grams and their probabilities (in log-scale)
mdl.ngrams._data
# this is a dict of tupels (1 to 3 tokens) as key and 2-tupels containing the log-prop and (sometimes) a backoff-value,
# which we will not be using

In [None]:
# let's have a look at the n-grams and their probabilities (in log-scale)
# items() is a python function returning (key,value) tupels for a dict
for x in mdl.ngrams._data.items():
    print(x[0], '-->', x[1][0])

#### find next tokens

In [None]:
# let's define a function to return the next most probable words for a text
def findnexts(text, mdl, n=0):
    # split the text into tokens
    if isinstance(text, str):
        hist = text.split()
        hist = tuple(hist)
    else:
        hist = text
    
    # if the hist contains more tokens than the order of our n-grams, only use the last n tokens
    if len(hist) >= mdl.order:
        hist = hist[-mdl.order+1:]
    
    def match(x, h):
        if not h:
            return len(x[0]) == 1
        else:
            # history needs to be "one longer" but needs to match
            return len(x[0]) == len(h) + 1 and x[0][:len(h)] == h
    
    cand = list(filter(lambda x: match(x, hist), mdl.ngrams._data.items()))
    
    # if no cands, shorten history from the left
    while not cand:
        hist = hist[1:]
        cand = list(filter(lambda x: match(x, hist), mdl.ngrams._data.items()))
        
    cand = list(sorted(cand, key=lambda x: x[1][0], reverse=True))
    
    if n > 0:
        return cand[:n]
    else:
        return cand
    

In [None]:
findnexts("", mdl, 20)

In [None]:
findnexts("Design und Implementierung", mdl, 10)

In [None]:
findnexts("und Implementierung", mdl, 10)

In [None]:
findnexts("Analyse", mdl, 10)

#### Interactive thesis titel completion

In [None]:
# Interaktive vervollständigung
hist = []
while True:
    a = input().strip()
    if not a:
        break
    hist.append(a)
    print(' '.join(hist) + str(list(map(lambda x: x[0][-1], findnexts(' '.join(hist), mdl)))))
    

#### Automatic thesis titel generation

In [None]:
import random

def generate_titles(max_len=20):
    hist = ('<s>')
    title = []
    for i in range(max_len):
        cand = findnexts(hist, mdl)
        if not cand:
            break

        cand = random.choice(cand)[0]

        if cand[-1] == '</s>':
            break
            
        title.append(cand[-1])
        hist = cand
    return title

for i in range(8):
    print('*',' '.join(generate_titles()))

In [None]:
# --- EOF ---