<a href="https://colab.research.google.com/github/KatharinaGardens/computational-linguistics.github.io/blob/Week-8/LELA32052_Week_8_Seminar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LELA32052 Computational Linguistics Week 8

This week we are going first to take a look at the challenge of machine translation.

We'll look at German-to-English MT. Here is a set of sentences - the s stands for source and the t for target. Hopefully the translations here will be somewhat transparent to you. The only thing that might not be obvious is the use of "ja". This means "yes" in some context but is also use to mean something like "certainly". So "das haus ist ja gros" could be translated as "the house is certainly big" but because there isn't a perfect match from ja to certainly it tends to just be omitted in English translation as it is here.

In [1]:
s1='klein ist das haus '
t1='the house is small '
s2='das haus ist ja groß '
t2='the house is big '
s3='ja das buch ist klein '
t3='yes the book is small '
s4='das haus '
t4='the house '
s5='ein buch '
t5='a book '

We are going to use the now very familiar re.sub function to perform translation first.
The g2e function takes German as input and should output English.

Its translation is performed using a series of re.sub functions.

First let's take a really naive approach.

In [2]:
import re

In [3]:
def g2e(out):
    re.UNICODE
    out=re.sub("klein ","small ",out)
    out=re.sub("ist ","is ",out)
    out=re.sub("das ","the ",out)
    out=re.sub("haus ","house ",out)
    out=re.sub("groß ","big ",out)
    out=re.sub("buch ","book ",out)
    out=re.sub("ein ","a ",out)
    out=re.sub("ja ","yes ",out)

    return out


In [4]:
print(g2e(s1) + "\n" + g2e(s2)  + "\n" + g2e(s3)  + "\n" + g2e(s4)  + "\n" + g2e(s5))


small is the house 
the house is yes big 
yes the book is small 
the house 
a book 


That didn't work well. Your job is to change the rules so that the function returns the correct translation.

To make your job easier I have marked the part of speech using the following tags, based on what an automatic part of speech tagger would do (we'll look at these and how they work next week).

ADJ : adjective
AUX : auxiliary verb
ART : article/determiner
N : noun
ADV : adverb

You can make use of the tags by matching them and their associated words like this:

[^ ]+_ART

so if you wrote

re.sub("([^ ]+)_ART",\\1,out)

then it would return an article without its tag.

In [5]:
s1='klein_ADJ ist_AUX das_ART haus_N'
t1='the house is small'
s2='das_ART haus_N ist_AUX ja_ADV groß_ADJ '
t2='the house is big '
s3='ja_ADV das_ART buch_N ist_AUX klein_ADJ'
t3='yes the book is small '
s4='das_ART haus_N'
t4='the house '
s5='ein_ART buch_N'
t5='a book '

In [24]:
def g2e(out):
    re.UNICODE
    out=re.sub('klein_','small_',out)
    out=re.sub('ist_','is_',out)
    out=re.sub('das_','the_',out)
    out=re.sub('haus_','house_',out)
    out=re.sub('ja_','yes_',out)
    out=re.sub('groß_','big_',out)
    out=re.sub('buch_','book_',out)
    out=re.sub('ein_','a_',out)
    out=re.sub('([^ ]+_ADJ)( is_AUX )(.+)', '\\3\\2\\1',out) #added to handle english predication order
    out=re.sub('(yes_ADV) ([^ ]+_ADJ)', '\\2',out) #added to handle the ja gros case - doesn't capture cases with yes meaning


    out = re.sub("_[^ ]+","",out)
    return out


In [25]:
print(g2e(s1) + "\n" + g2e(s2)  + "\n" + g2e(s3)  + "\n" + g2e(s4)  + "\n" + g2e(s5))

the house is small
the house is big 
yes the book is small
the house
a book


### Another sentence set to explore

Update the below function to translate these sentence pairs in as few a set of rules as possible

In [54]:
s1="der_ART mann_N hat_AUX fußball_N gespielt_V"
t1="the man played football"
s2="der_ART mann_N spielt_V fußball_N"
t2="the man plays football"
s3="der_ART mann_N hat_AUX kartoffeln_N gekocht_V"
t3="the man cooked potatoes"
s4="der_ART mann_N kocht_V kartoffeln_N"
t4="the man cooks potatoes"

In [66]:
def g2e(out):
    re.UNICODE
    #vocabulary rules
    out=re.sub('der_','the_',out)
    out=re.sub('mann_','man_',out)
    out=re.sub(' hat_AUX','',out) #gets rid of hat
    out=re.sub('fußball_','football_',out)
    out=re.sub('gespielt_', 'played_', out) #fixes the improper past tense
    out=re.sub('spielt_','plays_',out)
    out=re.sub('gekocht_', 'cooked_', out) #fixes the improper past tense pt 2
    out=re.sub('kocht_','cooks_',out)
    out=re.sub('kartoffeln_','potatoes_',out)
    out=re.sub('gerne','likes',out) #initialize gerne

    #grammatical rules
    out=re.sub('([^ ]+_N) ([^ ]+_N) ([^ ]+_V)', '\\1 \\3 \\2',out) #fixes word order - maintaining proper noun positioning
    out=re.sub('([^ ]+)s_V ([^ ]+_ADV) ([^ ]+_N)','\\2 \\1ing_V \\3',out) #fixes complicated sentence with gerne 1
    out=re.sub('([^ ]+)s_ADV ([^ ]+_N) ([^ ]+)ed_V','\\1d to_ART \\3_V \\2',out) #fixes complicated sentence with gerne 2

    #removing tags
    out = re.sub("_[^ ]+","",out)
    return out

In [67]:
print(g2e(s1) + "\n" + g2e(s2)  + "\n" + g2e(s3)  + "\n" + g2e(s4)  + "\n")

the man played football
the man plays football
the man cooked potatoes
the man cooks potatoes



And if you are really feeling brave, try accounting for these too:

In [45]:
s5="der_ART mann_N spielt_V gerne_ADV fußball_N"
t5="the man likes playing football"
s6="der_ART mann_N hat_AUX gerne_ADV fußball_N gespielt_V"
t6="the man liked to play football"

In [68]:
print(g2e(s5) + "\n" + g2e(s6)  + "\n" )

the man likes playing football
the man liked to play football



## Statistical machine translation

We will look next at statistical machine translation. NLTK has some built in tools for this that we can make use of.

To make sure we have latest version of nltk let's install and then restart runtime.

In [69]:
!pip install --user -U nltk



In [70]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
import math
from nltk import AlignedSent
from nltk import IBMModel3

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


### Build a translation table

We start by performing alignment and building a translation table

In [71]:
s1='klein ist das haus'
t1='the house is small'
s2='das haus ist ja groß'
t2='the house is big'
s3='das buch ist klein'
t3='the book is small'
s4='das buch'
t4='the book'
s4='das house'
t4='the house'

In [72]:
parallel_corpus = []
parallel_corpus.append(AlignedSent(nltk.word_tokenize(s1),nltk.word_tokenize(t1)))
parallel_corpus.append(AlignedSent(nltk.word_tokenize(s2),nltk.word_tokenize(t2)))
parallel_corpus.append(AlignedSent(nltk.word_tokenize(s3),nltk.word_tokenize(t3)))
parallel_corpus.append(AlignedSent(nltk.word_tokenize(s4),nltk.word_tokenize(t4)))

In [73]:
ibm3 = IBMModel3(parallel_corpus, 50)

In [74]:
ibm3.translation_table['haus']

defaultdict(<function nltk.translate.ibm_model.IBMModel.reset_probabilities.<locals>.<lambda>.<locals>.<lambda>()>,
            {'house': 0.6666666655073329,
             'is': 3.280000051417293e-10,
             'the': 2.460000038439765e-10,
             'small': 2.8800000428320153e-10,
             None: 0.1652218257503104,
             'big': 2.700000047108661e-10})

You can download and train on a larger aligned corpus by running this code (but beware it will take quite a while):

import nltk <br>
nltk.download('comtrans') <br>
ende=comtrans.aligned_sents('alignment-de-en.txt') <br>
ende_subset = ende[1:100] <br>
ibm3 = IBMModel3(ende_subset, 2) <br>

In [75]:
phrase_table = nltk.translate.PhraseTable()
for triple in ibm3.translation_table.items():
      for i in triple[1].items():
            phrase_table.add((triple[0],),(i[0],),math.log(i[1]))


In [76]:
phrase_table.translations_for(('ist',))

[PhraseTableEntry(trg_phrase=('is',), log_prob=-1.5040005918661971e-09),
 PhraseTableEntry(trg_phrase=(None,), log_prob=-1.4675276464888576),
 PhraseTableEntry(trg_phrase=('small',), log_prob=-21.457234998263182),
 PhraseTableEntry(trg_phrase=('book',), log_prob=-21.680378551304575),
 PhraseTableEntry(trg_phrase=('the',), log_prob=-21.796210364274806),
 PhraseTableEntry(trg_phrase=('house',), log_prob=-21.83800749195073),
 PhraseTableEntry(trg_phrase=('big',), log_prob=-22.03259913948252)]

### Build a probabilistic language model

We will use the collected works of Jane Austen here, but in real systems you would want to use a larger and more representative corpus.

In [77]:
!wget https://www.gutenberg.org/files/31100/31100.txt
f = open('31100.txt',"r",encoding='windows-1252')
text = f.read()
text = text + "\n" + t1 + "\n" + t2 + "\n" + t3 + "\n" + t4 + "\n"
tokenized_text = [list(map(str.lower, nltk.word_tokenize(sent)))
                  for sent in nltk.sent_tokenize(text)]

--2025-03-17 17:15:28--  https://www.gutenberg.org/files/31100/31100.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4454075 (4.2M) [text/plain]
Saving to: ‘31100.txt’


2025-03-17 17:15:29 (9.17 MB/s) - ‘31100.txt’ saved [4454075/4454075]



In [78]:
import nltk.lm.preprocessing
n = 3
train_data, padded_sents = nltk.lm.preprocessing.padded_everygram_pipeline(n, tokenized_text)

In [79]:
from nltk.lm import MLE
model = MLE(n)

In [80]:
model.fit(train_data, padded_sents)


In [85]:
model.generate(8)

['that', 'moment', 'i', 'can', 'be', 'any', 'longer', '.']

In [86]:
from collections import defaultdict
language_prob = defaultdict(lambda: -99.0)
for t in nltk.ngrams(nltk.word_tokenize(t1 + " " + t2 + " " + t3 + " " + t4),3):
    if model.logscore(t[2],[t[0],t[1]]) < 0:
        language_prob[t] = model.logscore(t[2],[t[0],t[1]])
    else:
        language_prob[t] = -999
language_model = type('',(object,),{'probability_change': lambda self, context, phrase: language_prob[phrase], 'probability': lambda self, phrase: language_prob[phrase]})()

In [87]:
language_prob.items()

dict_items([(('the', 'house', 'is'), -5.9734582125475075), (('house', 'is', 'small'), -2.8073549220576046), (('is', 'small', 'the'), -999), (('small', 'the', 'house'), -999), (('house', 'is', 'big'), -2.8073549220576046), (('is', 'big', 'the'), -999), (('big', 'the', 'book'), -999), (('the', 'book', 'is'), -4.643856189774724), (('book', 'is', 'small'), -999)])

### Combine with translation model to perform decoding


In [88]:
stack_decoder = nltk.translate.StackDecoder(phrase_table, language_model)

In [89]:
stack_decoder.distortion_factor = 1
stack_decoder.word_penalty = 0

In [90]:
stack_decoder.translate(nltk.word_tokenize("das haus ist groß"))

['the', 'is', 'big', 'house']

In [91]:
stack_decoder.translate(nltk.word_tokenize("klein ist das haus"))

['small', 'the', 'is', 'house']