# <u>Chapter 6</u>: Teaching Machines to Translate

A serious impediment to spreading new information, ideas, and knowledge is the language barriers imposed by the different languages spoken worldwide. Despite the cultural richness brought to our global heritage, they can pose significant hurdles to efficient human communication. This exercice focuses on `machine translation` (MT), which aims to alleviate these barriers. MT is the process of automatically converting a piece of text from a source into a target language without human intervention. 

In [86]:
import sys
import subprocess
import pkg_resources

# Find out which packages are missing.
installed_packages = {dist.key for dist in pkg_resources.working_set}
required_packages = {'nltk'}
missing_packages = required_packages - installed_packages

# If there are missing packages install them.
if missing_packages:
    print('Installing the following packages: ' + str(missing_packages))
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing_packages], stdout=subprocess.DEVNULL)

Now, download the datasets.

In [None]:
import os

# Check if the data directory already exists.
if not os.path.exists("data"):
    # URL of the zip data file to download.
    url = "https://github.com/PacktPublishing/Machine-Learning-Techniques-for-Text/raw/main/chapter-06/data.zip"

    # If it doesn't exist, download the zip file.
    !wget {url}

    # Unzip the file into the "data" folder.
    !unzip -q "data.zip"

## Rule-based machine translation

We will begin our journey of MT with the classical approach, known as `rule-based machine translation` (RBMT), which aims to exploit linguistic information about the source and target languages. RBMT techniques fall under the broad category of `knowledge-based systems`, which mainly aim to capture the knowledge of human experts to solve complex problems.

First, we will incorporate _nltk_ to perform `POS tagging`.

In [87]:
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Tokenize the input text.
text = nltk.word_tokenize("The sky is blue")

# Parse the input.
nltk.pos_tag(text)

[('The', 'DT'), ('sky', 'NN'), ('is', 'VBZ'), ('blue', 'JJ')]

The following code creates a _CFG_ (named _analysis_grammar_) that consists of six rules signified with the _->_ symbol. Then we parse the input phrase using the analysis grammar.

In [88]:
# Create the grammar that consists of six rules. 
# S:sentence, NP:noun phrase, DT:determiner, NN:noun, 
# VBZ:verb in the third person singular, JJ:adjective.
analysis_grammar = nltk.CFG.fromstring("""
    S -> NP VBZ JJ	
    NP -> DT NN	
    DT -> 'The'	
    NN -> 'sky'	
    VBZ -> 'is'	
    JJ -> 'blue'
    """)
 	
# Create the input.
input = ['The', 'sky', 'is', 'blue']

# Parse the input.
parser = nltk.ChartParser(analysis_grammar)

# Print the parse trees.
for tree in parser.parse(input):
    print(tree)
    #tree.draw()


(S (NP (DT The) (NN sky)) (VBZ is) (JJ blue))


The following code shows the extended grammar.

In [89]:
# The grammar consists  of six but more powerful rules.
analysis_grammar = nltk.CFG.fromstring("""
    S -> NP VBZ JJ	
    NP -> DT NN	
    DT -> 'The' | 'the'	
    NN -> 'sky' | 'sea'	
    VBZ -> 'is'	
    JJ -> 'blue' | 'red'
    """)

To verify that it supports all the necessary phrases, let’s generate 10 (_n=10_) possible expansions.

In [90]:
from nltk.parse.generate import generate

# Generate ten examples at most.
for sentence in generate(analysis_grammar, n=10):
    print(' '.join(sentence))

The sky is blue
The sky is red
The sea is blue
The sea is red
the sky is blue
the sky is red
the sea is blue
the sea is red


Let’s create a grammar that consists of three rules for word-to-word dependency relations and incorporate it through a dependency parser.

In [91]:
# Create the dependency grammar that includes three rules.
dependency_grammar = nltk.DependencyGrammar.fromstring("""
    'is' -> 'sky' | 'sea' | 'blue' | 'red'
    'sky' -> 'The' | 'the' 
    'sea' -> 'The' | 'the' 
    """)

# Create the dependency parser.
pdp = nltk.ProjectiveDependencyParser(dependency_grammar)

# Create the input.
input = ['The', 'sky', 'is', 'blue']

# Parse the input.
trees = pdp.parse(input)

# Print the parse trees.
for tree in trees:
    print(tree)

(is (sky The) blue)


Now, let’s learn how to perform `NER` in _nltk_ while using the phrase _The Aston Martin_ is blue as input.

In [92]:
# Download nltk models/corpora.
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Tokenize the input text.
text = nltk.word_tokenize("The Aston Martin is blue")

# Parse the input.
tags = nltk.pos_tag(text)

# Find the name entities.
tree = nltk.ne_chunk(tags)

# Draw the tree.
#tree.draw()

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\tsouraki\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\tsouraki\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


Next, we must extract the tagging tokens using the `IOB format` (short for inside, outside, beginning), which is used for tagging tokens in a chunking task.

In [93]:
# Get the IOB tags.
iob_tags = nltk.tree2conlltags(tree)

# Print the IOB tags.
print(iob_tags)

[('The', 'DT', 'O'), ('Aston', 'NNP', 'B-ORGANIZATION'), ('Martin', 'NNP', 'I-ORGANIZATION'), ('is', 'VBZ', 'O'), ('blue', 'JJ', 'O')]


IOB provides three tags to refer to parts of a chunk (group of words).

In the following code, we start with the definition of our feature grammar and use it to parse an input phrase. The grammar includes several attribute-value pairs; for example, the _SEM_ attribute can have _noun_ as its value. With the aid of the transfer rules, we can parse a source representation (in English) and return a sequence of attributes named _TEXT_ with the target language representations (in French). Observe the hierarchical expansion of the rules as the sentence, _S_, consists of noun phrases, _NP_, which consists of nouns, _NN_, and determiners, _DT_.

In [94]:
# Create the grammar string.
g = """

# S expansion productions.
S[AGR1=?np, ARG2=?vbz, ARG3=?jj] -> NP[AGR=?np] VBZ[AGR=?vbz] JJ[AGR=?jj]

# NP expansion productions.
NP[AGR=[DT=?dt, NN=?nn]] -> DT[AGR=?dt] NN[AGR=?nn] 

# Lexical productions.
DT[AGR=[TEXT='Le', SEM='determiner']] -> 'The' 
DT[AGR=[TEXT='le', SEM='determiner']] -> 'the' 
NN[AGR=[TEXT='ciel', SEM='noun']] -> 'sky'
NN[AGR=[TEXT='mer', SEM='noun']] -> 'sea'
VBZ[AGR=[TEXT='être', SEM='verb', TENSE='present', NUM='singular']] -> 'is'
JJ[AGR=[TEXT='bleu', SEM='adjective']] -> 'blue'
JJ[AGR=[TEXT='rouge', SEM='adjective']] -> 'red'
"""

# Create the input, transfer grammar, and parser.
input = ['The', 'sky', 'is', 'blue']
transfer_grammar = nltk.grammar.FeatureGrammar.fromstring(g)
parser = nltk.parse.FeatureEarleyChartParser(transfer_grammar)

# Parse the input and print the result.
trees = parser.parse(input)
for tree in trees: print(tree)


(S[AGR1=[DT=[SEM='determiner', TEXT='Le'], NN=[SEM='noun', TEXT='ciel']], ARG2=[NUM='singular', SEM='verb', TENSE='present', TEXT='être'], ARG3=[SEM='adjective', TEXT='bleu']]
  (NP[AGR=[DT=[SEM='determiner', TEXT='Le'], NN=[SEM='noun', TEXT='ciel']]]
    (DT[AGR=[SEM='determiner', TEXT='Le']] The)
    (NN[AGR=[SEM='noun', TEXT='ciel']] sky))
  (VBZ[AGR=[NUM='singular', SEM='verb', TENSE='present', TEXT='être']]
    is)
  (JJ[AGR=[SEM='adjective', TEXT='bleu']] blue))


Based on the previous output, we managed to obtain an internal representation in the target language.

Let's now create the generation grammar, create the parser and test it with the representation we encountered earlier.

In [95]:
# Create the grammar string.
g = """

# S expansion productions.
S[AGR1=?np, ARG2=?vbz, ARG3=?jj] -> NP[AGR=?np] VBZ[AGR=?vbz] JJ[AGR=?jj]

# NP expansion productions.
NP[AGR=[DT=?dt, NN=?nn]] -> DT[AGR=?dt] NN[AGR=?nn] 

# Lexical productions.	
DT[AGR=[TEXT='Le']] -> 'Le' 
DT[AGR=[TEXT='le']] -> 'le' 
NN[AGR=[TEXT='ciel']] -> 'ciel'
NN[AGR=[TEXT='mer']] -> 'mer'
VBZ[AGR=[TEXT='est', SEM='verb', TENSE='present', NUM='singular']] -> 'être'
JJ[AGR=[TEXT='bleu']] -> 'bleu'
JJ[AGR=[TEXT='rouge']] -> 'rouge'
"""

# Create the input, transfer grammar, and parser.
input = ['Le', 'ciel', 'être', 'bleu']
generation_grammar = nltk.grammar.FeatureGrammar.fromstring(g)
parser = nltk.parse.FeatureEarleyChartParser(generation_grammar)

# Parse the input and print the result.
trees = parser.parse(input)
for tree in trees: print(tree)

(S[AGR1=[DT=[TEXT='Le'], NN=[TEXT='ciel']], ARG2=[NUM='singular', SEM='verb', TENSE='present', TEXT='est'], ARG3=[TEXT='bleu']]
  (NP[AGR=[DT=[TEXT='Le'], NN=[TEXT='ciel']]]
    (DT[AGR=[TEXT='Le']] Le)
    (NN[AGR=[TEXT='ciel']] ciel))
  (VBZ[AGR=[NUM='singular', SEM='verb', TENSE='present', TEXT='est']]
    être)
  (JJ[AGR=[TEXT='bleu']] bleu))


## Example-based machine translation

The reliance on linguistic rules presents many shortcomings. As we saw previously, using a corpus of already-translated examples could serve as a model to base the translation task on. This is the basic idea behind `example-based machine translation` (EBMT) systems; keep track of well-translated fragments and use this information to facilitate the translation of new sentences. 

Let’s learn how to use the alignments that have been defined for a bilingual pair programmatically. In the following code, we are considering two examples from the English-to-French pair.

In [96]:
from nltk.translate import AlignedSent, Alignment

# Hold the bi-lingual text.
bitext = []

# Create two examples from German to English, along with the alignments.
bitext.append(AlignedSent(['blue', 'is', 'The', 'sky'], 
                            ['Le', 'ciel', 'est', 'bleu'], 
                            Alignment.fromstring('0-3 1-2 2-0 3-1')))
bitext.append(AlignedSent(['yellow', 'is', 'The', 'sun'], 
                            ['Le', 'soleil', 'est', 'jaune'], 
                            Alignment.fromstring('0-3 1-2 2-0 3-1')))

# Print the source words in the second example.
bitext[1].words

['yellow', 'is', 'The', 'sun']

For example, the word _yellow_ at position _0_ is aligned with the word _jaune_ at position _3_. 

We can verify these alignments using the following code.

In [97]:
# Print the target words in the second example.
bitext[1].mots

['Le', 'soleil', 'est', 'jaune']

In [98]:
# Print the alignments in the second example.
bitext[1].alignment

Alignment([(0, 3), (1, 2), (2, 0), (3, 1)])

Let’s load the _comtrans_ module and pick the first example from the English-to-French dataset.

In [99]:
# Download nltk corpus.
nltk.download('comtrans')

from nltk.corpus import comtrans

# Get the first example from the english/french corpus.
fe = comtrans.aligned_sents('alignment-en-fr.txt')[0]

# Print the source words.
fe.words

[nltk_data] Downloading package comtrans to
[nltk_data]     C:\Users\tsouraki\AppData\Roaming\nltk_data...
[nltk_data]   Package comtrans is already up-to-date!


['Resumption', 'of', 'the', 'session']

The target words in this case are as follows.

In [100]:
# Print the target words.
fe.mots

['Reprise', 'de', 'la', 'session']

Now, we can extract the alignments between the source and the target.

In [101]:
# Print the alignments.
fe.alignment

Alignment([(0, 0), (1, 1), (2, 2), (3, 3)])

In the previous example, the mapping of the words is one-to-one. Unfortunately, this is not the case in most MT tasks. Consider, for example, the following pair.

In [102]:
# Get the 52nd example from the English/French corpus.
fe = comtrans.aligned_sents('alignment-en-fr.txt')[52]

# Print the source words.
fe.words

['We', 'do', 'not', 'know', 'what', 'is', 'happening', '.']

In [103]:
# Print the target words.
fe.mots

['Nous', 'ne', 'savons', 'pas', 'ce', 'qui', 'se', 'passe', '.']

The output in this case is as follows.

In [104]:
# Print the alignments.
fe.alignment

Alignment([(0, 0), (1, 1), (2, 3), (3, 2), (4, 4), (4, 5), (5, 6), (6, 7), (7, 8)])

The following code demonstrates a few bilingual pairs from French to English that can be used to train a lexical translation model.

In [105]:
import nltk.translate.ibm2
from nltk.translate import AlignedSent, Alignment

# Hold the bi-lingual text.
bitext = []

# Create examples from French to English.
bitext.append(AlignedSent(
    ['petite', 'est', 'la', 'maison'],
    ['the', 'house', 'is', 'small']))
bitext.append(AlignedSent(
    ['la', 'maison', 'est', 'grande'], 
    ['the', 'house', 'is', 'big']))
bitext.append(AlignedSent
    (['le', 'livre', 'est', 'petit'], 
    ['the', 'book', 'is', 'small']))
bitext.append(AlignedSent(
    ['la', 'maison'], ['the', 'house']))
bitext.append(AlignedSent(['le', 'livre'], ['the', 'book']))
bitext.append(AlignedSent(['un', 'livre'], ['a', 'book']))



Based on the previous examples, we can create a model and examine the probability of the word _livre_ being translated as _book_.

In [106]:
# Create the lexical translation model from the examples.
ibm2 = nltk.translate.ibm2.IBMModel2(bitext, 5)

# Get the translation probabilities from the model.
print(round(ibm2.translation_table['livre']['book'], 3))

0.879


Don’t be surprised that the output probability is not equal to 1.0. All models suff er from certain limitations such as biases, vagaries of data noise and sampling, and so forth. Comparing _livre_ with any other word in the example gives a much smaller probability. Finally, we can obtain the alignments for one sample phrase.

In [107]:
# Consider one example from the bi-lingual text.
test_sentence = bitext[2]
test_sentence.words

['le', 'livre', 'est', 'petit']

In [108]:
test_sentence.mots

['the', 'book', 'is', 'small']

In [109]:
test_sentence.alignment

Alignment([(0, 0), (1, 1), (2, 2), (3, 3)])

## Statistical machine translation

BMT techniques follow a top-down approach, and domain experts are required to create models that can replicate the data. Conversely, data-driven approaches are bottom-up, and the data derives the model. Th is section focuses on `statistical machine translation` (SMT), which involves exploiting models whose parameters are learned from bilingual text 
corpora. They work on the assumption that every sentence in one language can be translated into any sentence in the target one. The overarching goal is to find the most probable translation in each case.

First, to create the translation model, we use a phrase table that includes sequences of words in the source and target languages, along with their probability.

In [110]:
from collections import defaultdict
from math import log
from nltk.translate import PhraseTable
from nltk.translate.stack_decoder import StackDecoder

# Create the phrase table.
phrase_table = PhraseTable()

# Populate the table with examples.
phrase_table.add(('das',), ('the', 'it'), log(0.4))
phrase_table.add(('das', 'ist'), ('this', 'is'), log(0.8))
phrase_table.add(('ein',), ('a',), log(0.8))
phrase_table.add(('haus',), ('house',), log(1.0))
phrase_table.add(('!',), ('!',), log(0.8))

Now, let’s create the language model.

In [111]:
# Create the dictionary of probabilities for each n-gram.
language_prob = defaultdict(lambda: -999.0)

# Populate the dictionary uni-grams and bi-grams.
language_prob[('this',)] = log(0.8)
language_prob[('is',)] = log(0.6)
language_prob[('a', 'house')] = log(0.2)
language_prob[('!',)] = log(0.1)

# Create the language model.
language_model = type('',(object,),{'probability_change': lambda self, context, phrase: language_prob[phrase], 'probability': lambda self, phrase: language_prob[phrase]})()

A stack decoder utilizes the two models to extract the translation of the German phrase _das ist ein haus_ (formally, nouns in German should be capitalized):

In [112]:
# Create the stack decoder and translate a sentence.
stack_decoder = StackDecoder(phrase_table, language_model)
stack_decoder.translate(['das', 'ist', 'ein', 'haus', '!'])	

['this', 'is', 'a', 'house', '!']

## What we have learned …

| | |
| --- | --- |
| **Text preprocessing**<ul><li>Part-of-speech tagging</li><li>Parse trees</li><li>Name Entity Resolution</li></ul> | **ML algorithms & models**<ul><li>Rule-based MT</li><li>Example-based MT</li><li>Statistical MT</li></ul> |
| | |