- Import the necessary libraries!

In [0]:
import re
from string import digits
import string
import itertools
import copy

- Please skip this step. 

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### **Load and clean the data**
- We read the data file and strip the text of leading and trailing spaces, split lines according \n in a list.

In [0]:
def readFile(path):
  file = open(path, mode='rt', encoding='utf-8')
  text = file.read()
  file.close()
  #Strip the text of leading and trailing spaces, split lines according \n in a list.
  text = text.strip().split("\n")
  return text

- The data is preprocessed by making the words lower case, removing digits and removing punctuation except for the apostrophe (') to retain any possessive nouns. 
- The words are tokenized by white spaces.
- The list of words are stored in a list of sentences and returned.
- A vocabulary of unique words pertaining to the corpus is returned.

In [0]:
def cleanData(data):
  cleandata=[]
  for line in data:
    # convert to lower case
    line = line.lower()
    remove_digits = str.maketrans('', '', digits)
    line = line.translate(remove_digits)
    #remove punctuation except for apostrophe
    line = re.sub('[!#?,.:";)(]', '', line)
    # tokenize on white space
    line = line.split()
    cleandata.append(line)
  flat_list = [item for sublist in cleandata for item in sublist]
  vocab= set(flat_list)
  return cleandata, vocab
  

- The unzipped data is stored in the same folder where this notebook has been kept.


In [0]:
de_path="/content/drive/My Drive/NLP Assignment 3/de-en.de"
de = readFile(de_path)
clean_de, vocab_de= cleanData(de)

In [0]:
en_path="/content/drive/My Drive/NLP Assignment 3/de-en.en"
en = readFile(en_path)
clean_en, vocab_en= cleanData(en)

## Problem 1

- Word translation probabilities are stored in a dictionary using the dictionary data structure. 
- The probabilities are initialized uniformly as a unit fraction of the French Vocabulary

In [0]:
def trans_prob():
  dictionary={}
  count = {}
  total = {}
  for de in vocab_de:
    total[de]=0
    for eng in vocab_en:
      count[(eng, de)] = 0
      dictionary[(eng, de)]=1/(len(vocab_de))
  return total, count, dictionary

## Problem 2

- The IBM model 1 has been implemented as per the pseudocode.
- The input is taken as a set of sentence pairs.
- The previously uniformly distributed translational probabilities are used here.
- Count and total dictionary is initialized to 0.
- Normalization has been computed for all sentence pairs.
- Counts are collected and probabilities are estimated.
- Before checking for convergence, the probabilities are rounded off to 3 decimal places.
- This loop converges when the translational probability table stops changing which is checked in the while loop. SInce convergence takes time, I have placed another condition which converges for number of iterations specified. It is 15 here

In [0]:
def model(total, count, dictionary, iterations):
  c=0
  old_t={}
  # new_dic=copy.deepcopy(dictionary)
  while(dictionary!=old_t and c<iterations):
    print("Loop started: ",c)
    old_t=copy.deepcopy(dictionary)
    for i in range(len(clean_en)): #Sent pairs
      for eng in clean_en[i]:
        add=0
        prod=1
        sum_c=0
        for de in clean_de[i]:
          add+=dictionary[(eng,de)]
        for de in clean_de[i]:
          count[(eng,de)]+=dictionary[(eng,de)]/add
          total[de]+=dictionary[(eng,de)]/add

    for de in vocab_de:
      for en in vocab_en:
        dictionary[(en,de)]=round(count[(en,de)]/total[de],3)
    c+=1
  return dictionary


- After the calculation of final translational probabilities, this dictionary is used to retrieve english aligned sentence for a given german sentence.
- For a given word pair which includes the German word, the corresponding high valued english word is outputted.

In [0]:
def alignments(clean_de, dictionary):
  eng_sent=[]
  for sent in clean_de:
    eng=[]
    for word in sent:
      maxVal = 0
      maxKey = ()
      for key, value in dictionary.items():
      # This will check if testValue matches any item in tuple.
        if word == key[1]:
          if maxVal < value:
            maxVal = value
            maxKey = key
      eng.append(maxKey[0])
    eng_sent.append(eng)
  return eng_sent

In [0]:
total, count, dictionary = trans_prob()


In [13]:
iterations=15
dictionary = model(total, count, dictionary, iterations)

Loop started:  0
Loop started:  1
Loop started:  2
Loop started:  3
Loop started:  4
Loop started:  5
Loop started:  6
Loop started:  7
Loop started:  8
Loop started:  9
Loop started:  10
Loop started:  11
Loop started:  12
Loop started:  13
Loop started:  14


In [0]:
alignments = alignments(clean_de[:5], dictionary)

- Print English alignments for 5 German sentences.

In [17]:
for i in range(5):
  print(clean_de[:5][i])
  print(alignments[i])
  print()
  print()

['wiederaufnahme', 'der', 'sitzungsperiode']
['resumption', 'the', 'you']


['ich', 'erkläre', 'die', 'am', 'freitag', 'dem', 'dezember', 'unterbrochene', 'sitzungsperiode', 'des', 'europäischen', 'parlaments', 'für', 'wiederaufgenommen', 'wünsche', 'ihnen', 'nochmals', 'alles', 'gute', 'zum', 'jahreswechsel', 'und', 'hoffe', 'daß', 'sie', 'schöne', 'ferien', 'hatten']
['i', 'festive', 'the', 'the', 'friday', 'the', 'december', 'festive', 'you', 'the', 'european', 'parliament', 'the', 'festive', 'like', 'you', 'would', 'everything', 'well', 'the', 'festive', 'and', 'hope', 'that', 'to', 'festive', 'festive', 'i']


['wie', 'sie', 'feststellen', 'konnten', 'ist', 'der', 'gefürchtete', 'millenium-bug', 'nicht', 'eingetreten', 'doch', 'sind', 'bürger', 'einiger', 'unserer', 'mitgliedstaaten', 'opfer', 'von', 'schrecklichen', 'naturkatastrophen', 'geworden']
['as', 'to', "'", 'were', 'is', 'the', "'", "'", 'not', "'", 'the', 'the', 'citizens', 'the', 'our', 'member', 'suffered', 'the', "'"