## Introduction to Natural Language Processing: Assignment 1

### PROBLEM 1. Language Model Creation

### 1. Description

To train probabilistic language models to distinguish between words in different languages

### 2. Import the necessary libraries

In [1]:
import nltk
from nltk.corpus import udhr 
import string
import re
from string import punctuation

### 3. Load corpus and create sets of four languages
* English
* French
* Italian
* Spanish

In [2]:
english = udhr.raw('English-Latin1')
french = udhr.raw('French_Francais-Latin1')
italian = udhr.raw('Italian_Italiano-Latin1')
spanish = udhr.raw('Spanish_Espanol-Latin1') 

### 4. Create Train, Test and Dev sets of the languages

In [3]:
english_train, english_dev = english[0:1000], english[1000:1100]
french_train, french_dev = french[0:1000], french[1000:1100]
italian_train, italian_dev = italian[0:1000], italian[1000:1100]
spanish_train, spanish_dev = spanish[0:1000], spanish[1000:1100] 
english_test = udhr.words('English-Latin1')[0:1000]
french_test = udhr.words('French_Francais-Latin1')[0:1000]
italian_test = udhr.words('Italian_Italiano-Latin1')[0:1000]
spanish_test = udhr.words('Spanish_Espanol-Latin1')[0:1000]

### 5. Train the language model
* The training corpus is first preprocessed to remove case sensitivity
* The corpus is checked for any punctuation characters
* Number of words are calculated
* Frequency distribution of each character in the corpus is calculated using NLTK's FreqDist module. Additionally, while calculating frequency distribution it is also checked that the character is a valid alphabet.
* Bigram and Trigram tuples are calculated by taking into account the begining and end of the word. The beginning of any word is marked as _<w_> and the end of the word is marked as _</w_>
* Conditional frequencies for these Bigram and Trigram pairs have been calculated using NLTK's ConditionalFreqDist module.
* The returned values: character frequencies and frequencies of bigram and trigram models are passed to the test_model()

In [4]:
def train_model(train_set):
    # Preprocess the training corpus
    corpus_train1=train_set.lower()
    corpus_train1=''.join(c for c in corpus_train1 if c not in punctuation)
    words=corpus_train1.split()
    fdist_char = nltk.FreqDist(ch for ch in corpus_train1 if ch.isalpha())
    
    # Calculate Bigram pairs
    bigram_pairs=[]
    for w in words:
        for i in range (len(w)):
            if i ==0:
                bigram_pairs.append(("<w>",w[i]))
            if i<len(w)-1:
                bigram_pairs.append((w[i],w[i+1]))
            if i == len(w)-1:
                bigram_pairs.append((w[i],"</w>"))
                
    # Calculate cdf for bigram pairs
    cfd = nltk.ConditionalFreqDist(bigram_pairs)
    
    # Calculate Trigram pairs
    trigram_pairs=[]
    for w in words:
        for i in range (len(w)-1):
            if i ==0:
                trigram_pairs.append(("<w>",w[i]))
                trigram_pairs.append(("<w>"+w[i],w[i+1]))
        if i<len(w)-2:
                trigram_pairs.append((w[i:i+2],w[i+2]))
        if i==len(w)-2:
                trigram_pairs.append((w[i:i+2],"</w>"))
                trigram_pairs.append((w[i+1],"</w>"))
    
    # Calculate cdf for trigram pairs
    cfd1 = nltk.ConditionalFreqDist(trigram_pairs)
    
    return fdist_char, cfd, cfd1

### 6. Test the language model and calculate probabilities

* The test_data is processed to make the words case insensitive
* While calculating Unigram, Bigram and Trigram probabilities, Laplace Smoothing (Add One) has been used. 
* Laplace Smoothing adds 1 to every count. The denominator is accordingly adjusted by adding V which is the total character vocabulary that we have. 

In [5]:
def test_model(test_data,fdist_char,cfd,cfd1):
    #Preprocessing: making words case insensitive
    corpus_test= [w.lower() for w in test_data]
    uni={}
    
    # Calculate Unigram probailities
    for w in corpus_test:
        prob_u=1
        for i in range(len(w)):
            prob_u*=(1+fdist_char[w[i]])/(len(fdist_char.keys())+sum(fdist_char.values()))
        uni[w]=prob_u
        
    # Calculate Bigram probailities
    bi={}
    for w in corpus_test:
        prob=1
        prob1=1
        prob2=1
        for i in range(len(w)):
            if i==0:
                prob1=(1+cfd['<w>'][w[i]])/(len(fdist_char.keys()) +sum(cfd['<w>'].values()))
            if i<len(w)-1:
                prob=prob*(cfd[w[i]][w[i+1]])/(len(fdist_char.keys())+ sum(cfd[w[i]].values()))
            if i== len(w)-1:
                prob2=(1+cfd[w[i]]['</w>'])/(len(fdist_char.keys())+sum(cfd[w[i]].values()))
        bi[w]=prob*prob1*prob2
        
    # Calculate Trigram probabilities
    tri={}
    for w in corpus_test:
        prob=1
        prob1=1
        prob2=1
        prob3=1
        for i in range(len(w)):
            if i==0:
                prob1=(1+cfd1['<w>'][w[i]])/(len(fdist_char.keys()) +sum(cfd1['<w>'].values()))
            if i==1:
                prob3=(1+cfd1['<w>'+w[i-1]][w[i]])/(len(fdist_char.keys()) +sum(cfd1['<w>'+w[i-1]].values()))
            if i<len(w)-1:
                prob=prob*(1+cfd1[w[i-1]+w[i]][w[i+1]])/(len(fdist_char.keys())+ sum(cfd1[w[i-1]+w[i]].values()))
            if i==len(w)-1:
                prob2=(1+cfd1[w[i-1]+w[i]]['</w>'])/(len(fdist_char.keys())+sum(cfd1[w[i-1]+w[i]].values()))
                prob2=prob2*(1+cfd1[w[i]]['</w>'])/(len(fdist_char.keys()) +sum(cfd1[w[i]].values()))
        tri[w]=prob*prob1*prob2*prob3
        
    return uni, bi, tri

### 7. Train models for all languages

In [6]:
fdist_char_eng,cfd_eng,cfd1_eng=train_model(english_train)
fdist_char_fre,cfd_fre,cfd1_fre=train_model(french_train)
fdist_char_spa,cfd_spa,cfd1_spa=train_model(spanish_train)
fdist_char_ita,cfd_ita,cfd1_ita=train_model(italian_train)

### 8. Comparison of English and French Unigram, Bigram and Trigram Models

#### a. Using English Test set

In [7]:
uni_eng, bi_eng, tri_eng=test_model(english_test,fdist_char_eng,cfd_eng,cfd1_eng)
uni_fre, bi_fre, tri_fre=test_model(english_test,fdist_char_fre,cfd_fre,cfd1_fre)

In [8]:
english_test1=[w.lower() for w in english_test]
correct_count_uni=0
correct_count_bi=0
correct_count_tri=0

for i in english_test1:
    if uni_fre[i]<uni_eng[i]:
        correct_count_uni+=1
    if bi_fre[i]<bi_eng[i]:
        correct_count_bi+=1
    if tri_fre[i]<tri_eng[i]:
        correct_count_tri+=1
print("Accuracy on Unigram models is: ", round(correct_count_uni/len(english_test)*100,2), "%")
print("Accuracy for Bigram model is: ", round(correct_count_bi/len(english_test)*100,2),"%")
print("Accuracy for Trigram model is: ", round(correct_count_tri/len(english_test)*100,2),"%")

Accuracy on Unigram models is:  68.1 %
Accuracy for Bigram model is:  75.9 %
Accuracy for Trigram model is:  90.8 %


#### b. Using French Test set

In [9]:
uni_eng, bi_eng, tri_eng=test_model(french_test,fdist_char_eng,cfd_eng,cfd1_eng)
uni_fre, bi_fre, tri_fre=test_model(french_test,fdist_char_fre,cfd_fre,cfd1_fre)

In [10]:
french_test1=[w.lower() for w in french_test]
correct_count_uni=0
correct_count_bi=0
correct_count_tri=0

for i in french_test1:
    if uni_fre[i]>uni_eng[i]:
        correct_count_uni+=1
    if bi_fre[i]>bi_eng[i]:
        correct_count_bi+=1
    if tri_fre[i]>tri_eng[i]:
        correct_count_tri+=1
print("Accuracy on Unigram models is: ", round(correct_count_uni/len(french_test)*100,2), "%")
print("Accuracy for Bigram model is: ", round(correct_count_bi/len(french_test)*100,2),"%")
print("Accuracy for Trigram model is: ", round(correct_count_tri/len(french_test)*100,2),"%")

Accuracy on Unigram models is:  85.4 %
Accuracy for Bigram model is:  58.3 %
Accuracy for Trigram model is:  64.5 %


### 9. Anaysis

* Ideally the accuracies should be Trigram>Bigram>Unigram where accuracy of Trigram is the highest and that of Unigram is the lowest.
* Here we can see this case when english_test was trained on English and French model. The better accuracies are due to the English test set which aided the English model in assigning higher probability to the words.
* On the other hand, the models did not fare well according to ideal order of accuracies for the french_set. This could be due to the sparsity of the CDF matrix or the choice of Smoothing function used.

### PROBLEM 2. Language Model Comparison


### 1. Comparison of Spanish and Italian Unigram, Bigram and Trigram Models

#### a. Using Spanish Test set

In [11]:
uni_spa, bi_spa, tri_spa=test_model(spanish_test,fdist_char_spa,cfd_spa,cfd1_spa)
uni_ita, bi_ita, tri_ita=test_model(spanish_test,fdist_char_ita,cfd_ita,cfd1_ita)

In [12]:
spanish_test1=[w.lower() for w in spanish_test]
correct_count_uni=0
correct_count_bi=0
correct_count_tri=0

for i in spanish_test1:
    if uni_spa[i]>uni_ita[i]:
        correct_count_uni+=1
    if bi_spa[i]>bi_ita[i]:
        correct_count_bi+=1
    if tri_spa[i]>tri_ita[i]:
        correct_count_tri+=1
print("Accuracy on Unigram models is: ", round(correct_count_uni/len(spanish_test)*100,2), "%")
print("Accuracy for Bigram model is: ", round(correct_count_bi/len(spanish_test)*100,2),"%")
print("Accuracy for Trigram model is: ", round(correct_count_tri/len(spanish_test)*100,2),"%")

Accuracy on Unigram models is:  73.7 %
Accuracy for Bigram model is:  60.2 %
Accuracy for Trigram model is:  57.2 %


#### b. Using Italian Test set

In [13]:
uni_spa, bi_spa, tri_spa=test_model(italian_test,fdist_char_spa,cfd_spa,cfd1_spa)
uni_ita, bi_ita, tri_ita=test_model(italian_test,fdist_char_ita,cfd_ita,cfd1_ita)

In [14]:
italian_test1=[w.lower() for w in italian_test]
correct_count_uni=0
correct_count_bi=0
correct_count_tri=0

for i in italian_test1:
    if uni_spa[i]<uni_ita[i]:
        correct_count_uni+=1
    if bi_spa[i]<bi_ita[i]:
        correct_count_bi+=1
    if tri_spa[i]<tri_ita[i]:
        correct_count_tri+=1
print("Accuracy on Unigram models is: ", round(correct_count_uni/len(italian_test)*100,2), "%")
print("Accuracy for Bigram model is: ", round(correct_count_bi/len(italian_test)*100,2),"%")
print("Accuracy for Trigram model is: ", round(correct_count_tri/len(italian_test)*100,2),"%")

Accuracy on Unigram models is:  56.0 %
Accuracy for Bigram model is:  68.9 %
Accuracy for Trigram model is:  84.6 %


### 2. Analysis

Judging by the frequencies, the Spanish, Italian language pairs were harder to distinguish. This could be due to limited training data set which was not diverse.