# [Assignment #2: NPFL067 Statistical NLP II](http://ufal.mff.cuni.cz/~hajic/courses/npfl067/assign2.html)

## Words and The Company They Keep

### Author: Dan Kondratyuk

### March 2, 2018

---

This Python notebook examines 

Code and explanation of results is fully viewable within this webpage.

## Files

- [index.html](./index.html) - Contains all veiwable code and a summary of results
- [README.md](./README.md) - Instructions on how to run the code with Python
- [nlp-assignment-2.ipynb](./nlp-assignment-1.ipynb) - Jupyter notebook where code can be run
- [requirements.txt](./requirements.txt) - Required python packages for running

## 1. Best Friends

#### Problem Statement
>  In this task you will do a simple exercise to find out the best word association pairs using the pointwise mutual information method.

> First, you will have to prepare the data: take the same texts as in the previous assignment, i.e.

> `TEXTEN1.txt` and `TEXTCZ1.txt`

> (For this part of Assignment 2, there is no need to split the data in any way.)

> Compute the pointwise mutual information for all the possible word pairs appearing consecutively in the data, **disregarding pairs in which one or both words appear less than 10 times in the corpus**, and sort the results from the best to the worst (did you get any negative values? Why?) Tabulate the results, and show the best 20 pairs for both data sets.

> Do the same now but for distant words, i.e. words which are at least 1 word apart, but not farther than 50 words (both directions). Again, tabulate the results, and show the best 20 pairs for both data sets. 

### Process Text

In [1]:
# Import Python packages
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import nltk
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import collections as c
from tqdm import tqdm
from scipy.special import comb

# Configure Plots
plt.rcParams['lines.linewidth'] = 4

In [2]:
np.random.seed(200) # Set a seed so that this notebook has the same output each time

In [3]:
def open_text(filename):
    """Reads a text line by line, applies light preprocessing, and returns an array of words"""
    with open(filename, encoding='iso-8859-2') as f:
        content = f.readlines()
    
    preprocess = lambda word: word.strip()
    
    return np.array([preprocess(word) for word in content])

In [4]:
# Read the texts into memory
english = './TEXTEN1.txt'
czech = './TEXTCZ1.txt'

words_en = open_text(english)
words_cz = open_text(czech)

In [28]:
class LanguageModel:
    """Counts words and calculates the probabilities of a language model"""
    
    def __init__(self, words, min_words=10):
        self.min_words = min_words
        
        # Unigrams
        self.unigrams = words
        self.unigram_set = list(set(self.unigrams))
        self.total_unigram_count = len(self.unigrams)
        self.unigram_dist = c.Counter(self.unigrams)
        
        # Bigrams
        self.bigrams = list(nltk.bigrams(words))
        self.bigram_set = list(set(self.bigrams))
        self.total_bigram_count = len(self.bigrams)
        self.bigram_dist = c.Counter(self.bigrams)
    
    def p_unigram(self, w):
        """Calculates the probability a unigram appears in the distribution"""
        return self.unigram_dist[w] / self.total_unigram_count
    
    def p_bigram(self, wprev, w):
        """Calculates the probability a bigram appears in the distribution"""
        return self.bigram_dist[(wprev, w)] / self.total_bigram_count
    
    def pointwise_mi(self, wprev, w, p_bigram_func=None):
        """Calculates the pointwise mutual information in a word pair"""
        p_bigram_func = self.p_bigram if p_bigram_func is None else p_bigram_func
        return np.log2(p_bigram_func(wprev, w) / self.p_unigram(wprev) / self.p_unigram(w))

In [29]:
lm_en = LanguageModel(words_en)
lm_cz = LanguageModel(words_cz)

In [30]:
def mutual_information(lm):
    # Obtain all word pairs in the word list, disregarding pairs in which one or both words appear less than 10 times in the corpus  
    pairs = [pair for pair in lm.bigram_set
             if lm.unigram_dist[pair[0]] >= lm.min_words 
             and lm.unigram_dist[pair[1]] >= lm.min_words]

    mi = [(' '.join(pair), lm.pointwise_mi(*pair)) for pair in pairs]
    return pd.DataFrame(mi, columns=['pair', 'mutual_information'])

In [31]:
mi_en = mutual_information(lm_en).sort_values(by='mutual_information', ascending=False)
mi_cz = mutual_information(lm_cz).sort_values(by='mutual_information', ascending=False)

In [32]:
mi_en[:20]

Unnamed: 0,pair,mutual_information
43243,La Plata,14.16937
25638,Asa Gray,14.031867
300,Fritz Muller,13.362016
42970,worth while,13.332869
21972,faced tumbler,13.26248
11360,lowly organised,13.216899
23409,Malay Archipelago,13.110477
13752,shoulder stripe,13.053893
8643,Great Britain,12.914557
15550,United States,12.847442


In [33]:
mi_cz[:20]

Unnamed: 0,pair,mutual_information
6412,Hamburger SV,14.28895
2686,Los Angeles,14.062442
6347,Johna Newcomba,13.762882
28960,Č. Budějovice,13.633599
19030,série ATP,13.468968
17861,turnajové série,13.434411
8878,Tomáš Ježek,13.428981
17581,Lidové noviny,13.329922
35002,Lidových novin,13.271028
23072,veřejného mínění,13.062442


In [34]:
mi_en[:-5:-1]

Unnamed: 0,pair,mutual_information
28149,"the ,",-8.790285
11222,. the,-8.407455
33255,. of,-7.90195
140,of .,-7.90195


In [35]:
def mutual_information_dist(lm):
    def mi_step(distance):
        # Get all pairs in the word list a certain distance apart
        pair_list = list(zip(lm.unigrams, lm.unigrams[distance+1:]))
        dist = c.Counter(pair_list)
    
        # Obtain all word pairs in the word list, disregarding pairs in which one or both words appear less than 10 times in the corpus  
        pairs = [pair for pair in list(set(pair_list))
                 if lm.unigram_dist[pair[0]] >= lm.min_words 
                 and lm.unigram_dist[pair[1]] >= lm.min_words]
        
        p_bigram = lambda wprev, w: dist[(wprev, w)] / lm.total_bigram_count
        
        yield ((distance, wprev, w, lm.pointwise_mi(wprev, w, p_bigram)) for wprev,w in pairs)
    
    max_distance = 50
    results = [m for distance in tqdm(range(1, max_distance+1)) for mi in mi_step(distance) for m in mi]
        
    return pd.DataFrame(results, columns=['distance', 'word0', 'word1', 'mutual_information'])

In [36]:
mi_dist_en = mutual_information_dist(lm_en).sort_values(by='mutual_information', ascending=False)
mi_dist_cz = mutual_information_dist(lm_cz).sort_values(by='mutual_information', ascending=False)

100%|██████████| 50/50 [00:24<00:00,  2.01it/s]
100%|██████████| 50/50 [00:24<00:00,  2.07it/s]


In [37]:
mi_dist_en[:20]

Unnamed: 0,distance,word0,word1,mutual_information
117857,2,survival,fittest,13.754333
42284,1,dimorphic,trimorphic,13.353454
127242,2,Alph,Candolle,13.236485
152017,3,H,Watson,13.16937
4742,1,Alph,de,13.053893
125929,2,Old,Worlds,13.053893
46878,1,E,Forbes,12.946978
93923,2,unimportant,welfare,12.879864
181294,3,carrier,faced,12.695439
32030,1,rarer,rarer,12.525514


In [38]:
mi_dist_cz[:20]

Unnamed: 0,distance,word0,word1,mutual_information
42390,1,ODÚ,VPN,14.119025
1122,1,turnajové,ATP,13.614983
34912,1,Mistrovství,turnajové,13.410365
420187,8,výher,výher,13.318097
47620,1,Čechy,Slováky,13.30345
125482,3,Mistrovství,ATP,13.203914
996021,19,prohraná,dvojchyby,13.172205
83549,2,soužití,Slováků,13.062442
671230,13,prohraná,esa,13.051911
410727,8,III,IV,13.025916


## 2. Best Friends

#### Word Classes

> **The Data**

> Get `TEXTEN1.ptg`, `TEXTCZ1.ptg`. These are your data. They are almost the same as the .txt data you have used so far, except they now contain the part of speech tags in the following form:

> `rady/NNFS2-----A----`  
`,/Z:-------------`

> where the tag is separated from the word by a slash ('/'). Be careful: the tags might contain everything (including slashes, dollar signs and other weird characters). It is guaranteed however that there is no slash-word.

> Similarly for the English texts (except the tags are shorter of course).

> **The Task**

> Compute a full class hierarchy of **words** using the first 8,000 words of those data, and only for words occurring 10 times or more (use the same setting for both languages). Ignore the other words for building the classes, but keep them in the data for the bigram counts. For details on the algorithm, use the Brown et al. paper distributed in the class; some formulas are wrong, however, so please see the corrections on the web (Class 12, formulas for Trick \#4). Note the history of the merges, and attach it to your homework. Now run the same algorithm again, but stop when reaching 15 classes. Print out all the members of your 15 classes and attach them too.

> **Hints:**

> The initial mutual information is (English, words, limit 8000):

> `4.99726326162518` (if you add one extra word at the beginning of the data)  
> `4.99633675507535` (if you use the data as they are and are carefull at the beginning and end).

> NB: the above numbers are finally confirmed from an independent source :-).

> The first 5 merges you get on the English data should be:

> `case subject`  
> `cannot may`  
> `individuals structure`  
> `It there`  
> `even less`  

> The loss of Mutual Information when merging the words "case" and "subject":

> Minimal loss: `0.00219656653357569` for `case+subject`

In [115]:
import itertools

class LmCluster:
    def __init__(self, words):
        self.lm = LanguageModel(words)
        
#         self.vocab2word = lm.unigram_set
#         self.word2vocab = {word:i for i,word in enumerate(vocab2word)}
#         self.vocab2class = list(range(len(self.vocab2word))) # Start with each word in its own class
        self.class2vocab = lm.unigram_set
        self.vocab2class = {word:i for i,word in enumerate(self.class2vocab)} # Start with each word in its own class
        
        self.text_v = lm.unigrams
#         self.text_c = [self.vocab2class[word] for word in self.text_v]
        
#         class_dist = c.Counter(self.text_c)
#         class_pair_dist = c.Counter(zip(self.text_c, self.text_c[1:]))
        
        c1, c2 = self.find_best_merge(vocab2class)
        print(self.vocab2word[c1], self.vocab2word[c2])
    
    def find_best_merge(self, vocab2class):
        text_c = [self.vocab2class[word] for word in self.text_v]
        classes = list(set(text_c))
        merges = (mi(vocab2class, c1, c2), c1, c2 for c1, c2 in itertools.combinations(classes, 2))
        best_merge = min(merges, key=lambda x: x[0])
        
    def mi(self, vocab2class, c1, c2):
        
        
        return np.sum([lm.p_bigram(*pair) * lm.pointwise_mi(*pair) for pair in self.word_set])

In [116]:
cluster = LmCluster(words_en[:8000])

TypeError: 'NoneType' object is not iterable

In [112]:
len(cluster.word_set)

112

In [72]:
def mutual_information_total(lm):
    pairs = lm.bigram_set
    return np.sum([lm.p_bigram(*pair) * lm.pointwise_mi(*pair) for pair in pairs])

mutual_information_total(LanguageModel(words_en[:8000]))

4.996416777233102

In [99]:
def mutual_information_total(words):
    def p_unigram(w):
        """Calculates the probability a unigram appears in the distribution"""
        return unigram_dist[w] / total_unigram_count
    
    def p_bigram(wprev, w):
        """Calculates the probability a bigram appears in the distribution"""
        return bigram_dist[(wprev, w)] / total_bigram_count
    
    def pointwise_mi(wprev, w):
        """Calculates the pointwise mutual information in a word pair"""
        return np.log2(p_bigram(wprev, w) / p_unigram(wprev) / p_unigram(w))
    
    # Unigrams
    unigrams = words
    unigram_set = list(set(unigrams))
    total_unigram_count = len(unigrams)
    unigram_dist = c.Counter(unigrams)

    # Bigrams
    bigrams = list(zip(words, words[1:]))
    bigram_set = list(set(bigrams))
    total_bigram_count = len(bigrams)
    bigram_dist = c.Counter(bigrams)
    
    return np.sum([p_bigram(*pair) * pointwise_mi(*pair) for pair in bigram_set])
    
t = mutual_information_total(list(words_en[:8000]))
# t = mutual_information_total(list(words_en[:8000]))
t - 4.99726326162518, t - 4.99633675507535

(-0.0008464843920785725, 8.002215775171351e-05)

4.99726326162518 (if you add one extra word at the beginning of the data)
4.99633675507535 (if you use the data as they are and are carefull at the beginning and end).

8.002215775171351e-05

## 3. Tag Classes

> Use the same original data as above, but this time, you will compute the classes for tags (the strings after slashes). Compute tag classes for all tags appearing 5 times or more in the data. Use as much data as time allows. You will be graded relative to the other student's results. Again, note the full history of merges, and attach it to your homework. Pick three interesting classes as the algorithm goes (English data only; Czech optional), and comment on them (why you think you see those tags there together (or not), etc.). 