# [Assignment #2: NPFL067 Statistical NLP II](http://ufal.mff.cuni.cz/~hajic/courses/npfl067/assign2.html)

## Words and The Company They Keep

### Author: Dan Kondratyuk

### March 28, 2018

---

This Python notebook examines the role of mutual information in natural language processing.

Code and explanation of results is fully viewable within this webpage.

## Files

- [index.html](./index.html) - Contains all veiwable code and a summary of results
- [README.md](./README.md) - Instructions on how to run the code with Python
- [nlp-assignment-2.ipynb](./nlp-assignment-2.ipynb) - Jupyter notebook where code can be run
- [brown_cluster.py](./brown_cluster.py) - Code defining the Brown clustering algorithm
- [requirements.txt](./requirements.txt) - Required python packages for running

## 1. Best Friends

#### Problem Statement
>  In this task you will do a simple exercise to find out the best word association pairs using the pointwise mutual information method.

> First, you will have to prepare the data: take the same texts as in the previous assignment, i.e.

> `TEXTEN1.txt` and `TEXTCZ1.txt`

> (For this part of Assignment 2, there is no need to split the data in any way.)

> Compute the pointwise mutual information for all the possible word pairs appearing consecutively in the data, **disregarding pairs in which one or both words appear less than 10 times in the corpus**, and sort the results from the best to the worst (did you get any negative values? Why?) Tabulate the results, and show the best 20 pairs for both data sets.

> Do the same now but for distant words, i.e. words which are at least 1 word apart, but not farther than 50 words (both directions). Again, tabulate the results, and show the best 20 pairs for both data sets. 

### Process Text

The first step is to process the frequency distribution of the unigrams and bigrams and define a function to calculate the pointwise mutual information between two words. The class `LanguageModel` will handle this.

In [1]:
# Import Python packages
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# %load_ext autoreload
# %autoreload 2

from collections import defaultdict, Counter, Iterable
import itertools
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tqdm import tqdm_notebook as tqdm, tnrange as trange
from scipy.special import comb

# Configure Plots
plt.rcParams['lines.linewidth'] = 4
pd.set_option('max_colwidth', 150)

np.random.seed(200) # Set a seed so that this notebook has the same output each time

In [2]:
def open_text(filename):
    """Reads a text line by line, applies light preprocessing, and returns an array of words"""
    with open(filename, encoding='iso-8859-2') as f:
        content = f.readlines()
    
    preprocess = lambda word: word.strip()
    
    return np.array([preprocess(word) for word in content])

In [3]:
class LanguageModel:
    """Counts words and calculates the probabilities of a language model"""
    
    def __init__(self, words, min_words=10):
        self.min_words = min_words
        
        # Unigrams
        self.unigrams = words
        self.unigram_set = list(set(self.unigrams))
        self.total_unigram_count = len(self.unigrams)
        self.unigram_dist = Counter(self.unigrams)
        
        self.unigram_pdist = defaultdict(float)
        for w in self.unigram_dist:
            self.unigram_pdist[w] = self.unigram_dist[w] / self.total_unigram_count
        
        # Bigrams
        self.bigrams = list(zip(words, words[1:]))
        self.bigram_set = list(set(self.bigrams))
        self.total_bigram_count = len(self.bigrams)
        self.bigram_dist = Counter(self.bigrams)
        
        self.bigram_pdist = defaultdict(float)
        for w in self.bigram_dist:
            self.bigram_pdist[w] = self.bigram_dist[w] / self.total_bigram_count
    
    def p_unigram(self, w):
        """Calculates the probability a unigram appears in the distribution"""
        return self.unigram_pdist[w]
    
    def p_bigram(self, wprev, w):
        """Calculates the probability a bigram appears in the distribution"""
        return self.bigram_pdist[(wprev, w)]
    
    def pointwise_mi(self, wprev, w, p_bigram_func=None):
        """Calculates the pointwise mutual information in a word pair"""
        p_bigram_func = self.p_bigram if p_bigram_func is None else p_bigram_func
        joint = p_bigram_func(wprev, w)
        independent = self.p_unigram(wprev) * self.p_unigram(w)
        return np.log2(joint / independent) if independent != 0 else 0

In [4]:
# Read the texts into memory
english = './TEXTEN1.txt'
czech = './TEXTCZ1.txt'

words_en = open_text(english)
words_cz = open_text(czech)

In [5]:
lm_en = LanguageModel(words_en)
lm_cz = LanguageModel(words_cz)

Loop over all pairs of bigrams and calculate their pointwise mutual information, collecting them into a table.

In [6]:
def mutual_information(lm):
    # Obtain all word pairs in the word list, disregarding pairs in which one or both words appear less than 10 times in the corpus  
    pairs = [pair for pair in lm.bigram_set
             if lm.unigram_dist[pair[0]] >= lm.min_words 
             and lm.unigram_dist[pair[1]] >= lm.min_words]

    mi = [(' '.join(pair), lm.pointwise_mi(*pair)) for pair in pairs]
    return pd.DataFrame(mi, columns=['pair', 'mutual_information'])

In [7]:
mi_en = mutual_information(lm_en).sort_values(by='mutual_information', ascending=False)
mi_cz = mutual_information(lm_cz).sort_values(by='mutual_information', ascending=False)

### Results - Consecutive Pairs

The two tables below show the pointwise mutual information (sorted descending) between pairs of words appearing consecutively in the English and Czech texts respectively.

We see that proper names like Great Britain and Tomáš Ježek provide a lot of mutual information, as those words are frequently seen together and rarely seen apart from each other. However, some of these values are negative (see below).

In [8]:
mi_en[:20] # English

Unnamed: 0,pair,mutual_information
6823,La Plata,14.16937
37943,Asa Gray,14.031867
12973,Fritz Muller,13.362016
35753,worth while,13.332869
35699,faced tumbler,13.26248
10443,lowly organised,13.216899
19938,Malay Archipelago,13.110477
24199,shoulder stripe,13.053893
7445,Great Britain,12.914557
13614,United States,12.847442


In [9]:
mi_cz[:20] # Czech

Unnamed: 0,pair,mutual_information
3551,Hamburger SV,14.28895
23868,Los Angeles,14.062442
16757,Johna Newcomba,13.762882
35953,Č. Budějovice,13.633599
18877,série ATP,13.468968
35605,turnajové série,13.434411
15144,Tomáš Ježek,13.428981
6814,Lidové noviny,13.329922
17903,Lidových novin,13.271028
10361,veřejného mínění,13.062442


Sorting in ascending order, there are pairs of words that provide negative mutual information. This can be explained by the definition of pointwise mutual information (PMI):

$$PMI(w_t,w_{t+1}) = \log \frac{p(w_t,w_{t+1})}{p(w_t)p(w_{t+1})}$$

where $w_t,w_{t+1}$ are consecutive words (in this instance). The `log` is negative when its input is less than 1, which is to say that

$$p(w_t,w_{t+1}) < p(w_t)p(w_{t+1})$$

i.e., the probability of the pair appearing consecutively in the text is less than the probability of them appearing independently from each other.

This can be verified by the data below. For instance, '_the_' and '_,_' both appear very frequently in the text. However, they are unlikely to be seen consecutively, since 'the ,' is ungrammatical. Therefore, their pointwise mutual information must be negative.

In [10]:
mi_en[:-5:-1]

Unnamed: 0,pair,mutual_information
5712,"the ,",-8.790285
27781,. the,-8.407455
30177,. of,-7.90195
16500,of .,-7.90195


Now define a function to calculate pointwise mutual information on all pairs of words a constant distance apart (up to 50) and store the results in a table. 

In [11]:
def mutual_information_dist(lm):
    def mi_step(distance):
        # Get all pairs in the word list a certain distance apart
        pair_list = list(zip(lm.unigrams, lm.unigrams[distance+1:]))
        dist = Counter(pair_list)
    
        # Obtain all word pairs in the word list, disregarding pairs in which one or both words appear less than 10 times in the corpus  
        pairs = [pair for pair in list(set(pair_list))
                 if lm.unigram_dist[pair[0]] >= lm.min_words 
                 and lm.unigram_dist[pair[1]] >= lm.min_words]
        
        p_bigram = lambda wprev, w: dist[(wprev, w)] / lm.total_bigram_count
        
        yield ((distance, wprev, w, lm.pointwise_mi(wprev, w, p_bigram)) for wprev,w in pairs)
    
    max_distance = 50
    results = [m for distance in tqdm(range(1, max_distance+1)) for mi in mi_step(distance) for m in mi]
        
    return pd.DataFrame(results, columns=['distance', 'word_1', 'word_2', 'mutual_information'])

In [12]:
mi_dist_en = mutual_information_dist(lm_en).sort_values(by='mutual_information', ascending=False)
mi_dist_cz = mutual_information_dist(lm_cz).sort_values(by='mutual_information', ascending=False)







### Results - Distant Pairs

As before, the two tables below show the pointwise mutual information (sorted descending) between pairs of words appearing in the English and Czech texts. There is an added column called `distance` which indicates the number of words between the two words of interest.

Expectedly, pairs of words with high pointwise mutual information appear close together. For example 'survival \_ \_ fittest' can be filled in as 'survival _of the_ fittest', which is a common phrase in the text. More surprisingly, some words appearing far apart from each other provide a lot of mutual information. It is likely pairs like 'Nastaseho \_ [x25] Newcomba' is a part of multiple quotations in the text such that the word pair appears infrequently outside of them.

In [13]:
mi_dist_en[:20] # English

Unnamed: 0,distance,word_1,word_2,mutual_information
133260,2,survival,fittest,13.754333
42200,1,dimorphic,trimorphic,13.353454
127985,2,Alph,Candolle,13.236485
205015,3,H,Watson,13.16937
110007,2,Old,Worlds,13.053893
51444,1,Alph,de,13.053893
43170,1,E,Forbes,12.946978
139913,2,unimportant,welfare,12.879864
173270,3,carrier,faced,12.695439
64514,1,rarer,rarer,12.525514


In [14]:
mi_dist_cz[:20] # Czech

Unnamed: 0,distance,word_1,word_2,mutual_information
24426,1,ODÚ,VPN,14.119025
7146,1,turnajové,ATP,13.614983
21491,1,Mistrovství,turnajové,13.410365
410056,8,výher,výher,13.318097
25208,1,Čechy,Slováky,13.30345
125332,3,Mistrovství,ATP,13.203914
1019188,19,prohraná,dvojchyby,13.172205
66523,2,soužití,Slováků,13.062442
675408,13,prohraná,esa,13.051911
377752,8,III,IV,13.025916


## 2. Best Friends

#### Word Classes

> **The Data**

> Get `TEXTEN1.ptg`, `TEXTCZ1.ptg`. These are your data. They are almost the same as the .txt data you have used so far, except they now contain the part of speech tags in the following form:

> `rady/NNFS2-----A----`  
`,/Z:-------------`

> where the tag is separated from the word by a slash ('/'). Be careful: the tags might contain everything (including slashes, dollar signs and other weird characters). It is guaranteed however that there is no slash-word.

> Similarly for the English texts (except the tags are shorter of course).

> **The Task**

> Compute a full class hierarchy of **words** using the first 8,000 words of those data, and only for words occurring 10 times or more (use the same setting for both languages). Ignore the other words for building the classes, but keep them in the data for the bigram counts. For details on the algorithm, use the Brown et al. paper distributed in the class; some formulas are wrong, however, so please see the corrections on the web (Class 12, formulas for Trick \#4). Note the history of the merges, and attach it to your homework. Now run the same algorithm again, but stop when reaching 15 classes. Print out all the members of your 15 classes and attach them too.

> **Hints:**

> The initial mutual information is (English, words, limit 8000):

> `4.99726326162518` (if you add one extra word at the beginning of the data)  
> `4.99633675507535` (if you use the data as they are and are carefull at the beginning and end).

> NB: the above numbers are finally confirmed from an independent source :-).

> The first 5 merges you get on the English data should be:

> `case subject`  
> `cannot may`  
> `individuals structure`  
> `It there`  
> `even less`  

> The loss of Mutual Information when merging the words "case" and "subject":

> Minimal loss: `0.00219656653357569` for `case+subject`

### Process Text

Process the text using the `LmCluster` class defined in `brown_cluster.py`. The code will perform the Brown clustering algorithm on the given texts.

In [15]:
from brown_cluster import LmCluster

In [16]:
def open_text(filename):
    """Reads a text line by line, applies light preprocessing, and returns an array of words and tags"""
    with open(filename, encoding='iso-8859-2') as f:
        content = f.readlines()
    
    preprocess = lambda word: word.strip().rsplit('/', 1)
    
    return [preprocess(word) for word in content]

In [17]:
# Read the texts into memory
english = './TEXTEN1.ptg'
czech = './TEXTCZ1.ptg'

words_en, tags_en = zip(*open_text(english))
words_cz, tags_cz = zip(*open_text(czech))

### Cluster the word classes

In [18]:
text_size = 8000

In [22]:
lm_en = LmCluster(words_en[:text_size])
lm_cz = LmCluster(words_cz[:text_size])

100%|██████████| 6216/6216 [00:32<00:00, 190.30pair/s]
100%|██████████| 1830/1830 [00:21<00:00, 84.36pair/s]


In [23]:
lm_en.cluster()
lm_cz.cluster()

100%|██████████| 111/111 [00:36<00:00,  3.05class/s]
100%|██████████| 60/60 [00:24<00:00,  2.47class/s]


In [24]:
def history(cluster):
    return pd.DataFrame(cluster.merge_history, columns=['class 1', 'class 2', 'cluster id', 'mutual_information_loss'])

### History of Merges

The tables below show the history of merges in the English and Czech texts respectively. The class (cluster) id is displayed by its corresponding word (if the class contains just one word).

According to the Brown clustering algorithm, words appearing in the most similar contexts (and hence reducing the text's total mutual information the least) get clustered first. For instance, helper verbs 'may' and 'cannot' can be interchanged in the text without reducing the text's mutual information much.

In [25]:
history(lm_en) # English

Unnamed: 0,class 1,class 2,cluster id,mutual_information_loss
0,subject,case,1662,-0.002197
1,may,cannot,1663,-0.002669
2,individuals,structure,1664,-0.002675
3,It,there,1665,-0.003479
4,even,less,1666,-0.003656
5,nature,variation,1667,-0.003691
6,short,slight,1668,-0.003906
7,cases,manner,1669,-0.004250
8,state,1662,1670,-0.004277
9,shall,),1671,-0.004382


In [26]:
history(lm_cz) # Czech

Unnamed: 0,class 1,class 2,cluster id,mutual_information_loss
0,listopadu,OKD,3685,-0.003083
1,které,který,3686,-0.003373
2,J,státu,3687,-0.004025
3,bude,musí,3688,-0.004422
4,ale,aby,3689,-0.004604
5,mezi,už,3690,-0.005
6,budou,pouze,3691,-0.00547
7,zákona,jeho,3692,-0.005578
8,byl,si,3693,-0.005792
9,NATO,&slash;,3694,-0.006073


As before, do the clustering, this time stopping at 15 clusters.

In [27]:
clusters = 15

In [28]:
lm_en_15 = LmCluster(words_en[:text_size])
lm_cz_15 = LmCluster(words_cz[:text_size])

100%|██████████| 6216/6216 [00:33<00:00, 187.28pair/s]
100%|██████████| 1830/1830 [00:21<00:00, 83.84pair/s]


In [29]:
lm_en_15.cluster(clusters)
lm_cz_15.cluster(clusters)

100%|██████████| 97/97 [00:35<00:00,  2.73class/s]
100%|██████████| 46/46 [00:22<00:00,  2.04class/s]


In [30]:
def class_cluster(lm):
    classes = lm.get_classes()
    return pd.DataFrame([(x, [lm.class_name(c) for c in classes[x] if c < len(lm.int2word)]) for x in classes], columns=['class', 'words'])

### Cluster Distribution with 15 Classes

The tables below display the contents of each of the 15 classes merged with the clustering algorithm.

Words that appear very frequently with other words like 'the' and 'of' will reduce the mutual information a lot if clustered with any other class, and so are left over. Class 1721 shows quantifiers like 'several' and 'one' are in similar contexts and hence in their own cluster. This is similar for articles in class 1758.

In [31]:
class_cluster(lm_en_15) # English

Unnamed: 0,class,words
0,0,"[,]"
1,1,[the]
2,2,[of]
3,3,[and]
4,4,[.]
5,5,[to]
6,6,[in]
7,7,[that]
8,1758,"[a, ;, this, any, long, very, my, different, great, short, slight]"
9,1753,"[I, is, as, from, are, on, by, been, under, The, plants, In, so, when, if, believe, see, nearly,..."


In [32]:
class_cluster(lm_cz_15) # Czech

Unnamed: 0,class,words
0,0,[.]
1,1,"[,]"
2,2,[a]
3,3,[v]
4,4,[se]
5,5,[na]
6,3730,"[o, by, ve, po, ze, před]"
7,3727,"["", s, V, Na, listopadu, OKD]"
8,8,[-]
9,3715,"[že, ale, které, který, aby]"


## 3. Tag Classes

> Use the same original data as above, but this time, you will compute the classes for tags (the strings after slashes). Compute tag classes for all tags appearing 5 times or more in the data. Use as much data as time allows. You will be graded relative to the other student's results. Again, note the full history of merges, and attach it to your homework. Pick three interesting classes as the algorithm goes (English data only; Czech optional), and comment on them (why you think you see those tags there together (or not), etc.). 

In [33]:
cluster_en_tag = LmCluster(tags_en, word_cutoff=5)
cluster_en_tag.cluster()

100%|██████████| 630/630 [00:00<00:00, 4500.17pair/s]
100%|██████████| 35/35 [00:00<00:00, 177.86class/s]


The tables below display the history of merges with regards to part-of-speech tags in the texts.

Some interesting classes include:

- 'JJ' (adjective) and 'JJR' (comparative adjective). These tags are both denote slightly different types of adjectives, so it makes sense that they would get merged into their own cluster.
- 'TO' (to) and 'RBS' (superlative adverb). Likewise, the infinitive 'to' and adverbs like 'best' most frequently appear before a verb, and so get merged due to the similar context.
- 'IN' (preposition), 'WP$' (posessive wh-pronoun), '(', and '"' all appear in a single class, likely due to the fact that all of these tags appear frequently at the beginning of a clause and break up sentences into phrases. For instance, 'the chair _which_ is ...' or 'the chair _in_ the ...'.

In [34]:
history(cluster_en_tag) # English

Unnamed: 0,class 1,class 2,cluster id,mutual_information_loss
0,IN,WP$,36,0.010274
1,RBR,36,37,0.009367
2,(,37,38,0.008524
3,RB,WP,39,0.012271
4,"""",38,40,0.013501
5,FW,40,41,0.007315
6,NNPS,41,42,0.007394
7,SYM,39,43,0.012683
8,WRB,42,44,0.010462
9,TO,RBS,45,0.010699


In [35]:
cluster_cz_tag = LmCluster(tags_cz, word_cutoff=5)
cluster_cz_tag.cluster()

100%|██████████| 228826/228826 [12:56<00:00, 294.85pair/s]
100%|██████████| 676/676 [21:18<00:00,  1.89s/class]


In [36]:
history(cluster_cz_tag) # Czech

Unnamed: 0,class 1,class 2,cluster id,mutual_information_loss
0,Z:-------------,PE--4----------,1015,0.015749
1,P1ZS1FS3-------,1015,1016,0.015721
2,PJFS3----------,1016,1017,0.015698
3,NNNXX-----A---8,1017,1018,0.015688
4,Dg-------1A---1,1018,1019,0.015649
5,VsFS4--XX-AP---,1019,1020,0.015652
6,AAIS1----1A---6,1020,1021,0.015649
7,P7-X4----------,Dg-------1A----,1022,0.020290
8,AGMS2-----A----,1021,1023,0.019119
9,RR--4----------,Vf--------A----,1024,0.021323
