# [Assignment #2: NPFL067 Statistical NLP II](http://ufal.mff.cuni.cz/~hajic/courses/npfl067/assign2.html)

## Words and The Company They Keep

### Author: Dan Kondratyuk

### March 2, 2018

---

This Python notebook examines 

Code and explanation of results is fully viewable within this webpage.

## Files

- [index.html](./index.html) - Contains all veiwable code and a summary of results
- [README.md](./README.md) - Instructions on how to run the code with Python
- [nlp-assignment-2.ipynb](./nlp-assignment-2.ipynb) - Jupyter notebook where code can be run
- [requirements.txt](./requirements.txt) - Required python packages for running

## 1. Best Friends

#### Problem Statement
>  In this task you will do a simple exercise to find out the best word association pairs using the pointwise mutual information method.

> First, you will have to prepare the data: take the same texts as in the previous assignment, i.e.

> `TEXTEN1.txt` and `TEXTCZ1.txt`

> (For this part of Assignment 2, there is no need to split the data in any way.)

> Compute the pointwise mutual information for all the possible word pairs appearing consecutively in the data, **disregarding pairs in which one or both words appear less than 10 times in the corpus**, and sort the results from the best to the worst (did you get any negative values? Why?) Tabulate the results, and show the best 20 pairs for both data sets.

> Do the same now but for distant words, i.e. words which are at least 1 word apart, but not farther than 50 words (both directions). Again, tabulate the results, and show the best 20 pairs for both data sets. 

### Process Text

In [1]:
# Import Python packages
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# %load_ext autoreload
# %autoreload 2

from collections import defaultdict, Counter, Iterable
import itertools
import nltk
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tqdm import tqdm_notebook as tqdm, tnrange as trange
from scipy.special import comb

# Configure Plots
plt.rcParams['lines.linewidth'] = 4
pd.set_option('max_colwidth', 100)

In [2]:
np.random.seed(200) # Set a seed so that this notebook has the same output each time

In [3]:
def open_text(filename):
    """Reads a text line by line, applies light preprocessing, and returns an array of words"""
    with open(filename, encoding='iso-8859-2') as f:
        content = f.readlines()
    
    preprocess = lambda word: word.strip()
    
    return np.array([preprocess(word) for word in content])

In [4]:
class LanguageModel:
    """Counts words and calculates the probabilities of a language model"""
    
    def __init__(self, words, min_words=10):
        self.min_words = min_words
        
        # Unigrams
        self.unigrams = words
        self.unigram_set = list(set(self.unigrams))
        self.total_unigram_count = len(self.unigrams)
        self.unigram_dist = Counter(self.unigrams)
        
        self.unigram_pdist = defaultdict(float)
        for w in self.unigram_dist:
            self.unigram_pdist[w] = self.unigram_dist[w] / self.total_unigram_count
        
        # Bigrams
        self.bigrams = list(nltk.bigrams(words))
        self.bigram_set = list(set(self.bigrams))
        self.total_bigram_count = len(self.bigrams)
        self.bigram_dist = Counter(self.bigrams)
        
        self.bigram_pdist = defaultdict(float)
        for w in self.bigram_dist:
            self.bigram_pdist[w] = self.bigram_dist[w] / self.total_bigram_count
    
    def p_unigram(self, w):
        """Calculates the probability a unigram appears in the distribution"""
        return self.unigram_pdist[w]
    
    def p_bigram(self, wprev, w):
        """Calculates the probability a bigram appears in the distribution"""
        return self.bigram_pdist[(wprev, w)]
    
    def pointwise_mi(self, wprev, w, p_bigram_func=None):
        """Calculates the pointwise mutual information in a word pair"""
        p_bigram_func = self.p_bigram if p_bigram_func is None else p_bigram_func
        joint = p_bigram_func(wprev, w)
        independent = self.p_unigram(wprev) * self.p_unigram(w)
        return np.log2(joint / independent) if independent != 0 else 0

In [5]:
# Read the texts into memory
english = './TEXTEN1.txt'
czech = './TEXTCZ1.txt'

words_en = open_text(english)
words_cz = open_text(czech)

In [6]:
lm_en = LanguageModel(words_en)
lm_cz = LanguageModel(words_cz)

In [7]:
def mutual_information(lm):
    # Obtain all word pairs in the word list, disregarding pairs in which one or both words appear less than 10 times in the corpus  
    pairs = [pair for pair in lm.bigram_set
             if lm.unigram_dist[pair[0]] >= lm.min_words 
             and lm.unigram_dist[pair[1]] >= lm.min_words]

    mi = [(' '.join(pair), lm.pointwise_mi(*pair)) for pair in pairs]
    return pd.DataFrame(mi, columns=['pair', 'mutual_information'])

In [8]:
mi_en = mutual_information(lm_en).sort_values(by='mutual_information', ascending=False)
mi_cz = mutual_information(lm_cz).sort_values(by='mutual_information', ascending=False)

In [9]:
mi_en[:20]

Unnamed: 0,pair,mutual_information
41329,La Plata,14.16937
32582,Asa Gray,14.031867
23243,Fritz Muller,13.362016
5687,worth while,13.332869
10378,faced tumbler,13.26248
8793,lowly organised,13.216899
29389,Malay Archipelago,13.110477
21688,shoulder stripe,13.053893
35398,Great Britain,12.914557
13256,United States,12.847442


In [10]:
mi_cz[:20]

Unnamed: 0,pair,mutual_information
21075,Hamburger SV,14.28895
36964,Los Angeles,14.062442
13500,Johna Newcomba,13.762882
1744,Č. Budějovice,13.633599
255,série ATP,13.468968
3901,turnajové série,13.434411
31632,Tomáš Ježek,13.428981
36687,Lidové noviny,13.329922
2895,Lidových novin,13.271028
24063,veřejného mínění,13.062442


In [11]:
mi_en[:-5:-1]

Unnamed: 0,pair,mutual_information
6019,"the ,",-8.790285
36491,. the,-8.407455
2320,of .,-7.90195
40739,. of,-7.90195


In [12]:
def mutual_information_dist(lm):
    def mi_step(distance):
        # Get all pairs in the word list a certain distance apart
        pair_list = list(zip(lm.unigrams, lm.unigrams[distance+1:]))
        dist = Counter(pair_list)
    
        # Obtain all word pairs in the word list, disregarding pairs in which one or both words appear less than 10 times in the corpus  
        pairs = [pair for pair in list(set(pair_list))
                 if lm.unigram_dist[pair[0]] >= lm.min_words 
                 and lm.unigram_dist[pair[1]] >= lm.min_words]
        
        p_bigram = lambda wprev, w: dist[(wprev, w)] / lm.total_bigram_count
        
        yield ((distance, wprev, w, lm.pointwise_mi(wprev, w, p_bigram)) for wprev,w in pairs)
    
    max_distance = 50
    results = [m for distance in tqdm(range(1, max_distance+1)) for mi in mi_step(distance) for m in mi]
        
    return pd.DataFrame(results, columns=['distance', 'word0', 'word1', 'mutual_information'])

In [13]:
mi_dist_en = mutual_information_dist(lm_en).sort_values(by='mutual_information', ascending=False)
mi_dist_cz = mutual_information_dist(lm_cz).sort_values(by='mutual_information', ascending=False)







In [14]:
mi_dist_en[:20]

Unnamed: 0,distance,word0,word1,mutual_information
79104,2,survival,fittest,13.754333
66376,1,dimorphic,trimorphic,13.353454
118307,2,Alph,Candolle,13.236485
175623,3,H,Watson,13.16937
100355,2,Old,Worlds,13.053893
42597,1,Alph,de,13.053893
22862,1,E,Forbes,12.946978
84121,2,unimportant,welfare,12.879864
179318,3,carrier,faced,12.695439
45988,1,rarer,rarer,12.525514


In [15]:
mi_dist_cz[:20]

Unnamed: 0,distance,word0,word1,mutual_information
19527,1,ODÚ,VPN,14.119025
20285,1,turnajové,ATP,13.614983
29739,1,Mistrovství,turnajové,13.410365
401837,8,výher,výher,13.318097
599,1,Čechy,Slováky,13.30345
138197,3,Mistrovství,ATP,13.203914
1008188,19,prohraná,dvojchyby,13.172205
96422,2,soužití,Slováků,13.062442
691746,13,prohraná,esa,13.051911
401011,8,III,IV,13.025916


## 2. Best Friends

#### Word Classes

> **The Data**

> Get `TEXTEN1.ptg`, `TEXTCZ1.ptg`. These are your data. They are almost the same as the .txt data you have used so far, except they now contain the part of speech tags in the following form:

> `rady/NNFS2-----A----`  
`,/Z:-------------`

> where the tag is separated from the word by a slash ('/'). Be careful: the tags might contain everything (including slashes, dollar signs and other weird characters). It is guaranteed however that there is no slash-word.

> Similarly for the English texts (except the tags are shorter of course).

> **The Task**

> Compute a full class hierarchy of **words** using the first 8,000 words of those data, and only for words occurring 10 times or more (use the same setting for both languages). Ignore the other words for building the classes, but keep them in the data for the bigram counts. For details on the algorithm, use the Brown et al. paper distributed in the class; some formulas are wrong, however, so please see the corrections on the web (Class 12, formulas for Trick \#4). Note the history of the merges, and attach it to your homework. Now run the same algorithm again, but stop when reaching 15 classes. Print out all the members of your 15 classes and attach them too.

> **Hints:**

> The initial mutual information is (English, words, limit 8000):

> `4.99726326162518` (if you add one extra word at the beginning of the data)  
> `4.99633675507535` (if you use the data as they are and are carefull at the beginning and end).

> NB: the above numbers are finally confirmed from an independent source :-).

> The first 5 merges you get on the English data should be:

> `case subject`  
> `cannot may`  
> `individuals structure`  
> `It there`  
> `even less`  

> The loss of Mutual Information when merging the words "case" and "subject":

> Minimal loss: `0.00219656653357569` for `case+subject`

In [13]:
from brown_cluster import LmCluster

In [14]:
def open_text(filename):
    """Reads a text line by line, applies light preprocessing, and returns an array of words and tags"""
    with open(filename, encoding='iso-8859-2') as f:
        content = f.readlines()
    
    preprocess = lambda word: word.strip().rsplit('/', 1)
    
    return [preprocess(word) for word in content]

In [15]:
# Read the texts into memory
english = './TEXTEN1.ptg'
czech = './TEXTCZ1.ptg'

words_en, tags_en = zip(*open_text(english))
words_cz, tags_cz = zip(*open_text(czech))

In [29]:
class LanguageModel:
    """Counts words and calculates the probabilities of a language model"""
    
    def __init__(self, words, word_cutoff=10):
        self.word_cutoff = word_cutoff
        
        # Unigrams
        self.text_size = len(words)
        self.word2int = {}
        self.unigram_dist = defaultdict(int)
        
        word_counts = Counter(words)
        word_set = sorted(word_counts, key=lambda w: word_counts[w], reverse=True)
        
        for i, w in enumerate(word_set):
            self.word2int[w] = i
            self.unigram_dist[i] = word_counts[w]
        
        self.int2word = sorted(self.word2int, key=lambda word: self.word2int[word])
        self.unigrams = [self.word2int[w] for w in words]
        
        # Bigrams
        self.bigrams = list(zip(self.unigrams, self.unigrams[1:]))
        self.bigram_set = set(self.bigrams)
        self.bigram_dist = defaultdict(lambda: defaultdict(int))
        for wprev, w in self.bigrams:
            self.bigram_dist[wprev][w] += 1
        
        self.classes = [word for word in self.unigram_dist if self.unigram_dist[word] >= self.word_cutoff]
        self.class_counter = len(self.unigram_dist)
        
        self.int2class = list(range(len(self.word2int)))
        
        self.merge_history = []
        
#         self.W = self.build_w(self.classes)
    
#     def build_w(self, classes):
#         W = defaultdict(lambda: defaultdict(float))

#         # Edges between classes
#         for l, r in itertools.combinations(classes, 2):
#             W[l][r] = self.pointwise_mi(l, r) + self.pointwise_mi(r, l)

#         # Edges to and from a single class
#         for c in classes:
#             W[c][c] = self.pointwise_mi(c, c)

#         return W

    def class_name(self, classes):
        if not isinstance(classes, Iterable):
            classes = [classes]

        classes = [self.int2word[c] if c < len(self.int2word) else c for c in classes]
        return classes if len(classes) > 1 else classes[0]

    def cluster(self, class_count):
        merges = len(self.classes) - class_count
        prev_mi = self.mi(self.bigram_set)
        for _ in trange(merges, unit='class'):
#             merges = self.best_merge()
            mi, (l, r, c_new), merge_data = self.best_merge()
            self.bigram_dist, self.unigram_dist, self.int2class, self.classes = merge_data
            
            save = (*self.class_name([l, r]), c_new, prev_mi - mi)
            self.merge_history.append(save)
            
            print(save)
            
            prev_mi = mi
            self.class_counter += 1

    def best_merge(self):
        mi = (self.merge_mi(l, r) for l, r in itertools.combinations(self.classes, 2))
        progress = tqdm(mi, total=comb(len(self.classes), 2, exact=True), leave=False)
        return max(progress, key=lambda x: x[0])
#         return sorted(progress, key=lambda x: x[0], reverse=True)

    def merge_mi(self, l, r):
#         unigram_dist = self.unigram_dist.copy()
#         bigram_dist = defaultdict(lambda: defaultdict(int))
#         for wprev in self.bigram_dist:
#             bigram_dist[wprev] = self.bigram_dist[wprev].copy()
                
        int2class = self.int2class.copy()
        classes = self.classes.copy()
        
        c_new = self.class_counter
        
#         # Add the new class to frequency distributions
#         unigram_dist[c_new] = unigram_dist[l] + unigram_dist[r]
        
#         for c in [l, r]:
#             for d, count in bigram_dist[c].items():
#                 d = c_new if d in [l, r] else d
#                 bigram_dist[c_new][d] += count
#         for c in bigram_dist:
#             for d in [l, r]:
#                 if d in bigram_dist[c] and c != c_new:
#                     bigram_dist[c][c_new] += bigram_dist[c][d]
                
        
#         if c1 >= len(self.word2int) and c1 in s:
#             del self.word_counts[c1]
#         if c2 >= len(self.word2int) and c2 in self.word_counts:
#             del self.word_counts[c2]

#         del self.bigram_counts[c1]
#         del self.bigram_counts[c2]
#         for c in bigram_dist:
#             for d in [l, r]:
#                 if d in bigram_dist[c]:
#                     del bigram_dist[c][d]
        
        # Update mapping between words and classes
        for c in [l, r]:
            int2class[c] = c_new
            classes.remove(c)
        int2class.append(c_new)
        classes.append(c_new)
        
        unigrams = [int2class[w] for w in self.unigrams]
        bigrams = list(zip(unigrams, unigrams[1:]))
        
        unigram_dist = defaultdict(int)
        for w in unigrams:
            unigram_dist[w] += 1
        
        bigram_set = set(bigrams)
        bigram_dist = defaultdict(lambda: defaultdict(int))
        for wprev, w in bigrams:
            bigram_dist[wprev][w] += 1
        
        mi = self.mi(bigram_set, bigram_dist, unigram_dist)
        merge = l, r, c_new
        merge_data = bigram_dist, unigram_dist, int2class, classes
        return mi, merge, merge_data
    
#     def merge(self, l, r):
#         c_new = self.class_counter
#         self.class_counter += 1
        
#         # Add the new class to frequency distributions
#         self.unigram_dist[c_new] = self.unigram_dist[l] + self.unigram_dist[r]
        
#         for c in [l, r]:
#             for d, count in self.bigram_dist[c].items():
#                 d = c_new if d in [l, r] else d
#                 self.bigram_dist[c_new][d] += count
#         for c in self.bigram_dist:
#             for d in [l, r]:
#                 if d in self.bigram_dist[c] and c != c_new:
#                     self.bigram_dist[c][c_new] += self.bigram_dist[c][d]
        
#         # Update mapping between words and classes
#         self.int2class.append(c_new)
#         self.classes.append(c_new)
#         for c in [l, r]:
#             self.int2class[c] = c_new 
#             self.classes.remove(c)
        
#         unigrams = [self.int2class[w] for w in self.unigrams]
#         bigram_set = set(zip(unigrams, unigrams[1:]))
        
#         return c_new, self.mi(bigram_set)
        
    def mi(self, bigram_set, bigram_dist=None, unigram_dist=None):
#         extra = np.log2(self.text_size / self.unigram_dist[self.unigrams[0]]) / self.text_size
        return np.sum(self.pointwise_mi(*pair, bigram_dist, unigram_dist) for pair in bigram_set)
    
    def pointwise_mi(self, wprev, w, bigram_dist=None, unigram_dist=None):
        """Calculates the pointwise mutual information in a word pair"""
        bigram_dist = bigram_dist if bigram_dist else self.bigram_dist
        unigram_dist = unigram_dist if unigram_dist else self.unigram_dist
        
        if not bigram_dist[wprev][w]:
            return 0.0
        
        p_bigram = bigram_dist[wprev][w] / self.text_size
        joint = bigram_dist[wprev][w] * self.text_size
        independent = unigram_dist[wprev] * unigram_dist[w]
        return p_bigram * np.log2(joint / independent)
    
#     def div(self, a, b):
#         return 0 if b == 0 else a / b

In [30]:
lm = LanguageModel(words_en[:8000])
lm.mi(lm.bigram_set), lm.merge_mi(lm.word2int['subject'], lm.word2int['case'])[:2]

(4.995611899527346, (4.993415332993771, (84, 104, 1662)))

In [31]:
lm = LanguageModel(words_en[:8000])
best = lm.cluster(15)

('subject', 'case', 1662, 0.0021965665335752504)


('in', 1662, 1663, -3.552713678800501e-15)


KeyboardInterrupt: 

In [23]:
cluster_en = LmCluster(words_en[:8000])
cluster_cz = LmCluster(words_cz[:8000])

2018-03-15 17:17:41,313	8000 word tokens were processed.
2018-03-15 17:17:41,316	Starting classes: 112
2018-03-15 17:17:41,317	initializing tables
100%|██████████| 6216/6216 [00:01<00:00, 3685.15pairs/s]
100%|██████████| 111/111 [00:02<00:00, 41.42class/s]
2018-03-15 17:17:45,708	8000 word tokens were processed.
2018-03-15 17:17:45,709	Starting classes: 61
2018-03-15 17:17:45,710	initializing tables
100%|██████████| 1830/1830 [00:00<00:00, 6654.45pairs/s]
100%|██████████| 60/60 [00:00<00:00, 116.27class/s]


In [24]:
def history(cluster):
    return pd.DataFrame(cluster.cluster_history, columns=['prev word', 'word', 'cluster id'])

In [25]:
history(cluster_en)[:20]

Unnamed: 0,prev word,word,cluster id
0,may,cannot,1662
1,subject,case,1663
2,),1663,1664
3,in,between,1665
4,short,slight,1666
5,(,1666,1667
6,an,my,1668
7,1667,1668,1669
8,individuals,structure,1670
9,1664,1670,1671


In [26]:
history(cluster_cz)[:20]

Unnamed: 0,prev word,word,cluster id
0,při,?,3685
1,od,mezi,3686
2,3685,3686,3687
3,po,před,3688
4,3687,3688,3689
5,za,musí,3690
6,však,bude,3691
7,byl,si,3692
8,ze,3692,3693
9,3691,3693,3694


In [24]:
cluster_en_15 = LmCluster(words_en[:8000], cluster_cutoff=15)
cluster_cz_15 = LmCluster(words_cz[:8000], cluster_cutoff=15)

2018-03-03 17:02:39,737	8000 word tokens were processed.
2018-03-03 17:02:39,738	Starting classes: 112
2018-03-03 17:02:39,738	initializing tables
100%|██████████| 6216/6216 [00:01<00:00, 4508.62pairs/s]
100%|██████████| 97/97 [00:02<00:00, 44.01class/s]
2018-03-03 17:02:43,341	8000 word tokens were processed.
2018-03-03 17:02:43,342	Starting classes: 61
2018-03-03 17:02:43,342	initializing tables
100%|██████████| 1830/1830 [00:00<00:00, 8011.53pairs/s]
100%|██████████| 46/46 [00:00<00:00, 119.97class/s]


In [25]:
def get_classes(cluster):
    classes = c.defaultdict(list)

    for c0 in cluster.classes:
        for w in cluster.vocab:
            cur_cluster = cluster.vocab[w]
            while cur_cluster in cluster.cluster_parents:
                cur_cluster = cluster.cluster_parents[cur_cluster]
            if c0 == cur_cluster:
                classes[c0].append(w)

    return pd.DataFrame([(x,classes[x]) for x in classes], columns=['class', 'words'])

In [26]:
get_classes(cluster_en_15)

Unnamed: 0,class,words
0,0,"[,]"
1,1,[the]
2,2,[of]
3,3,[and]
4,4,[.]
5,8,[a]
6,10,[I]
7,12,[as]
8,13,[be]
9,14,[have]


In [27]:
get_classes(cluster_cz_15)

Unnamed: 0,class,words
0,0,[.]
1,1,"[,]"
2,2,[a]
3,3,[v]
4,4,[se]
5,6,[o]
6,7,"[""]"
7,9,[že]
8,14,[je]
9,15,[i]


## 3. Tag Classes

> Use the same original data as above, but this time, you will compute the classes for tags (the strings after slashes). Compute tag classes for all tags appearing 5 times or more in the data. Use as much data as time allows. You will be graded relative to the other student's results. Again, note the full history of merges, and attach it to your homework. Pick three interesting classes as the algorithm goes (English data only; Czech optional), and comment on them (why you think you see those tags there together (or not), etc.). 

In [40]:
cluster_en_tag = LmCluster(tags_en, word_cutoff=5)
cluster_cz_tag = LmCluster(tags_cz[:len(tags_cz)], word_cutoff=5)

2018-03-03 17:12:23,725	221098 word tokens were processed.
2018-03-03 17:12:23,725	Starting classes: 36
2018-03-03 17:12:23,725	initializing tables
100%|██████████| 630/630 [00:00<00:00, 7894.11pairs/s]
100%|██████████| 35/35 [00:00<00:00, 309.61class/s]
2018-03-03 17:12:24,016	224538 word tokens were processed.
2018-03-03 17:12:24,016	Starting classes: 677
2018-03-03 17:12:24,017	initializing tables
100%|██████████| 228826/228826 [07:17<00:00, 522.90pairs/s]
100%|██████████| 676/676 [10:31<00:00,  1.07class/s]


In [43]:
def history(cluster):
    return pd.DataFrame(cluster.cluster_history, columns=['prev tag', 'tag', 'cluster id'])

In [44]:
history(cluster_en_tag)[:20]

Unnamed: 0,prev tag,tag,cluster id
0,RBR,WP$,36
1,JJS,36,37
2,SYM,NNPS,38
3,PRP,EX,39
4,NN,38,40
5,37,40,41
6,FW,41,42
7,RB,"""",43
8,(,42,44
9,NNS,WP,45


In [45]:
history(cluster_cz_tag)[:20]

Unnamed: 0,prev tag,tag,cluster id
0,AAIP6----3A----,CrIP6----------,1015
1,J^------------8,NNNXX-----A---8,1016
2,PSFS6-P1-------,AAFS6----2A----,1017
3,AGFS7-----A----,PSFS7-P1-------,1018
4,Vi-P---1--N----,PJXP2----------,1019
5,AAIP3----3A----,AGIP3-----A----,1020
6,PZXP6----------,PSXP6-P1-------,1021
7,PLXP6----------,AGFP6-----A----,1022
8,AAFP7----2A----,AAFP7----1N----,1023
9,AAIS6----1N----,AAIS6----3A----,1024
