# [Assignment #2: NPFL067 Statistical NLP II](http://ufal.mff.cuni.cz/~hajic/courses/npfl067/assign2.html)

## Words and The Company They Keep

### Author: Dan Kondratyuk

### March 2, 2018

---

This Python notebook examines 

Code and explanation of results is fully viewable within this webpage.

## Files

- [index.html](./index.html) - Contains all veiwable code and a summary of results
- [README.md](./README.md) - Instructions on how to run the code with Python
- [nlp-assignment-2.ipynb](./nlp-assignment-2.ipynb) - Jupyter notebook where code can be run
- [requirements.txt](./requirements.txt) - Required python packages for running

## 1. Best Friends

#### Problem Statement
>  In this task you will do a simple exercise to find out the best word association pairs using the pointwise mutual information method.

> First, you will have to prepare the data: take the same texts as in the previous assignment, i.e.

> `TEXTEN1.txt` and `TEXTCZ1.txt`

> (For this part of Assignment 2, there is no need to split the data in any way.)

> Compute the pointwise mutual information for all the possible word pairs appearing consecutively in the data, **disregarding pairs in which one or both words appear less than 10 times in the corpus**, and sort the results from the best to the worst (did you get any negative values? Why?) Tabulate the results, and show the best 20 pairs for both data sets.

> Do the same now but for distant words, i.e. words which are at least 1 word apart, but not farther than 50 words (both directions). Again, tabulate the results, and show the best 20 pairs for both data sets. 

### Process Text

In [1]:
# Import Python packages
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# %load_ext autoreload
# %autoreload 2

from collections import defaultdict, Counter, Iterable
import itertools
import nltk
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from tqdm import tqdm_notebook as tqdm, tnrange as trange
from scipy.special import comb

# Configure Plots
plt.rcParams['lines.linewidth'] = 4
pd.set_option('max_colwidth', 100)

np.random.seed(200) # Set a seed so that this notebook has the same output each time

In [2]:
def open_text(filename):
    """Reads a text line by line, applies light preprocessing, and returns an array of words"""
    with open(filename, encoding='iso-8859-2') as f:
        content = f.readlines()
    
    preprocess = lambda word: word.strip()
    
    return np.array([preprocess(word) for word in content])

In [3]:
class LanguageModel:
    """Counts words and calculates the probabilities of a language model"""
    
    def __init__(self, words, min_words=10):
        self.min_words = min_words
        
        # Unigrams
        self.unigrams = words
        self.unigram_set = list(set(self.unigrams))
        self.total_unigram_count = len(self.unigrams)
        self.unigram_dist = Counter(self.unigrams)
        
        self.unigram_pdist = defaultdict(float)
        for w in self.unigram_dist:
            self.unigram_pdist[w] = self.unigram_dist[w] / self.total_unigram_count
        
        # Bigrams
        self.bigrams = list(nltk.bigrams(words))
        self.bigram_set = list(set(self.bigrams))
        self.total_bigram_count = len(self.bigrams)
        self.bigram_dist = Counter(self.bigrams)
        
        self.bigram_pdist = defaultdict(float)
        for w in self.bigram_dist:
            self.bigram_pdist[w] = self.bigram_dist[w] / self.total_bigram_count
    
    def p_unigram(self, w):
        """Calculates the probability a unigram appears in the distribution"""
        return self.unigram_pdist[w]
    
    def p_bigram(self, wprev, w):
        """Calculates the probability a bigram appears in the distribution"""
        return self.bigram_pdist[(wprev, w)]
    
    def pointwise_mi(self, wprev, w, p_bigram_func=None):
        """Calculates the pointwise mutual information in a word pair"""
        p_bigram_func = self.p_bigram if p_bigram_func is None else p_bigram_func
        joint = p_bigram_func(wprev, w)
        independent = self.p_unigram(wprev) * self.p_unigram(w)
        return np.log2(joint / independent) if independent != 0 else 0

In [4]:
# Read the texts into memory
english = './TEXTEN1.txt'
czech = './TEXTCZ1.txt'

words_en = open_text(english)
words_cz = open_text(czech)

In [5]:
lm_en = LanguageModel(words_en)
lm_cz = LanguageModel(words_cz)

In [6]:
def mutual_information(lm):
    # Obtain all word pairs in the word list, disregarding pairs in which one or both words appear less than 10 times in the corpus  
    pairs = [pair for pair in lm.bigram_set
             if lm.unigram_dist[pair[0]] >= lm.min_words 
             and lm.unigram_dist[pair[1]] >= lm.min_words]

    mi = [(' '.join(pair), lm.pointwise_mi(*pair)) for pair in pairs]
    return pd.DataFrame(mi, columns=['pair', 'mutual_information'])

In [7]:
mi_en = mutual_information(lm_en).sort_values(by='mutual_information', ascending=False)
mi_cz = mutual_information(lm_cz).sort_values(by='mutual_information', ascending=False)

In [8]:
mi_en[:20]

Unnamed: 0,pair,mutual_information
1812,La Plata,14.16937
10985,Asa Gray,14.031867
13301,Fritz Muller,13.362016
25145,worth while,13.332869
39601,faced tumbler,13.26248
13295,lowly organised,13.216899
28544,Malay Archipelago,13.110477
18174,shoulder stripe,13.053893
33993,Great Britain,12.914557
20439,United States,12.847442


In [9]:
mi_cz[:20]

Unnamed: 0,pair,mutual_information
14167,Hamburger SV,14.28895
33829,Los Angeles,14.062442
29292,Johna Newcomba,13.762882
15824,Č. Budějovice,13.633599
6077,série ATP,13.468968
9789,turnajové série,13.434411
37677,Tomáš Ježek,13.428981
22711,Lidové noviny,13.329922
4938,Lidových novin,13.271028
11320,veřejného mínění,13.062442


In [10]:
mi_en[:-5:-1]

Unnamed: 0,pair,mutual_information
5482,"the ,",-8.790285
5032,. the,-8.407455
7911,of .,-7.90195
9651,. of,-7.90195


In [11]:
def mutual_information_dist(lm):
    def mi_step(distance):
        # Get all pairs in the word list a certain distance apart
        pair_list = list(zip(lm.unigrams, lm.unigrams[distance+1:]))
        dist = Counter(pair_list)
    
        # Obtain all word pairs in the word list, disregarding pairs in which one or both words appear less than 10 times in the corpus  
        pairs = [pair for pair in list(set(pair_list))
                 if lm.unigram_dist[pair[0]] >= lm.min_words 
                 and lm.unigram_dist[pair[1]] >= lm.min_words]
        
        p_bigram = lambda wprev, w: dist[(wprev, w)] / lm.total_bigram_count
        
        yield ((distance, wprev, w, lm.pointwise_mi(wprev, w, p_bigram)) for wprev,w in pairs)
    
    max_distance = 50
    results = [m for distance in tqdm(range(1, max_distance+1)) for mi in mi_step(distance) for m in mi]
        
    return pd.DataFrame(results, columns=['distance', 'word0', 'word1', 'mutual_information'])

In [12]:
mi_dist_en = mutual_information_dist(lm_en).sort_values(by='mutual_information', ascending=False)
mi_dist_cz = mutual_information_dist(lm_cz).sort_values(by='mutual_information', ascending=False)







In [13]:
mi_dist_en[:20]

Unnamed: 0,distance,word0,word1,mutual_information
100747,2,survival,fittest,13.754333
34024,1,dimorphic,trimorphic,13.353454
109425,2,Alph,Candolle,13.236485
171136,3,H,Watson,13.16937
84829,2,Old,Worlds,13.053893
13541,1,Alph,de,13.053893
25956,1,E,Forbes,12.946978
134535,2,unimportant,welfare,12.879864
220813,3,carrier,faced,12.695439
56208,1,rarer,rarer,12.525514


In [14]:
mi_dist_cz[:20]

Unnamed: 0,distance,word0,word1,mutual_information
32730,1,ODÚ,VPN,14.119025
34986,1,turnajové,ATP,13.614983
26719,1,Mistrovství,turnajové,13.410365
388108,8,výher,výher,13.318097
47522,1,Čechy,Slováky,13.30345
139690,3,Mistrovství,ATP,13.203914
1000244,19,prohraná,dvojchyby,13.172205
87893,2,soužití,Slováků,13.062442
654498,13,prohraná,esa,13.051911
410973,8,III,IV,13.025916


## 2. Best Friends

#### Word Classes

> **The Data**

> Get `TEXTEN1.ptg`, `TEXTCZ1.ptg`. These are your data. They are almost the same as the .txt data you have used so far, except they now contain the part of speech tags in the following form:

> `rady/NNFS2-----A----`  
`,/Z:-------------`

> where the tag is separated from the word by a slash ('/'). Be careful: the tags might contain everything (including slashes, dollar signs and other weird characters). It is guaranteed however that there is no slash-word.

> Similarly for the English texts (except the tags are shorter of course).

> **The Task**

> Compute a full class hierarchy of **words** using the first 8,000 words of those data, and only for words occurring 10 times or more (use the same setting for both languages). Ignore the other words for building the classes, but keep them in the data for the bigram counts. For details on the algorithm, use the Brown et al. paper distributed in the class; some formulas are wrong, however, so please see the corrections on the web (Class 12, formulas for Trick \#4). Note the history of the merges, and attach it to your homework. Now run the same algorithm again, but stop when reaching 15 classes. Print out all the members of your 15 classes and attach them too.

> **Hints:**

> The initial mutual information is (English, words, limit 8000):

> `4.99726326162518` (if you add one extra word at the beginning of the data)  
> `4.99633675507535` (if you use the data as they are and are carefull at the beginning and end).

> NB: the above numbers are finally confirmed from an independent source :-).

> The first 5 merges you get on the English data should be:

> `case subject`  
> `cannot may`  
> `individuals structure`  
> `It there`  
> `even less`  

> The loss of Mutual Information when merging the words "case" and "subject":

> Minimal loss: `0.00219656653357569` for `case+subject`

In [None]:
from brown_cluster import LmCluster

In [136]:
def open_text(filename):
    """Reads a text line by line, applies light preprocessing, and returns an array of words and tags"""
    with open(filename, encoding='iso-8859-2') as f:
        content = f.readlines()
    
    preprocess = lambda word: word.strip().rsplit('/', 1)
    
    return [preprocess(word) for word in content]

In [3]:
# Read the texts into memory
english = './TEXTEN1.ptg'
czech = './TEXTCZ1.ptg'

words_en, tags_en = zip(*open_text(english))
words_cz, tags_cz = zip(*open_text(czech))

In [None]:
text_size = 8000

In [108]:
lm_en = LmCluster(words_en[:text_size])
lm_cz = LmCluster(words_cz[:text_size])







In [109]:
lm_en.cluster()
lm_cz.cluster()







In [110]:
def history(cluster):
    return pd.DataFrame(cluster.merge_history, columns=['class 1', 'class 2', 'cluster id', 'mutual_information_loss'])

In [111]:
history(lm_en)[:20]

Unnamed: 0,class 1,class 2,cluster id,mutual_information_loss
0,subject,case,1662,-0.002197
1,may,cannot,1663,-0.002669
2,individuals,structure,1664,-0.002675
3,It,there,1665,-0.003479
4,even,less,1666,-0.003656
5,nature,variation,1667,-0.003691
6,short,slight,1668,-0.003906
7,cases,manner,1669,-0.00425
8,state,1662,1670,-0.004277
9,shall,),1671,-0.004382


In [112]:
history(lm_cz)[:20]

Unnamed: 0,class 1,class 2,cluster id,mutual_information_loss
0,listopadu,OKD,3685,-0.003083
1,které,který,3686,-0.003373
2,J,státu,3687,-0.004025
3,bude,musí,3688,-0.004422
4,ale,aby,3689,-0.004604
5,mezi,už,3690,-0.005
6,budou,pouze,3691,-0.00547
7,zákona,jeho,3692,-0.005578
8,byl,si,3693,-0.005792
9,NATO,&slash;,3694,-0.006073


In [113]:
clusters = 15

In [114]:
lm_en_15 = LmCluster(words_en[:text_size])
lm_cz_15 = LmCluster(words_cz[:text_size])







In [115]:
lm_en_15.cluster(clusters)
lm_cz_15.cluster(clusters)







In [116]:
def class_cluster(lm):
    classes = lm.get_classes()
    return pd.DataFrame([(x, [lm.class_name(c) for c in classes[x] if c < len(lm.int2word)]) for x in classes], columns=['class', 'words'])

In [117]:
class_cluster(lm_en_15)

Unnamed: 0,class,words
0,0,"[,]"
1,1,[the]
2,2,[of]
3,3,[and]
4,4,[.]
5,5,[to]
6,6,[in]
7,7,[that]
8,1758,"[a, ;, this, any, long, very, my, different, great, short, slight]"
9,1753,"[I, is, as, from, are, on, by, been, under, The, plants, In, so, when, if, believe, see, nearly,..."


In [118]:
class_cluster(lm_cz_15)

Unnamed: 0,class,words
0,0,[.]
1,1,"[,]"
2,2,[a]
3,3,[v]
4,4,[se]
5,5,[na]
6,3730,"[o, by, ve, po, ze, před]"
7,3727,"["", s, V, Na, listopadu, OKD]"
8,8,[-]
9,3715,"[že, ale, které, který, aby]"


## 3. Tag Classes

> Use the same original data as above, but this time, you will compute the classes for tags (the strings after slashes). Compute tag classes for all tags appearing 5 times or more in the data. Use as much data as time allows. You will be graded relative to the other student's results. Again, note the full history of merges, and attach it to your homework. Pick three interesting classes as the algorithm goes (English data only; Czech optional), and comment on them (why you think you see those tags there together (or not), etc.). 

In [127]:
cluster_en_tag = LmCluster(tags_en, word_cutoff=5)
cluster_en_tag.cluster()







In [128]:
history(cluster_en_tag)[:20]

Unnamed: 0,class 1,class 2,cluster id,mutual_information_loss
0,IN,WP$,36,0.010274
1,RBR,36,37,0.009367
2,(,37,38,0.008524
3,RB,WP,39,0.012271
4,"""",38,40,0.013501
5,FW,40,41,0.007315
6,NNPS,41,42,0.007394
7,SYM,39,43,0.012683
8,WRB,42,44,0.010462
9,TO,RBS,45,0.010699


In [None]:
cluster_cz_tag = LmCluster(tags_cz, word_cutoff=5)
cluster_cz_tag.cluster()

In [None]:
history(cluster_cz_tag)[:20]