# [Assignment #2: NPFL067 Statistical NLP II](http://ufal.mff.cuni.cz/~hajic/courses/npfl067/assign2.html)

## Words and The Company They Keep

### Author: Dan Kondratyuk

### March 2, 2018

---

This Python notebook examines 

Code and explanation of results is fully viewable within this webpage.

## Files

- [index.html](./index.html) - Contains all veiwable code and a summary of results
- [README.md](./README.md) - Instructions on how to run the code with Python
- [nlp-assignment-2.ipynb](./nlp-assignment-1.ipynb) - Jupyter notebook where code can be run
- [requirements.txt](./requirements.txt) - Required python packages for running

## 1. Best Friends

#### Problem Statement
>  In this task you will do a simple exercise to find out the best word association pairs using the pointwise mutual information method.

> First, you will have to prepare the data: take the same texts as in the previous assignment, i.e.

> `TEXTEN1.txt` and `TEXTCZ1.txt`

> (For this part of Assignment 2, there is no need to split the data in any way.)

> Compute the pointwise mutual information for all the possible word pairs appearing consecutively in the data, **disregarding pairs in which one or both words appear less than 10 times in the corpus**, and sort the results from the best to the worst (did you get any negative values? Why?) Tabulate the results, and show the best 20 pairs for both data sets.

> Do the same now but for distant words, i.e. words which are at least 1 word apart, but not farther than 50 words (both directions). Again, tabulate the results, and show the best 20 pairs for both data sets. 

### Process Text

In [131]:
# Import Python packages
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import nltk
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import collections as c
from tqdm import tqdm

# Configure Plots
plt.rcParams['lines.linewidth'] = 4

In [3]:
np.random.seed(200) # Set a seed so that this notebook has the same output each time

In [4]:
def open_text(filename):
    """Reads a text line by line, applies light preprocessing, and returns an array of words"""
    with open(filename, encoding='iso-8859-2') as f:
        content = f.readlines()
    
    preprocess = lambda word: word.strip()
    
    return np.array([preprocess(word) for word in content])

In [5]:
# Read the texts into memory
english = './TEXTEN1.txt'
czech = './TEXTCZ1.txt'

words_en = open_text(english)
words_cz = open_text(czech)

In [89]:
class LanguageModel:
    """Counts words and calculates the probabilities of a language model"""
    
    def __init__(self, words, min_words=10):
        self.min_words = min_words
        
        # Unigrams
        self.unigrams = words
        self.unigram_set = list(set(self.unigrams))
        self.total_unigram_count = len(self.unigrams)
        self.unigram_dist = c.Counter(self.unigrams)
        
        # Bigrams
        self.bigrams = list(nltk.bigrams(words))
        self.bigram_set = list(set(self.bigrams))
        self.total_bigram_count = len(self.bigrams)
        self.bigram_dist = c.Counter(self.bigrams)
    
    def p_unigram(self, w):
        """Calculates the probability a unigram appears in the distribution"""
        return self.div(self.unigram_dist[w], self.total_unigram_count)
    
    def p_bigram(self, wprev, w):
        """Calculates the probability a bigram appears in the distribution"""
        return self.div(self.bigram_dist[(wprev, w)], self.total_bigram_count)
    
    def div(self, a, b):
        """Divides a and b safely"""
        return a / b if b != 0 else 0
    
    def pointwise_mi(self, wprev, w, p_bigram_func=None):
        """Calculates the pointwise mutual information in a word pair"""
        p_bigram_func = self.p_bigram if p_bigram_func is None else p_bigram_func
        return np.log(self.div(p_bigram_func(wprev, w), self.p_unigram(wprev) * self.p_unigram(w)))

In [90]:
lm_en = LanguageModel(words_en)
lm_cz = LanguageModel(words_cz)

In [91]:
def mutual_information(lm):
    # Obtain all word pairs in the word list, disregarding pairs in which one or both words appear less than 10 times in the corpus  
    pairs = [pair for pair in lm.bigram_set
             if lm.unigram_dist[pair[0]] >= lm.min_words 
             and lm.unigram_dist[pair[1]] >= lm.min_words]

    mi = [(' '.join(pair), lm.pointwise_mi(*pair)) for pair in pairs]
    return pd.DataFrame(mi, columns=['pair', 'mutual_information'])

In [92]:
mi_en = mutual_information(lm_en).sort_values(by='mutual_information', ascending=False)
mi_cz = mutual_information(lm_cz).sort_values(by='mutual_information', ascending=False)

In [75]:
mi_en[:20]

Unnamed: 0,pair,mutual_information
27778,La Plata,9.821459
28504,Asa Gray,9.726149
15255,Fritz Muller,9.261843
34382,worth while,9.241641
3948,faced tumbler,9.192851
26899,lowly organised,9.161256
13577,Malay Archipelago,9.08749
38874,shoulder stripe,9.048269
4289,Great Britain,8.951688
12350,United States,8.905168


In [76]:
mi_cz[:20]

Unnamed: 0,pair,mutual_information
38169,Hamburger SV,9.904346
37404,Los Angeles,9.747342
3623,Johna Newcomba,9.539703
24017,Č. Budějovice,9.45009
11934,série ATP,9.335977
7319,turnajové série,9.312024
31450,Tomáš Ježek,9.30826
23679,Lidové noviny,9.239598
11145,Lidových novin,9.198776
26842,veřejného mínění,9.054195


In [77]:
mi_en[:-5:-1]

Unnamed: 0,pair,mutual_information
12142,"the ,",-6.092961
29793,. the,-5.827604
28483,. of,-5.477215
31817,of .,-5.477215


In [132]:
def mutual_information(lm):
    results = []
    
    max_distance = 50
    for distance in tqdm(range(1, max_distance+1)):
        # Get all pairs in the word list a certain distance apart
        pair_list = list(zip(lm.unigrams, lm.unigrams[distance+1:]))
        dist = c.Counter(pair_list)
    
        # Obtain all word pairs in the word list, disregarding pairs in which one or both words appear less than 10 times in the corpus  
        pairs = [pair for pair in list(set(pair_list))
                 if lm.unigram_dist[pair[0]] >= lm.min_words 
                 and lm.unigram_dist[pair[1]] >= lm.min_words]
        
        p_bigram = lambda wprev, w: lm.div(dist[(wprev, w)], lm.total_bigram_count)

        mi = [(distance, wprev, w, lm.pointwise_mi(wprev, w, p_bigram)) for wprev,w in pairs]
        results += mi
        
    return pd.DataFrame(results, columns=['distance', 'word0', 'word1', 'mutual_information'])

In [133]:
mi_dist_en = mutual_information(lm_en).sort_values(by='mutual_information', ascending=False)
mi_dist_cz = mutual_information(lm_cz).sort_values(by='mutual_information', ascending=False)

100%|██████████| 50/50 [00:26<00:00,  1.90it/s]
100%|██████████| 50/50 [00:26<00:00,  1.88it/s]


In [135]:
mi_dist_en[:20]

Unnamed: 0,distance,word0,word1,mutual_information
79729,2,survival,fittest,9.533777
10124,1,dimorphic,trimorphic,9.255909
127156,2,Alph,Candolle,9.174832
215505,3,H,Watson,9.128312
27950,1,Alph,de,9.048269
110454,2,Old,Worlds,9.048269
2510,1,E,Forbes,8.974161
94792,2,unimportant,welfare,8.927641
211808,3,carrier,faced,8.799808
57455,1,rarer,rarer,8.682025


In [136]:
mi_dist_cz[:20]

Unnamed: 0,distance,word0,word1,mutual_information
34407,1,ODÚ,VPN,9.786563
2022,1,turnajové,ATP,9.437187
28687,1,Mistrovství,turnajové,9.295357
415513,8,výher,výher,9.231401
7404,1,Čechy,Slováky,9.221249
106914,3,Mistrovství,ATP,9.152256
1003827,19,prohraná,dvojchyby,9.130277
94486,2,soužití,Slováků,9.054195
662244,13,prohraná,esa,9.046895
408698,8,III,IV,9.028877


## 2. Best Friends

#### Word Classes

> **The Data**

> Get `TEXTEN1.ptg`, `TEXTCZ1.ptg`. These are your data. They are almost the same as the .txt data you have used so far, except they now contain the part of speech tags in the following form:

> `rady/NNFS2-----A----`  
`,/Z:-------------`

> where the tag is separated from the word by a slash ('/'). Be careful: the tags might contain everything (including slashes, dollar signs and other weird characters). It is guaranteed however that there is no slash-word.

> Similarly for the English texts (except the tags are shorter of course).

> **The Task**

> Compute a full class hierarchy of **words** using the first 8,000 words of those data, and only for words occurring 10 times or more (use the same setting for both languages). Ignore the other words for building the classes, but keep them in the data for the bigram counts. For details on the algorithm, use the Brown et al. paper distributed in the class; some formulas are wrong, however, so please see the corrections on the web (Class 12, formulas for Trick \#4). Note the history of the merges, and attach it to your homework. Now run the same algorithm again, but stop when reaching 15 classes. Print out all the members of your 15 classes and attach them too.

> Hints:

> The initial mutual information is (English, words, limit 8000):

> `4.99726326162518` (if you add one extra word at the beginning of the data)  
> `4.99633675507535` (if you use the data as they are and are carefull at the beginning and end).

> NB: the above numbers are finally confirmed from an independent source :-).

> The first 5 merges you get on the English data should be:

> `case subject`  
> `cannot may`  
> `individuals structure`  
> `It there`  
> `even less`  

> The loss of Mutual Information when merging the words "case" and "subject":

> Minimal loss: `0.00219656653357569` for case+subject 