1. Use SpaCy tokenizer API to tokenize the text from the law corpus.
2. Compute bigram counts of downcased tokens. Given the sentence: "The quick brown fox jumps over the lazy dog.", the bigram counts are as follows:
- "the quick": 1
- "quick brown": 1
- "brown fox": 1
- ...
- "dog .": 1
3. Discard bigrams containing characters other than letters. Make sure that you discard the invalid entries after computing the bigram counts.
4. Use pointwise mutual information to compute the measure for all pairs of words.
5. Sort the word pairs according to that measure in the descending order and determine top 10 entries.
6. Filter bigrams with number of occurrences lower than 5. Determine top 10 entries for the remaining dataset (>=5 occurrences).
7. Use log likelihood ratio (LLR) to compute the measure for all pairs of words.
8. Sort the word pairs according to that measure in the descending order and display top 10 entries.
9. Compute trigram counts for the whole corpus and perform the same filtering.
10. Use PMI (with 5 occurrence threshold) and LLR to compute top 10 results for the trigrams. Devise a method for computing the values, based on the results for bigrams.
11. Create a table comparing the methods (separate table for bigrams and trigrams).
12. Answer the following questions:
- Why do we have to filter the bigrams, rather than the token sequence?
- Which measure (PMI, PMI with filtering, LLR) works better for the bigrams and which for the trigrams?
- What types of expressions are discovered by the methods.
- Can you devise a different type of filtering that would yield better results?

In [1]:
import matplotlib.pyplot as plt
from elasticsearch import Elasticsearch, helpers
from spacy.lang.pl import Polish
from spacy.tokenizer import Tokenizer
import datetime
import Levenshtein as lvs
import os
import pandas as pd
import morfeusz2
import re
import tarfile
import numpy as np
import glob
from datetime import datetime
from math import log

In [2]:
if len(glob.glob("*.txt")) == 0:
    bills = tarfile.open("ustawy.tar.gz", "r:gz")
    bills.extractall()
    bills.close()

### 1)

In [3]:
tokenizer = Tokenizer(Polish().vocab)

### 2)

In [4]:
bigrams_counts={}
        
for file in glob.glob("*.txt"):
    with open(os.getcwd() + "/" + file, 'r', encoding='utf-8') as f:
        bill = f.read()
    tokenized_bill = tokenizer(bill)
    for i in range(len(tokenized_bill) - 1):
        bigram = (tokenized_bill[i].text.strip().lower(), tokenized_bill[i+1].text.strip().lower())
        bigrams_counts[bigram] = bigrams_counts.get(bigram, 0) + 1

In [5]:
bigrams_df = pd.DataFrame.from_dict(bigrams_counts, orient='index').reset_index()

In [6]:
bigrams_df.columns = ['bigram', 'count']

In [7]:
bigrams_df

Unnamed: 0,bigram,count
0,"(, dz.u.)",950
1,"(dz.u., z)",1212
2,"(z, 1993)",511
3,"(1993, r.)",762
4,"(r., nr)",16952
...,...,...
910598,"(wyrazami, ""odpadów)",1
910599,"(""odpadów, opakowaniowych)",1
910600,"(handlowej, tej)",1
910601,"(tej, jednostki"".)",1


### 3)

In [8]:
bigrams_df[['first_word', 'second_word']] = pd.DataFrame(bigrams_df['bigram'].tolist(), index=bigrams_df.index)

In [9]:
bigrams_df

Unnamed: 0,bigram,count,first_word,second_word
0,"(, dz.u.)",950,,dz.u.
1,"(dz.u., z)",1212,dz.u.,z
2,"(z, 1993)",511,z,1993
3,"(1993, r.)",762,1993,r.
4,"(r., nr)",16952,r.,nr
...,...,...,...,...
910598,"(wyrazami, ""odpadów)",1,wyrazami,"""odpadów"
910599,"(""odpadów, opakowaniowych)",1,"""odpadów",opakowaniowych
910600,"(handlowej, tej)",1,handlowej,tej
910601,"(tej, jednostki"".)",1,tej,"jednostki""."


In [10]:
bigrams_df = bigrams_df[(bigrams_df['first_word'].str.isalpha()) & (bigrams_df['second_word'].str.isalpha())]

In [11]:
bigrams_df = bigrams_df.sort_values(by='count', ascending=False)

In [12]:
bigrams_df

Unnamed: 0,bigram,count,first_word,second_word
281,"(mowa, w)",27628,mowa,w
581,"(których, mowa)",12963,których,mowa
580,"(o, których)",12562,o,których
12,"(z, dnia)",8975,z,dnia
1372,"(którym, mowa)",8683,którym,mowa
...,...,...,...,...
438301,"(jest, srebrny)",1,jest,srebrny
438302,"(srebrny, orzeł)",1,srebrny,orzeł
438303,"(orzeł, ze)",1,orzeł,ze
438304,"(znaku, ministerstwa)",1,znaku,ministerstwa


### 4)

In [13]:
frequencies = []
for file in glob.glob("*.txt"):
    with open(os.getcwd() + "/" + file, 'r', encoding='utf-8') as f:
        bill = f.read()
    tokenized_bill = tokenizer(bill)
    frequencies_dict = {}
    for token in tokenized_bill:
        if token.text.strip().lower() in frequencies_dict.keys():
            frequencies_dict[token.text.strip().lower()] += 1
        else:
            frequencies_dict[token.text.strip().lower()] = 1
    frequencies.append(frequencies_dict)

global_frequencies = {}
for freq in frequencies:
    for token, count in freq.items():
        if token in global_frequencies.keys():
            global_frequencies[token] += count
        else:
            global_frequencies[token] = count
            
freq_df = pd.DataFrame.from_dict(global_frequencies, orient='index').reset_index()
freq_df.columns=['token', 'count']

freq_df = freq_df[(freq_df['token'].str.isalpha()) & (freq_df['token'].str.len() >= 2)]
freq_df = freq_df.reset_index(drop=True).reset_index()
freq_df = freq_df.drop('index', 1).sort_values(by='count', ascending=False)

In [14]:
freq_df

Unnamed: 0,token,count
31,do,60160
108,na,50237
54,lub,45325
0,nr,44918
15,się,43967
...,...,...
38299,ratującej,1
38300,asortymencie,1
38301,wstrzymywanych,1
38302,wycofywanych,1


In [15]:
start = datetime.now()
bigrams_df['bigram_first_word_count']=bigrams_df.apply(lambda x: global_frequencies.get(x['first_word']), axis=1)
bigrams_df['bigram_second_word_count']=bigrams_df.apply(lambda x: global_frequencies.get(x['second_word']), axis=1)

bigrams_sum = bigrams_df['count'].sum()
words_sum = freq_df['count'].sum()

bigrams_df['pmi'] = bigrams_df.apply(lambda x: log((x['count']/bigrams_sum)/((x['bigram_first_word_count']/words_sum)*(x['bigram_second_word_count']/words_sum))),axis=1)
time = datetime.now() - start
print("Time of operation: {}".format(time))

Time of operation: 0:00:10.689144


In [16]:
bigrams_df

Unnamed: 0,bigram,count,first_word,second_word,bigram_first_word_count,bigram_second_word_count,pmi
281,"(mowa, w)",27628,mowa,w,28756,200236,2.649098
581,"(których, mowa)",12963,których,mowa,17895,28756,4.307343
580,"(o, których)",12562,o,których,64281,17895,3.471502
12,"(z, dnia)",8975,z,dnia,81806,17617,2.909839
1372,"(którym, mowa)",8683,którym,mowa,11759,28756,4.326513
...,...,...,...,...,...,...,...
438301,"(jest, srebrny)",1,jest,srebrny,13012,4,4.036445
438302,"(srebrny, orzeł)",1,srebrny,orzeł,4,8,11.430631
438303,"(orzeł, ze)",1,orzeł,ze,8,5490,4.206242
438304,"(znaku, ministerstwa)",1,znaku,ministerstwa,159,158,4.764868


### 5)

In [17]:
bigrams_df = bigrams_df.sort_values(by='pmi', ascending=False)
bigrams_df.head(10)

Unnamed: 0,bigram,count,first_word,second_word,bigram_first_word_count,bigram_second_word_count,pmi
613336,"(gruczołów, chłonnych)",1,gruczołów,chłonnych,1,1,14.896367
3678,"(automatyki, grzewczej)",1,automatyki,grzewczej,1,1,14.896367
603553,"(chlamydia, trachomatis)",1,chlamydia,trachomatis,1,1,14.896367
603569,"(enterococcus, faecalis)",1,enterococcus,faecalis,1,1,14.896367
739512,"(zmniejszą, widoczności)",1,zmniejszą,widoczności,1,1,14.896367
393857,"(płyn, alkoholowy)",1,płyn,alkoholowy,1,1,14.896367
372453,"(winianu, potasowego)",1,winianu,potasowego,1,1,14.896367
110553,"(motylkowatych, drobnonasiennych)",1,motylkowatych,drobnonasiennych,1,1,14.896367
902666,"(klifów, nadmorskich)",1,klifów,nadmorskich,1,1,14.896367
633102,"(uprzywilejowanym, wierzytelnościom)",1,uprzywilejowanym,wierzytelnościom,1,1,14.896367


### 6)

In [18]:
bigrams_pmi_below_5 = bigrams_df[bigrams_df['count'] < 5]

In [19]:
bigrams_pmi_below_5

Unnamed: 0,bigram,count,first_word,second_word,bigram_first_word_count,bigram_second_word_count,pmi
613336,"(gruczołów, chłonnych)",1,gruczołów,chłonnych,1,1,14.896367
3678,"(automatyki, grzewczej)",1,automatyki,grzewczej,1,1,14.896367
603553,"(chlamydia, trachomatis)",1,chlamydia,trachomatis,1,1,14.896367
603569,"(enterococcus, faecalis)",1,enterococcus,faecalis,1,1,14.896367
739512,"(zmniejszą, widoczności)",1,zmniejszą,widoczności,1,1,14.896367
...,...,...,...,...,...,...,...
530684,"(na, w)",1,na,w,50237,200236,-8.135392
684854,"(w, do)",1,w,do,200236,60160,-8.315648
530053,"(w, o)",1,w,o,200236,64281,-8.381905
528372,"(w, z)",1,w,z,200236,81806,-8.622991


In [20]:
bigrams_pmi_above_5 = bigrams_df[bigrams_df['count'] >= 5]
bigrams_pmi_above_5.head(10)

Unnamed: 0,bigram,count,first_word,second_word,bigram_first_word_count,bigram_second_word_count,pmi
123271,"(ręcznego, miotacza)",5,ręcznego,miotacza,5,5,13.286929
421616,"(młyny, kulowe)",5,młyny,kulowe,5,5,13.286929
544574,"(zaszkodzić, wynikom)",5,zaszkodzić,wynikom,5,5,13.286929
93035,"(świeckie, przygotowujące)",5,świeckie,przygotowujące,5,5,13.286929
592464,"(grzegorz, schetyna)",5,grzegorz,schetyna,5,5,13.286929
270522,"(mleczka, makowego)",5,mleczka,makowego,6,5,13.104607
823394,"(adama, mickiewicza)",6,adama,mickiewicza,6,6,13.104607
421458,"(przeponowe, rurowe)",5,przeponowe,rurowe,5,6,13.104607
895445,"(schedę, spadkową)",7,schedę,spadkową,7,7,12.950457
79374,"(papierem, wartościowym)",5,papierem,wartościowym,7,5,12.950457


### 7)

In [21]:
def H(counts):
    total = float(sum(counts))
    return -sum([k * log(k / total + (k==0)) for k in counts])


def llr_for_bigrams(row, total_tokens):
    k11 = row['count']
    k12 = row['bigram_second_word_count'] - row['count']
    k21 = row['bigram_first_word_count'] - row['count']
    k22 = total_tokens - k12 - k21 - k11
    
    return 2 * (H([k11 + k12, k21 + k22]) +
                H([k11 + k21, k12 + k22]) -
                H([k11, k12, k21, k22]))

In [22]:
token_counts = {}

start = datetime.now()

bigrams_df['llr'] = bigrams_df.apply(lambda row: llr_for_bigrams(row, bigrams_sum), axis=1)

time = datetime.now() - start
print("Time of operation: {}".format(time))

Time of operation: 0:00:19.680340


### 8)

In [23]:
bigrams_df.sort_values(by='llr', ascending=False).head(10)

Unnamed: 0,bigram,count,first_word,second_word,bigram_first_word_count,bigram_second_word_count,pmi,llr
281,"(mowa, w)",27628,mowa,w,28756,200236,2.649098,124224.500981
581,"(których, mowa)",12963,których,mowa,17895,28756,4.307343,97187.006255
580,"(o, których)",12562,o,których,64281,17895,3.471502,68637.189659
1372,"(którym, mowa)",8683,którym,mowa,11759,28756,4.326513,63971.727994
75,"(dodaje, się)",7956,dodaje,się,8422,43967,4.148251,59451.445538
12443,"(do, spraw)",8182,do,spraw,60160,9890,3.702016,50269.517707
1371,"(o, którym)",8499,o,którym,64281,11759,3.500677,46712.250625
798484,"(w, w)",1,w,w,200236,200236,-9.518137,42226.233484
106,"(stosuje, się)",5755,stosuje,się,6669,43967,4.057771,39987.231886
12441,"(minister, właściwy)",3870,minister,właściwy,6859,6049,5.616411,39178.393304


### 9)

In [24]:
trigrams_counts={}
        
for file in glob.glob("*.txt"):
    with open(os.getcwd() + "/" + file, 'r', encoding='utf-8') as f:
        bill = f.read()
    tokenized_bill = tokenizer(bill)
    for i in range(len(tokenized_bill) - 1):
        if (i<len(tokenized_bill) - 2):
            trigram = (tokenized_bill[i].text.strip().lower(), tokenized_bill[i+1].text.strip().lower(), tokenized_bill[i+2].text.strip().lower())
            trigrams_counts[trigram] = trigrams_counts.get(trigram, 0) + 1

In [25]:
trigrams_df = pd.DataFrame.from_dict(trigrams_counts, orient='index').reset_index()

In [26]:
trigrams_df.columns=['trigram', 'count']

In [27]:
trigrams_df

Unnamed: 0,trigram,count
0,"(, dz.u., z)",948
1,"(dz.u., z, 1993)",8
2,"(z, 1993, r.)",502
3,"(1993, r., nr)",498
4,"(r., nr, 129,)",21
...,...,...
2052367,"(ofercie, handlowej, tej)",1
2052368,"(handlowej, tej, jednostki"".)",1
2052369,"(tej, jednostki""., )",1
2052370,"(jednostki""., , art.)",1


In [28]:
trigrams_df[['first_word', 'second_word', 'third_word']] = pd.DataFrame(trigrams_df['trigram'].tolist(), index=trigrams_df.index)
trigrams_df = trigrams_df[(trigrams_df['first_word'].str.isalpha()) & (trigrams_df['second_word'].str.isalpha()) & (trigrams_df['third_word'].str.isalpha())]

In [29]:
trigrams_df.sort_values(by='count', ascending=False)

Unnamed: 0,trigram,count,first_word,second_word,third_word
682,"(których, mowa, w)",12491,których,mowa,w
681,"(o, których, mowa)",11667,o,których,mowa
1682,"(którym, mowa, w)",8427,którym,mowa,w
1681,"(o, którym, mowa)",8012,o,którym,mowa
306,"(której, mowa, w)",5019,której,mowa,w
...,...,...,...,...,...
840923,"(naliczonego, przyjmuje, się)",1,naliczonego,przyjmuje,się
840922,"(podatku, naliczonego, przyjmuje)",1,podatku,naliczonego,przyjmuje
840910,"(zużytych, w, ramach)",1,zużytych,w,ramach
840909,"(lub, zużytych, w)",1,lub,zużytych,w


### 10)

In [30]:
start = datetime.now()
trigrams_df['trigram_first_word_count']=trigrams_df.apply(lambda x: global_frequencies.get(x['first_word']), axis=1)
trigrams_df['trigram_second_word_count']=trigrams_df.apply(lambda x: global_frequencies.get(x['second_word']), axis=1)
trigrams_df['trigram_third_word_count']=trigrams_df.apply(lambda x: global_frequencies.get(x['third_word']), axis=1)


trigrams_sum = trigrams_df['count'].sum()

trigrams_df['pmi'] = trigrams_df.apply(lambda x: log((x['count']/trigrams_sum)/((x['trigram_first_word_count']/words_sum)*(x['trigram_second_word_count']/words_sum)*(x['trigram_third_word_count']/words_sum))),axis=1)
time = datetime.now() - start
print("Time of operation: {}".format(time))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  trigrams_df['trigram_first_word_count']=trigrams_df.apply(lambda x: global_frequencies.get(x['first_word']), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  trigrams_df['trigram_second_word_count']=trigrams_df.apply(lambda x: global_frequencies.get(x['second_word']), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#re

Time of operation: 0:00:21.966771


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  trigrams_df['pmi'] = trigrams_df.apply(lambda x: log((x['count']/trigrams_sum)/((x['trigram_first_word_count']/words_sum)*(x['trigram_second_word_count']/words_sum)*(x['trigram_third_word_count']/words_sum))),axis=1)


In [31]:
trigrams_df = trigrams_df.sort_values(by='pmi', ascending=False)

In [32]:
trigrams_df

Unnamed: 0,trigram,count,first_word,second_word,third_word,trigram_first_word_count,trigram_second_word_count,trigram_third_word_count,pmi
1360916,"(chleb, świętojański, strąkowy)",1,chleb,świętojański,strąkowy,1,1,1,29.942436
1692034,"(krzewiące, etos, społecznikowski)",1,krzewiące,etos,społecznikowski,1,1,1,29.942436
1766062,"(implantacji, stymulatora, nerwu)",1,implantacji,stymulatora,nerwu,1,1,1,29.942436
894014,"(prosimy, uważnie, przeczytać)",1,prosimy,uważnie,przeczytać,1,1,1,29.942436
1392561,"(suthma, eegcou, eok)",1,suthma,eegcou,eok,1,1,1,29.942436
...,...,...,...,...,...,...,...,...,...
1749453,"(oraz, i, w)",1,oraz,i,w,33088,89032,200236,-4.068493
1851705,"(i, oraz, w)",1,i,oraz,w,89032,33088,200236,-4.068493
289194,"(w, i, i)",2,w,i,i,200236,89032,89032,-4.365171
1700664,"(nr, i, w)",1,nr,i,w,44918,89032,200236,-4.374161


In [33]:
def H(counts):
    total = float(sum(counts))
    return -sum([k * log(k / total + (k==0)) for k in counts])


def llr_for_trigrams(row):
    k11 = row['count']
    ab_counts = bigrams_counts[(row['first_word']), row['second_word']]
    bc_counts = bigrams_counts[(row['second_word']), row['third_word']]
    k12 = bc_counts - row['count']
    k21 = ab_counts - row['count']
    k22 = trigrams_sum - k12 - k21 - k11
    
    return 2 * (H([k11 + k12, k21 + k22]) +
                H([k11 + k21, k12 + k22]) -
                H([k11, k12, k21, k22]))

In [34]:
token_counts = {}

start = datetime.now()

trigrams_df['llr'] = trigrams_df.apply(lambda row: llr_for_trigrams(row), axis=1)

time = datetime.now() - start
print("Time of operation: {}".format(time))

Time of operation: 0:00:34.194643


In [35]:
trigrams_df

Unnamed: 0,trigram,count,first_word,second_word,third_word,trigram_first_word_count,trigram_second_word_count,trigram_third_word_count,pmi,llr
1360916,"(chleb, świętojański, strąkowy)",1,chleb,świętojański,strąkowy,1,1,1,29.942436,30.479135
1692034,"(krzewiące, etos, społecznikowski)",1,krzewiące,etos,społecznikowski,1,1,1,29.942436,30.479135
1766062,"(implantacji, stymulatora, nerwu)",1,implantacji,stymulatora,nerwu,1,1,1,29.942436,30.479135
894014,"(prosimy, uważnie, przeczytać)",1,prosimy,uważnie,przeczytać,1,1,1,29.942436,30.479135
1392561,"(suthma, eegcou, eok)",1,suthma,eegcou,eok,1,1,1,29.942436,30.479135
...,...,...,...,...,...,...,...,...,...,...
1749453,"(oraz, i, w)",1,oraz,i,w,33088,89032,200236,-4.068493,9.663045
1851705,"(i, oraz, w)",1,i,oraz,w,89032,33088,200236,-4.068493,9.627750
289194,"(w, i, i)",2,w,i,i,200236,89032,89032,-4.365171,30.797363
1700664,"(nr, i, w)",1,nr,i,w,44918,89032,200236,-4.374161,9.663045


### 11)

## Bigrams
| Measure | Time | Accuracy |
| --- | --- | --- |
| PMI | 10.3s | Meaning is matched very well |
| LLR | 17.8s | Highest LLR reserved for obvious sets, e.g. words with prepositions |

## Trigrams
| Measure | Time | Accuracy |
| --- | --- | --- |
| PMI | 21.5s | Meaning is matched very well |
| LLR | 33.86s | Highest LLR reserved for obvious sets, e.g. words with prepositions |

### 12)

- i. Why do we have to filter the bigrams, rather than the token sequence?
So that we don't modify the source text, and only consider the results we get from it
- ii. Which measure (PMI, PMI with filtering, LLR) works better for the bigrams and which for the trigrams?
PMI is the best in finding pure meaning matches (so e.g., for the trigrams highest measure is achieved for sets of 3 words which make sense only within particular set. In our case it works better for both bigrams and trigrams
- iii. What types of expressions are discovered by the methods.
LLR finds obvious sets of words, e.g. single words with prepositions, or multiple versions of the same words (o których mowa is the most common one). 
PMI finds the sets in which the meaning is matched very well. It gives the highest score to words that have particularly stronger meaning together (e.g. first name and surname)
- iv. Can you devise a different type of filtering that would yield better results?
Focusing on the most important words (so, e.g., filtering out all prepositions, or words that have no real meaning in the context of substance) could give better results