<h3>Multiword expressions identification and extraction</h3>
<p>Konrad Przewłoka</p>

<h4>Necessary imports</h4>

In [56]:
#!pip install prettytable
from spacy.tokenizer import Tokenizer
from spacy.lang.pl import Polish
import os
import random
import math
import collections
import matplotlib.pyplot as plt
from prettytable import PrettyTable

Collecting prettytable
  Downloading prettytable-3.5.0-py3-none-any.whl (26 kB)
Installing collected packages: prettytable
Successfully installed prettytable-3.5.0


<h4>Load data</h4>

In [2]:
data=[]
files = os.listdir("../ustawy")
for file in files:
    with open("../ustawy" + '/' + file, 'r', encoding='utf8') as f:
        tmp = f.read().lower()
        data.append(tmp)

<h4>SpaCy tokenization</h4>

In [6]:
nlp = Polish()
tokenizer = nlp.tokenizer
tokenized_data = [[t.text for t in tokenizer(r)] for r in data ]
tokens_counter = collections.Counter([element for sublist in tokenized_data for element in sublist])
tmp = []
for key in tokens_counter.keys():
    if not key.isalpha():
        tmp.append(key)
        
for key in tmp:
    del tokens_counter[key]
    
tokens_total = sum(tokens_counter.values())

<h4>Bigram computation</h4>

In [26]:
bigrams=[]
for file in tokenized_data:
    previous_token = None
    for token in file:
        if previous_token != None:
            bigrams.append(" ".join([previous_token,token]))
        previous_token = token
bigrams = collections.Counter(bigrams)

#Filter out bigrams containing characters other than letters
def is_correct(s):
    if len(s.split())!=2:
        return False
    for element in s:
        if not element.isalpha() and element!=' ':
            return False
    return True

tmp = []
for key in bigrams.keys():
    if not is_correct(key):
        tmp.append(key)
        
for key in tmp:
    del bigrams[key]

<h4>Pointwise mutual inforamtion for bigrams</h4>

In [45]:
bigrams_total= sum(bigrams.values())
def pmi(bigram):
    p_w1 = tokens_counter[bigram.split()[0]]/tokens_total
    p_w2 = tokens_counter[bigram.split()[1]]/tokens_total
    p_bigram = bigrams[bigram]/bigrams_total
    return math.log(p_bigram/(p_w1*p_w2))
bigrams_pmi = {bigram: pmi(bigram) for bigram  in bigrams.keys()}
collections.Counter(bigrams_pmi).most_common()[:10]

[('kołowe jednoosiowe', 15.454037668330464),
 ('zbrojeń żelbeto', 15.454037668330464),
 ('prefabrykatów wnętrzowe', 15.454037668330464),
 ('gołe aluminiowe', 15.454037668330464),
 ('polistyrenu spienionego', 15.454037668330464),
 ('objaśnieniem figur', 15.454037668330464),
 ('wkładzie wnoszonym', 15.454037668330464),
 ('doktorem habilitowanym', 15.454037668330464),
 ('losy loteryjne', 15.454037668330464),
 ('ugaszone zapałki', 15.454037668330464)]

<h4>Pointwise mutual inforamtion for filtered bigrams</h4>

In [46]:
filltered_bigrams = {bigram: count for bigram, count  in bigrams.items() if count>=5} 
def pmi(bigram):
    p_w1 = tokens_counter[bigram.split()[0]]/tokens_total
    p_w2 = tokens_counter[bigram.split()[1]]/tokens_total
    p_bigram = filltered_bigrams[bigram]/sum(filltered_bigrams.values())
    return math.log(p_bigram/(p_w1*p_w2))
filltered_bigrams_pmi = {bigram: pmi(bigram) for bigram  in filltered_bigrams.keys()}
collections.Counter(filltered_bigrams_pmi).most_common()[:10]

[('świeckie przygotowujące', 14.136205671251723),
 ('klęskami żywiołowymi', 14.136205671251723),
 ('ręcznego miotacza', 14.136205671251723),
 ('stajnią wyścigową', 14.136205671251723),
 ('otworami wiertniczymi', 14.136205671251723),
 ('obcowania płciowego', 14.136205671251723),
 ('młyny kulowe', 14.136205671251723),
 ('młynki młotkowe', 14.136205671251723),
 ('zaszkodzić wynikom', 14.136205671251723),
 ('grzegorz schetyna', 14.136205671251723)]

<h4>LLR for bigrams</h4>

In [47]:
def h(ks):
    total = float(sum(ks))
    return sum([k/total * math.log(k / total + (k==0)) for k in ks])

def llr(bigram):
    k11= bigrams[bigram]
    k12= tokens_counter[bigram.split()[1]]-bigrams[bigram]
    k21= tokens_counter[bigram.split()[0]]-bigrams[bigram]
    k22= tokens_total-k11-k12-k21
    return 2*sum([k11,k12,k21,k22])*(h([k11+k12,k21+k22])-h([k11+k21,k12+k22]))

bigrams_llr={bigram: llr(bigram) for bigram  in bigrams.keys()}
collections.Counter(bigrams_llr).most_common()[:10] 

[('w alfabecie', 1547749.5438785334),
 ('w oczyszczal', 1547749.5438785334),
 ('w opisach', 1547749.5438785334),
 ('w figurach', 1547749.5438785334),
 ('w uczciwych', 1547749.5438785334),
 ('w niewielkim', 1547749.5438785334),
 ('w tłumaczeniach', 1547749.5438785334),
 ('w koronie', 1547749.5438785334),
 ('w uchylonym', 1547749.5438785334),
 ('w niezmniejszonym', 1547749.5438785334)]

<h4>Trigram computation</h4>

In [52]:
trigrams=[]
for file in tokenized_data:
    previous_token = None
    previous_previous_token = None
    for token in file:
        if previous_token != None and previous_previous_token!=None:
            trigrams.append(" ".join([previous_previous_token,previous_token,token]))
        previous_previous_token = previous_token
        previous_token = token
trigrams = collections.Counter(trigrams)

def is_correct(s):
    if len(s.split())!=3:
        return False
    for element in s:
        if not element.isalpha() and element!=' ':
            return False
    return True

tmp = []
for key in trigrams.keys():
    if not is_correct(key):
        tmp.append(key)
        
for key in tmp:
    del trigrams[key]
    
filltered_trigrams = {trigram: count for trigram, count  in trigrams.items() if count>=5} 

<h4>Pointwise mutual inforamtion for filltered trigrams</h4>

In [54]:
total_filltered_trigrams=sum(filltered_bigrams.values())
def pmi(trigram):
    p_w1 = tokens_counter[trigram.split()[0]]/tokens_total
    p_w2 = tokens_counter[trigram.split()[1]]/tokens_total
    p_w3 = tokens_counter[trigram.split()[2]]/tokens_total
    p_trigram = filltered_trigrams[trigram]/total_filltered_trigrams
    return math.log(p_trigram/(p_w1*p_w2*p_w3))
filltered_trigrams_pmi = {trigram: pmi(trigram) for trigram  in filltered_trigrams.keys()}
collections.Counter(filltered_trigrams_pmi).most_common()[:10]

[('profilem zaufanym epuap', 25.622883236510766),
 ('finałowego turnieju mistrzostw', 25.359154405081476),
 ('przedwczesnego wyrębu drzewostanu', 25.23885705045838),
 ('potwierdzonym profilem zaufanym', 25.15638223781623),
 ('piłce nożnej uefa', 25.115796957701154),
 ('cienką sierścią zwierzęcą', 24.873883303704954),
 ('szybkiemu postępowi technicznemu', 24.828937592000834),
 ('turnieju mistrzostw europy', 24.828526154019308),
 ('grożącą jemu samemu', 24.661541685423558),
 ('wypalonym paliwem jądrowym', 24.610248391036006)]

<h4>LLR for trigrams</h4>

In [55]:
def h(ks):
    total = float(sum(ks))
    return sum([k/total * math.log(k / total + (k==0)) for k in ks])

def llr(trigram):
    k11= trigrams[trigram]
    k12= bigrams[" ".join([trigram.split()[1],trigram.split()[2]])]-trigrams[trigram]
    k21= bigrams[" ".join([trigram.split()[0],trigram.split()[1]])]-trigrams[trigram]
    k22= tokens_total-k11-k12-k21
    return 2*sum([k11,k12,k21,k22])*(h([k11+k12,k21+k22])-h([k11+k21,k12+k22]))

trigrams_llr={trigram: llr(trigram) for trigram  in trigrams.keys()}
collections.Counter(trigrams_llr).most_common()[:10] 

[('mowa w powyższej', 323790.16559965984),
 ('mowa w przywołanym', 323790.16559965984),
 ('mowa w europejskiej', 323732.40855133283),
 ('mowa w sekcji', 323732.40855133283),
 ('mowa w paragrafach', 323704.73287230177),
 ('mowa w artykułach', 323677.562536909),
 ('mowa w artykule', 323650.79491234885),
 ('mowa w przepisie', 323650.79491234885),
 ('mowa w dyrektywie', 323624.36218229897),
 ('mowa w rozdziałach', 323572.32076727395)]

<h4>Bigram table</h4>

In [63]:
t = PrettyTable(['PMI', 'LLR'])
for i in range(10):
    t.add_row([collections.Counter(filltered_bigrams_pmi).most_common()[i],collections.Counter(bigrams_llr).most_common()[i] ])
print(t)

+-------------------------------------------------+-------------------------------------------+
|                       PMI                       |                    LLR                    |
+-------------------------------------------------+-------------------------------------------+
| ('świeckie przygotowujące', 14.136205671251723) |    ('w alfabecie', 1547749.5438785334)    |
|   ('klęskami żywiołowymi', 14.136205671251723)  |    ('w oczyszczal', 1547749.5438785334)   |
|    ('ręcznego miotacza', 14.136205671251723)    |     ('w opisach', 1547749.5438785334)     |
|    ('stajnią wyścigową', 14.136205671251723)    |     ('w figurach', 1547749.5438785334)    |
|  ('otworami wiertniczymi', 14.136205671251723)  |    ('w uczciwych', 1547749.5438785334)    |
|   ('obcowania płciowego', 14.136205671251723)   |    ('w niewielkim', 1547749.5438785334)   |
|       ('młyny kulowe', 14.136205671251723)      |  ('w tłumaczeniach', 1547749.5438785334)  |
|     ('młynki młotkowe', 14.13620567125

<h4>Trigram table</h4>

In [64]:
t = PrettyTable(['PMI', 'LLR'])
for i in range(10):
    t.add_row([collections.Counter(filltered_trigrams_pmi).most_common()[i],collections.Counter(trigrams_llr).most_common()[i] ])
print(t)

+----------------------------------------------------------+---------------------------------------------+
|                           PMI                            |                     LLR                     |
+----------------------------------------------------------+---------------------------------------------+
|     ('profilem zaufanym epuap', 25.622883236510766)      |   ('mowa w powyższej', 323790.16559965984)  |
|  ('finałowego turnieju mistrzostw', 25.359154405081476)  |  ('mowa w przywołanym', 323790.16559965984) |
| ('przedwczesnego wyrębu drzewostanu', 25.23885705045838) | ('mowa w europejskiej', 323732.40855133283) |
|  ('potwierdzonym profilem zaufanym', 25.15638223781623)  |    ('mowa w sekcji', 323732.40855133283)    |
|        ('piłce nożnej uefa', 25.115796957701154)         |  ('mowa w paragrafach', 323704.73287230177) |
|    ('cienką sierścią zwierzęcą', 24.873883303704954)     |   ('mowa w artykułach', 323677.562536909)   |
| ('szybkiemu postępowi technicznemu'

<h4>Answers</h4>
<h5>I</h5>
<p>We have to filter bigrams rather than token sequences in order to preserve real and meaningful connections between tokens. For example if we have a sequence of tokens looking like this: a1 a2 n1 n2 a3 a4, where a&lt;nr&gt; represents a valid token and n&lt;nr&gt; represents an invalid token, if we filter tokens first we will get bigrams: a1a2 a2a3 a3a4 whereas if we first create bigrams and filter then we will get bigrams a1a2 a3a4. As we see the operation order is important as we can end up with different bigrams depending of the order of operations. Filtering after creating bigrams is preferred as it preserves original words collocations.</p>
<h5>II</h5>
<p>In my humble opinion PMI seems to be working better for trigrams since the expressions found look more significant in terms of information. Similarly, for bigrams PMI seems to also be the prefered method. It is hard to select between PMI with and without filtering, but I believe that PMI with filtering should be working better since filtering should reduce outliers.</p>
<h5>III</h5>
<p>Used methods may be used to discover different types of multiword expressions such as multiword named entities and multiword terms.</p>
<h5>IV</h5>
<p>It is definitely possible to devise better solutions especially for certain types of datasets. Depending on the task given and the understanding of the data it may be possible to devise better or additional filters other than non-letter characters. For example it may be a good idea to filter out expressions that are constructed from tokens that all represent the same part of speech (such as 'świeckie przygotowujące') since they are most likely parts of longer expressions and by themselves convey little information.</p>