<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/task1_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 1: Lexical ambiguity in a Universal Dependencies dataset

* There are many ways to approach this exercise
* If your details differ, that's fine
* Let's start by grabbing the data

In [3]:
# Get the data into colab, wget is fine
# URLs I got by clicking around the UD github

!wget -nc --quiet https://github.com/UniversalDependencies/UD_Finnish-TDT/raw/master/fi_tdt-ud-train.conllu
!wget -nc --quiet https://github.com/UniversalDependencies/UD_English-EWT/raw/master/en_ewt-ud-train.conllu

# Finnish verbs



In [4]:
# This is a direct, quite non-pythonic approach

# Let's loop over the lines, skipping empty lines (sentence separators)
# and skipping also metadata lines (start with a #)

counter={} #{key: verb inflection form, value: the count}

with open("fi_tdt-ud-train.conllu") as f:
    for line in f:
        line=line.rstrip("\n") #Remember python leaves the newline in the string, in most cases you want to strip it
        if not line or line.startswith("#"): #skip empty and metadata lines
            continue
        #So here we have a real data line
        cols=line.split("\t") #this is a tab-delimited format, so split into columns
        ID,FORM,LEMMA,UPOS,XPOS,FEATS,HEAD,DEPREL,DEPS,MISC=cols
        if UPOS=="VERB":
            counter[FEATS]=counter.get(FEATS,0)+1 #add +1 the count of this particular inflection form (FEATS)

#Now we have the counts, maybe we want to print some stats
print(f"Total verb inflection forms: {len(counter)}")
#Maybe we want to print the first few forms and their counts
#there are so many ways to sort a dict, just google or ask ChatGPT, here is one:
forms_sorted=sorted(counter.items(),key=lambda key_val: key_val[1], reverse=True)
for form,count in forms_sorted[:10]:
    print(f"Form {form}     ...seen {count} times")


Total verb inflection forms: 408
Form Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin|Voice=Act     ...seen 2801 times
Form Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act     ...seen 2517 times
Form InfForm=1|Number=Sing|VerbForm=Inf|Voice=Act     ...seen 2460 times
Form Case=Nom|Number=Sing|PartForm=Past|VerbForm=Part|Voice=Act     ...seen 1256 times
Form Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass     ...seen 1030 times
Form Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act     ...seen 820 times
Form Case=Ill|InfForm=3|Number=Sing|VerbForm=Inf|Voice=Act     ...seen 727 times
Form Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin|Voice=Act     ...seen 661 times
Form Case=Nom|Number=Sing|PartForm=Past|VerbForm=Part|Voice=Pass     ...seen 648 times
Form Connegative=Yes|Mood=Ind|Tense=Pres|VerbForm=Fin     ...seen 586 times


In [6]:
# This is an alternative approach using collections.Counter
# The counter takes a sequence and counts its items into a dictionary
# It also has an easy way to get the top N most common elements
from collections import Counter

# Create a generator which yields the verb inflection forms
def yield_infl_forms(file_name,target_pos="VERB"):
    """
    Yields the FEATS field.
    file_name: name of the file with conllu data
    target_pos: the POS we are interested in, everything else ignored
    """
    with open(file_name) as f:
        for line in f:
            line=line.rstrip("\n") #this is unnecessary for this exercise, but useful to do in general
            if not line or line.startswith("#"):
                continue
            #So here we have a real data line
            cols=line.split("\t")
            ID,FORM,LEMMA,UPOS,XPOS,FEATS,HEAD,DEPREL,DEPS,MISC=cols
            if UPOS==target_pos:
                yield FEATS #yield one value at a time -> results in a generator

#Now we can count the forms with a Counter
counter=Counter(yield_infl_forms("fi_tdt-ud-train.conllu"))
print(f"Total verb inflection forms: {len(counter)}")
for form,count in counter.most_common(10):
    print(f"Form {form}     ...seen {count} times")

Total verb inflection forms: 408
Form Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin|Voice=Act     ...seen 2801 times
Form Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act     ...seen 2517 times
Form InfForm=1|Number=Sing|VerbForm=Inf|Voice=Act     ...seen 2460 times
Form Case=Nom|Number=Sing|PartForm=Past|VerbForm=Part|Voice=Act     ...seen 1256 times
Form Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass     ...seen 1030 times
Form Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act     ...seen 820 times
Form Case=Ill|InfForm=3|Number=Sing|VerbForm=Inf|Voice=Act     ...seen 727 times
Form Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin|Voice=Act     ...seen 661 times
Form Case=Nom|Number=Sing|PartForm=Past|VerbForm=Part|Voice=Pass     ...seen 648 times
Form Connegative=Yes|Mood=Ind|Tense=Pres|VerbForm=Fin     ...seen 586 times


# English noun-verb

* Here we might want to make two counters, one for the verbs, one for the nouns
* Then we can look at which words (FORM) are shared among these counters


In [12]:
# Maybe I can adopt the solution from above
# let's modify it to yield the FORM
def yield_forms(file_name,target_pos="VERB"):
    with open(file_name) as f:
        for line in f:
            line=line.rstrip("\n") #this is unnecessary for this exercise, but useful to do in general
            if not line or line.startswith("#"):
                continue
            #So here we have a real data line
            cols=line.split("\t")
            ID,FORM,LEMMA,UPOS,XPOS,FEATS,HEAD,DEPREL,DEPS,MISC=cols
            if UPOS==target_pos:
                yield FORM
file_name="en_ewt-ud-train.conllu"
nouns=Counter(yield_forms(file_name,"NOUN"))
verbs=Counter(yield_forms(file_name,"VERB"))

# The most straightforward answer is:
common= set(nouns) & set(verbs) # & means set intersection
print(f"There are {len(common)} forms seen as both VERB and NOUN in {file_name}")


There are 738 forms seen as both VERB and NOUN in en_ewt-ud-train.conllu


In [16]:
# We can also go all-out fancy and list the most ambiguous ones
# i.e. those which are most balanced between their noun and verb counts
# this is definitely going the extra mile

noun_verb_ambiguity=[] #let's make a list of [(ambiguity,count_as_verb,count_as_noun,word)]
for word,count_as_verb in verbs.items():
    count_as_noun=nouns.get(word,0) #get the count as noun, if seen, or 0 otherwise
    proportion_as_verb=count_as_verb/(count_as_verb+count_as_noun)
    ambiguity=abs(0.5-proportion_as_verb) #the closer to zero, the closer to a 50:50 distribution we are
    if count_as_verb>20: #let's try to get at least somehow common cases
        noun_verb_ambiguity.append((ambiguity,count_as_verb,count_as_noun,word))

noun_verb_ambiguity.sort() #this will sort ascending, so lowest ambiguity comes first
for x in noun_verb_ambiguity[:20]: #let's print the 20 most balanced words
    print(x)


(0.0, 110, 110, 'work')
(0.01677852348993286, 72, 77, 'call')
(0.051724137931034475, 26, 32, 'review')
(0.05319148936170215, 26, 21, 'return')
(0.06470588235294117, 37, 48, 'change')
(0.07894736842105265, 22, 16, 'charge')
(0.15217391304347827, 30, 16, 'visit')
(0.16666666666666663, 34, 17, 'needs')
(0.19655172413793098, 101, 44, 'help')
(0.24285714285714288, 26, 9, 'move')
(0.25, 54, 18, 'love')
(0.26, 38, 12, 'show')
(0.2692307692307693, 40, 12, 'start')
(0.2777777777777778, 35, 10, 'run')
(0.28260869565217395, 36, 10, 'means')
(0.2954545454545454, 35, 9, 'wait')
(0.30000000000000004, 32, 8, 'set')
(0.3125, 26, 6, 'walk')
(0.3152173913043478, 75, 17, 'look')
(0.31818181818181823, 27, 6, 'living')


# Coverage of ambiguous words

This is perhaps a little more involved, but not much. What we can do is build a dictionary `d` which for each word (FORM in the data) maintains a separate counter. This word-specific counter will count all unique analyses (triples of LEMMA-UPOS-FEATS) and how many times these were seen.

Once we have our `d`, we can then go over it and sum up the occurrences. If the length of the word-specific counter is 1,
this word was ever seen only with one possible analysis and is
not ambiguous. If the length is >1, this word is ambiguous.


In [18]:

d={} #{word -> counter of different unique analyses}

#uncomment the one you want
data="en_ewt-ud-train.conllu"
#data="fi_tdt-ud-train.conllu"
with open(data) as f:
    for line in f:
        line=line.rstrip("\n")
        if not line or line[0]=="#":
            continue
        cols=line.split("\t")
        ID,FORM,LEMMA,UPOS,XPOS,FEATS,HEAD,DEPREL,DEPS,MISC=cols
        unique_triple=(LEMMA,UPOS,FEATS) #this counts as analysis
        #Now, get the counter for this FORM from d
        #here as a counter I will use a normal dictionary, not the Counter class above
        #.setdefault is very useful, if you don't know it, read up on it
        this_form_counter=d.setdefault(FORM,{}) #for every form, we want to maintain a counter for all the triplets it can represent
        #+1 for this occurence:
        this_form_counter[unique_triple]=this_form_counter.get(unique_triple,0)+1

#...and now go over everything we've got
uniques=0
total=0
#We really don't need the "form" for anything, but, well, here is how you get it...
for form, this_form_counter in d.items():
    #sum up the occurrences of this form across all analyses
    times_seen=sum(count for triple,count in this_form_counter.items())
    if len(this_form_counter)==1: #unique! this word has only one triple seen, is unambiguous
        uniques+=times_seen
    total+=times_seen

print(f"Percentage of unambiguous words in {data}:", uniques/total*100,"%")


Percentage of unambiguous words in en_ewt-ud-train.conllu: 39.49195809426378 %


Note how the percentage of unique words in the English data is way less than in the Finnish data!