# 2.1 Part-of-Speech (PoS) tags

In [1]:
import nltk
from collections import defaultdict
from collections import Counter
from nltk.corpus.reader import TaggedCorpusReader

train = TaggedCorpusReader(root="resources", fileids="BAWE_train.retagged.txt")
test  = TaggedCorpusReader(root="resources", fileids="BAWE_test.retagged.txt")

tagged_words_train = train.tagged_words()

## Question 1: Most frequent tags
Based on the tagged training corpus, provide the top 3 most frequent tags. You should format your answer in a python dictionary, where the keys are the tags and the values are their corresponding counts (e.g. `{'NN': 10, 'JJ':10}`). For the tags, you should use the taxonomy that is used in the training corpus. For instance `'NN'` stands for a noun, `'JJ'` stands for an adjective, etc. You can refer to the NLTK documentation to check the meaning of each tag.

In [2]:
freq = nltk.FreqDist([x for (_, x) in tagged_words_train])
print({x: y for (x, y) in freq.most_common(3)})

{'NN': 906007, 'IN': 634158, 'DT': 538613}


## Question 2: Most frequent POS before nouns
Nouns generally refer to people, places, things, or concepts. They usually appear after determiners and adjectives and they are often followed by a verb.

Analyze the training corpus to provide a dictionary of 3 POS tags that occur the most often before a common singular noun (NN). The keys of your dictionary should correspond to some POStags and the values should be the relative frequencies of occurrence (among all tokens appearing before a NN in the training corpus). The frequencies should be given with at least two significant digits.

In [3]:
d = defaultdict(int)
for i in range(len(tagged_words_train) - 1):
    if tagged_words_train[i + 1][1] == 'NN':
        d[tagged_words_train[i][1]] += 1

c = Counter(d)
print({x: y/(sum(d.values())) for (x, y) in c.most_common(3)})

{'DT': 0.2898840737433596, 'JJ': 0.20038035026219445, 'IN': 0.12694383155980032}


## Question 3: Most common verbs
Analyze the training corpus to determine which are the 3 most frequent verbs in the corpus? Provide your answer as a dictionary where each key is some verb occurrence and the associated value is its relative frequency (among the verbs). You should consider here all possible verb forms ([various possible POStags](https://inginious.info.ucl.ac.be/course/LINGI2263/project2a/TAGSET.png)). The frequencies should be given with at least two significant digits.

In [4]:
verbs = [x for (x, y) in tagged_words_train if y[:2] == 'VB']
verb_freqs = nltk.FreqDist(verbs)
print({x: y/len(verbs) for (x, y) in verb_freqs.most_common(3)})

{'<UNK>': 0.1474337683519249, 'is': 0.10494317215232941, 'be': 0.05552495213715372}
