<a href="https://colab.research.google.com/github/KatharinaGardens/computational-linguistics.github.io/blob/Week-9/LELA32052_Week_9_Seminar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LELA32052 Computational Linguistics Week 9

This week we are going to take a look at part of speech tagging.

## Tagged corpora
In looking to understand part of speech tagging, it is useful to start by looking at some human (rather than machine) tagged data. NLTK contains a number of corpora. We can import a few of these as follows:

In [1]:
import nltk
nltk.download('brown')
from nltk.corpus import brown
nltk.download('sinica_treebank')
nltk.download('indian')
nltk.download('conll2002')
nltk.download('cess_cat')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package sinica_treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/sinica_treebank.zip.
[nltk_data] Downloading package indian to /root/nltk_data...
[nltk_data]   Unzipping corpora/indian.zip.
[nltk_data] Downloading package conll2002 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2002.zip.
[nltk_data] Downloading package cess_cat to /root/nltk_data...
[nltk_data]   Unzipping corpora/cess_cat.zip.


True

In [2]:
brown.tagged_words()[1:25]

[('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('Grand', 'JJ-TL'),
 ('Jury', 'NN-TL'),
 ('said', 'VBD'),
 ('Friday', 'NR'),
 ('an', 'AT'),
 ('investigation', 'NN'),
 ('of', 'IN'),
 ("Atlanta's", 'NP$'),
 ('recent', 'JJ'),
 ('primary', 'NN'),
 ('election', 'NN'),
 ('produced', 'VBD'),
 ('``', '``'),
 ('no', 'AT'),
 ('evidence', 'NN'),
 ("''", "''"),
 ('that', 'CS'),
 ('any', 'DTI'),
 ('irregularities', 'NNS'),
 ('took', 'VBD'),
 ('place', 'NN'),
 ('.', '.')]

In [3]:
nltk.download('universal_tagset')

[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


True

In [4]:
brown.tagged_words(tagset="universal")[1:25]

[('Fulton', 'NOUN'),
 ('County', 'NOUN'),
 ('Grand', 'ADJ'),
 ('Jury', 'NOUN'),
 ('said', 'VERB'),
 ('Friday', 'NOUN'),
 ('an', 'DET'),
 ('investigation', 'NOUN'),
 ('of', 'ADP'),
 ("Atlanta's", 'NOUN'),
 ('recent', 'ADJ'),
 ('primary', 'NOUN'),
 ('election', 'NOUN'),
 ('produced', 'VERB'),
 ('``', '.'),
 ('no', 'DET'),
 ('evidence', 'NOUN'),
 ("''", '.'),
 ('that', 'ADP'),
 ('any', 'DET'),
 ('irregularities', 'NOUN'),
 ('took', 'VERB'),
 ('place', 'NOUN'),
 ('.', '.')]

In [5]:
nltk.corpus.sinica_treebank.tagged_words() # Chinese

[('一', 'Neu'), ('友情', 'Nad'), ('嘉珍', 'Nba'), ...]

In [6]:
nltk.corpus.indian.tagged_words() # Bangla, Hindi, Marathi, and Telugu language data

[('মহিষের', 'NN'), ('সন্তান', 'NN'), (':', 'SYM'), ...]

In [7]:
nltk.corpus.conll2002.tagged_words() # Spanish

[('Sao', 'NC'), ('Paulo', 'VMI'), ('(', 'Fpa'), ...]

In [8]:
nltk.corpus.cess_cat.tagged_words() # Catalan

[('El', 'da0ms0'), ('Tribunal_Suprem', 'np0000o'), ...]

## Inspecting tagged corpora

Inspecting human tagged corpora can be useful for both linguistic research and for building taggers. We can use the NLTK toolkit to do this.

Most straightforwardly we can look at the frequency with which particular words are given a tag (we will return to this later when we come to build a tagger).

In [9]:
sent = [("the","DET"),("man","NOUN"),("walked","VERB"),("the","DET"),("dog","NOUN")]

In [10]:
cfd1 = nltk.ConditionalFreqDist(sent)
cfd1['the']

FreqDist({'DET': 2})

When we apply this to whole corpora, it becomes useful.

In [11]:
brown_tagged = brown.tagged_words(tagset='universal')
cfd1 = nltk.ConditionalFreqDist(brown_tagged)
cfd1['the']

FreqDist({'DET': 62710, 'X': 3})

In [12]:
cfd1['run']

FreqDist({'VERB': 154, 'NOUN': 52})

And if we additionally use a couple of other NLTK tools (which we don't have time to cover in detail - I just want to give you a sense of what is possible), we can look at the frequency with which particular word classes precede particular words

In [13]:
brown_tagged = brown.tagged_words(tagset='universal')
tags = [b[1] for (a, b) in nltk.bigrams(brown_tagged) if a[0] == 'car']
fd = nltk.FreqDist(tags)
fd.tabulate()

   . VERB  ADP NOUN CONJ PRON  ADV  PRT  DET  ADJ 
  83   64   44   24   19   15   13    4    2    2 


Or the frequency with which particular word classes precede other word classes:

In [14]:
brown_tagged = brown.tagged_words(tagset='universal')
word_tag_pairs = nltk.bigrams(brown_tagged)
noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN']
noun_preceders_fd = nltk.FreqDist(noun_preceders)
[(wt,_) for (wt, _) in noun_preceders_fd.most_common()]

[('DET', 85845),
 ('ADJ', 54653),
 ('NOUN', 41309),
 ('ADP', 37418),
 ('.', 20084),
 ('VERB', 17851),
 ('CONJ', 9294),
 ('NUM', 5668),
 ('ADV', 1851),
 ('PRT', 1068),
 ('PRON', 440),
 ('X', 77)]

And you can even search for particular constructional patterns

In [17]:
for tagged_sent in brown.tagged_sents(categories="news")[1:75]:
    for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(tagged_sent):
        if (t1.startswith('N') and w2 == 'and' and t3.startswith('N')):
            print(w1, w2, w3)

praise and thanks
registration and election
Atlanta and Fulton
guardians and administrators
fees and compensation
night and weekend
administration and operation
Bellwood and Alpharetta
man and wife


## Building an automatic tagger

A very simple approach to automated tagging that actually works quite well is to find the most common tag for each word in a training corpus (as we did above) and just tag all occurences of each word with its most common tag:

In [18]:
brown_tagged_sents = brown.tagged_sents(tagset='universal')

In [19]:
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)

In [20]:
unigram_tagger.tag(["the","cat","sat","on","the","mat"])

[('the', 'DET'),
 ('cat', 'NOUN'),
 ('sat', 'VERB'),
 ('on', 'ADP'),
 ('the', 'DET'),
 ('mat', 'NOUN')]

We can formally evaluate this by splitting our data into a training set and a testing set. We obtain the by-word tag frequencies from the training set and evaluate by tagging the test set and comparing our predicted tags to the human tags.

In [21]:
training_set_size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:training_set_size]
test_sents = brown_tagged_sents[training_set_size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.accuracy(test_sents)

0.9156346262651662

### Regular expression based tagging

As a next step we want to use a more intelligent way to deal with words we haven't seen before, but making use of their orthography and/or morphology. Write regular expressions to classify words in this way and see if you can improve performance. I've added one example rule to get you started.

In [33]:
patterns = [
    (r'.*ing$', 'VERB'),
    (r'.*ed$', 'VERB'),
    (r'.*ly$','ADV'),
      ]

In [34]:
t0 = nltk.DefaultTagger('NOUN')
t1 = nltk.RegexpTagger(patterns, backoff=t0)
t2 = nltk.UnigramTagger(train_sents, backoff=t1)
t2.evaluate(test_sents)

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  t2.evaluate(test_sents)


0.9432954045388823

As with other classification tasks we can generate a confusion matrix to see where things are going right or wrong.

In [35]:
from sklearn.metrics import confusion_matrix
import pandas as pd
predicted = [tag for sent in brown.sents(categories='editorial') for (word, tag) in t2.tag(sent)]
true = [tag for (word, tag) in brown.tagged_words(categories='editorial',tagset="universal")]
cm=pd.DataFrame(confusion_matrix(predicted, true),index=list(set(predicted)),columns=list(set(predicted)))
cm

Unnamed: 0,PRON,ADP,ADJ,PRT,DET,X,ADV,.,NUM,CONJ,VERB,NOUN
PRON,7099,0,0,0,0,0,1,0,0,0,0,0
ADP,0,4609,4,131,0,0,204,0,0,14,39,1
ADJ,0,2,6882,123,0,94,6,0,97,55,19,0
PRT,0,167,32,2673,4,0,14,0,0,2,7,0
DET,0,0,16,14,1858,5,0,0,0,0,0,0
X,0,0,3,14,0,7316,2,0,14,0,0,0
ADV,0,167,1,4,0,0,14673,0,0,0,488,6
.,0,0,0,0,0,0,23,722,0,0,0,0
NUM,0,0,0,0,0,1,0,0,2180,0,0,0
CONJ,0,0,672,37,0,0,1,0,0,1462,0,0


### Looking at the context

We want to improve this, and an obvious next step is to give the tag that is most frequent for this word when it follows the previous word. The problem is this doesn't do very well. Any idea why?

In [36]:
bigram_tagger = nltk.BigramTagger(train_sents)
bigram_tagger.evaluate(test_sents)

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  bigram_tagger.evaluate(test_sents)


0.4737746484776094

We can still make use of the bigram information by combining it with the unigram tagger via a process known as backing off - for each word we check whether we have seen that word and preceding word in our training data. If we have then we tag it with the most frequent tag for that word in that context. If we haven't seen it then we tag the word with its most frequent tag regardless of context. And if we haven't seen the word before we tag it as a noun.

In [37]:
t0 = nltk.DefaultTagger('NOUN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t2.evaluate(test_sents)

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  t2.evaluate(test_sents)


0.9485446658703716

### NLTK's Averaged Perceptron tagger

NLTKs default prebuilt tagger uses a Perceptron just like that we have been using for other tasks on the module. For more information on this approach see here: https://explosion.ai/blog/part-of-speech-pos-tagger-in-python


In [38]:
nltk.download('punkt')
nltk.download('punkt_tab')

nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

It can be run straightforwardly like this:

In [39]:
text = nltk.word_tokenize("And now for something completely different")
nltk.pos_tag(text, tagset="universal")

[('And', 'CONJ'),
 ('now', 'ADV'),
 ('for', 'ADP'),
 ('something', 'NOUN'),
 ('completely', 'ADV'),
 ('different', 'ADJ')]

### POS tagging in other languages

POS taggers are available for a great many languages. A popular package called Spacy contains a number. Here, as an example, is a German tagger.

In [40]:
!pip install -U spacy



In [41]:
import spacy

In [42]:
!python -m spacy download de_core_news_sm

Collecting de-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.8.0/de_core_news_sm-3.8.0-py3-none-any.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m55.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: de-core-news-sm
Successfully installed de-core-news-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [43]:
nlp = spacy.load('de_core_news_sm')

In [44]:
text = "Das ist nicht gut."

In [45]:
s1_t = nlp(text)

In [46]:
for tk in s1_t:
    print(tk.text, tk.tag_, tk.pos_)

Das PDS PRON
ist VAFIN AUX
nicht PTKNEG PART
gut ADJD ADV
. $. PUNCT


### Chunking / Shallow Parsing

Chunking involves grouping together words into elementary phrases. In its most common form it doesn't involve any hierachical structure.


In [47]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


True

In [48]:
text = nltk.word_tokenize("I study Linguistics and Social Anthropology at the University of Manchester")

In [55]:
grammar = r"""
  NP: {<NOUN>+<CONJ><NOUN>+}
      {<DET|ADP>?<ADJ>*<NOUN><ADP>*<NOUN>*}
      {<PRON>}
      {<NOUN>+}
"""
sent=nltk.pos_tag(text,tagset="universal")
cp = nltk.RegexpParser(grammar)
cs = cp.parse(sent)
print(cs)

(S
  (NP I/PRON)
  study/VERB
  (NP Linguistics/NOUN and/CONJ Social/NOUN Anthropology/NOUN)
  at/ADP
  (NP the/DET University/NOUN of/ADP Manchester/NOUN))


Update the grammar so that it produces the following shallow parse: <br> <br>
(S <br>
  (NP I/PRON) <br>
  study/VERB <br>
  (NP Linguistics/NOUN and/CONJ Social/NOUN Anthropology/NOUN) <br>
  at/ADP <br>
  (NP the/DET University/NOUN of/ADP Manchester/NOUN)) <br>