## POS Tagging - Lexicon and Rule Based Taggers

Let's look at the two most basic tagging techniques - lexicon based (or unigram) and rule-based. 

In this guided exercise, you will explore the WSJ (wall street journal) POS-tagged corpus that comes with NLTK and build a lexicon and rule-based tagger using this corpus as the tarining data. 

This exercise is divided into the following sections:
1. Reading and understanding the tagged dataset
2. Exploratory analysis

### 1. Reading and understanding the tagged dataset

In [4]:
# Importing libraries
import nltk
import numpy as np
import pandas as pd
import pprint, time
import random
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
import math
nltk.download('treebank')

[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\jaisa05\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\treebank.zip.


True

In [5]:
# reading the Treebank tagged sentences
wsj = list(nltk.corpus.treebank.tagged_sents())

In [6]:
# samples: Each sentence is a list of (word, pos) tuples
wsj[:3]

[[('Pierre', 'NNP'),
  ('Vinken', 'NNP'),
  (',', ','),
  ('61', 'CD'),
  ('years', 'NNS'),
  ('old', 'JJ'),
  (',', ','),
  ('will', 'MD'),
  ('join', 'VB'),
  ('the', 'DT'),
  ('board', 'NN'),
  ('as', 'IN'),
  ('a', 'DT'),
  ('nonexecutive', 'JJ'),
  ('director', 'NN'),
  ('Nov.', 'NNP'),
  ('29', 'CD'),
  ('.', '.')],
 [('Mr.', 'NNP'),
  ('Vinken', 'NNP'),
  ('is', 'VBZ'),
  ('chairman', 'NN'),
  ('of', 'IN'),
  ('Elsevier', 'NNP'),
  ('N.V.', 'NNP'),
  (',', ','),
  ('the', 'DT'),
  ('Dutch', 'NNP'),
  ('publishing', 'VBG'),
  ('group', 'NN'),
  ('.', '.')],
 [('Rudolph', 'NNP'),
  ('Agnew', 'NNP'),
  (',', ','),
  ('55', 'CD'),
  ('years', 'NNS'),
  ('old', 'JJ'),
  ('and', 'CC'),
  ('former', 'JJ'),
  ('chairman', 'NN'),
  ('of', 'IN'),
  ('Consolidated', 'NNP'),
  ('Gold', 'NNP'),
  ('Fields', 'NNP'),
  ('PLC', 'NNP'),
  (',', ','),
  ('was', 'VBD'),
  ('named', 'VBN'),
  ('*-1', '-NONE-'),
  ('a', 'DT'),
  ('nonexecutive', 'JJ'),
  ('director', 'NN'),
  ('of', 'IN'),
  ('this'

In the list mentioned above, each element of the list is a sentence. Also, note that each sentence ends with a full stop '.' whose POS tag is also a '.'. Thus, the POS tag '.' demarcates the end of a sentence.

Also, we do not need the corpus to be segmented into sentences, but can rather use a list of (word, tag) tuples. Let's convert the list into a (word, tag) tuple.

In [11]:
# converting the list of sents to a list of (word, pos tag) tuples
tagged_words = [tup for sent in wsj for tup in sent]
print(len(tagged_words))
tagged_words[0:10]

100676


[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT')]

We now have a list of about 100676 (word, tag) tuples. Let's now do some exploratory analyses.

### 2. Exploratory Analysis

Let's now conduct some basic exploratory analysis to understand the tagged corpus. To start with, let's ask some simple questions:
1. How many unique tags are there in the corpus? 
2. Which is the most frequent tag in the corpus?
3. Which tag is most commonly assigned to the following words:
    - "bank"
    - "executive"


In [29]:
# question 1: Find the number of unique POS tags in the corpus
# you can use the set() function on the list of tags to get a unique set of tags, 
# and compute its length
tags =  pd.DataFrame(tagged_words)[1]
unique_tags = set(tags)
len(unique_tags)

46

In [32]:
# question 2: Which is the most frequent tag in the corpus
# to count the frequency of elements in a list, the Counter() class from collections
# module is very useful, as shown below

from collections import Counter
tag_counts = Counter(tags)
tag_counts.most_common(1)

[('NN', 13166)]

In [44]:
# the most common tags can be seen using the most_common() method of Counter
tag_counts.most_common(5)

[('NN', 13166), ('IN', 9857), ('NNP', 9410), ('DT', 8165), ('-NONE-', 6592)]

Thus, NN is the most common tag followed by IN, NNP, DT, -NONE- etc. You can read the exhaustive list of tags using the NLTK documentation as shown below.

In [34]:
# list of POS tags in NLTK
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [54]:
df=pd.DataFrame(tagged_words)
df.loc[df[0]=='executive']
# tagged_words

Unnamed: 0,0,1
1641,executive,NN
2954,executive,JJ
3366,executive,JJ
5132,executive,NN
5212,executive,NN
5237,executive,NN
5243,executive,NN
5261,executive,NN
7334,executive,NN
7402,executive,NN


In [51]:
# question 3: Which tag is most commonly assigned to the word w. Get the tags list that appear for word w and then use the Counter()
# Try w ='bank' 
bank = Counter(df.loc[df[0]=='bank'][1])
bank

Counter({'NN': 38})

In [56]:
# question 3: Which tag is most commonly assigned to the word w. Try 'executive' 
executive = Counter(df.loc[df[0]== 'executive'][1])
executive

Counter({'JJ': 28, 'NN': 40})

### 2. Exploratory Analysis Contd.

Until now, we were looking at the frequency of tags assigned to particular words, which is the basic idea used by lexicon or unigram taggers. Let's now try observing some rules which can potentially be used for POS tagging. 

To start with, let's see if the following questions reveal something useful:

4. What fraction of words with the tag 'VBD' (verb, past tense) end with the letters 'ed'
5. What fraction of words with the tag 'VBG' (verb, present participle/gerund) end with the letters 'ing'

In [68]:
# 4. how many words with the tag 'VBD' (verb, past tense) end with 'ed'
# first get the all the words tagged as VBD
past_tense_verbs = df.loc[df[1]=='VBD'][0]

# subset the past tense verbs with words ending with 'ed'. (Try w.endswith('ed'))
ed_verbs = [x for x in past_tense_verbs if x.endswith('ed')]
print(len(ed_verbs) / len(past_tense_verbs))
ed_verbs[:20]

0.3881038448899113


['reported',
 'stopped',
 'studied',
 'led',
 'worked',
 'explained',
 'imposed',
 'dumped',
 'poured',
 'mixed',
 'described',
 'ventilated',
 'contracted',
 'continued',
 'eased',
 'ended',
 'lengthened',
 'reached',
 'resigned',
 'approved']

In [69]:
# 5. how many words with the tag 'VBG' end with 'ing'
participle_verbs =df.loc[df[1]=='VBG'][0]
ing_verbs = [x for x in participle_verbs if x.endswith('ing')]
print(len(ing_verbs) / len(participle_verbs))
ing_verbs[:20]

0.9972602739726028


['publishing',
 'causing',
 'using',
 'talking',
 'having',
 'making',
 'surviving',
 'including',
 'including',
 'according',
 'remaining',
 'according',
 'declining',
 'rising',
 'yielding',
 'waiving',
 'holding',
 'holding',
 'cutting',
 'manufacturing']

## 2. Exploratory Analysis Continued

Let's now try observing some tag patterns using the fact the some tags are more likely to apper after certain other tags. For e.g. most nouns NN are usually followed by determiners DT ("The/DT constitution/NN"), adjectives JJ usually precede a noun NN (" A large/JJ building/NN"), etc. 

Try answering the following questions:
1. What fraction of adjectives JJ are followed by a noun NN? 
2. What fraction of determiners DT are followed by a noun NN?
3. What fraction of modals MD are followed by a verb VB?

In [104]:
# question: what fraction of adjectives JJ are followed by a noun NN

# create a list of all tags (without the words)
tags = df[1]

# create a list of JJ tags
jj_tags = df.loc[df[1]=='JJ']

# [) for index, t in enumerate(tags)]
# create a list of (JJ, NN) tags
jj_nn_tags = [('JJ','NN') for index, t in enumerate(tags) if tags[index]=='JJ' and tags[index+1]=='NN']

print(len(jj_tags))
print(len(jj_nn_tags))
print(len(jj_nn_tags) / len(jj_tags))

5834
2611
0.4475488515598217


In [115]:
print(len(tags[tags=='DT']))
len(df.loc[df[1]=='DT'])

8165


8165

In [117]:
# question: what fraction of determiners DT are followed by a noun NN
dt_tags = tags[tags=='DT']
dt_nn_tags = [('DT','NN') for index, t in enumerate(tags) 
              if tags[index]=='DT' and tags[index+1]=='NN']

print(len(dt_tags))
print(len(dt_nn_tags))
print(len(dt_nn_tags) / len(dt_tags))

8165
3844
0.470789957134109


In [119]:
# question: what fraction of modals MD are followed by a verb VB?
md_tags = tags[tags=='MD']
md_vb_tags = [('MD','VB') for index, t in enumerate(tags) 
              if tags[index]=='MD' and tags[index+1]=='VB']

print(len(md_tags))
print(len(md_vb_tags))
print(len(md_vb_tags) / len(md_tags))

927
756
0.8155339805825242
