# 1. Penn Treebank & NLTK

The Penn Treebank Project
https://web.archive.org/web/19970614160127/http://www.cis.upenn.edu:80/~treebank/

Penn Treebank POS tags https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

#Tree structure

<img src="img/tree1.png" alt="Drawing" style="width: 400px;"/>

#Parsed sentence

#Tree structure

<img src="img/tree2.png" alt="Drawing" style="width: 1000px;"/>

#Parsed sentence

#Exercise 1:

We have a sentence:

"I love NLP ."

Could you draw its tree structure and its parsed sentence structure in the form introducted above?

### Penn Treebank in nltk

The corpus module in nltk defines the treebank corpus reader, which contains a 10% sample of the Penn Treebank corpus.
If you'd like to see more parsed sentence examples, you can install the nltk module and try the following example code. (More inforemation can be found at http://www.nltk.org/book/ch08.html) 

# 2. Function

Reading the Penn Treebank corpus sections 00-18 into a list of tuples

In [2]:
lines = [line.strip() for line in open('wsj00-18.tag') if "\t" in line]

In [3]:
wordtags = [(l.split("\t")[0],l.split("\t")[1]) for l in lines]

In [4]:
wordtags[:10]

[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT')]

What if we want to read more than one file into a list? Do we repeat the same lines several times by changing only the file names?
Nope. Whenver we find ourselves reuse code by copying it and adapt it to its different contexts, it's time to define a function!

In [5]:
def Text2Tuplelist(filename):
    lines = lines = [line.strip() for line in open(filename) if "\t" in line]
    wordtags = [(l.split("\t")[0],l.split("\t")[1]) for l in lines]
    return wordtags

In [6]:
wordtags00_18 = Text2Tuplelist('wsj00-18.tag')

In [7]:
wordtags00_18 == wordtags

True

In [8]:
wordtags19_21 = Text2Tuplelist('wsj19-21.tag')

In [9]:
wordtags22_24 = Text2Tuplelist('wsj22-24.tag')

# 3. Iterables, especially lists, sets, and dictionaries

In [10]:
#Combining lists
wordtags_all = wordtags00_18 + wordtags19_21 + wordtags22_24

In [11]:
#convert a list to a set or vice versa
list2set = set(['cat', 'dog'])
print (list2set)

set2list = list({'cat', 'dog'})
print (set2list)

{'dog', 'cat'}
['dog', 'cat']


In [12]:
#convert a list of tuples of two elements each to a dictionary
list2dt = dict([('cat', 4), ('dog', 3)])
print (list2dt)
#This works only if you want keep one value for each key

{'dog': 3, 'cat': 4}


In [13]:
list2dt = dict([('cat', 4), ('dog', 3), ('cat', 5)])
print (list2dt)

{'dog': 3, 'cat': 5}


What if we want to keep all the values? For example, what if we want to know for each token type in the Penn treebank corpus, what are its tags?

Dictionary = {word:taglist}

In [14]:
#Code one
wordtags_dict = {}
for word, tag in wordtags_all:
    if word not in wordtags_dict:
        wordtags_dict[word] = [tag]
    else:
        wordtags_dict[word].append(tag)

wordtags_dict['Boulder']

['NNP', 'NNP', 'NNP', 'NNP', 'NNP', 'NNP']

In [15]:
#Code two
from collections import defaultdict

d = defaultdict(list)
for word, tag in wordtags_all:
    d[word].append(tag)

In [16]:
d['Boulder']

['NNP', 'NNP', 'NNP', 'NNP', 'NNP', 'NNP']

In [17]:
d == wordtags_dict

True

How to get the list of token types with exactly 1 tag / 2 tags / 3 tags

In [18]:
onetagwords = [(word, set(tags)) for word, tags in d.items() if len(set(tags)) == 1]
len(onetagwords)

42586

How to get the count of each tag for each token type?
For example, for 'have' in the Penn treebank corpus, what are its tags and how many times each tag is tagged to 'have'?

In [19]:
d['have']

['VBP',
 'VBP',
 'VBP',
 'VB',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VB',
 'VB',
 'VB',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VB',
 'VBP',
 'VB',
 'VBP',
 'VBP',
 'VB',
 'VB',
 'VB',
 'VB',
 'VBP',
 'VB',
 'VB',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VB',
 'VB',
 'VB',
 'VBP',
 'VB',
 'VBP',
 'VBP',
 'VBP',
 'VB',
 'VBP',
 'VB',
 'VBP',
 'VB',
 'VBP',
 'VB',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VB',
 'VBP',
 'VBP',
 'VB',
 'VB',
 'VB',
 'VBP',
 'VB',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VB',
 'VB',
 'VBP',
 'VB',
 'VBP',
 'VB',
 'VBP',
 'VB',
 'VB',
 'VBP',
 'VBP',
 'VBP',
 'VB',
 'VB',
 'VB',
 'VBP',
 'VB',
 'VBP',
 'VBP',
 'VBP',
 'VB',
 'VB',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VB',
 'VB',
 'VBP',
 'VBP',
 'VBP',
 'VB',
 'VBP',
 'VBP',
 'VBP',
 'VBP',
 'VB',
 'VBP',
 'VBP',
 'VBP',
 'VB',
 'VBP',
 '

In [21]:
set(d['have'])

{'JJ', 'VB', 'VBD', 'VBN', 'VBP'}

In [22]:
#have_tags = d['have']
have_tag_freq = [('have', tag, d['have'].count(tag)) for tag in set(d['have'])]
have_tag_freq

[('have', 'VBN', 1),
 ('have', 'VBD', 1),
 ('have', 'JJ', 1),
 ('have', 'VBP', 2408),
 ('have', 'VB', 1366)]

How to sort a list of tuples by a certain element in the tuple?

In [24]:
have_tag_freq = sorted(have_tag_freq, key = lambda x:x[2]) #from the least common to the most common
have_tag_freq

[('have', 'VBN', 1),
 ('have', 'VBD', 1),
 ('have', 'JJ', 1),
 ('have', 'VB', 1366),
 ('have', 'VBP', 2408)]

In [25]:
have_tag_freq = sorted(have_tag_freq, key = lambda x:x[2], reverse=True) #from the most common to the least common
have_tag_freq

[('have', 'VBP', 2408),
 ('have', 'VB', 1366),
 ('have', 'VBN', 1),
 ('have', 'VBD', 1),
 ('have', 'JJ', 1)]

### lambda function

The lambda operator or lambda function is a way to create small anonymous functions, i.e. functions without a name. These functions are throw-away functions, i.e. they are just needed where they have been created.

The general syntax of a lambda function is quite simple: 

lambda argument_list: expression 



Exercise 2&3:

Pleas define a function which takes a dictionary of token_type:taglist and token_type, and returns the list of tuples (token_type, tag, time of tag occurance) sorted from the most frequent to the least frequent. The template of the function is as follows:

def WordTagFreq(wordtag_dict, word):
    
    ...
    return word_tag_freq 
    
For example, if we call the function with "have" and d by,

word_tag_freq = WordTagFreq(d, "have")

what's returned should be:
[('have', 'VBP', 2408),
 ('have', 'VB', 1366),
 ('have', 'JJ', 1),
 ('have', 'VBN', 1),
 ('have', 'VBD', 1)]


In [56]:
#Example Answer:
def WordTagFreq(wordtag_dict, word):
    word_tag_freq = [(word, tag, wordtag_dict[word].count(tag)) for tag in set(wordtag_dict[word])]
    word_tag_freq = sorted(word_tag_freq, key = lambda x:x[2], reverse=True)
    return word_tag_freq

word_tag_freq = WordTagFreq(wordtags_dict, "have")
print (word_tag_freq)

[('have', 'VBP', 2408), ('have', 'VB', 1366), ('have', 'JJ', 1), ('have', 'VBN', 1), ('have', 'VBD', 1)]
