# **Natural Language Processing with Python**
by [CSpanias](https://cspanias.github.io/aboutme/) - 02/2022

Content based on the [NLTK book](https://www.nltk.org/book/). <br>

You can find Chapter 5 [here](https://www.nltk.org/book/ch05.html).

# CONTENT

1. Language Processing and Python
2. Accessing Text Corpora and Lexical Resources
3. Processing Raw Text
4. Writing Structured Programs
5. Categorizing and Tagging Words
    1. Using a Tagger
    1. Tagged Corpora
    1. [Mapping Words to Properties Using Python Dictionaries](#mapwithdicts)
        1. [Indexing Lists vs Dictionaries](#indexing)
        1. [Dictionaries in Python](#dicts)
        1. [Default Dictionaries](#default)
        1. [Incrementally Updating a Dictionary](#updating)
        1. [Complex Keys and Values](#complex)
        1. [Inverting a Dictionary](#inverting)

<a name="mapwithdicts"></a>
# 5.3 Mapping Words to Properties Using Python Dictionaries

As we have seen, a tagged word of the form `(word, tag)` is an association between a word and a part-of-speech tag. 

Once we start doing part-of-speech tagging, we will be creating programs that __assign a tag to a word__, the tag which is most likely in a given context. 

We can think of this process as __mapping from words to tags__. 

The most natural way to store mappings in Python uses the so-called __dictionary data type__ (also known as an __associative array__ or __hash array__ in other programming languages)

<a name="indexing"></a>
## 5.3.1 Indexing Lists vs Dictionaries

A text is treated in Python as a list of words. An important property of __lists__ is that we can __"look up" a particular item by giving its index__, e.g. `text1[100]`. Notice how we __specify a number__, and __get back a word__.

Contrast this situation with __frequency distributions__, where we __specify a word__, and __get back a number__, e.g. `fdist['monstrous']`, which tells us the number of times a given word has occurred in a text.

In general, we would like to be able to __map between arbitrary types of information__. 

Most often, we are mapping __from a "word" to some structured object__. The figure below lists a variety of __linguistic objects__, along with what they map. 

![mapping.PNG](attachment:mapping.PNG)

<a name="dicts"></a>
## 5.3.2 Dictionaries in Python

Python provides a dictionary data type that can be used for __mapping between arbitrary types__. It is like a conventional dictionary, in that it gives you an __efficient way to look things up__.

In [1]:
# define empty dict
pos = {}

# add key, value pairS
pos['colorless'] = 'ADJ'
pos['ideas'] = 'N'
pos['sleep'] = 'V'
pos['furiously'] = 'ADV'

# show dict
print(pos)

# show value
print(pos['ideas'])

{'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}
N


Unlike lists and strings, where we can use `len()` to work out which integers will be __legal indexes__, how do we work out the __legal keys__ for a dictionary? 

If the dictionary is not too big, we can simply __inspect its contents__ by evaluating the variable `pos`.This gives us the key-value pairs. 

Notice that they are not in the same order they were originally entered; this is because __dictionaries are not sequences but mappings__, and the __keys are not inherently ordered__.

Alternatively, to just find the keys, we can __convert the dictionary to a list__— or __use the dictionary in a context where a list is expected__, as the parameter of `sorted()`, or in a `for` loop.

In [2]:
# convert dict to list
print(list(pos))

# sort dict
print(sorted(pos))

# search for keys
[w for w in pos if w.endswith('s')]

['colorless', 'ideas', 'sleep', 'furiously']
['colorless', 'furiously', 'ideas', 'sleep']


['colorless', 'ideas']

Finally, the dictionary methods `keys()`, `values()` and `items()` allow us to __access the keys, values, and key-value pairs as separate lists__. 

We can even __sort tuples__, which orders them according to their first element (and if the first elements are the same, it uses their second elements).

In [3]:
# show dict keys
print(pos.keys(), "\n")

# show dict values
print(pos.values(), "\n")

# show dict key, value pairs
print(pos.items(), "\n")

for key, val in sorted(pos.items()):
    print(key + ": " + val)

dict_keys(['colorless', 'ideas', 'sleep', 'furiously']) 

dict_values(['ADJ', 'N', 'V', 'ADV']) 

dict_items([('colorless', 'ADJ'), ('ideas', 'N'), ('sleep', 'V'), ('furiously', 'ADV')]) 

colorless: ADJ
furiously: ADV
ideas: N
sleep: V


We want to be sure that when we look something up in a dictionary, we __only get one value for each key__. 

Now suppose we try to use a dictionary to store the fact that the word `sleep` can be used as both a `verb` and a `noun`.

In [4]:
# store sleep as a verb
pos['sleep'] = 'V'

# store sleep as a noun
pos['sleep'] = 'N'

pos['sleep']

'N'

The initial value `'V'` is __immediately overwritten with the new value__ `'N'`. 

In other words, __there can only be one entry in the dictionary for `'sleep'`__. 

However, __there is a way of storing multiple values in that entry: we use a list value__, e.g. `pos['sleep'] = ['N', 'V']`. 

In fact, this is how the __CMU Pronouncing Dictionary__ works; it stores __multiple pronunciations for a single word__.

<a name="default"></a>
## 5.3.3 Default Dictionaries

We can use the same key-value pair format to __create a dictionary__. There's a couple of ways to do this, and we will normally use the first:

 	
>>> pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}
>>> pos = dict(colorless='ADJ', ideas='N', sleep='V', furiously='ADV')


In [5]:
# define dict
pos = {
    'colorless': 'ADJ',
    'ideas': 'N',
    'sleep': 'V',
    'furiously': 'ADV'
}

# print dict
print(pos)

# define dict
pos1 = dict(colorless='ADJ',
          ideas='N',
          sleep='V',
          furiously='ADV')

# show dict
pos1

{'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}


{'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}

Note that dictionary __keys must be immutable types__, such as strings and tuples. If we try to define a dictionary using a mutable key, we get a `TypeError`.

If we try to access a __key that is not in a dictionary__, we get an error.

However, its often useful if a dictionary can __automatically create an entry for this new key and give it a default value__, such as zero or the empty list. 

For this reason, a special kind of dictionary called a `defaultdict` is available. 

In order to use it, we have to supply a parameter which can be used to create the default value, e.g. int, float, str, list, dict, tuple.

In [6]:
from collections import defaultdict

# define dict which stores int values
frequency = defaultdict(int)

# assign key, pair value
frequency['colorless'] = 4

# access non-existing key
print(frequency['ideas'])

# define dict which stores list values
pos = defaultdict(list)

# assign key, pair value
pos['sleep'] = ['NOUN', 'VERB']

# access non-existing key
print(pos['ideas'])

0
[]


These default values are actually __functions that convert other objects to the specified type__ (e.g. `int("2")`, `list("2"))`. 

When they are called with no parameter — `int()`, `list()` — they return `0` and `[]` respectively.

The above examples specified the __default value of a dictionary entry to be the default value of a particular data type__. 

However, we can __specify any default value we like__, simply by providing the name of a function that can be called with no arguments to create the required value.

In [7]:
# create a dict whose default value for any entry is 'N'
pos = defaultdict(lambda: 'NOUN')

# assign a key,value pair
pos['colorless'] = 'ADJ'

# access a non-existing entry
print(pos['blog'])

# convert dict to list
list(pos.items())

NOUN


[('colorless', 'ADJ'), ('blog', 'NOUN')]

The above example used a __lambda expression__. 

This lambda expression specifies __no parameters__, so we call it using parentheses with no arguments. Thus, the definitions of `f` and `g` below are equivalent.

In [8]:
f = lambda: 'NOUN'
print(f())

def g():
    return 'NOUN'
g()

NOUN


'NOUN'

Let's see how default dictionaries could be used in a more substantial language processing task. 

Many language processing tasks — including tagging — __struggle to correctly process the hapaxes of a text__. They can perform better with a __fixed vocabulary__ and a guarantee that __no new words__ will appear. 

We can preprocess a text to __replace low-frequency words with a special "out of vocabulary" token `UNK`__, with the help of a default dictionary.

In [9]:
from nltk import FreqDist
from nltk.corpus import gutenberg

# define text
alice = gutenberg.words('carroll-alice.txt')

# extract vocab
vocab = FreqDist(alice)

# extract most common words
v1000 = [w for (w, _) in vocab.most_common(1000)]

# define dict
mapping = defaultdict(lambda: 'UNK')

# for the most common 1000 words
for v in v1000:
    # map each word to itself, i.e. ('word': 'word')
    mapping[v] = v
    
# for words not already in dict map default value, i.e. ('word':'UNK')
alice2 = [mapping[v] for v in alice]

# show tokens
print(alice2[:20])

# check vocab length, i.e. 1000 most common unique tokens + UNK = 1001 tokens 
len(set(alice2))

['[', 'Alice', "'", 's', 'Adventures', 'in', 'Wonderland', 'by', 'UNK', 'UNK', 'UNK', 'UNK', 'CHAPTER', 'I', '.', 'Down', 'the', 'Rabbit', '-', 'UNK']


1001

<a name="updating"></a>
## 5.3.4 Incrementally Updating a Dictionary

We can employ dictionaries to __count occurrences__. 

In [10]:
from nltk.corpus import brown

# create default dict
counts = defaultdict(int)

# access word, tag pairs in news category
for (word, tag) in brown.tagged_words(tagset='universal',
                                      categories='news'):
    # add 1 to the tag's value
    counts[tag] += 1
    
# check the number of NOUNS
print("The number of the tag NOUN is: {}.".format(counts['NOUN']), "\n")

# sort counts by keys, i.e. alphabetically
print("Sorted counts by keys (alphabetically):\n\n{}".format(sorted(counts)))

The number of the tag NOUN is: 30654. 

Sorted counts by keys (alphabetically):

['.', 'ADJ', 'ADP', 'ADV', 'CONJ', 'DET', 'NOUN', 'NUM', 'PRON', 'PRT', 'VERB', 'X']


In [11]:
from operator import itemgetter

# sort items by values, i.e. counts
print("Sorted counts by values (counts):\n\n:{}\n"
      .format(sorted(counts.items(), key=itemgetter(1), reverse=True)))

# print tags sorted by counts
print("Tags sorted by counts:\n\n{}"
      .format([t for t, c in sorted(counts.items(), key=itemgetter(1), reverse=True)]))

Sorted counts by values (counts):

:[('NOUN', 30654), ('VERB', 14399), ('ADP', 12355), ('.', 11928), ('DET', 11389), ('ADJ', 6706), ('ADV', 3349), ('CONJ', 2717), ('PRON', 2535), ('PRT', 2264), ('NUM', 2166), ('X', 92)]

Tags sorted by counts:

['NOUN', 'VERB', 'ADP', '.', 'DET', 'ADJ', 'ADV', 'CONJ', 'PRON', 'PRT', 'NUM', 'X']


The above example illustrates an __important idiom for sorting a dictionary by its values__, to show words in decreasing order of frequency. 

The first parameter of `sorted()` is the items to sort, a list of tuples consisting of a POS tag and a frequency. The second parameter specifies the sort key using a function `itemgetter()`. 

In general, `itemgetter(n)` returns a function that can be called on some other sequence object to obtain the nth element.

In [12]:
# define a pair
pair = ('NP', 8336)

# check pair second value
print(pair[1])

# use itemgetter
print(itemgetter(1)(pair))

8336
8336


The last parameter of `sorted()` specifies that the items should be returned in reverse order, i.e. decreasing values of frequency.

There's a second __useful programming idiom__ at the beginning, where we initialize a `defaultdict` and then use a `for` loop to update its values.

Here's a schematic version.

In [13]:
# my_dict = defaultdict(function to create default value)
# for item in sequence:
#     my_dict[item_key] is updated with information about item

We can use this idiom to index words according to their last 2 letters.

In [14]:
from nltk.corpus import words

# create default dict
last_letters = defaultdict(list)

# create tokens
words = words.words('en')

for word in words:
    # extract last 2 letters
    key = word[-2:]
    # assign as key these 2 letters and word as its value
    last_letters[key].append(word)
    
print(last_letters['ly'][:20])

['abactinally', 'abandonedly', 'abasedly', 'abashedly', 'abashlessly', 'abbreviately', 'abdominally', 'abhorrently', 'abidingly', 'abiogenetically', 'abiologically', 'abjectly', 'ableptically', 'ably', 'abnormally', 'abominably', 'aborally', 'aboriginally', 'abortively', 'aboundingly']


The following example uses the same pattern to create an __anagram dictionary__.

In [15]:
# create a dict
anagrams = defaultdict(list)

for word in words:
    # sort the characters alphabetically
    key = ''.join(sorted(word))
    # assign sorted word as key and word as value
    anagrams[key].append(word)
    
anagrams['aeilnrt']

['entrail', 'latrine', 'ratline', 'reliant', 'retinal', 'trenail']

The reason that the above program works is that when `sorted()` is used on a word, the word is splitted into characters. 

We need to use `''.join()` to convert it back as a single word.

In [16]:
print("Sort a word using just sorted(): {}.\n".format(sorted(words[4])))
print("Sort a word using sorted() and ''.join(): {}.".format(''.join(sorted(words[4]))))

Sort a word using just sorted(): ['a', 'a', 'i', 'i', 'l'].

Sort a word using sorted() and ''.join(): aaiil.


Since __accumulating words like this is such a common task__, NLTK provides a more convenient way of creating a `defaultdict(list)`, in the form of `nltk.Index()`.

In [17]:
from nltk import Index

anagrams = Index((''.join(sorted(w)), w) for w in words)
anagrams['aeilnrt']

['entrail', 'latrine', 'ratline', 'reliant', 'retinal', 'trenail']

`nltk.Index` is a `defaultdict(list)` with __extra support for initialization__. 

Similarly, `nltk.FreqDist` is essentially a `defaultdict(int)` with __extra support for initialization__ (along with sorting and plotting methods).

<a name="complex"></a>
## 5.3.5 Complex Keys and Values

We can use default dictionaries with complex keys and values. 

Let's study the range of possible tags for a word, given the word itself, and the tag of the previous word. 

We will see how this information can be used by a __POS tagger__.

In [18]:
from nltk import bigrams

# dict whose default value for an entry is another dict whose default value in 0
pos = defaultdict(lambda: defaultdict(int))

# obtain tagged words
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')

# 
for ((w1, t1), (w2, t2)) in bigrams(brown_news_tagged):
     pos[(t1, w2)][t2] += 1
        
# check the POS of 'right' when is preceded by 'DET'
pos[('DET', 'right')]

defaultdict(int, {'NOUN': 5, 'ADJ': 11})

This example uses a dictionary whose default value for an entry is a dictionary (whose default value is int(), i.e. zero). 

Notice how we iterated over the bigrams of the tagged corpus, processing a pair of word-tag pairs for each iteration. 

Each time through the loop we updated our pos dictionary's entry for `(t1, w2)`, a tag and its following word. 

When we look up an item in `pos` we must specify a compound key , and we get back a dictionary object. 

A POS tagger could use such information to decide that __the word `right`, when preceded by a `determiner`, should be tagged as `ADJ`__.

<a name="inverting"></a>
## 5.3.6 Inverting a Dictionary

Dictionaries support efficient lookup, so long as you want to get the value for any key. 

If `d` is a dictionary and `k` is a key, we type `d[k]` and immediately obtain the value. 

Finding a key given a value is slower and more cumbersome.

In [20]:
from nltk.corpus import gutenberg

# create a default dict
counts = defaultdict(int)

for word in gutenberg.words('milton-paradise.txt'):
    counts[word] += 1
    
print([key for (key, value) in counts.items() if value == 32])

['mortal', 'Against', 'Him', 'There', 'brought', 'King', 'virtue', 'every', 'been', 'thine']


If we expect to do this kind of __"reverse lookup"__ often, it helps to construct a dictionary that maps values to keys. 

In the case that __no two keys have the same value__, this is an easy thing to do. We just get all the key-value pairs in the dictionary, and create a new dictionary of value-key pairs. 

The next example also illustrates __another way of initializing a dictionary__ `pos` with key-value pairs.

In [21]:
# create dict
pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}

# reverse position of key, value pair
pos2 = dict((value, key) for (key, value) in pos.items())

# find key with value 'N'
pos2['N']

'ideas'

Let's first make our __POS dictionary__ a bit more realistic and add some more words to `pos` using the dictionary `update()` method, to create the situation where __multiple keys have the same value__. 

Then the technique just shown for reverse lookup will no longer work (why not?). Instead, we have to use `append()` to accumulate the words for each POS.

In [29]:
# add key, value pairs to pos
pos.update({'cats': 'N', 'scratch': 'V', 'peacefully': 'ADV', 'old': 'ADJ'})
print("pos dictionary:\n\n{}.\n".format(pos))

# create a default dict
pos2 = defaultdict(list)

# populate pos2
for key, value in pos.items():
    pos2[value].append(key)
print("pos2 dictionary:\n\n{}\n".format(pos2))
    
# find keys with value 'ADV'
print("Values with 'ADV' as key: {}".format(pos2['ADV']))

pos dictionary:

{'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV', 'cats': 'N', 'scratch': 'V', 'peacefully': 'ADV', 'old': 'ADJ'}.

pos2 dictionary:

defaultdict(<class 'list'>, {'ADJ': ['colorless', 'old'], 'N': ['ideas', 'cats'], 'V': ['sleep', 'scratch'], 'ADV': ['furiously', 'peacefully']})

Values with 'ADV' as key: ['furiously', 'peacefully']


Now we have inverted the pos dictionary, and can look up any part-of-speech and find all words having that part-of-speech. 

We can do __the same thing even more simply using NLTK's support for indexing__.

In [31]:
from nltk import Index

pos2 = Index((value, key) for (key, value) in pos.items())

pos2['ADV']

['furiously', 'peacefully']

![dict_methods.PNG](attachment:dict_methods.PNG)