<a href="https://colab.research.google.com/github/DrAlexSanz/NLP-SPEC-C2/blob/master/W2/POS_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!wget https://raw.githubusercontent.com/DrAlexSanz/NLP-SPEC-C2/master/W2/WSJ_02-21.pos
!wget https://raw.githubusercontent.com/DrAlexSanz/NLP-SPEC-C2/master/W2/WSJ_24.pos
!wget https://raw.githubusercontent.com/DrAlexSanz/NLP-SPEC-C2/master/W2/hmm_vocab.txt
!wget https://raw.githubusercontent.com/DrAlexSanz/NLP-SPEC-C2/master/W2/test.words.txt
!wget https://raw.githubusercontent.com/DrAlexSanz/NLP-SPEC-C2/master/W2/utils_pos.py

#Parts-of-Speech Tagging - First Steps:
Working with text files, Creating a Vocabulary and Handling Unknown Words
In this lecture notebook you will create a vocabulary from a tagged dataset and learn how to deal with words that are not present in this vocabulary when working with other text sources. Aside from this you will also learn how to:

* read text files.
* work with defaultdict.
* work with string data.

In [2]:
import string
from collections import defaultdict

## Read Text Data
A tagged dataset taken from the Wall Street Journal is provided in the file WSJ_02-21.pos.

To read this file you can use Python's context manager by using the with keyword and specifying the name of the file you wish to read. To actually save the contents of the file into memory you will need to use the readlines() method and store its return value in a variable.

Python's context managers are great because you don't need to explicitly close the connection to the file, this is done under the hood:

In [3]:
with open("WSJ_02-21.pos") as f:
    lines = f.readlines()

In [4]:
#Check

for i in range(5):
    print(lines[i])

In	IN

an	DT

Oct.	NNP

19	CD

review	NN



The previous is formatted, print only one and see the \t and \n:

In [5]:
lines[0]

'In\tIN\n'

Indeed there is a tab between the word and the tag and a newline at the end of each line.

### Creating a vocabulary
Now that you understand how the dataset is structured, you will create a vocabulary out of it. A vocabulary is made up of every word that appeared at least 2 times in the dataset. For this, follow these steps:

* Get only the words from the dataset
* Use a defaultdict to count the number of times each word appears
* Filter the dict to only include words that appeared at least 2 times
* Create a list out of the filtered dict
* Sort the list

For step 1 you can use the fact that every word and tag are separated by a tab and that words always come first. Using list comprehension the words list can be created like this:

In [13]:
words = [i.split("\t")[0] for i in lines]

Step 2 can be done easily by leveraging defaultdict. In case you aren't familiar with defaultdicts they are a special kind of dictionaries that return the "zero" value of a type if you try to access a key that does not exist. Since you want the frequencies of words, you should define the defaultdict with a type of int.

Now you don't need to worry about the case when the word is not present within the dictionary because getting the value for that key will simply return a zero. Isn't that cool?

In [18]:
freq = defaultdict(int)

for word in words:
    freq[word] += 1

Filtering the freq dictionary can be done using list comprehensions again (aren't they handy?). You should filter out words that appeared only once and also words that are just a newline character:

In [21]:
vocab = [k for k, v in freq.items() if v > 1 and k != "\n"]
vocab.sort()

In [22]:
for i in range(4000, 4005):
    print(vocab[i])

Early
Earnings
Earth
Earthquake
East


# Processing new text sources
## Dealing with unknown words
Now that you have a vocabulary, you will use it when processing new text sources. A new text will have words that do not appear in the current vocabulary. To tackle this, you can simply classify each new word as an unknown one, but you can do better by creating a function that tries to classify the type of each unknown word and assign it a corresponding unknown token.

This function will do the following checks and return an appropriate token:

* Check if the unknown word contains any character that is a digit
  * return --unk_digit--
* Check if the unknown word contains any punctuation character
  * return --unk_punct--
* Check if the unknown word contains any upper-case character
  * return --unk_upper--
* Check if the unknown word ends with a suffix that could indicate it is a noun, verb, adjective or adverb
  * return --unk_noun--, --unk_verb--, --unk_adj--, --unk_adv-- respectively
* If a word fails to fall under any condition then its token will be a plain --unk--.

The conditions will be evaluated in the same order as listed here. So if a word contains a punctuation character but does not contain digits, it will fall under the second condition. To achieve this behaviour some if/elif statements can be used along with early returns.

This function is implemented next. Notice that the any() function is being heavily used. It returns True if at least one of the cases it evaluates is True.

In [24]:
def assign_unk(word):

    # Punctuation characters
    punct = set(string.punctuation)

    # Suffixes
    noun_suffix = ["action", "age", "ance", "cy", "dom", "ee", "ence", "er", "hood", "ion", "ism", "ist", "ity", "ling", "ment", "ness", "or", "ry", "scape", "ship", "ty"]
    verb_suffix = ["ate", "ify", "ise", "ize"]
    adj_suffix = ["able", "ese", "ful", "i", "ian", "ible", "ic", "ish", "ive", "less", "ly", "ous"]
    adv_suffix = ["ward", "wards", "wise"]

    # Find digits

    if any(char.isdigit() for char in word):
        return "<UNK-DIGIT>"

    elif any(char in punct for char in word):
        return "<UNK-PUNCT>"

    elif any(char.isupper() for char in word):
        return "<UNK-UPPER>" # Could be avoided by standarizing cases

    # Check if word ends with any noun suffix
    elif any(word.endswith(suffix) for suffix in noun_suffix):
        return "<UNK-NOUN>"

    # Check if word ends with any verb suffix
    elif any(word.endswith(suffix) for suffix in verb_suffix):
        return "<UNK-VERB>"

    # Check if word ends with any adjective suffix
    elif any(word.endswith(suffix) for suffix in adj_suffix):
        return "<UNK-ADJ>"

    # Check if word ends with any adverb suffix
    elif any(word.endswith(suffix) for suffix in adv_suffix):
        return "<UNK-RB>"
    
    # If none of the previous criteria is met, return plain unknown
    else:
        return "<UNK>"

## Getting the correct tag for a word
All that is left is to implement a function that will get the correct tag for a particular word taking special considerations for unknown words. Since the dataset provides each word and tag within the same line and a word being known depends on the vocabulary used, these two elements should be arguments to this function.

This function should check if a line is empty and if so, it should return a placeholder word and tag, --n-- and --s-- respectively.

If not, it should process the line to return the correct word and tag pair, considering if a word is unknown in which scenario the function assign_unk() should be used.

The function is implemented next. Notice That the split() method can be used without specifying the delimiter, in which case it will default to any whitespace.

In [29]:
def get_word_tag(line, vocab):

  if not line.split():
      word = "Placeholder"
      tag = "Placeholder"
  
  else:
      word, tag = line.split() # I separated by space

      if word not in vocab:

          word = assign_unk(word)

  return word, tag


In [31]:
get_word_tag("\n", vocab)

('Placeholder', 'Placeholder')

In [32]:
get_word_tag('In\tIN\n', vocab)

('In', 'IN')

In [34]:
get_word_tag('tardigrade\tNN\n', vocab) # UNK word but it's a noun

('<UNK>', 'NN')

In [35]:
get_word_tag('scrutinize\tVB\n', vocab)

('<UNK-VERB>', 'VB')