<a href="https://colab.research.google.com/github/DrAlexSanz/NLP-SPEC-C2/blob/master/W2/Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Assignment 2: Parts-of-Speech Tagging (POS)

Welcome to the second assignment of Course 2 in the Natural Language Processing specialization. This assignment will develop skills in part-of-speech (POS) tagging, the process of assigning a part-of-speech tag (Noun, Verb, Adjective...) to each word in an input text. Tagging is difficult because some words can represent more than one part of speech at different times. They are Ambiguous. Let's look at the following example:

* The whole team played well. [adverb]
* You are doing well for yourself. [adjective]
* Well, this assignment took me forever to complete. [interjection]
* The well is dry. [noun]
* Tears were beginning to well in her eyes. [verb]

Distinguishing the parts-of-speech of a word in a sentence will help you better understand the meaning of a sentence. This would be critically important in search queries. Identifying the proper noun, the organization, the stock symbol, or anything similar would greatly improve everything ranging from speech recognition to search. By completing this assignment, you will:

Learn how parts-of-speech tagging works
* Compute the transition matrix A in a Hidden Markov Model
* Compute the transition matrix B in a Hidden Markov Model
* Compute the Viterbi algorithm
* Compute the accuracy of your own model

In [None]:
# Download everything:
!wget https://raw.githubusercontent.com/DrAlexSanz/NLP-SPEC-C2/master/W2/WSJ_02-21.pos
!wget https://raw.githubusercontent.com/DrAlexSanz/NLP-SPEC-C2/master/W2/WSJ_24.pos
!wget https://raw.githubusercontent.com/DrAlexSanz/NLP-SPEC-C2/master/W2/hmm_vocab.txt
!wget https://raw.githubusercontent.com/DrAlexSanz/NLP-SPEC-C2/master/W2/test.words.txt
!wget https://raw.githubusercontent.com/DrAlexSanz/NLP-SPEC-C2/master/W2/utils_pos.py

--2020-10-08 16:21:33--  https://raw.githubusercontent.com/DrAlexSanz/NLP-SPEC-C2/master/W2/WSJ_02-21.pos
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8279089 (7.9M) [text/plain]
Saving to: ‘WSJ_02-21.pos’


2020-10-08 16:21:34 (15.9 MB/s) - ‘WSJ_02-21.pos’ saved [8279089/8279089]

--2020-10-08 16:21:34--  https://raw.githubusercontent.com/DrAlexSanz/NLP-SPEC-C2/master/W2/WSJ_24.pos
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 286063 (279K) [text/plain]
Saving to: ‘WSJ_24.pos’


2020-10-08 16:21:35 (4.51 MB/s) - ‘WSJ_24.pos’ saved [28

In [None]:
import math
import pandas as pd
import numpy as np
from utils_pos import get_word_tag, preprocess
from collections import defaultdict

print("Everything imported correctly")

Everything imported correctly


## Part 0: Data Sources
This assignment will use two tagged data sets collected from the Wall Street Journal (WSJ).

One data set (WSJ-2_21.pos) will be used for training.

The other (WSJ-24.pos) for testing.

The tagged training data has been preprocessed to form a vocabulary (hmm_vocab.txt).

The words in the vocabulary are words from the training set that were used two or more times.

The vocabulary is augmented with a set of 'unknown word tokens', described below.

The training set will be used to create the emission, transmission and tag counts.

The test set (WSJ-24.pos) is read in to create y.

This contains both the test text and the true tag.
The test set has also been preprocessed to remove the tags to form test_words.txt.

This is read in and further processed to identify the end of sentences and handle words not in the vocabulary using functions provided in utils_pos.py.

This forms the list prep, the preprocessed text used to test our POS taggers.
A POS tagger will necessarily encounter words that are not in its datasets.

To improve accuracy, these words are further analyzed during preprocessing to extract available hints as to their appropriate tag.

For example, the suffix 'ize' is a hint that the word is a verb, as in 'final-ize' or 'character-ize'.

A set of unknown-tokens, such as '--unk-verb--' or '--unk-noun--' will replace the unknown words in both the training and test corpus and will appear in the emission, transmission and tag data structures.

Implementation note:

* For python 3.6 and beyond, dictionaries retain the insertion order.
* Furthermore, their hash-based lookup makes them suitable for rapid membership tests.
  * If di is a dictionary, key in di will return True if di has a key key, else False.
* The dictionary vocab will utilize these features.


In [None]:
# Load training corpus

with open("WSJ_02-21.pos", "r") as f:
    training_corpus = f.readlines()

print(f"First 5 lines of the training corpus:", training_corpus[0:5])

First 5 lines of the training corpus: ['In\tIN\n', 'an\tDT\n', 'Oct.\tNNP\n', '19\tCD\n', 'review\tNN\n']


In [None]:
# Read the vocabulary data, split by word (line) and save the vocab

with open("hmm_vocab.txt") as f:
    voc_list = f.read().split("\n")

print(f"The first 5 words", voc_list[0:5])

print(f"The last 5 words", voc_list[-5:])

The first 5 words ['!', '#', '$', '%', '&']
The last 5 words ['zones', 'zoning', '{', '}', '']


In [None]:
# Now make a dictionary of the words in vocab

vocab = {}

for i, word in enumerate(sorted(voc_list)):
    vocab[word] = i

print("Vocabulary dictionary. Key is the word and Value is a unique integer (ID), not a count.")

for k, v in vocab.items():
    print(f"{k}:{v}")

    if v > 10:
        break

In [None]:
# Load test corpus

with open("WSJ_24.pos", "r") as f:
    test_corpus = f.readlines()

print(f"The first 5 lines of the test corpus:", test_corpus[0:5])

The first 5 lines of the test corpus: ['The\tDT\n', 'economy\tNN\n', "'s\tPOS\n", 'temperature\tNN\n', 'will\tMD\n']


In [None]:
# Separate tags from test data

_, prep = preprocess(vocab, "test.words.txt")

In [None]:
print(prep[0:10])

['The', 'economy', "'s", 'temperature', 'will', 'be', 'taken', 'from', 'several', '--unk--']


# Part 1: Parts-of-speech tagging

## Part 1.1 - Training
You will start with the simplest possible parts-of-speech tagger and we will build up to the state of the art.

In this section, you will find the words that are not ambiguous.

* For example, the word is is a verb and it is not ambiguous.
* In the WSJ corpus, $86$% of the token are unambiguous (meaning they have only one tag)
* About $14\%$ are ambiguous (meaning that they have more than one tag)

Before you start predicting the tags of each word, you will need to compute a few dictionaries that will help you to generate the tables.

Transition counts
The first dictionary is the transition_counts dictionary which computes the number of times each tag happened next to another tag.
This dictionary will be used to compute:$$P(t_i |t_{i-1}) \tag{1}$$

This is the probability of a tag at position $i$ given the tag at position $i-1$.

In order for you to compute equation 1, you will create a transition_counts dictionary where

The keys are (prev_tag, tag)
The values are the number of times those two tags appeared in that order.
Emission counts
The second dictionary you will compute is the emission_counts dictionary. This dictionary will be used to compute:

$$P(w_i|t_i)\tag{2}$$
In other words, you will use it to compute the probability of a word given its tag.

In order for you to compute equation 2, you will create an emission_counts dictionary where

The keys are (tag, word)
The values are the number of times that pair showed up in your training set.
Tag counts
The last dictionary you will compute is the tag_counts dictionary.

The key is the tag
The value is the number of times each tag appeared.

## Exercise 01
**Instructions:** Write a program that takes in the training_corpus and returns the three dictionaries mentioned above transition_counts, emission_counts, and tag_counts.

**emission_counts:** maps (tag, word) to the number of times it happened.

**transition_counts:** maps (prev_tag, tag) to the number of times it has appeared.

**tag_counts:** maps (tag) to the number of times it has occured.

Implementation note: This routine utilises defaultdict, which is a subclass of dict.

* A standard Python dictionary throws a KeyError if you try to access an item with a key that is not currently in the dictionary.
* In contrast, the defaultdict will create an item of the type of the argument, in this case an integer with the default value of 0.
* See defaultdict.