# Homework 3: Twitter POS tagging

## General info

<b>Due date</b>: 11pm, Sunday April 15

<b>Submission method</b>: see LMS

<b>Submission materials</b>: completed copy of this iPython notebook

<b>Late submissions</b>: -20% per day

<b>Marks</b>: 5% of mark for class

<b>Overview</b>: In this homework, you'll be adapting a POS tagger to Twitter data, starting from a tagger trained on Penn Treebank. You will also use prior information on the Twitter tagset to obtain better performance. Finally, you will also analyse your results in a more fine-grained way. For extra credits, you will implement the Expectation-Maximisation algorithm.

<b>Materials</b>: See the main class LMS page for information on the basic setup required for this class, including an iPython notebook viewer and the python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and Gensim. In particular, if you are not using a lab computer which already has it installed, we recommend installing all the data for NLTK, since you will need various parts of it to complete this assignment. You can also use any Python built-in packages, but do not use any other 3rd party packages; if your iPython notebook doesn't run on the marker's machine, you will lose marks.  

<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each section is worth is given in parenthesis after the instructions. You will be marked not only on the correctness of your methods, but also the quality and efficency of your code: in particular, you should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to <a href="https://www.python.org/dev/peps/pep-0008/">Python style requirements</a>. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it.

<b>Extra credit</b>: Each homework has a task which is optional with respect to getting full marks on the assignment, but that can be used to offset any points lost on this or any other homework assignment (but not the final project or the exam). We recommend you skip over this step on your first pass, and come back if you have time: the amount of effort required to receive full marks (1 point) on an extra credit question will be substantially more than earning the same amount of credit on other parts of the homework.

<b>Updates</b>: Any major changes to the assignment will be announced via LMS. Minor changes and clarifications will be announced in the forum on LMS, we recommend you check the forum regularly.

<b>Academic Misconduct</b>: For most people, collaboration will form a natural part of the undertaking of this homework, and we encourge you to discuss it in general terms with other students. However, this ultimately is still an individual task, and so reuse of code or other instances of clear influence will be considered cheating. We will be checking submissions for originality and will invoke the University’s <a href="http://academichonesty.unimelb.edu.au/policy.html">Academic Misconduct policy</a> where inappropriate levels of collusion or plagiarism are deemed to have taken place.


### Part 1: Preprocessing (2.0)

<b>Instructions</b>: you first task is to preprocess the data. We will use two datasets for training: 1) the Penn Treebank sample used in the workshops and 2) the Twitter samples data you used in Homework 1. In order to adapt the tagger to the Twitter data we need to built a *joint* vocabulary containing all the word types in PTB and the twitter_samples corpora. So, in addition to preprocessing, your code should also build this vocabulary. Finally, you should also store the tagset, for reasons that will be clearer later.

<b>Important</b>: you are allowed to reuse all the code from the workshop notebooks. In fact you are encouraged to do so as much of this homework will be based on these notebooks.

The vocabulary and the tagset should be stored in Python dictionaries, mapping each word (or tag) to an index (integer). This is similar to what is done in the W6/W7 workshop notebooks. The preprocessed corpora should contain indices only, as in the workshop.

Let's start with the PTB data. You should iterate over all sentences and words, and build the vocabulary and the tagset. Important: make sure you <b>lowercase</b> words before they are added to the dictionary. You should also generate the preprocessed corpus. It should be a list where each element is a tagged sentence, represented as another list of (word, tag) indices (which should correspond to the original words/tags). Print the first preprocessed sentence, the index for the word 'electricity' and the length of the full tagset. (0.5)


<b>Instructions</b>: now you should do the same with the twitter_samples dataset. From now on, we will refer this dataset as the **training** tweets. Since this data is not tagged, the preprocessed corpus should be a list where each element is another list containing indices only (instead of (word, tag) tuples). A tokenised version of twitter_samples is available through the method .tokenized(), use this method to read your corpus. Besides generating the corpus, you should also **update** the vocabulary with the new words from this corpus.

There are two things to keep in mind when doing this process:

1) We will perform a bit more of preprocessing in this dataset, besides lowercasing. Specifically, you should replace special tokens with special symbols, as follows:
- Username mentions are tokens that start with '@': replace these tokens with 'USER_TOKEN'
- Hashtags are tokens that start with '#': replace these with 'HASHTAG_TOKEN'
- Retweets are represented as the token 'RT' (or 'rt' if you lowercase first): replace these with 'RETWEET_TOKEN'
- URLs are tokens that start with 'https://' or 'http://': replace these with 'URL_TOKEN'

2) **Do not create a new vocabulary**. Instead, you should update the vocabulary built from PTB with any new words present in this corpus. These should *include* the special tokens defined above but *not* the original un-preprocessed tokens.

The easiest way to do these steps is by doing 3 passes over the data: preprocess the words first, update the vocabulary and finally convert the corpus into the list format described above. However, it is possible to do all of this in one pass only.

Print the first sentence from your preprocessed corpora, the index for the word 'electricity' and the index for 'HASHTAG_TOKEN'. (0.5)

<b>Instructions:</b> now we will preprocess the tagged twitter corpus used in W7 (Ritter et al.). This dataset will be referred from now on as **test** tweets. Before you do that though, you should update the tagset.

You might have noticed this in the workshop but this dataset has a few extra tags, besides the PTB ones. These were added to incorporate specific phenomena that happens on Twitter:
- "USR": username mentions
- "HT": hashtags
- "RT": retweets
- "URL": URL addresses

Notice that these special tags correspond to the special tokens we preprocessed before. These steps will be important in Part 3 later.

There a few additional tags which are not specific to Twitter but are not present in the PTB sample:
- "VPP"
- "TD"
- "O"

You should add these new seven tags to the tagset you built when reading the PTB corpus.

Another task is to add an extra type to the vocabulary: `<unk>`. This is in order to account for unknown or out-of-vocabulary words.

Finally, build two "inverted indices" for the vocabulary and the tagset. These should be lists, where the "i"-th element should contain the word (or tag) corresponding to the index "i" in the vocabulary (or tagset).

After doing these tasks, print the index for `<unk>` and the length of your resulting tagset. (0.5)

<b>Instructions</b>: now we can read the test tweets. Store them in the same format as the PTB corpora (list of lists containing (word, tag) index tuples). Do the same preprocessing steps that you did for the training tweets (lowercasing + replace special tokens). However, **do not** update the vocabulary. Why? Because the test set should simulate a real-world scenario, where out-of-vocabulary words can appear. Instead, after preprocessing each word, you should check if that word is in the vocabulary. If yes, just replace it with its index, otherwise you should replace it with the index for the `<unk>` token. Remember: you can reuse the code from the workshop for this task. Just be mindful that in the workshop we stored words and tags in two separate lists: here you should have a single list, as in the PTB corpus you preprocessed above.

When reading the POS tags for the test tweets you should do some additional preprocessing. There are three tags in this dataset which correspond to PTB tags but are represented with different names:
- "(". In PTB, this is represented as "-LRB-"
- ")". In PTB, this is represented as "-RRB-"
- "NONE". In PTB, this is represented as "-NONE-"

As you build the corpus for the test tweets, you should check if the tag for a word is one of the above. If yes, you should use the PTB equivalent instead. In practice, it is sufficient to ensure you use the correct index for the corresponding tag, using your tagset dictionary. This concept is sometimes referred as *tag harmonisation*, where two different tagsets are mapped to each other.

After this, print the first sentence of your preprocessed corpus. (0.5)

In [None]:
import urllib
try:
    urllib.request.urlretrieve("https://github.com/aritter/twitter_nlp/raw/master/data/annotated/pos.txt","pos.txt")
except: # Python 2
    urllib.urlretrieve("https://github.com/aritter/twitter_nlp/raw/master/data/annotated/pos.txt","pos.txt")

<b>Hint</b>: if you did these steps correctly you should have 53 tags in your tagset and around 26000 words in your vocabulary.

### Part 2: Running the PTB tagger on the test tweets (1.5)

<b>Instructions</b>: your next task is to train a POS tagger on the PTB data and try it on the test tweets. This is exactly what we did in W7: feel free to reuse code. However, we are also gonna modify the code a bit.

Your first task is encapsulate the HMM training code into a function. You should name your function `count`. This function should take these input parameters:
- A tagged corpus, in the format described above (list of lists containing (word, tag) index tuples).
- The vocabulary (a dict).
- The tagset (a dict).

Output return values should contain:
- The initial tag probabilities (a vector).
- The transition probabilities (a matrix).
- The emission probabilities (a matrix).

Notice that in the workshop code the vocabulary and tagset were built as part of the training process. Here you should pass them explicitly as parameters instead. This is to ensure our tagger can take into account the words in the training tweets and the extra tags. Important: the workshop code initialise the probabilities with an `eps` value, to ensure you end up with non-zero probabilities for unseen events. You should do the same here.

After writing your function, run it on the PTB corpus to obtain the initial, transition and emission probabilities. (0.5)

<b>Instructions</b>: now you should write a function for Viterbi. The input parameters are the same as in the workshop:
- The parameters (probabilities) of your HMM (a tuple (initial, transition, emission)).
- The input words (a list with numbers).

The output is slightly different though:
- A list of (word, tag) indices, containing the original input word and the predicted tag.

Run Viterbi on the test tweets and store the predictions in a list (might take a few seconds). Remember that in the processing part you stored the test tweets as (word, tag) indices lists: make sure your input to Viterbi are word index lists only. Print the first sentence of your predicted list. (0.5)

<b>Instructions</b>: you should now evaluate the results. Write a function that takes (word, tag) lists as inputs and outputs the tag sequence using the original tags in the tagset. Your inputs should be a sentence and the tag inverted index you built before.

Run this function on the predictions you obtained above **and** the test tweets, storing them in two separate lists. Finally, flat your predictions into a single list and do the same for the test tweets and report accuracy. (0.5)

### Part 3: Adapting the tagger using prior information (1.5)

<b>Instructions</b>: now your task is to adapt the tagger using prior information. What do we mean by that? Remember from part 1 that the twitter tagset has some extra tags, related to special tokens such as mentions and hashtags. In other words, **we know beforehand** that these special tokens **should** have these tags. However, because these tags never appear in the PTB data, the tagger has no such information. We are going to add this in order to improve the tagger.

To recap, we know these things about the twitter data:
- username mentions should have the tag 'USR'
- hashtags should have the tag 'HT'
- retweet tokens should have the tag 'RT'
- URL tokens should have the tag 'URL'

Remember how we replace these tokens with unique special ones (such as 'USER_TOKEN')? Your task is to adapt the emission probabilities for these tokens. Modify the emission matrix: assign 1.0 probability for the emission P('USER_TOKEN'|'USR') and 0.0 for P(word|'USR') for all other words. Do the same for the other three special tags.

In order to do that, you should use the vocabulary and tagset dictionaries in order to obtain the indices for the corresponding words and tags. Then, use the indices to find the values in the emission matrix and modify them. Print your new emission matrix. (0.5)

<b>Instructions</b>: now evaluate your new tagger on the test tweets again. You should report accuracy but also do a fine-grained error analysis. Print the F-scores for **each tag**. <b>Hint:</b> use the "classification_report" function in scikit-learn for that. You should report the tags that performed the best and the worse. (0.5) 

<b>Instructions</b>: finally, based on the information you got above, do some analysis. Why do you think the tagger performed worse on the tags you mentioned above? How would you improve the tagger? Feel free to inspect some instances manually if you want (and show us if you do). Write your analysis in the markdown cell below. Notice that this question is inherently subjective: this is on purpose as you will be evaluated on your analytical abilities. But don't worry about going into depth: 2-4 sentences is enough (but feel free to write more if you need). (0.5)
    

<b>WRITE YOUR ANALYSIS HERE</b>

### Extra credits 1: Expectation-Maximisation (1.0)

<b>Instructions</b>: here your goal is to improve the tagger using **hard EM**. This question is divided in two parts. Because EM can take a long time to run we will modify our code above to make it faster and also more robust to underflow by making calculations in the log space.

Your first task is to modify the `count` and `viterbi` functions. For `count`, you should return log probabilities for all matrices. For `viterbi`, you should modify the code in the following way:
- Calculate scores using log probabilities. Remember that in log space, any products become sums. Also remember to make sure you change the base case as well (not only the inner loop).
- You should rewrite the algorithm in vectorised form, in order to make it more efficient. Remember that in Viterbi, the third (inner) `for` loop calculate scores for a single cell, while the second `for` loop calculates scores for a whole column. These operations can be made in parallel, which makes them amenable to vectorisation. Replace the second and third `for` loops in the code with appropriate vector and matrix operations. Remember to do the same to obtain the backpointers. <b>Hint</b>: start by vectorising the inner loop only and check if it is correct.

The second task is to implement EM. This can be done without the first step above but bear in mind that it will take much longer to run. Write a function that:

- 1) Tag a corpus using Viterbi and an initial tagger
- 2) Train a new tagger using the tagged corpora obtained above.
- 3) Repeat both steps above N times.

If you've done the main homework correctly you should be able to reuse the `count` and `viterbi` functions for this and the code should be very straightforward. You should pretrain your tagger using the steps in the main homework, including the tag adaptation in Part 3. Then, in the inner loop, use the training tweets as your unlabelled corpora. Run 5 iterations of EM and report test accuracy **at every iteration**.

EM can take a long time to train, even if you're using the vectorised Viterbi code. If it is too slow in your machine, you're allowed to use a subset of the training tweets for this task.

To get full marks, adapt the algorithm in the following manner:
- At step 2) above, when training the new tagger, combine **both** the PTB gold data and the tagged training tweets, instead of just using the training tweets.

This is an easy trick to obtain better results (and it is essentially **semi-supervised** learning).

### Extra credits 2: Soft EM using Forward-Backward (1.0)

<b>Instructions</b>: This is only for the truly intrepid: expect a substantial amount of work to get full marks in this. The goal is to perform **soft EM** using the Forward-Backward algorithm and expected counts. You will need to implement and adapt a set of functions for this task:
- The `count` method, which will still be used for pretraining, needs to also calculate **final** probabilities and return these as part of the output. These are never used in Viterbi but are essential for Forward-Backward.
- Implement the `forward` function. If you want to work in log space you should be careful because it requires summing probabilities. Hint: check the Scipy function `logsumexp` for that. Remember: the algorithm is very similar to Viterbi so feel free to use it as a starting point. The function should return the matrix with the alpha values and the marginal probability for the sentence.
- Implement the `backward` function. This should return a matrix with the beta values and the marginal. Remember the very useful sanity check: for the same sentence and tagger, the marginals returned from `forward` and `backward` should match.
- Implement the `expected_count` function, which is similar to `count` in the sense that it trains a tagger and output new parameters. However, it should have the alpha and beta matrices as inputs, as well as the marginals.

After implementing all of this, rerun EM using the soft approach and evaluate it in a similar way as done in Extra credits 1.
