# Week 2 - Second Year Project

---

**Learning goals**
* be familiar with the Universal POS tagset and data annotation for POS tagging
* be able to discuss annotation quality, both qualitatively and quantitatively, by comparing your annotations to those of a peer
* be able to implement a Naive Bayes and a  logistic regression classifier for language identification, using BOW and character n-gram features
* analyze the performance of classifiers both on in-domain and out-of-domain data.

**Notebook overview**

*Lecture 3*
1. Annotation - annotate a small sample of social media data with POS tags
2. Annotation quality - inspect annotation quality through kappa scores, but also qualitatively
3. Words as features - convert input text to features that can be used in machine learning algorithms

*Lecture 4*

4. Naive Bayes Classifier (pen and paper)
5. Naive Bayes with BOW in sklearn - train a classifier with bag-of-word features
6. Discriminative Classifier with BOW - train a discriminative classifier
7. Character n-grams - extract and use character n-grams
8. Analysis of a model's performance - some examples of how to analyze when/how a model fails

# Lecture 3: Annotation and POS tagging

This assignment consists of 2 parts: first you annotate the data, then you compare your annotations against the annotations of a peer. For this reason, we have an intermediate deadline for uploading your annotations on learnit:
* **08-02-2022** 12:00 Danish time: upload your annotations on learnit

Before Thursday (09-02) you will receive the annotation from a peer for the same data. You can then compare the annotations. Note that you can already implement your solution on tuesday, and change some of your own annotation to use them as a dummy test-file. 

## 1. Annotation 

Find the file with your ITU username in `assignments/week2/annotate/`. In this file, you will find 20
TikTok comments which are pre-tokenized and in conll format (see [week1](https://github.itu.dk/robv/intro-nlp/blob/main/assignments/week1/week1.ipynb)). Behind each word you are supposed to annotate the pos tag, with one tab in between. The final file should look like this:

```
-       PUNCT
en      DET
mand    NOUN
der     PRON
hedder  VERB
goergh  PROPN
bush    PROPN
.       PUNCT

```

You can use a whitespace or a tab between the word and its tag. Please check with the script posCheck.py
whether the file format is correct. Usage: `python3 posCheck.py origFile annotatedFile`
For annotation guidelines we refer to the slides and https://universaldependencies.org/u/pos/all.html. Alternatively, it might be helpful to look at example annotations, which are provided in:
`assignments/week2/pos-data/da_ddt-ud-sample.conllu` and `assignments/week2/pos-data/en_ewt-ud-sample.conllu`

**NOTE** If you do not speak Danish, please annotate the English sample (ending with _en)

Upload your annotation on LearnIT (before **08-02-2022 12:00**), and name it like: `[username]_[language].conll` if your username is robv for example, use: `robv_da.conll`.

## 2. Annotation Quality

* a) Calculate the accuracy between you and the other annotator, how often did you agree?
* b) Now implement Cohen’s Kappa score, and calculate the Kappa for your annotation sample. In which range
does your Kappa score fall?
* c) Take a closer look at the cases where you disagreed with the other annotator; are these disagreements due
to ambiguity, or are there mistakes in the annotation? Would you classify your agreement in the same category as it falls in the standard kappa interpretation?

## 3. Words as Features
In this assignment, we will convert a text to a matrix of features for the purpose of language identification (the classifiers will be trained in thursdays assignments, see below). We will use data from star-wars fandom wikipedia:
* English [Wookipedia](https://starwars.fandom.com/wiki/Main_Page)  
* Danish [Kraftens Arkiver](https://starwars.fandom.com/da/wiki) 
* Dutch [Yodapedia](https://starwars.fandom.com/da/wiki)

The data for the following assignments can be read like this:

In [1]:
def load_langid(path):
    text = []
    labels = []
    for line in open(path):
        tok = line.strip().split('\t')
        labels.append(tok[0])
        text.append(tok[1])
    return text, labels

wooki_train_text, wooki_train_labels = load_langid('langid-data/wookipedia_langid.train.tok.txt')
wooki_dev_text, wooki_dev_labels = load_langid('langid-data/wookipedia_langid.dev.tok.txt')

a) Convert the train data to "binary word features". This means that every instance (sentence) is represented by a vector of binary values, each of which correspongs to a word. For example (features are on the columns, input on the rows):

|             | hello | bye | there | here | ... |
|-------------|-------|-----|-------|------|-----|
| hello there | 1     | 0   | 1     | 0    |     |
| bye bye     | 0     | 1   | 0     | 0    |     |


Note that this means that you will end up with a matrix of size `(#data_instances, len(vocab))` where `vocab` contains your vocabulary (i.e. all the words in the train data), and the `#data_instances` is the number of input sentences (feel free to use numpy, torch or native python lists). This matrix will be filled with 0's and 1's, indicating which features are present in which instances.

**Hint**: Start with two sentences, as it is much easier to debug (and with 1 sentence, you will have only 1s)

b) Convert the dev data to the same features generated from the training data. Note that no new features can be introduced at this point, check whether the size of the matrix is `(#dev_instances, len(vocab))`.

c) Write down what are the pros and cons of representing text as `BOW` (bag-of-words)

# Lecture 4: Generative and Discriminative Classification

## 4. Naive Bayes Classifier (pen and paper)

Solve the following exercises from [Chapter 4 of Speech and Language processing](https://web.stanford.edu/~jurafsky/slp3/4.pdf):

a) Exercise 4.1 from J&M: (copied here for your convenience):

Assume the following likelihoods for each word being part of a positive or
negative movie review, and equal prior probabilities for each class.

| feature         | pos | neg     |
| :---        |    :----:   |          ---: |
| I      |  0.09      |  0.16  |
| always   | 0.07        | 0.06      |
| like      | 0.29       | 0.06   |
| foreign      | 0.04       | 0.15   |
| films      |  0.08      | 0.11   |

- What class will Naive Bayes assign to the sentence `“I always like foreign films.”`?

b) Exercise 4.2 from J&M (copied here for your convenience):

Given the following short movie reviews, each labeled with a genre, either comedy or action:

1. fun, couple, love, love **comedy**

2. fast, furious, shoot **action**

3. couple, fly, fast, fun, fun **comedy**

4. furious, shoot, shoot, fun **action**

5. fly, fast, shoot, love **action**

and a new document D:

```
fast, couple, shoot, fly
```

- Compute the most likely class for D. Assume a Naive Bayes classifier and use *add-1 smoothing* for the likelihoods.

## 5. Naive Bayes with BOW in sklearn 

In this assignment, we will focus on the task of language identification. You can use the data from assignment 03

In [2]:
def load_langid(path):
    text = []
    labels = []
    for line in open(path):
        tok = line.strip().split('\t')
        labels.append(tok[0])
        text.append(tok[1])
    return text, labels

wooki_train_text, wooki_train_labels = load_langid('langid-data/wookipedia_langid.train.tok.txt')
wooki_dev_text, wooki_dev_labels = load_langid('langid-data/wookipedia_langid.dev.tok.txt')

a) Train a Naive Bayes classifier, you can make use of the scikit-learn implementation. See: [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB), note that there is are multiple variety of Naive Bayes implementations in sklearn, the one discussed in the book/slides is the multinomial variant.

**Note**: the input is a list of lists of features `x` and a list of corresponding gold labels `y`. Therefore, the following should hold `len(x) == len(y)` and their indices should match.
Additionally, every instance in `x` should have the same length (the number of features).

b) Run the classifier on the dev data. It is crucial that you ensure that the feature values have exactly the same order as during training. How well does it perform? (accuracy?)

**Note**: you cannot introduce new features here (!): you have to use the exact same features as the ones used during training.

**Hint**: If the accuracy is lower than 50%, you are probably mixing up the feature order, either during training or during development or both.



## 6. Discriminative Classifier with BOW

a) Train a `logistic regression` classifier in a similar fashion. For more information, see: [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Does it outperform the naive bayes classifier?

b) Now evaluate both classifiers (`logistic regression` and `naive bayes`) on the out-of-domain Bulbapedia data:



In [3]:
bulba_dev_text, bulba_dev_labels = load_langid('langid-data/bulbapedia_langid.dev.tok.txt')

- Are the trends similar to the Wookipedia data? Is there a performance drop compared to the Wookipedia data?

## 7. Character n-grams
Instead of using word unigrams as features, character n-grams can provide better generalization. 

a) Implement character tri-gram features without using the sklearn implementations.

b) Train the `logistic regression` model on the tri-gram features and inspect performance on both the Wookipedia and Bulbapedia data. Does it outperform the BOW model?

## 8. Analysis
There are two obvious ways to inspect the classifiers in more detail: by inspecting a confusion matrix and by
examining the feature weights.

### Confusion matrix
a) Plot a confusion matrix for the logistic regression BOW model `6a)` when used on Bulbapedia data, and inspect the errors (it is not important how you visualize the results: a table, a figure, or even an ASCII table will suffice). 
Are there any interesting trends?

### Feature weights
In scikit-learn, you can inspect the internal weights given to each feature in the `.coef_` variable. Inspect the
most important features for both classifiers. 

b) Are there any interesting differences?

**Hint**: The weights are given per class, so you can either inspect three lists, or compute the average importance
(make sure to use the absolute feature values for the average).