# Embeddings

## WordNet

This first part should be able to run on almost any system. No need for Google Colab for this part!

Import the WordNet library within NLTK's corpus library

In [None]:
from nltk.corpus import wordnet as wn

The word "grade" can mean many things in English. Use `wn.synsets()` to find the different synsets of these.

Print out the definitions for each one.

Take the synset `mark.n.01` and print out some example usages of it

Does it have any hyponyms? (Examples / instances of it... `A ____ is a mark.`

What are its hypernyms? `A mark is a ____`

What lemma names does `mark` have? i.e. What are some synonyms of `mark`?

## An End-to-End Text Classification System

This is a combination of a review of last week (making a classifier) and also practice with embeddings.

The task will be to classify questions.

It's gong to be heavy so to run this task we advice that you use [Google Colaboratory](https://colab.research.google.com) (also called Google Colab), which is a cloud solution to run Jupyter notebooks. The demonstrator will show how to use Google Colab. For additional information and to practice with the use of notebooks in Google Colab, you can also follow this link:

* [Welcome notebook and link to additional resources](https://colab.research.google.com/notebooks/welcome.ipynb)

### Question Classification

NLTK has a corpus of questions and their question types according to a particular classification scheme (e.g. DESC refers to a question expecting a descriptive answer, such as one starting with "How"; HUM refers to a question expecting an answer referring to a human). Below is an example of use of the corpus:

In [None]:
import nltk
nltk.download("qc")
from nltk.corpus import qc
train = qc.tuples("train.txt")
test = qc.tuples("test.txt")

In [None]:
train[:3]

In [None]:
test[:3]

### Exercise: Find all question types
Write Python code that lists all the possible question types of the training set (**remember: for data exploration, never look at the test set**).


### Exercise: Find all general types

The question types have two parts. The first part describes a general type, and the second part defines a subtype. For example, the question type `DESC:manner` belongs to the general `DESC` type and within that type to the `manner` subtype. Let's focus on the general types only. Write Python code that lists all the possible general types (there are 6 of them).

In [None]:
general_types = list(set([q.split(':')[0] for q in qtypes]))
general_types

### Exercise: Partition the data
There is a train and test data, but for this exercise we want to have a partition into train, dev-test, and test. In this exercise, combine all data into one array and do a 3-way partition into train, dev-test, and test. Make sure that you shuffle the data prior to doing the partition. Also, make sure that you only use the general label types.

### Exercise: Tokenise the data

Use Keras' tokeniser to tokenise all the data. For this exercise we will use only the 100 most frequent words in the training set (since you aren't supposed to use the dev-test or test sets to extract features).

### Exercise: Vectorize the data
The following code shows the distribution of lengths of my training data (could be different in your training data):

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
plt.hist([len(d) for d in indices_train])

The histogram shows that the longest question in the training data has 30 word indices, but by far most of the questions have at least 20. Based on this, use Keras' `pad_sequences` to vectorize the questions into sequences of 20 word indices. The default will be to truncate the beginning, but we want to truncate the end (since the first words of a question are often very important to determine the question type). For this you can use the option `truncating='post'`: https://keras.io/preprocessing/sequence/

### Exercise: Vectorise the labels
Convert the labels to one-hot encoding. If you use Keras' `to_categorical`, you will first need to convert the labels to integers.

### Exercise: Define the model

Define a model for classification. For this model, use a feedforward architecture with an embedding layer of size 20, a layer that computes the average of word embeddings (use `GlobalAveragePooling1D`), a hidden layer of 16 units, and `relu` activation. You need to determine the size and activation of the output layer.

### Exercise: Train and evaluate
Train your model. In the process you need to determine the optimal number of epochs. Then answer the following questions:
1. What was the optimal number of epochs and how did you determine this?
2. Is the system overfitting? Justify your answer.

Based on the validation loss, a good value of epochs is 41. At this point the system is overfitting already but the validation loss appears to be optimal. Let's check with the accuracy as well:

Yes accuracy looks near optimal at 41 epochs.

### Optional Exercise: Data exploration
Plot the distribution of labels in the training data and compare with the distribution of labels in the devtest data. Plot also the distribution of predictions in the devtest data. What can you learn from this?

The training and devtest sets have similar distributions. The data is somewhat balanced except for the `ABBR` class.

The predicted labels have a different distribution to the labels of the training and devtest data, and there are no predictions for the ABBR class. This is a common issue. Sometimes, the system does not learn classes that have poor representation in the training data.

### Optional Exercise: Improve your system

Try the following options:

1. Use pre-trained word embeddings
2. Use recurrent neural networks.

Feel free to try each option separately and in combination, and compare the results. Feel also free to try with other variants of the initial architecture, such as:

1. Introducing more hidden layers.
2. Changing the size of embeddings.
3. Changing the number of units in the hidden layer(s).