# Part-of-speech tagging for Treebank of Learner English corpora with Recurrent Neural Networks

## Motivation

>Part-of-speech (POS) tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. [Wikipedia](https://en.wikipedia.org/wiki/Part-of-speech_tagging)

POS tagging could be the fundamentals of many NLP/NLU tasks, such as Name Entity Recognition (NER) and Abstract Meaning Representation (AMR). In this project, I want to explore the state-of-the-art Recurrent Neural Network (RNN) based models for POS tagging. The following are the candidate models:
- Long Short-Term Memory (LSTM)
- Bidirectional LSTM (BI-LSTM)
- LSTM with a Conditional Random Field (CRF) layer (LSTM-CRF) 
- Bidirectional LSTM with a CRF layer (BI-LSTM-CRF)

(**Update 2018/04/12: the basic LSTM is added**)

## Dataset

>UD English-ESL/TLE is a collection of 5,124 English as a Second Language (ESL) sentences (97,681 words), manually annotated with POS tags and dependency trees in the Universal Dependencies formalism. Each sentence is annotated both in its original and error corrected forms. The annotations follow the standard English UD guidelines, along with a set of supplementary guidelines for ESL. The dataset represents upper-intermediate level adult English learners from 10 native language backgrounds, with over 500 sentences for each native language. The sentences were randomly drawn from the Cambridge Learner Corpus First Certificate in English (FCE) corpus. The treebank is split randomly to a training set of 4,124 sentences, development set of 500 sentences and a test set of 500 sentences. Further information is available at [esltreebank.org](esltreebank.org). 

Citation: (Berzak et al., 2016; Yannakoudakis et al., 2011)


### Data Loader

I've built a data loader for this dataset. To use the data loader, you need to first install the [CoNLL-U Parser](https://github.com/EmilStenstrom/conllu) built by [Emil Stenström](https://github.com/EmilStenstrom). The following is an example to use data_loader:

In [None]:
import data_loader

meta_list, data_list = data_loader.load_data(load_train=True, load_dev=True, load_test=True)

train_meta, train_meta_corrected, \
dev_meta, dev_meta_corrected, \
test_meta, test_meta_corrected = meta_list

train_data, train_data_corrected, \
dev_data, dev_data_corrected, \
test_data, test_data_corrected = data_list

### Metadata
- doc_id: filename (also learner ID) of the original xml file
- sent: raw text of the sentence written by the leaner with error corrected tags
- native_language: native language of the leaner
- age_range: age range of the learner
- score: exam score of the learner

Some observations:
- "native_language" enables us to design tasks related to native language identificaiton.
- "age_range" enables us to identify the learner's age based on his/her writing style.
- "score" can help us to group learners into categories, such as Beginner, Intermediate, Expert, Fluent, Proficient. It enables us to discover the writing style and common mistakes of different groups of learners.

In [2]:
train_meta.head()

Unnamed: 0_level_0,doc_id,sent,native_language,age_range,score
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,doc2664,"I was <ns type=""S""><i>shoked</i><c>shocked</c>...",Russian,21-25,21.0
2,doc648,I am very sorry to say it was definitely not a...,French,26-30,38.0
3,doc1081,"Of course, I became aware of her feelings sinc...",Spanish,16-20,36.0
4,doc724,I also suggest that more plays and films shoul...,Japanese,21-25,33.0
5,doc567,"Although my parents were very happy <ns type=""...",Spanish,31-40,34.0


### Sentence Format
In this project, I will only use "form" (words) and "upostag" (part-of-speech tags).

In [3]:
train_data[0]

Unnamed: 0_level_0,form,lemma,upostag,xpostag,feats,head,deprel,deps,misc,meta_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,I,_,PRON,PRP,,3,nsubj,,,1
2,was,_,VERB,VBD,,3,cop,,,1
3,shoked,_,ADJ,JJ,,0,root,,,1
4,because,_,SCONJ,IN,,8,mark,,,1
5,I,_,PRON,PRP,,8,nsubj,,,1
6,had,_,AUX,VBD,,8,aux,,,1
7,alredy,_,ADV,RB,,8,advmod,,,1
8,spoken,_,VERB,VBN,,3,advcl,,,1
9,with,_,ADP,IN,,10,case,,,1
10,them,_,PRON,PRP,,8,nmod,,,1


## Task 1: Continuous POS tagging with RNNs

### Architecture

In this task, a POS tagger was trained with all train data (4124 sentences), validated with dev data (500 sentences), and tested with test data (500 sentences). The following is the architecture:

![Task1 Architecture](figures/task1-arch.png)


### RNN Models

In this project, I mainly use [PyTorch](http://pytorch.org/) to implement the RNN models. The following are what I've already implemented:

#### Long Short-Term Memory (LSTM)
>Long short-term memory (LSTM) units (or blocks) are a building unit for layers of a recurrent neural network (RNN). A RNN composed of LSTM units is often called an LSTM network. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell is responsible for "remembering" values over arbitrary time intervals; hence the word "memory" in LSTM. [Wikepedia](https://en.wikipedia.org/wiki/Long_short-term_memory)

The following is the high-level architecture for the LSTM model:

![Task1_Feature](figures/task1-w2v-lstm.png)

#### Word Features

I use the pre-trained [Word2Vec model](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit) built with Google News corpus (3 million 300-dimension English word vectors). Although it might not be the best choice (e.g. Google News corpus might not be representative for the English Learner text), it's still a legitimate choice: 1) It saves my time to build a large dictionary which cover all words in the UD English-ESL/TLE corpus; 2) It saves my time and computing resources to build large/sparse unigram vectors for words, and I don't need to worry about dimension reduction for now; 3) 300-dim w2v vector is small enough for this task, and the dimension is fixed so the vector can be directly used in NN. 4) It's free and available on Google Drive :). 

### Parameter Tuning

#### Number of epochs

The dataset was divided into train, dev, test sets. I used train and dev sets to observe the fluctuation of accuracy and loss during the training process of 1000 epochs. There are 17 different POS tags in this experiment. The prediction is considered correct only if it is the same as the actual POS tag. At the end of the 1000 epochs, the LSTM model achieves **88.09%** training accuracy, **80.9%** validation accuracy, and **88.42%** test accuracy. According to the following figures, there is no apparent overfitting, and the best number of training epoch is around 650-700. However, the intersection between train loss line and dev loss line was not shown in the experiment.

![Task1_Accu](figures/accu_linear.png)
![Task1_Loss](figures/loss_linear.png)

## References

1. Berzak, Y., Kenney, J., Spadine, C., Wang, J. X., Lam, L., Mori, K. S., ... & Katz, B. (2016). Universal dependencies for learner English. arXiv preprint arXiv:1605.04278.
2. Yannakoudakis, H., Briscoe, T., & Medlock, B. (2011, June). A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1 (pp. 180-189). Association for Computational Linguistics.