# A Unified Architecture for Natural Language Processing- Deep Neural Networks with Multitask Learning

Link: https://ronan.collobert.com/pub/matos/2008_nlp_icml.pdf

Authors: Ronan Collobert, Jason Weston

Institution: NEC Labs America

Publication: Proceedings of the 25th International Confer- ence on Machine Learning

Date: 2008




## Background Materials




## Papers citing this paper



## What is this paper about?

A single convolutional neural network architecture that outputs part-of-speech tags, chunks, named entity tags, semantic roles, semantically similar words and likelihood that the sentence makes sense grammatically and semantically.


## What is the motivation of this research?

*Separately* analyzing NLP tasks pose following failings:
- shallow in the sense that the classifier is often linear
- requires many hand-engeneered features specific for the task for good performance
- propagating errors by cascading features lernt separately from other tasks 


## What makes this paper different from previous research?

They define a *unified* architecture for NLP that learns relevant features between tasks given *very limited prior knowledge*.


## How this paper achieve it?

### NLP tasks

#### Part-Of-Speech Tagging (POS)

Labeling each word with a qunique tag that indicates its syntactic role (e.g. noun, adverb).

#### Chunking

Labeling segments of a sentence with syntactic constituents (e.g. noun phrase NP, verb phrase VP).

#### Named Entity Recognition (NER)

Labeling atomic elementsin the sentence into categories (e.g. "PERSON", "COMPANY", "LOCATION")

#### Semantic Role Labeling (SRL)

Giving a semantic role to a syntactic constituent of a sentence. In the PropBank formalism, one assigns roles ARG0-5 to words that are arguments of a predicate in the sentense. (e.g. $[\mathrm{John}]_{\mathrm{ARG0}} [\mathrm{ate}]_\mathrm{REL} [\mathrm{the\ apple}]_{\mathrm{ARG1}}$)

#### Language Models

Estimating the probability of the next word being $w$ in a sentence.

#### Semantically Related Words ("Synonyms")

Predicting whether two words are semantically related, which is measured WordNet database.


### General Deep Architecture for NLP

A deep neural network is used and traind in end-to-end fashion.

The first layer extract features for each word. The second layer extracts features from the sentence treating it as a sequence with local and global structure. The following layers are classical NN layers.

<img src="img/A_Unified_Architecture_for_Natural_Language_Processing-Deep_Neural_Networks_with_Multitask_Learning_Figure1.png" width="300">

#### Transforming Indices into Vectors

The first layer directly deal with raw words. The words are mapped into a vectors.

Each word $i \in \mathcal{D}$ is embedded into d-dimensional space using a lookup table $\mathrm{LT}_W(.)$:

$\mathrm{LT}_W(i) = W_i$

where $\mathcal{D}$ is a finit dictionary of words, $W \in \mathbb{R}^{d\times |\mathcal{D}|}$ is a matrix of parameters to be learnt, $W_i \in \mathbb{R}^d$ is the i-th column of $W$ and $d$ is the word vector size (wsz) to be chosen by the user.

An input sentence $\{s_1,..., s_n\}$ is this transformed into a series of vectors $\{W_{s_1}, ..., W_{s_n}\}$.

The parameter $W$ is trained during learning process.

#### Variations on Word Representations

All words are converted to lower case and capitalization are represented as separate feature flag.

When a word is decomposed into K deatures, it is represented as a tuple $\boldsymbol{i} = \{i^1,...,i^K\} \in \mathcal{D}^1\times ... \times \mathcal{D}^K$, where $\mathcal{D}^K$ is the dictionary for the k-th element.

Each element is associated to a lookup-table $\mathrm{LT}_{W^k}(.)$ with parameters $W^k \in \mathbb{R}^{d\times |\mathcal{D^k}|}$.

#### Classifying with Respect to a Predicate

In SRL the class label of each word in a sentence depends on a given predicate.

A feature that encodes its relative distance to the predicate is added. For the i-th word in a sentence, if the predicate is at position $\mathrm{pos_p}$ additional lookup table $LT^\mathrm{dist_p}(i - \mathrm{pos_p})$


#### Variable Sentence Length

To deal with variable-length sequence, Time-Delay Neural Networks (TDNNs, Waibel et al., 1989) is used.

A TDNN "reads" the sequence in an online fashion, at time $t$ one sees $x_t$.

A classical TDNN layer performs a convolution on a given sequence $\boldsymbol{o}$.

$\boldsymbol{o}(t) = \sum_{j=1-t}^{n-t} \boldsymbol{L}_j x_{t+j}$

where $L_j \in \mathbb{R}^{n_{hu}\times d} (-n \le j \le n)$ are the training parameters of the layer with $n_{hu}$ hidden units.

The convolution is constrained by defining a *kernel witdth* (ksz), which enforces

$\forall |j| \gt (ksz - 1)/2, \boldsymbol{L}_j = 0$

Unlike window approach, TDNN considers at the same time all windows of ksz words in the sentence, whereas window approach only considers words in a window of size ksz around the word.

As the layer's output is fixed dimension, subsequent layers can be classical NN layers.

#### Deep Architecture

A TDNN layer performs a linear operation.

A nonlinearity is added by $\tanh$ activation.

$\boldsymbol{o}^l = \tanh(\boldsymbol{L^l} \cdot \boldsymbol{o^{t-1}})$

The size of last layer's output is the number of classes of the NLP task. The last layer is followed by softmax layer and trained with the cross-entropy criterion.


### Multitasking with Deep NN

#### Deep Joint Training

If one considers related tasks, features useful for one task might be useful for other ones.

In NLP, POS prediction are often used as features for SRL and NER.

It is expected that when training NNs on related tasks sharing deep layers in these NNs would improve generalization performance.

#### Previous Work in MTL for NLP

The two types of previsous multi-task learning research exists.

- cascading features
- shallow joint training


### Leveraging Unlabeled Data

The proposed architecture can be jointly trained  supervised tasks on labeled data and unsupervised tasks on unlabeled data.

#### Language Model

They trained a language model that discriminates a two-class classification task: if the word in the middle of the input window is related to its context or not.

Their experiments showed that the embedding learnt by the lookup-table layer clusters semantically similar words. 

The resulting word lookup-table from the language model was used as an initializer of lookup-table in MTL.

<img src="img/A_Unified_Architecture_for_Natural_Language_Processing-Deep_Neural_Networks_with_Multitask_Learning_Figure2.png" width="400">

#### Semantically Related Words Task

The embedding obtained with a language model on unlabeled data and an embedding obtained with labeled data are compared.



## Dataset used in this study

- PropBank dataset version 1
- Penn TreeBank
- Wikipedia
- WordNet


## Implementations




## Further Readings

- http://sebastianruder.com/multi-task/
