# Cross-lingual Learning through Adversarial Training

### Problem Formulation

Let $L_1$ and $L_2$ be two different two natural languages. Let $C_1$ and $C_2$ be two corpora from $L_1$ and $L_2$ repectively.

We denote a document from a document from $C_i$ by $C_i^{(j)}$. Let $Q_i^{(j)}$ be a question based on the information available in the document $C_i^{(j)}$ and Let $A_i^{(j)}$ be the answer to the given question.

We wish to find two encoding networks $\phi_1(\cdot)$, $\phi_2(\cdot)$ and an answerer network $\gamma(\cdot, \cdot)$ such that $\gamma(\phi_i(C_i^{(k)}), \phi_1(Q_i^{(k)})) \to A_i^{(k)}$ for both $i=1$ and $i=2$.

Let $R_i^{(j)}$ be the output of $\phi_i(C_i^{(j)})$ (document representation) and $r_i^{(j)}$ be the output of $\phi_i(Q_i^{(j)})$ (question representation), we hope to minimise the language dependency of $R$ and $r$ so that the answerer network $\gamma$ learns a language-independent way to solve the question-answering task. Ideally, this means that if $C_1^{(j)}$ and $C_2^{(j)}$ are similar in their content, then $||R_1^{(j)} - R_2^{(j)}||$ should be small (same for $r$). In practice, however, it is difficult to measure because:
1. It is unlikely to find two documents in two languages that are perfectly aligned and also suitable for QA tasks;
2. Directly forcing $R_1^{(j)} = R_2^{(j)}$ (i.e. the approach taken by neutral machine translation) leads to a model that is difficult to train.

In order to circumvent the issue of minimising representation difference, we introduce a discriminator network $\delta(\cdot)$ that classifies an input representation $R$ (or $r$) into either "from $L_1$" or "from $L_2$". The goal of $\delta$ is to best distinguish between representations generated from $\phi_1$ and $\phi_2$ (which ultimately means inputs from $L_1$ or $L_2$). We add an additional training goal to $\phi_i$ other than maximising the accuracy of $\gamma$, which is to mimimise the success rate (or maximising the error rate) of $\delta$.

### Details of the Model

$C_i^{(j)}$ is an S by N by V matrix where S is the number of sentences, N is the number of words and V is the size of the vocabulary.

$\phi_i(\cdot)=\phi_i^{summary}(\phi_i^{sent}(\phi_i^{emb}(\cdot)))$

$\phi_i^{emb}$ is the embedding layer, whereas $\phi_i^{sent}$ and $\phi_i^{summary}$ are all bi-RNNs. $\phi_i^{sent}$ generates a single output for each sentence, whereas $\phi_i^{summary}$ generates a context-aware representation for all sentences.

$R_i^{(j)}$ and $r_i^{(j)}$ are S by D and S by 1 matrices, respectively. D is the dimension of the representation.

$\gamma(R, r) = \gamma^{MLP}(r, \sigma(R, r) \odot R)$, where $sigma(\cdot, \cdot)$ is a similarity measure (such as dot product) applied across the S dimension of R. $\sigma$ here serves as the attention mechanism. The output of $\gamma$ is a softmax (or logits) vector with the same length as $A_i^{(j)}$. (Problem here: in order to use softmax output, the answer space should be shared between two languages, otherwise the answerer network $\gamma$ cannot achieve language-indenepdence.)

$\delta$ is an MLP that takes either $R_i^{(j)}[k, :]$ or $r_i^{(j)}$ as input and generates a binary output.

### Training Goals

The main goal of the model is to maximise the probability of successfully predicting the answer, i.e.
$$
argmax_{\Theta_\phi, \Theta_\gamma} logP(A|Q, C, \Theta_\phi, \Theta_\gamma)
$$

The auxiliary goal is to maximise the confusion of the discriminator network $\delta$, i.e.
$$
argmax_{\Theta_\phi} log P(\lnot L|Q, C,\Theta_\phi, \Theta_\delta)
$$

The joint goal can be represented as:
$$
argmax_{\Theta_\phi, \Theta_\gamma} [logP(A|Q, C, \Theta_\phi, \Theta_\gamma)
+ \alpha \cdot log P(\lnot L|Q, C,\Theta_\phi, \Theta_\delta)]
$$

Whereas the goal of the discriminator network is to maximise the correct prediction probability, i.e.
$$
argmax_{\Theta_\delta} log P(L|Q, C, \Theta_\phi, \Theta_\delta)
$$

### Issues with the Data

In order to train a cross-lingual model on question answering, we need decent quality QA corpus in more than one language. Ideally, the length, topic etc. of the two corpora should align as well, otherwise the discriminator network will likely be able to distinguish two languages by these non-language-specific features.

So far, there is a lack of datasets that enable cross-lingual training of question answering models. Below are some of the possible sources of QA datasets:

1. Parallel texts. Plenty of parallel texts exist, and they are often of the exact same topics. Some of them are even aligned to the sentence or word level. These attributes are desirable in our model training. However, these corpora are usually only compiled for translation purposes, and many of them are not suitable for conversion into QA datasets.

2. Existing QA datasets in two or more languages. There might be issues with the topic / vocabulary alignment of datasets, but this is the most convenient approach.

3. Compile a new dataset from web data etc. so that it is compatible with an existing QA dataset.

Here are some possible datasets for use:

1. The Stanford Question Answering Dataset - https://rajpurkar.github.io/SQuAD-explorer/ High quality QA dataset. Based on Wikipedia.

2. WebQA: A Chinese Open-Domain Factoid Question Answering Dataset - http://idl.baidu.com/WebQA.html. Similar idea to SQUAD, however different in various ways (based on community QA websites, contains possibly useless evidence, informal language)

3. OPUS - http://opus.lingfil.uu.se/index.php Large amount of parallel corpora. However challenging to generate questions from these documents. Many of them are fragmented (only aligned sentences are retained)

4. LDC - (example) https://catalog.ldc.upenn.edu/LDC95T13 High data usage fee

5. Select one well-known corpus in multiple languages, such as http://www.umiacs.umd.edu/~resnik/parallel/bible.html Relatively easy to find existing Q-A pairs from other sources and put them togetehr as a dataset. E.g. There are trivia questions on http://www.christianity.com/trivia/category/. However extracting the answers still require some effort.