# Notes on "Representation learning of speech data by mutual information maximization"

In this repository, I share the codes and notes on "representation learning of speech data by mutual information maximization".

I refer to ["Representation learning by contrastive predictive coding"](https://arxiv.org/abs/1807.03748) and ["Wasserstein Dependency Measure for Representation Learning"](https://arxiv.org/abs/1903.11780) for more details on theoretical backgrounds. 

To summarize, I used the following ideas.
* To minimize the surrogate form of mutual information between two representations.
* It is formulized by predicting "the local representation at future $t+k$-th step with "the global representation at $t$-th step.
* The first infoNCE loss is propsed by [Aaron van der Oord et. al.](https://arxiv.org/abs/1807.03748). Further to this, the infoNCE loss, which is also termed as the contrastive predictive coding (CPC) loss, is reformulized by enforcing 1-Lipscthiz continuity on the kind of critic function used to predict future representations and this new loss proposed by [Sherjil Ozair et. al.](https://arxiv.org/abs/1903.11780) is called as the Wasserstein predictive coding (WPC) loss.

I next discuss the methodology and the results on this small personal project.

## Methodology

![](predictive_coding.jpg)

As shown in the above figure,
* A set of input speech data is given to the model. The window size of the inputs is 20480(=128x160).
* Then, the convolutional encoder and the GRU-RNN blocks produce $z_t$ and $c_t$, respectively.
* As described in the paper, the convolutional encoder is composed of five convolutional layers with strides=[5,4,2,2,2], filter_size=[10,8,4,4,4] and hidden dimension=512. Therefore, the down-sampling factor of this encoder is 160(=5x4x2x2x2), and the output with shape [batch_size, 128, 512] (corresdponds to a set of $z_t$ will be obtained from the encoder.
* The GRU-RNN sequentially reads $z_t$ and produces $c_t$ with its hidden dimension=512.
* We can understand that $z_t$ is a local vector contains the local information of the sensory inputs with window size 160, and $c_t$ is a vector contains the global information before the $t$-th step.
* This model aims to learn both local and global representations (latent vectors) by minimizing the infoNCE loss. [Aaron van der Oord et. al.](https://arxiv.org/abs/1807.03748) propose to optimize the following objective:
$$
\mathcal{L}_{CPC}(c_t, z_{t+k}, \{\tilde{z}_{t+k}\}) = \sup_{f \in \mathcal{F}}  \mathbb{E}_{p(c_{t}, z_{t+k})}[f(c_t, z_{t+k})] - \mathbb{E}_{p(c_{t})p(\tilde{z}_{j,t+k})}[\log \sum_j \exp f(c_t, \tilde{z}_{j,t+k})],
$$
where $f(c_{t}, z_{t+k}) = c_{t}^T W_k z_{t+k} $, $z_{t+k}$ is the positive samples and $\tilde{z}_{j, t+k}$ is the negative samples both at $t+k$ step. Optimizing this objective performs \textbf{predicting the postive samples from the mixture of positive and negtiave samples by using the global representation $c_t$ at $t$-th step.
* [Sherjil Ozair et. al.](https://arxiv.org/abs/1903.11780) modifies the firstly proposed CPC loss by enforcing the 1-Lipscthiz continuity on the critic function $W_k$.
$$
\mathcal{L}_{WPC}(c_t, z_{t+k}, \{\tilde{z}_{t+k}\}) = \sup_{f \in \mathcal{F}_{1-Lipschitz}}  \mathbb{E}_{p(c_{t}, z_{t+k})}[f(c_t, z_{t+k})] - \mathbb{E}_{p(c_{t})p(\tilde{z}_{j,t+k})}[\log \sum_j \exp f(c_t, \tilde{z}_{j,t+k})],
$$
where $\mathcal{F}_{1-Lipschitz}$ is the family of the critic functions satisfying 1-Lipscthiz continuity. The idea comes from measuing Wasserstien-1 distance, not KL-divernce, between the joint and the product of marginal distributions on $z_{t}$ and $c_{t+k}$.

## Experiments
1. Representation learning of speech data
![](loss_plot.png)

Firstly, I pre-trained the encoder and the GRU-RNN by minimizing the CPC/WPC loss. 
The above figure shows the change of CPCand WPC losses as the training epoch increases. 
We can confirm that the WPC loss is slightly lower than the CPC loss.

2. Semi-supervised learning with learned representations
![](supervised.jpg)

Next, I planned to verify the quality of the representations leanred by minimizing the CPC/WPC losses. 
Firstly, I summarized the ${c_t}_{t=1}^{128}$ by global-average pooling and trained the linear-classifier to predict the speakers. 
Total 251 classes of speaker labels are given. 

![](accuracy_plot.png)
As can be shown in the above figure, using the pre-trained models greatly outperform the purely supervised learning model. Also, representations obtained by minimizing the WPC is better than that by minimizing the CPC loss in the prediction task. 

## Future Direction

Learning representations by mutual information maximization, which is especially formalized by predicting future contexts from past contexts, shows good performances on the speaker classification of LibriSpeech data. 

I expect that the representation learning by predictive coding and mutual information maximization would be powerful on other domains, such as images, medical records, molecular graphs and financial time series.