[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb)

In this case study, we'll train a multilingual sentiment classifier using [multilingual Universal Sentence Encoder (mUSE)](https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3) for feature generation, two linear layers with dropout, and a cross-entropy loss.

The multilingual Universal Sentence Encoder is a Transformer encoder trained such that it encodes text such that text of two different languages with similar meaning will result in a similar encoding. 

# Environment setup

In [None]:
!pip install tensorflow_hub tensorflow_text>=2.0.0rc0 pytorch_lightning==1.4.7 datasets==1.12.1

In [None]:
!nvidia-smi

Sat Oct 23 14:52:38 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
import pytorch_lightning as pl
import torch
import torch.nn as nn
import torch.nn.functional as F
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text

from torch.utils.data import DataLoader
from datasets import Dataset, load_dataset, load_metric
import numpy as np

from typing import List, Dict

In [None]:
pl.seed_everything(445326, workers=True)

Global seed set to 445326


445326

# Sentence Embeddings

In [None]:
model_URL = "https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3"
encoder = hub.load(model_URL)

Want the vectors to be numpy arrays, not Tensorflow tensors, b/c they'll be used in PyTorch.

In [None]:
def embed_text(text: List[str]) -> List[np.ndarray]:
    vectors = encoder(text)
    return [vector.numpy() for vector in vectors]

# Data

In [None]:
class YelpDataModule(pl.LightningDataModule):
    def __init__(self, 
                 batch_size: int = 32, 
                 num_workers: int = 2):
        super().__init__()
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.pin_memory = torch.cuda.is_available()        

    def prepare_data(self):
        self.test_ds = load_dataset('yelp_polarity', split="test[:2%]")
        self.train_ds = load_dataset('yelp_polarity', split="train[:2%]")
        self.val_ds = load_dataset('yelp_polarity', split="train[99%:]")
        
        self.label_names = self.train_ds.unique("label")
        label2int = {str(label): n for n, label in enumerate(self.label_names)}
        self.encoder = encoder_factory(label2int)
        
    def setup(self):
        # Compute embeddings in batches, so that they fit in the GPU's RAM.
        self.train = self.train_ds.map(self.encoder, batched=True, batch_size=self.batch_size)
        self.train.set_format(type="torch", columns=["embedding", "label"], 
                              output_all_columns=True)

        self.val = self.val_ds.map(self.encoder, batched=True, batch_size=self.batch_size)
        self.val.set_format(type="torch", columns=["embedding", "label"], 
                            output_all_columns=True)

        self.test = self.test_ds.map(self.encoder, batched=True, batch_size=self.batch_size)
        self.test.set_format(type="torch", columns=["embedding", "label"], 
                             output_all_columns=True)

    def train_dataloader(self):
        return DataLoader(self.train,
                          batch_size=self.batch_size,
                          num_workers=self.num_workers,
                          pin_memory=self.pin_memory,
                          shuffle=True)

    def val_dataloader(self):
        return DataLoader(self.val,
                          batch_size=self.batch_size,
                          num_workers=self.num_workers,
                          pin_memory=self.pin_memory)

    def test_dataloader(self):
        return DataLoader(self.test, 
                          batch_size=self.batch_size,
                          num_workers=self.num_workers)        


def encoder_factory(label2int: Dict[str, int]):
    def encode(batch):        
        batch["embedding"] = embed_text(batch["text"])
        batch["label"] = [label2int[str(x)] for x in batch["label"]]        
        return batch
        
    return encode

In [None]:
data = YelpDataModule()
data.prepare_data()

Downloading:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.66k [00:00<?, ?B/s]

Downloading and preparing dataset yelp_polarity/plain_text (download: 158.67 MiB, generated: 421.07 MiB, post-processed: Unknown size, total: 579.73 MiB) to /root/.cache/huggingface/datasets/yelp_polarity/plain_text/1.0.0/a770787b2526bdcbfc29ac2d9beb8e820fbc15a03afd3ebc4fb9d8529de57544...


Downloading:   0%|          | 0.00/166M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset yelp_polarity downloaded and prepared to /root/.cache/huggingface/datasets/yelp_polarity/plain_text/1.0.0/a770787b2526bdcbfc29ac2d9beb8e820fbc15a03afd3ebc4fb9d8529de57544. Subsequent calls will reuse this data.




In [None]:
data.setup()
print(len(data.train))
print(len(data.val))
print(len(data.test))



  0%|          | 0/350 [00:00<?, ?ba/s]

  0%|          | 0/175 [00:00<?, ?ba/s]

  0%|          | 0/24 [00:00<?, ?ba/s]

11200
5600
760


# Model

## Multilingual binary classifier

In [None]:
class Model(pl.LightningModule):
    def __init__(self, 
                 hidden_dims: List[int] = [768, 128], 
                 dropout_prob: float = 0.5,
                 learning_rate: float = 1e-3):        
        super().__init__()
        self.train_acc = load_metric("accuracy")
        self.val_acc = load_metric("accuracy")
        self.test_acc = load_metric("accuracy")
        self.hidden_dims = hidden_dims
        self.dropout_prob = dropout_prob
        self.learning_rate = learning_rate

        self.embedding_dim = 512

        layers = []
        prev_dim = self.embedding_dim
        
        if dropout_prob > 0:
            layers.append(nn.Dropout(dropout_prob))

        for h in hidden_dims:
            layers.append(nn.Linear(prev_dim, h))
            prev_dim = h
            if dropout_prob > 0:
                layers.append(nn.Dropout(dropout_prob))
            layers.append(nn.ReLU())
            if dropout_prob > 0:
                layers.append(nn.Dropout(dropout_prob))
        # output layer
        layers.append(nn.Linear(prev_dim, 2))

        self.layers = nn.Sequential(*layers)                

    def forward(self, x):
        # x will be a batch of USEm vectors
        logits = self.layers(x)
        return logits

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        return optimizer

    def __compute_loss(self, batch):        
        x, y = batch["embedding"], batch["label"]
        logits = self(x)
        preds = torch.argmax(logits, dim=1).detach().cpu().numpy()
        loss = F.cross_entropy(logits, y)
        return loss, preds, y

    def training_step(self, batch, batch_idx):
        loss, preds, y = self.__compute_loss(batch)
        self.train_acc.add_batch(predictions=preds, references=y)
        acc = self.train_acc.compute()["accuracy"]
        values = {"train_loss": loss, "train_accuracy": acc}
        self.log_dict(values, on_step=True, on_epoch=True, 
                      prog_bar=True, logger=True)
        return loss

    def validation_step(self, batch, batch_idx):
        loss, preds, y = self.__compute_loss(batch)
        self.val_acc.add_batch(predictions=preds, references=y)
        acc = self.val_acc.compute()["accuracy"]
        values = {"val_loss": loss, "val_accuracy": acc}
        self.log_dict(values, on_step=True, on_epoch=True, 
                      prog_bar=True, logger=True)
        return loss

    def test_step(self, batch, batch_idx):
        loss, preds, y = self.__compute_loss(batch)
        self.test_acc.add_batch(predictions=preds, references=y)
        acc = self.test_acc.compute()["accuracy"]
        values = {"test_loss": loss, "test_accuracy": acc}
        self.log_dict(values, on_step=False, on_epoch=True, 
                      prog_bar=True, logger=True)
        return loss


## Train

In [None]:
model = Model()

Downloading:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

In [None]:
MAX_EPOCHS = 5

checkpoint_callback = pl.callbacks.ModelCheckpoint(
    monitor="val_loss",
    dirpath="model",
    filename="yelp-sentiment-multilingual-{epoch:02d}-{val_loss:.3f}",
    save_top_k=3,
    mode="min")

trainer = pl.Trainer(gpus=1, max_epochs=MAX_EPOCHS, 
                     callbacks=[checkpoint_callback])

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


In [None]:
trainer.fit(model, data.train_dataloader(), data.val_dataloader())

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name   | Type       | Params
--------------------------------------
0 | layers | Sequential | 492 K 
--------------------------------------
492 K     Trainable params
0         Non-trainable params
492 K     Total params
1.971     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 445326


Training: -1it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

## Test

In [None]:
trainer.test(test_dataloaders=data.test_dataloader())

  "`trainer.test(test_dataloaders)` is deprecated in v1.4 and will be removed in v1.6."
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_accuracy': 0.8631578683853149, 'test_loss': 0.33023467659950256}
--------------------------------------------------------------------------------


[{'test_accuracy': 0.8631578683853149, 'test_loss': 0.33023467659950256}]

# Inference

In [None]:
best_model = Model.load_from_checkpoint(checkpoint_callback.best_model_path)

In [None]:
def predict(text: List[str]):
    embeddings = torch.Tensor(embed_text(text))
    logits = best_model(embeddings)
    preds = torch.argmax(logits, dim=1).detach().cpu().numpy()   
    scores = torch.softmax(logits, dim=1).detach().cpu().numpy()         
    
    results = []
    for t, best_index, score_pair in zip(text, preds, scores):
        results.append({
            "text": t,
            "label": "positive" if best_index == 1 else "negative",
            "score": score_pair[best_index]
        })
    return results

In [None]:
predict(["I love that restaurant!", "I hate italian food."])

[{'label': 'positive', 'score': 0.99893814, 'text': 'I love that restaurant!'},
 {'label': 'negative', 'score': 0.8079468, 'text': 'I hate italian food.'}]

## Inference on non-English text

Since we used USEm embeddings, we should be able to predict sentiment for non-English languages. Let's try it out. [USEm supports 16 languages](https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3):

Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian

In [None]:
from pprint import PrettyPrinter
pp = PrettyPrinter()

Compare predictions for English and German.

In [None]:
english_text = "Our server was horrid. He messed up the order and didn't even apologize when he spilled wine on my sister's hair!"
german_translation = "Unser Server war schrecklich. Er hat die Bestellung durcheinander gebracht und sich nicht einmal entschuldigt, als er Wein in die Haare meiner Schwester verschüttet hat!"

pp.pprint(predict([english_text, german_translation]))


[{'label': 'negative',
  'score': 0.9564845,
  'text': "Our server was horrid. He messed up the order and didn't even "
          "apologize when he spilled wine on my sister's hair!"},
 {'label': 'negative',
  'score': 0.9694613,
  'text': 'Unser Server war schrecklich. Er hat die Bestellung durcheinander '
          'gebracht und sich nicht einmal entschuldigt, als er Wein in die '
          'Haare meiner Schwester verschüttet hat!'}]


Compare predictions for English and Italian. For kicks, let's also see how it performs on a European language that USEm does not support, Finnish.

In [None]:
english_text = "My least favorite film is Showgirls. I hate it so much. In fact, it's so bad that it makes me angry."
italian_translation = "Il mio film meno preferito è Showgirls. Lo odio così tanto. In effetti, è così brutto che mi fa arrabbiare."
finnish_translation = "Minun lempi elokuva on Showgirls. Vihaan sitä niin paljon. Itse asiassa se on niin paha, että se saa minut vihaiseksi."
pp.pprint(predict([english_text, italian_translation, finnish_translation]))

[{'label': 'negative',
  'score': 0.98994666,
  'text': 'My least favorite film is Showgirls. I hate it so much. In fact, '
          "it's so bad that it makes me angry."},
 {'label': 'negative',
  'score': 0.974451,
  'text': 'Il mio film meno preferito è Showgirls. Lo odio così tanto. In '
          'effetti, è così brutto che mi fa arrabbiare.'},
 {'label': 'negative',
  'score': 0.7616636,
  'text': 'Minun lempi elokuva on Showgirls. Vihaan sitä niin paljon. Itse '
          'asiassa se on niin paha, että se saa minut vihaiseksi.'}]


USEm even works on Finnish. But why? Without digging into things, it would be difficult to know for sure. Our guess is that in the training process, the subword units used in USEm's tokenization lets the Transformer learn which subword units are used across languages. The layers we added onto USEm, which are trained for classification, lets the model learn which subword units are related to positive or negative sentiment. It must be that the subword units used in Finnish are close enough to those in one of the 16 languages that USEm supports.