# TD 6

[Use PyTorch for all questions]

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import glob
import time
from torch.utils.data import DataLoader, Dataset

  from .autonotebook import tqdm as notebook_tqdm


## RNN: Determinating Lastnames Origins

The goal here is to build our first (basic) RNN network.

We have a datset composed of 18,000 names, from 18 nationalities (1,000 names from each country).
We try to build a network to classify names to their correct nationality.
We do that with a RNN, that "reads" each letter one by one.

Link to the dataset `name_1000`, containing 1,000 names from 18 nationalities:
https://drive.google.com/drive/folders/1qqyB_ZRMsz_7veqlKYnJH2kmYK6myV4Y?usp=share_link

Start by downloading it and store it in your working directory.

As we haven't dealt with unbalanced datasets yet, all 18 nationalities are represented with the same number of names. This is not the case in real life and wasn't the case initially. Even though we will ignore this for now, you need to keep in mind that because the dataset was smaller from some nationalities, you can see that in `Vietnamese.txt`, some names appear several times. This will be a problem theorethically, as when we split the dataset into `train` and `test`, we will have some names in the `test` set that were already in the `train` set. The point though is not to commercialize this model, but to learn how to build a RNN (question: can you remind us what's the problem RNNs try to solve?). So we will ignore this.

### Pre-processing

Some countries are using non-latin alphabet, we need the ASCII version.
You can try the function `unidecode` from the `unidecode` mdule.

Create a function that takes a name, and return its version using only letters from 
`LETTERS = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ .,;'` (you can add some letters if you want, but the more you add, the more complex your network will become).

Test your function on `'Ślusàrski'`, `'François'`, `'北亰'`, `'Kožušček'`, and `'+-*/'`.

### Feeding letters to a network

A network can not, originally, process characters/letters; networks can only understand numbers, and list of numbers.
We need to turn our characters to a vector. We could just take the binary byte representing the character in the ASCII. However, this would be very hard for the network to understand (`x: 1011000`, `y:1011001` and `z: 1011010` will have very similar activations).
Thus, we will use 'one-hot encoding' of our set of letters `LETTERS`. That is, we transform each letter to a tensor of size `<1xN_LETTERS>`, where all entries are zero except the one corresponding to the position of the letter, that we set to one.
e.g.:
- `a => [1, 0, 0, 0, ..., 0]`
- `b => [0, 1, 0, 0, ..., 0]`
- `c => [0, 0, 1, 0, ..., 0]`

Define a `letterToTensor` function that perfoms this operation.

Now define a `nameToTensor` function that perfoms this operation for each letter in the name (resulting to a `<name_length x 1 x N_LETTERS>` tensor).

### Loading data

Create a custom dataset:
- in the `__init__`, read all files, and create a list of names and associated country
- add a `countryID` method that turns a country its index
- add a `countryTensor` method that turns a country to a one-hot encoded tensor
- the `__getitem__` should return one piece of data in the form `(name, country, nameTensor, countryID)`

Create a dataloader for the dataset; the `batch_size` must be 1, since different names can have different lengths (and therefore, different tensor size).

### Build the RNN

Define the RNN with the following parameters:
- `input_size`: number of input features
- `hidden_size`: number of hidden units
- `output_size`: number of output features
- `idx_to_country`: list of countries

The input is a one-hot vector of size `N_LETTERS`; the output is a one-hot vector of size `N_COUNTRIES = len(idx_to_country)` + a hidden state of size `hidden_size`.
You can build the architecture you like, but one that is known to work is the [following](https://i.imgur.com/AJHiuhO.png).

That is, a simple dense layer that takes as input the concatenation of the hiden state and the current letter, and outputs both the new hidden state and a vector of likelyhood for each country. Adding a softmax layer to the countries likelyhood turns them to actual probabilities.

On top of the `__init__` and `forward` methods, define:
- `init_hidden` a method that creates a zero hidden state (that we will use as a hidden state when sending the first letter)
- `outputToCountry` to convert output probabilities to the corresponding country
- `outputToID` to convert output probabilities to the corresponding country ID

Build a network with 128 hidden units.

### Feeding the RNN

Feed a single letter to the network (i.e. 1 step)

Feed a full word to the network (i.e. multiple steps)

### Training the network

Train the network; for each iteration of the training:
- Create a zero initial hidden state
- Feed each letter in and keep hidden state for next letter
- Compute the loss
- Back-propagate
- Zero-out the gradients

One configuration that is known to work (for the architecture described above):
- Optimizer: Adam
- Learning rate: `lr = 0.001`
- Loss: Negative log likelihood (`NLLLoss`)
- Epoch: No need for too many epochs (~5-10 is enough)

NB: if you did not put a softmax, use `CrossEntropyLoss` instead of `NLLLoss`.

*(Training takes ~3min on a modern laptop.)*

### Testing the network

**Testing should be done on a different dataset than training !!!**
Here we use the same dataset for simplicity.

Test on a couple of names from the dataset, display the name, the prediction, and the ground truth.

Test on the full dataset:
- Plot the confusion matrix
- Compute the accuracy

### Conclusion

Accuracy on the dataset is ~60%; this is much better than taking a random guess (which would have an accuracy of ~5.5%).

You can do you own experiences to improve this result (add layers, test other layers such as LSTM or GRU, try some combinations).

You can also change dataset (eg: word -> language; name -> gender; title -> newspaper; etc...)

This RNN exercise was inspired from a notebook from [the official PyTorch documentation](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html).