# Classifying names with a character-level RNN
In this notebook we will use a recurrent neural network to predict the language from which certain surnames originate. When given some surname the network outputs a probability distribution over 18 possible languages corresponding to the likelyhood that they originate from these languages. 

This exercise was taken from the [PyTorch website](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html).

### Download the dataset
The dataset that is used can be downloaded [here](https://download.pytorch.org/tutorial/data.zip). Extract it to the directory where this notebook is located.
Included in the ``data/names`` directory are 18 text files named as
"[Language].txt". Each file contains a bunch of names, one name per
line, mostly romanized (but we still need to convert from Unicode to
ASCII).

If you are running this notebook on Colab you can access the dataset by storing it on your Drive.

In [0]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


**Change the following path variable such that it points to the location of the dataset**

In [0]:
import os

path_to_data = './gdrive/My Drive/DL_1920/codes/3_tutorial/labs_RNN/lab01_rnn/'  # TODO -- set this to the right location!

os.listdir(path_to_data + '/data/names/')

['Arabic.txt',
 'Irish.txt',
 'Chinese.txt',
 'Italian.txt',
 'Dutch.txt',
 'English.txt',
 'Czech.txt',
 'Greek.txt',
 'German.txt',
 'French.txt',
 'Japanese.txt',
 'Russian.txt',
 'Polish.txt',
 'Spanish.txt',
 'Scottish.txt',
 'Portuguese.txt',
 'Vietnamese.txt',
 'Korean.txt']

### Preparing the data
We first preprocess the dataset by limiting ourselves to ASCII characters

In [0]:
import unicodedata
import string

all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)

# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicodeToAscii('Ślusàrski'))



Slusarski


In [0]:
# Build a dictionary containing a list of names for each language
names_per_language = dict()
languages = list()  # Keep a list containing all languages

def readNames(file_path):  # Define a function that reads all names from some file in /data/names/ and converts them to ASCII
  with open(file_path, encoding='utf-8') as f:
    unicode_names = f.read().strip().split('\n')  # Split the file on new lines. Each line contains a name (in unicode)
    return [unicodeToAscii(name) for name in unicode_names]  # Convert all names to ASCII

# For all files in /data/names/ read the names. Group the names by the language they are in

for filename in os.listdir(path_to_data + 'data/names/'):
  language, _ = filename.split('.')  # Remove the file extention to obtain the class label (the language)
  languages.append(language)
  names = readNames(path_to_data + 'data/names/' + filename)  # Read the names in the current file
  names_per_language[language] = names

n_languages = len(languages)

Now we have ``names_per_language``, a dictionary mapping each language to a list of names.

In [0]:
print(names_per_language['Italian'][:5])

['Abandonato', 'Abatangelo', 'Abatantuono', 'Abate', 'Abategiovanni']


### Exercise - Transforming names into suitable inputs

Now that we have all the names organized, we need to turn them into
Tensors to make any use of them.

To represent a single letter, we use a "one-hot vector" of size
``<1 x n_letters>``. A one-hot vector is filled with 0s except for a 1
at index of the current letter, e.g. ``"b" = <0 1 0 0 0 ...>``.

To make a word we join a bunch of those into a 2D matrix
``<line_length x 1 x n_letters>``.

That extra 1 dimension is because PyTorch assumes everything is in
batches - we're just using a batch size of 1 here.


In [0]:
import torch
import torch.nn.functional as F

# Find letter index from all_letters, e.g. "a" = 0
def letterToIndex(letter):
    return all_letters.find(letter)

# Just for demonstration, turn a letter into a <1 x n_letters> Tensor
def letterToTensor(letter):
    pass  # COMPLETE THIS CODE

# Turn a line into a <line_length x 1 x n_letters>,
# or an array of one-hot letter vectors
def lineToTensor(line):
    pass  # COMPLETE THIS CODE

print(letterToTensor('J'))

print(lineToTensor('Jones').size())

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0.]])
torch.Size([5, 1, 57])


### Exercise - Define a RNN architecture

Create a neural network that takes as input some encoding of a character as well as a hidden state tensor. These two tensors are concatenated and passed to the following:
* a linear layer that produces the next hidden state tensor (no activation function)
* a linear layer that produces an output tensor (no activation function)


In [0]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        self.i2h = ... # COMPLETE THIS CODE
        self.i2o = ... # COMPLETE THIS CODE

    def forward(self, input, hidden):

        # COMPLETE THIS CODE

        return output, hidden

n_hidden = 128
rnn = RNN(n_letters, n_hidden, n_languages)

To run a step of this network we need to pass an input (in our case, the
Tensor for the current letter) and a previous hidden state (which we
initialize as zeros at first). We'll get back the output and a next hidden state (which we keep for the next
step).


In [0]:
input_tensor = lineToTensor('Albert')
hidden = torch.zeros(1, n_hidden)  # Initialize the hidden state as zeros

print(input_tensor.shape)  # The name contains 6 characters which are all encoded as 1-hot vectors of length 57 (corresponding to all possible input characters)
print(input_tensor[0].shape)  # Show the shape of a single character. The 1 is the batch dimension. In this example we set the batch size to 1 for simplicity
print(hidden.shape)  # Show the shape of the hidden state vector

output, next_hidden = rnn(input_tensor[0], hidden)  # Pass the first letter in the name to the network, as well as the initial hidden state

print(F.softmax(output, dim=1))

torch.Size([6, 1, 57])
torch.Size([1, 57])
torch.Size([1, 128])
tensor([[0.0569, 0.0577, 0.0514, 0.0543, 0.0606, 0.0548, 0.0510, 0.0544, 0.0616,
         0.0599, 0.0533, 0.0541, 0.0537, 0.0493, 0.0603, 0.0580, 0.0516, 0.0571]],
       grad_fn=<ExpBackward>)


### Training

First we will define a function to sample random data points from the train set

In [0]:
import random

def random_train_example():
  # Select a random name in some random language
  language = random.choice(languages)
  name = random.choice(names_per_language[language])
  # Convert the name to a suitable input tensor
  name_tensor = lineToTensor(name)
  # Convert the language to a suitable target tensor
  lang_tensor = torch.LongTensor([languages.index(language)])  # The tensor datatype is 'long' as it contains an integer corresponding to the index of the language
  return language, name, lang_tensor, name_tensor

for i in range(10):
    language, name, lang_tensor, name_tensor = random_train_example()

    print(f'Example {i}, Name: {name}, Language: {language}')


Example 0, Name: Scott, Language: Scottish
Example 1, Name: Nomikos, Language: Greek
Example 2, Name: Zhi, Language: Chinese
Example 3, Name: Kieu, Language: Vietnamese
Example 4, Name: Aberquero, Language: Spanish
Example 5, Name: Marquering, Language: Dutch
Example 6, Name: Kaplanek, Language: Czech
Example 7, Name: Sakoda, Language: Japanese
Example 8, Name: Satoh, Language: Japanese
Example 9, Name: Alamilla, Language: Spanish


Next we define a function that performs stochastic gradient descent using a single data point

In [0]:
criterion = nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(rnn.parameters(), lr=0.005)

def train_on_example(name_tensor, language_tensor):
  hidden_state = torch.zeros(1, n_hidden)

  optimizer.zero_grad()

  for character_tensor in name_tensor:  # Perform a forward pass for each character in the name
    out, hidden_state = ...  # COMPLETE THIS CODE

  loss = criterion(out, language_tensor)
  loss.backward()

  optimizer.step()

  return out, loss.item()



Now iterate through the dataset to train the network

In [0]:

n_iters = 100000

for i in range(1, n_iters + 1):
  language, name, lang_tensor, name_tensor = random_train_example()
  output, loss = train_on_example(name_tensor, lang_tensor)

  if not i % 1000:
    lang_pred = languages[torch.argmax(output).item()]

    print(f'Example {i}, Loss: {loss:.3f} Name: {name:16s} Language: {language:16s} Classified as: {lang_pred:16s} {"Correct" if lang_pred == language else "Incorrect"}')


### Evaluating the network

We now define a function that gives the network names you can enter manually

In [0]:


def predict(input_name, num_langs=3):
  name_tensor = lineToTensor(input_name)
  hidden_state = torch.zeros(1, n_hidden)
  for character_tensor in name_tensor:
    out, hidden_state = rnn(character_tensor, hidden_state) 

  dist = list(zip(languages, F.softmax(out, dim=1).squeeze()))

  topk_langs = sorted(dist, key=lambda p: p[1].item())[-num_langs:]

  for lang, p in reversed(topk_langs):
    print(f'{lang}, {p.item()}')


In [0]:
while(True):
  predict(input())

Dovesky
Czech, 0.45103010535240173
Russian, 0.38874995708465576
English, 0.06384486705064774
Hazaki
Japanese, 0.8367655277252197
Polish, 0.1282850056886673
Arabic, 0.010923446156084538
Brune
Scottish, 0.5156903862953186
German, 0.15303932130336761
English, 0.12645377218723297
Jackson
Scottish, 0.7522407174110413
English, 0.1534046083688736
Russian, 0.030946804210543633
Satoshi
Italian, 0.3838347792625427
Japanese, 0.26992765069007874
Arabic, 0.17385496199131012
Lee
Vietnamese, 0.5058001279830933
Chinese, 0.3838532567024231
French, 0.02899548038840294
Hinton
Scottish, 0.5770138502120972
English, 0.14913064241409302
Korean, 0.09496493637561798
Schmidhuber
German, 0.8255714774131775
Arabic, 0.06226326897740364
Dutch, 0.03400055691599846
