# RNN for Classifying Names

In this notebook we are building and training a basic character-level RNN to classify
words. A character-level RNN reads words as a series of characters -
outputting a prediction and "hidden state" at each step, feeding its
previous hidden state into each next time step. We take the final prediction
to be the output, i.e. which class the word belongs to.

### Preparing the Data

Download the data in folder `data/names` from GitHub.

Included in the ``data/names`` directory are 18 text files named as
``[Language].txt``. Each file contains a bunch of names, one name per
line, mostly romanized (but we still need to convert from Unicode to
ASCII).

We first get all the filenames:

In [1]:
import glob
filenames = glob.glob('data/names/*.txt')

print(filenames)

['data/names/Korean.txt', 'data/names/Portuguese.txt', 'data/names/Dutch.txt', 'data/names/Italian.txt', 'data/names/French.txt', 'data/names/Vietnamese.txt', 'data/names/Chinese.txt', 'data/names/Irish.txt', 'data/names/Japanese.txt', 'data/names/Scottish.txt', 'data/names/Greek.txt', 'data/names/Czech.txt', 'data/names/Russian.txt', 'data/names/English.txt', 'data/names/Spanish.txt', 'data/names/German.txt', 'data/names/Arabic.txt', 'data/names/Polish.txt']


And save each language as a category:

In [2]:
import os
all_categories = []


for filename in filenames:
    language = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(language)

print(all_categories)

['Korean', 'Portuguese', 'Dutch', 'Italian', 'French', 'Vietnamese', 'Chinese', 'Irish', 'Japanese', 'Scottish', 'Greek', 'Czech', 'Russian', 'English', 'Spanish', 'German', 'Arabic', 'Polish']


Next we load the data and put every name in a list together and its category (=label) in a second list:

In [3]:
X = []
y = []


for index, filename in enumerate(filenames):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    category = all_categories[index]
    for line in lines:
        X.append(line)
        y.append(category)

n_categories = len(all_categories)
n_categories, len(X)

(18, 20074)

Let's check which characters are included in the names:

In [4]:
all_characters = set([c for name in X for c in name])
print(all_characters)
print(len(all_characters), "characters")

{'ą', 'ú', 'H', ':', 'C', 'ù', 'r', ' ', 'ż', 'õ', 'ò', '/', 'á', 'j', 'f', ',', '-', 'ó', 'k', 'ü', 'ì', 'E', 'w', 'B', 'l', '1', 'ß', 'L', 'F', 'K', 'M', 'ä', 'I', 'ń', 'z', 'y', 'p', 'R', 'Q', 'J', 'n', 'ł', 'P', 'i', 'q', 'u', 'Ś', '\xa0', 'b', 'U', 'g', 'N', 'a', 'ç', 'ö', 't', 'x', 'ê', 'V', 'A', 'ñ', 'G', 'D', 'à', 'ã', 'Á', 'T', 'è', 'c', 'h', 'é', 'Y', 'v', 'W', 'o', "'", 'S', 'd', 'Z', 'É', 'í', 's', 'm', 'e', 'Ż', 'X', 'O'}
87 characters


We see that the files contain many special characters that make our problem more difficult. To reduce the character count, we only allow ASCII symbols:

In [5]:
import string

# these is the vocabulary we will use
all_letters = string.ascii_letters
n_letters = len(all_letters)

print(f"Vocab is of size {n_letters} and contains:", all_letters)

Vocab is of size 52 and contains: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ


In [6]:
import unicodedata

# this method converts anything into ascii
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicodeToAscii('Ślusàrski'))
print(unicodeToAscii('Frühling'))

Slusarski
Fruhling


In [7]:
# convert all letters to ascii
X = [unicodeToAscii(x) for x in X]

# print again all characters
all_characters = set([c for name in X for c in name])
print(all_characters)
print(len(all_characters), "characters")

{'z', 'G', 'f', 'D', 'y', 'p', 'R', 'Q', 'T', 'J', 'H', 'c', 'k', 'n', 'P', 'h', 'V', 'C', 'E', 'w', 'B', 'O', 'i', 'l', 'Y', 'q', 'v', 'r', 'u', 'W', 'o', 'b', 'L', 'U', 'g', 'N', 'a', 'S', 'F', 'K', 'd', 'Z', 'M', 't', 's', 'm', 'e', 'I', 'x', 'X', 'j', 'A'}
52 characters


We can see that we successfully reduced the number of characters and can now divide the data into train and test data:

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print("Train data points:", len(X_train))

Train data points: 16059


Turning Names into Tensors
--------------------------

Now that we have all the names organized, we need to turn them into
Tensors to make any use of them.

To represent a single letter, we use a "one-hot vector" of size
``<1 x n_letters>``. A one-hot vector is filled with 0s except for a 1
at index of the current letter, e.g. ``"b" = <0 1 0 0 0 ...>``.

To make a word we join a bunch of those into a 2D matrix
``<line_length x 1 x n_letters>``.

That extra 1 dimension is because PyTorch assumes everything is in
batches - we're just using a batch size of 1 here.




In [9]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

def letterToTensor(letter):
    tensor = torch.zeros(1, n_letters)
    index = all_letters.find(letter)
    tensor[0][index] = 1
    return tensor

Using cpu device


Know lets check how the encoding of one letter looks like:

In [10]:
print(letterToTensor('J'))

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])


We also need to convert the label into a number, which is just the index of the category:

In [11]:
def categoryToTensor(category):
    index = all_categories.index(category)
    return torch.tensor([index], dtype=torch.long)

categoryToTensor("Korean")

tensor([0])

Creating the RNN
====================

This RNN module has two linear layers. One calculates the next hidden state, the other one the output.

In [12]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, output_size):
        super(RNN, self).__init__()

        self.hidden_size = 128 # number of hidden layer size

        self.input2hidden = nn.Linear(input_size + self.hidden_size, self.hidden_size)
        self.input2output = nn.Linear(input_size + self.hidden_size, output_size)

    def forward(self, x, hidden):
        combined = torch.cat((x, hidden), 1) # input and hidden state are combined
        hidden = self.input2hidden(combined) # calculate next hidden state
        output = self.input2output(combined) # calculate output state
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

To run a step of this network we need to pass an input (in our case, the
tensor for the current letter) and a previous hidden state (which we
initialize as zeros at first). We get back the output and a next hidden state (which we keep for the next
step).




In [13]:
rnn = RNN(n_letters, n_categories)

x = letterToTensor('A')
hidden = rnn.initHidden()

output, next_hidden = rnn(x, hidden)
print(torch.softmax(output, 1))

tensor([[0.0555, 0.0561, 0.0533, 0.0539, 0.0538, 0.0594, 0.0599, 0.0544, 0.0543,
         0.0576, 0.0609, 0.0600, 0.0529, 0.0565, 0.0519, 0.0517, 0.0521, 0.0555]],
       grad_fn=<SoftmaxBackward0>)


As you can see the output is a ``<1 x n_categories>`` Tensor, where
every item is the likelihood of that category (higher is more likely).




Task 1: Training the Network
--------------------

Finish the following training function to train the RNN on the training data set.

In [None]:
import math
rnn.to(device)
rnn.train()

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(rnn.parameters(), lr=0.005)


for epoch in range(1, 10):
    running_loss = 0.0
    print("Training epoch:", epoch)
    # iterate through all names in X_train
    for i, name in enumerate(X_train, 0):
        next_hidden = rnn.initHidden()   
        optimizer.zero_grad()

        for char in name:
            x = letterToTensor(char)
            output, next_hidden = rnn(x, next_hidden)
        
        # forward + backward + optimize
        category = y_train[i]
        loss = criterion(output, categoryToTensor(category))
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.9f}')
            running_loss = 0.0

print("Finished Learning")


Training epoch: 1
[2,  2000] loss: 1.876466891
[2,  4000] loss: 1.563148097
[2,  6000] loss: 1.482917607
[2,  8000] loss: 1.382190724
[2, 10000] loss: 1.372809337
[2, 12000] loss: 1.297917074
[2, 14000] loss: 1.298332493
[2, 16000] loss: 1.212345366
Training epoch: 2
[3,  2000] loss: 1.200130932
[3,  4000] loss: 1.151070391
[3,  6000] loss: 1.130098969
[3,  8000] loss: 1.091652181
[3, 10000] loss: 1.099722245
[3, 12000] loss: 1.093102953
[3, 14000] loss: 1.118540812
[3, 16000] loss: 1.050824200
Training epoch: 3
[4,  2000] loss: 1.074618266
[4,  4000] loss: 1.041089648
[4,  6000] loss: 1.002159093
[4,  8000] loss: 0.985734719
[4, 10000] loss: 0.986821996
[4, 12000] loss: 1.006558733
[4, 14000] loss: 1.037499289
[4, 16000] loss: 0.966549328
Training epoch: 4
[5,  2000] loss: 1.005560512
[5,  4000] loss: 0.981200496
[5,  6000] loss: 0.933879589
[5,  8000] loss: 0.923400705
[5, 10000] loss: 0.922129239
[5, 12000] loss: 0.954495491
[5, 14000] loss: 0.982984848
[5, 16000] loss: 0.913223639


### Task 2: Evaluating the Results

Evaluate the accuarcy of the RNN on the test data.

In [18]:
# Evaluate accuracy
from sklearn.metrics import accuracy_score

rnn.eval()
all_preds = []
all_labels = []
correct = 0
for i, name in enumerate(X_test, 0):
    next_hidden = rnn.initHidden() 
    for char in name:
        x = letterToTensor(char)
        output, next_hidden = rnn(x, next_hidden)
    predicted_category = all_categories[output.argmax().item()]
    #prediction = torch.argmax(output, 1).item()
    #category_index = all_categories[prediction]
    all_preds+=predicted_category
    category = y_test[i]
    all_labels+=category
    if predicted_category == category:
            correct+=1

accuracy = correct / len(X_test)

print(f"Accuracy is: {100. * accuracy}")

Accuracy is: 74.17185554171856


### Task 3: Running on User Input

Write a function that takes an abritrary name as input and outputs the top 3 categories of the RNN for the input.


In [87]:
# function to take arbitrary name as input and output top 3 categories
def name_heritage_specifier(model, name):
    model.eval
    next_hidden = rnn.initHidden() 
    for char in name:
        x = letterToTensor(char)
        output, next_hidden = rnn(x, next_hidden)
    output = torch.softmax(output, 0)
    return list(zip(all_categories, output[0]))


In [88]:
name_heritage_specifier(rnn, "Colin")

[('Korean', tensor(1., grad_fn=<UnbindBackward0>)),
 ('Portuguese', tensor(1., grad_fn=<UnbindBackward0>)),
 ('Dutch', tensor(1., grad_fn=<UnbindBackward0>)),
 ('Italian', tensor(1., grad_fn=<UnbindBackward0>)),
 ('French', tensor(1., grad_fn=<UnbindBackward0>)),
 ('Vietnamese', tensor(1., grad_fn=<UnbindBackward0>)),
 ('Chinese', tensor(1., grad_fn=<UnbindBackward0>)),
 ('Irish', tensor(1., grad_fn=<UnbindBackward0>)),
 ('Japanese', tensor(1., grad_fn=<UnbindBackward0>)),
 ('Scottish', tensor(1., grad_fn=<UnbindBackward0>)),
 ('Greek', tensor(1., grad_fn=<UnbindBackward0>)),
 ('Czech', tensor(1., grad_fn=<UnbindBackward0>)),
 ('Russian', tensor(1., grad_fn=<UnbindBackward0>)),
 ('English', tensor(1., grad_fn=<UnbindBackward0>)),
 ('Spanish', tensor(1., grad_fn=<UnbindBackward0>)),
 ('German', tensor(1., grad_fn=<UnbindBackward0>)),
 ('Arabic', tensor(1., grad_fn=<UnbindBackward0>)),
 ('Polish', tensor(1., grad_fn=<UnbindBackward0>))]