# Classifying Names with Character Level RNN

Welcome! In this assigment you will learn to build a recurrent net to help classify names to their country.

1. ***Input:*** Kawachi ***output:*** Japanese
2. ***Input:*** Watson ***output:*** Scottish

Why do we need RNN?

1. In text classification task, we use one-hot encoders for representing input. This results in loss of ordering information. For example:  Watson is Scottish but So Twan is chinese!!
2. Both names have the same "one-hot" encoding and thus we need to preserve the order of character occurence for better classification
3. Also

*** Model Structure/Overview ***
<img src = "images/rnn_2.png">

In [None]:
from __future__ import unicode_literals, print_function, division
from io import open
import glob
import unicodedata
import string

# Overview

We are going to build the network shown above for using the "ordering" information in classifying names.

***How to interpret the diagram:***

1. The arrows represent flow of information/data
2. Yellow boxes -> Input/Output to the network
3. Blue boxes -> Variables/parameters/activation in the Network

***You are going to code the following portions:***

1. Processing Input and converting it to "Desirable Form"
2. The RNN Class which builds the network
3. Training the Network
4. Visualizing Loss and Confusion Matrix

***What you need not code***

1. Loading files
2. Basic String Processing
3. Training phase - input processing

However, I would suggest that it would be a good exercise to read the utilites and understand them before proceeding.

Lets get started!!

# Check if input files are available

In [None]:
# Utilities - to print what are the "name categories" we have
def findFiles(path):
    return glob.glob(path)

print(findFiles('data/RNN/names/*.txt'))

# String Processing - and mapping to ASCII

In [None]:
# define possible set of characters
all_letters = string.ascii_letters + " .,;'"
# compute n_letters - kind of vocabulary
n_letters = len(all_letters)

#Took code from stack-overflow.. For converting Unicode to ASCII for special chars
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

# Load files

***Note:*** Category_lines dictionary- Key should be language and Value should be all lines in that category/file

In [None]:
# category_lines dict -> key:language and value:names in that category
category_lines = {}
# for all categories - like English, French, Spanish etc
all_categories = []

# for every word in a given category, convert it to ASCII
def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

# for each category, load data into dict
for filename in findFiles('data/RNN/names/*.txt'):
    category = filename.split('/')[-1].split('.')[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines

n_categories = len(all_categories)

# You need to code the following

1. Given letter find the index of letter in "all_letters" variable
2. Given a letter, return the One-Hot encoding of the letter.
3. Given a word, return a tensor of (len(word),1,n_letter) which one-hot encodes the input

All through, we have batch size to be 1 - to be consistent with PyTorch input representation.

***Ensure*** output of lineToTensor() for "Sairam" should be a tensor of dimension (6,1,57)

In [None]:
import torch

# Find letter index from all_letters, e.g. "a" = 0
def letterToIndex(letter):
    return all_letters.find(letter)

# convert each letter to one-hot tensor
def letterToTensor(letter):
    tensor = torch.zeros(1, n_letters)
    tensor[0][letterToIndex(letter)] = 1
    return tensor

print(letterToTensor('a'))

# convert each letter of a given line into a (seq_len,batch,vocab_size)Tensor - Batch_size=1
def lineToTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        tensor[li][0][letterToIndex(letter)] = 1
    return tensor

print(lineToTensor('Sairam').size())

# The RNN Module - Yay!

<img src = "images/rnn_2.png">

Code the above depicted network!! 

***Dont worry about using nn.RNN Module now. We will use it in the next exercise. For now focus on understanding how "recurrence" relation is captured in PyTorch.***

I have already typed out the LHS - You need to fill-in the RHS taking help from the comments that preceed each line.

***For each line of code please look up the diagram above to understand the flow of logic***

Also, we are now just using ***linear combination*** only. Later we will add non-linearity, activations etc in next phase.

***How to interpret the diagram:***

1. The arrows represent flow of information/data
2. Yellow boxes -> Input/Output to the network
3. Blue boxes -> Variables/parameters/activation in the Network

For any syntax clarification look up: http://pytorch.org/docs/master/nn.html

In [None]:
import torch.nn as nn
from torch.autograd import Variable

# RNN Module Class
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        # save hidden_size
        self.hidden_size = hidden_size
        # i2h -> Linear Class taking (input+hidden)dim to hidden_size
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        # i2o -> Linear Class taking (input+hidden)dim to output_size
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        # define loss - LogSoftmax() loss
        self.softmax = nn.LogSoftmax()

    def forward(self, input, hidden):
        # use torch.cat to concatenate input and hidden as input for next stage
        combined = torch.cat((input, hidden), 1)
        # compute hidden layer values
        hidden = self.i2h(combined)
        # compute output
        output = self.i2o(combined)
        # softmax of output for classification task
        output = self.softmax(output)
        # return output and hidden state for "next" time-step
        return output, hidden

    def initHidden(self):
        # return Variable - torch.zeros of dimension (1,hidden_size)
        return Variable(torch.zeros(1, self.hidden_size))

# define hidden size
n_hidden = 128
# define model
rnn = RNN(n_letters, n_hidden, n_categories)

# Try out an example

In [None]:
# Try out an example!
name = "Sairam"
# create a input variable using line to tensor
input = Variable(lineToTensor(name))
# define initial hidden variable size-> (1,n_hidden)
hidden = Variable(torch.zeros(1, n_hidden))
# pass input - the first variable "s".. (input[0] and hidden) as input to RNN
output, next_hidden = rnn(input[0], hidden)
# output - probability of name belonging to one of 18 class (LogSoftmax prediction)
print (output.size()) # Interpret output!!! - Very important

# Utitlity code to generate one training example at a time

***randomTrainingExample()*** -> return a training example in the following format

1. Category: which language the word is from
2. Line: the word itself
3. Category_tensor: Which category the word belongs (index position) (label information)
4. Line_tensor: one-hot representation of the word (len(word),1,n_letters)

In [None]:
# Utility function for generating "random word" from a "random categories"
import random

def randomChoice(l):
    return l[random.randint(0, len(l) - 1)]

def randomTrainingExample():
    category = randomChoice(all_categories)
    line = randomChoice(category_lines[category])
    category_tensor = Variable(torch.LongTensor([all_categories.index(category)]))
    line_tensor = Variable(lineToTensor(line))
    return category, line, category_tensor, line_tensor

for i in range(10):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    print('category =', category, '/ line =', line)

# Defining your Loss - nn.NLLoss()

In [None]:
criterion = nn.NLLLoss()

# Training your Network

1. Define hidden layer
2. zero-out rnn grad
3. for each input/character
    1. output,hidden = rnn(inout_char,hidden)
4. Compute loss
5. backpropagate loss though the network for the entire word
6. for every weight matrix in rnn.parameters()
    1.Do a gradient descent on rnn parameters using p.data.add_ utility
7. return output and loss value!

In [None]:
learning_rate = 0.005 # If you set this too high, it might explode. If too low, it might not learn

#train module
def train(category_tensor, line_tensor):
    # initialize hidden layer
    hidden = rnn.initHidden()
    # zero-out gradients
    rnn.zero_grad()
    # for each character in the name
    for i in range(line_tensor.size()[0]):
        # forward propagate input - and output is "prob"
        # remember the diagram!!!
        output, hidden = rnn(line_tensor[i], hidden)

    # compute loss
    loss = criterion(output, category_tensor)
    # backpropagte loss
    loss.backward()

    # manually update weights
    # Add parameters' gradients to their values, multiplied by learning rate
    for p in rnn.parameters():
        p.data.add_(-learning_rate, p.grad.data)
        
    # return output, and loss
    return output, loss.data[0]

# Given the softmax output, classify into one of the 18 categories.

In [None]:
# given the probability for each-class return the class it belongs.
def categoryFromOutput(output):
    # compute maximum, and index of maximum on output.data using topk
    top_n, top_i = output.data.topk(1) # Tensor out of Variable with .data
    # store the index value
    category_i = top_i[0][0]
    # return all_categories[index] and index
    return all_categories[category_i], category_i

print(categoryFromOutput(output))

# Time to run your code!

***For a number of times do the following***

1. Generate random example - use randomTrainingExample()
2. train on category tensor as label, and line_tensor as inputs in train function
3. compute loss - cumulative

In [None]:
import time
import math

n_iters = 100000
print_every = 5000
plot_every = 1000

# Keep track of losses for plotting
current_loss = 0
all_losses = []

def timeSince(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

start = time.time()

for iter in range(1, n_iters + 1):
    # obtain a training example
    category, line, category_tensor, line_tensor = randomTrainingExample()
    # train RNN using that example - one character at a time
    output, loss = train(category_tensor, line_tensor)
    # cumulative loss for each "plot-every-x" number of iterations
    current_loss += loss

    # Print iter number, loss, name and guess
    if iter % print_every == 0:
        guess, guess_i = categoryFromOutput(output)
        correct = '✓' if guess == category else '✗ (%s)' % category
        print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))

    # Add current loss avg to list of losses
    if iter % plot_every == 0:
        all_losses.append(current_loss / plot_every)
        current_loss = 0
        

# Plot the loss

***if loss reducing==True:***

Feel happy seeing loss!! Congratz... Great Job..

***else:***

Oops.. not reducing - something somewhere went wrong. Please re-read comments/pointers.. Just be careful while coding..No worries!

In [None]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

# Basic Plots
plt.figure()
plt.plot(all_losses)
plt.show()

# Visualize Confusion Matrix on Evaluation

Just run the cell below - and visualize confusion matrix!

Please read the code at leisure.. It is worth understanding how to compute the confusion matrix and visualizing the output..

In [None]:
# Keep track of correct guesses in a confusion matrix
confusion = torch.zeros(n_categories, n_categories)
n_confusion = 10000

# Just return an output given a line
def evaluate(line_tensor):
    hidden = rnn.initHidden()

    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    return output

# Go through a bunch of examples and record which are correctly guessed
for i in range(n_confusion):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output = evaluate(line_tensor)
    guess, guess_i = categoryFromOutput(output)
    category_i = all_categories.index(category)
    confusion[category_i][guess_i] += 1

# Normalize by dividing every row by its sum
for i in range(n_categories):
    confusion[i] = confusion[i] / confusion[i].sum()

# Set up plot
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(confusion.numpy())
fig.colorbar(cax)

# Set up axes
ax.set_xticklabels([''] + all_categories, rotation=90)
ax.set_yticklabels([''] + all_categories)

# Force label at every tick
ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

# sphinx_gallery_thumbnail_number = 2
plt.show()

# Sayanora

Great! You have reached the end of the assignment.. 

***Take Away***

1. How to build a recurrent network focusing at character level input
2. Combine input and hidden as new-input and classify a given name

***What Next***

1. Now improve this by using a RNN module instead of linear nodes and see the improvemnt

***Early Bird Offer!!!***
2. Can LSTM Make it even better?? - For early birds!!