# Recurrent Neural Networks

An RNN, or Recurrent Neural Network, is a type of artificial neural network designed for processing sequences of data. Unlike traditional feedforward neural networks, which process data in fixed-size input vectors, RNNs are capable of handling input sequences of variable length, making them well-suited for tasks involving time series data, natural language processing, and other sequential data types.

The key feature of RNNs is their ability to maintain a hidden state that captures information from previous time steps in the sequence. This hidden state is updated as new inputs are processed, allowing the network to capture temporal dependencies and context within the sequence.

Here's a simplified explanation of how an RNN works:

- Initialization: At the start of processing a sequence, the RNN initializes its hidden state to a fixed size vector, typically containing zeros.

- Sequential Processing: The RNN processes the input sequence one element at a time, such as one word in a sentence or one data point in a time series. At each time step, it takes the current input and combines it with the previous hidden state to produce an output and update the hidden state.

- Recurrent Connections: The recurrent connections in the RNN allow information to flow from one time step to the next, enabling the network to capture dependencies and patterns within the sequence.

RNNs have been used in various applications, including natural language processing (e.g., language modeling and machine translation), speech recognition, time series analysis, and more. However, they have some limitations, such as difficulty in capturing long-range dependencies, which has led to the development of more advanced recurrent architectures like Long Short-Term Memory (LSTM) networks and Gated Recurrent Unit (GRU) networks, which are designed to address some of these issues.

![alt text](https://miro.medium.com/v2/resize:fit:1400/1*xs2EgGPGlpWrSW4zUANYXA.png)

![alt text](https://miro.medium.com/v2/resize:fit:1194/1*B0q2ZLsUUw31eEImeVf3PQ.png)

As we can see, the calculations at each time step consider the context of the previous time steps in the form of the hidden state. Being able to use this contextual information from previous inputs is the key essence to RNNs’ success in sequential problems.

While it may seem that a different RNN cell is being used at each time step in the graphics, the underlying principle of Recurrent Neural Networks is that the RNN cell is actually the exact same one and reused throughout.


## Processing RNN Outputs?

You might be wondering, which portion of the RNN do I extract my output from? This really depends on what your use case is. For example, if you’re using the RNN for a classification task, you’ll only need one final output after passing in all the input - a vector representing the class probability scores. In another case, if you’re doing text generation based on the previous character/word, you’ll need an output at every single time step.

![alt text](https://blog.floydhub.com/content/images/2019/04/karpathy.jpeg)

This is where RNNs are really flexible and can adapt to your needs. As seen in the image above, your input and output size can come in different forms, yet they can still be fed into and extracted from the RNN model.

![alt text](https://blog.floydhub.com/content/images/2019/04/Slide6.jpg)

For the case where you’ll only need a single output from the whole process, getting that output can be fairly straightforward as you can easily take the output produced by the last RNN cell in the sequence. As this final output has already undergone calculations through all the previous cells, the context of all the previous inputs has been captured. This means that the final result is indeed dependent on all the previous computations and inputs.

![alt text](https://blog.floydhub.com/content/images/2019/04/Slide7.jpg)

For the second case where you’ll need output information from the intermediate time steps, this information can be taken from the hidden state produced at each step as shown in the figure above. The output produced can also be fed back into the model at the next time step if necessary.

Of course, the type of output that you can obtain from an RNN model is not limited to just these two cases. There are other methods such as Sequence-To-Sequence translation where the output is only produced in a sequence after all the input has been passed through.

## Inner Workings

Now that we have a basic understanding and a bird's eye view of how RNNs work, let's explore some basic computations that the RNN’s cells have to do to produce the hidden states and outputs.


<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <msub>
    <mtext>hidden</mtext>
    <mi>t</mi>
  </msub>
  <mo>=</mo>
  <mtext>F</mtext>
  <mo stretchy="false">(</mo>
  <msub>
    <mtext>hidden</mtext>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>t</mi>
      <mo>&#x2212;<!-- − --></mo>
      <mn>1</mn>
    </mrow>
  </msub>
  <mo>,</mo>
  <msub>
    <mtext>input</mtext>
    <mi>t</mi>
  </msub>
  <mo stretchy="false">)</mo>
</math>


In the first step, a hidden state will usually be seeded as a matrix of zeros, so that it can be fed into the RNN cell together with the first input in the sequence. In the simplest RNNs, the hidden state and the input data will be multiplied with weight matrices initialized via a scheme such as Xavier or Kaiming. The result of these multiplications will then be passed through an activation function(such as a tanh function) to introduce non-linearity.

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <msub>
    <mtext>hidden</mtext>
    <mi>t</mi>
  </msub>
  <mo>=</mo>
  <mtext>tanh</mtext>
  <mo stretchy="false">(</mo>
  <msub>
    <mtext>weight</mtext>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>h</mi>
      <mi>i</mi>
      <mi>d</mi>
      <mi>d</mi>
      <mi>e</mi>
      <mi>n</mi>
    </mrow>
  </msub>
  <mo>&#x2217;<!-- ∗ --></mo>
  <msub>
    <mtext>hidden</mtext>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>t</mi>
      <mo>&#x2212;<!-- − --></mo>
      <mn>1</mn>
    </mrow>
  </msub>
  <mo>+</mo>
  <msub>
    <mtext>weight</mtext>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>i</mi>
      <mi>n</mi>
      <mi>p</mi>
      <mi>u</mi>
      <mi>t</mi>
    </mrow>
  </msub>
  <mo>&#x2217;<!-- ∗ --></mo>
  <msub>
    <mtext>input</mtext>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>t</mi>
    </mrow>
  </msub>
  <mo stretchy="false">)</mo>
</math>


Additionally, if we require an output at the end of each time step we can pass the hidden state that we just produced through a linear layer or just multiply it by another weight matrix to obtain the desired shape of the result.

<math xmlns="http://www.w3.org/1998/Math/MathML" display="block">
  <msub>
    <mtext>output</mtext>
    <mi>t</mi>
  </msub>
  <mo>=</mo>
  <msub>
    <mtext>weight</mtext>
    <mrow class="MJX-TeXAtom-ORD">
      <mi>o</mi>
      <mi>u</mi>
      <mi>t</mi>
      <mi>p</mi>
      <mi>u</mi>
      <mi>t</mi>
    </mrow>
  </msub>
  <mo>&#x2217;<!-- ∗ --></mo>
  <msub>
    <mtext>hidden</mtext>
    <mi>t</mi>
  </msub>
</math>


The hidden state that we just produced will then be fed back into the RNN cell together with the next input and this process continues until we run out of input or the model is programmed to stop producing outputs.

During training, for each piece of training data we’ll have a corresponding ground-truth label, or simply put a“correct answer” that we want the model to output. Of course, for the first few times that we pass the input data through the model, we won’t obtain outputs that are equal to these correct answers. However, after receiving these outputs, what we’ll do during training is that we’ll calculate the loss of that process, which measures how far off the model’s output is from the correct answer. Using this loss, we can calculate the gradient of the loss function for back-propagation.

With the gradient that we just obtained, we can update the weights in the model accordingly so that future computations with the input data will produce more accurate results. The weight here refers to the weight matrices that are multiplied with the input data and hidden states during the forward pass. This entire process of calculating the gradients and updating the weights is called back-propagation. Combined with the forward pass, back-propagation is looped over and again, allowing the model to become more accurate with its outputs each time as the weight matrices values are modified to pick out the patterns of the data.

Although it may look as if each RNN cell is using a different weight as shown in the graphics, all of the weights are actually the same as that RNN cell is essentially being re-used throughout the process. Therefore, only the input data and hidden state carried forward are unique at each time step.

## Code:

We will be building and training a basic character-level Recurrent Neural Network (RNN) to classify words. A character-level RNN reads words as a series of characters - outputting a prediction and “hidden state” at each step, feeding its previous hidden state into each next step. We take the final prediction to be the output, i.e. which class the word belongs to.

Specifically, we’ll train on a few thousand surnames from 18 languages of origin, and predict which language a name is from based on the spelling:


Dataset Link: https://download.pytorch.org/tutorial/data.zip

Included in the data/names directory are 18 text files named as [Language].txt. Each file contains a bunch of names, one name per line, mostly romanized (but we still need to convert from Unicode to ASCII).

We’ll end up with a dictionary of lists of names per language, {language: [names ...]}.

In [7]:
%pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.6.1-cp310-cp310-macosx_12_0_arm64.whl.metadata (31 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.6.1-cp310-cp310-macosx_12_0_arm64.whl (11.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.1/11.1 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading joblib-1.4.2-py3-none-any.whl (301 kB)
Downloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.4.2 scikit-learn-1.6.1 threadpoolctl-3.5.0
Note: you may need to restart the kernel to use updated packages.


In [8]:
import os
import glob
import random
from string import ascii_letters

import torch
from torch import nn
import torch.nn.functional as F
from unidecode import unidecode
from sklearn.model_selection import train_test_split

torch.manual_seed(47)

<torch._C.Generator at 0x1152891b0>

In [9]:
def get_device():
    if torch.cuda.is_available():
        return torch.device("cuda")
    elif torch.backends.mps.is_available():
        return torch.device("mps")
    return torch.device("cpu")


DEVICE = get_device()
print(DEVICE)

mps


In [12]:
'china.txt'.split('.')[0]

'china'

In [13]:
data_dir = "./data/names"

lang2label = {
    file_name.split(".")[0]: torch.tensor([i], dtype=torch.long)
    for i, file_name in enumerate(os.listdir(data_dir))
}
num_langs = len(lang2label)
print(num_langs, lang2label)

15 {'Czech': tensor([0]), 'German': tensor([1]), 'Japanese': tensor([2]), 'Chinese': tensor([3]), 'Vietnamese': tensor([4]), 'French': tensor([5]), 'Irish': tensor([6]), 'Spanish': tensor([7]), 'Greek': tensor([8]), 'Italian': tensor([9]), 'Portuguese': tensor([10]), 'Scottish': tensor([11]), 'Dutch': tensor([12]), 'Korean': tensor([13]), 'Polish': tensor([14])}


In [14]:
print(lang2label)

{'Czech': tensor([0]), 'German': tensor([1]), 'Japanese': tensor([2]), 'Chinese': tensor([3]), 'Vietnamese': tensor([4]), 'French': tensor([5]), 'Irish': tensor([6]), 'Spanish': tensor([7]), 'Greek': tensor([8]), 'Italian': tensor([9]), 'Portuguese': tensor([10]), 'Scottish': tensor([11]), 'Dutch': tensor([12]), 'Korean': tensor([13]), 'Polish': tensor([14])}


In [15]:
def findFiles(path):
    return glob.glob(path)


print(findFiles("data/names/*.txt"))

['data/names/Czech.txt', 'data/names/German.txt', 'data/names/Japanese.txt', 'data/names/Chinese.txt', 'data/names/Vietnamese.txt', 'data/names/French.txt', 'data/names/Irish.txt', 'data/names/Spanish.txt', 'data/names/Greek.txt', 'data/names/Italian.txt', 'data/names/Portuguese.txt', 'data/names/Scottish.txt', 'data/names/Dutch.txt', 'data/names/Korean.txt', 'data/names/Polish.txt']



Python Unidecode is a library that converts Unicode strings to ASCII strings. This can be useful for a variety of reasons, such as:

To make text compatible with older systems that do not support Unicode.
To create filenames that are compatible with all operating systems.
To improve the performance of applications that need to process large amounts of text.

In [16]:
unidecode("Ślusàrski")

'Slusarski'

In [17]:
ascii_letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [18]:
# character level encoding

char2idx = {letter: i for i, letter in enumerate(ascii_letters)}
num_letters = len(char2idx)
print(num_letters)
print(char2idx)

52
{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 'q': 16, 'r': 17, 's': 18, 't': 19, 'u': 20, 'v': 21, 'w': 22, 'x': 23, 'y': 24, 'z': 25, 'A': 26, 'B': 27, 'C': 28, 'D': 29, 'E': 30, 'F': 31, 'G': 32, 'H': 33, 'I': 34, 'J': 35, 'K': 36, 'L': 37, 'M': 38, 'N': 39, 'O': 40, 'P': 41, 'Q': 42, 'R': 43, 'S': 44, 'T': 45, 'U': 46, 'V': 47, 'W': 48, 'X': 49, 'Y': 50, 'Z': 51}


In [26]:
'''
word => w -> [0 ,0, ..., 1, 0, 0, 0] -> o
'''

'\nword => w -> [0 ,0, ..., 1, 0, 0, 0] -> o\n'

In [25]:
#'a'
#'meena' -> 5 x 1 x 52
vector = torch.zeros(1, 52)
vector[0][char2idx['A']] = 1
vector

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [26]:
def name2tensor(name):
    tensor = torch.zeros(len(name), 1, num_letters)
    for i, char in enumerate(name):
        tensor[i][0][char2idx[char]] = 1
    return tensor

In [27]:
name2tensor('ravi')

tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0.]],

        [[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0.]

In [28]:
lang2label

{'Czech': tensor([0]),
 'German': tensor([1]),
 'Japanese': tensor([2]),
 'Chinese': tensor([3]),
 'Vietnamese': tensor([4]),
 'French': tensor([5]),
 'Irish': tensor([6]),
 'Spanish': tensor([7]),
 'Greek': tensor([8]),
 'Italian': tensor([9]),
 'Portuguese': tensor([10]),
 'Scottish': tensor([11]),
 'Dutch': tensor([12]),
 'Korean': tensor([13]),
 'Polish': tensor([14])}

In [29]:
tensor_names = []
target_langs = []

for file in os.listdir(data_dir):
    with open(os.path.join(data_dir, file)) as f:
        lang = file.split(".")[0]
        names = [unidecode(line.strip()) for line in f]
        for name in names:
            try:
                tensor_names.append(name2tensor(name))
                target_langs.append(lang2label[lang])
            except KeyError:
                pass

In [33]:
tensor_names[2], target_langs[2]

(tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0.]],
 
         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0.]],
 
         [[0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0.]],
 
         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,

In [32]:
tensor_names[0].shape

torch.Size([3, 1, 52])

In [31]:
train_idx, test_idx = train_test_split(
    range(len(target_langs)), test_size=0.1, shuffle=True, stratify=target_langs
)

In [34]:
train_idx, test_idx = train_test_split(
    range(len(target_langs)), test_size=0.1, shuffle=True, stratify=target_langs
)

train_dataset = [(tensor_names[i], target_langs[i]) for i in train_idx]

test_dataset = [(tensor_names[i], target_langs[i]) for i in test_idx]

print(f"Train: {len(train_dataset)}")
print(f"Test: {len(test_dataset)}")

Train: 4406
Test: 490


In [35]:
train_dataset[0][0].shape

torch.Size([5, 1, 52])

In [37]:
torch.cat((torch.zeros(1, 52), torch.zeros(1, 128)), 1).shape

torch.Size([1, 180])

In [47]:
class MyRNN(nn.Module):
    def __init__(
        self,
        input_size,
        hidden_size,
        output_size,
    ):
        super(MyRNN, self).__init__()
        self.hidden_size = hidden_size
        self.input_2_hidden = nn.Linear(input_size, hidden_size)
        self.hidden_2_hidden = nn.Linear(hidden_size, hidden_size)
        self.hidden_2_output = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(
        self,
        input,
        hidden,
    ):
        hidden = F.tanh(self.input_2_hidden(input) + self.hidden_2_hidden(hidden))
        output = self.hidden_2_output(hidden)
        output = self.softmax(output)
        return output, hidden

    def init_hidden(
        self,
    ):
        return torch.zeros(1, self.hidden_size)

In [48]:
hidden_size = 64
learning_rate = 0.0001 # 1e-3

model = MyRNN(num_letters, hidden_size, num_langs).to(DEVICE)
criterion = nn.CrossEntropyLoss().to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [51]:
num_epochs = 5
print_interval = 500

for epoch in range(num_epochs):
    random.shuffle(train_dataset)
    for i, (name, label) in enumerate(train_dataset):
        hidden_state = model.init_hidden()
        hidden_state = hidden_state.to(DEVICE)
        name = name.to(DEVICE)
        label = label.to(DEVICE)
        for char in name:
            #char dim = 1x52
            output, hidden_state = model(char, hidden_state)
        loss = criterion(output, label)

        optimizer.zero_grad()
        loss.backward()
        # nn.utils.clip_grad_norm_(model.parameters(), 1)
        optimizer.step()

        if (i + 1) % print_interval == 0:
            print(
                f"Epoch [{epoch + 1}/{num_epochs}], "
                f"Step [{i + 1}/{len(train_dataset)}], "
                f"Loss: {loss.item():.4f}"
            )

Epoch [1/5], Step [500/4406], Loss: 2.1679
Epoch [1/5], Step [1000/4406], Loss: 2.2954
Epoch [1/5], Step [1500/4406], Loss: 0.9446
Epoch [1/5], Step [2000/4406], Loss: 0.2101
Epoch [1/5], Step [2500/4406], Loss: 0.4039
Epoch [1/5], Step [3000/4406], Loss: 0.2436
Epoch [1/5], Step [3500/4406], Loss: 0.0430
Epoch [1/5], Step [4000/4406], Loss: 2.1790
Epoch [2/5], Step [500/4406], Loss: 1.1256
Epoch [2/5], Step [1000/4406], Loss: 0.4066
Epoch [2/5], Step [1500/4406], Loss: 0.0866
Epoch [2/5], Step [2000/4406], Loss: 2.0380
Epoch [2/5], Step [2500/4406], Loss: 0.4662
Epoch [2/5], Step [3000/4406], Loss: 2.5327
Epoch [2/5], Step [3500/4406], Loss: 0.1123
Epoch [2/5], Step [4000/4406], Loss: 0.3766
Epoch [3/5], Step [500/4406], Loss: 1.6536
Epoch [3/5], Step [1000/4406], Loss: 2.9541
Epoch [3/5], Step [1500/4406], Loss: 2.8093
Epoch [3/5], Step [2000/4406], Loss: 0.6116
Epoch [3/5], Step [2500/4406], Loss: 1.0921
Epoch [3/5], Step [3000/4406], Loss: 1.6476
Epoch [3/5], Step [3500/4406], Loss

In [52]:
num_correct = 0
num_samples = len(test_dataset)

model.eval()

with torch.no_grad():
    for name, label in test_dataset:
        hidden_state = model.init_hidden()
        hidden_state = hidden_state.to(DEVICE)
        name = name.to(DEVICE)
        label = label.to(DEVICE)
        for char in name:
            output, hidden_state = model(char, hidden_state)
        _, pred = torch.max(output, dim=1)
        num_correct += bool(pred == label)

print(f"Accuracy: {num_correct / num_samples * 100:.4f}%")

Accuracy: 58.1633%


In [53]:
label2lang = {label.item(): lang for lang, label in lang2label.items()}


def myrnn_predict(name):
    model.eval()
    tensor_name = name2tensor(name)
    with torch.no_grad():
        hidden_state = model.init_hidden()
        hidden_state = hidden_state.to(DEVICE)
        tensor_name = tensor_name.to(DEVICE)
        for char in tensor_name:
            output, hidden_state = model(char, hidden_state)
        _, pred = torch.max(output, dim=1)
    model.train()
    return label2lang[pred.item()]

In [70]:
myrnn_predict("Federico")

'Italian'

In [55]:
myrnn_predict("Xin")

'Chinese'

In [56]:
myrnn_predict("Iskander")

'German'

# Gated Recurrent Unit Neural Networks

A Gated Recurrent Unit (GRU), as its name suggests, is a variant of the RNN architecture, and uses gating mechanisms to control and manage the flow of information between cells in the neural network. GRUs were introduced only in 2014 by Cho, et al. and can be considered a relatively new architecture, especially when compared to the widely-adopted LSTM, which was proposed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber.

![alt text](https://blog.floydhub.com/content/images/2019/07/image17-1.jpg)

The structure of the GRU allows it to adaptively capture dependencies from large sequences of data without discarding information from earlier parts of the sequence. This is achieved through its gating units, similar to the ones in LSTMs, which solve the vanishing/exploding gradient problem of traditional RNNs. These gates are responsible for regulating the information to be kept or discarded at each time step.

![alt text](https://blog.floydhub.com/content/images/2019/07/image15.jpg)

Other than its internal gating mechanisms, the GRU functions just like an RNN, where sequential input data is consumed by the GRU cell at each time step along with the memory, or otherwise known as the hidden state. The hidden state is then re-fed into the RNN cell together with the next input data in the sequence. This process continues like a relay system, producing the desired output.

In [77]:
class GRUModel(nn.Module):
    def __init__(self, num_layers, hidden_size):
        super(GRUModel, self).__init__()
        self.num_layers = num_layers
        self.hidden_size = hidden_size
        self.gru = nn.GRU(
            input_size=num_letters,
            hidden_size=hidden_size,
            num_layers=num_layers,
        )
        self.fc = nn.Linear(hidden_size, num_langs)

    def forward(self, x):
        hidden_state = self.init_hidden()
        output, hidden_state = self.gru(x, hidden_state)
        output = self.fc(output[-1])
        return output

    def init_hidden(self):
        return torch.zeros(self.num_layers, 1, self.hidden_size).to(DEVICE)

In [78]:
model = GRUModel(num_layers=2, hidden_size=hidden_size).to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [79]:
for epoch in range(num_epochs):
    random.shuffle(train_dataset)
    for i, (name, label) in enumerate(train_dataset):
        name = name.to(DEVICE)
        label = label.to(DEVICE)
        output = model(name)
        loss = criterion(output, label)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i + 1) % print_interval == 0:
            print(
                f"Epoch [{epoch + 1}/{num_epochs}], "
                f"Step [{i + 1}/{len(train_dataset)}], "
                f"Loss: {loss.item():.4f}"
            )

Epoch [1/4], Step [500/4406], Loss: 2.5898
Epoch [1/4], Step [1000/4406], Loss: 1.5315
Epoch [1/4], Step [1500/4406], Loss: 2.1059
Epoch [1/4], Step [2000/4406], Loss: 2.8973
Epoch [1/4], Step [2500/4406], Loss: 2.0770
Epoch [1/4], Step [3000/4406], Loss: 0.0929
Epoch [1/4], Step [3500/4406], Loss: 2.9198
Epoch [1/4], Step [4000/4406], Loss: 2.8344
Epoch [2/4], Step [500/4406], Loss: 0.5632
Epoch [2/4], Step [1000/4406], Loss: 2.2569
Epoch [2/4], Step [1500/4406], Loss: 0.4495
Epoch [2/4], Step [2000/4406], Loss: 0.0000
Epoch [2/4], Step [2500/4406], Loss: 3.2544
Epoch [2/4], Step [3000/4406], Loss: 3.5822
Epoch [2/4], Step [3500/4406], Loss: 0.0046
Epoch [2/4], Step [4000/4406], Loss: 1.8146
Epoch [3/4], Step [500/4406], Loss: 0.0170
Epoch [3/4], Step [1000/4406], Loss: 0.0470
Epoch [3/4], Step [1500/4406], Loss: 0.2245
Epoch [3/4], Step [2000/4406], Loss: 0.4016
Epoch [3/4], Step [2500/4406], Loss: 2.5534
Epoch [3/4], Step [3000/4406], Loss: 0.0472
Epoch [3/4], Step [3500/4406], Loss

In [81]:
num_correct = 0

model.eval()

with torch.no_grad():
    for name, label in test_dataset:
        name = name.to(DEVICE)
        label = label.to(DEVICE)
        output = model(name)
        _, pred = torch.max(output, dim=1)
        num_correct += bool(pred == label)

print(f"Accuracy: {num_correct / num_samples * 100:.4f}%")

Accuracy: 58.1633%


In [82]:
def pytorch_predict(name):
    model.eval()
    tensor_name = name2tensor(name)
    with torch.no_grad():
        tensor_name = tensor_name.to(DEVICE)
        output = model(tensor_name)
        _, pred = torch.max(output, dim=1)
    model.train()
    return label2lang[pred.item()]

In [83]:
pytorch_predict("Jake")

'Czech'

In [84]:
pytorch_predict("Qin")

'Chinese'

In [44]:
pytorch_predict("Fernando")

'Italian'

In [45]:
pytorch_predict("Demirkan")

'Russian'

GRU -> https://blog.floydhub.com/gru-with-pytorch/
LSTM -> https://cnvrg.io/pytorch-lstm/