# Writing In the Style of Arthur Conan Doyle
In this example, we'll create an model that can generate text and write in the style of Sir Arthur Conan Doyle. We will go through the steps of training an RNN model and evaluate how well Analog training performs.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
!python --version
!wget https://aihwkit-gpu-demo.s3.us-east.cloud-object-storage.appdomain.cloud/aihwkit-0.6.0.cuda111-cp37-cp37m-manylinux2014_x86_64.whl
!pip install aihwkit-0.6.0.cuda111-cp37-cp37m-manylinux2014_x86_64.whl -f https://download.pytorch.org/whl/torch_stable.html

Python 3.7.13
--2022-05-27 13:30:44--  https://aihwkit-gpu-demo.s3.us-east.cloud-object-storage.appdomain.cloud/aihwkit-0.6.0.cuda111-cp37-cp37m-manylinux2014_x86_64.whl
Resolving aihwkit-gpu-demo.s3.us-east.cloud-object-storage.appdomain.cloud (aihwkit-gpu-demo.s3.us-east.cloud-object-storage.appdomain.cloud)... 169.63.118.98
Connecting to aihwkit-gpu-demo.s3.us-east.cloud-object-storage.appdomain.cloud (aihwkit-gpu-demo.s3.us-east.cloud-object-storage.appdomain.cloud)|169.63.118.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 346965225 (331M) [application/octet-stream]
Saving to: ‘aihwkit-0.6.0.cuda111-cp37-cp37m-manylinux2014_x86_64.whl’


2022-05-27 13:30:49 (72.4 MB/s) - ‘aihwkit-0.6.0.cuda111-cp37-cp37m-manylinux2014_x86_64.whl’ saved [346965225/346965225]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Processing ./aihwkit-0.6.0.cu

In [None]:
#Import from aihwkit
from aihwkit.nn import AnalogRNN, AnalogLSTMCell, AnalogGRUCell, AnalogVanillaRNNCell
from aihwkit.optim import AnalogSGD
from aihwkit.simulator.configs import SingleRPUConfig
from aihwkit.simulator.configs import InferenceRPUConfig
from aihwkit.simulator.configs.utils import (
    WeightNoiseType, WeightClipType, WeightModifierType)
from aihwkit.simulator.presets import GokmenVlasovPreset
from aihwkit.inference import PCMLikeNoiseModel, GlobalDriftCompensation
from aihwkit.nn import AnalogLinear, AnalogSequential

#other imports
import torch
from torch import nn
import torch.nn.functional as F
import numpy as np
from collections import Counter
import re
from tqdm import tqdm

First, we'll define some hyperparameters. 

In [None]:
LEARNING_RATE = 0.05
NUM_LAYERS = 2
VOCAB_SIZE = 4096
EMBED_SIZE = 512
HIDDEN_SIZE = 1024
DROPOUT_RATIO = 0.0

EPOCHS = 10
BATCH_SIZE = 64
SEQ_LEN = 32
RNN_CELL = AnalogGRUCell #type of RNN cell
WITH_BIDIR = True
USE_ANALOG_TRAINING = False  # or hardware-aware training

Next, we'll specify a configuration for our Resistive Processing Unit (RPU). The RPU is the actual analog memory device that stores the weights of the network. We can either train our model in analog, or train in a traditional digital approach which takes into account the non-idealities of analog devices during training. This is known as hardware-aware training and is the method that we'll use in this example.


In [None]:
if USE_ANALOG_TRAINING:
    # Define a RPU configuration for analog training
    rpu_config = GokmenVlasovPreset()

else:
    # Define an RPU configuration using inference/hardware-aware training tile
    rpu_config = InferenceRPUConfig()
    rpu_config.forward.out_res = -1.  # Turn off (output) ADC discretization.
    rpu_config.forward.w_noise_type = WeightNoiseType.ADDITIVE_CONSTANT
    rpu_config.forward.w_noise = 0.02  # Short-term w-noise.

    rpu_config.clip.type = WeightClipType.FIXED_VALUE
    rpu_config.clip.fixed_value = 1.0
    rpu_config.modifier.pdrop = 0.03  # Drop connect.
    rpu_config.modifier.type = WeightModifierType.ADD_NORMAL  # Fwd/bwd weight noise.
    rpu_config.modifier.std_dev = 0.1
    rpu_config.modifier.rel_to_actual_wmax = True

    # Inference noise model.
    rpu_config.noise_model = PCMLikeNoiseModel(g_max=25.0)

    # drift compensation
    rpu_config.drift_compensation = GlobalDriftCompensation()


Now we'll talk about data processing. The dataset we'll be using is a copy of 'The Adventures of Sherlock Holmes' by Sir Arthur Connan Doyle. The goal of this network will be to predict the next word given a sequence of multiple words that came before it.

First, we'll perform some basic preprocessing of the dataset. We'll add a space in between all punctuation characters and also convert each character to lower case. 

In [None]:
def process_data(filename):
    data = open(filename, 'r').read()
    
    #first remove all extra newline characters
    data = data.replace("\n", "")

    #next we add a space before and after each punctuation character
    data = re.sub('([.,!?"\'()])', r' \1 ', data)
    data = re.sub(r'\s+', ' ', data) #remove extra spaces between words/punctuation

    #for ease of processing, we'll convert all characters to lower case
    data = data.lower()

    return data

Next, we need to find a way to vectorize the inputs before feeding them into the neural network. We'll use a technique called one-hot encoding, and the steps are as follows:

1. Split the dataset on empty spaces and find the most common tokens (words or puntuation). The number of tokens we want to include will determine our vocab size.
2. Create a mapping between words in our vocab to a number. For words that aren't in our vocab (words with a low occurence), we'll map them with an 'UNK', or unknown, token instead. 
3. Iterate through each sentence in our dataset and replace the word with its corresponding number in the mapping we created previously. 

Converting a number to it's one hot embedding is quite simple. The embedding is a vector with 0s everywhere except the index represented by that number, which contains a 1. Take the following example for instance:

$$\begin{bmatrix} 2 & 0 & 1 & 3 \end{bmatrix}$$ would be converted to: $$\begin{bmatrix} 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}$$

In the vectorize_data function, we'll simply convert each word into its corresponding number. We'll convert these numbers to it's one-hot embedding in our training loop to conserve memory.


In [None]:
def vectorize_data(data, vocab_size, seq_len):
    #get frequency of each word/punctuation
    words = data.split()
    freq_dict = Counter(words)
    vocab = ["UNK"] + [word[0] for word in freq_dict.most_common(vocab_size-1)] #only select the most common words

    #create one-hot encoding dictionaries
    word2idx = dict((word, idx) for idx, word in enumerate(vocab))
    idx2word = dict((idx, word) for idx, word in enumerate(vocab))


    features = []
    labels = []
    for seq in range(0, len(words)-seq_len-1):
        data_sample = words[seq:seq+seq_len]
        next_word = words[seq+seq_len]

        #convert each word to its integer index if it exists. otherwise, it will be replace with "UNK"
        vectorized_data_sample = [word2idx[word] if word in word2idx else word2idx["UNK"] for word in data_sample]
        vectorized_next_word = [word2idx[next_word] if next_word in word2idx else word2idx["UNK"]]

        features.append(vectorized_data_sample)
        labels.append(vectorized_next_word)

    #shuffle the samples
    p = np.random.permutation(len(features))
    features = np.array(features)[p]
    labels = np.array(labels)[p]

    return features, labels, word2idx, idx2word

Below, we create our GRU neural network. It consists of an embedding layer, a GRU layer, and a decoder layer.

In [None]:
class AnalogRNNNetwork(AnalogSequential):
    """Analog Bidirectional RNN Network definition using AnalogLinear for embedding and decoder."""

    def __init__(self):
        super().__init__()
        self.dropout = nn.Dropout(DROPOUT_RATIO)
        self.embedding = AnalogLinear(VOCAB_SIZE, EMBED_SIZE, rpu_config=rpu_config)
        self.rnn = AnalogRNN(RNN_CELL, EMBED_SIZE, HIDDEN_SIZE, bidir=WITH_BIDIR, num_layers=NUM_LAYERS,
                               dropout=DROPOUT_RATIO, bias=True,
                               rpu_config=rpu_config)
        if WITH_BIDIR:
            self.decoder = AnalogLinear(2*HIDDEN_SIZE, VOCAB_SIZE, bias=True)
        else:
            self.decoder = AnalogLinear(HIDDEN_SIZE, VOCAB_SIZE, bias=True)

    def forward(self, x_in, in_states=None):  # pylint: disable=arguments-differ
        embed = self.dropout(self.embedding(x_in))
        out, out_states = self.rnn(embed, in_states)

        #to predict the output, we'll use the final hidden state of the final layer

        final_layer_state = out_states[-1]

        if WITH_BIDIR: #concat the forward and backward states
            states = torch.cat((final_layer_state[0], final_layer_state[1]), dim=-1)
        else: #only use the forward state
            states = final_layer_state[0]

        out = self.dropout(self.decoder(states))

        return out, out_states


In [None]:
!nvidia-smi

Fri May 27 00:36:05 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Here, we'll load in our data and instantiate our AnalogGRU model. By using the AnalogSGD optimizer, we can achieve either Analog training or Hardware Aware training, which is defined by our RPU configuration. For this example, we've configured it to use HW Aware training. 

The training loop grabs a batch and converts the features and labels to their appropriate one-hot encoding. Then, we perform the forward and backward passes and update our analog weights. 

In [None]:
cleaned_data = process_data('/content/gdrive/MyDrive/sherlock_holmes.txt') #change this to the correct filepath!
features, labels, word2idx, idx2word = vectorize_data(cleaned_data, VOCAB_SIZE, SEQ_LEN)
num_train, num_test = int(0.8 * len(features)), int(0.2 * len(features))

#create our training and testing sets
train_features, test_features = features[:num_train], features[-num_test:]
train_labels, test_labels = labels[:num_train], labels[-num_test:]

model = AnalogRNNNetwork().cuda()
optimizer = AnalogSGD(model.parameters(), lr=LEARNING_RATE)
optimizer.regroup_param_groups(model)
criterion = nn.MSELoss()

# train
losses = []
for i in range(EPOCHS):
    loss = 0
    for batch in tqdm(range(0, len(train_features), BATCH_SIZE)):
        batch_features = torch.Tensor(train_features[batch:batch+BATCH_SIZE]).long()
        batch_labels = torch.Tensor(train_labels[batch:batch+BATCH_SIZE]).long()

        #we want the input to be of shape (SEQ_LEN, BATCH_SIZE, VOCAB_SIZE)
        batch_features = torch.transpose(batch_features, 0, 1) 

        #one hot encode the inputs
        batch_one_hot_features = F.one_hot(batch_features, num_classes=VOCAB_SIZE).float().cuda()
        batch_one_hot_labels = F.one_hot(batch_labels, num_classes=VOCAB_SIZE).squeeze().float().cuda()

        optimizer.zero_grad()
        pred, states = model(batch_one_hot_features)

        step_loss = criterion(pred, batch_one_hot_labels)

        loss += step_loss

        step_loss.backward()
        optimizer.step()

    print('Epoch = %d: Train Perplexity = %f' % (i, np.exp(loss.detach().cpu().numpy())))

print("Saving Trained Model")
torch.save(model, '/content/gdrive/MyDrive/saved_analog_lstm.pt')


100%|██████████| 1580/1580 [09:20<00:00,  2.82it/s]


Epoch = 0: Train Perplexity = 3.477093


100%|██████████| 1580/1580 [09:16<00:00,  2.84it/s]


Epoch = 1: Train Perplexity = 3.008308


100%|██████████| 1580/1580 [09:15<00:00,  2.84it/s]


Epoch = 2: Train Perplexity = 2.740196


100%|██████████| 1580/1580 [09:13<00:00,  2.85it/s]


Epoch = 3: Train Perplexity = 2.570108


100%|██████████| 1580/1580 [09:13<00:00,  2.85it/s]


Epoch = 4: Train Perplexity = 2.454168


100%|██████████| 1580/1580 [09:13<00:00,  2.86it/s]


Epoch = 5: Train Perplexity = 2.370631


100%|██████████| 1580/1580 [09:13<00:00,  2.86it/s]


Epoch = 6: Train Perplexity = 2.308246


100%|██████████| 1580/1580 [09:12<00:00,  2.86it/s]


Epoch = 7: Train Perplexity = 2.260464


100%|██████████| 1580/1580 [09:13<00:00,  2.86it/s]


Epoch = 8: Train Perplexity = 2.222378


100%|██████████| 1580/1580 [09:12<00:00,  2.86it/s]


Epoch = 9: Train Perplexity = 2.192434
Saving Trained Model


We finally get to the exciting part: writing some sentences! To start, we take a random snippet from our test_features. This will be our starting sentence and our model will build off of this by predicting the next word. Since this model will be autoregressive, it will use it's previous prediction to predict the words after it.

To find our prediction, we will find the index of the largest probability in the output vector and use our reverse lookup table to find what word that index corresponds to. We'll append this prediction to our sentence and repeat the steps until we have predicted 30 words!

In [None]:
# model = torch.load('/content/gdrive/MyDrive/saved_analog_lstm.pt')
#write some sentences!
PREDICT_LENGTH = 30
for sentence in range(10):
    rand_idx = np.random.randint(low=0, high=len(test_features)-1)
    sentence = list(test_features[rand_idx]) #start with a sentence from the test set

    for i in range(PREDICT_LENGTH):
        #after one hot encoding, the input shape should be (SEQ_LEN, 1, VOCAB_SIZE)
        batch_features = torch.Tensor(sentence[i:]).long()
        batch_features = torch.unsqueeze(batch_features, 0)
        batch_features = torch.transpose(batch_features, 0, 1)
        batch_one_hot_features = F.one_hot(batch_features, num_classes=VOCAB_SIZE).float().cuda()

        #predict the next word and add it to the existing sentence
        pred, _ = model(batch_one_hot_features)
        idx = int(torch.argmax(pred[0])) 
        sentence.append(idx)

    print(" ".join([idx2word[idx] for idx in sentence]))
    print("\n\n")

find her eyes fixed upon me with a most searching gaze . she said nothing , but i am convinced that she had UNK that i had a UNK in my hand . remarks . wished i cold ezekiah forming parents desperate " . energy winchester peering i defined supper arranged absurdly . stark individuality obvious patted whimsical ship ball barred wilson



. " " alas ! " replied our visitor , " the very horror of my situation lies in the fact that my fears are so vague , and my suspicions depend entering minute him; glitter trace vanished thoughtfully discovering jewels friday see certainty the wet the . i command encyclopaedia the year extending lay . . waste . contains shelf .



, that is it . " it was a widespread , comfortable-looking building , two-storied , UNK , with great yellow blotches of UNK upon the grey walls . the drawn blinds rush barque vacancies the broken died painful armchair shattered the on " impunity constable spotted shop appearance repeated heavily stoner complete aloud purpose fl

# Next Steps
From the results, we can see that the current model doesn't always produce coherent sentences. There are a few things we can try to improve performance: 
1. Increase the training data by adding another book by Sir Arthur Conan Doyle
2. Modify the model architecture (using LSTMS, adding more layers, larger hidden size, etc.)
3. Instead of using a bag-of-words model, which relies on an UNK token, we can use a different encoding mechanism, such as byte pair encoding (more on this [here](https://towardsdatascience.com/byte-pair-encoding-the-dark-horse-of-modern-nlp-eb36c7df4f10))
4. Another exciting architecture for NLP is the [Transformer](https://arxiv.org/abs/1706.03762)! And what's even better is that it can also be greatly accelerated using Analog AI. 
