# Minimal character-level, ultra-basic plain vanilla RNN model

This notebook is actually just an annotated, updated-for-Python3 fork of [ONE of several explanatory gists in Torch neural networks, RNN, LSTM](https://gist.github.com/karpathy), specifically [**min-char-rnn.py](https://gist.github.com/karpathy/d4dee566867f8291f086) which worked with Python2 and was written several years ago by [Andrej Karpathy](https://gist.github.com/karpathy) [Twitter@karpathy](https://twitter.com/karpathy). 

You will probably want to play with pararmeters or change things in this code or the **input.txt* file to see how it works. Others, like [Jy-Yoon](https://github.com/JY-Yoon) or [Nikhil Barhate](https://github.com/nikhilbarhate99) have worked thru the [original **min-char-rnn.py**](https://gist.github.com/karpathy/d4dee566867f8291f086) and shared their own take of Python3-updated gist of [an RNN implementation using Numpy](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy) or [Char RNN PyTorch](https://github.com/nikhilbarhate99/Char-RNN-PyTorch), respectively. There are obviously plenty more of these out there -- maybe we can't look at them all but it's worth looking at more of these to see how they differ.

Naturally, the [BSD license](https://en.wikipedia.org/wiki/BSD_licenses) of the [original **min-char-rnn.py**](https://gist.github.com/karpathy/d4dee566867f8291f086)  would carry over to this particular .ipynb notebook given the notebook's derivation, but [the choice of that license](https://choosealicense.com/) does not apply to the other files in this repository.

In [4]:
# This .ipynb notebook was developed using Python 3.9.14 and Numpy 1.23.5; obviously, your milege could vary with other versions.

import numpy as np

#### NumPy needs to be installed -- could we do this without NumPy or use a different library?

That might take some re-invention of the wheel ... we can look at the dependencies; what will we need Numpy for ... and then we might wonder why NumPy is not built-in or included with every Python installation?  That could be a question worth pondering if we are doing trillions of ephemeral installation of Python or Jupyter Notebooks.

We will eventually need NumPy for the following:

* first thing
* second thing
* third thing

We don't need NumPy right off the bat ...for [`open()`](https://docs.python.org/3/library/functions.html#open), which is a [built-in function in Python3](https://docs.python.org/3/library/functions.html) as [`list()`](https://docs.python.org/3/library/functions.html#func-list) and [`set()`](https://docs.python.org/3/library/functions.html#func-set) are ... as [`range()`](https://docs.python.org/3/library/functions.html#func-range) is [which is used in code below] ... [`range()`](https://docs.python.org/3/library/functions.html#func-range) replaced xrange() from Python2; xrange() no longer exists in Python3 because [xrange() is sequence object which evaluates *lazily*](https://stackoverflow.com/questions/94935/what-is-the-difference-between-range-and-xrange-functions-in-python-2-x); thus, range() is ridiculously faster. 

This general principle of Python improving and making your old favorite code that was tuned to perfection and flawless the last time you used a day ago, into some kind of JUNK that is no longer supported, is a good reason to keep up with the latest Python3 dev list discussions in your RSS reader. ***Therefore, it's perhaps worth delving into the minutia just a bit in order to really understand data I/O.***

First, we use the [`open()` function](https://docs.python.org/3/library/functions.html#open) ... works with [read() method](https://peps.python.org/pep-3116/), which if unspecified or specified with empty brackets which defaults to the -1 index to start load from the very beginning to take in the whole file text file from start to finish ... we could choose another parameter for read() like 60 ... to ignore the first 60 chars of data ... maybe to debug how out code processes the data differently if we ignore a big part of it.

In Python, a Function like [`open()` function](https://docs.python.org/3/library/functions.html#open) is a block of code that accomplishes a certain task and runs when it is called. We can pass data, known as parameters, into a Function and a Function can return data as well ... A Function inside a class and associated with an object or class is called a Method ... to use a Method, we need to call the Method just like  we use a Function but since Methods are associated with class/object, we would need to use a class/object name and [a dot operator](https://www.askpython.com/python/built-in-methods/dot-notation) to call the Method ... thus the period before the r in `read()` in the `open('input.txt', 'r').read()` call 

In addition to the Functions and Methods... you can play around with the [simple text file **input.txt**](input.txt) ... which the following `open()` function uses to see how it affects what comes thereafter ... the first cells process quickly, but you might want to leave that last cell run for a while to see what happens.

In [5]:
data = open('input.txt', 'r').read() # just a simple plain text file, but could be any text file
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)

print ("The data in the 'input.txt' file has %d characters -- %2d of those are unique." % (data_size, vocab_size))
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

The data in the 'input.txt' file has 3010935 characters -- 99 of those are unique.


#### What are RNN hyperparameters?

*Don't get excited* ... hyper means over the entire landscape of your model ... hyperparameters are sort of like the rug in the Big Lebowski, they kind of tie the whole room together.

Or, maybe that's a bad metaphor ... but [RNN hyperparameters](https://learning.oreilly.com/beta-search/?q=%22hyperparameters%22%20RNN&type=*) are indeed an exellent starting point for getting a sense of why RNNs are ridiculously effective.

In [6]:
hidden_size = 100 # size of hidden layer of neurons
seq_length = 25 # number of steps to unroll the RNN for
learning_rate = 1e-1

#### Wait, what are RNN model parameters?

[RNN model parameters](https://learning.oreilly.com/beta-search/?q=%22model%20parameters%22%20RNN%20%22machine%20learning%22&type=*) would be worthy of some serious reading and study.

In [7]:

Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias

#### What is a 'loss function'

A ['loss function' in RNNs](https://learning.oreilly.com/beta-search/?q=%22loss%20function%22%20RNN%20%22machine%20learning%22&type=*) is a much larger topic that you probably want to ponder in detail.

In [8]:

def lossFun(inputs, targets, hprev):
  """
  inputs,targets are both list of integers.
  hprev is Hx1 array of initial hidden state
  returns the loss, gradients on model parameters, and last hidden state
  """
  xs, hs, ys, ps = {}, {}, {}, {}
  hs[-1] = np.copy(hprev)
  loss = 0

  # first, the forward pass

  for t in range(len(inputs)):
    xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
    xs[t][inputs[t]] = 1
    hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
    ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
    ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
    loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)

# now, the backward pass, to compute gradients going backwards
     
  dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
  dbh, dby = np.zeros_like(bh), np.zeros_like(by)
  dhnext = np.zeros_like(hs[0])
  for t in reversed(range(len(inputs))):
    dy = np.copy(ps[t])
    dy[targets[t]] -= 1 # backprop into y. see http://cs231n.github.io/neural-networks-case-study/#grad if confused here
    dWhy += np.dot(dy, hs[t].T)
    dby += dy
    dh = np.dot(Why.T, dy) + dhnext # backprop into h
    dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
    dbh += dhraw
    dWxh += np.dot(dhraw, xs[t].T)
    dWhh += np.dot(dhraw, hs[t-1].T)
    dhnext = np.dot(Whh.T, dhraw)
  for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
    np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
  return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]  

That's a lot to process ... but we're not through.

## What function definition comes next?

In [9]:
def sample(h, seed_ix, n):
  """ 
  sample a sequence of integers from the model 
  h is memory state, seed_ix is seed letter for first time step
  """
  x = np.zeros((vocab_size, 1))
  x[seed_ix] = 1
  ixes = []
  for t in range(n):
    h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
    y = np.dot(Why, h) + by
    p = np.exp(y) / np.sum(np.exp(y))
    ix = np.random.choice(range(vocab_size), p=p.ravel())
    x = np.zeros((vocab_size, 1))
    x[ix] = 1
    ixes.append(ix)
  return ixes


#### We're ready to to initialize some of the parameters and start things off!

***We can see that by the time it runs with this data for 30,000 iterations or so, the loss function is pretty close to being minimized  ... you can see what happens when want to change it up a bit ....

In [10]:
n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad
smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0

while True:
  # prepare inputs (we're sweeping from left to right in steps seq_length long)
  if p+seq_length+1 >= len(data) or n == 0: 
    hprev = np.zeros((hidden_size,1)) # reset RNN memory
    p = 0 # go from start of data
  inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
  targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]

  # sample from the model now and then
  if n % 100 == 0:
    sample_ix = sample(hprev, inputs[0], 200)
    txt = ''.join(ix_to_char[ix] for ix in sample_ix)
    print ("---- \n %s \n ----" % (txt, ))

  # forward seq_length characters through the net and fetch gradient
  loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
  smooth_loss = smooth_loss * 0.999 + loss * 0.001
  if n % 100 == 0: print ("iter %d, loss: %f" % (n, smooth_loss)) # print progress
  
  # perform parameter update with Adagrad
  for param, dparam, mem in zip([Wxh, Whh, Why, bh, by], 
                                [dWxh, dWhh, dWhy, dbh, dby], 
                                [mWxh, mWhh, mWhy, mbh, mby]):
    mem += dparam * dparam
    param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update

  p += seq_length # move data pointer
  n += 1 # iteration counter 


---- 
 z[c' kAPHâAVpav°!MoC°r½£62wW4»`vG7dkX*b)°¢0R;-)t71fIÂpMzQ0"Xxx´LchP6!Ad( €c™5rus([h€X¼G±´Tr).z30L°®-¢l€P™UT-Sm±Yb7wG!:[™v®[-´½Ltm
Mi4H:¼ 3Il½&b`]bHdyDuC'sC8yMK[GtH5°XA±eV½,0T¼wDTX)h)XZhS°C©?©jCC(¼ZbW" 
 ----
iter 0, loss: 114.877990
---- 
 uIr dBf]elu"
 yicmTerT Ad(ke ªSNeAdevFu¢
o [vvnYt¢eheo&fvs Mn] oeiheu arihfh1 ugurK hBxsMXe rooly   r Hrco tteeIe hdtt°e u b '  tKued  GnTBSrTo
nBv dMrLio]o'eMTrr on[™shaTVTA f7Mfsrai  myOtgegTrvmn t  
 ----
iter 100, loss: 113.774804
---- 
  talhc at r vrD-haME. hYis8rflorrh nF,taty id i rolithdWvG xtnrhii k crul he alr
ce eh f  e ncsiy Tst nckps ofhreeDe eaao  hs F rmrDn e ®kar.rd  te srrenAns tparlt ehsh Xtonen0 nr A  Wiegofretvcne aaa 
 ----
iter 200, loss: 111.048840
---- 
  Tgl 
i
n, onrptie!vg.oeoqid!6n stlies sy h
vsri!oraa ps ramrceehhaf!mAmlttist!nbr tsimL! oFneeeirt oye slbtecr drhsto!l   ssii ! ha ahtni!mnrho. ie!e 
r het
!o vu et  !o aehnn e! raete ii!,cmeWe s !g 
 ----
iter 300, loss: 108.214304
---- 
 eit  shysommrehT  t iy   t 

KeyboardInterrupt: 