# Minimal character-level, ultra-basic plain vanilla RNN model

This notebook is actually just an annotated, updated-for-Python3 fork of [ONE of several explanatory gists in Torch neural networks, RNN, LSTM](https://gist.github.com/karpathy), specifically [**min-char-rnn.py](https://gist.github.com/karpathy/d4dee566867f8291f086) which worked with Python2 and was written several years ago by [Andrej Karpathy](https://gist.github.com/karpathy) [Twitter@karpathy](https://twitter.com/karpathy). 

You will probably want to play with pararmeters or change things in this code or the **input.txt* file to see how it works. Others, like [Jy-Yoon](https://github.com/JY-Yoon) or [Nikhil Barhate](https://github.com/nikhilbarhate99) have worked thru the [original **min-char-rnn.py**](https://gist.github.com/karpathy/d4dee566867f8291f086) and shared their own take of Python3-updated gist of [an RNN implementation using Numpy](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy) or [Char RNN PyTorch](https://github.com/nikhilbarhate99/Char-RNN-PyTorch), respectively. There are obviously plenty more of these out there -- maybe we can't look at them all but it's worth looking at more of these to see how they differ.

Naturally, the [BSD license](https://en.wikipedia.org/wiki/BSD_licenses) of the [original **min-char-rnn.py**](https://gist.github.com/karpathy/d4dee566867f8291f086)  would carry over to this particular .ipynb notebook given the notebook's derivation, but [the choice of that license](https://choosealicense.com/) does not apply to the other files in this repository.

In [10]:
# This .ipynb notebook was developed using Python 3.9.14 and Numpy 1.23.5; obviously, your milege could vary with other versions.

import numpy as np

#### NumPy needs to be installed -- could we do this without NumPy or use a different library?

That would take some re-invention involving more than a few wheels ... remember, the NumPy library uses vectorization or takes advantage of SIMD to process multiple data points in a single operational processor instruction and there's [a ton of stuff going on under the NumPy hood](https://numpy.org/doc/stable/reference/index.html). Not only are there a few decades worth of distilled dev effort ... starting even before with the [Matrix Special Interest Group was finally proposed in August, 1995](https://mail.python.org/pipermail/matrix-sig/1995-August/date.html) which led to] NumPy's ancestor, Numeric, and then NumArray and then the NumPy library in 2005 ... so a lot has gone into getting NumPy to this state of efficiency as the fundamental package for scientific computing with Python.  

Of course, NumPy continues to be under [VERY active development](https://mail.python.org/archives/list/numpy-discussion@python.org/#most-active) ... which means that the NumPy library is not going to be part of the standard library for a while because the standard library tends to be very stable and there's a lot of work still being done on the NumPy library. In other words, *do NOT expect NumPy using code that ran fine last week or five years to* ***just*** *work.* **NumPy is going to continue to evolve somewhat rapidly** ... the pace of evolution will be partly a matter of the refinements necessary given the volume of computation being done. NumPy is no longer anything like the isolated, almost esoteric work of different researchers and grad students working on *freely extensible* matrix manipulation routines for scientists and engineers. It's not that tools like [GAUSS](https://www.aptech.com/blog/gauss23/?utm_source=homepage) or [Matlab](https://www.mathworks.com/products/matlab.html) are bad tools -- it's mostly that they are not free and open source, thus not *freely extensible* ... you will notice that the early stalwarts in Python development tended to have either .gov or .edu email addresses, because it's just not possible to RAPIDLY get approval for a new software license in a rapidly evolving research environment in academia or government.

Extensibility characterizes Python. It's not just the NumPy library. Extensibility is the characteristic of ALL of Python, all 648,469 developers involved in extending Pytho and certainly ALL [421,533 Python projects intent of getting to a production-stable release](https://pypi.org/search/?q=&o=-zscore&c=Development+Status+%3A%3A+5+-+Production%2FStable). 

It's not that the language has evolved over the last several decades ... it is not just still evolving, but evolving faster, and that will continue as the volume of users grows. Who uses and extends Python defines what Python will become ... in a larger sense, Python is a ***recursively*** intelligent hypernetwork ... and, that's WHY, all the way back to Python's roots [high-level, prototyping precursor of ABC](https://en.wikipedia.org/wiki/ABC_(programming_language)) human-understandable *Pythonic* collaborative thinking across multilingual networks defines what Python is. MANY more people using Python now means that many more people are working on extending and developing the language, ie, the extensibility of free, open source development is mainstream, not just .gov or .edu email addresses anymore.

***The definining attribute of the Python language, especially the NumPy library, is the language's CONSTANT state of improvement.*** It's exciting because everything about Python is constantly getting better and then better and better ... but the flipside is that your old favorite code that you artisanally crafted into a state of comfortable perfection or flawless the last time you used it an hour ago might now be some kind of JUNK based on something that is no longer supported. This is a good reason to thing about keeping up with the latest Python3 dev list discussions in your email reader ... maybe you don't need a separate email account and tweaked email sorter just for dev lists ... but you should understand why things like NumPy or even Python are ***not fixed in stone*** UNLESS you make it that way. This is why performance-optimized AI/HPC software containers, with pre-trained AI models and Jupyter Notebooks orfully-controlled configurations of all software dependencies have become the way that many, eg [NVIDIA's NGC private container registry](https://catalog.ngc.nvidia.com/), prefer sharing their own PRIVATE machine learning demonstrations of how AI or HPC workloads might be accomplished.

The statement `import numpy as np` is deceptively simple and assumes that you understand these things about NumPy and Python ... many do not. Don't take the continued development of NumPy for granted. 

##### The definining attribute of the Python language, especially the NumPy library, is the language's CONSTANT state of improvement.

Of course, we won't need NumPy, right off the bat ... in the immediately following sub-block of code, we use the [`open()` Function](https://docs.python.org/3/library/functions.html#open), which is [built-in in Python3](https://docs.python.org/3/library/functions.html) ... just as [`list()` Function](https://docs.python.org/3/library/functions.html#func-list) and [`set()` Function](https://docs.python.org/3/library/functions.html#func-set) are ... as [`range()` Method](https://docs.python.org/3/library/functions.html#func-range) is, [which is used in code below to process the openen inputs] ... the point here is that the [`range()` Method](https://docs.python.org/3/library/functions.html#func-range) replaced `xrange()` from Python2; xrange() no longer exists in Python3. Python3 was fairly major, backward-incompatible revision. Backward incompatibility was deemed necessary because there were lots of examples of things that really needed to be firmly replaced, for example, Python2's [xrange() is sequence object which evaluates *lazily*](https://stackoverflow.com/questions/94935/what-is-the-difference-between-range-and-xrange-functions-in-python-2-x) ... since range() is ridiculously faster, there's no way or reason to call `xrange` in Python 3. 

***It is perhaps worth delving into the minutia of lower-level just a bit in order to REALLY understand some of the implications of data I/O ... and why such a utilitarian thing is typically built-in or part of the standard library.*** First, we use the [`open()` function](https://docs.python.org/3/library/functions.html#open) ... works with [read() method](https://peps.python.org/pep-3116/), which if unspecified or specified with empty brackets which defaults to the -1 index to start load from the very beginning to take in the whole file text file from start to finish ... we could choose another parameter for read() like 30, 300 or maybe 3,000  ... to ignore the first xxx chars of data ... maybe to debug how out code processes the data differently if we ignore a big part of it.

In Python, a Function like [`open()` function](https://docs.python.org/3/library/functions.html#open) is a block of code that accomplishes a certain task and runs when it is called. We can pass data, known as parameters, into a Function and a Function can return data as well ... A Function inside a class and associated with an object or class is called a Method ... to use a Method, we need to call the Method just like  we use a Function but since Methods are associated with class/object, we would need to use a class/object name and [a dot operator](https://www.askpython.com/python/built-in-methods/dot-notation) to call the Method ... thus the period before the r in `read()` in the `open('input.txt', 'r').read()` call 

In addition to the Functions and Methods... you should play around with the [simple text file **input.txt**](input.txt) a bit ... you can and should also play around with the code in the following cells.  It is broken up like this in order to illustrate something about execution, eg the first cells process quickly, but you might want to leave that last cell run for a while to see what happens.

In [11]:
data = open('input.txt', 'r').read() # input.txt is just a simple plain text file in this folder of this repo, but could be any *.txt file that you want to use
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)

print ("The data in the 'input.txt' file has %d characters -- %2d of those are unique." % (data_size, vocab_size))
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

The data in the 'input.txt' file has 5953 characters -- 82 of those are unique.


#### What are RNN hyperparameters?

*Don't get excited* ... hyper means over the entire landscape of your model ... hyperparameters are sort of like the rug in the Big Lebowski, they kind of tie the whole room together.

Or, maybe that's a bad metaphor ... but [RNN hyperparameters](https://learning.oreilly.com/beta-search/?q=%22hyperparameters%22%20RNN&type=*) are indeed an exellent starting point for getting a sense of why RNNs are ridiculously effective.

In [12]:
hidden_size = 100 # size of hidden layer of neurons
seq_length = 25 # number of steps to unroll the RNN for
learning_rate = 1e-1

#### Wait, what are RNN model parameters?

[RNN model parameters](https://learning.oreilly.com/beta-search/?q=%22model%20parameters%22%20RNN%20%22machine%20learning%22&type=*) would be worthy of some serious reading and study.

In [13]:

Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias

#### What is a 'loss function'

A ['loss function' in RNNs](https://learning.oreilly.com/beta-search/?q=%22loss%20function%22%20RNN%20%22machine%20learning%22&type=*) is a much larger topic that you probably want to ponder in greater detail ... at minimum, it's worth reading over the [grad loss function compuation suggested by Karpathy](https://cs231n.github.io/neural-networks-case-study/#loss) ... and it would not be a bad idea to peruse the [CS231n course materials in general](https://cs231n.github.io/), especially the [git repo](https://github.com/cs231n/cs231n.github.io/blob/master/rnn.md) and the [CS231n student reports](http://cs231n.stanford.edu/reports.html).

In [14]:

def lossFun(inputs, targets, hprev):
  """
  inputs,targets are both list of integers.
  hprev is Hx1 array of initial hidden state
  returns the loss, gradients on model parameters, and last hidden state
  """
  xs, hs, ys, ps = {}, {}, {}, {}
  hs[-1] = np.copy(hprev)
  loss = 0

  # first, the forward pass

  for t in range(len(inputs)):
    xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
    xs[t][inputs[t]] = 1
    hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
    ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
    ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
    loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)

# now, the backward pass, to compute gradients going backwards
     
  dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
  dbh, dby = np.zeros_like(bh), np.zeros_like(by)
  dhnext = np.zeros_like(hs[0])
  for t in reversed(range(len(inputs))):
    dy = np.copy(ps[t])
    dy[targets[t]] -= 1 # backprop into y. see http://cs231n.github.io/neural-networks-case-study/#grad if confused here
    dWhy += np.dot(dy, hs[t].T)
    dby += dy
    dh = np.dot(Why.T, dy) + dhnext # backprop into h
    dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
    dbh += dhraw
    dWxh += np.dot(dhraw, xs[t].T)
    dWhh += np.dot(dhraw, hs[t-1].T)
    dhnext = np.dot(Whh.T, dhraw)
  for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
    np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
  return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]  

That's a lot to process ... but we're not through.

## What function definition comes next?

In [15]:
def sample(h, seed_ix, n):
  """ 
  sample a sequence of integers from the model 
  h is memory state, seed_ix is seed letter for first time step
  """
  x = np.zeros((vocab_size, 1))
  x[seed_ix] = 1
  ixes = []
  for t in range(n):
    h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
    y = np.dot(Why, h) + by
    p = np.exp(y) / np.sum(np.exp(y))
    ix = np.random.choice(range(vocab_size), p=p.ravel())
    x = np.zeros((vocab_size, 1))
    x[ix] = 1
    ixes.append(ix)
  return ixes


#### We're ready to to initialize some of the parameters and start things off!

***We can see that by the time it runs with this data for 10,000 iterations or so, the loss function is pretty close to being minimized  ... you can see what happens when want to change it up a bit ....

In [16]:
n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad
smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0

while True:
  # prepare inputs (we're sweeping from left to right in steps seq_length long)
  if p+seq_length+1 >= len(data) or n == 0: 
    hprev = np.zeros((hidden_size,1)) # reset RNN memory
    p = 0 # go from start of data
  inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
  targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]

  # sample from the model now and then
  if n % 100 == 0:
    sample_ix = sample(hprev, inputs[0], 200)
    txt = ''.join(ix_to_char[ix] for ix in sample_ix)
    print ("---- \n %s \n ----" % (txt, ))

  # forward seq_length characters through the net and fetch gradient
  loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
  smooth_loss = smooth_loss * 0.999 + loss * 0.001
  if n % 100 == 0: print ("iter %d, loss: %f" % (n, smooth_loss)) # print progress
  
  # perform parameter update with Adagrad
  for param, dparam, mem in zip([Wxh, Whh, Why, bh, by], 
                                [dWxh, dWhh, dWhy, dbh, dby], 
                                [mWxh, mWhh, mWhy, mbh, mby]):
    mem += dparam * dparam
    param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update

  p += seq_length # move data pointer
  n += 1 # iteration counter 


---- 
 
l(.kL6Evt!.;P/24LgAòéRé雲Z2bgip
4oVVbj!hs)aò雲xK'ëAwY3ò(q@Qh9WJ-金
—S
N/.DWK@eLsI-,ha4meiLxgdrGQ雲5;bE-
còpp/9DU雲金ëbqfRIJfqLTPZ2f!-金z3sFuTjCiQjgC傳W金9e qH傳t:EexM..f雲'j1—BvLEkpbe
Jk傳tAJJ'rcdPK,v86pfSGv6BBI 
 ----
iter 0, loss: 110.167974
---- 
 ey lrE le/Ciolbybn3Rrf lsc cuIolCcoWf olry alerfilb2Ye et nIrlhl9a 
ocsc crg cbnrgrntc yiaeehers JTtsy2sos  sa
e crlhO cilhhtydcoi3c cb kloR oee/o1y l yre4Bhc nAake cbe nLabooesotnafolro a ltltom i bc 
 ----
iter 100, loss: 111.992251
---- 
  ethec enlkecTeine Wisl bica meeaobneo mn h nyHTnos eis 16iba nhn t ao  optsit lyoaaneyb otaaiho TFmvmo(hwrt ocsno ly 7l enuétel oy t Vyo (7G6(0ë—yl sdn yyce6 lr6a o6(6
W0como hWotesnezln tc drth6haal 
 ----
iter 200, loss: 110.013889
---- 
 l.lgn tfy ,algmdaeloasttbro 1 Mlr uan f y Wels,cee,Me An T.e 7agweaWtnrgle (123136-11)
Tv Svfmkeasn bo St 0esn jmonn 9leosr w(1WiSYok :WFob bbimi Giunleee (1-22)
),t (D,e Jy 1TWet by WaykhudWRk M Frko 
 ----
iter 300, loss: 107.343387
---- 
 y ye  uosl by cieiN (109)
T

KeyboardInterrupt: 