# PHYS 481 Assignment 1: Spamlet and Random numbers

Gisu Ham 10134838  
Lincoln Phung 10148276

## Introduction

This notebook will cover aspects of probability theory starting with the bernouli process and ending with random number generators. Probability theory as it relates to Physics is substantially important in statistical mechanics as it relates to entropy in classical and quantum mechanical systems. The notebook will first start with the Bernouli process (a series of "success" or "failure" experiments). This will be used in order to determine the entropy of a simplified version of William Shakespeare's *Hamlet.* Following this, the "infinite monkey theorem" will be explored in order to determine the odds that monkeys hitting keys will produce the text of simplified Hamlet. This will be calculated for a random uniform distribution (all keys have equal chance of being hit) and for a weighted distribution based on the amount of times the keys appear in simplified Hamlet. Additionally, the implementation of pseudo random number generators (PRNGs) in computers will be explored, which includes number representation in computers and evaluating the "randomness" of given PRNGs.

### Q1: what is the entropy of "simplified Hamlet" (Spamlet)?
    
### Q2: what is the probability that a monkey with a uniform random selection of 27-keys would produce Spamlet?  In other words, how many different sequences with 167664 characters are there?

### Q3: how does the probability change if the chance of hitting any given key was not 1/27, but the same as the distribution of Spamlet?

### Optional Q4: determine the joint probability of each 2-key sequence eg. 'aa', 'ab', 'ac' from Spamlet.  How does the probability of producing Spamlet change if the monkey hits keys according to this distribution?

### Optional Q5: write a program to generate sequences of text that sound somewhat like Shakespeare.  See for inspiration http://www.elsewhere.org/journal/pomo/ 

First, we import lines of codes that generate the 'catalog' of Hemlet that counts the occurences of each letter in the Spamlet and charlist that contains each leeter of the Spamlet (list).

In [4]:
# So how long will it take a monkey to type Hamlet?

# Don't pull the same file every time!
#
import urllib.request
url = 'http://www.gutenberg.org/cache/epub/1524/pg1524.txt'
bytedata = urllib.request.urlopen( url ).read()

# Get a local copy with browser or "wget" command.

In [5]:
# Note that the file contains a preamble about the text (metadata).
# A more careful analysis might remove all this, but for simplicity 
# we will ignore this detail.
#
print( bytedata[3:99].decode('ascii') )

Project Gutenberg Etext of Hamlet by Shakespeare
PG has multiple editions of William Shakespear


In [6]:
# Change bytes to characters
# +ignore the first 4 "magic" bytes (file type)
# +ignore everything except a-z and spaces
# +change all letters to lower case
data = bytedata[4:].decode('ascii')
charlist = [c.lower() for c in data if c.isalpha() or c==' ']
print('Length of simplified Hamlet: ', len(charlist) )

# This is one way to count how many times each letter occurs.
# Using numpy might be faster, but this is easy to understand.
#
catalog = {}
for symbol in charlist:
    
    if symbol in catalog:
        catalog[symbol] += 1  #increment
    else:
        catalog[symbol] = 1  # initialize
        
print( catalog ) 

Length of simplified Hamlet:  167664
{'r': 8381, 'o': 11985, 'j': 150, 'e': 16138, 'c': 2906, 't': 12707, ' ': 28805, 'g': 2608, 'u': 4617, 'n': 8864, 'b': 1971, 'x': 270, 'f': 2889, 'h': 8988, 'a': 10513, 'm': 4487, 'l': 6195, 'y': 3437, 's': 8840, 'k': 1315, 'p': 2251, 'i': 9141, 'd': 5358, 'w': 3239, 'v': 1302, 'z': 78, 'q': 229}


# Q1

Probability of occurences of letters are simply extracted from the 'catalog' dictionary above.
First I check that its length is 27. Ohterwise I need to assign a value of 0 to missing keys

In [171]:
assert len(catalog)==27

Then simply apply the Discrete Shannon entropy form:
$$  H = -\sum_{i=1}^n p_i \; log{p_i} $$

In [174]:
#Entropy H
H=0.0

#also measure time
begin=time.time()

for i in charlist:
    #probability= number of each letters' occurences / length of Spamlet
    p=catalog[i]/len(charlist)
    H+=-p*np.log(p)/(np.log(2))

end=time.time()

print("Entropy of Spamlet: ",H)
print("time taken: ",end-begin," seconds")


Entropy of Spamlet:  42498.5737891
time taken:  0.4161057472229004  seconds


Entropy can also be found in a more elegant way, using 'pythonic' code

In [14]:
#Entropy H
H=0.0

begin=time.time()

for i in catalog:
    p=catalog[i]/len(charlist)*np.ones(catalog[i])
    H+=-np.sum(np.log(p)*p)/(np.log(2))

end=time.time()

print("Entropy of Spamlet: ",H)
print("time taken: ",end-begin," seconds")

Entropy of Spamlet:  42498.5737892
time taken:  0.04689311981201172  seconds


   #  Q2

Given the uniform distribution, the probability $P_u$ of typing the Spamlet would be
$$P_u=\left(\frac{1}{27}\right)^{167664}$$

It's apparaent that computing this would approach 0 very rapidly:

In [15]:
for i in np.arange(0,300)[::20]:
    a=(1/27)**i
    print("(1/27) to the power of ",i," = ",a)

(1/27) to the power of  0  =  1.0
(1/27) to the power of  20  =  2.35898248759e-29
(1/27) to the power of  40  =  5.56479837677e-58
(1/27) to the power of  60  =  1.31272619178e-86
(1/27) to the power of  80  =  3.09669809741e-115
(1/27) to the power of  100  =  7.30505658115e-144
(1/27) to the power of  120  =  1.72325005458e-172
(1/27) to the power of  140  =  4.0651167005e-201
(1/27) to the power of  160  =  9.58953910649e-230
(1/27) to the power of  180  =  2.26215548163e-258
(1/27) to the power of  200  =  5.33638516538e-287
(1/27) to the power of  220  =  1.25884391521e-315
(1/27) to the power of  240  =  0.0
(1/27) to the power of  260  =  0.0
(1/27) to the power of  280  =  0.0


Python seems to give up on displaying small floats after around $10^{-300}$

So its' clear that there is a limit on how much Python can handle float. However, Python does not have a limit on how high an integer can go up to as long as RAM permits it. So I define
$$PR_u:=P_u^{-1}=27^{167664}$$

Simply displaying $PR_u$ gives a wall of numbers. So while that confirms Python can handle very very very large integers, it has way too much entropy for us humans to handle. So I count its length to write in scientific notation.

In [176]:
PR_n=str(27**len(charlist))
length=len(PR_n)
print("PR_n= ",PR_n[0]+"."+PR_n[1]+PR_n[2],"*10^",length-1)

PR_n=  1.49 *10^ 239988


So the probability of typing Spamlet using the uniform distribution is simply an inverse of $PR_n$

# Q3

Define the probability of typing Spamlet according to the distribution of occurences of characters as $P_d$:
$$P_d=\prod_{i=0}^{167774-1}p_i$$
where $p_i$ are probability of occurence of $i^{th}$ letter in Spamlet

This can alternatively be written as
$$P_d=\prod_{l=0}^{27-1}\prod_{n=0}^{n_l}p_l$$

where $n_l$ is the number of each unique letters' occurences and $p_l$ is the probability of that letter occuring in Spamlet.

Similar to $Q2$, the float $P_d$ approaches 0 too quickly for Python to handle. So I define
$$PR_d:=\frac{1}{P_d}$$ and work with integers. Probability cannot be represented as an integer (unless 0 or 1), but since probability is a rational, we can have two integers representing that probability as a fraction:

In [196]:
PR_d=1
up=1 # denominator of P_l above
down=1 # numerator of the fraction of P_l

#then multiply each p_l's numerator and denominator separately
for l in catalog:
    for n in range(catalog[l]):
        up=up*len(charlist)
        down=down*catalog[l]

In [253]:
#Then find the probability by dividing. We want to remain an integer
# so we use operation //.
# since the remainder would be very very small compared to the final result,
#we ignore it.

PR_d=str(up//down)


In [198]:
print(PR_d[0]+"."+PR_d[1:3],"10^",L_PR_d-1)

4.13 10^ 208633


Like in Q2, the inverse of this number is the probability of typing Spamlet using the distribution of Hamlet. Compared to the uniform distribution, the probability has increase slightly, but it's still negligible.

# Optional Q4

First we divide the Spamlet into pieces, each piece containing two letters. We can use these pieces to find the distribution of 2-key sequences.

In [267]:
#Joint Probability dictionary
JC={}

for i in range(0,len(charlist)-1,2):
    #jc= 2-key sequence. append to dictionary appropriately
    jc=charlist[i]+charlist[i+1]
    
    if jc in JC:
        JC[jc]+=1
    else:
        JC[jc]=1


In [268]:
#same as Q3, up is the denominator of the probability and down
#is the numerator of the probability, initially 1.

up=1
down=1

#the multiply each respectively, going up by steps of 2
for i in range(0,len(charlist),2):
    jc=charlist[i]+charlist[i+1]
    up=up*len(charlist)
    down=down*JC[jc]*2

In [269]:
#calculating the probability. Again, we ignore the remainder

PJ=str(up//down)
L_PJ=len(str(up//down))

In [270]:
print(PJ[0]+"."+PJ[1:3],"10^",L_PJ-1)

8.88 10^ 191580


This 2-key sequence probability offers the highest probability to type Spaming using the 2-key sequence distribution, but it is still negligibly low.

# Pseudo Random Q1

First I create isbitset that checks the set bit of ith order of an integer in binary form

In [274]:
def isbitset(sequence,setbit=1):
    """This function deter4mines whether a specific bit is set for each integer value
        args:
            sequence (np.ndarray): list of integers
            setbit (int): setbit
        returns:
            return1 (bool): check the set bit of ith order."""
    
    #convert to array if list is inputted
    if sequence ==list(sequence):
        sequence=np.array(sequence)
        
    #create bitmask    
    bitmask=2**setbit
    
    return1= (sequence & bitmask)!=0
    
    return return1

## numpy random generator
I run the function above to random numbers generated by numpy.random.randint.

In [286]:
#create random numbers
np_random_list=[]

for i in range(10):
    np_random_list.append(np.random.randint(0,2**31))


displaying the numbers:

In [287]:
np_random_list

[2044920961,
 853679168,
 891923477,
 244145354,
 279165345,
 1355676035,
 1246909089,
 683573426,
 1326368857,
 706097092]

examining output of each set bit

In [288]:
for i in range(10):
    print("setbit=",i," :",isbitset(np_random_list,setbit=i))

setbit= 0  : [ True False  True False  True  True  True False  True False]
setbit= 1  : [False False False  True False  True False  True False False]
setbit= 2  : [False False  True False False False False False False  True]
setbit= 3  : [False False False  True False False False False  True False]
setbit= 4  : [False False  True False False False False  True  True False]
setbit= 5  : [False False False False  True False  True  True False False]
setbit= 6  : [False  True False  True False False False False  True  True]
setbit= 7  : [ True False False  True  True  True  True  True False  True]
setbit= 8  : [False False False False  True  True False False False  True]
setbit= 9  : [False False False False False False  True False False  True]


## Unix rand

We directly import unix_rand function from the pseudorandomness notebook by Dr. Jackel

In [313]:
def unix_rand(seed=None):
    """Produce randmo integers based on the Grogonno algorithm
        args:
            seed (int): it provides initailization of the random number generator algorithm
        returns:
        unix_rand.seed (int): random number generated"""
    
    unix_rand.seed = unix_rand.seed if seed is None else seed
    if unix_rand.seed is None:
        unix_rand.seed = 0
    
    multval, addval, maxval = 1103515245, 12345, 2**31
    unix_rand.seed = (multval * unix_rand.seed + addval) % maxval
    
    return unix_rand.seed


create random numbers using unix_rand

In [314]:
unix_random_list=[]

unix_random_list.append(unix_rand(0))
for i in range(10):
    unix_random_list.append(unix_rand())



display

In [315]:
unix_random_list

[12345,
 1406932606,
 654583775,
 1449466924,
 229283573,
 1109335178,
 1051550459,
 1293799192,
 794471793,
 551188310,
 803550167]

and run for each set bit

In [316]:
for i in range(10):
    print ("setbit=",i,": ",isbitset(unix_random_list,setbit=i))

setbit= 0 :  [ True False  True False  True False  True False  True False  True]
setbit= 1 :  [False  True  True False False  True  True False False  True  True]
setbit= 2 :  [False  True  True  True  True False False False False  True  True]
setbit= 3 :  [ True  True  True  True False  True  True  True False False False]
setbit= 4 :  [ True  True  True False  True False  True  True  True  True  True]
setbit= 5 :  [ True  True False  True  True False  True False  True False False]
setbit= 6 :  [False  True  True False  True False  True False  True  True  True]
setbit= 7 :  [False False  True False  True  True  True False False False  True]
setbit= 8 :  [False False  True False False False False  True  True  True  True]
setbit= 9 :  [False  True  True False  True False  True  True False  True  True]



From the above results, using a bitset of 1, we can see that the numpy.random generator appears to be more random. This will be explained by first explaining what the isbitset function performs.

The function isbitset takes two parameters, the list of integers to be examined, and a "bitset". Using bitset=0 means that we are looking at whether the least significant bit has a value of 1, using a bitset=1 means we are looking at whether the second least significant bit has a value of 1, and so on. For example, for the above results (bitset of 1), we write the first 4 numbers generated from the UNIX RAND generator in binary:

12345$_{10}$ = 110000001110**0**1$_{2}$  
1406932606$_{10}$ = 10100111101110000010110011111**1**0$_{2}$  
654583775$_{10}$ = 1001110000010000100111110111**1**1$_{2}$  
1449466924$_{10}$ = 10101100110010100011100001011**0**0$_{2}$  

The second least significant bit has been bolded in the above. Because the values are 0, 1, 1, and 0. The bit is set only for the second and third number. 

Observing whether the bit is set shows us that the UNIX RAND operator is not as random as it appears. Looking at the results, whether the bit is set or not has a period of 4. This means the numbers generated from the UNIX RAND will have the second least significant bit return 0,1,1,0 (False, True, True, False) repeatedly.

However, this result is not specific to just bitset=1. More generally, as mentioned by K.W. Miller and S.K. Park in *"Random Number Generators: Good Ones are Hard to Find"* the UNIX Rand, and any generators in the form of:

$f(z) = (az+c)$ $mod$ $m$, **where:** $m=2^{b}$ **and** $a$ $mod$ $4$ $=$ $1$ **and** $c$ $is$ $odd$

have bits that cycle. This starts with the least significant bit having a period of 2, the next least significant bit a period of 4, and further bits onwards with periods increasing by powers of 2. 

For this reason, the numpy.random generator is more random than the UNIX Rand generator as the low bits do not have a period.


# Conclusion

We explored the probabilities of typing the "Spamlet" according to various probability distributions of the letter, including uniform, character-based distribution, and 2-key sequence distribution.  

We also discussed the two types of pseudo-random number generator, one used by numpy.random based on Mersenne twister and Unix_rand based on Grogonno's generator type. It's evident that Grogonno's type shows a definite pattern for low order bits, giving us a cycle of 1 and 0's while numpy.random gives us low order bits that are more 'random'.

We have learned that even when something looks random on the surface, examining it may prove to be a vaulable to prove its predictability.