# Lincoln Phung
# Assignment 1: Bernouli Entropy, Spamlet, and Random Number Generators
## Lincoln Phung 10148276, David Gisu

## Introduction

This notebook will cover aspects of probability theory starting with the bernouli process and ending with random number generators. Probability theory as it relates to Physics is substantially important in statistical mechanics as it relates to entropy in classical and quantum mechanical systems. The notebook will first start with the Bernouli process (a series of "success" or "failure" experiments). This will be used in order to determine the entropy of a simplified version of William Shakespeare's *Hamlet.* Following this, the "infinite monkey theorem" will be explored in order to determine the odds that monkeys hitting keys will produce the text of simplified Hamlet. This will be calculated for a random uniform distribution (all keys have equal chance of being hit) and for a weighted distribution based on the amount of times the keys appear in simplified Hamlet. Additionally, the implementation of pseudo random number generators (PRNGs) in computers will be explored, which includes number representation in computers and evaluating the "randomness" of given PRNGs.

In [4]:
# import the necessary libraries
# load modules for math and plotting
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


### Question 1: 
What is the entropy of "simplified Hamlet" (Spamlet)?

In [5]:
### Code from Assignment ###

# Don't pull the same file every time!
#
import urllib.request
url = 'http://www.gutenberg.org/cache/epub/1524/pg1524.txt'
bytedata = urllib.request.urlopen( url ).read()

# Get a local copy with browser or "wget" command.

# Note that the file contains a preamble about the text (metadata).
# A more careful analysis might remove all this, but for simplicity 
# we will ignore this detail.
#
print( bytedata[3:99].decode('ascii') )

# Change bytes to characters
# +ignore the first 4 "magic" bytes (file type)
# +ignore everything except a-z and spaces
# +change all letters to lower case
data = bytedata[4:].decode('ascii')
charlist = [c.lower() for c in data if c.isalpha() or c==' ']
print('Length of simplified Hamlet: ', len(charlist) )

# This is one way to count how many times each letter occurs.
# Using numpy might be faster, but this is easy to understand.
#
catalog = {}
for symbol in charlist:
    
    if symbol in catalog:
        catalog[symbol] += 1  #increment
    else:
        catalog[symbol] = 1  # initialize
        

### ###



Project Gutenberg Etext of Hamlet by Shakespeare
PG has multiple editions of William Shakespear
Length of simplified Hamlet:  167664


In [6]:
# Creating probabilities for each character by dividing number of occurences by total number of characters in spamlet.
pcharlist = []
for item in charlist:
        pcharlist.append(catalog[item] / 167664.0)


# Calculating the entropy

hamEntropy = -np.sum(np.log(pcharlist)*pcharlist) / np.log(2)
print("Entropy of simplified Hamlet bytes per key: ", hamEntropy)


Entropy of simplified Hamlet bytes per key:  42498.5737892


### Question 2:
What is the probability that a monkey with a uniform random selection of 27-keys would produce Spamlet?  In other words, how many different sequences with 167774 characters are there?


In [7]:
#The amount of different sequences 167774 long with 27 different keys is
print(27**16774)


4963408302792470080315479244974719167193440711999271069720867272870591432870403341895247541772727526104584735380332619454612205628487838376568883936880137146000182172257299432442753111308897499978288056485491944784146009854927197679227392040929329846734008823774825017196385938917994308679742857694829229671805346797613079836059469574476449813933265349888834742494491894298603256932068952805341843337054491533656825091505906436398883844142492551512188032177838519655715184929957277080575042027780668052159767133359117441590883466167110004209608580634281171307458935458003824182030906172773151997117554477610446215302357077447457485999075206706704722261426516183702765141490235832877161333665355587994646045545256344667459599241096099587070356087365704226009304725025115419123699645217661052733731228480405241967862571577501626626870704459632529229204908587209884855793703234059828076862639591112538027392054925867301724652661933002672249145084077938691633030160855651380816800390082469134317790420480

### Question 3:
How does the probability change if the chance of hitting any given key was not 1/27, but the same as the distribution of Spamlet?

In [28]:

#Create a new list to store data about each key's probability multiplied by itself by N times where N is the amount of
#occurences that happen in Spamlet

def inv_prob(lettermap, charlist):
    pcharN = []

#go through dictionary and find probabilities
    for letter in lettermap:
        pcharN.append(lettermap[letter]**lettermap[letter])
    

#Determining the numerator for the probability of a monkey typing Spamlet.
    num = np.prod(pcharN)

#Determining the denominator for the probability of a monkey typing Spamlet.
    dem = len(charlist)**len(charlist)


#Determine the inverse probability of a monkey typing Spamlet as determining the probability results in too small of a nummber
    inv_pspam = dem//num
    return inv_pspam


singleSpam=inv_prob(catalog, charlist)
#Since the value is so large, determine the value in exponential form.

print(str(singleSpam)[0] , "." , str(singleSpam)[1:3] ,"x 10^", len(str(singleSpam)[1:]))
    

4 . 13 x 10^ 208633


### Optional 1:
Determine the joint probability of each 2-key sequence eg. 'aa', 'ab', 'ac' from Spamlet.  How does the probability of producing Spamlet change if the monkey hits keys according to this distribution?


In [35]:
### Test code / sample ###
a = ['a', 'b', 'c','a','c']
b = []
c = ['a', 'b', 'c']

#counter
n=0

for item in c:
    
    for symbol in a:
        
        n=n+1
       
        if item == symbol:
            if n < len(a):
                
                b.append(symbol+a[n])
        
        
        
    n=0
    
print(b)

5
['ab', 'ac', 'bc', 'ca']


In [29]:
#print(charlist)

#Add new list for double characters
dcharlist = []

#Add a counter to track indx of Spamlet, what the position of a character is at.
n=0

#Go through each unique character in Spamlet

for unique in catalog:
    
    #search through the character list
    for symbol in charlist:
        
        #if a match is found between the unique character and a symbol in spamlet, append the symbol and the next symbol to
        #a new list
        n=n+1
        if unique == symbol:
            if n < len(charlist):
                dcharlist.append(symbol+charlist[n])
    
    #reset counter after looping through all characters in Spamlet
    n=0

#Make a new dictionary counting all the times a sequence of two characters (double) appear
dcatalog = {}
for dsymbol in dcharlist:
    
    if dsymbol in dcatalog:
        dcatalog[dsymbol] += 1  #increment
    else:
        dcatalog[dsymbol] = 1  # initialize
        
##print(dcatalog)

#Now apply the same procedure as in question 3

doubleSpam=inv_prob(dcatalog,charlist)


print(str(doubleSpam)[0] , "." , str(doubleSpam)[1:3] ,"x 10^", len(str(doubleSpam)[1:]))


9 . 50 x 10^ 383403


In [39]:
####Probability if we consider only two key commitments at a time

#initialize a list
d2charlist = []

#go through Spamlet and add in groups of two into the list
for indx in range(0,len(charlist),2):
    d2charlist.append(charlist[indx]+charlist[indx+1])
    
#Count all entries in a dictionary
d2catalog={}
for d2symbol in d2charlist:
    if d2symbol in d2catalog:
        d2catalog[d2symbol] += 1 
    else:
        d2catalog[d2symbol]=1 


d2Spam = inv_prob(d2catalog,d2charlist)

print(str(d2Spam)[0],".",str(d2Spam)[1:3],"x 10^", len(str(d2Spam)[1:]))

8 . 88 x 10^ 191580


## Question- random bits?
Write a function that takes an integer sequence and determines whether a specific bit is set for each value ie.
function call:

 isbitset( sequence=[0,2,1,4,7], setbit=1 )

result:

[False, True, False, False, True]

Use this function to examine the output from the unix_rand generator. Then examine the output from the numpy.random generator. Compare and discuss.

Optional: Carry out a quantitative analysis using "mutual information".

In [3]:
#Defining the isbitset function
def isbitset(sequence, setbit):
    """Given an integer sequence and a bit, determines if bit is set for values in the sequence by comparative & operator"""
    sequence = np.array(sequence)
    bitmask = 2**setbit
    print((sequence&bitmask) != 0)
    
    

In [5]:
#Showing that it works
seq = [0,2,1,4,7]

isbitset(seq,1)

[False  True False False  True]


In [23]:
#Applying the function onto the output for unix_rand generator and output for numpy.random generator


#### FOR UNIX RAND GENERATOR (CODE FROM ASSIGNMENT SHEET) ####
def unix_rand(seed=None):
    
    unix_rand.seed = unix_rand.seed if seed is None else seed
    if unix_rand.seed is None:
        unix_rand.seed = 0
    
    multval, addval, maxval = 1103515245, 12345, 2**31
    unix_rand.seed = (multval * unix_rand.seed + addval) % maxval
    
    return unix_rand.seed

#### Using Unix_rand generator for a list of 10 numbers
unix_seq = [unix_rand(seed=0)]
for i in range(15):
    unix_seq.append(unix_rand())


#### FOR NUMPY.RANDOM GENERATOR ####
np_seq = [np.random.randint(low=0, high=2**31)]
for j in range(15):
    np_seq.append(np.random.randint(low=0, high=2**31))



#### Comparing the two generators ####
print(unix_seq)
isbitset(unix_seq, 1)

print('\n', np_seq)
isbitset(np_seq,1)



[12345, 1406932606, 654583775, 1449466924, 229283573, 1109335178, 1051550459, 1293799192, 794471793, 551188310, 803550167, 1772930244, 370913197, 639546082, 1381971571, 1695770928]
[False  True  True False False  True  True False False  True  True False
 False  True  True False]

 [1464242791, 700072020, 477838115, 637816628, 60148709, 664122637, 1647831330, 678432377, 778852780, 1794426448, 511334789, 1991914002, 2010360818, 366609175, 1471999402, 956388741]
[ True False  True False False False  True False False False False  True
  True  True  True False]


## Comparing the UNIX Rand and NUMPY.RANDOM.RANDINT

From the above results, using a bitset of 1, we can see that the numpy.random generator appears to be more random. This will be explained by first explaining what the isbitset function performs.

The function isbitset takes two parameters, the list of integers to be examined, and a "bitset". Using bitset=0 means that we are looking at whether the least significant bit has a value of 1, using a bitset=1 means we are looking at whether the second least significant bit has a value of 1, and so on. For example, for the above results (bitset of 1), we write the first 4 numbers generated from the UNIX RAND generator in binary:

12345$_{10}$ = 110000001110**0**1$_{2}$  
1406932606$_{10}$ = 10100111101110000010110011111**1**0$_{2}$  
654583775$_{10}$ = 1001110000010000100111110111**1**1$_{2}$  
1449466924$_{10}$ = 10101100110010100011100001011**0**0$_{2}$  

The second least significant bit has been bolded in the above. Because the values are 0, 1, 1, and 0. The bit is set only for the second and third number. 

Observing whether the bit is set shows us that the UNIX RAND operator is not as random as it appears. Looking at the results, whether the bit is set or not has a period of 4. This means the numbers generated from the UNIX RAND will have the second least significant bit return 0,1,1,0 (False, True, True, False) repeatedly.

However, this result is not specific to just bitset=1. More generally, as mentioned by K.W. Miller and S.K. Park in *"Random Number Generators: Good Ones are Hard to Find"* the UNIX Rand, and any generators in the form of:

$f(z) = (az+c)$ $mod$ $m$, **where:** $m=2^{b}$ **and** $a$ $mod$ $4$ $=$ $1$ **and** $c$ $is$ $odd$

have bits that cycle. This starts with the least significant bit having a period of 2, the next least significant bit a period of 4, and further bits onwards with periods increasing by powers of 2. 

For this reason, the numpy.random generator is more random than the UNIX Rand generator as the low bits do not have a period.


## Conclusion



