# Phys 481 Fall 2021 Assignment 2: Spamlet
### A.G. Swadling (30098501)
### E.J. Thompson (30087678)
### G.J. Gelinas (30085897)
### T.J. Cey (30088060)

In [24]:
# load libraries for numerical methods and plotting
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Load libraries for reading from a url and handling large numbers
import urllib.request
import mpmath as mp

## Introduction

In this assignment we investigated how to calculate probability and entropy by applying it to the likelihood of reproducing a stripped down version of Shakespeare's "Hamlet" (Spamlet) from random keystrokes. We first built an understanding of Shannon entropy by calculating the entropy of Spamlet in bits per character. This calculation also required us to calculate the probability of each character being typed, which was done by dividing the number of times a given character occurred by the total number of characters in Spamlet (including repeats). In the following two questions we examined the number of possible sequences the same length as Spamlet that could be produced, and the probability that a given sequence would be Spamlet. This was done considering random strokes on a keyboard where every character key has equal probability of being pressed, and for where the probability of a character key being pressed was equal to the probability distribution in Spamlet.

How the probability distribution and entropy would change if we considered groupings of characters opposed to individual characters was also examined. First the joint probability of a given character pair being produced was found, followed by the probability of producing Spamlet with this distribution. This was compared to what was found for individual characters. The probabilities found for character pairs were then used to calculate the associated Shannon entropy, which was then extended to full words. The effects of differing entropies observed for character, pair, and full word combinations were shown when we tried to construct Shakespeare like sentences from random combinations of characters, character pairs, full words, and word groupings from Spamlet.

## Questions

## Question 1:

In this question we determine the Shannon Entropy of Spamlet in bits per character, using the probabilities of different characters occurring in Spamlet.

In [25]:
def shannonEntropy(probs):
    """Returns the shannon entropy associated with a collection of probabilities"""
    
    # Obtains the probability associated with all the items in the probs dictionary
    prob_values = list(probs.values())
    
    # Calculates the shannon entropy
    entropy = -sum(np.log2(prob_values)*prob_values)
    
    return entropy

In [26]:
def dictProbs(itemDict):
    """Returns a dictionary of probabilities associated with the occurrences of specified items in a file"""
    
    # Total number of occurrences
    total = sum(itemDict.values())
    
    # Initializes probability dictionary
    probs = {}
    
    # Gets the count of each character and adds its simple probability to the array
    for item in itemDict.keys():
        probs[item] = (1.0*itemDict[item]/total)
    
    return probs

In [27]:
def charDecoder(url):
    """Reads the data from a url and decodes it into characters. 
    Letters are made lower-case and added to a list along with spaces
    and newline characters recoded as spaces. The collection of decoded
    bytedata is also returned in case it is needed for other work."""
    
    # Retrieves bytedata and decodes it into characters
    bytedata = urllib.request.urlopen( url ).read()
    data = bytedata.decode()

    # Creates a list to contain all characters in the data 
    charlist = []
    # Adds characters and spaces, as well as single newlines as spaces, to our character list
    for char in data:
        if (char.isalpha()):
            charlist.append(char.lower())
        elif  (char == ' ' or char == '\n') and charlist[-1] != ' ': # Removes double spaces
            charlist.append(' ')
            
    return charlist, data

In [28]:
def occurenceDict(termList, N = 1, words = False):
    """Creates a dictionary for the number of times 
    a collection of N terms occurs in a row in the inputted list.
    If the inputted list is made up of words, spaces are added between 
    consecutive terms."""
    
    # Creates a dictionary counting the number of times each term occurs in the list
    termDict = {}
    # Ranges over the termlist until there are not N terms left
    for i in range(len(termList)-(N-1)):
        if words:
            # Adjoins the next N-1 terms to a string, followed by a space, then adds the 
            # Nth term without any space
            termString = ''.join([(termList[i+j]+" ") for j in range(N-1)])
            termString += termList[i+(N-1)]
        else:
            # Adjoins the next N terms directly
            termString = ''.join([(termList[i+j]) for j in range(N)])

        # Increments the item's dictionary value if it is already stored, 
        # and initializes it at one otherwise
        if termString in termDict.keys():
            termDict[termString] += 1
        else:
            termDict[termString] = 1
    
    return termDict

In [29]:
# Obtain spamlet data and character dictionary for future use
url = r'http://www.gutenberg.org/files/1524/1524-0.txt'
charList, dataCollection = charDecoder(url)

In [30]:
# Obtains the character dictionary and associated probability dictionary for calculating the shannon entropy
charDict = occurenceDict(charList)
charProbs = dictProbs(charDict)
entropy = shannonEntropy(charProbs)

print("The entropy of Spamlet in bits per character is:", entropy)

The entropy of Spamlet in bits per character is: 4.095554914480837


Thus, we observe that the Shannon Entropy for Spamlet in bits per character is approximately 4.096.



## Question 2:

In this question we determine the probability of a monkey typing Spamlet on a keyboard consisting of the 26 letter characters, and a spacebar, where all 27 characters are equally probable. In particular, we determine the number of 27-key sequences of length 184730 (the length of Spamlet after removing double spaces), and we determine the probability of any one of these sequences being typed.

In [31]:
def num_sequences(num_options, length):
    """Returns the number of sequences of a certain length given that each entry has num_options choices"""
    
    return mp.power(num_options,length)

In [32]:
def prob_total(num_options, length):
    """Returns the probability of obtaining a particular sequence of characters of length 'length'
    from a selection of 'num_options' symbols."""
    
    return mp.power(num_options, -length)

In [33]:
# Obtains the length of the character dictionary found in Question 1 for Spamlet
length = sum(charDict.values())

# Obtains the number of sequences and the probability of any given sequence 
# of length equal to Spamlet for the 26 letter characters and space
sequenceNum = num_sequences(27, length)
probTot = prob_total(27, length)

print("The number of sequences of length", length, "made up of 27-keys is", sequenceNum)
print("The probability of any one such sequence occurring is", probTot)

The number of sequences of length 184730 made up of 27-keys is 6.73213923799414e+264415
The probability of any one such sequence occurring is 1.4854119391297e-264416


Thus, we have that if the monkey presses the 27-keys with equal probability, the number of possible sequences is approximately 6.732𝐸+264415, with each sequence having a probability of 1.485𝐸-264416 of being typed at random.

## Question 3:

We now investigate how the probability of a monkey typing Spamlet changes if the key probabilities are not uniform but instead the probability of hitting any given key is the same as the probability of selecting that key at random out of Spamlet.

In [34]:
def prob_selection(probs, occurrences):
    """
    Returns the probability of selecting a specific sequence of characters
    with weighted probabilities.
    
    Arguments:
    probs - a dictionary of probabilities of selecting each item
    occurrences - a dictionary for the number of selections of each desired symbol
    
    Returns:
    prob -  the probability of selecting a certain number of characters of each type from
            the probs dictionary
    """
    
    # Obtains the terms in the probability dictionary
    terms = occurrences.keys()
    
    # Initializes the probability
    prob = 1.0
    
    # Multiplies the independent probability of each character
    for term in terms:
        prob *= mp.power(probs[term], occurrences[term])
    
    
    return prob

In [35]:
# Obtains the weighted probability associated with typing Spamlet with the character dictionary.
weightedProb = prob_selection(charProbs, charDict)

print("The probability of typing Spamlet given that the probability " + 
      "distributions for the keys are the same as they are for those in Spamlet is", weightedProb)

The probability of typing Spamlet given that the probability distributions for the keys are the same as they are for those in Spamlet is 1.50127331056182e-227751


Therefore, we see that the probability of a monkey typing Spamlet with each of the 27-keys being equally probable, which is approximately 1.485𝐸-264416, is less than the probability that a monkey types Spamlet with each key having a probability of being hit determined by the key occurrences in Spamlet, which yielded a probability of approximately 1.501𝐸-227751.

## Question 4:

In this question we find the joint-probability of 2-key sequences in Spamlet, and then use the resulting probability distribution for pairs to determine the probability that a monkey types Spamlet following this distribution, typing pairs of keys at a time. To determine the probability of 2-key sequences we used the number of occurrences for all adjacent sequences of 2-keys, including spaces, in the Spamlet document. But, in order to determine the probability of reconstructing Spamlet we need to use the occurrences of the even pairs, where by "even pair" we mean a pair which has its second letter occur at an even place in Spamlet. For example, if the first sentence of Spamlet was "A platform before the Castle," the even pairs would be "a ", "pl", "at", "fo", "rm", etcetera, while " p", "la", "tf", etcetera would be odd pairs. From this example, we can indeed see that in building Spamlet from pairs we must proceed by stacking even pairs on top of each other, since the first pair will always be even and any distinct pair succeeding an even pair will also be even. 

In [36]:
def twoCharsDict(charList):
    """Stores the occurrences of even 2-pairs in one dictionary, 
    and stores the total probabilities for all 2-pairs in Spamlet
    in another, then returns both dictionaries to the user.
    
    Arguments:
    charList - list of characters to sort
    
    Returns:
    pairDict_even - a dictionary of the number of occurrences of all even pairs
    dictProbs(pairDict_full) - a dictionary of the probability of all pairs
    """
    
    #  Creates a dictionary for storing the even character 2-pairs
    pairDict_even = {}
    # Runs over all even pairs in Spamlet 
    for i in range(len(charList)//2):
        charPair = charList[2*i]+charList[2*i+1]

        # Increments the count if a pair appears in the dictionary, and adds it otherwise.
        if charPair in pairDict_even.keys():
            pairDict_even[charPair] += 1
        else:
            pairDict_even[charPair] = 1
    
    # Creates a dictionary for storing all character 2-pairs in Spamlet
    pairDict_full = occurenceDict(charList, N = 2)
    
    # Returns the even pair occurrences, and a probability dictionary for the total occurrences.
    return pairDict_even, dictProbs(pairDict_full)

In [37]:
def testDisplay(pairProbs, size = 4):
    """Displays a test range of formatted probabilities
    in a square array based on the characters paired together."""
    
    # Creates a list of all lowercase letters as well as space
    testChars = list(' abcdefghijklmnopqrstuvwxyz')
    
    print("Test Probabilities:")
    # Adjoins the 'size' different characters into 'size^2' different pairs
    # and prints the resulting probability of each pair in a square array beside them
    for char1 in testChars[:size]:
        print('\t', end = "")
        for char2 in testChars[:size]:
            pair = ''.join([char1,char2])
            if pair in pairProbs.keys():
                print(pair + ":", "{:.4e}".format(pairProbs[pair]), end = '\t')
            else:
                print(pair + ":", "{:.4e}".format(0.0), end = '\t')
        print("")

In [38]:
# Obtains a dictionary of even 2-pairs in Spamlet along with one containing their probabilities
pairDict_even, totalPairProbs = twoCharsDict(charList)

# Obtains the probability of reconstructing Spamlet from the 2-pairs
pairSelectionProb = prob_selection(totalPairProbs, pairDict_even)

print("The probability of reconstructing Spamlet if pairs of keys " + 
      "are hit according to the even key distribution in Spamlet is",pairSelectionProb)
testDisplay(totalPairProbs)

The probability of reconstructing Spamlet if pairs of keys are hit according to the even key distribution in Spamlet is 4.78319487722973e-207316
Test Probabilities:
	  : 0.0000e+00	 a: 1.7729e-02	 b: 7.9468e-03	 c: 6.4852e-03	
	a : 3.9896e-03	aa: 0.0000e+00	ab: 6.7667e-04	ac: 1.5915e-03	
	b : 1.5699e-04	ba: 7.1997e-04	bb: 5.9547e-05	bc: 0.0000e+00	
	c : 5.0344e-04	ca: 1.8784e-03	cb: 0.0000e+00	cc: 2.4901e-04	


Therefore, we conclude that using 2-key sequences, the probability of a monkey typing Spamlet using a keyboard of pairs, where the probability of a key occurring is the same as its probability distribution in Spamlet, is approximately 4.783𝐸-207316. Consequently, this probability is greater than the probability for single characters, both in the uniform case and the weighted case where the distributions of characters in Spamlet were used. Additionally, we have displayed the probabilities of certain test pairs in a square array to illustrate examples of the calculated probabilities.

## Question 5

In this question we determine the entropy associated with 2-key sequences in Spamlet and for words in Spamlet.

In [39]:
# 2-key Entropy
# Calculates the shannon entropy for the pairs using the probability dictionary calculated in question 4
pairEntropy = shannonEntropy(totalPairProbs)

print("The shannon entropy of 2-key sequences in Spamlet in bits per character is:",pairEntropy)

The shannon entropy of 2-key sequences in Spamlet in bits per character is: 7.455948161245874


Hence, we find that the Shannon Entropy for 2-key sequences in Spamlet in bits per character is approximately 7.456, which is greater than the Shannon Entropy associated with single characters.

In [40]:
def decodeWords(charList, numWords = 1):
    """Adjoins the character in the character list into words, 
    and then sorts them into a dictionary of occurrences. If numWords is more than 1,
    adjacent words are put together."""
    
    # Joins the characters into words separated by spaces
    text = ''.join(charList)
    
    # Removes the spaces and stores the individual words into a list
    wordList = str.split(text, sep = " ")
    # Creates a dictionary counting the number of times each word occurs
    wordDict = occurenceDict(wordList, N = numWords, words = True)
    
    return wordDict

In [41]:
# Obtains a dictionary of words in Spamlet along with one containing their probabilities
wordDict = decodeWords(charList)
wordProbs = dictProbs(wordDict)

# Calculates the shannon entropy for the words in Spamlet
wordEntropy = shannonEntropy(wordProbs)

print("The shannon entropy of word sequences in Spamlet in bits per character is:",wordEntropy)

The shannon entropy of word sequences in Spamlet in bits per character is: 9.395778599888459


Therefore, we find that the Shannon Entropy of words in Spamlet in bits per character is approximately 9.396. In particular, we find that the Shannon Entropy of words in Spamlet is greater than the Shannon Entropy for both 2-key sequences and single characters.

## Question 6:

In this problem we attempt to generate text similar to Shakespeare through the use of weighted probability distributions of characters, pairs, words, and punctuation. Random phrases are reconstructed with each of these options (characters, pairs, or words) as the base building block to show how each step is closer to intelligible sentences than the last.

In [42]:
def wordAndPunct(charList, data):
    """Adjoins the character in the character list into words,
    and appends punctuation before sorting them into a dictionary
    of occurrences in Spamlet.
    """
    
    # Initializes an array for punctuation
    punct = [',','.',';',':','?','!']
    
    # Creates a list of the occurrences of punctuation in Spamlet
    punctuations = [c for c in data if (c in punct)]
    
    # Joins the characters into words separated by certain numbers of spaces
    text = ''.join(charList)
    
    # Removes the spaces and stores the individual words into a list along with the punctuation
    wordAndPunctList = str.split(text)
    wordAndPunctList.extend(punctuations)
    
    # Creates a dictionary counting the number of times each word and punctuation mark occurs
    totalDict = occurenceDict(wordAndPunctList)
    
    return totalDict

In [43]:
def randTermPhrase(termDict, min_length, words = False):
    """Creates a phrase of a given minimum length using probabilities 
    associated with each term (for example pairs or singular characters) in the dictionary.
    
    Arguments:
    probDict - dictionary of probabilities of the terms used to generate our phrase
    min_length - the minimum number of terms in a generated phrase
    words - boolean describing whether the terms in probDict are whole words or not
    
    Returns:
    finalPhrase - the string obtained by concatenating the generated terms
    """
    
    # Initializes a list for the terms in a phrase
    phraseTerms = []
    
    # Obtains a dictionary of probabilities for the terms
    probDict = dictProbs(termDict)
    
    # Adds pairs to the phrase list until it is longer than the min length
    while (len(phraseTerms) < min_length):
        
        # Obtains a random pair based on the probabilities in the dictionary
        term = np.random.choice(list(probDict.keys()), p = list(probDict.values()))
        
        # Appends the last term
        phraseTerms.append(term)
        
        # If the terms are words, add a space after each one
        if words == True:
            phraseTerms.append(' ')
    
    # Concatenates the terms into a single string
    finalPhrase = ''.join(phraseTerms)
    
    return finalPhrase

In [44]:
def randWordPunctPhrase(fullDict, wordDict, min_length = 50):
    """Creates a phrase of a given minimum length using
    probabilities associated with each word in the dictionary
    as well as the probabilities of punctuation.
    
    Arguments:
    fullDict - dictionary of all words and punctuation along with their occurrences
    wordDict - dictionary of all words with their occurrences
    min_length - the minimum number of words and punctuation points in the generated phrase
    
    Returns:
    finalPhrase - the string obtained by concatenating the generated words and punctuation
    """
    
    # Initializes arrays for punctuation and end punctuation of sentences
    endPunct = ['.','?','!']
    punct = [',','.',';',':','?','!']

    # Initializes a list for the terms in a phrase
    phraseTerms = []
    
    # Turns the occurrences into probabilities for just the words 
    # and for the words along with punctuation
    fullProbDict = dictProbs(fullDict)
    wordProbDict = dictProbs(wordDict)
    
    # Initialize Sentence with a random first word and capitalizes it
    term = np.random.choice(list(wordProbDict.keys()), p = list(wordProbDict.values()))
    term = term.capitalize()
    phraseTerms.append(term)
    
    # Adds words and punctuation to the phrase list until it is longer than the min length
    # and it ends in a period, exclamation mark, or question mark.
    while (len(phraseTerms) < min_length or not (phraseTerms[-1] in endPunct)):
        
        # Takes a random word if the last term was a punctuation, and capitalizes it
        # if it was also an ending punctuation (i.e. period, exclamation mark, or question mark)
        if phraseTerms[-1] in punct:
            term = np.random.choice(list(wordProbDict.keys()), p = list(wordProbDict.values()))
            if phraseTerms[-1] in endPunct:
                term = term.capitalize()
        else:
            term = np.random.choice(list(fullProbDict.keys()), p = list(fullProbDict.values()))
        
        # Adds a space in front of the term if it is not a punctuation
        if not(term in punct):
            phraseTerms.append(' ')
            
        # Capitalizes single i's
        if term == 'i':
            term = term.capitalize()
            
        # Appends the last term
        phraseTerms.append(term)
    
    # Joins the words and punctuations into a single string
    finalPhrase = ''.join(phraseTerms)
    
    return finalPhrase

In [45]:
def phraseGenerator(termDict, termType, min_termNum = 50, num_phrases = 3):
    """Generates random phrases using a dictionary of terms and their occurrences,
    and prints them to the screen.
    """
    
    # Checks whether the inputted term dictionary is in words
    if termType == "Word" or termType == "Double Word":
        areWords = True
    else:
        areWords = False
    
    # Decomposes termDict if the terms being printed are formatted words
    if termType == "Formatted Word":
        totalDict, wordDict = termDict
    
    # Prints phrases reconstructed from term probabilities
    print(termType + " Phrases:")
    for i in range(num_phrases):
        if termType == "Formatted Word":
            print(str(i+1) + ": \"" + str(randWordPunctPhrase(totalDict, wordDict)) + "\"")
        else:
            print(str(i+1) + ": \"" + str(randTermPhrase(termDict, min_termNum, words = areWords)) + "\"")

In [46]:
# Obtains dictionaries for the number of occurances of pairs of words in Spamlet.
doubleWordDict = decodeWords(charList, numWords = 2)

# Obtains a dictionary for the number of occurrences of single words and punctuation
# using the character list and collection of data obtained in question 1
totalDict = wordAndPunct(charList, dataCollection)

# Creates an array of the different term dictionaries in this assignment, and their names for printing
dictArray = [charDict, totalPairProbs, wordDict, doubleWordDict, (totalDict, wordDict)]
typeArray = ["Letter", "Pair", "Word", "Double Word", "Formatted Word"]


for i in range(len(dictArray)):
    phraseGenerator(dictArray[i], typeArray[i])
    print("\n")

Letter Phrases:
1: "gl ioeitndilsto rley h vroo rohe  we r ohee niofia"
2: "r itgg mh f  dalitit  teu dste eaohrtrureunsssxrdc"
3: " ae si  rhaiont cscu toi y soiboavdn i psutoehd ui"


Pair Phrases:
1: "liitee olist ms a noy f ushso lasr smaths  f tnt h oelasnditrdor tallehe orie  nceheieguhopa thaomn "
2: "thmelaexon nic qshofcu dguishe fthd al camexe soseeala ctoetkeutve mteesmyire ha tacea crmeee pp fh "
3: "usagdoat h darourrinbld ironth nw  aow his ausannole ievr threbo nouhte rgvdisomhenscat stdrttni b p"


Word Phrases:
1: "i taxes the his argues can food goodly needs first of without lewdness that to hot where till an pronounced speak speak shows arm use "
2: "mould dear confess their not say kneels elsinore license throat our come waits something enter very ever literary did o whispers are though far delay "
3: "kissd live season hamlet have who come law pleasure hamlet give ophelia that overthrown one and int him or a held the accident to have "


Double Word Phrases:
1: "ha

As we can see, when it is just characters the reconstructed phrases are nearly entirely gibberish, with it being difficult to discern even the shape of any words. As we move on to pairs, the strings start to appear slightly more word-like, with "she", "you", and "no" appearing, but are still mostly unintelligible. Next, using words as the building block the phrases are now readable, and even appear to generate somewhat coherent phrases using the probability distribution of words in Spamlet. Using pairs of adjacent words in Spamlet the phrases start to read more like Shakespeare, with the structure of the phrases making more sense as a whole. Finally, the addition of capitalizations and punctuation further aids in bringing the random phrases closer to the form of Shakespearean text, although as this is done with the single words and not the pairs, the structure of the phrases is not as coherent as the double word phrases.

## Conclusions/Summary

This assignment focused on probability and entropy calculations specifically regarding a monkey reproducing Spamlet through randomly hitting a keyboard numerous times. To begin, the Shannon Entropy of Spamlet was found to be 4.096 bits per character in Question 1. Following this the probability of reproducing Spamlet was found to be 1.485𝐸-264416 when there is an equally likely chance to hit each of the 27 keys. To do this the number of sequences of length 184730 made up of 27 keys, each with a probability of 1/27 of being selected, was found to be 6.732𝐸+264415. From that point the likelihood of producing the exact key sequence of Spamlet was determined to be the inverse of the number of sequences since the probability distribution for the keys is uniform. Question 3 examined the probability of reproducing Spamlet, again through random key strokes, however this time the likelihood of selecting each key was not equal. The probability of each key occurring in Spamlet was determined and used to construct probability distributions for the 27 keys. Then, the overall probability of a Spamlet reproduction using these determined key distributions was calculated to be 1.501𝐸−227751. It can be seen that it is more likely for a monkey to type Spamlet when the probability of hitting each key is specified by the distribution of keys in Spamlet itself, instead of a uniform distribution of keys. Next the joint-probability of 2-key sequences in Spamlet was determined for each sequence of adjacent keys that appeared in Spamlet. From this the probability of a monkey producing Spamlet through hitting pairs of keys according to this distribution was determined to be 4.783𝐸-207316. Evidently, we find that with this alteration the probability of a monkey reproducing Spamlet once again increases as compared to both of the single key probabilities (uniform and non-uniform). Finally the Shannon Entropy of 2-key sequences and words in Spamlet were determined to be 7.456 and 9.396 bits per character, respectively. Hence, throughout we observe that as the terms we are selecting become more complex (characters to pairs of characters to full words), their associated Shannon Entropy increases. This follows intuitively from the fact that selecting a random word from all the possibilities in Spamlet provides more information than the selection of a 2-key sequence, and the selection of the 2-key sequence provides more information than single character selections. Indeed, the number of single characters is $27$, the number of 2-key sequences is $27\cdot 27$, and the number of words would be strictly greater than the number of 2-key sequences, being formed from sequences of size greater than or equal to 1. Thus, the number of possibilities at each step increases. To conclude, throughout this assignment a thorough analysis of the probability of a random production of Spamlet by typing monkeys, simulating the notion of random processes, was done successfully. 


In optional question 6, a random generation of characters, 2-key sequences, words, and pairs of words was used to create a section of text that, although largely incoherent, resembled Spamlet. Here it was found that to achieve the most coherent and sensible result, it was useful to pull pairs of adjacent words from Spamlet and include simple grammar such as capitalization and punctuation. 