# Phys 481 Fall 2021 Assignment 1: Spamlet
### A.G. Swadling (30098501)
### E.J. Thompson (30087678)
### G.J. Gelinas (30085897)
### T.J. Cey (30088060)

In [5]:
# load libraries for numerical methods and plotting
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Load libraries for reading from a url and handling large numbers
import urllib.request
import mpmath as mp

## Introduction



## Questions

## Question 1:

In this question we determine the Shannon Entropy of Spamlet in bits per character, using the probabilities of different characters occuring in Spamlet.

In [1]:
def shannonEntropy(probs):
    """Returns the shannon entropy associated with a collection of probabilities"""
    
    # Obtains the probability associated with all the items in the probs dictionary
    prob_values = list(probs.values())
    
    # Calculates the shannon entropy
    entropy = -sum(np.log2(prob_values)*prob_values)
    
    return entropy

In [2]:
def dictProbs(itemDict):
    """Returns a dictionary of probabilities associated with the occurences of characters in a file"""
    
    # Total number of occurences
    total = sum(itemDict.values())
    
    # Initializes probability dictionary
    probs = {}
    
    # Gets the count of each character and adds its simple probability to the array
    for item in itemDict.keys():
        probs[item] = (1.0*itemDict[item]/total)
    
    return probs

In [3]:
def singleCharDecode(url):
    """Reads the data from a url and decodes it into characters before sorting it into a dictionary"""
    
    # Retrieves bytedata and decodes it into characters
    bytedata = urllib.request.urlopen( url ).read()
    data = bytedata.decode()

    # Creates a list to contain all characters in the data 
    charlist = []
    # Adds characters and spaces, as well as single newlines as spaces, to our character list
    for char in data:
        if (char.isalpha()):
            charlist.append(char.lower())
        elif  (char == ' ' or char == '\n') and charlist[-1] != ' ': # Removes double spaces
            charlist.append(' ')
    
    # Creates a dictionary counting the number of times each character occurs
    charDict = {}
    for char in charlist:
        # Increments the character's dictionary value if it is already stored, 
        # and initializes it at one otherwise
        if char in charDict.keys():
            charDict[char] += 1
        else:
            charDict[char] = 1
    
    return charDict

In [7]:
# Obtain spamlet data
url = r'http://www.gutenberg.org/files/1524/1524-0.txt'

# Obtains the character dictionary and associated probability dictionary for calculating the shannon entropy
charDict = singleCharDecode(url)
charProbs = dictProbs(charDict)
entropy = shannonEntropy(charProbs)

print("The entropy of Spamlet in bits per character is:", entropy)

The entropy of Spamlet in bits per character is: 4.095554914480837


Thus, we observe that the Shannon Entropy for Spamlet in bits per character is approximately 4.0955549.



## Question 2:

In this question we determine the probability of a monkey typing Spamlet on a keyboard consisting of the 26 letter characters, and a spacebar, where all 27 characters are equally probable. In particular, we determine the number of 27-key sequences of length 184730 (the length of Spamlet after removing double spaces), and we determine the probability of any one of these sequences being typed.

In [8]:
def num_sequences(num_options, length):
    """Returns the number of sequences of a certain length given that each entry has num_options choices"""
    
    return mp.power(num_options,length)

In [9]:
def prob_total(num_options, length):
    """Returns the probability of obtaining a particular sequence of characters of length 'length'
    from a selection of 'num_options' symbols."""
    
    return mp.power(num_options, -length)

In [57]:
# Obtains the character dictionary and its length
charDict = singleCharDecode(url)
length = sum(charDict.values())

# Obtains the number of sequences and the probability of any given sequence 
# of length equal to Spamlet for the 26 letter characters and space
sequenceNum = num_sequences(27, length)
probTot = prob_total(27, length)

print("The number of sequences of length", length, "made up 27-keys is", sequenceNum)
print("The probability of any one such sequence occuring is", probTot)

The number of sequences of length 184730 made up 27-keys is 6.73213923799414e+264415
The probability of any one such sequence occuring is 1.4854119391297e-264416


Thus, we have that if the monkey presses the 27-keys with equal probability, the number of possible sequences is approximately $6.732139E+264415$, with each sequence having a probability of $1.4854119E-264416$ of being typed at random.

## Question 3:

We now investigate how the probability of a monkey typing Spamlet changes if the key probabilities are not uniform but instead the probability of hitting any given key is the same as the probability of selecting that key at random out of Spamlet.

In [13]:
def prob_selection(probs, occurences):
    """
    Returns the probability of selecting a certain number of a collection of 
    items from some pool, and then arranging them in a specific way such that two
    items of the same type can be interchanged and the result is considered the same
    
    Arguments:
    probs - a dictionary of probabilities of selecting each item
    occurences - a dictionary for the number of selections of each desired symbol
    
    Returns:
    prop -  the probability of selecting a certain number of characters of each type from
            the probs dictionary
    """
    
    # Obtains the terms in the probability dictionary
    terms = probs.keys()
    
    # Initializes the probability
    prob = 1.0
    
    # Multiplies the independent probability of each character
    for term in terms:
        prob *= mp.power(probs[term], occurences[term])
    
    
    return prob

In [15]:
# Obtains the character dictionary and associated probability dictionary
charDict = singleCharDecode(url)
charProbs = dictProbs(charDict)

weightedProb = prob_selection(charProbs, charDict)

print("The probability of typing Spamlet given that the probability " + 
      "distributions for the keys are the same as they are for those in Spamlet is", weightedProb)

The probability of typing Spamlet given that the probability distributions for the keys are the same as they are for those in Spamlet is 1.50127331056182e-227751


Therefore, we see that the probability of a monkey typing Spamlet with each of the 27-keys being equally probable, approximately $1.4854119E-264416$, is less than the probability that a monkey types Spamlet with each key having a probability of being hit equal to its probability of being selected out of Spamlet, approximately $1.5012733E-227751$.

## Question 4:

In this question we find the joint-probability of tuples of keys in Spamlet, and then use the resulting probability distribution for pairs to determine the probability that a monkey types Spamlet following this distribution, typing pairs of keys at a time. Further, we note that if a monkey were to attempt to reconstruct Spamlet from tuples found in Spamlet, only the even tuples would on average contribute to the creation of Spamlet, where by "even tuple" we mean a tuple which has its second letter occur at an even place in Spamlet. For example, if the first sentence of Spamlet was "A platform before the Castle," the even tuples would be "a ", "pl", "at", "fo", "rm", etcetera, while " p", "la", "tf", etcetera would be odd tuples. From this we can indeed see that building Spamlet from tuples we must proceed by stacking even tuples on top of eachother, since the first tuple will always be even and any distinct tuple succeeding an even tuple will also be even. Hence, in determining the probabilities of occurrences, we only consider the even tuples in Spamlet.

In [None]:
def twoCharsDecode(url):
    """Stores the occurences of even tuples in one dictionary, 
    and stores the total probabilities for all tuples in Spamlet
    in another, then returns both dictionaries to the user.
    """
    
    # Retrieves bytedata and decodes it into characters
    bytedata = urllib.request.urlopen( url ).read()
    data = bytedata.decode()

    # Creates a list to contain all characters in the data 
    charlist = []
    # Adds characters and spaces, as well as single newlines as spaces, to our character list
    for char in data:
        if (char.isalpha()):
            charlist.append(char.lower())
        elif  (char == ' ' or char == '\n') and charlist[-1] != ' ': # Removes double spaces
            charlist.append(' ')
    
    
    #  Creates a dictionary for storing the even character tuples
    tupleDict_even = {}
    # Runs over all even pairs in Spamlet 
    for i in range(len(charlist)//2):
        charPair = charlist[2*i]+charlist[2*i+1]

        # Increments the count if a tuple appears in the dictionary, and adds it otherwise.
        if charPair in tupleDict_even.keys():
            tupleDict_even[charPair] += 1
        else:
            tupleDict_even[charPair] = 1
    
    # Creates a dictionary for storing all character tuples in Spamlet
    tupleDict_full = {}
    # Runs over all pairs in Spamlet 
    for i in range(len(charlist)-1):
        charPair = charlist[i]+charlist[i+1]

        # Increments the count if a tuple appears in the dictionary, and adds it otherwise.
        if charPair in tupleDict_full.keys():
            tupleDict_full[charPair] += 1
        else:
            tupleDict_full[charPair] = 1
    
    # Returns the even tuple occurences, and a probability dictionary for the total occurences.
    return tupleDict_even, dictProbs(tupleDict_full)

In [58]:
def twoCharsDecode_even(url):
    """Stores the joint probabilities of tuples of characters in a dictionary
    and returns the result
    """
    
    # Retrieves bytedata and decodes it into characters
    bytedata = urllib.request.urlopen( url ).read()
    data = bytedata.decode()

    # Creates a list to contain all characters in the data 
    charlist = []
    # Adds characters and spaces, as well as single newlines as spaces, to our character list
    for char in data:
        if (char.isalpha()):
            charlist.append(char.lower())
        elif  (char == ' ' or char == '\n') and charlist[-1] != ' ': # Removes double spaces
            charlist.append(' ')
    
    # Creates a dictionary for storing character tuples
    tupleDict = {}
    # Runs over all pairs in Spamlet such that the 2nd term is even (denoting the 0th term as 1)
    for i in range(len(charlist)//2):
        charPair = charlist[2*i]+charlist[2*i+1]

        # Increments the count if a tuple appears in the dictionary, and adds it otherwise.
        if charPair in tupleDict.keys():
            tupleDict[charPair] += 1
        else:
            tupleDict[charPair] = 1
    
    return tupleDict

In [59]:
def testDisplay(tupleProbs, size = 4):
    """Displays a test range of formatted probabilities"""
    
    # Creates a list of all lowercase letters as well as space
    testChars = list(' abcdefghijklmnopqrstuvwxyz')
    
    print("Test Probabilities:")
    for char1 in testChars[:size]:
        print('\t', end = "")
        for char2 in testChars[:size]:
            pair = ''.join([char1,char2])
            if pair in tupleProbs.keys():
                print(pair + ":", "{:.4e}".format(tupleProbs[pair]), end = '\t')
            else:
                print(pair + ":", "{:.4e}".format(0.0), end = '\t')
        print("")

In [60]:
# Obtains a dictionary of even tuples in Spamlet along with one containing their probabilities
tupleDict = twoCharsDecode_even(url)
tupleProbs = dictProbs(tupleDict)

# Obtains the probability of reconstructing Spamlet from the tuples
tupleSelectionProb = prob_selection(tupleProbs, tupleDict)

print("The probability of reconstructing Spamlet if pairs of keys " + 
      "are hit according to the even key distribution in Spamlet is",tupleSelectionProb)
testDisplay(tupleProbs)

The probability of reconstructing Spamlet if pairs of keys are hit according to the even key distribution in Spamlet is 1.64706860331996e-207260
Test Probabilities:
	  : 0.0000e+00	 a: 1.7582e-02	 b: 7.7194e-03	 c: 6.4743e-03	
	a : 3.9734e-03	aa: 0.0000e+00	ab: 7.4704e-04	ac: 1.4941e-03	
	b : 1.5157e-04	ba: 8.1200e-04	bb: 4.3306e-05	bc: 0.0000e+00	
	c : 5.0885e-04	ca: 1.9596e-03	cb: 0.0000e+00	cc: 2.4901e-04	


Therefore, we conclude that using the even tuple approach, the probability of a monkey typing Spamlet using a keyboard of tuples, where the probability of a key occuring is the same as its probability distribution in Spamlet, is approximately $1.6470686E-207260$, which is greater than the probability for single characters. Additionally, we have displayed the probabilities of certain test tuples in a square array.

## Question 5

In this question we determine the entropy associated with 2-key sequences in Spamlet and for words in Spamlet.

In [23]:
# 2-key Entropy

# Obtains a dictionary of even tuples in Spamlet along with one containing their probabilities
tupleDict = twoCharsDecode_even(url)
tupleProbs = dictProbs(tupleDict)

# Calculates the shannon entropy for the tuples
tupleEntropy = shannonEntropy(tupleProbs)

print("The shannon entropy of 2-key sequences in Spamlet in bits per character is:",tupleEntropy)

The shannon entropy of 2-key sequences in Spamlet in bits per character is: 7.454144936347158


Hence, we find that the Shannon Entropy for 2-key sequences in Spamlet in bits per character is approximately 7.4541449.

In [24]:
def decodeWords(url):
    """Reads the data from a url and decodes it into words before sorting it into a dictionary"""
    
    # Retrieves bytedata and decodes it into characters
    bytedata = urllib.request.urlopen( url ).read()
    data = bytedata.decode()

    # Creates a list to contain all characters in the data 
    charlist = []
    # Adds characters and spaces, as well as single newlines as spaces, to our character list
    for char in data:
        if (char.isalpha()):
            charlist.append(char.lower())
        elif  (char == ' ' or char == '\n') and charlist[-1] != ' ': # Removes double spaces
            charlist.append(' ')
    
    
    # Joins the characters into words separated by spaces
    text = ''.join(charlist)
    
    # Removes the spaces and stores the individual words into a list
    wordList = str.split(text, sep = " ")
    # Creates a dictionary counting the number of times each word occurs
    wordDict = {}
    for word in wordList:
        # Increments the count if a word appears in the dictionary, and adds it to the dictionary otherwise.
        if word in wordDict.keys():
            wordDict[word] += 1
        else:
            wordDict[word] = 1
    
    return wordDict

In [25]:
# Obtains a dictionary of words in Spamlet along with one containing their probabilities
wordDict = decodeWords(url)
wordProbs = dictProbs(wordDict)

# Calculates the shannon entropy for the words in Spamlet
wordEntropy = shannonEntropy(wordProbs)

print("The shannon entropy of word sequences in Spamlet in bits per character is:",wordEntropy)

The shannon entropy of word sequences in Spamlet in bits per character is: 9.395778599888459


Therefore, we find that the Shannon Entropy of words in Spamlet in bits per character is approximately 9.395778.

## Question 6:

In this problem we attempt to generate text similar to Shakespeare through the use of weighted probability distributions of characters, tuples, words, and punctuation. Random phrases are reconstructed with each of these options (characters, tuples, or words) as the base building block to show how each step is more close to intelligible sentences than the last.

In [41]:
def wordAndPunct(url):
    """Reads the data from a url and decodes it into words and punctuation before sorting it into a dictionary"""
    
    # Retrieves bytedata and decodes it into characters
    bytedata = urllib.request.urlopen( url ).read()
    data = bytedata.decode()
    
    # Initializes an array for punctuation
    punct = [',','.',';',':','?','!']

    # Creates a list to contain all characters in the data 
    charlist = []
    # Adds characters and spaces, as well as single newlines as spaces, to our character list
    for char in data:
        if (char.isalpha()):
            charlist.append(char.lower())
        elif  (char == ' ' or char == '\n') and charlist[-1] != ' ': # Removes double spaces
            charlist.append(' ')
    
    punctuations = [c for c in data if (c in punct)]
    
    # Joins the characters into words separated by certain numbers of spaces
    text = ''.join(charlist)
    
    # Removes the spaces and stores the individual words into a list along with the punctuation
    wordAndPunctList = str.split(text)
    wordAndPunctList.extend(punctuations)
    
    # Creates a dictionary counting the number of times each word and punctuation mark occurs
    totalDict = {}
    for word in wordAndPunctList:
        # Increments the count if a word appears in the dictionary, and adds it to the dictionary otherwise.
        if word in totalDict.keys():
            totalDict[word] += 1
        else:
            totalDict[word] = 1
    
    return totalDict

In [42]:
def randTermPhrase(probDict, min_length = 100, words = False):
    """Creates a phrase of a given minimum length using
    probabilities associated with each terms (tuples or singular characters) in the dictionary."""
    
    # Initializes a list for the terms in a phrase
    phraseTerms = []
    
    # Normalizes the probabilities so that they sum to one
    sum_Probs = sum(list(probDict.values()))
    for term in probDict.keys():
        probDict[term] = probDict[term]/sum_Probs
    
    # Adds tuples to the phrase list until it is longer than the min length
    while (len(phraseTerms) < min_length):
        
        # Obtains a random tuple based on the probabilities in the dictionary
        term = np.random.choice(list(probDict.keys()), p = list(probDict.values()))
        
        # Appends the last term
        phraseTerms.append(term)
        
        # If the terms are words, add a space after each one
        if words == True:
            phraseTerms.append(' ')
    
    return ''.join(phraseTerms)

In [47]:
def randWordPunctPhrase(fullDict, wordDict, min_length = 50):
    """Creates a phrase of a given minimum length using
    probabilities associated with each word in the dictionary."""
    
    # Initializes arrays for punctuation and end punctuation of sentences
    endPunct = ['.','?','!']
    punct = [',','.',';',':','?','!']

    # Initializes a list for the terms in a phrase
    phraseTerms = []
    
    # Turns the occurences into probabilities for just the words 
    # and for the words along with punctuation
    sum_totProbs = sum(list(fullDict.values()))
    sum_wordProbs = sum(list(wordDict.values()))
    for term in fullDict.keys():
        fullDict[term] = fullDict[term]/sum_totProbs
    for term in wordDict.keys():
        wordDict[term] = wordDict[term]/sum_wordProbs
    
    # Initialize Sentence with a random first word and capitalizes it
    term = np.random.choice(list(wordDict.keys()), p = list(wordDict.values()))
    term = term.capitalize()
    phraseTerms.append(term)
    
    # Adds words and punctuation to the phrase list until it is longer than the min length
    # and it ends in a period, exclamation point, or question mark.
    while (len(phraseTerms) < min_length or not (phraseTerms[-1] in endPunct)):
        
        # Ensures the first term in a sentence is a capitalized word
        if phraseTerms[-1] in punct:
            term = np.random.choice(list(wordDict.keys()), p = list(wordDict.values()))
            if phraseTerms[-1] in endPunct:
                term = term.capitalize()
        else:
            term = np.random.choice(list(fullDict.keys()), p = list(fullDict.values()))
        
        # Adds a space in front of the term if it is not a punctuation
        if not(term in punct):
            phraseTerms.append(' ')
            
        # Capitalizes single i's
        if term == 'i':
            term = term.capitalize()
            
        # Appends the last term
        phraseTerms.append(term)
    
    return ''.join(phraseTerms)

In [48]:
# Obtains dictionaries of characters, tuples, andwords in Spamlet along with ones containing their probabilities
charDict = singleCharDecode(url)
charProbs = dictProbs(charDict)
tupleDict = twoCharsDecode_even(url)
tupleProbs = dictProbs(tupleDict)
wordDict = decodeWords(url)
wordProbs = dictProbs(wordDict)
totalDict = wordAndPunct(url)


print("Letter Phrases:")
for i in range(3):
    print(str(i+1) + ": \"" + str(randTermPhrase(charProbs)) + "\"")
print('\n')
    
print("Tuple Phrases:")
for i in range(3):
    print(str(i+1) + ": \"" + str(randTermPhrase(tupleProbs)) + "\"")
print('\n')

print("Word Phrases:")
for i in range(3):
    print(str(i+1) + ": \"" + str(randTermPhrase(wordProbs, words = True)) + "\"")
print('\n')

print("Formatted Word Phrases:")
for i in range(3):
    print(str(i+1) + ": \"" + str(randWordPunctPhrase(totalDict, wordDict)) + "\"")

Letter Phrases:
1: "utd osktthemts yel aa   esenhurina ue ee    ne iannow sovinllmi sartm soux s tl rngteseorethe kirhio"
2: "gtfoedwef mr asesfnosnhrtous  ehesid   dkrrreh  irisigd r y rgsovweg sttdl ooreoetnrc  armuessydh rf"
3: "rhfnimwdft  eehr ibpii xh neieaoe  dwclllvhagvmta tpoie tl ri mestfheete n tgineh  puntroft iee shn "


Tuple Phrases:
1: "an a fe owasigke sh igs enfingbewiho cusy av tenle oley enotad nisndri hthr r  dutysllheiton as gvd ommo lor t csto s ri nfohisc illyo tndssanegheomooteraon f hi yoteatityodsg iaetonulrtin wh mpannois"
2: "t  oen to  wswo ctkierm atitthreou aheamnitost m a bndro tenzzvdllkspai ouyo gblaneaplcoeteds ndde ihtutoaf ustena omeccthtldedeteai akietofrbhennshughasohoanisetglriarhimendwat shtoal tutol golanu  n"
3: "weowdeatf voenncosm infaet a she no  d bitmaf t epe yequ we ertoanomormdonh verendye sou ber an  de anusagrif  mnig  i hixml fth pmiantesefoor avewial muswh se taunve tomuge le you ymm wct pett  serm "


Word Phrases:
1: "all did ugl

As we can see when it is just characters the reconstructed phrases are nearly entirely gibberish, with it being difficult to descern even the shape of any words. As we move on to tuples the strings start to appear slightly more word-like, with "she", "you", and "no" appearing, but are still mostly unintelligible. Next, using words as the building block the phrases are now readable, and even appear to generate somewhat coherent phrases using the probability distribution of words in Spamlet. Finally, the addition of capitalizations and punctuation further aids in bringing the random phrases closer to the form of shakespearean text.

## Conclusions/Summary

