We will implement a very simple encryption scheme that closely resembles the one-time-pad. You have probably seen this method used in movies like [Unknown](http://www.imdb.com/title/tt1401152/?ref_=nm_flmg_act_43). The idea is that you and your counterparty share a book whose words you will use as the raw material for a  codebook. In this case, you need [Metamorphosis, by Franz Kafka](https://storage.googleapis.com/class-notes-181217.appspot.com/pg5200.txt).

Your job is to create a codebook of 2-tuples that map to specific words in the given text based on the line and position the words appears in the text. The text is very long so there will be duplicated words. Strip out all of the punctuation and make everything lowercase.

For example, the word **let** appears on line `1682` in the text as the fourth word (reading from left-to-right). Similarly,
the word **us** appears in the text on line `1760` as the fifth word.

Thus, if the message you want to send is the following:

    let us not say we met late at the night about the secret
    
Then, one possible valid sequence for that message is the following:
    
    [(1682,4),(1760,5),(1650,2),(304,7),(1190,4),(2327,2),(731,4),(988,4),(1091,6),(958,7),(564,10),(1923,9),(849,2)]

Your counterparty receives the above sequence of tuples, and, because she has the same text, she is able to look up the line and word numbers of each of the tuples to retrieve the encoded message. Notice that the word **the** appears twice in the above message but is encoded differently each time. This is because re-using codewords (i.e., 2-tuples) destroys the encryption strength. In case of repeated words, you should have a randomized scheme to ensure that no message contains the same 2-tuple, even if the same word appears multiple times in the message. If there is only one occurrence of a word in the text and the message uses that word repeatedly so that each occurrence of the word cannot have a unique 2-tuple, then the message should be rejected (i.e., assert against this).

Your assignment is to create an encryption function and the corresponding decryption function to implement this scheme. Note that your downloaded text should have 2362 lines and 25186 words in it.



In [1]:
def encrypt_message(message,fname):
    '''
    Given `message`, which is a lowercase string without any punctuation, and `fname` which is the
    name of a text file source for the codebook, generate a sequence of 2-tuples that
    represents the `(line number, word number)` of each word in the message. The output is a list
    of 2-tuples for the entire message. Repeated words in the message should not have the same 2-tuple. 
    
    :param message: message to encrypt
    :type message: str
    :param fname: filename for source text
    :type fname: str
    :returns: list of 2-tuples
    '''
    import string, re, random
    from collections import defaultdict
    assert len(re.findall('[%s]'%string.punctuation,message)) == 0
    # no uppercase characters
    assert len(re.findall('[%s]'%string.ascii_uppercase,message))==0    
    assert isinstance(message, str) and isinstance(fname, str)
    assert len(message) > 0
    
    pureFile = []
    res = []
    file = open(fname)
    for line in file:
        pureLine = re.sub('[%s]'%string.punctuation and '\n','',line)
        pureFile.append(pureLine)
    indexTable = defaultdict(list)
    for line,sentence in enumerate(pureFile):
        for row,word in enumerate(sentence.split()):
            indexTable[word].append((line,row))
    for element in message.split():
        index = random.choice(indexTable[element])
        res.append(index)
    return res


In [2]:
encrypt_message("let us not say we met late at the night about the secret", "pg5200.txt")

[(954, 4),
 (1878, 3),
 (196, 14),
 (533, 6),
 (2076, 4),
 (2327, 2),
 (731, 4),
 (1791, 1),
 (496, 9),
 (382, 6),
 (2276, 3),
 (1453, 2),
 (901, 0)]

In [3]:
def decrypt_message(inlist,fname):
    '''
    Given `inlist`, which is a list of 2-tuples`fname` which is the
    name of a text file source for the codebook, return the encrypted message. 
    
    :param message: inlist to decrypt
    :type message: list
    :param fname: filename for source text
    :type fname: str
    :returns: string decrypted message
    '''  
    import string, re
    
    assert isinstance(inlist, list) and isinstance(fname, str)
    for element in inlist:
        assert isinstance(element, tuple)
        
    pureFile = []
    out = ''
    file = open(fname)
    for line in file:
        pureLine = re.sub(r'[^\w\s]','', line)
        #pureLine = re.sub('[%s]'% string.punctuation and '\n','',line)
        pureFile.append(pureLine)
    for line, row in inlist:
        word = pureFile[line].split()[row] + " "
        out += word 
    return out[:-1]


In [4]:
decrypt_message([(1682,4),(1760,5),(1650,2),(304,7),(1190,4),(2327,2),(731,4),(988,4),(1091,6),(958,7),(564,10),(1923,9),(849,2)], "pg5200.txt")

'let us not say we met late at the night about the secret'