In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab01.ipynb")

In [None]:
import numpy as np
from collections import defaultdict, Counter
import re
from functools import reduce
import math
import string
import pandas as pd
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt
import random
%matplotlib inline

In [None]:
alphabet = string.ascii_lowercase

# Lab 1: Substitution Ciphers and Steganography
Contributions From: Ryan Cottone, Imran Khaliq-Baporia, Will Giorza

Welcome to Lab 1! In this lab, we will explore the most common type of early cipher -- the substitution cipher.
By the end of lab, you will have built a full encrypting and decrypting system complete with a cracking function. We will also be building a tool to hide secret messages inside of seemingly ordinary image files.

Please review the lecture slides and Note 1 if you are unsure what a substitution cipher is.

## Part 1: Substitution Ciphers

In this part, we'll build a system to encrypt and decrypt text using Caesar and Vigenere ciphers, as well as using frequency analysis to crack these ciphers when we don't know the plaintext.

### Helpers

In [None]:
# Takes in a character and returns its index in the alphabet (a=0, b=1, ...)
def char_to_num(text):
    return ord(text.lower()) - 97;

In [None]:
# Takes in a number and returns its character in the alphabet (0=a, 1=b, ...)
def num_to_char(num):
    return chr(num + 97).lower();

In [None]:
# Takes in a character and a numerical shift, and returns its shifted version. Example: (a, 2) -> c
def shift_letter(char, shift):
    if not char.isalpha():
        return char
    else:
        return num_to_char((char_to_num(char) + shift) % 26) 

In [None]:
# Cleans our text to ensure no non-alphabetic characters appear
def clean_text(text):
    return re.sub('[^A-Za-z\s!?.,$]*', '', text.lower())

In [None]:
# Pads the key to be equal to length by repeating itself
# Example: pad_key('abc', 7) -> 'abcabca'
def pad_key(key, length):
    key = clean_text(key)
    newkey = key
    for i in range(len(key), length):
        newkey += key[i%len(key)]
    return newkey

**Question 1**: Complete the following implementation of a Caesar Cipher.

As a quick reminder, a Caesar cipher shifts all letters in the plaintext over by one. "abc" shifted by 1 becomes "bcd".

The encryption function is given to you. Complete the implementation of the decryption function.

*HINT: Be sure to use `shift_letter`*!

In [None]:
# Encrypts the given plaintext with the given numerical shift
def caesar_encrypt(plaintext, shift):
    plaintext = clean_text(plaintext)
    encrypted = "" 
    for i in range(len(plaintext)):
        encrypted += shift_letter(plaintext[i], shift)
    return encrypted

In [None]:
# Decrypts the given ciphertext with the given numerical shift
def caesar_decrypt(ciphertext, shift):
    ciphertext = clean_text(ciphertext)
    decrypted = "" 
    for i in range(len(ciphertext)):
        ...
    return decrypted

In [None]:
grader.check("q1")

**Question 2**: Complete the following implementation of a Vigenére cipher.

As a reminder, a Vignére cipher acts as a Caesar cipher for each individual letter. Once we extend the key to be the same length as the plaintext, we shift each individual letter by its corresponding key letter (converted to a number). For example:

Plaintext: "abcd"

Key: "ab"

Padded key: "abab"

"abcd" + "abab" = "acce"

"acce" - "abab" = "abcd"

**The encryption function is given to you as an example. Complete the implementation of the decryption function.**

*Hint: be sure to use `pad_key`*

In [None]:
# Encrypts a given plaintext with the key using the Vigenere cipher
def vigenere_encrypt(plaintext, key):
    plaintext = clean_text(plaintext)
    padded_key = pad_key(key, len(plaintext)) 
    encrypted = ""
    
    for i in range(len(plaintext)):
        encrypted += shift_letter(plaintext[i], char_to_num(padded_key[i])) 
    
    return encrypted

In [None]:
# Decrypts a given ciphertext with the key using the Vigenere cipher
def vigenere_decrypt(ciphertext, key):
    ciphertext = clean_text(ciphertext)
    padded_key = ...
    decrypted = ""
    
    for i in range(len(ciphertext)):
        ...
        
    return decrypted

In [None]:
grader.check("q2")

In this part, we'll use frequency analysis to crack a Caesar cipher where we don't know the plaintext.

Here are some more helper functions:

In [None]:
english_frequencies = {
    'a': .0812,
    'b': .0149,
    'c': .0271,
    'd': .0432,
    'e': .1202,
    'f': .0230,
    'g': .0203,
    'h': .0592,
    'i': .0731,
    'j': .0100,
    'k': .0069,
    'l': .0398,
    'm': .0261,
    'n': .0695,
    'o': .0768,
    'p': .0182,
    'q': .0011,
    'r': .0602,
    's': .0628,
    't': .0910,
    'u': .0288,
    'v': .0111,
    'w': .0209,
    'x': .0017,
    'y': .0211,
    'z': .0007
}
np_english_frequencies = np.fromiter(english_frequencies.values(), dtype=float)

In [None]:
# Return a dictionary with the counts of every letter.
def count_letters(text):
    init = Counter(english_frequencies.keys())
    init.update(text)
    return {k: init[k] - 1 for k in english_frequencies.keys()}

In [None]:
# Calculates the proportions of each letter in the given text
def calculate_proportions(text): # Coded for you
    counts = count_letters(text).values()
    nparr = np.fromiter(counts, dtype=float)
    return nparr / sum(counts)

In [None]:
# Plots frequencies 
def plot_freqs(freqs): # Coded for you
    plt.bar([alphabet[i] for i in range(26)], freqs)

In [None]:
# Plots the frequencies of letters from a given tesxt
def plot_freqs_from_text(text): # Coded for you
    plot_freqs(calculate_proportions(text))

Now, let's try plotting the frequencies of letters in a long text:

In [None]:
sample_text = ""
with open('sample.txt', 'r') as file:
    sample_text = file.read().replace('\n', '')

In [None]:
plot_freqs_from_text(sample_text)

In [None]:
plot_freqs(english_frequencies.values())

In [None]:
# Plots two frequency histograms over one another
def plot_overlay(freq1, freq2): # Coded for you
    plt.bar([alphabet[i] for i in range(26)],freq1, color='orange', width = 0.5)
    plt.bar([alphabet[i] for i in range(26)], freq2, color='blue', alpha=0.5)

In [None]:
plot_overlay(np_english_frequencies, calculate_proportions(sample_text))

As you can see, the frequencies of longer texts are almost identical to the frequency of English as a whole. 

We can now write a function to compute the Total Variation Distance of two categorical distributions:

$$TVD(freq1, freq2) = \frac{1}{2} \sum_{i=0}^{k} | freq1_i - freq2_i | $$

In [None]:
# Takes in two NumPy arrays and returns their TVD.
def tvd(freq1, freq2):
    diff = abs(freq1 - freq2)
    return sum(diff)/2

In [None]:
tvd(np_english_frequencies, calculate_proportions(sample_text))

Our text has a relatively low TVD with the base English frequencies. Remember, the lower the TVD is, the closer the two distributions are. Let's see how the TVD of an encrypted version of the sample text compares to English.

In [None]:
encrypted_sample_text = caesar_encrypt(sample_text, 8)
tvd(np_english_frequencies, calculate_proportions(encrypted_sample_text))

That is a lot higher! If we chart these two, you will see why it is so large:

In [None]:
plot_overlay(np_english_frequencies, calculate_proportions(encrypted_sample_text))

You can see that the histograms are completely different. In fact, you can make out the shift of the blue to the right from the orange.

**Question 3.1**: Build a function to minimize the TVD over all 26 possible shifts.

In [None]:
# Given to you as a helper function
# Returns the TVD of a text's frequencies against the base English frequencies
def analyze_frequency(text):
    return tvd(np_english_frequencies, calculate_proportions(text))

In [None]:
# Returns the best Caesar cipher shift for a given ciphertext to result in the most English-like plaintext
def find_best_shift(ciphertext):
    best_shift = 0
    best_tvd = float('inf')
    
    for i in range(26):
        shifted = ...
        result_tvd = ...
        
        if (result_tvd < best_tvd):
            best_shift = i
            best_tvd = result_tvd
    
    return best_shift

The following test may take a few seconds to run.

In [None]:
grader.check("q3_1")

**Question 3.2**: Build a function to decrypt an arbitrary Caesar ciphertext.

HINT: We just need to find the best "key" (shift) and then decrypt it using that!

In [None]:
# Breaks a Caesar ciphertext
def crack_caesar(ciphertext):
    shift = ...
    ...

Try it out for yourself! Longer texts will be decrypted much more reliably than shorter ones (can you see why?).

In [None]:
text = "Hello from Codebreaking at Cal!"
shift = 5 # Change this if you want!
encrypted_text = caesar_encrypt(text, shift)
print("Encrypted ciphertext:", encrypted_text)
print("Cracked plaintext:", crack_caesar(encrypted_text))

In [None]:
grader.check("q3_2")

Unfortunately, a simple frequency analysis will not work for polyalphabetic ciphers like Vigenére. For example, let's see what happens when we plot the frequencies of our sample text encoded with the key "samplekey":

*Note: The code in the rest of Part 1 is provided for you, but make sure you understand it!*

In [None]:
plot_overlay(np_english_frequencies, calculate_proportions(vigenere_encrypt(sample_text, "sample key")))

As you can see, our frequency chart lines up very poorly with overall english frequencies -- there is no noticeable shift. If we try to run our find_best_shift function, we will see that it returns nothing more than gibberish.

In [None]:
shift = find_best_shift(vigenere_encrypt(sample_text, "sample key"))
attempted_decryption = caesar_decrypt(sample_text, shift)
plot_overlay(np_english_frequencies, calculate_proportions(attempted_decryption))
print(attempted_decryption[:50] + '...')

In order to remedy this, we use a key principle that applies to polyalphabetic ciphers -- they are simply a combination of *monoalphabetic* ciphers. If we can split the polyalphabetic cipher into numerous monoalphabetic ones, we can solve those individually and re-combine to produce our decrypted text. For example, take the Vigenere ciphertext "bcbc" with key "ab". 'a' would be the key for the first and third letter, and 'b' is the key for the second and forth letter. Thus, we can split this into two Caesar ciphers -- "bb" and "cc". We find that our keys are 'b' and 'c', respectively, then we concatenate to find the overall key "bc". 

The tricky part, however, is finding the key length. We can't know how to split the cipher unless we know how many ciphers to split into. Fortunately, there exists a very useful tool called the *Kasiski test* to determine the key length of a polyalphabetic substitution cipher. Simply put, we look for repeated substrings of words and check the distance between the two. For example: 

"Cryptography isnt cryptocurrency"
"abcd"
"**Csastp**iuaqjb itpw **csastp**exrsgqcz"

Notice the two bolded substrings, which are exactly 16 characters apart (start-start). This would imply our key could be 16 characters long or any of the factors of 16 (8, 4, 2), since the key would "repeat" in time for it to encrypt the same plaintext to the same ciphertext. It could be any of these, but over a long text, only the key lengths and its multiples (4, 8, 12 ..) will show up in appreciable amounts. (A notable exception is 2, just because its really small and a lot of repeats happen by chance).

Our algorithm for determining the key length is as follows:

1. Find all repeated substrings of length 3 or 4 and their distances. Put this data into a dictionary with (key length -> count)
2. Compare the proportions of these values to those of an unencrypted, very long text (same idea as comparing to base English frequency values)
3. Collect those which are significantly above the base values and determine the key length based on least-inversions principle (this part is a bit tricky and subjective, so it is implemented for you)
4. Try the best *k* keylengths and compare using a frequency analysis on the final product (k is set by the user).


In [None]:
# Returns a list of counts of repeated substrings for each possible key length
def find_repeated_substrings(text):
    text=clean_text(text)
    appearances = defaultdict(lambda: 0)
    for i in range(2,len(text)//2):
        counts = defaultdict(lambda:-1)
        for k in range(0, len(text), i):
            if (k+3 > len(text)):
                continue         
            snippet_three = text[k:k+3]
            snippet_four = text[k:k+4]
            if (counts[snippet_three] != -1):
                appearances[i] += 1;
            if (counts[snippet_four] != -1 and k+4 < len(text)):
                appearances[i] += 1;
            counts[snippet_three] = k
            counts[snippet_four] = k
    return appearances

`MAX_KEYLENGTH` defines the maximum length of key to check for. Theoretically, this could be up to `len(text)` (we will see why this keylength creates an unbreakable code in the next lab), but for practical purposes we will set it at 20. Feel free to change and see how the program works!

In [None]:
MAX_KEYLENGTH = 20

In [None]:
# Returns the proportions of repeated strings for each keylength
def keylen_proportions_from_text(text): # Coded for you 
    parsed = np.fromiter(find_repeated_substrings(text).values(), dtype=int)[:MAX_KEYLENGTH*2]
    return parsed/sum(parsed)

In [None]:
def approx(x): # Ignore this, used for graping visuals
    offset = 0.012
    if (x == 2):
        offset+=0.02
    elif (x > 2 and x < 6):
        offset-=  0.01
    return 0.45* 0.7**x + offset

In [None]:
base_keylen_proportions = keylen_proportions_from_text(sample_text)

In [None]:
colors = []
for i in range(2, MAX_KEYLENGTH+3):
    if (i % 5 == 0):
        colors.append('#50514f')
    elif (i % 4 == 0):
        colors.append('#f25f5c')
    elif (i % 3 == 0):
        colors.append('#ffae4a')
    elif (i % 2 == 0):
        colors.append('#247ba0')
    else:
        colors.append('#70c1b3')
        
fig, ax = plt.subplots()

ax.bar(range(2,MAX_KEYLENGTH+3), base_keylen_proportions[:MAX_KEYLENGTH+1], color=colors)
ax.set_xticks(range(2,MAX_KEYLENGTH+3))
ax.plot(range(2,MAX_KEYLENGTH+3), [approx(x) for x in range(2,MAX_KEYLENGTH+3)] ,color="black")
plt.show()

The following function finds the differences of a given ciphertext from the base keylength proportions.

In [None]:
# Finds the difference of keylength frequencies between a text and the base frequencies
def find_diff_from_base(text):
    text_keylen_proportions = keylen_proportions_from_text(text)
    
    diffs = text_keylen_proportions - base_keylen_proportions[:len(text_keylen_proportions)]
    
    return diffs

Now that we have `find_diff_from_base`, let's visualize what diffs of various keylengths look like when plotted.

In [None]:
EXAMPLE_KEYLENGTH = 5 # Change between 2 and 20!

diffs = find_diff_from_base(vigenere_encrypt(sample_text, alphabet[0:EXAMPLE_KEYLENGTH]))
fig, ax = plt.subplots()
ax.bar(range(2,MAX_KEYLENGTH+3), diffs[:MAX_KEYLENGTH+1], color=colors)
ax.set_xticks(range(2,MAX_KEYLENGTH+3))
plt.show()

You'll notice that the key length and its multiples are noticeably above 0 while the rest tend to be lower. (2 is an often exception; it's so small that a lot of statistical noise gets in the way).

Detecting the best key length programmatically is often challenging, and has a few different approaches. The one below scores different lengths based on whether any given multiple is less than its previous multiple, as well as how many in between are larger than the given multiple. As you see in the example graph, each successive multiple should be less than the previous. If you try a key length like 4, you'll see 2 < 4 and 6 < 8, which helps disqualify 2 as a potential key length.

In [None]:
# Given a list of possible keylengths and the overall diffs from base, return the top potential key lengths
def find_best_divisor(nums, diffs): 
    result = []
    k = 1
    for num in nums:
        score = 0
        for i in range(2*num, min(len(diffs)+2, MAX_KEYLENGTH*2+2), num):
            inbetween = sum([diffs[a-2] > diffs[i-2] for a in range(i-num+1, i)]) / len(range(i-num+1, i))
            
            if inbetween > 0.1:
                score -= 1
            if (diffs[i-num-2] < diffs[i - 2]):
                score -= 1
        k+=1
        
        result.append((num, score))
    result = sorted(result, key=lambda x: x[1], reverse=True)
    top_three = [x[0] for x in result[:3]] 
    
    return top_three + [reduce(math.gcd, top_three)]

We can now write a function that takes in a ciphertext and returns the best three keylengths.

In [None]:
# Returns the best possible key lenghts from a Vigenere ciphertext
def find_vigenere_key_lengths(ciphertext):
    diffs = find_diff_from_base(ciphertext)
    
    potential = []
    for i in range(2, min(len(diffs)+2,MAX_KEYLENGTH+2)):
        if diffs[i-2] > 0:
            potential.append(i)
    return find_best_divisor(potential, diffs)

At this point, we have everything we need to build the final function -- `crack_vigenere`! This function will take in an arbitrary ciphertext and return the most likely plaintext.

In [None]:
# Breaks a Vigenere ciphertext
def crack_vigenere(ciphertext):
    keylengths = find_vigenere_key_lengths(ciphertext)
    
    finalstrs = []
    
    for keylen in keylengths:
        texts = ['' for i in range(keylen)]
        
        for i in range(len(ciphertext)):
            # Put the first letter into the first text, second into second, etc...
            texts[i%keylen] += ciphertext[i]
            
        # Use our caesar cipher cracker to individually break each text    
        cracked_texts = [crack_caesar(text) for text in texts]
        
        finalstr = ""
    
        for i in range(len(ciphertext)):
            # Recombine the original string from the assorted texts
            finalstr += cracked_texts[i%keylen][i//keylen]
        
        finalstrs.append(finalstr)    
    
    # Find the "best" string via frequency analysis (Remember we have analyze_frequency)
    return min(finalstrs, key = lambda x: analyze_frequency(x))

Let's decrypt our sample text!

In [None]:
print('...', crack_vigenere(vigenere_encrypt(sample_text, "samplekey"))[4512:4751], '...')

And with that, you have built a Caesar cipher and Vigenere cipher breaker -- something which took hundreds of years to accomplish in the past. Congrats! 

## Part 2: Steganography

In this part, we'll build a system to hide and recover messages inside images, then also find a way to break this system.

### Background: Image and Pixel Representation

Before we begin hiding messages inside of images, let's begin by taking a look at how pixels and images are represented in computers. 

In the digital world, one of the most common methods of encoding an encoding a pixel to be a particular color is by representing that color as a RGB triplet. A RGB triplet is a list of 3 numbers, each ranging from 0 to 255, where the first number represents the level of red, the second number represents the level of green, and the third number reprsents the level of blue in the image. The higher the number (i.e: the closer to 255), the more intense that component will be in the final color.

For example, the RGB triplet `(0, 0, 0)` represents the color black, as `0` indicates the absence of a component entirely and we are asking for the absence of all three components. In contrast, the RGB triplet `(255, 255, 255)` represents the color white, as `255` indicates the full intensity for a component, and we are asking for the full intensity of all three components. Below are some examples of other RGB triplets and the colors they represent.

<img src="https://linuxhint.com/wp-content/uploads/2022/02/image7-9.png" width="300">

So, now that we know how to represent a single pixel, how could we extropolate this to representing an entire image? Well, we can think of an image as a matrix of pixels, so we could store an image as a 2D-array of RGB triplets. Each array inside of the 2D-array represents a row of the image, and inside each of these rows are the RGB triplets for that row, stored as a list of three numbers.

As a simple example, consider thie following 3 by 3 image (consiting of 9 total pixels).

![](https://i.imgur.com/7aOJTgl.png)

This would be represented by an array that looks like this:

```
[
    [[255, 0, 0], [255, 255, 0], [255, 0, 255]],
    [[255, 128, 0], [0, 255, 255], [0, 0, 0]],
    [[128, 0, 255], [255, 255, 255], [0, 255, 0]]
]
```

Lastly, it may be helpful throughout this lab to remember that we can represent each of the numbers in the RGB triplets as binary number. Since the numbers in the triplet range anywhere between 0 and 255, there are a total of 256 different possible values. Thus, we will require at most 8 bits (or 1 byte) to represent one of these numbers. This means that each RGB triplet requires 3 bytes to represent. The above image, represented with binary numbers, would be

```
[
    [[11111111, 00000000, 00000000], [11111111, 11111111, 00000000], [11111111, 00000000, 11111111]],
    [[11111111, 10000000, 00000000], [00000000, 11111111, 11111111], [00000000, 00000000, 00000000]],
    [[10000000, 00000000, 11111111], [11111111, 11111111, 11111111], [00000000, 11111111, 00000000]]
]
```

### Hiding a Message

In this part of the lab, we will begin by hiding a message within a "masking" image. In order to do this, we will take advantage of the limitations of the human eyes. The human eyes' ability to differentiate different colors is quite limited. So much so, in fact, that if we change the least significant bit of one of the numbers in a RGB triplet, the change is imperceptible to ordinary human eye sight. As an example, below are two colors. One is true red (the RGB triplet is `[11111111, 00000000, 00000000]`) and one is a slightly off red (the RGB triplet is `[11111110, 00000000, 00000000]`). It is nearly impossible to determine which is which using only your eyes.

<img src="https://i.imgur.com/NcWgSZg.png" width="300">

So, using this fact, we will try to secretly encode a black and white image into a "masking" image by changing the last bit of the green component of each pixel in the "masking" image.

Here is our scheme we will use to do this:

We will take in two images (represented as 2D arrays). One image is `mask` and the other is `message`. `mask` and `message` will have the same exact dimensions. `mask` will represent an innocent looking image and `message` will the secret image we want to hide. `message` will be guranteed to only contain white and black pixel values (RGB triplets of `[255, 255, 255]` and `[0, 0, 0]`). 

To produce our encoded message, we will loop through the pixels inside of `message`. If the pixel in `message` is white, then we want to set the last bit of the green value in the corresponding pixel of `mask` to 1. If the pixel in `message` is black, then we want to set the last bit of the green value in the corresponding pixel of `mask` to 0.

For this lab, we will use the following picture of Oski as our "masking" image and our secret message will be the black-and-white image saying "go bears!". For each pixel in the "go bears!" image, we will find the corresponding pixel in the Oski image, and change its green component according to whether the pixel is black or white.

![](https://i.ibb.co/5Kgxw7Z/sidebyside.png)

**Question 4:** Write a function `encode_message` that takes in two images, `original` and `message`, and that hides `message` inside of `original` using the algorithm described above.

In [None]:
# helper functions

# Takes in an RGB triplet PIXEL and returns True iff it encodes the color white
def is_white_pixel(pixel):
    return len(pixel) == 3 and all(n == 255 for n in pixel)

# Takes in an RGB triplet PIXEL and returns True iff it encodes the color black
def is_black_pixel(pixel):
    return len(pixel) == 3 and all(n == 0 for n in pixel)

# Takes in a number NUM and returns a new version with its last bit change to NEW_LAST_BIT
def set_last_bit(num, new_last_bit):
    assert new_last_bit == 0 or new_last_bit == 1, 'new bit invalid'
    return ((num >> 1) << 1) | new_last_bit

In [None]:
def encode_message(mask, message):
    # check that mask and message have the same dimension
    assert len(mask) == len(message), 'mismatched number of rows'
    assert list(map(len, mask)) == list(map(len, message)), 'mismatched number of columns'
    
    # check that message contains only white and black pixels
    for row in message:
        for pixel in row:
            assert is_white_pixel(pixel) or is_black_pixel(pixel), 'message must be black and white'

    # create 2D array to for outputted image, containing secret message
    output_image = []
    for r in range(len(mask)):
        output_row = []
        for c in range(len(mask[r])):
            mask_pixel = mask[r][c]
            message_pixel = message[r][c]
            new_pixel = list(mask_pixel) # make a copy of mask_pixel
            # modify new_pixel's green value according to the algorithm above
            # hint: the helper function(s) above may be helpful.
            # hint: the green component will be the second value of the pixel
            ...
            output_row.append(new_pixel)
        output_image.append(output_row)
    return output_image

In [None]:
grader.check("q4")

Now, let's take a look at what the output image looks like if we encode our "Go bears!" message into the image of Oski!

**Note:** This may take up to a minute to execute.

In [None]:
OSKI_IMG = 'https://pbs.twimg.com/profile_images/1276527827848818688/dfr7_4Kn_400x400.jpg'
SECRET_MESSAGE = 'https://i.ibb.co/djphCHY/gobears.png'

oski = np.asarray(Image.open(BytesIO(requests.get(OSKI_IMG).content))).tolist()
secret = np.asarray(Image.open(BytesIO(requests.get(SECRET_MESSAGE).content))).tolist()

encoded_oski = encode_message(oski, secret)

plt.imshow(encoded_oski)
plt.show()

Notice how the image looks almost exactly the same as the original image of Oski, seen below. The modifications to the last bits of the green values are completely imperceptible.

![](https://pbs.twimg.com/profile_images/1276527827848818688/dfr7_4Kn_400x400.jpg)

### Recovering a Message

In this part, we are going to now write a function that will let us recover a message from an image that has already had a message hidden inside of it.

For this part, you will take in an image `encoded_image` that has had a secret message encoded in it using the `encode_message` function you implemented in the previous part. Your goal is to derive the original message, by examining each pixel of `encoded_image`. Remember: if the last bit of the green value is a 1, then the corresponding pixel in the secret message was white, and if the last bit of the green value is a 0, then the corresponding pixel in the secret message was black.

**Question 5:** Write a function `decode_message` that takes in an image `encoded_image`, and return the image of the secret message hidden inside of `encoded_image`.

In [None]:
# helper function(s)

# Takes in a number NUM and returns the last bit (either 0 or 1)
def get_last_bit(num):
    return num & 1

In [None]:
def decode_message(encoded_image):
    WHITE_PIXEL = [255, 255, 255]
    BLACK_PIXEL = [0, 0, 0]
    recovered_message = []
    for r in range(len(encoded_image)):
        recovered_row = []
        for c in range(len(encoded_image[r])):
            encoded_pixel = encoded_image[r][c]
            # determine, using the last bit of the green value, whether to
            # append a white or black pixel to the end of recovered_row
            # hint: the helper function(s) above may be helpful.
            ...
        recovered_message.append(recovered_row)
    return recovered_message

In [None]:
grader.check("q5")

Finally, let's test our decoding method on the picture of Oski we made at the end of part 1, and make sure that we get back our original secret message containing "go bears!". If this block doesn't work, go back to Part 1 and re-run the block where you created the `encoded_oski` image.

In [None]:
recovered_message = decode_message(encoded_oski)

plt.imshow(recovered_message)
plt.show()

If everything has gone well, we should see our original message of "go bears!". 

To wrap up Part 2 of this lab, let's decode a brand new message that we don't know the original message for. Here's a seemingly innocent looking image.

<img src="https://i.ibb.co/gM62mM4/problem.png" width="300">


However, if we run it through our decoding algorithm, we will see that we run into sort of a **problem**. ;)

In [None]:
NORMAL_IMG = 'https://i.ibb.co/gM62mM4/problem.png'

normal_image = np.asarray(Image.open(BytesIO(requests.get(NORMAL_IMG).content))).tolist()
recovered_image = decode_message(normal_image)

plt.imshow(recovered_image)
plt.show()

### Content Threat Removal

To conclude this lab, we will create a algorithm that tries to block/prevent this stenographic technique.

Let's say that we were trying to devise a messaging system that blocked secret messages from hiding within images, but that still allowed legit images to pass through. One way to do this is to try and run advanced steganalysis techniques on the images to detect which ones have had a message encoded within them. However, due to the difficulty of successfully implementing these techniques, it is often preferred to instead add random "noise" to the image, with the hopes of corrupting any message that may have been hiding within while still keep the image looking the same to the human eye.

For the sake of simplicity, we will implement an algorithm that adds "noise" to images by changing the last bit of all values in the RGB triplets of the image to randomly be 0 or 1.

You'll need the following 2 helper functions:

In [None]:
# Returns either 0 or 1 at random.
def get_random_bit():
    return random.randint(0, 1)

# Takes in a number NUM and returns a new version with its last bit change to NEW_LAST_BIT
def set_last_bit(num, new_last_bit):
    assert new_last_bit == 0 or new_last_bit == 1, 'new bit invalid'
    return ((num >> 1) << 1) | new_last_bit

**Question 6:** Write a function `add_noise` that takes in `image`, and return the same image with each of the least significant bits set randomly to either 0 or 1.

In [None]:
def add_noise(image):
    noisy_image = []
    for r in range(len(image)):
        noisy_row = []
        for c in range(len(image[r])):
            pixel = image[r][c]
            # modify the red, green, and blue components of pixel
            # by changing the each of their last bits to a random bit
            # append this modified pixel to the end of noisy_row
            # hint: the helper function(s) above may be helpful.
            ...
        noisy_image.append(noisy_row)
    return noisy_image

In [None]:
grader.check("q6")

Now, let's see what happens if we add noise to an encoded message and then try to decode it. We should it expect the decoded message to be totally garbled (looking almost like TV static). We will test this by adding noise to the image we decoded at the end of Part 2 and then trying to decode it.

For reference, here is the encoded image:

<img src="https://i.ibb.co/gM62mM4/problem.png" width="300">

In [None]:
NORMAL_IMG = 'https://i.ibb.co/gM62mM4/problem.png'

normal_image = np.asarray(Image.open(BytesIO(requests.get(NORMAL_IMG).content))).tolist()

noisy_image = add_noise(normal_image)

plt.imshow(noisy_image)
plt.show()

Here, we see that the "noisy" version of the image looks identical to the original. This is intended, since we want legitmate images to not be visually impacted by the noise.

However, if someone were to try and decode the image, then they would run into issues.

In [None]:
recovered_image = decode_message(normal_image)

plt.imshow(recovered_image)
plt.show()

As we can see, we no longer are getting our message back, but rather absolute garbage.

One thing to keep in mind about this method of content threat removal is that it required us to have some prior knowledge about how the message may be possibly encoded in the image. In this case, that meant that we already knew the message was being sent through the last bits of some of the values. If the sender of the message was carefuly, the could encode the message with an entirely different scheme that may be able to get past this sort of filter. This kind of represents the arms race that exists between those implementing filters and those trying to get around them.

**FOR SUBMISSION: If you run into an error (Runtime Error) at first, try running it again. There's an infrequent bug causing an error, but it goes away after re-running the submission cell.**

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Once you have generated the zip file, go to the Gradescope page for this assignment to submit.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)