<center><img src="https://docs.google.com/drawings/d/e/2PACX-1vT4S4QVOsu1GtRuJmYftcySJMZGo_4woIB8S2p52sttdzdnRL3AEb-Z7A7dyBzLDQL1n9DYeqvmoV6r/pub?w=816&amp;h=144"></center>

# Encoding, Encryption, and Hashing

There are three concepts that often get confused when dealing with information - *encoding*, *encryption*, and *hashing*. Let's look at each of these. For all three examples, we will use the notion of a *password*. This analogy breaks down a little bit when we use *encoding*, but nonetheless we'll stick with it as the payoff for *encryption* and *hashing* is worth it.

We'll be using *byte literals* in this conversation. For now, it's sufficient to know that a byte literal is just the computer version of a `string`. For example, let's see what a `string` looks like and a byte literal of the same `string`. Note two things:

1. We don't really need to know how to convert between the two; we just need to know that some functions prefer a byte literal over a `string`

2. If we jam a `b` in front of a `string`, that will convert it to a byte literal

Sometimes there is no visible difference between the `string` and the byte literal. But trust me - the computer knows the difference:

In [None]:
print('Too many secrets')
print(type('Too many secrets'))
print()
print(b'Too many secrets')
print(type(b'Too many secrets'))

Okay. Enough nerding out. Let's look at *encoding*.

## Encoding

Encoding is the process of transforming some text into text with stricter guidelines. For instance, when we think about how websites work, consider the URL. Do you know what happens when you put a space in some text in a URL? Browsers don't like spaces (in fact there's a whole bunch of characters that browsers don't like) so they replace the space with `%20`. Don't believe me? Check out what happens when I try to access a website with a space in the title - **the browser swaps the space for a `%20`!!!**

<center><img style="margin-bottom: .5em;border-radius: 15px;" src="https://lh3.googleusercontent.com/d/1EfWl4ANB0h6C_5mk4cM1gPl5BVbh0Z_k" alt="URL Encoding" title="URL Encoding" /></center>

So what if we have a system that behaves like that, but we actually want a `%20` to be different than a space?

This is *exactly* what encoding is for. It's a way to reduce some text into a subset of characters that have a strict guideline. So **base64** encoding is designed to only use capital letters `A-Z`, lowercase letters `a-z`, numerals `0-9`, and the symbols `+` and `/`. And that's it.

So let's imagine that we want to design a URL that has a link to a web conference as well as the password. Well, since passwords oftentimes contain special characters, we can anticipate that there *might* be a `%` sign in there. What's worse is that it's possible that the phrase `%20` appears in the password. We know that browsers treat the `%20` as a space, but we really need it to be not a space.

So we can *encode* it.

Note that encoding isn't super special. It is not used to protect a secret (like *encryption* or *hashing*). All encoding does is translate characters to a specific subset. Let's look at an example. Let's assume that a URL contains the characters `aT8%20!`. Well, right off the bat we know that browsers aren't going to like it. But we also know that if we encode it in base64 we can be sure that the `%20` is not converted to a space.

Let's see that in action!

### STEPS
1. Convert the `string` to a byte literal (in this case, we will change the text to ASCII, which is pretty much like converting it strictly to a byte literal)<br /><br />
2. We'll encode the byte literal to base64<br /><br />
3. We'll take the result and change it from a byte literal to a `string`<br /><br />
4. We'll output the base64 version of the original `string`<br /><br />

In [None]:
import base64

# define password
password = 'aT8%20!'

# output the original string
print(f'Original message: {password}')

# convert `password` to byte literal
password_bytes = password.encode('ascii')

# note that the code:
# password_bytes = b"aT8%20!"
# does the same thing as encoding in ASCII - but if you have a variable
# and not a string literal, you have to use the `encode` function.

# encode the byte literal
base64_bytes = base64.b64encode(password_bytes)

# change from a byte literal to a `string`
base64_string = base64_bytes.decode('ascii')

print(f'Encoded message:  {base64_string}')


By the way, since we now know that some things - like a `%` sign - go from being one character to two or three characters, we can conclude that *encoding* text lengthens the size of the text (by about 33%). That in turn requires more memory and bandwidth. So encoding isn't ideal unless you need to restrict the character set.

For funsies, change the value of the password from `aT8%20!` to `%20` and see what happens. How many characters do you see?

Oh - and you'll notice that if you change the value of `password` a few times and run the code that many of the encodings will end in one or two `=` signs; this is because of the way base64 encoding works. It needs to chunk three characters at a time so if the resulting encoding is not a multiple of three, an `=` sign or two is added to *pad* the result.

Change the password to `usb` and see how many `=` signs are in the result.

If you want to dive deeper, [read this StackOverflow conversation](https://stackoverflow.com/questions/6916805/why-does-a-base64-encoded-string-have-an-sign-at-the-end).

Also, Python has a built-in encoding function. By default, it is UTF-8. You can quickly encode data by:

```python
message = 'Too many secrets'
message = message.encode()
print(message)
```

The difference between base64 and UTF-8 doesn't really matter for the small things we are doing.


# Encryption

*Encryption* is similar to *encoding* in that it transforms text to something unrecognizable. The difference is that encryption is secret; that is, once you encrypt something you can't decrypt it to a readable form without a key.

Some forms of encryption are *symmetric*. That means that the same key is used to encrypt and decrypt the message. Like the lock on your home. You can give the same key to several people and they can all lock or unlock the door with it.

Encryption can also be *asymmetric* which means one key can encrypt the message but that message can only be decrypted with a different key. You may have heard of the notion of *public key* and *private key*. In this modality, a public key is available for anyone to see. All it can do is encrypt data. Even that key can't be used to decrypt the same data. A matching (but different) key, the *private key*, is the **only key in the world** that can decrypt the data.

For example, I could give you my public key (if you are curious, you can [find it here](http://daveghidiu.com/security/davesKey.key) - and it looks like maybe it's encoded in base64...) and you could encrypt any data with it that you want. But you - or anybody else - would not be able to decrypt it. Only I can because I have the private key.

For the scope of this conversation, we will just consider symmetric cryptography (where the same key encrypts and decrypts the data). Much like how encoding has different flavors (base64, UTF-8, etc.), there are a number of encryption algorithms. For a while RSA encryption was the standard. But there are few modern algorithms that we use today:
* AES (trusted by the US government and other governments)
* Triple DES
* RSA
* Blowfish
* Twofish

We will be using the `cryptography` module in Python. One of the functions, `Fernet`, is built on top of the AES-128 model. Note that AES-256 is available now, but it takes a bit longer to use so we'll stick with `Fernet`

In [None]:
from cryptography.fernet import Fernet

# create a secret message
message = 'Too many secrets'

# print out the message
print(f'Message:   {message}')

# encode the message to handle any special character issues
encoded_message = message.encode()

# print out the encoded message for funsies
print(f'Encoded:   {encoded_message}')

# generate a key using the `Fernet` method
key = Fernet.generate_key()

# print out the key for funsies
print(f'Key:       {key}')

# convert the key into a version we can use
key = Fernet(key)

# encrypt the freshly encoded message
encrypted_message = key.encrypt(encoded_message)

# print out the encrypted message
print(f'Encrypted: {encrypted_message}')

There are two interesting things to say about encryption:
1. Most of the popular encryption algorithms are open source - that means you can inspect the code. In fact, it's widely considered to be more secure if everyone can see the code. But it's okay that the code is accessible. The math behind encryption is what makes it secure, not the complicated code.<br /><br />
2. If we generate a few keys, they will all be different. That makes sense, but that also means that if you run the code above a few times in a row you'll get all different keys.

# Hashing

Lastly, *hashing*. This is sort of like encryption in that the process of hashing takes data and uses an algorithm to convert it to something else. But unlike encryption, *it cannot be reversed!*

That is, a good hash cannot be reversed. Some hashing algorithms, like `md5`, have been compromised and can sometimes be reversed. Hashing algorithms must abide by several guidelines to be safe:
1. One way - they cannot be reversed<br /><br />
2. Fixed size - they are ALWAYS the same length, regardless of the data<br /><br />
3. Consistent - hashing the same data should always result in the same hash<br /><br />
4. No collisions - two different data should not be hashed to the same thing

So let's first look at why we might want to hash data. Let's say you program an app. And every user has a password. Well, you might be tempted to have a database that stores the username and their password. So when a user logs in you can ask them for their username and password - if what they typed in matches what is in your records, then access is granted. But this isn't safe. What if someone hacks into your computer? They can see all the passwords of the users. Besides the obvious problem with bad actors using these credentials to access your app, they now have usernames and passwords that they can attempt to use on other websites. If any of the poor souls using your app have used the same password somewhere else, now the bad actors can abuse other apps and websites too!

Instead, as soon as your app gets the username and the password from someone setting up an account, *hash* the password and store that. Since it cannot be reversed, if the bad actors break into your database now all they have are the usernames - no passwords. But if one of your customers logs in, as soon as they type in their username and password to login, you can hash what they just typed in and see if it matches the hash you have on file (from when they created the account).

Let's look at hashing in action. We will be using the `hashlib` module, and that also requires the data to be encoded first:

In [None]:
import hashlib

# create the message
message = 'so complex it cannot be solved'

# output the message for funsies
print(f'Message: {message}')

# encode the message to bytes since hashlib requires that
message_bytes = message.encode()

# use hashlib to create a SHA-256 hash
hashed_message = hashlib.sha256(message_bytes)

# get the hexadecimal version of the hash
hash_hex = hashed_message.hexdigest()

# print the hashed version for funsies
print(f'Hashed:  {hash_hex}')


You should totally try changing the message - you'll note that the resulting hash will ALWAYS be the same number of characters. Even if the message is `'a'`, it's the same number of characters.

If you've been paying close attention, you'll note an issue with storing passwords as hashes (a user problem, not a technical problem). Since hashes *always* generate the same hash if given the same output, some passwords are less safe than others even while hashed. For instance:

```
Message: p@sssword
Hashed:  dd87c56f039e99a7bbf800265b15baa53f28c535c21fc7ae949d3604f8d9e41b
```

That means if one of your customers uses `p@ssword` as their password, the bad actors will see `dd87c56f039e99a7bbf800265b15baa53f28c535c21fc7ae949d3604f8d9e41b` as the hash. And popular passwords and their corresponding hashes are floating around in things called *Rainbow Tables*. So if a customer uses a unique password, the resulting hash probably doesn't appear on a rainbow table. Also, YOU as the database administrator should probably be *salting* the passwords anyhow (but that's a lesson for another day).