# Pregunta 3: Break Random OTP

### Desafíos:

**De dónde se obtuvieron estos mensajes?**

Parecieran haberse obtenido de la novela 1984 de George Orwell

**Podremos escuchar los canales de otras personas **sin** pedirles su correo ni número de alumno?**

El algoritmo usado en este notebook probablemente no podría hacerlo, debido a que al intentar separar los mensajes encriptados según las llaves que usaron, habrían muchas "colisiones", debido a que mi algoritmo se basa en tomar aquellos pares de mensajes que al hacerles un XOR cumplan cierta condición. Habría que mejorar esta condición con un sistema más complejo e inteligente. A pesar de lo anterior, todo es posible usando fuerza bruta :D

## Import `custom_md5` (Pregunta 2)
Nota: Este módulo requiere importar la funcíon `custom_md5` desde el notebook de la pregunta 2 `pregunta2.ipynb`

In [6]:
# Importar desde jupyter
# %run '../Pregunta 2/pregunta2.ipynb'
# Importar como script
from pregunta2 import custom_md5

REAL MD5: 7052292b1c02ae4b0b35fabca4fbd487
CUSTOM MD5: 7052292b1c02ae4b0b35fabca4fbd487


## Utils

### Message loading utils

In [7]:
def load_messages(path):
    """
    Loads messages from file.
    Returns dictionary { md5_digest : binary_string }
    """
    with open(path, 'r', encoding='utf-8') as messages_file:
        messages = dict()
        for line in messages_file:
            md5_hash, encoded_message = line.rstrip('\n').split(',')
            md5_hash, encoded_message = md5_hash.strip('"'),  encoded_message.strip('"')
            messages[md5_hash] = encoded_message
    return messages

def find_channel_messages(messages, email, n_alumno):
    """
    Extract messages using the same key from messages dictionary.
    Returns list of ordered binary strings
    """
    channel_messages = list()
    current_message_index = 0
    md5_hash = custom_md5(email, n_alumno * 100 + current_message_index)
    try:
        while True:
            message = messages[md5_hash]
            channel_messages.append(message)
            current_message_index += 1
            md5_hash = custom_md5(email, n_alumno * 100 + current_message_index)
    except KeyError:
        print("Could not find message with md5", md5_hash)
        print("Messages found =", current_message_index)
    return channel_messages


### OTP Breaking utils
Algunas de las funciones a continuación fueron basadas en aquellas mostradas en la clase de OTP por el profesor Martín, disponibles en el [repositorio del curso](https://github.com/UC-IIC3253/2021/tree/main/src/otp).

In [8]:
def encode_binary(string_):
    """
    Receives a binary string.
    Returns string with characters represented by binary numbers
    """
    output = b""
    for i in range(0, len(string_), 8):
        output += bytes(chr(int(string_[i:i+8], 2)), encoding="ASCII")
    return output

def string_to_ints(string_):
    """
    Receives a bytestring.
    Returns list of integers representing the character value
    """
    return tuple(c for c in string_)

def ints_to_string(tup):
    return "".join(chr(x) for x in tup)

def int_tuple_xor(t1, t2):
    return tuple(t1[i] ^ t2[i] for i in range(len(t1)))

def probable_space_count_vector(messages, picked_message):
    vector = [0] * len(picked_message)
    for m in messages:
        res_xor = int_tuple_xor(picked_message, m)
        for i, c in enumerate(res_xor):
            if c >= 64:
                vector[i] += 1
    return tuple(round(x / len(messages), 4) for x in vector)

def probable_space_count_matrix(messages):
    probable_space_count_matrix = []
    for m in messages:
        probable_space_count_matrix.append(probable_space_count_vector(messages, m))
    return probable_space_count_matrix


def count_over_64(tup):
    count = 0
    for i in tup:
        if i >= 64:
            count += 1
    return count


## `break_random_otp` implementation

### Notas ASCII:
Number | Character
--- | ---
32 | `Space`
33 | `!`
39 | `'`
40 | `(`
41 | `)`
44 | `,`
45 | `-`
46 | `.`
48-57 | `0-9`
58 | `:`
59 | `;`
63| `?`
64 | `@`
65-90 | `A-Z`
97-122 | `a-z`

In [9]:
valid_chars = " !'(),-.0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
# TODO: Luego de definir una key, revisar cada mensaje de possible_mates en ese index
# y sacar de la lista aquellos que decodificados con la key no tienen un caracter en dicha posición
# que no pertenezca a valid_chars

# TODO: Idea: variable threshold

def break_random_otp(encrypted_messages: [str]) -> {str: list[str]}:
    """
    Arguments:
        encrypted_messages:  str - encrypted messages list.
    Returns:
        {str: list[str]} - dictionary mapping a key to all the decrypted messages that
            were originally encrypted with that key.
    """
    int_result_dict = {}
    result_dict = {}
    # If input is empty, return empty dict
    if not encrypted_messages:
        return result_dict
    # Transform binary strings into int tuples representing each byte value
    int_encrypted_messages = list(map(lambda m: string_to_ints(encode_binary(m)), encrypted_messages))
    # Main loop
    while int_encrypted_messages:
        print(len(int_encrypted_messages))
        # Pick message
        m1 = int_encrypted_messages.pop()
        # Get the rest of the messages that may have been encrypted with the same key (possible mates)
        possible_mates = [m1]
        for i in range(len(int_encrypted_messages)):
            index = i - len(possible_mates) - 1
            m2 = int_encrypted_messages[index]
            if count_over_64(int_tuple_xor(m1, m2)) <= 2:
                int_encrypted_messages.pop(index)
                possible_mates.append(m2)
        # Define vector to store possible key
        current_key_vector = [(0, 0)] * len(m1)
        # Loop once for every "slot" in the key
        for _ in range(len(current_key_vector)):
            # Get matrix containing the probability of each character of being a space
            matrix = probable_space_count_matrix(possible_mates)
            # Get max value for each column (and the respective index):
            max_vector = [(0, 0)] * len(current_key_vector)
            for i in range(len(matrix)):
                for j in range(len(matrix[i])):
                    # TODO: case == (equals)
                    if matrix[i][j] > max_vector[j][1]:
                        max_vector[j] = (i, matrix[i][j])
            # Find key byte
            key_changed = False
            changed_index = 0
            while not key_changed:
                curr_max = max(max_vector, key=lambda x: x[1])
                index = max_vector.index(curr_max)
                if current_key_vector[index] == (0, 0):
                    current_key_vector[index] = (curr_max[1], possible_mates[curr_max[0]][index] ^ 32)
                    key_changed = True
                    changed_index = index
                else: 
                    max_vector[index] = (0, 0)
            # Remove unmatching
            unmatching_mates = []
            for i in range(len(possible_mates)):
                index = i - len(unmatching_mates)
                m = possible_mates[index]
                char = chr(m[changed_index] ^ current_key_vector[changed_index][1])
                if char not in valid_chars:
                    possible_mates.pop(index)
                    unmatching_mates.append(m2)
            int_encrypted_messages += unmatching_mates
            # print(changed_index)
            # for x in possible_mates:
            #     print(repr(ints_to_string(int_tuple_xor(tuple(x[1] for x in current_key_vector), x))))
        current_key = tuple(x[1] for x in current_key_vector)
        int_result_dict[tuple(current_key)] = possible_mates[:]
        break
    for key in int_result_dict:
        decrypted_messages = [ints_to_string(int_tuple_xor(key, m)) for m in int_result_dict[key]]
        result_dict[ints_to_string(key)] = decrypted_messages
    return result_dict


## Testing

In [None]:
if __name__ == "__main__":
    file_path = 'mensajes_pregunta_3/mensajes_p3.csv'
    email = 'matias.duhalde@uc.cl'
    n_alumno = 18639496

    messages = load_messages(file_path)
    out = find_channel_messages(messages, email, n_alumno)
    result = break_random_otp(out)
    for key in result:
        print("KEY:", repr(key))
        for message in result[key]:
            print(repr(message))

## Example
How to XORed messages look like

In [11]:
example_m = ["nate the k", "II letters", "hexadecima", "letters XO", "bably not "]
example_m_int = [string_to_ints(bytes(m, "ASCII")) for m in example_m]

# for a in example_m_int:
#     for b in example_m_int:
#         res = int_tuple_xor(a, b)
        
#         print(count_over_64(res))



In [12]:
# How two letters XOR'd together look like:
letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
# for a in letters:
#     for b in letters:
#         print(ord(a) ^ ord(b))
# Result: A byte with value < 64

### Manual testing

In [132]:
messages = load_messages(file_path)
out = find_channel_messages(messages, email, n_alumno)
int_encrypted_messages = list(map(lambda m: string_to_ints(encode_binary(m)), out))
m1 = int_encrypted_messages[0]
possible_mates = []
for m2 in int_encrypted_messages:
    if m2 not in possible_mates:
        if count_over_64(int_tuple_xor(m1, m2)) <= 4:
            add_m2 = True
            for m3 in possible_mates:
                if count_over_64(int_tuple_xor(m3, m2)) > 4:
                    add_m2 = False
            if add_m2:
                possible_mates.append(m2)

a = probable_space_count_matrix(possible_mates)
for i in range(len(a)):
    for j in range(len(a[i])):
        print((a[i][j], f"{possible_mates[i][j]:03}"), end="")
    print()

Could not find message with md5 7a13c239aafdf428b7952849a18b2eb5
Messages found = 200
(0.0667, '094')(0.1333, '060')(0.0667, '003')(0.7333, '000')(0.2667, '071')(0.1333, '037')(0.0, '072')(0.2667, '032')(0.6667, '093')(0.2, '099')
(0.0667, '076')(0.1333, '048')(0.0667, '019')(0.2667, '075')(0.7333, '009')(0.8667, '081')(0.0, '074')(0.2667, '032')(0.3333, '015')(0.2, '123')
(0.9333, '025')(0.1333, '059')(0.0667, '002')(0.2667, '090')(0.2667, '071')(0.1333, '030')(0.0, '070')(0.7333, '101')(0.3333, '014')(0.2, '124')
(0.0667, '092')(0.1333, '049')(0.0667, '007')(0.7333, '002')(0.2667, '071')(0.1333, '005')(0.0, '072')(0.2667, '032')(0.6667, '093')(0.2, '098')
(0.0667, '086')(0.1333, '033')(0.0667, '016')(0.2667, '070')(0.2667, '071')(0.1333, '025')(0.0, '065')(0.2667, '043')(0.3333, '025')(0.8, '048')
(0.0667, '074')(0.1333, '060')(0.0667, '022')(0.2667, '069')(0.7333, '014')(0.1333, '031')(0.0, '071')(0.7333, '101')(0.3333, '004')(0.2, '127')
(0.0667, '076')(0.1333, '038')(0.9333, '087'

In [174]:
current_key = [25 ^ 32, 116 ^ 32, 87 ^ 32, 14 ^ 32, 71 ^ 32, 81 ^ 32, 0 ^ 32, 101 ^ 32, 93 ^ 32, 48 ^ 32]
decrypted_messages = [ints_to_string(int_tuple_xor(current_key, m)) for m in possible_mates]

for k in decrypted_messages:
    print(repr(k))

'ght. The s'
'udden jerk'
' out of sl'
'eep, the r'
'ough hand '
'shaking yo'
'ur shoulde'
'hts glarin'
'g in your '
'eyes, the '
'ring of ha'
'ound the b'
'rity of ca'
'B$~h;nUX\x06w'
"er}\x13'Ks\x7f x"


In [172]:
for i in range(0, 127):
    current_key = [25 ^ 32, 116 ^ 32, 87 ^ 32, 14 ^ 32, i ^ 32, 81 ^ 32, 0 ^ 32, 29 ^ 32, 93 ^ 32, 48 ^ 32]
    decrypted_messages = [ints_to_string(int_tuple_xor(current_key, m)) for m in possible_mates]

    print(i, repr(decrypted_messages[0]))

0 'ght.gTh\x1d s'
1 'ght.fTh\x1d s'
2 'ght.eTh\x1d s'
3 'ght.dTh\x1d s'
4 'ght.cTh\x1d s'
5 'ght.bTh\x1d s'
6 'ght.aTh\x1d s'
7 'ght.`Th\x1d s'
8 'ght.oTh\x1d s'
9 'ght.nTh\x1d s'
10 'ght.mTh\x1d s'
11 'ght.lTh\x1d s'
12 'ght.kTh\x1d s'
13 'ght.jTh\x1d s'
14 'ght.iTh\x1d s'
15 'ght.hTh\x1d s'
16 'ght.wTh\x1d s'
17 'ght.vTh\x1d s'
18 'ght.uTh\x1d s'
19 'ght.tTh\x1d s'
20 'ght.sTh\x1d s'
21 'ght.rTh\x1d s'
22 'ght.qTh\x1d s'
23 'ght.pTh\x1d s'
24 'ght.\x7fTh\x1d s'
25 'ght.~Th\x1d s'
26 'ght.}Th\x1d s'
27 'ght.|Th\x1d s'
28 'ght.{Th\x1d s'
29 'ght.zTh\x1d s'
30 'ght.yTh\x1d s'
31 'ght.xTh\x1d s'
32 'ght.GTh\x1d s'
33 'ght.FTh\x1d s'
34 'ght.ETh\x1d s'
35 'ght.DTh\x1d s'
36 'ght.CTh\x1d s'
37 'ght.BTh\x1d s'
38 'ght.ATh\x1d s'
39 'ght.@Th\x1d s'
40 'ght.OTh\x1d s'
41 'ght.NTh\x1d s'
42 'ght.MTh\x1d s'
43 'ght.LTh\x1d s'
44 'ght.KTh\x1d s'
45 'ght.JTh\x1d s'
46 'ght.ITh\x1d s'
47 'ght.HTh\x1d s'
48 'ght.WTh\x1d s'
49 'ght.VTh\x1d s'
50 'ght.UTh\x1d s'
51 'ght.TTh\x1d s'
52 'ght.STh\x1d s'
