# Pregunta 3: Break Random OTP

### Desafíos:

**De dónde se obtuvieron estos mensajes?**

Parecieran haberse obtenido de la novela 1984 de George Orwell

**Podremos escuchar los canales de otras personas **sin** pedirles su correo ni número de alumno?**

El algoritmo usado en este notebook probablemente no podría hacerlo, debido a que al intentar separar los mensajes encriptados según las llaves que usaron, habrían muchas "colisiones", debido a que mi algoritmo se basa en tomar aquellos pares de mensajes que al hacerles un XOR cumplan cierta condición. Habría que mejorar esta condición con un sistema más complejo e inteligente. A pesar de lo anterior, todo es posible usando fuerza bruta :D

## Import `custom_md5` (Pregunta 2)
Nota: Este módulo requiere importar la funcíon `custom_md5` desde el notebook de la pregunta 2 `pregunta2.ipynb`

In [3]:
# Importar desde jupyter
%run '../Pregunta 2/pregunta2.ipynb'
# Importar como script
# from pregunta2 import custom_md5

REAL MD5: 7052292b1c02ae4b0b35fabca4fbd487
CUSTOM MD5: 7052292b1c02ae4b0b35fabca4fbd487


## Utils

### Message loading utils

In [4]:
def load_messages(path):
    """
    Loads messages from file.
    Returns dictionary { md5_digest : binary_string }
    """
    with open(path, 'r', encoding='utf-8') as messages_file:
        messages = dict()
        for line in messages_file:
            md5_hash, encoded_message = line.rstrip('\n').split(',')
            md5_hash, encoded_message = md5_hash.strip('"'),  encoded_message.strip('"')
            messages[md5_hash] = encoded_message
    return messages

def find_channel_messages(messages, email, n_alumno):
    """
    Extract messages using the same key from messages dictionary.
    Returns list of ordered binary strings
    """
    channel_messages = list()
    current_message_index = 0
    md5_hash = custom_md5(email, n_alumno * 100 + current_message_index)
    try:
        while True:
            message = messages[md5_hash]
            channel_messages.append(message)
            current_message_index += 1
            md5_hash = custom_md5(email, n_alumno * 100 + current_message_index)
    except KeyError:
        print("Could not find message with md5", md5_hash)
        print("Messages found =", current_message_index)
    return channel_messages


### OTP Breaking utils
Algunas de las funciones a continuación fueron basadas en aquellas mostradas en la clase de OTP por el profesor Martín, disponibles en el [repositorio del curso](https://github.com/UC-IIC3253/2021/tree/main/src/otp).

In [5]:
def encode_binary(string_):
    """
    Receives a binary string.
    Returns string with characters represented by binary numbers
    """
    output = b""
    for i in range(0, len(string_), 8):
        output += bytes(chr(int(string_[i:i+8], 2)), encoding="ASCII")
    return output

def string_to_ints(string_):
    """
    Receives a bytestring.
    Returns list of integers representing the character value
    """
    return tuple(c for c in string_)

def ints_to_string(tup):
    return "".join(chr(x) for x in tup)

def int_tuple_xor(t1, t2):
    return tuple(t1[i] ^ t2[i] for i in range(len(t1)))

def probable_space_count_vector(messages, picked_message):
    vector = [0] * len(picked_message)
    for m in messages:
        res_xor = int_tuple_xor(picked_message, m)
        for i, c in enumerate(res_xor):
            if c >= 64:
                vector[i] += 1
    return tuple(round(x / len(messages), 4) for x in vector)

def probable_space_count_matrix(messages):
    probable_space_count_matrix = []
    for m in messages:
        probable_space_count_matrix.append(probable_space_count_vector(messages, m))
    return probable_space_count_matrix

def get_max_index_vector(matrix):
    vector = []
    matrix = list(zip(*matrix))

    for a in matrix:
        max_prob = max(a)
        max_indices = []
        for i, b in enumerate(a):
            if b == max_prob:
                max_indices.append(i)
                if b == 0:
                    break
        vector.append(max_indices)
    return vector

def count_over_64(tup):
    count = 0
    for i in tup:
        if i >= 64:
            count += 1
    return count

def rec_part_keys(lok, current_key, messages, max_vector):
    if len(current_key) == len(messages[0]):
        lok.append(current_key)
        return
    for i in max_vector[len(current_key)]:
        next_key = current_key + [messages[i][len(current_key)] ^ 32]
        rec_part_keys(lok, next_key, messages, max_vector)

def gen_all_possible_keys(messages, max_vector):
    list_of_keys = []
    rec_part_keys(list_of_keys, [], messages, max_vector)
    return list_of_keys


## `break_random_otp` implementation

### Notas ASCII:
Number | Character
--- | ---
32 | `Space`
33 | `!`
39 | `'`
40 | `(`
41 | `)`
44 | `,`
45 | `-`
46 | `.`
48-57 | `0-9`
58 | `:`
59 | `;`
63| `?`
64 | `@`
65-90 | `A-Z`
97-122 | `a-z`

In [8]:
# valid_chars = " !'(),-.0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
valid_chars = " ,.ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

def break_random_otp(encrypted_messages: [str]) -> {str: list[str]}:
    """
    Arguments:
        encrypted_messages:  str - encrypted messages list.
    Returns:
        {str: list[str]} - dictionary mapping a key to all the decrypted messages that
            were originally encrypted with that key.
    """
    int_result_dict = {}
    result_dict = {}
    # If input is empty, return empty dict
    if not encrypted_messages:
        return result_dict
    # Transform binary strings into int tuples representing each byte value
    int_encrypted_messages = list(map(lambda m: string_to_ints(encode_binary(m)), encrypted_messages))
    # Main loop
    prev_len = 0
    while int_encrypted_messages:
        if prev_len == len(int_encrypted_messages):
            break
        prev_len = len(int_encrypted_messages)
        # Pick message
        m1 = int_encrypted_messages[0]
        # Get the rest of the messages that may have been encrypted with the same key (possible mates)
        possible_mates = []
        for m2 in int_encrypted_messages:
            if m2 not in possible_mates:
                if count_over_64(int_tuple_xor(m1, m2)) <= 4:
                    add_m2 = True
                    for m3 in possible_mates:
                        if count_over_64(int_tuple_xor(m3, m2)) > 4:
                            add_m2 = False
                    if add_m2:
                        possible_mates.append(m2)
        for m in possible_mates:
            if m in int_encrypted_messages:
                int_encrypted_messages.remove(m)
        # Get space probability matrix
        matrix = probable_space_count_matrix(possible_mates)
        max_vector = get_max_index_vector(matrix)
        # Loop once for every "slot" in the key
        res = gen_all_possible_keys(possible_mates, max_vector)
        current_key = []
        current_key_score = 0
        for possible_key in res:
            decrypted_messages = [ints_to_string(int_tuple_xor(possible_key, m)) for m in possible_mates]
            possible_key_score = 0
            for a in decrypted_messages:
                for char in a:
                    if char in valid_chars:
                        possible_key_score += 1
            if possible_key_score > current_key_score:
                current_key = possible_key[:]
                current_key_score = possible_key_score
        # loop this somehow
        # decrypted_messages = [ints_to_string(int_tuple_xor(current_key, m)) for m in possible_mates]
        # for j in range(len(decrypted_messages[0])):
        #     invalid_count = 0
        #     for i in range(len(decrypted_messages)):
        #         if decrypted_messages[i][j] not in valid_chars:
        #             invalid_count += 1
        #     if (invalid_count / len(decrypted_messages)) > 0.2:
        #         if max_vector.
        # Remove bad messages
        to_remove = []
        for m in possible_mates:
            decrypted_with_current_key = ints_to_string(int_tuple_xor(current_key, m))
            invalid_count = 0
            for char in decrypted_with_current_key:
                if char not in valid_chars:
                    invalid_count += 1
            if invalid_count / len(decrypted_with_current_key) >= 0.3:
                to_remove.append(m)
        for m in to_remove:
            possible_mates.remove(m)
            int_encrypted_messages.append(m)
        # Add other possible messages
        for m in int_encrypted_messages:
            decrypted_with_current_key = ints_to_string(int_tuple_xor(current_key, m))
            invalid_count = 0
            for char in decrypted_with_current_key:
                if char not in valid_chars:
                    invalid_count += 1
            if invalid_count / len(decrypted_with_current_key) <= 0.15:
                possible_mates.append(m)
        for m in possible_mates:
            if m in int_encrypted_messages:
                int_encrypted_messages.remove(m)
        int_result_dict[tuple(current_key)] = possible_mates[:]
    for key in int_result_dict:
        decrypted_messages = [ints_to_string(int_tuple_xor(key, m)) for m in int_result_dict[key]]
        result_dict[ints_to_string(key)] = decrypted_messages
    return result_dict


## Testing

In [9]:
if __name__ == "__main__":
    file_path = 'mensajes_pregunta_3/mensajes_p3.csv'
    email = 'matias.duhalde@uc.cl'
    n_alumno = 18639496

    messages = load_messages(file_path)
    out = find_channel_messages(messages, email, n_alumno)
    result = break_random_otp(out)
    for key in result:
        print("KEY:", repr(key))
        for message in result[key]:
            print(repr(message))

Could not find message with md5 7a13c239aafdf428b7952849a18b2eb5
Messages found = 200
[(123, 112, 9, 70, 92, 31, 117, 29, 123, 103), (92, 38, 10, 61, 64, 58, 83, 58, 93, 104)]
[(75, 48, 87, 72, 6, 18, 69, 54, 93, 98), (74, 49, 4, 14, 19, 25, 69, 55, 24, 48), (21, 45, 125, 91, 50, 82, 117, 21, 23, 5), (88, 41, 8, 84, 19, 31, 10, 6, 46, 28)]
[(84, 36, 33, 36, 24, 103, 108, 102, 117, 19), (69, 8, 117, 60, 15, 118, 117, 103, 111, 122), (34, 25, 7, 78, 3, 74, 92, 82, 73, 63), (116, 35, 84, 50, 31, 17, 100, 82, 76, 11), (115, 47, 13, 62, 5, 92, 98, 91, 65, 0)]
[]
[(23, 25, 119, 60, 13, 109, 52, 127, 126, 63), (72, 114, 86, 51, 76, 52, 19, 86, 126, 2), (67, 55, 69, 53, 14, 48, 23, 0, 122, 65), (64, 120, 95, 63, 0, 120, 52, 3, 107, 77), (125, 123, 84, 62, 5, 14, 54, 88, 65, 26)]
[(109, 56, 15, 9, 76, 80, 98, 6, 46, 36), (106, 106, 9, 93, 64, 90, 126, 82, 122, 47), (106, 106, 9, 93, 64, 90, 126, 82, 70, 34), (89, 97, 111, 34, 86, 98, 56, 42, 96, 18), (69, 77, 45, 121, 108, 122, 123, 120, 105, 4

### Manual testing

In [17]:
if __name__ == "__main__":
    messages = load_messages(file_path)
    out = find_channel_messages(messages, email, n_alumno)
    int_encrypted_messages = list(map(lambda m: string_to_ints(encode_binary(m)), out))
    m1 = int_encrypted_messages[0]
    possible_mates = []
    for m2 in int_encrypted_messages:
        if m2 not in possible_mates:
            if count_over_64(int_tuple_xor(m1, m2)) <= 4:
                add_m2 = True
                for m3 in possible_mates:
                    if count_over_64(int_tuple_xor(m3, m2)) > 4:
                        add_m2 = False
                if add_m2:
                    possible_mates.append(m2)

    a = probable_space_count_matrix(possible_mates)
    for i in range(len(a)):
        for j in range(len(a[i])):
            print((a[i][j], f"{possible_mates[i][j]:03}"), end="")
        print()
    current_key = [25 ^ 32, 116 ^ 32, 87 ^ 32, 14 ^ 32, 71 ^ 32, 81 ^ 32, 0 ^ 32, 101 ^ 32, 93 ^ 32, 48 ^ 32]
    decrypted_messages = [ints_to_string(int_tuple_xor(current_key, m)) for m in possible_mates]

    for k in decrypted_messages:
        print(repr(k))

Could not find message with md5 7a13c239aafdf428b7952849a18b2eb5
Messages found = 200
(0.1463, '094')(0.122, '060')(0.3415, '003')(0.7805, '000')(0.7073, '071')(0.1707, '037')(0.2683, '072')(0.2927, '032')(0.6829, '093')(0.4878, '099')
(0.1463, '076')(0.122, '048')(0.3415, '019')(0.2195, '075')(0.2927, '009')(0.8293, '081')(0.2683, '074')(0.2927, '032')(0.3171, '015')(0.4878, '123')
(0.8537, '025')(0.122, '059')(0.3415, '002')(0.2195, '090')(0.7073, '071')(0.1707, '030')(0.2683, '070')(0.7073, '101')(0.3171, '014')(0.4878, '124')
(0.1463, '092')(0.122, '049')(0.3415, '007')(0.7805, '002')(0.7073, '071')(0.1707, '005')(0.2683, '072')(0.2927, '032')(0.6829, '093')(0.4878, '098')
(0.1463, '086')(0.122, '033')(0.3415, '016')(0.2195, '070')(0.7073, '071')(0.1707, '025')(0.2683, '065')(0.2927, '043')(0.3171, '025')(0.5122, '048')
(0.1463, '074')(0.122, '060')(0.3415, '022')(0.2195, '069')(0.2927, '014')(0.1707, '031')(0.2683, '071')(0.7073, '101')(0.3171, '004')(0.4878, '127')
(0.1463, '076'