# Notebook 02 - Statistical Analysis
<i>This material has been designed by Azade Rezaeeade and Lizzy Grootjen. Distribution without permission is not permitted.</i>

Let's start with the statistical analysis! 

Load in the Python imports, the auxiliary functions and the auxiliary arrays below. You can have a look at the internals, but they are meant to be used as a black box. 

In [None]:
# Python imports

import numpy as np
import scipy.stats
from tqdm import trange

In [None]:
## Auxiliary arrays

sbox = np.array([99, 124, 119, 123, 242, 107, 111, 197, 48,  1,   103, 43,  254, 215, 171, 118, 
                202, 130, 201, 125, 250, 89,  71,  240, 173, 212, 162, 175, 156, 164, 114, 192, 
                183, 253, 147, 38,  54,  63,  247, 204, 52,  165, 229, 241, 113, 216, 49,  21, 
                4,   199, 35,  195, 24,  150, 5,   154, 7,   18,  128, 226, 235, 39,  178, 117, 
                9,   131, 44,  26,  27,  110, 90,  160, 82,  59,  214, 179, 41,  227, 47,  132, 
                83,  209, 0,   237, 32,  252, 177, 91,  106, 203, 190, 57,  74,  76,  88,  207,
                208, 239, 170, 251, 67,  77,  51,  133, 69,  249, 2,   127, 80,  60,  159, 168, 
                81,  163, 64,  143, 146, 157, 56,  245, 188, 182, 218, 33,  16,  255, 243, 210, 
                205, 12,  19,  236, 95,  151, 68,  23,  196, 167, 126, 61,  100, 93,  25,  115, 
                96,  129, 79,  220, 34,  42,  144, 136, 70,  238, 184, 20,  22,  94,  11,  219, 
                224, 50,  58,  10,  73,  6,   36,  92,  194, 211, 172, 98,  145, 149, 228, 121, 
                231, 200, 55,  109, 141, 213, 78,  169, 108, 86,  244, 234, 101, 122, 174, 8, 
                186, 120, 37,  46,  28,  166, 180, 198, 232, 221, 116, 31,  75,  189, 139, 138, 
                112, 62,  181, 102, 72,  3,   246, 14,  97,  53,  87,  185, 134, 193, 29,  158, 
                225, 248, 152, 17,  105, 217, 142, 148, 155, 30,  135, 233, 206, 85,  40,  223, 
                140, 161, 137, 13,  191, 230, 66,  104, 65,  153, 45,  15,  176, 84,  187, 22]) 

hw = np.array([bin(n).count("1") for n in range(0, 256)])

In [6]:
# Auxiliary functions

# Load in the dataset using numpy
def load_dataset(path):
    dataset = np.load(path)
    traces = dataset["trace"][:, 1300:2000]
    textin = dataset["textin"]
    return traces, textin

# Decode the bytes to text using ASCII
def decode_bytes_to_text(predicted_key):
    password = ""
    for elem in predicted_key:
        elem_char = chr(elem)
        password = password + elem_char
    return password

# Calculate the hypotheses for the attack
def calculate_hypothetical_sensitive_values(key_guess, textin, byte):
    hypothetical_sensitive_values = sbox[key_guess ^ textin[:, byte]]
    HW_hypothetical_sensitive_values = hw[hypothetical_sensitive_values]
    return HW_hypothetical_sensitive_values

# Correlational power attack
def cpa(traces, hypothesis):
    num_samples = traces.shape[1]
    correlation_value = np.zeros(num_samples)
    for i in range(num_samples):
        correlation_value[i], _ = scipy.stats.pearsonr(traces[:, i], hypothesis)

    max_correlate_value = np.max(abs(correlation_value))
    return max_correlate_value


## Excercise 1: prepare our dataset
Let's start with recovering the first character of our password. We already collected traces for you from this target device. Load in the traces using the auxiliary functions. 

Todo:
- Fill in the blanks. Take a look at NB_01 if you want to.
- For determining the num_samples & num_traces, use the [shape](https://numpy.org/doc/2.1/reference/generated/numpy.shape.html) function.
- How many traces are there? And how many timesamples?

In [None]:
## Exercise
path = "../../datasets/dataset_nb02.npy"

traces, plaintexts = ...

traces_shape = ...

num_traces = ...
time_samples = ...

print("Shape of dataset: " + str(traces_shape))
print("Number of traces: " + str(num_traces))
print("Number of time samples: " + str(time_samples))


## Exercise 2: Prepare our correlation attack

Let's start with preparing the attack. As we have quite some traces doing exactly the same computations, we can use statistics to infer information about the password. The only thing w eneed to do is to test every possible key. 

Let's assume that we think the first byte of our password is "37", connected to binary representation 00100101. This is one key hypothesis. We know a "1" consumes more power in a computer compared to "0". To determine if the value "37" is occuring somewhere in the power trace, we can use correlation to test each timesample of our trace with this key hypothesis. 


To do:
- Fill in the blanks of the **for loop** and the **cpa** function below. 


In [None]:
# Exercise
# Returns an array with a correlation value for each key hypothesis
def correlate_traces_to_byte_hypotheses(traces, textin, byte_number):
    number_of_key_hypothesis = ...  # 1 byte = 8 bits = ?? possible binary values

    # Just empty array initialisation
    correlation = np.zeros(number_of_key_hypothesis)
    
    # Test every possible value for the specified key byte
    for key_guess in trange(...):
        hypothesis = calculate_hypothetical_sensitive_values(key_guess, textin, byte_number) # No need to look into this, you get it for free! 
        correlation[key_guess] = cpa(..., ...) # correlate power traces to hypothesis. Check the auxiliary functions for parameters.
        
    return correlation

## Excercise 3: Attack one byte
You've completed the function to calculate the correlation with a certain hypothesis for all available power traces. The auxiliary function below helps you to do the calculation for each possible key hypothesis for a specific byte. 

To do:
- Have a look at the function **correlate_traces_to_byte_hypotheses()** from Excersice 2. Do you recognise what is happening? Can you relate them to the cinnamon-example?
- Then, calculate the correlations for byte 0 using **correlate_traces_to_byte_hypotheses()** function. 
- Finally, determine from these correlations which of the hypothesis is most likely the correct key byte. (hint: look into the [argmax()](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html) function)

In [None]:
## Exercise
byte_number = ...

correlations = ...

key_hypothesis_with_highest_corr = ...

print("The password hypothesis with the highest correlation for byte {} is:  {}".format(byte_number, decode_bytes_to_text([key_hypothesis_with_highest_corr])))

## Excercise 4: Recover the whole password
If you can do this attack for one password byte, then you can do it for all bytes of the password! 

To do:
- Fill in the missing variables. Do you remember how many bytes the password consists of?
- Calculate the correlations and the most likely passwordbyte for each of the bytes and add them to the list.
- Which password did you find?

In [None]:
## Exercise:
number_of_bytes_to_attack = ...

# Initialise empty array for our key
password = np.zeros(number_of_bytes_to_attack)

for byte_number in ... :
    correlations = ...
    predicted_keybyte = ...
    password[byte_number] = predicted_keybyte


print(decode_bytes_to_text(password.astype(int)))


# Done!

You've reached the end of the notebook! Hopefully you were able to crack the password. During this whole process, you only had access to the power traces and the plaintexts/messages - as an attacker you are most likely in all cases able to query your own plaintexts for encryption with the built-in system password (key).

If you want to know more details on this attack: it is called Correlational Power Analysis. For this attack, we assume the password is the same for each power trace iteration; i.e. you collect multiple datapoints of exactly the same operation. This way, you can use correlation to distinguish the recurring patterns.

As we use directly the power traces from our target device, it is called an unprofiled side-channel attack. We do not assume any prerequisite knowledge of this device except which algorithm is running. Of course, you do need physical access to the device.