# Hitag-2 Brute Force GPU version

Author: [Kaci Amaouche](mailto:amaouchekaci28@gmail.com)

In this [Jupyter](https://jupyter.org/) notebook,we present an implementation of Hitag 2 brute force on GPU using Numba CUDA in Python. If you are unfamiliar with Jupyter, you can take a quick look at the [Notebook Basics](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Notebook%20Basics.html) guide (~5min).

Here is an outline of the session:

* [Environment Setup]
* [Implementation]

## 1 - Environment Setup

To run this lab on your own laptop you will need:
1. This notebook (the `.ipynb` file)
1. Python >= 3.8
2. The following packages installed: `numpy numba `


You can install the above dependencies:
- [Using `pip`](#Installation-using-pip)

## 2 - Implementation

## GPU Execution

This version of the notebook is identical to the previous version, but it utilizes GPU acceleration to improve performance. The main advantages of GPU execution are:

- Significant increase in computing speed due to the parallel computing power of GPUs.
- Better ability to handle compute-intensive tasks, such as matrix calculations and vector operations.
- Efficient utilization of available GPU resources to accelerate computations.

To run this version on GPU, make sure you have access to an environment with a compatible GPU and the appropriate GPU drivers installed. You can also use cloud services that offer GPU instances to run this notebook at scale.

Please note that results and performance may vary depending on the hardware and software configuration of your system.

To run this notebook on GPU, select the appropriate runtime environment with GPU acceleration enabled.```


We begin by importing the necessary libraries.






In [None]:
import numpy as np
from numba import cuda
import math
from numba import njit, prange, config
from typing import Callable, List

Now we can implement the non-linear function f.





In [None]:
@njit(parallel=True)
def i4(x, a, b, c, d):
    """
    Computes a value by extracting specific bits from the input 'x' based on the given indices 'a', 'b', 'c', and 'd'.

    Parameters:
    x (int): The input value.
    a, b, c, d (int): Indices specifying the position of the bits to extract.

    Returns:
    int: The computed value based on the extracted bits.
    """
    return (((x >> a) & 1)*8)+((x >> b) & 1)*4+((x >> c) & 1)*2+((x >> d) & 1)

@njit(parallel=True)
def f20_4(state):
    """
    Computes the value of the fourth component of the 'f20' function based on the given 'state'.

    Parameters:
    state (int): The input state.

    Returns:
    int: The computed value of the fourth component of 'f20'.
    """
    return ((0x3c65 >> i4(state,34,43,44,46)) & 1)

@njit(parallel=True)
def f20_3(state):
    """
    Computes the value of the third component of the 'f20' function based on the given 'state'.

    Parameters:
    state (int): The input state.

    Returns:
    int: The computed value of the third component of 'f20'.
    """
    return (( 0xee5 >> i4(state,28,29,31,33)) & 1)

@njit(parallel=True)
def f20_2(state):
    """
    Computes the value of the second component of the 'f20' function based on the given 'state'.

    Parameters:
    state (int): The input state.

    Returns:
    int: The computed value of the second component of 'f20'.
    """
    return (( 0xee5 >> i4(state,17,21,23,26)) & 1)

@njit(parallel=True)
def f20_1(state):
    """
    Computes the value of the first component of the 'f20' function based on the given 'state'.

    Parameters:
    state (int): The input state.

    Returns:
    int: The computed value of the first component of 'f20'.
    """
    return (( 0xee5 >> i4(state, 8,12,14,15)) & 1)

@njit(parallel=True)
def f20_0(state):
    """
    Computes the value of the zeroth component of the 'f20' function based on the given 'state'.

    Parameters:
    state (int): The input state.

    Returns:
    int: The computed value of the zeroth component of 'f20'.
    """
    return ((0x3c65 >> i4(state, 2, 3, 5, 6)) & 1)

@njit(parallel=True)
def f20_last(s0,s1,s2,s3,s4):
    """
    Computes the last component of the 'f20' function based on the given five input components.

    Parameters:
    s0, s1, s2, s3, s4 (int): The five input components.

    Returns:
    int: The computed value of the last component of 'f20'.
    """
    return (0xdd3929b >> ((s0 * 16)
                        + (s1 *  8)
                        + (s2 *  4)
                        + (s3 *  2)
                        + (s4 *  1))) & 1

@njit(parallel=True)
#The fc function
def f20(state):
    """
    Computes the final value of the 'f20' function based on the given 'state'.

    Parameters:
    state (int): The input state.

    Returns:
    int: The computed value of the 'f20' function.
    """
    return f20_last(f20_0(state), f20_1(state), f20_2(state), f20_3(state), f20_4(state))


Note that the implementation remains identical (except for a few additional parameters).






In [None]:
@njit(parallel=True)
def hitag2_init(key, uid, nonce):
    """
    Performs the initialization phase of the Hitag-2 algorithm.

    Parameters:
    key (int): The key value.
    uid (int): The unique identifier value.
    nonce (int): The nonce value.

    Returns:
    int: The resulting state after the initialization phase.
    """
    state = 0

    # Extract the key bits and append them to the state
    for i in range(32, 48):
        state = (state << 1) | ((key >> i) & 1)

    # Extract the UID bits and append them to the state
    for i in range(0, 32):
        state = (state << 1) | ((uid >> i) & 1)

    # Generate the state based on the nonce bits and key bits
    for i in range(0, 32):
        nonce_bit = (f20(state) ^ ((nonce >> (31-i)) & 1))
        state = (state >> 1) | (((nonce_bit ^ (key >> (31-i))) & 1) << 47)
    return state

Finally, we can implement the Hitag2 algorithm that generates the keystream.






In [None]:
@njit(parallel=True)
def lfsr_feedback(state):
    """
    Computes the feedback bit for the LFSR (Linear Feedback Shift Register) based on the given state.

    Parameters:
    state (int): The current state of the LFSR.

    Returns:
    int: The computed feedback bit.
    """
    return (((state >>  0) ^ (state >>  2) ^ (state >>  3)
            ^ (state >>  6) ^ (state >>  7) ^ (state >>  8)
            ^ (state >> 16) ^ (state >> 22) ^ (state >> 23)
            ^ (state >> 26) ^ (state >> 30) ^ (state >> 41)
            ^ (state >> 42) ^ (state >> 43) ^ (state >> 46)
            ^ (state >> 47)) & 1)

@njit(parallel=True)
def lfsr(state):
    """
    Updates the state of the LFSR (Linear Feedback Shift Register) based on the given state.

    Parameters:
    state (int): The current state of the LFSR.

    Returns:
    int: The updated state of the LFSR.
    """
    return (state >>  1) + (lfsr_feedback(state) << 47)

@njit(parallel=True)
def hitag2(state):
    """
    Generates the first 32 bits of the keystream using the Hitag-2 algorithm.

    Parameters:
    state (int): The current state of the Hitag-2 algorithm.
    KEYSIZE (int): The size of the keystream to generate (default: 32).

    Returns:
    int: The first 32 bits of the keystream.
    """
    keystream = 0
    for _ in range(0, 32):
        # Append the output bit of 'f20' to the rightmost bit of the keystream
        keystream = (keystream << 1) | f20(state)

        # Update the state using the LFSR function
        state = lfsr(state)
    return keystream

Now we implement the kernel and the function that executes it.






In [None]:
MAX_THREAD=2**32
KEY_SIZE=2**48
@cuda.jit
def kernel_hitag_48_bits_key(keystream1,keystream2, key_found, uid, iv1, iv2, progress, call_number):
    # Increment number of thread passed
    cuda.atomic.add(progress, 0, 1)
    
    # Get the current ID of the thread + MAX_NUMBER of key generate by one kernel * the call number
    tested_key = (cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x)

    # If the keystream result is equal to the keystream captured for both of the iv then the key is found
    if hitag2(hitag2_init(tested_key, uid, iv1))==keystream1:
        if hitag2(hitag2_init(tested_key, uid, iv2))==keystream2:
              cuda.atomic.add(key_found, 0, tested_key)


            


def gpu_hitag_brute_force_48_bits_keys(initialization_vector: List[int], keystream: List[int], serial_number: int) -> bool:
  
    # Data preprocessing
    np_initialization_vector = np.array(initialization_vector)
    np_keystream = np.array(keystream)
    np_serial_number = np.array([serial_number])

    # Copy data on the device
    d_key = cuda.managed_array(1, dtype=np.uint64, strides=None, order='C', stream=0, attach_global=True)
    progress = cuda.managed_array(1, dtype=np.uint64, strides=None, order='C', stream=0, attach_global=True)

    # Compute number of block and thread
    threads_per_block = 1024
    blocks_per_grid = int(MAX_THREAD / threads_per_block)
    number_call_kernel = math.ceil((KEY_SIZE / MAX_THREAD))

    # According to the number of key possibility, the kernel is calling X time to reach the number of possibility
    # For example: if the key is 2*33 we call the kernel twice if the number of thread = 1024 and the number of block = 4194304 (1024 * 4194304 = 2**32)
    for call in range(number_call_kernel):
        d_keystream = cuda.to_device(np_keystream)
        d_serial_number = cuda.to_device(np_serial_number)
        d_initialization_vector = cuda.to_device(np_initialization_vector)
        d_kernel_call_number = cuda.to_device([call])
        kernel_hitag_48_bits_key[blocks_per_grid, threads_per_block](d_keystream[0],d_keystream[1], d_key, d_serial_number[0],
                                                                     d_initialization_vector[0],d_initialization_vector[1], progress,
                                                                     d_kernel_call_number)
        cuda.synchronize()
        print(d_key[0])
        if hitag2(hitag2_init(d_key[0], serial_number, initialization_vector[1]))==keystream[1]:
          return d_key[0]
        
    return 1

We can test the validity of this algorithm as follows.






In [None]:
uid,key,iv1,iv2=2**32-45,2**32-175,2**32-47,2**32-1452
ks1,ks2=hitag2(hitag2_init(key,uid,iv1)),hitag2(hitag2_init(key,uid,iv2))
gpu_hitag_brute_force_48_bits_keys([iv1,iv2],[ks1,ks2],uid)

For this example, we used a 32-bit key, and when testing it on Google Colab (which can provide better performance with a better graphics card), we obtained a result in 6 seconds compared to the 6319542 seconds for a CPU implementation.