<a href="https://colab.research.google.com/github/Alessandro1999/FreeKeystrokeDynamics/blob/main/code/Contrastive_Learning_for_Keystroke_Dynamics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Contrastive learning for Keystroke Dynamics
by Alessandro Torri.


The purpose of this project is to train a Pytorch model with contrastive learning to learn a feature representation of free-text keystroke dynamics, and then use this model to implement a verification system.
The idea of the project is strongly inspired by this [paper](https://arxiv.org/pdf/2004.03627.pdf), with the objective of improving their results.

##0. Environment setup

In [None]:
#@title Downloading libraries
!pip -q install pytorch-lightning
!pip -q install wandb

[K     |████████████████████████████████| 798 kB 8.3 MB/s 
[K     |████████████████████████████████| 125 kB 25.2 MB/s 
[K     |████████████████████████████████| 529 kB 34.4 MB/s 
[K     |████████████████████████████████| 87 kB 3.1 MB/s 
[?25h  Building wheel for fire (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 1.9 MB 4.3 MB/s 
[K     |████████████████████████████████| 168 kB 34.4 MB/s 
[K     |████████████████████████████████| 182 kB 6.8 MB/s 
[K     |████████████████████████████████| 62 kB 1.1 MB/s 
[K     |████████████████████████████████| 168 kB 42.8 MB/s 
[K     |████████████████████████████████| 166 kB 34.6 MB/s 
[K     |████████████████████████████████| 166 kB 37.3 MB/s 
[K     |████████████████████████████████| 162 kB 37.8 MB/s 
[K     |████████████████████████████████| 162 kB 18.0 MB/s 
[K     |████████████████████████████████| 158 kB 33.0 MB/s 
[K     |████████████████████████████████| 157 kB 35.7 MB/s 
[K     |███████████████████

In [None]:
#@title Import libraries
import os
import re
from google.colab import drive
import pandas as pd
from pathlib import Path
from typing import *
import csv
from tqdm import tqdm
from sklearn import model_selection
import math
import random
import torch
import pytorch_lightning as pl
from pytorch_lightning import callbacks
import wandb

seed = 17
pl.seed_everything(seed)

INFO:lightning_lite.utilities.seed:Global seed set to 17


17

##1. The dataset
The [dataset](https://userinterfaces.aalto.fi/136Mkeystrokes/) that will be used to train our model contains 15 sentences written by 168K different users with the modalities described in the following [paper](https://userinterfaces.aalto.fi/136Mkeystrokes/resources/chi-18-analysis.pdf).

The dataset consists of a .csv (tab separated) file for each user in which every line is a key typed event.
This representation wastes a lot of space and it also makes it more difficult to extrapolate a sequence from the dataframe.
For this reason, the first thing I did was to preprocess this dataset and format it in a way where we have a line for each sequence typed by the user.
This formatting operation reduced the size of the dataset from 16GB to 3GB and it makes it easier to be used.

However, considering the big size of the dataset, formatting it in python was really really slow, so I decided to do it using the Julia language and the code used to do that can be found in the following [notebook](https://colab.research.google.com/drive/1yrVAJRK3cgsgGTrXojRrs7v5Gfapffpf).

So, as a first thing to do, we will download and unzip our dataset from the public drive link where I uploaded it.

In [None]:
#@title Download the dataset
!rm -rf sample_data
#!gdown "https://drive.google.com/uc?id=1yg8zcON6zyu90VCuEvVM8W9yg8Lk9olb"
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1yg8zcON6zyu90VCuEvVM8W9yg8Lk9olb' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1yg8zcON6zyu90VCuEvVM8W9yg8Lk9olb" -O Keystrokes_processed.zip && rm -rf /tmp/cookies.txt
!unzip -q Keystrokes_processed.zip
!rm Keystrokes_processed.zip

--2022-11-29 19:43:27--  https://docs.google.com/uc?export=download&confirm=t&id=1yg8zcON6zyu90VCuEvVM8W9yg8Lk9olb
Resolving docs.google.com (docs.google.com)... 172.253.117.113, 172.253.117.100, 172.253.117.102, ...
Connecting to docs.google.com (docs.google.com)|172.253.117.113|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0g-cc-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/qiijospcvf9969v78u2jaq1r19abqp1j/1669750950000/05640747438315365327/*/1yg8zcON6zyu90VCuEvVM8W9yg8Lk9olb?e=download&uuid=96e4f8b5-b574-493c-8d2c-4b03f13df70e [following]
--2022-11-29 19:43:27--  https://doc-0g-cc-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/qiijospcvf9969v78u2jaq1r19abqp1j/1669750950000/05640747438315365327/*/1yg8zcON6zyu90VCuEvVM8W9yg8Lk9olb?e=download&uuid=96e4f8b5-b574-493c-8d2c-4b03f13df70e
Resolving doc-0g-cc-docs.googleusercontent.com (doc-0g-cc-docs.googleusercontent.com)... 74.125.195.1

### Data visualization
Now that we obtained the dataset, we can see how it is made up. Indeed, we have a file for each user, and in each file we have stored the timing informations of the sentences he/she has typed.
Each user has a unique ID, and (as we said) each row of the user file consists of a sequence typed. The columns of our files are the following:

- PARTICIPANT_ID: The unique id of the user typing;
- TEST_SECTION_ID: The unique id of the sentence shown to the user;
- SENTENCE: Sentence shown to the user;
- USER_INPUT: Sentence typed by the user after pressing Enter or Next button;
- TIMINGS: An ordered list of tuple (key,keycode,dwell_time,waiting_time) with 4 elements, where each tuple represents the event of a key press/release with its timing informations. 

In [None]:
column_names = ['PARTICIPANT_ID','TEST_SECTION_ID','SENTENCE','USER_INPUT','TIMINGS']

In [None]:
#@title Visualize data of a specific user {run:"auto"}
id: int = 306610 #@param{type:"integer"}
df: pd.DataFrame = None
try:
    df : pd.DataFrame = pd.read_csv(f"data/Keystrokes_processed/{id}_keystrokes.txt",
                                sep=",",
                                names = column_names,
                                header=None,
                                encoding = "ISO-8859-1",
                                )
except FileNotFoundError:
    print(f"There is no user with id = {id}")

df

Unnamed: 0,PARTICIPANT_ID,TEST_SECTION_ID,SENTENCE,USER_INPUT,TIMINGS
0,306610,3297345,I didn't hear from Ginger this week.,L didn't hear from Ginger this week.,"Any[(""SHIFT"", 16, 431, 0), (""76"", missing, 158..."
1,306610,3297487,We haven't made that decision here though.,We haven't made that decision here though.,"Any[(""SHIFT"", 16, 426, 0), (""87"", missing, 197..."
2,306610,3297580,The team mate says were not interested at this...,The team mate says were not interested at this...,"Any[(""SHIFT"", 16, 348, 0), (""84"", missing, 160..."
3,306610,3297600,But both reports were denied by the southern l...,But both reports were denied by the southern l...,"Any[(""SHIFT"", 16, 463, 0), (""66"", missing, 196..."
4,306610,3297318,Can you rough out a slide on rating agencies?,Can you rough out a slide on rating agencies?,"Any[(""SHIFT"", 16, 116, 0), (""SHIFT"", 16, 456, ..."
5,306610,3297516,Hours are listed in 24hr format central time.,Hours are listed in 24hr format central time.,"Any[(""SHIFT"", 16, 272, 0), (""72"", missing, 121..."
6,306610,3297458,The film will be shown for the first time.,The film will be shown for the first time.,"Any[(""SHIFT"", 16, 234, 0), (""84"", missing, 121..."
7,306610,3297637,Now go get washed up for dinner.,No go get washed up for dinner.,"Any[(""SHIFT"", 16, 195, 0), (""SHIFT"", 16, 437, ..."
8,306610,3297410,There are differing views among the dissident ...,There are differin views among the dissiedent ...,"Any[(""SHIFT"", 16, 319, 0), (""84"", missing, 159..."
9,306610,3297653,Juntao is in Washington this week meeting wit...,Jantao is in Washington this week meeting wit...,"Any[(""SHIFT"", 16, 311, 0), (""74"", missing, 159..."


##2. Data preprocessing

As we have seen from the dataframe visualization example before, the TIMINGS column looks a bit weird; this is because we preprocessed the dataset using Julia and we saved it into .csv files using the Julia CSV package. Basically, our TIMINGS column is a string that represents a Julia list of tuples.

However, our goal in this section is to convert this string into an actual python list of tuples, so that we can easily manipulate our data.


### Javascript keycode mapping
The dataset we are analyzing has been obtained with JavaScript, and in the readme file of the dataset it says that for some event they weren't able to capture the key that was pressed/released; however, they were anyway able to get the JavaScript keycode of the key. So, if someone wants to get the key from the keycode should use the Javascript keycodes.

Unfortunately, python keycodes are different from the JavaScript ones, so to obtain a proper mapping, what we are going to do is to retrieve the keycodes informations from this nice [website](https://www.toptal.com/developers/keycode/table-of-all-keycodes) and store them in a dictionary.


In [None]:
#@title Obtain the Javascript keycode mappings
from bs4 import BeautifulSoup
import requests

# get the html of the page
url = "https://www.toptal.com/developers/keycode/table-of-all-keycodes"
page = requests.get(url)

js_code_to_key : Dict[int,str] = dict()
# reverse mapping
js_key_to_code : Dict[str,int] = dict()

# parse the html using beautiful soup
soup = BeautifulSoup(page.content, "html.parser")

# look at all the rows of the table
for row in soup.find(class_="table-body").find_all("tr"):
    # for each row, the first 2 columns contains keycode and key
    columns = row.find_all("td")
    keycode = int(columns[0].text)
    if keycode == 32: # the space is not showed on the website, we force it
        value = " "
    else:
        value = columns[1].text.lower()
    # save the mapping keycode:key
    js_code_to_key[keycode] = value
    js_key_to_code[value] = keycode

### Conversion of the TIMINGS column
As described in the introduction of this section, we want to convert the TIMINGS column of our dataset from a string to a python list of tuples.

We will use regular expressions (regex) to find the matches of the tuple in the TIMINGS strings.

In [None]:
#@title Regex and function to convert a TIMING string into a python list

# this set will contain all the characters used in typing (for the training set)
chars : Set[str] = set()

# the regex that identifies a tuple (key,keycode,dt,wt)
#regex = r"\(\"(\w+|\s|\W|\\\w)\", (\d+|missing), \d+, \-?\d+\)"
regex = r"\(\"(\w+|\s|\W|\\\w)\", (\d+|missing), \d+(\.\d+)?, \-?\d+(.\d+)?\)"
# the actual function that makes the conversion
def string_to_timings(s : str) -> List[Tuple[str,int,float,float]]:
    '''
    Given a string representing the TIMING cell of a row in the dataframe,
    this function process this string and formats that into a list of tuple,
    where each of them has:
        - key, str reperesentation of the key pressed;
        - keycode, javascript keycode of the key pressed;
        - dwell time, elapsed time between key press and key release;
        - waiting time, elapsed time between the previous key press and the actual key release.
    '''
    out : List[Tuple[str,int,float,float]] = list()
    for match in re.finditer(regex, s, re.MULTILINE):
        key, keycode, dt, wt = match.group()[1:-1].split(", ")
        key = key[1:-1] # remove " at the beginning and " at the end
        if keycode == "missing":
            keycode = key
            key = js_code_to_key[int(key)]
        elif key == "\\b": #TODO (?)
            key = "backspace"
        chars.add(key.lower())
        out.append((key.lower(),int(keycode),float(dt),float(wt)))
    return out

def string_to_vocab(s : str) -> None:
    '''
    Given a representing the TIMING cell of a row in the dataframe,
    this functions adds all the typed characters in a set called "chars"
    '''
    for match in re.finditer(regex, s, re.MULTILINE):
        key, keycode, dt, wt = match.group()[1:-1].split(", ")
        key = key[1:-1] # remove " at the beginning and " at the end
        if keycode == "missing":
            keycode = key
            key = js_code_to_key[int(key)]
        elif key == "\\b": #TODO (?)
            key = "backspace"
        chars.add(key.lower())


def string_to_tensor(s : str, vocab : Dict[str,int]) -> torch.Tensor:
    out : List[Tuple[int,float,float]] = list()
    for match in re.finditer(regex, s, re.MULTILINE):
        key, keycode, dt, wt = match.group()[1:-1].split(", ")
        key = key[1:-1] # remove " at the beginning and " at the end
        if keycode == "missing":
            keycode = key
            key = js_code_to_key[int(key)]
        elif key == "\\b": #TODO (?)
            key = "backspace"
        out.append((vocab.get(key.lower(),vocab[unk_char]),float(dt),float(wt)))
    return torch.tensor(out)

In [None]:
#@title Example of a conversion

# convert the TIMINGS column
df.TIMINGS = df.TIMINGS.apply(string_to_timings)
# show the updated dataframe
df

Unnamed: 0,PARTICIPANT_ID,TEST_SECTION_ID,SENTENCE,USER_INPUT,TIMINGS
0,306610,3297345,I didn't hear from Ginger this week.,L didn't hear from Ginger this week.,"[(shift, 16, 431.0, 0.0), (l, 76, 158.0, -164...."
1,306610,3297487,We haven't made that decision here though.,We haven't made that decision here though.,"[(shift, 16, 426.0, 0.0), (w, 87, 197.0, -157...."
2,306610,3297580,The team mate says were not interested at this...,The team mate says were not interested at this...,"[(shift, 16, 348.0, 0.0), (t, 84, 160.0, -119...."
3,306610,3297600,But both reports were denied by the southern l...,But both reports were denied by the southern l...,"[(shift, 16, 463.0, 0.0), (b, 66, 196.0, 151.0..."
4,306610,3297318,Can you rough out a slide on rating agencies?,Can you rough out a slide on rating agencies?,"[(shift, 16, 116.0, 0.0), (shift, 16, 456.0, 6..."
5,306610,3297516,Hours are listed in 24hr format central time.,Hours are listed in 24hr format central time.,"[(shift, 16, 272.0, 0.0), (h, 72, 121.0, -80.0..."
6,306610,3297458,The film will be shown for the first time.,The film will be shown for the first time.,"[(shift, 16, 234.0, 0.0), (t, 84, 121.0, -43.0..."
7,306610,3297637,Now go get washed up for dinner.,No go get washed up for dinner.,"[(shift, 16, 195.0, 0.0), (shift, 16, 437.0, 2..."
8,306610,3297410,There are differing views among the dissident ...,There are differin views among the dissiedent ...,"[(shift, 16, 319.0, 0.0), (t, 84, 159.0, -119...."
9,306610,3297653,Juntao is in Washington this week meeting wit...,Jantao is in Washington this week meeting wit...,"[(shift, 16, 311.0, 0.0), (j, 74, 159.0, -119...."


##3. Training, Validation and Test sets
As the paper from which this work is inspired (TypeNet: Scaling up Keystroke Biometrics), we will train our model using just 68K users. Anyway, I'll add it as a tunable parameter of the notebook if one wants to change that.

In [None]:
#@title Training dataset
n : int = 68000 #@param{type:"integer", min:"1", max:"168000"}
dataframes : List[pd.DataFrame] = list()
users : List[str] = os.listdir("data/Keystrokes_processed/")
added : int = 0
idx : int = 0
with tqdm(total=n) as pbar:
    while added < n:
        user = users[idx]
        if "keystrokes" in user: # it is a user file (and not readme)
            df : pd.DataFrame = pd.read_csv(f"data/Keystrokes_processed/{user}",
                                        sep=",",
                                        names = column_names,
                                        header=None,
                                        encoding = "ISO-8859-1",
                                        )
            dataframes.append(df)
            added += 1
            pbar.update(1)
        idx += 1

train_dataset : pd.DataFrame = pd.concat(dataframes)
# for training we are not going to need these columns
train_dataset = train_dataset.drop(['TEST_SECTION_ID','SENTENCE','USER_INPUT'],axis=1) 
del dataframes # let's free some RAM space

100%|██████████| 68000/68000 [03:05<00:00, 366.24it/s]


Now, we will focus our dataset just to the two columns we need for the siamese training, that is the user id and his/her timings informations:

In [None]:
train_dataset.head()

Unnamed: 0,PARTICIPANT_ID,TIMINGS
0,448593,"Any[(""SHIFT"", 16, 538, 0), (""O"", 79, 99, -83),..."
1,448593,"Any[(""SHIFT"", 16, 264, 0), (""T"", 84, 97, -82),..."
2,448593,"Any[(""SHIFT"", 16, 257, 0), (""W"", 87, 88, -88),..."
3,448593,"Any[(""SHIFT"", 16, 191, 0), (""I"", 73, 87, -72),..."
4,448593,"Any[(""SHIFT"", 16, 227, 0), (""S"", 83, 98, -83),..."


In [None]:
#@title Train-Validation split
#@markdown perc is the percentage of users to use for validation
perc: float = 0.000397 #@param{type:"number"}
# all the users of our dataset
users: Set[str] = sorted(set(train_dataset.PARTICIPANT_ID))

# the number of users we will keep in the training set
train_num = int(len(users)*(1-perc))

# sample the users at random
training_users: Set[str] = random.sample(users, k=train_num)

# the boolean series for the rows of the training dataframe
train_series: pd.Series = train_dataset.PARTICIPANT_ID.isin(training_users)

# the validation set is taken from the users not kept in the training set
val_dataset: pd.DataFrame = train_dataset[~train_series]

# the remaining users will be in our training set
train_dataset = train_dataset[train_series]

# split directly on the dataframe (I prefered to split the users to keep balanced the genuine and impostor pairs of the training)
#train_dataset, val_dataset = model_selection.train_test_split(train_dataset, test_size=perc)

After the split we can see how the data has been divided according to the given percentage and how the two dataframes have no rows in common:

In [None]:
train_dataset.shape, val_dataset.shape, pd.merge(train_dataset,val_dataset)

((1019595, 2), (405, 2), Empty DataFrame
 Columns: [PARTICIPANT_ID, TIMINGS]
 Index: [])

In [None]:
#@title Compute character vocabulary
chars : Set[str] = set()
train_dataset.TIMINGS.apply(string_to_vocab)
# compute vocabulary only for the training set
char_vocab : Dict[str,int] = { c:i+2 for i,c in enumerate(sorted(chars)) }
unk_char = "<UNK>"
pad_char = "<PAD>"
char_vocab[pad_char] = 0
char_vocab[unk_char] = 1

##4. Pytorch pipeline

Now we have all the necessary to move to the PyTorch framework and start training our Deep models.


###4.1 Pytorch Dataset and Dataloader

The first thing to do is to wrap our datasets into Pytorch Datasets:

####4.1.1. Siamese training
Recall that our constrastive training will be a siamese training, meaning that each time we will give our model couples $(x_1,x_2)$ and we will train it to distinguish wheather $id(x_1) = id(x_2)$ or $id(x_1) \neq id(x_2)$ ($id$ is the function that returns the identity of the input sample).

This means that, if our dataset has $n$ sample, our siamese training will consist of $n \choose 2$ = $\frac{n(n-1)}{2}$ inputs to the model.

However, we do not really want to have all the couples in memory, but we simply want to assign an index to each of them in order to distinguish them and implicitly represent them; In this way, we can generate the couples "on the fly" whenever they are requested.

In the cell below I implemented the function that indexes all the possible couples of a generic list $[0,1,...,n-1]$ of n elements.


In [None]:
#@title Couples indexer function
def index_2_combination(index : int, n: int, comb_num : int = None) -> Tuple[int,int]:
    '''
    Given the number of elements n (0,1,2,...,n-1) and an index i, this function returns
    the index-th combinations with length 2 without repetition.
    The combinations are ordered in ascending order.
    '''
    if comb_num is None: # if not given, compute the number of possible couples
        comb_num : int = n*(n-1)//2 # the number of combinations
    cur_index = index + 1
    assert(cur_index <= comb_num) # assert you do not go out of index
    first_e : int = 0 # the first element of the combination -> (0,?)
    # while you can subtract by (n-e) it means that the index refers to a number > e
    while cur_index - n + first_e + 1 > 0: 
        cur_index -= n - (first_e + 1)
        first_e += 1
    # the remainder gives us the second element
    return first_e, (first_e + cur_index)

In [None]:
# example
print(index_2_combination(0,5,10),index_2_combination(2,5,10), index_2_combination(3,5,10), index_2_combination(5,5,10))

(0, 1) (0, 3) (0, 4) (1, 3)


####4.1.2. First Dataset implementation
Let's now create a simple implementation of our siamese Training with a PyTorch Dataset: 

In [None]:
#@title First implementation of the Dataset
class SiameseDataset(torch.utils.data.Dataset):
    '''
    Dataset for the Siamese training of Keystroke Dynamics
    '''
    def __init__(self, df : pd.DataFrame, vocab : Dict[str,int]):
        # the identities of the users
        self.ground_truth : torch.Tensor = torch.tensor(list(df.PARTICIPANT_ID))
        self.tot = self.ground_truth.shape[0] # number of samples
        # we are doing siamese training, so the total n will be the possible couples
        self.n = int((self.tot-1)*self.tot / 2) # number of combinations
        # lengths and timings
        self.lengths : torch.Tensor = torch.zeros_like(self.ground_truth)
        strings = list(df.TIMINGS)
        tensors : List[torch.Tensor] = list()
        for i,string in tqdm(enumerate(strings),total = self.tot): # for every row of the dataset
            tensors.append(string_to_tensor(string,vocab)) # convert the timings into tensors
            self.lengths[i] = tensors[-1].shape[0] # get the lenght of the sequence
        # pad the smaller sequences with 0's
        self.timings : torch.Tensor = torch.nn.utils.rnn.pad_sequence(tensors,batch_first=True)
    
    def __len__(self) -> int:
        return self.n
    
    def __getitem__(self,idx) ->Tuple[torch.Tensor]:
        # convert the i-th element in the 2 indexes of the couple
        idx1, idx2 = index_2_combination(idx,self.tot,self.n)
        # return the data
        return {
                "genuine" : (self.ground_truth[idx1] == self.ground_truth[idx2]).long(),
                "timings1": self.timings[idx1],
                "lengths1": self.lengths[idx1],
                "timings2": self.timings[idx2],
                "lengths2": self.lengths[idx2]
                }
    


This implementation works just fine, however there is a big issue: we are explicitly storing all the dataset in memory, and given the size of our dataset, this will definitily lead to a full occupation of the RAM and therefore to an error.

If you want, you can try (even if I do not suggest you to do that):

In [None]:
#train_siamese = SiameseDataset(train_dataset,char_vocab)
#val_siamese = SiameseDataset(val_dataset,char_vocab)

####4.1.3. Lazy implementation

The solution to this issue, is to implement our Dataset in a lazy way, meaning that we load the data into memory only when it is actually requested.

Now, our dataset will only consider which are the files that correspondends to his data, and load these files when the user requests them.

So, this is the lazy implementation, in which, instead of the classic Dataset class, I used the IterableDataset class instead:

In [None]:
#@title Lazy dataset
class LazySiameseDataset(torch.utils.data.IterableDataset):
    def __init__(self, users_list : List[int], vocab : Dict[str,int]):
        self.vocab = vocab
        self.users_list = users_list
        users_num: int = len(users_list) # each user file has 15 samples
        self.samples_num: int = 15 * len(users_list)
        # number of possible couples
        self.comb_num: int = self.samples_num * (self.samples_num-1) // 2
    
    def __iter__(self) -> Tuple[int,int]:
        return map(self.load_couple,map(self.internal_index_2_combination, range(self.comb_num)))

    def internal_index_2_combination(self,idx : int) -> Tuple[int,int]:
        return index_2_combination(idx, self.samples_num, self.comb_num)

    def load_item(self, idx : int) -> Tuple[torch.Tensor,torch.Tensor,torch.Tensor]:
        file_idx : int = idx // 15
        row_idx : int= idx % 15
        user_id : int = self.users_list[file_idx]
        df : pd.DataFrame = pd.read_csv(f"data/Keystrokes_processed/{user_id}_keystrokes.txt",
                                sep=",",
                                names = column_names,
                                header=None,
                                encoding = "ISO-8859-1",
                                )
        row = df.iloc[row_idx]
        ground_truth : torch.Tensor = torch.tensor(row.PARTICIPANT_ID)
        timings : torch.Tensor = string_to_tensor(row.TIMINGS, self.vocab)
        length : torch.Tensor = torch.tensor(timings.shape[0])
        return ground_truth, timings, length
    
    def load_couple(self, indexes: Tuple[int,int]) -> Dict[str,torch.Tensor]:
        g1, t1, l1 = self.load_item(indexes[0])
        g2, t2, l2 = self.load_item(indexes[1])
        return {
            "genuine" : g1 == g2,
            "timings1": t1,
            "length1" : l1,
            "timings2": t2,
            "length2" : l2
            }

The only little issue whit the lazy implementation is that, when the dataset is wrapped into a DataLoader, the batches of data cannot be shuffled; this property is something that we like in the training set, to avoid to give the model a bias of the order of the samples.

In order to fix this, the training set is wrapped in the [ShuffleDataset](https://discuss.pytorch.org/t/how-to-shuffle-an-iterable-dataset/64130/6) class, that does the trick by shuffling just a given number of samples (the buffer size of this class):

In [None]:
#@title Shuffle Dataset wrapper
class ShuffleDataset(torch.utils.data.IterableDataset):
  def __init__(self, dataset : torch.utils.data.IterableDataset , buffer_size : int):
    super().__init__()
    self.dataset = dataset
    self.buffer_size = buffer_size

  def __iter__(self):
    shufbuf = []
    try:
      dataset_iter = iter(self.dataset)
      for i in range(self.buffer_size): # iterate over the dataset for buffer times
        shufbuf.append(next(dataset_iter)) # append the data to the shuffle buffer
    except:
      self.buffer_size = len(shufbuf)

    try:
      while True:
        try:
          item = next(dataset_iter) # take a new sample
          evict_idx = random.randint(0, self.buffer_size - 1) # take a random index
          yield shufbuf[evict_idx] # return a random element of the shuffle buffer
          shufbuf[evict_idx] = item # substitute the element with new one
        except StopIteration: # when the iterator ends
          break
      while len(shufbuf) > 0: # return the remaining element of the shuffle buffer
        yield shufbuf.pop()
    except GeneratorExit:
      pass

However, even though this solution perfectly works, we have got another issue; indeed, while before we had a space problem, now we also have to consider the time issue.

Since our training set is composed of $68*10^3$ users, and each user has 15 samples belonging to him, we have actually $15*68*10^3= 1,02 * 10^6$ samples (so a million of them). If we now consider all the possible couples, our model will have to be trained on $\frac{1,02 * 10^6 * (1,02*10^6 - 1)}{2} = 520.199.490.000 \approx 520 * 10^9$ examples.

Train a model for 520 billion examples is just not doable (at least here in Colab). We can solve this problem in 2 ways:

1. Reduce the number of users.
2. Do not consider all the possible couples but for each sample of the dataset pick a random "twin" to be trained with.

With the second solution, which is implemented in the class below, we decrease the number of training step from 520 billion to 1 million.

In [None]:
#@title Random Lazy dataset
class RandomLazySiameseDataset(torch.utils.data.Dataset):
    def __init__(self, users_list : List[int], vocab : Dict[str,int], dataset : pd.DataFrame, max_len : int = None):
        self.vocab = vocab
        self.users_list = users_list
        self.dataset = dataset
        users_num: int = len(users_list) # each user file has 15 samples
        self.samples_num: int = 15 * len(users_list)
        self.len = max_len if max_len is not None else self.samples_num
        # number of possible couples
        #self.comb_num: int = self.samples_num * (self.samples_num-1) // 2

    def __len__(self) -> int:
        return self.len

    def __getitem__(self, idx : int) -> Dict[str,torch.Tensor]:
        # whether to pick a genuine or a false twin
        pick_genuine = bool(random.randint(0,1)) 

        # we want a genuine couple
        if pick_genuine:
            sample_idx = idx % 15
            twin_idx = sample_idx
            # sample until you pick a different twin than itself
            while twin_idx == sample_idx:
                twin_idx = random.randint(0,14)
            final_idx = (idx - sample_idx) + twin_idx
        # we want a non-genuine couple
        else:
            # since we have only 14 out of 1020000 possibility of picking a genuine
            # we do not really care of not considering them in the sampling
            final_idx = idx
            while final_idx == idx:
                final_idx = random.randint(0,self.samples_num - 1)
        return self.load_couple((idx,final_idx))

    def load_item(self, idx : int) -> Tuple[torch.Tensor,torch.Tensor,torch.Tensor]:
        file_idx : int = idx // 15
        row_idx : int= idx % 15
        user_id : int = self.users_list[file_idx]
        #df : pd.DataFrame = pd.read_csv(f"data/Keystrokes_processed/{user_id}_keystrokes.txt",
        #                        sep=",",
        #                        names = column_names,
        #                        header=None,
        #                        encoding = "ISO-8859-1",
        #                        )
        df = self.dataset[self.dataset.PARTICIPANT_ID == user_id]
        row = df.iloc[row_idx]
        ground_truth : torch.Tensor = torch.tensor(row.PARTICIPANT_ID)
        timings : torch.Tensor = string_to_tensor(row.TIMINGS, self.vocab)
        length : torch.Tensor = torch.tensor(timings.shape[0])
        return ground_truth, timings, length
    
    def load_couple(self, indexes: Tuple[int,int]) -> Dict[str,torch.Tensor]:
        g1, t1, l1 = self.load_item(indexes[0])
        g2, t2, l2 = self.load_item(indexes[1])
        return {
            "genuine" : g1 == g2,
            "timings1": t1,
            "length1" : l1,
            "timings2": t2,
            "length2" : l2
            }

####4.1.3. Final wrapping into DataLoaders
Now, we just have to create our datasets and wrap them into dataloaders.

In [None]:
batch_size : int = 512 #@param{type:"integer"}

In [None]:
#@title Dataset instantiation

train_batches_per_epoch : int = 150 #@param{type:"integer"}

# the list of the users for each dataset
train_users = sorted(set(train_dataset.PARTICIPANT_ID))
val_users = sorted(set(val_dataset.PARTICIPANT_ID))

# the training set will be shuffled
#train_siamese = ShuffleDataset(LazySiameseDataset(train_users,char_vocab), buffer_size = batch_size)
train_siamese = RandomLazySiameseDataset(train_users,char_vocab, train_dataset, max_len = train_batches_per_epoch * batch_size)
# for the validation it is not necessary
val_siamese = SiameseDataset(val_dataset,char_vocab)
#val_siamese = LazySiameseDataset(val_users,char_vocab)

100%|██████████| 405/405 [00:00<00:00, 7063.47it/s]


In order to batch the data into dataloaders, we are going to have to pad the shorter sequences to make them have the same lenght of the longer ones.

This is done in the so-called collate function:

In [None]:
#@title Collate function
def collate_fn(batch : List[Dict[str,torch.Tensor]]) -> Dict[str,torch.Tensor]:
    batch_out : Dict[str,torch.Tensor] = dict()
    timings1 : List[torch.Tensor] = list()
    timings2 : List[torch.Tensor] = list()
    genuine : torch.Tensor = torch.zeros(len(batch))
    lengths1 : torch.Tensor = torch.zeros_like(genuine)
    lengths2 : torch.Tensor = torch.zeros_like(genuine)
    for i,sample in enumerate(batch):
        timings1.append(sample["timings1"])
        timings2.append(sample["timings2"])
        lengths1[i] = sample['length1']
        lengths2[i] = sample['length2']
        genuine[i] = sample['genuine']
    
    batch_out["genuine"] = genuine
    batch_out["lengths1"] = lengths1
    batch_out["lengths2"] = lengths2
    batch_out["timings1"] = torch.nn.utils.rnn.pad_sequence(timings1, batch_first = True)
    batch_out["timings2"] = torch.nn.utils.rnn.pad_sequence(timings2, batch_first = True)

    return batch_out


In [None]:
#@title DataLoaders instantiation
#train_dl = torch.utils.data.DataLoader(train_siamese,batch_size = batch_size, collate_fn = collate_fn)
#train_dl = torch.utils.data.DataLoader(train_siamese,batch_size = batch_size, collate_fn = collate_fn, shuffle = True)
#val_dl = torch.utils.data.DataLoader(val_siamese,batch_size = batch_size)

###4.2. Deep learning model

In [None]:
#@title Contrastive loss
def contrastive_loss(distances : torch.Tensor, genuine : torch.Tensor, alpha : float = 1.5) -> torch.Tensor:
        zero = torch.tensor(0)
        genuine_loss = genuine * (distances**2) / 2
        impostor_loss = (1 - genuine) * ((torch.maximum(zero,alpha - distances))**2) / 2
        return (impostor_loss + genuine_loss).mean()

Let's wrap our datasets into a PytorchLightning DataModule:

In [None]:
#@title Lightning Datamodule
class KeystrokeDataModule(pl.LightningDataModule):
    def __init__(self,
                 train_set : torch.utils.data.DataLoader,
                 val_set : torch.utils.data.DataLoader):
        super().__init__()
        self.train_set = train_set
        self.val_set = val_set
    
    def train_dataloader(self) -> torch.utils.data.DataLoader:
        return torch.utils.data.DataLoader(self.train_set,
                                           batch_size = batch_size,
                                           collate_fn = collate_fn, 
                                           shuffle = True)

    def val_dataloader(self) -> torch.utils.data.DataLoader:
        return torch.utils.data.DataLoader(self.val_set,
                                           batch_size = batch_size)

####4.2.1. LSTM

In [None]:
#@title LSTM Model
class KeystrokeLSTM(pl.LightningModule):
    def __init__(self,
                 embedding_dim: int,
                 time_dim: int,
                 hidden_size: int,
                 output_size : int,
                 alpha : float = 1.5,
                 lstm_layers: int = 1) -> None:
        super().__init__()

        # embedding layer
        self.key_emb = torch.nn.Embedding(num_embeddings=len(char_vocab),
                                          embedding_dim=embedding_dim,
                                          padding_idx=char_vocab[pad_char])

        # linear projection of the time features
        self.time_features = torch.nn.Linear(2, time_dim)

        # lstm
        self.lstm = torch.nn.LSTM(input_size=embedding_dim+time_dim,
                                  hidden_size=hidden_size,
                                  num_layers=lstm_layers)
        # linear layer
        self.linear = torch.nn.Linear(in_features=hidden_size,
                                      out_features=output_size)
        # activation function
        self.activation = torch.nn.functional.relu

        # loss hyperparam
        self.alpha = alpha
   
        self.save_hyperparameters()

    def single_forward(self,
                       timings: torch.Tensor,
                       lenghts: torch.Tensor) -> torch.Tensor:
        
        batch_size = lenghts.shape[0]

        emb = self.key_emb(timings[:,:,0].long())
        timings = self.time_features(timings[:,:,1:])

        x = torch.concat((emb, timings), dim=-1)

        x = self.lstm(x)[0][torch.arange(batch_size), (lenghts-1).long(), :]

        x = self.linear(x)

        x = self.activation(x)

        return x

    def forward(self,
                timings1: torch.Tensor,
                lengths1: torch.Tensor,
                timings2: torch.Tensor,
                lengths2: torch.Tensor,
                genuine: torch.Tensor
                ) -> torch.Tensor:

        o1 = self.single_forward(timings1,lengths1)
        o2 = self.single_forward(timings2,lengths2)

        euclidean_distance = (((o1 - o2)**2).sum(dim = 1)) ** 1/2

        return contrastive_loss(euclidean_distance,genuine, alpha = self.alpha)

    def step(self, batch) -> torch.Tensor:
        loss : torch.Tensor = self(**batch)
        return loss

    def training_step(self, train_batch, batch_idx) -> torch.Tensor:
        return self.step(train_batch)

    def validation_step(self, val_batch, batch_idx) -> torch.Tensor:
        return self.step(val_batch)
    
    def log_metrics(self, loss: float, type: str):
        self.log(f'{type}_loss', loss)
        self.log(f'epoch', float(self.current_epoch))

    def training_epoch_end(self, outputs) -> None:
        loss = sum([x["loss"] for x in outputs]) / len(outputs)
        self.log_metrics(loss.item(), 'train')
        return super().training_epoch_end(outputs)

    def validation_epoch_end(self, outputs) -> None:
        loss = sum(outputs) / len(outputs)
        self.log_metrics(loss.item(), 'val')
        return super().validation_epoch_end(outputs)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.001)


In [None]:
datamodule = KeystrokeDataModule(train_siamese,val_siamese)
net = KeystrokeLSTM(100,100,100,50,1)

In [None]:
wandb.init(project='FreeKeystrokeDynamics', entity='ale99')
logger = pl.loggers.WandbLogger(name='long_run', project='FreeKeystrokeDynamics')

wandb.define_metric('epoch')
wandb.define_metric('train_loss',step_metric='epoch')
wandb.define_metric('val_loss',step_metric='epoch')

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 

··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


  "There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse"


<wandb.sdk.wandb_metric.Metric at 0x7f5c7a6edb10>

In [None]:
# Checkpoint to save the model with the lowest validation loss
from pytorch_lightning import callbacks
from google.colab import drive

drive.mount('/content/drive', force_remount=True)
%cd /content/drive/MyDrive/Keystroke

checkpoint = callbacks.ModelCheckpoint("checkpoints/",
                                       monitor="val_loss",
                                       mode="min")

Mounted at /content/drive
/content/drive/MyDrive/Keystroke


In [None]:
trainer = pl.Trainer(max_epochs=20,
                         accelerator='gpu',
                         devices=1,
                         logger=logger,
                         callbacks=[checkpoint])

trainer.fit(model=net, datamodule=datamodule)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name          | Type      | Params
--------------------------------------------
0 | key_emb       | Embedding | 13.3 K
1 | time_features | Linear    | 300   
2 | lstm          | LSTM      | 120 K 
3 | linear        | Linear    | 5.0 K 
--------------------------------------------
139 K     Trainable params
0         Non-trainable params
139 K     Total params
0.558     Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

### Qualitative Testing

In [None]:
users : List[str] = os.listdir("data/Keystrokes_processed/")

In [None]:
net = KeystrokeLSTM.load_from_checkpoint("checkpoints/(2)noDropout5epochsTrainedLSTM.ckpt")

RuntimeError: ignored