# Decision Tree Method

This notebook focuses on implementing the decision tree method to study keystroke dynamics 

## List of imports

In [41]:
from sklearn import tree
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
from os import listdir
from os.path import isfile, join

## Pre-processing

**Input:** reference files per user with the following format ``{Key|Shift|HD_mean|HD_std|DL_mean|DL_std}`` \
**Output:** one file with all reference files to perform decision tree classifier with the following format 
``{Label|Key Numerical|Shift[HD_mean|HD_std|DL_mean|DL_std}``

1. We need to add an attribute `name` to the data to identify the user
2. We need to merge all references in one pandas dataframe
3. The `key` values must have an integer value
4. The `name` values must have an integer value

In [2]:
# import reference into pandas dataframe

files = [f for f in listdir("reference_1/") if isfile(join("reference_1/", f))]

# concat reference files in one dataframe
dataframes_list = []
for file in files:
    df_file = pd.read_csv("reference_1/" + file, sep="|", encoding = "ISO-8859-1")
    # remove .csv suffix
    df_file["name"] = file[:-4]
    dataframes_list.append(df_file)
    
df_reference = pd.concat(dataframes_list)

# map 'key' to integers (add a column 'key_numerical')
key_map = dict(enumerate(df_reference['Key'].astype('category').cat.categories))
key_mapping = {v: k for k, v in key_map.items()}

df_reference['key_numerical'] = df_reference['Key'].map(key_mapping)

# map 'name' to integers (add a column 'label')
label_to_name_map = dict(enumerate(df_reference['name'].astype('category').cat.categories))
name_to_label_map = {v: k for k, v in label_to_name_map.items()}

df_reference['label'] = df_reference['name'].map(name_to_label_map)

# re-organize dataframe to '{Label|Key Numerical|Shift[HD_mean|HD_std|DL_mean|DL_std}'
df_reference = df_reference[['label', 'key_numerical', 'Shift', 'HD_mean', 'HD_std', 'DL_mean', 'DL_std']]



Here's the mapping for the key and the name, as well as the representation of the dataframe : 

In [3]:
print("KEY MAPPING")
print(key_mapping)
print("LABEL MAPPING")
print(name_to_label_map)
print("DATAFRAME")
df_reference.head()

KEY MAPPING
{"'": 0, "'&'": 1, "'('": 2, "','": 3, "'-'": 4, "'.'": 5, "'?'": 6, "'B'": 7, "'C'": 8, "'D'": 9, "'H'": 10, "'J'": 11, "'K'": 12, "'M'": 13, "'P'": 14, "'S'": 15, "'W'": 16, "'X'": 17, "'Y'": 18, "'\\x01'": 19, "'a'": 20, "'b'": 21, "'c'": 22, "'d'": 23, "'e'": 24, "'f'": 25, "'g'": 26, "'h'": 27, "'i'": 28, "'j'": 29, "'k'": 30, "'l'": 31, "'m'": 32, "'n'": 33, "'o'": 34, "'p'": 35, "'q'": 36, "'r'": 37, "'s'": 38, "'t'": 39, "'u'": 40, "'v'": 41, "'w'": 42, "'x'": 43, "'y'": 44, "'z'": 45, "'è'": 46, "'é'": 47, 'Key.backspace': 48, 'Key.ctrl_l': 49, 'Key.enter': 50, 'Key.esc': 51, 'Key.right': 52, 'Key.shift': 53, 'Key.shift_r': 54, 'Key.space': 55}
LABEL MAPPING
{'andrieu': 0, 'benjamin': 1, 'hugo': 2}
DATAFRAME


Unnamed: 0,label,key_numerical,Shift,HD_mean,HD_std,DL_mean,DL_std
0,0,53,1,0.253048,0.18726,-0.093205,0.228329
1,0,14,1,0.0622,0.014784,0.1422,0.028833
2,0,34,0,0.072855,0.01036,0.201289,0.289603
3,0,37,0,0.06939,0.011595,0.109506,0.117702
4,0,39,0,0.064707,0.010277,0.05978,0.062745


Get all files to be identified

In [4]:
# import unknown into pandas dataframe

files = [f for f in listdir("identification/") if isfile(join("identification/", f))]
# concat reference files in one dataframe
identification_list = []
for file in files:
    df_file = pd.read_csv("identification/" + file, sep="|", encoding = "ISO-8859-1")
    # remove .csv suffix
    df_file["name"] = file[:-4]
    identification_list.append(df_file)

Now, we need to perform modifications on the file we want to re-identify 
(re-identification refers to us trying to identify the user of an unknown keylog)

1. Import data as dataframe
2. Convert key as integer (key_numerical)
    - if the key is not contained in the map, we must delete the tuple
    - otherwise convert key to integer
3. Reorganize columns

In [5]:
to_identify = []
truths = []
for df_reidentification in identification_list:
    df_reidentification['key_numerical'] = np.nan
    # use previous key mapping to translate key as string to key as integers
    df_reidentification['key_numerical'] = df_reidentification['Key'].apply(lambda x: key_mapping[x])
    # drop key combinations which do not help in identification because they never appear in references
    df_reidentification = df_reidentification.dropna()

    # re-organize dataframe to '{Key Numerical|Shift[HD_mean|HD_std|DL_mean|DL_std}'
    truths.append(df_reidentification['name'][0])
    df_reidentification = df_reidentification[['key_numerical', 'Shift', 'HD_mean', 'HD_std', 'DL_mean', 'DL_std']]
    to_identify.append(df_reidentification)
truths

['benjamin', 'hugo']

Here's what the reidentification file look like

In [6]:
to_identify[0].head()

Unnamed: 0,key_numerical,Shift,HD_mean,HD_std,DL_mean,DL_std
0,53,1,0.260692,0.194857,-0.119282,0.216057
1,14,1,0.097,0.012617,0.1276,0.032463
2,34,0,0.079286,0.014217,0.119584,0.202539
3,37,0,0.082557,0.020573,0.069759,0.092685
4,39,0,0.081304,0.011329,0.05263,0.073434


## Decision Tree Implementation

In [7]:
# tree training
features = ["key_numerical", "Shift", "HD_mean", "HD_std", "DL_mean", "DL_std"]
target = "label"

Y_reference = df_reference[target]
X_reference = df_reference[features]

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_reference, Y_reference)

In [44]:
# Run prediction for all unknowns and compare to truths to get accuracy

predictions = []
for unknown in to_identify:
    # tree returns predictions and probabilities for each key entry
    # average to get final prediction   
    probs = clf.predict_proba(unknown)
    prob_df = pd.DataFrame(probs)
    prob_df = prob_df.rename(columns=label_to_name_map)
    # calculate probability win sums for different labels
    pred_df = pd.DataFrame(columns=['prob_sum'])
    pred_df['prob_sum'] = prob_df.sum()
    # calculate mean probability per label
    pred_df['prob_mean'] = pred_df['prob_sum'] / pred_df['prob_sum'].sum()
    # find label with max probability sum
    pred_label = pred_df['prob_mean'].idxmax()
    pred_probability = pred_df.loc[pred_label,'prob_mean']
    # log result
    print("Prediction: ",pred_label," with probability ",pred_probability)
    predictions.append(pred_label)

tree_acc = accuracy_score(truths, predictions)
print("---\nFinal accuracy for all predictions: ",tree_acc)
        

Prediction:  benjamin  with probability  1.0
Prediction:  hugo  with probability  1.0
---
Final accuracy for all predictions:  1.0
