# Decision Tree Method

This notebook focuses on implementing the decision tree method to study keystroke dynamics 

## List of imports

In [1]:
from sklearn import tree
import pandas as pd
import numpy as np

## Pre-processing

**Input:** reference files per user with the following format ``{Key|Shift|HD_mean|HD_std|DL_mean|DL_std}`` \
**Output:** one file with all reference files to perform decision tree classifier with the following format 
``{Label|Key Numerical|Shift[HD_mean|HD_std|DL_mean|DL_std}``

1. We need to add an attribute `name` to the data to identify the user
2. We need to merge all references in one pandas dataframe
3. The `key` values must have an integer value
4. The `name` values must have an integer value

In [2]:
# import reference into pandas dataframe

files = ["benjamin", "hugo"]

# concat reference files in one dataframe
dataframes_list = []
for file in files:
    df_file = pd.read_csv("reference_1/" + file + ".csv", sep="|", encoding = "ISO-8859-1")
    df_file["name"] = file
    dataframes_list.append(df_file)
    
df_reference = pd.concat(dataframes_list)

# map 'key' to integers (add a column 'key_numerical')
key_map = dict(enumerate(df_reference['Key'].astype('category').cat.categories))
key_mapping = {v: k for k, v in key_map.items()}

df_reference['key_numerical'] = df_reference['Key'].map(key_mapping)

# map 'name' to intgers (add a column 'label')
label_map = dict(enumerate(df_reference['name'].astype('category').cat.categories))
label_mapping = {v: k for k, v in label_map.items()}

df_reference['label'] = df_reference['name'].map(label_mapping)

# re-organize dataframe to '{Label|Key Numerical|Shift[HD_mean|HD_std|DL_mean|DL_std}'
df_reference = df_reference[['label', 'key_numerical', 'Shift', 'HD_mean', 'HD_std', 'DL_mean', 'DL_std']]



Here's the mapping for the key and the name, as well as the representation of the dataframe : 

In [None]:
print("KEY MAPPING")
print(key_mapping)
print("LABEL MAPPING")
print(label_mapping)
print("DATAFRAME")
df_reference.head()

Now, we need to perform modifications on the file we want to re-identify 

1. Import data as dataframe
2. Convert key as integer (key_numerical)
    - if the key is not contained in the map, we must delete the tuple
    - otherwise convert key to integer
3. Reorganize columns

In [None]:
# Set filename that we want to reidentify, should be in 'reference_1/' directory
reidentification_file = 'benjamin'

# export data to dataframe
df_reidentification = pd.read_csv("reference_1/" + reidentification_file + ".csv", sep="|", encoding = "ISO-8859-1")

# use previous key mapping to translate key as string to key as integers
df_reidentification['key_numerical'] = np.nan
df_reidentification['key_numerical'] = df_reidentification['Key'].apply(lambda x: key_mapping[x])

# re-organize dataframe to '{Key Numerical|Shift[HD_mean|HD_std|DL_mean|DL_std}'
df_reidentification = df_reidentification[['key_numerical', 'Shift', 'HD_mean', 'HD_std', 'DL_mean', 'DL_std']]


Here's what the reidentification file look like

In [None]:
df_reidentification.head()

## Decision Tree Implementation

In [None]:
features = ["key_numerical", "Shift", "HD_mean", "HD_std", "DL_mean", "DL_std"]
target = "label"

Y_reference = df_reference[target]
X_reference = df_reference[features]

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_reference, Y_reference)

In [None]:
clf.predict(df_reidentification)

In [None]:
clf.predict_proba(df_reidentification)

In [3]:
print("KEY MAPPING")
print(key_mapping)
print("LABEL MAPPING")
print(label_mapping)
print("DATAFRAME")
df_reference.head()

KEY MAPPING
{"'": 0, "'&'": 1, "'('": 2, "','": 3, "'-'": 4, "'.'": 5, "'?'": 6, "'B'": 7, "'C'": 8, "'D'": 9, "'H'": 10, "'J'": 11, "'K'": 12, "'M'": 13, "'P'": 14, "'S'": 15, "'W'": 16, "'X'": 17, "'Y'": 18, "'\\x01'": 19, "'a'": 20, "'b'": 21, "'c'": 22, "'d'": 23, "'e'": 24, "'f'": 25, "'g'": 26, "'h'": 27, "'i'": 28, "'j'": 29, "'k'": 30, "'l'": 31, "'m'": 32, "'n'": 33, "'o'": 34, "'p'": 35, "'q'": 36, "'r'": 37, "'s'": 38, "'t'": 39, "'u'": 40, "'v'": 41, "'w'": 42, "'x'": 43, "'y'": 44, "'z'": 45, "'è'": 46, "'é'": 47, 'Key.backspace': 48, 'Key.ctrl_l': 49, 'Key.enter': 50, 'Key.esc': 51, 'Key.right': 52, 'Key.shift': 53, 'Key.space': 54}
LABEL MAPPING
{'benjamin': 0, 'hugo': 1}
DATAFRAME


Unnamed: 0,label,key_numerical,Shift,HD_mean,HD_std,DL_mean,DL_std
0,0,53,1,0.260692,0.194857,-0.119282,0.216057
1,0,14,1,0.097,0.012617,0.1276,0.032463
2,0,34,0,0.079286,0.014217,0.119584,0.202539
3,0,37,0,0.082557,0.020573,0.069759,0.092685
4,0,39,0,0.081304,0.011329,0.05263,0.073434


Now, we need to perform modifications on the file we want to re-identify 

1. Import data as dataframe
2. Convert key as integer (key_numerical)
    - if the key is not contained in the map, we must delete the tuple
    - otherwise convert key to integer
3. Reorganize columns

In [5]:
# Set filename that we want to reidentify, should be in 'reference_1/' directory
reidentification_file = 'benjamin'

# export data to dataframe
df_reidentification = pd.read_csv("reference_1/" + reidentification_file + ".csv", sep="|", encoding = "ISO-8859-1")

# use previous key mapping to translate key as string to key as integers
df_reidentification['key_numerical'] = np.nan
df_reidentification['key_numerical'] = df_reidentification['Key'].apply(lambda x: key_mapping[x])

# re-organize dataframe to '{Key Numerical|Shift[HD_mean|HD_std|DL_mean|DL_std}'
df_reidentification = df_reidentification[['key_numerical', 'Shift', 'HD_mean', 'HD_std', 'DL_mean', 'DL_std']]


Here's what the reidentification file look like

In [6]:
df_reidentification.head()

Unnamed: 0,key_numerical,Shift,HD_mean,HD_std,DL_mean,DL_std
0,53,1,0.260692,0.194857,-0.119282,0.216057
1,14,1,0.097,0.012617,0.1276,0.032463
2,34,0,0.079286,0.014217,0.119584,0.202539
3,37,0,0.082557,0.020573,0.069759,0.092685
4,39,0,0.081304,0.011329,0.05263,0.073434


## Decision Tree Implementation

In [9]:
features = ["key_numerical", "Shift", "HD_mean", "HD_std", "DL_mean", "DL_std"]
target = "label"

Y_reference = df_reference[target]
X_reference = df_reference[features]

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_reference, Y_reference)

In [11]:
clf.predict(df_reidentification)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0], dtype=int64)

In [14]:
clf.predict_proba(df_reidentification)

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.]])