# KB-CAPTCHA Documentation

## Overview
This notebook is meant to serve as documentation for the rationale and processes for the design of this project with regard to its data analysis.
For instructions on running the solution, installation and clean-up, please see the README at the root of the project files.

## Hypothesis
The hypothesis for our solution is that users can successfully be distinguished from a large class of bots based on the typing metrics of key held time, keystroke delay time, key overlap percentage and standard deviation of the first two values using unsupervised machine learning techniques.

## Data Set Description
We will source the data for this project from the "136M Keystrokes" data set available [here](https://userinterfaces.aalto.fi/136Mkeystrokes/). This data set contains the following fields:

* Participant id (Integer)
* Test section id (Integer)
* Sentence (String)
* User input (String)
* Keystroke id (Integer)
* Press time (Integer)
* Release time (Integer)
* Letter (String)
* Keycode (Integer)

To test our hypothesis, we need to extract and convert the data from these fields. 

* Average key held time
* Standard deviation of key held time
    * We can derive both of these fields by subtracting the "Press time" from the "Release time" of each entry and then taking the average / standard deviation of the results.
* Average time between keystrokes
* Standard deviation of time between strokes
    * Both of these fields can be derived by subtracting the "Press time" of the current key from the "Press time" of the previous key and taking the average / standard deviation of the results.
* Percentage of keys pressed while the previous key is still held down
    * We can determine if a given key is an "overlap" key by checking if its "Press time" is before the previous key's "Release time". Then, add up overlap keys and divide by the total.

## Data Set Exploration

To explore the viability of standard deviation as a measure of spread as well as identify outliers, we will visualize a random sample of the data and try to make observations and generalizations.

First, let's visualize the distributions for key held time.

### Key Held Time



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import os
import random
import re
%matplotlib inline 

print("Generating sample charts. This can take a while.")

#The product of these two should equal the 
graph_row_count = 2
graph_col_count = 2
number_of_samples = graph_row_count * graph_col_count
graph_size_x = 10
graph_size_y = 8


directory = "../Data/raw/"
all_files = [f for f in os.listdir(directory) if os.path.isfile(os.path.join(directory, f))]

files_to_show = []
for i in range(number_of_samples):
    file_to_add = os.path.join(directory, random.choice(all_files))
    print(f"Selecting file {file_to_add}")
    files_to_show.append(file_to_add)


figure, subplot = plt.subplots(nrows=graph_row_count, ncols=graph_col_count, figsize=(graph_size_x, graph_size_y))

graph_row = 0
graph_col = 0

for file_to_show in files_to_show:
    data_frame = pd.read_csv(file_to_show, delimiter='\t')
    data_frame["TIME_HELD"] = data_frame["RELEASE_TIME"] - data_frame["PRESS_TIME"]
    
    time_held_std_dev = data_frame["TIME_HELD"].std()
    time_held_mean = data_frame["TIME_HELD"].mean()

    subplot[graph_row][graph_col].hist(data_frame['TIME_HELD'], bins=20, edgecolor='black')
    user_id = re.match(r".*/(\d+)_keystrokes.*", file_to_show).group(1)
    subplot[graph_row][graph_col].set_title(f'Key held duration distribution for user {user_id}')
    subplot[graph_row][graph_col].set_xlabel('Time held (ms)')
    subplot[graph_row][graph_col].set_ylabel('Frequency')

    subplot[graph_row][graph_col].axvline(time_held_mean, color='red', linestyle='solid', linewidth=2)
    subplot[graph_row][graph_col].annotate(f'Mean: {time_held_mean:.2f}', xy=(time_held_mean, 0.75), xycoords=('data', 'axes fraction'),
                xytext=(20, 20), textcoords='offset points', color='black', 
                arrowprops=dict(arrowstyle='->', color='black'))

    subplot[graph_row][graph_col].axvline(time_held_mean + time_held_std_dev, color='green', linestyle='solid', linewidth=2)
    subplot[graph_row][graph_col].axvline(time_held_mean - time_held_std_dev, color='green', linestyle='solid', linewidth=2)
    subplot[graph_row][graph_col].annotate(f'Std Dev: {time_held_std_dev:.2f}', xy=(time_held_mean + time_held_std_dev, 0.60), xycoords=('data', 'axes fraction'),
                xytext=(20, 20), textcoords='offset points', color='black',
                arrowprops=dict(arrowstyle='->', color='black'))


    graph_row += 1

    if graph_row >= graph_row_count:
        graph_col += 1
        graph_row %= graph_row_count

plt.tight_layout()
plt.show()


Let's remove outliers. To do this, we will remove data points where the key is held for more than 500 milliseconds.

In [None]:
figure, subplot = plt.subplots(nrows=graph_row_count, ncols=graph_col_count, figsize=(graph_size_x, graph_size_y))

graph_row = 0
graph_col = 0

for file_to_show in files_to_show:
    data_frame = pd.read_csv(file_to_show, delimiter='\t')
    data_frame["TIME_HELD"] = data_frame["RELEASE_TIME"] - data_frame["PRESS_TIME"]
    data_frame = data_frame.query("TIME_HELD <= 500")
    time_held_std_dev = data_frame["TIME_HELD"].std()
    time_held_mean = data_frame["TIME_HELD"].mean()

    subplot[graph_row][graph_col].hist(data_frame['TIME_HELD'], bins=20, edgecolor='black')
    user_id = re.match(r".*/(\d+)_keystrokes.*", file_to_show).group(1)
    subplot[graph_row][graph_col].set_title(f'Key held duration distribution for user {user_id}')
    subplot[graph_row][graph_col].set_xlabel('Time held (ms)')
    subplot[graph_row][graph_col].set_ylabel('Frequency')

    subplot[graph_row][graph_col].axvline(time_held_mean, color='red', linestyle='solid', linewidth=2)
    subplot[graph_row][graph_col].annotate(f'Mean: {time_held_mean:.2f}', xy=(time_held_mean, 0.75), xycoords=('data', 'axes fraction'),
                xytext=(20, 20), textcoords='offset points', color='black', 
                arrowprops=dict(arrowstyle='->', color='black'))

    subplot[graph_row][graph_col].axvline(time_held_mean + time_held_std_dev, color='green', linestyle='solid', linewidth=2)
    subplot[graph_row][graph_col].axvline(time_held_mean - time_held_std_dev, color='green', linestyle='solid', linewidth=2)
    subplot[graph_row][graph_col].annotate(f'Std Dev: {time_held_std_dev:.2f}', xy=(time_held_mean + time_held_std_dev, 0.60), xycoords=('data', 'axes fraction'),
                xytext=(20, 20), textcoords='offset points', color='black',
                arrowprops=dict(arrowstyle='->', color='black'))
    

    graph_row += 1

    if graph_row >= graph_row_count:
        graph_col += 1
        graph_row %= graph_row_count
        
plt.tight_layout()
plt.show()

This seems to produce much better results. We don't want to exclude much more than this however, as a very slow typist may have data that is shifted to the right toward the 500 millisecond cutoff.

Now, lets observe how the distributions for time between strokes looks.

### Time Between Strokes

In [None]:
figure, subplot = plt.subplots(nrows=graph_row_count, ncols=graph_col_count, figsize=(graph_size_x, graph_size_y))

graph_row = 0
graph_col = 0

for file_to_show in files_to_show:
    data_frame = pd.read_csv(file_to_show, delimiter='\t')
    data_frame["KEY_DELAY"] = data_frame["PRESS_TIME"].shift(-1) - data_frame["PRESS_TIME"]
    
    key_delay_std_dev = data_frame["KEY_DELAY"].std()
    key_delay_mean = data_frame["KEY_DELAY"].mean()
    
    subplot[graph_row][graph_col].hist(data_frame['KEY_DELAY'], bins=20, edgecolor='black')
    user_id = re.match(r".*/(\d+)_keystrokes.*", file_to_show).group(1)
    subplot[graph_row][graph_col].set_title(f'Key delay distribution for user {user_id}')
    subplot[graph_row][graph_col].set_xlabel('Time between strokes (ms)')
    subplot[graph_row][graph_col].set_ylabel('Frequency')

    subplot[graph_row][graph_col].axvline(key_delay_mean, color='red', linestyle='solid', linewidth=2)
    subplot[graph_row][graph_col].annotate(f'Mean: {key_delay_mean:.2f}', xy=(key_delay_mean, 0.75), xycoords=('data', 'axes fraction'),
                xytext=(20, 20), textcoords='offset points', color='black', 
                arrowprops=dict(arrowstyle='->', color='black'))

    subplot[graph_row][graph_col].axvline(key_delay_mean + key_delay_std_dev, color='green', linestyle='solid', linewidth=2)
    subplot[graph_row][graph_col].axvline(key_delay_mean - key_delay_std_dev, color='green', linestyle='solid', linewidth=2)
    subplot[graph_row][graph_col].annotate(f'Std Dev: {key_delay_std_dev:.2f}', xy=(key_delay_mean + key_delay_std_dev, 0.60), xycoords=('data', 'axes fraction'),
                xytext=(20, 20), textcoords='offset points', color='black',
                arrowprops=dict(arrowstyle='->', color='black'))
    
    graph_row += 1

    if graph_row >= graph_row_count:
        graph_col += 1
        graph_row %= graph_row_count

plt.tight_layout()
plt.show()

These are unfortunately not very informative due to the extreme outliers. Let's do the same visualization, but remove all entries with more than 1 second of loiter time.


In [None]:
figure, subplot = plt.subplots(nrows=graph_row_count, ncols=graph_col_count, figsize=(graph_size_x, graph_size_y))

graph_row = 0
graph_col = 0

for file_to_show in files_to_show:
    data_frame = pd.read_csv(file_to_show, delimiter='\t')
    data_frame["KEY_DELAY"] = data_frame["PRESS_TIME"].shift(-1) - data_frame["PRESS_TIME"]
    data_frame = data_frame.query("KEY_DELAY <= 1000")
    data_frame = data_frame.query("KEY_DELAY > 0")
    
    key_delay_std_dev = data_frame["KEY_DELAY"].std()
    key_delay_mean = data_frame["KEY_DELAY"].mean()
    
    subplot[graph_row][graph_col].hist(data_frame['KEY_DELAY'], bins=20, edgecolor='black')
    user_id = re.match(r".*/(\d+)_keystrokes.*", file_to_show).group(1)
    subplot[graph_row][graph_col].set_title(f'Key delay distribution for user {user_id}')
    subplot[graph_row][graph_col].set_xlabel('Time between strokes (ms)')
    subplot[graph_row][graph_col].set_ylabel('Frequency')

    subplot[graph_row][graph_col].axvline(key_delay_mean, color='red', linestyle='solid', linewidth=2)
    subplot[graph_row][graph_col].annotate(f'Mean: {key_delay_mean:.2f}', xy=(key_delay_mean, 0.75), xycoords=('data', 'axes fraction'),
                xytext=(20, 20), textcoords='offset points', color='black', 
                arrowprops=dict(arrowstyle='->', color='black'))

    subplot[graph_row][graph_col].axvline(key_delay_mean + key_delay_std_dev, color='green', linestyle='solid', linewidth=2)
    subplot[graph_row][graph_col].axvline(key_delay_mean - key_delay_std_dev, color='green', linestyle='solid', linewidth=2)
    subplot[graph_row][graph_col].annotate(f'Std Dev: {key_delay_std_dev:.2f}', xy=(key_delay_mean + key_delay_std_dev, 0.60), xycoords=('data', 'axes fraction'),
                xytext=(20, 20), textcoords='offset points', color='black',
                arrowprops=dict(arrowstyle='->', color='black'))
    
    graph_row += 1

    if graph_row >= graph_row_count:
        graph_col += 1
        graph_row %= graph_row_count
    
plt.tight_layout()
plt.show()


We can see that the data is roughly normal shaped, though does have a non-negligible positive skew. To get a more representative description of the data, let's try using the interquartile range (IQR) instead.

In [None]:
figure, subplot = plt.subplots(nrows=graph_row_count, ncols=graph_col_count, figsize=(graph_size_x, graph_size_y))

graph_row = 0
graph_col = 0


for file_to_show in files_to_show:
    data_frame = pd.read_csv(file_to_show, delimiter='\t')
    data_frame["KEY_DELAY"] = data_frame["PRESS_TIME"].shift(-1) - data_frame["PRESS_TIME"]
    data_frame = data_frame.query("KEY_DELAY <= 1000")
    data_frame = data_frame.query("KEY_DELAY > 0")

    first_quartile = data_frame["KEY_DELAY"].quantile(0.25)
    third_quartile = data_frame["KEY_DELAY"].quantile(0.75)
    key_delay_median = data_frame["KEY_DELAY"].median()

    
    subplot[graph_row][graph_col].hist(data_frame['KEY_DELAY'], bins=20, edgecolor='black')
    user_id = re.match(r".*/(\d+)_keystrokes.*", file_to_show).group(1)
    subplot[graph_row][graph_col].set_title(f'Key delay distribution for user {user_id}')
    subplot[graph_row][graph_col].set_xlabel('Time between strokes (ms)')
    subplot[graph_row][graph_col].set_ylabel('Frequency')

    subplot[graph_row][graph_col].axvline(key_delay_median, color='red', linestyle='solid', linewidth=2)
    subplot[graph_row][graph_col].annotate(f'Median: {key_delay_median:.2f}', xy=(key_delay_median, 0.75), xycoords=('data', 'axes fraction'),
                xytext=(20, 20), textcoords='offset points', color='black', 
                arrowprops=dict(arrowstyle='->', color='black'))

    subplot[graph_row][graph_col].axvline(first_quartile, color='green', linestyle='solid', linewidth=2)
    subplot[graph_row][graph_col].axvline(third_quartile, color='green', linestyle='solid', linewidth=2)
    subplot[graph_row][graph_col].annotate(f'IQR: {third_quartile - first_quartile:.2f}', xy=(third_quartile, 0.60), xycoords=('data', 'axes fraction'),
                xytext=(20, 20), textcoords='offset points', color='black',
                arrowprops=dict(arrowstyle='->', color='black'))

    graph_row += 1

    if graph_row >= graph_row_count:
        graph_col += 1
        graph_row %= graph_row_count

plt.tight_layout()
plt.show()

Based on these visualizations, median / IQR seems to be a much better measure of spread and center than mean / standard deviation.

## Hypothesis Acceptance Evaluation (Part one)
Our hypothesis was that we could use the standard deviation and the mean to differentiate bot users and humans. It seems that we may need to adjust this slightly based on the visualizations above. While the mean and standard deviation were effective measures of center and spread for the key held time, the IQR and median seems to more accurately describe the distribution for key stroke delay time.

We will move forward using median / IQR for key delay and mean / standard deviation for key held time.




## Data Cleaning and Preparation

From what we found above, standard deviation and mean were good measures of center and spread for key held time, but for time between key strokes median and IQR were better. So, now we need to parse the raw data. In this project, this is done using the file(s) in DataCleaner/. A brief overview of the cleaning method follows.

1. Use the script in Data/ to download and extract the raw data.
2. Remove non-data files.
3. Parse each file and check for irregularity.
    * If a file is missing needed information, discard the sample.
    * If it is missing data, but it's unnecessary data, use the sample but take extra care when parsing.
4. Perform the numeric analysis of center and spread for applicable values on a user basis.
5. Write the results to a new file.

Now that the data has been processed, we can look at the distribution for details we can use.

## Derived Data Analysis



In [None]:

data_frame = pd.read_csv("../Data/processed/result.csv", delimiter=',', names=["KEY_HELD_AVG","KEY_HELD_STD_DEV","KEY_DELAY_MEDIAN","KEY_DELAY_IQR","OVERLAP_PERCENT"])

plt.hist(data_frame['KEY_HELD_AVG'], bins=20, edgecolor='black')
plt.title(f'Mean Key Held Duration Distribution')
plt.xlabel('Average Time Held (ms)')
plt.ylabel('Frequency')

plt.show()

In [None]:
plt.hist(data_frame['KEY_HELD_STD_DEV'], bins=20, edgecolor='black')
plt.title(f'Mean Key Held Duration Distribution')
plt.xlabel('Average Time Held (ms)')
plt.ylabel('Frequency')

plt.show()

In [None]:
plt.hist(data_frame['KEY_DELAY_MEDIAN'], bins=20, edgecolor='black')
plt.title(f'Mean Time Between Keystrokes Distribution')
plt.xlabel('Mean Time Between Keystrokes (ms)')
plt.ylabel('Frequency')

plt.show()

In [None]:
plt.hist(data_frame['KEY_DELAY_IQR'], bins=50, edgecolor='black')
plt.title(f'Time Between Keystrokes IQR Distribution')
plt.xlabel('Time Between Keystrokes IQR (ms)')
plt.ylabel('Frequency')

plt.show()

In [None]:

plt.hist(data_frame['OVERLAP_PERCENT'], bins=50, edgecolor='black')
plt.title(f'Key Press Overlap Percentage Distribution')
plt.xlabel('Key Press Overlap Percentage (%)')
plt.ylabel('Frequency')

plt.show()

Unfortunately, it appears that it is common that some real users do not overlap their keys when they type. 

In [None]:
data_frame.plot.scatter(x="KEY_DELAY_MEDIAN" ,y="OVERLAP_PERCENT")
plt.show()



## Hypothesis Acceptance Evaluation (Part two)
