<a href="https://colab.research.google.com/github/Alessandro1999/FreeKeystrokeDynamics/blob/main/Data_preprocessing_with_Julia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data preprocessing with Julia
In this notebook, the data preprocessing will be performed with the Julia language to speed up the process.
The final output of this stage will be a .csv file for each user containing a row for each typed sentence.

## Installing Julia

### <img src="https://github.com/JuliaLang/julia-logo-graphics/raw/master/images/julia-logo-color.png" height="100" /> _Colab Notebook Template_

## Instructions
1. Work on a copy of this notebook: _File_ > _Save a copy in Drive_ (you will need a Google account). Alternatively, you can download the notebook using _File_ > _Download .ipynb_, then upload it to [Colab](https://colab.research.google.com/).
2. If you need a GPU: _Runtime_ > _Change runtime type_ > _Harware accelerator_ = _GPU_.
3. Execute the following cell (click on it and press Ctrl+Enter) to install Julia, IJulia and other packages (if needed, update `JULIA_VERSION` and the other parameters). This takes a couple of minutes.
4. Reload this page (press Ctrl+R, or ⌘+R, or the F5 key) and continue to the next section.

_Notes_:
* If your Colab Runtime gets reset (e.g., due to inactivity), repeat steps 2, 3 and 4.
* After installation, if you want to change the Julia version or activate/deactivate the GPU, you will need to reset the Runtime: _Runtime_ > _Factory reset runtime_ and repeat steps 3 and 4.

In [None]:
%%shell

#---------------------------------------------------#
JULIA_VERSION="1.8.2" # any version ≥ 0.7.0
JULIA_PACKAGES="IJulia BenchmarkTools Plots"
JULIA_PACKAGES_IF_GPU="CUDA" # or CuArrays for older Julia versions
JULIA_NUM_THREADS=8
#---------------------------------------------------#

if [ -z `which julia` ]; then
  # Install Julia
  JULIA_VER=`cut -d '.' -f -2 <<< "$JULIA_VERSION"`
  echo "Installing Julia $JULIA_VERSION on the current Colab Runtime..."
  BASE_URL="https://julialang-s3.julialang.org/bin/linux/x64"
  URL="$BASE_URL/$JULIA_VER/julia-$JULIA_VERSION-linux-x86_64.tar.gz"
  wget -nv $URL -O /tmp/julia.tar.gz # -nv means "not verbose"
  tar -x -f /tmp/julia.tar.gz -C /usr/local --strip-components 1
  rm /tmp/julia.tar.gz

  # Install Packages
  GPU_LIST=`nvidia-smi -L 2> /dev/null`
  if [ "$?" -eq "0" ]; then
    JULIA_PACKAGES="$JULIA_PACKAGES $JULIA_PACKAGES_IF_GPU"
  fi
  for PKG in `echo $JULIA_PACKAGES`; do
    echo "Installing Julia package $PKG..."
    julia -e 'using Pkg; pkg"add '$PKG'; precompile;"' &> /dev/null
  done

  # Install kernel and rename it to "julia"
  echo "Installing IJulia kernel..."
  julia -e 'using IJulia; IJulia.installkernel("julia", env=Dict(
      "JULIA_NUM_THREADS"=>"'"$JULIA_NUM_THREADS"'"))'
  KERNEL_DIR=`julia -e "using IJulia; print(IJulia.kerneldir())"`
  KERNEL_NAME=`ls -d "$KERNEL_DIR"/julia*`
  mv -f $KERNEL_NAME "$KERNEL_DIR"/julia  

  echo ''
  echo "Successfully installed `julia -v`!"
  echo "Please reload this page (press Ctrl+R, ⌘+R, or the F5 key) then"
  echo "jump to the 'Checking the Installation' section."
fi

Installing Julia 1.8.2 on the current Colab Runtime...
2022-11-05 09:19:40 URL:https://storage.googleapis.com/julialang2/bin/linux/x64/1.8/julia-1.8.2-linux-x86_64.tar.gz [135859273/135859273] -> "/tmp/julia.tar.gz" [1]
Installing Julia package IJulia...
Installing Julia package BenchmarkTools...
Installing Julia package Plots...
Installing IJulia kernel...
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInstalling julia kernelspec in /root/.local/share/jupyter/kernels/julia-1.8

Successfully installed julia version 1.8.2!
Please reload this page (press Ctrl+R, ⌘+R, or the F5 key) then
jump to the 'Checking the Installation' section.




### Checking the Installation
The `versioninfo()` function should print your Julia version and some other info about the system:

In [None]:
versioninfo()

Julia Version 1.8.2
Commit 36034abf260 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 2 × Intel(R) Xeon(R) CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
  Threads: 8 on 2 virtual cores
Environment:
  LD_LIBRARY_PATH = /usr/local/nvidia/lib:/usr/local/nvidia/lib64
  LD_PRELOAD = /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4
  JULIA_NUM_THREADS = 8


## Take the dataset from drive

Download the dataset

In [None]:
; gdown "https://drive.google.com/uc?id=11SjBTq8AdgFcirmnClYl97VAzj6YpM9D"

Downloading...
From: https://drive.google.com/uc?id=11SjBTq8AdgFcirmnClYl97VAzj6YpM9D
To: /content/Keystrokes.zip
100%|██████████████████████████████████████| 1.57G/1.57G [00:10<00:00, 152MB/s]


Unzip it

In [None]:
; unzip -q Keystrokes.zip

## Download and import packages

In [None]:
using Pkg
Pkg.add("Ranges")
Pkg.add("CSV")
Pkg.add("DataFrames")
Pkg.add("ProgressMeter")
Pkg.add("PyCall")

using Ranges
using CSV
using DataFrames
using ProgressMeter
using PyCall

[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.8/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.8/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.8/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.8/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.8/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.8/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.8/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.8/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.ju

## Data preprocessing

In [None]:
#@title Visualize data of a specific user
id = 2200 #@param
rows_to_show = 21 #@param
df = DataFrame(CSV.File("Keystrokes/files/"*string(id)*"_keystrokes.txt",delim="\t"))
                                
first(df,rows_to_show)

Row,PARTICIPANT_ID,TEST_SECTION_ID,SENTENCE,USER_INPUT,KEYSTROKE_ID,PRESS_TIME,RELEASE_TIME,LETTER,KEYCODE
Unnamed: 0_level_1,Int64,Int64,String,String,Int64,Int64,Int64,String15,Int64
1,2200,22307,What are the units?,What are the units?,1063868,1471961564282,1471961564656,SHIFT,16
2,2200,22307,What are the units?,What are the units?,1063875,1471961564552,1471961564657,W,87
3,2200,22307,What are the units?,What are the units?,1063882,1471961564912,1471961565000,h,72
4,2200,22307,What are the units?,What are the units?,1063888,1471961565001,1471961565120,a,65
5,2200,22307,What are the units?,What are the units?,1063891,1471961565272,1471961565336,t,84
6,2200,22307,What are the units?,What are the units?,1063897,1471961565337,1471961565424,,32
7,2200,22307,What are the units?,What are the units?,1063931,1471961565480,1471961565584,a,65
8,2200,22307,What are the units?,What are the units?,1063937,1471961565656,1471961565720,r,82
9,2200,22307,What are the units?,What are the units?,1063943,1471961565696,1471961565808,e,69
10,2200,22307,What are the units?,What are the units?,1063951,1471961565809,1471961565872,,32


In [None]:
#@title Function to convert a single dataframe
function convert_user_df(df::DataFrame)
    df[!, "DWELL_TIME"] = df[!, "RELEASE_TIME"] - df[!, "PRESS_TIME"]
    sentences::Set{Int32} = Set(df.TEST_SECTION_ID)
    data = Vector{Any}()
    for sentence_id in sentences
        df_s = filter(row -> row.TEST_SECTION_ID == sentence_id, df)
        n_rows = nrow(df_s)
        timings = []
        row = Vector{Any}(df_s[1, [1,2,3,4]])
        for r in range(1,stop=n_rows,step=1)
            l = df_s[r, "LETTER"]
            keycode = df_s[r, "KEYCODE"]
            pt = df_s[r, "PRESS_TIME"]
            wt = 0
            if r != 1
                previous_rt = df_s[r-1, "RELEASE_TIME"]
                wt = pt - previous_rt
            end
            dt = df_s[r, "DWELL_TIME"]
            push!(timings, (l, keycode, dt, wt))
        end
        push!(row, timings)
        push!(data, row)
    end
    out_df = DataFrame(PARTICIPANT_ID = Int[], TEST_SECTION_ID = Int[], SENTENCE = String[], USER_INPUT = String[], TIMINGS = Vector{Any}())
    for row in data
        push!(out_df,row)
    end
    return out_df
end

convert_user_df (generic function with 1 method)

Then, we will apply this function to all the users.

However, there are some files that have a \n more, resulting in an error when we try to parse them, that's why I've also written a python function to fix the error

In [None]:
#@title Function to fix the error in the dataset
py"""
def fix_file(src_path : str, trg_path : str = None):
    if trg_path == None:
        trg_path = src_path
    lines = []
    before = None
    with open(src_path,"r",encoding="iso-8859-1") as r:
        for i,line in enumerate(r.readlines()): # read every line of the file
            if len(line) < 10 and before != None: # the line is too short, there is probably a \n more
                lines.append(before[:-1] + line) # remove the \n from the previous line and append the current one
                before = lines[-1]
            elif before != None: # update the previous row variable
                lines.append(before)
                before = line
            elif i == 0: # previous row variable initialization
                before = line
    with open(trg_path,"w") as w: # write the lines
        w.writelines(lines)
"""

### Example of an error fixed

Let's take the file of the user 107740:

In [None]:
u_107740 =  DataFrame(CSV.File("Keystrokes/files/107740_keystrokes.txt",delim="\t"))

└ @ CSV /root/.julia/packages/CSV/mgO6B/src/file.jl:579
└ @ CSV /root/.julia/packages/CSV/mgO6B/src/file.jl:579


Row,PARTICIPANT_ID,TEST_SECTION_ID,SENTENCE,USER_INPUT,KEYSTROKE_ID,PRESS_TIME,RELEASE_TIME,LETTER,KEYCODE
Unnamed: 0_level_1,Int64?,Int64,String?,String?,Int64?,Int64?,Int64?,String7?,Int64?
1,107740,1175419,Have a good weekend.,Have a good weekend.,55892037,1473359887796,1473359888251,SHIFT,16
2,107740,1175419,Have a good weekend.,Have a good weekend.,55892042,1473359888100,1473359888259,H,72
3,107740,1175419,Have a good weekend.,Have a good weekend.,55892048,1473359888351,1473359888490,a,65
4,107740,1175419,Have a good weekend.,Have a good weekend.,55892051,1473359888554,1473359888733,v,86
5,107740,1175419,Have a good weekend.,Have a good weekend.,55892053,1473359888726,1473359888825,e,69
6,107740,1175419,Have a good weekend.,Have a good weekend.,55892056,1473359888889,1473359888996,,32
7,107740,1175419,Have a good weekend.,Have a good weekend.,55892117,1473359889040,1473359889143,a,65
8,107740,1175419,Have a good weekend.,Have a good weekend.,55892121,1473359889184,1473359889319,,32
9,107740,1175419,Have a good weekend.,Have a good weekend.,55892128,1473359889654,1473359889741,g,71
10,107740,1175419,Have a good weekend.,Have a good weekend.,55892133,1473359889777,1473359889864,o,79


We were able to read it because Julia CSV is smart and substitues everything with "missing", but as soon as we try to convert it into our format... 

In [None]:
u_107740 = convert_user_df(u_107740)

┌ Error: Error adding value to column :PARTICIPANT_ID. Maybe it was forgotten to ask for column element type promotion, which can be done by passing the promote=true keyword argument.
└ @ DataFrames /root/.julia/packages/DataFrames/bza1S/src/dataframe/insertion.jl:688


LoadError: ignored

We get an error since there is basically a line with no fields. The solution to this is to call the python function written for the purpose of modifying the file in the correct way and the retry:

In [None]:
py"fix_file"("Keystrokes/files/107740_keystrokes.txt")
u_107740 =  DataFrame(CSV.File("Keystrokes/files/107740_keystrokes.txt",delim="\t"))
u_107740 = convert_user_df(u_107740)

Row,PARTICIPANT_ID,TEST_SECTION_ID,SENTENCE,USER_INPUT,TIMINGS
Unnamed: 0_level_1,Int64,Int64,String,String,Any
1,107740,1175618,Jones executive vice president and chief operating officer.,Jones executive vice president and chief operating officer.,"Any[(String7(""SHIFT""), 16, 520, 0), (String7(""J""), 74, 105, -81), (String7(""o""), 79, 171, 108), (String7(""n""), 78, 138, 13), (String7(""e""), 69, 111, -6), (String7(""s""), 83, 87, 102), (String7("" ""), 32, 139, 65), (String7(""e""), 69, 100, 47), (String7(""x""), 88, 119, 468), (String7(""e""), 69, 115, 122) … (String7(""g""), 71, 108, -7), (String7("" ""), 32, 148, 19), (String7(""o""), 79, 140, 203), (String7(""f""), 70, 136, 4), (String7(""f""), 70, 76, 103), (String7(""i""), 73, 152, 56), (String7(""c""), 67, 148, 42), (String7(""e""), 69, 239, 79), (String7(""r""), 82, 144, -75), (String7("".""), 190, 132, 519)]"
2,107740,1175514,I think those are the right dates.,I think those are the right dates.,"Any[(String7(""SHIFT""), 16, 788, 0), (String7(""I""), 73, 84, -88), (String7("" ""), 32, 107, 135), (String7(""t""), 84, 91, 39), (String7(""h""), 72, 87, -23), (String7(""i""), 73, 115, 156), (String7(""n""), 78, 119, 44), (String7(""k""), 75, 103, 38), (String7("" ""), 32, 132, 76), (String7(""t""), 84, 97, -8) … (String7(""g""), 71, 108, 68), (String7(""h""), 72, 139, 82), (String7(""t""), 84, 107, 49), (String7("" ""), 32, 172, 81), (String7(""d""), 68, 164, 227), (String7(""a""), 65, 164, 75), (String7(""t""), 84, 135, 106), (String7(""e""), 69, 76, 4), (String7(""s""), 83, 128, 176), (String7("".""), 190, 76, 219)]"
3,107740,1175648,Don't forget the wood.,Don't forget the wood.,"Any[(String7(""SHIFT""), 16, 300, 0), (String7(""D""), 68, 131, -103), (String7(""o""), 79, 143, 81), (String7(""n""), 78, 156, 13), (String7(""t""), 84, 103, 1), (String7("";""), 186, 199, 164), (String7(""BKSP""), 8, 139, 324), (String7(""BKSP""), 8, 97, 335), (String7(""'""), 222, 108, 444), (String7(""t""), 84, 103, -8) … (String7("" ""), 32, 103, 43), (String7(""t""), 84, 99, 17), (String7(""h""), 72, 111, 9), (String7(""e""), 69, 111, -27), (String7("" ""), 32, 111, 26), (String7(""w""), 87, 35, 89), (String7(""o""), 79, 91, 152), (String7(""o""), 79, 91, 100), (String7(""d""), 68, 139, 193), (String7("".""), 190, 103, 240)]"
4,107740,1175501,We are all fragile.,We are all fragile.,"Any[(String7(""SHIFT""), 16, 400, 0), (String7(""W""), 87, 128, -72), (String7(""e""), 69, 99, 71), (String7("" ""), 32, 131, 93), (String7(""a""), 65, 123, 166), (String7(""r""), 82, 91, 100), (String7(""e""), 69, 100, 400), (String7("" ""), 32, 123, 53), (String7(""a""), 65, 59, 23), (String7(""l""), 76, 90, 114), (String7(""l""), 76, 81, 93), (String7("" ""), 32, 111, 115), (String7(""f""), 70, 107, 430), (String7(""r""), 82, 95, 68), (String7(""a""), 65, 199, 227), (String7(""g""), 71, 106, 149), (String7(""i""), 73, 91, 252), (String7(""l""), 76, 167, 358), (String7(""e""), 69, 119, -43), (String7("".""), 190, 83, 369)]"
5,107740,1175419,Have a good weekend.,Have a good weekend.,"Any[(String7(""SHIFT""), 16, 455, 0), (String7(""H""), 72, 159, -151), (String7(""a""), 65, 139, 92), (String7(""v""), 86, 179, 64), (String7(""e""), 69, 99, -7), (String7("" ""), 32, 107, 64), (String7(""a""), 65, 103, 44), (String7("" ""), 32, 135, 41), (String7(""g""), 71, 87, 335), (String7(""o""), 79, 87, 36) … (String7(""w""), 87, 127, 32), (String7(""e""), 69, 95, 99), (String7(""e""), 69, 118, 214), (String7(""k""), 75, 155, 75), (String7(""e""), 69, 115, -7), (String7(""n""), 78, 131, 106), (String7(""d""), 68, 103, 29), (String7("",""), 188, 151, 56), (String7(""BKSP""), 8, 75, 346), (String7("".""), 190, 159, 112)]"
6,107740,1175481,Hopefully this can wait until Monday.,Hopefully this can wait until Monday.,"Any[(String7(""SHIFT""), 16, 392, 0), (String7(""H""), 72, 115, -107), (String7(""o""), 79, 131, 76), (String7(""p""), 80, 143, 181), (String7(""e""), 69, 119, 270), (String7(""f""), 70, 111, 847), (String7(""u""), 85, 103, 24), (String7(""l""), 76, 132, 255), (String7(""y""), 89, 123, 118), (String7("" ""), 32, 147, 45) … (String7(""l""), 76, 147, 106), (String7("" ""), 32, 127, 47), (String7(""SHIFT""), 16, 212, 267), (String7(""M""), 77, 107, -79), (String7(""o""), 79, 107, 505), (String7(""n""), 78, 127, 69), (String7(""d""), 68, 131, 74), (String7(""a""), 65, 123, 150), (String7(""y""), 89, 91, 13), (String7("".""), 190, 131, 557)]"
7,107740,1175526,This time I'm more comfortable and aware of a lot more situations.,This time I'm more comfortable and aware of a lot more situations.,"Any[(String7(""SHIFT""), 16, 467, 0), (String7(""T""), 84, 112, -119), (String7(""h""), 72, 104, 132), (String7(""s""), 83, 112, 352), (String7(""BKSP""), 8, 99, 249), (String7(""i""), 73, 103, 156), (String7(""s""), 83, 104, 9), (String7("" ""), 32, 131, 23), (String7(""t""), 84, 119, 101), (String7(""i""), 73, 120, -7) … (String7(""t""), 84, 87, 56), (String7(""u""), 85, 135, 438), (String7(""a""), 65, 87, 47), (String7(""t""), 84, 99, 193), (String7(""i""), 73, 106, 289), (String7(""o""), 79, 178, 608), (String7(""n""), 78, 175, -31), (String7(""s""), 83, 163, 9), (String7("".""), 190, 165, 433), (String7("" ""), 32, 91, 50)]"
8,107740,1175603,It is not surprising.,It is not surprising.,"Any[(String7(""SHIFT""), 16, 282, 0), (String7(""I""), 73, 58, -57), (String7(""t""), 84, 115, 202), (String7("" ""), 32, 109, 67), (String7(""i""), 73, 146, 48), (String7(""s""), 83, 111, -11), (String7("" ""), 32, 83, 9), (String7(""n""), 78, 143, 367), (String7(""o""), 79, 155, 13), (String7(""t""), 84, 107, 17) … (String7(""o""), 79, 56, 59), (String7(""BKSP""), 8, 66, 386), (String7(""p""), 80, 147, 836), (String7(""r""), 82, 115, -19), (String7(""i""), 73, 98, 449), (String7(""s""), 83, 107, 40), (String7(""i""), 73, 95, 17), (String7(""n""), 78, 151, 70), (String7(""g""), 71, 139, -63), (String7("".""), 190, 151, 651)]"
9,107740,1175582,He doesn't want to give the trading positions.,He doesn't want to give the trading positions.,"Any[(String7(""SHIFT""), 16, 508, 0), (String7(""H""), 72, 79, -47), (String7(""e""), 69, 99, 75), (String7("" ""), 32, 143, 106), (String7(""d""), 68, 159, 201), (String7(""o""), 79, 135, 18), (String7(""e""), 69, 91, 17), (String7(""s""), 83, 87, 319), (String7(""n""), 78, 59, 377), (String7(""'""), 222, 70, 516) … (String7(""p""), 80, 243, 165), (String7(""o""), 79, 106, -14), (String7(""s""), 83, 147, 82), (String7(""i""), 73, 179, 512), (String7(""t""), 84, 111, 22), (String7(""i""), 73, 399, 65), (String7(""o""), 79, 167, -147), (String7(""n""), 78, 193, 17), (String7(""s""), 83, 135, -25), (String7("".""), 190, 107, 665)]"
10,107740,1175685,Hope you guys are doing fine.,Hope you guys are doing fine.,"Any[(String7(""SHIFT""), 16, 518, 0), (String7(""H""), 72, 84, -65), (String7(""o""), 79, 167, 129), (String7(""p""), 80, 160, 140), (String7(""e""), 69, 99, 13), (String7("" ""), 32, 139, 27), (String7(""o""), 79, 99, 292), (String7(""BKSP""), 8, 99, 294), (String7(""u""), 85, 118, 127), (String7(""o""), 79, 103, 125) … (String7(""o""), 79, 201, -87), (String7(""i""), 73, 163, -5), (String7(""n""), 78, 143, -23), (String7(""g""), 71, 116, -8), (String7("" ""), 32, 123, 24), (String7(""f""), 70, 140, 65), (String7(""i""), 73, 139, -31), (String7(""n""), 78, 103, 69), (String7(""n""), 69, 112, 26), (String7(""BKSP""), 8, 60, 453)]"


### Data conversion
Now we can finally preprocess all the files in our format.

In [None]:
; mkdir Keystrokes_formatted

In [None]:
path = "Keystrokes/files/"
n = length(readdir(path))
ProgressMeter.ijulia_behavior(:clear) # hide warnings of the progress bar
p = Progress(n, dt=0.5, barglyphs=BarGlyphs("[=> ]"), barlen=50, color=:red)
for (i,file) in collect(enumerate(readdir(path)))
    if occursin("keystrokes",file) # it is a user file
        try
            df = DataFrame(CSV.File("Keystrokes/files/"*file, delim="\t",quoted=false,ignorerepeated=true, silencewarnings=true))
            df = convert_user_df(df)
        catch e # error found, correct it and retry
            py"fix_file"("Keystrokes/files/"*file)
            df = DataFrame(CSV.File("Keystrokes/files/"*file, delim="\t",quoted=false,ignorerepeated=true, silencewarnings=true))
            df = convert_user_df(df)
        finally
            CSV.write("Keystrokes_formatted/"*file,df,delim=",",append=true)
        end
    end
    ProgressMeter.next!(p,showvalues = [(:iter,i), (:tot,n)])
end
