This is the Matlab analysis component for the MailPhys project. The user-facing, data collection component can be found on https://github.com/ForSubmission/MailPhys.
Analysis requires source data, obtained from MailEye and Shimmer sensors.
The source data files are CSV (for Shimmer data) and JSONs contained within MAIL
folders (for MailEye data). These should be in the format GSR_P01_S1.csv
, EMG_P01_S1.csv
, MAIL_P01_S1_A1
, MAIL_P01_S1_A2
, etc., where P is followed by the participant number, S with the session number, and A with the account number (only for mail folders).
Once these data are collected, we preprocess the data and perform classification on them, by following the steps below. The procedure is broken into steps in order to be more modular (so it can be iteratively improved without repeating the whole pipeline, which could be time-consuming).
The steps are
- Preprocessing 1 (just for GSR data)
- Preprocessing 2 (Aggregation and conversion into features)
- Table creation
- Classification
All these steps output intermediate data and results in ~/MATLAB/MailPhys
, where ~
stands for the user's home folder. ~/MATLAB
is defined in Params.m as basedir
and can be changed to another location (Windows users might especially want to do this, since ~
is not defined under all Windows installations). Params.outdir
specifies that we want to save outputs under the MailPhys
subdirectory of basedir
(this can also be changes as preferred).
We preprocess GSR files using Ledalab. We will obtain an EDA mat file for the following step (aggregation).
As soon as all data is collected, run preprocess_eda.m and point to a Shimmer CSV (GSR) file. This will convert GSR data into a format understood by Ledalab, then runs Ledalab on the created data.
The data will be saved in the target directory (defined in Params.outdir
, e.g. ~/MATLAB/MailPhys
), subdirectory edatemp
.
preprocess_eda
can be run without arguments (a file selection window will appear) or with arguments, for example:
preprocess_eda('/Users/username/Google Drive/MailPhys/Data/Participant 1 - training/GSR_P01_S4.csv');
this will generate a EDA_P01_S4.mat
file in edatemp
. Move this file back with the original CSV files and MAIL folders for the participant.
Data should then be ready to be aggregated (next step).
After preprocessing, data is aggregated, so that we generate a JSON file for each message that the user read. For each JSON, we add physiological and accelerometric data related to the given mail (obtained from the Shimmer files and the EDA mat file).
Run aggregateData
pointing to the folder that contains all the source data (with a parameter specifying the directory or without a parameter to pop-up a folder selection window). The selected folder must contain data in CSV files, EDA MAT file (added in the previous step) and MAIL directories. They should be organised in this format.
EDA_P01_S1.mat
EDA_P01_S2.mat
EMG_P01_S1.csv
EMG_P01_S2.csv
GSR_P01_S1.csv
GSR_P01_S2.csv
MAIL_P01_S1_A1
MAIL_P01_S2_A1
Where P01 is the participant number, S1 the session number. MAIL folders also have an account number; A1 is mandatory, then additional accounts can be separated by using A2, A3 and so on.
Running aggregateData on this dataset will create a new directory called P01
(where 01 is the participant number) under the target directory (Params.outdir
). In the P01
directory, a JSON file will be created for each message, in chronological order (in the original order that the user read the messages). Two JSON files containing raw data will also be created here. The outputted data is defined in the Data specification section below.
Once the data is split into individual json files (named 0.json
, 1.json
, ...) it can be read and saved into a table using writeTable.m. The table will be called tab
and saved into datatable.mat
in the same directory we point writeTable
to (we can pass a parameter to specify the target location, or run without parameters to pop-up a folder selection window).
writeTable('~/MATLAB/MailPhys/P01');
Once the procedure has been run for at least one participant, we run the classifier, and also run a randomisation test to make sure that the first classification performed significantly better than chance.
To do this, we run saveResults.m, which will run the classifier and output the results of the randomisation tests in the target folder. This requires the participant number and the number of randomisation iterations we prefer.
For example:
saveResults(1, 500)
will run the test for participant 1 (automatically translated P01
under the target directory specified in Params.outdir
). The results will be saved in the target directory, under results_P01_500
.
If we run this procedure for a few participants, we can then run finalTableMaker.m specifying a vector of participant numbers and a number of iterations parameter, for example:
finalTableMaker([1 2 3 4], 500)
will parse through all the needed results_PXX_500
folders (where PXX is the participant number) and compute a final table that summarises all randomisation results for all the given participants under the results_final_500 folder
. This final table will be called finalTable.csv
and will be outputted in the target directory. It will contain how often a given feature set was significant and which was its average AUC. Feature sets are defined in SetCollection.m. It will also output a participantTable.csv
, which report which was the "best" feature set for each participant (i.e. the feature set that obtained the highest AUC for that participant).
Once we have a set of results (generated during the classification step described above) we can send tags back to the server using pushResults.m. We need to specify the server details from which e-mail data was originally obtained.
pushResults('imap.example.com', 'username', 'password');
This will run through the classification results, searching for individual e-mails for which any tag was estimated with a probability of at least 75%. Found e-mails will be tagged with the variable name of interest, sent as an IMAP flag. For example, this procedure may tag a few e-mails as spam by adding to them an IMAP flag named spam
.
each JSON corresponds to an AugmentedMessage (as defined in https://github.com/ForSubmission/MailPhys).
pleasure - number from 1 to 4 indicating how pleasant was this message (all these labels could be split into binary classification so that if > 2.5 == 1 otherwise 0)
priority - 1 to 4, how important was this message
workload - 1 to 4, how much work it took to deal with this message
spam - 1 if spam (might be interesting to classify spam based on physio data)
wasUnread - binary, 1 if message was initially unread, 0 if the participant read this before (we could discard all messages for which this is 0)
eagerlyExpected - binary, 1 if the participant was eagerly expecting this message (there’s not many of these)
corrupted - binary, 1 if text was corrupted (should discard all corrupted messages)
containsAttachment - binary, 1 if attachments were present
session - session number, just in case
account - account number, just in case
bodySize - how big the message was overall, in pixels (including offscreen text)
rawEyeData - matrix of all raw eye data, contains Xs, Ys and Ts (timepoints in nanoseconds)
eda - EDA data and accelerometer data on wrist, contains
accX - raw data for x axis
accY - raw data for y axis
accZ - raw data for z axis
* the above also have _n variants calculated on normalised data
accel - sum of all raw accelerometer data
accelSum - sum of raw accelerometer data
raw - raw gsr data
phasic - phasic driver, should contain a representation of how much the fingers were sweating at each timepoint
phasicSum - summation of phasic driver, i.e. total sweating (arousal) for that message. Might correlate to priority
emg - facial EMG and accelerometer data on head, contains
accel, accelSum, accX, accY, accZ (similar to what eda contains but on head)
corrugator - normalised data from corrugator supercili (eyebrow - concentration)
zygomaticus - normalised data from zygomaticus major (cheek - smiling)
corrSum - summation of currugator data
zygoSum - summation of zygomaticus data
* the above also have _n variants calculated on normalised data
gazes - fixations detected in various boxes, contains
body - fixations in body of message
thread - fixations in thread (left hand side) related to this message
reference - fixations in reference view (related messages) related to this message
header - fixations in header (sender, subject, time). Each box contains
x, y - coordinates of fixation
unixtime - when fixation took place (milliseconds from 1/1/1970)
duration - fixation duration in ns
keywords - which words the user looked at (or looked nearby), each entry contains
name - hash of keyword (so we don’t spy into the user data, but might be used to correlate keywords)
gazedurations - duration of each fixation that took place on or near this keyword
length - original length of word
clickActivity - represents “chunks” of mouse clicks (one entry represents the amount of mouse clicks that took place in a 2s interval, if any)
pointerActivity - represents chunks of mouse movements (hypotenuse of x y mouse location travel within a 2s interval)
keyboardActivity - how many key presses occurred in a 5s interval (one entry per chunk)
visits - when and how many times the user visited this message (start and end unixtime per visit)
selections - one entry gets created every time the user highlights some text in the message, contains
endTime - time representing when selection ended (we have no start time)
nOfCharacters - how many characters were selected
Apart from JSON numbered files, these files are also outputted, containing the entire data stream.
emgData.json
time - unix time, ms from 1/1/1970 when this sample was taken
accelX, accelY, accelZ - accelerometer data from head
corr - corrugator supercili muscle
zygo - zygomaticus major muscle
gsrData.json
time, accelX, accelY, accelZ - same as above but accelerometer data is from wrist
gsr - raw skin conductance
phasic - phasic driver (should be more informative than raw gsr)