# Parsing a single log file

_parse: to examine in a minute way_

In this notebook we'll extract the _information_ on reaction time and accuracy from a single log file, and generalise our code to apply to _any_ log file (written with the same structure).

It is considered _good practice_ to import all the modules you use in a notebook in the beginning, so we'll start with that:

In [1]:
import string

We'll be using two lists defined in the `string`-module:

1. the list of all lowercase (ASCII) letters
1. the list of all digits (as string a string, not numbers)

In [2]:
print(string.ascii_lowercase)
print(string.digits)

abcdefghijklmnopqrstuvwxyz
0123456789


## Read lines of a single log-file into a list

Assign the _path to one of the logfiles_ to the variable `logfile_name`. You will need to adjust the path to wherever you placed the `logs`-directory containing them!

In [20]:
logfile_name = '../src/logs/0023_FCA_2017-03-09.log'

Open the file, read the lines & close the file.

In [4]:
fp = open(logfile_name, 'r')
all_lines = fp.readlines()
fp.close()

Display the _first ten lines_. For this, you can use the slice-syntax `[:10]`, which reads: 'from the start to index 10'.

In [5]:
all_lines[:10]

['# Original filename: 0023_UZD_2017-02-04.log\n',
 '# Time unit: 100 us\n',
 '# RARECAT=digit\n',
 '#\n',
 '# Time\tHHGG\tEvent\n',
 '28558\t42\tSTIM=d\n',
 '38397\t42\tRESP=1\n',
 '46843\t42\tSTIM=q\n',
 '54687\t42\tRESP=1\n',
 '63719\t42\tSTIM=7\n']

The first five lines are comments, which we'll want to skip over. How many events are there in the file (how many rows after the comments)?

In [6]:
len(all_lines[5:])

2560

## Splitting the lines

From the above, determine the field-separator character used in the file.

In [7]:
field_sep = '\t'

Split the 6th line and display:

In [8]:
line = all_lines[5]
split_line = line.split(field_sep)
print(split_line)

['28558', '42', 'STIM=d\n']


The 1st value of the split list is the time, the 3rd value contains information on whether the _event_ was a _stimulus presentation_, or a _response_. Since the data is consistent, to get the actual stimulus presented (letter or digit), we can simply count how many characters 'in' the equal-sign is: the index of the stimulus is:

In [9]:
# what is the index of the stimulus?
# Try changing the relevant value below until you get 'd'
split_line[2][2]

'I'

In [10]:
idx = 5

Note that this index is also the one we need for getting to the response (1 or 2).

* split the 6th line & print the stimulus delivery time and stimulus presented
* split the 7th line & print the response time and button number pressed
* _calculate the reaction time_
    * __NB__: the contents of the file we are reading from is _textual_
    * _arithmetic on text_ is very different from that on _numbers_...
    * (you'll need to convert the string to a number; use the `int`-function)
* assign the reaction time to a variable ('`RT`') and print it

In [11]:
# 6th line: STIM
line = all_lines[5]
split_line = line.split(field_sep)
print(split_line)

stim_time = split_line[0]
cur_stim = split_line[2][idx]
print(stim_time, cur_stim)
    
# 7th line: RESP
line = all_lines[6]
split_line = line.split(field_sep)
print(split_line)
    
resp_time = split_line[0]
cur_resp = split_line[2][idx]
print(resp_time, cur_resp)

# calculate RT
RT = int(resp_time) - int(stim_time)
print('reaction time: ', RT)    

['28558', '42', 'STIM=d\n']
28558 d
['38397', '42', 'RESP=1\n']
38397 1
reaction time:  9839


## Loop over the lines

Convert the above into something that can be used to _loop over_ the list. Start by just looping over the 6th and 7th rows: you should arrive at the same answer as above.

You'll need logic for determining whether the current line _starts with the string_ `STIM`. Strings have a method `startswith` for this! Use an `if`-`else`-construct.

In [12]:
for line in all_lines[5:7]:
    split_line = line.split(field_sep)

    # does the 3rd element of the list start with 'STIM'?
    if split_line[2].startswith('STIM'):
        stim_time = split_line[0]
        cur_stim = split_line[2][idx]
        print(stim_time, cur_stim)

    else:  # nope; it starts with something other than 'STIM'
        split_line = line.split(field_sep)

        resp_time = split_line[0]
        cur_resp = split_line[2][idx]
        print(resp_time, cur_resp)

        # calculate RT
        RT = int(resp_time) - int(stim_time)
        print('reaction time: ', RT)    

28558 d
38397 1
reaction time:  9839


## Saving the reaction times into lists

Instead of printing out 1280 RT values, we want to save them into memory for later use (we need to calculate mean and median values over them). Start with two empty lists for reaction times:

* one for the _frequent_ category of stimuli (letter)
* one for the _rare_ category of stimuli (digit)

In [13]:
rt_freq = []
rt_rare = []

In [14]:
for line in all_lines[5:]:
    split_line = line.split(field_sep)

    if split_line[2].startswith('STIM'):
        stim_time = split_line[0]
        cur_stim = split_line[2][idx]
    else:
        split_line = line.split(field_sep)

        resp_time = split_line[0]
        cur_resp = split_line[2][idx]

        # calculate RT
        RT = int(resp_time) - int(stim_time)
        
        if cur_stim in string.ascii_lowercase:
            rt_freq.append(RT)            
        elif cur_stim in string.digits:
            rt_rare.append(RT)            

## Accuracy: is each response correct or incorrect?

Modify the above code to also include logic for determining whether the response in correct or not. Initialise two _counters_ for the number of correct responses.

In [15]:
rt_freq = []
rt_rare = []
n_corr_freq = 0
n_corr_rare = 0

In [16]:
for line in all_lines[5:]:
    split_line = line.split(field_sep)

    if split_line[2].startswith('STIM'):
        stim_time = split_line[0]
        cur_stim = split_line[2][idx]
    else:
        split_line = line.split(field_sep)

        resp_time = split_line[0]
        cur_resp = split_line[2][idx]

        # calculate RT
        RT = int(resp_time) - int(stim_time)
        
        if cur_stim in string.ascii_lowercase:
            rt_freq.append(RT)
            if cur_resp == '1':
                n_corr_freq += 1
        elif cur_stim in string.digits:
            rt_rare.append(RT)            
            if cur_resp == '2':
                n_corr_rare += 1

## Print out the mean and median RTs and the accuracies for frequent and rare stimuli

* use the functions you previously wrote as an exercise
    * you'll have to copy the code into the present notebook and execute
* recall that times are given in the odd unit of '100's of microseconds'
    * multiply by `100e-6` (_i.e_ 0.0001) to obtain seconds
* accuracy is simply the number of correct responses divided by the total number of responses

In [9]:
def mean(values):
    return(sum(values)/len(values))
def median(values):
    return(sorted(values)[len(values)//2])

In [18]:
# freq
mean_rt_freq = mean(rt_freq) * 100e-6
median_rt_freq = median(rt_freq) * 100e-6
accuracy_freq = 100 * n_corr_freq / len(rt_freq)

# rare
mean_rt_rare = mean(rt_rare) * 100e-6
median_rt_rare = median(rt_rare) * 100e-6
accuracy_rare = 100 * n_corr_rare / len(rt_rare)

In [19]:
print('Frequent category:')
print('------------------')
print('Mean:', mean_rt_freq)
print('Median:', median_rt_freq)
print('Accuracy:', accuracy_freq)

Frequent category:
------------------
Mean: 0.5881216796875001
Median: 0.5517000000000001
Accuracy: 93.5546875


In [20]:
print('Rare category:')
print('--------------')
print('Mean:', mean_rt_rare)
print('Median:', median_rt_rare)
print('Accuracy:', accuracy_rare)

Rare category:
--------------
Mean: 0.680767578125
Median: 0.6271
Accuracy: 86.71875


## Convert all of the above into a _function_

Now that we have code that works for one file, we can make it into a __function__ and apply it on the other files (hoping they 'behave' the same way as the file we used to develop the code on...).

In [10]:
def read_log_file(logfile_name, field_sep='\t'):
    '''Read a single log file
    
    The field-separator is assumed to be the tab-character (\t)
    
    Return the mean and median RT, and the accuracy, separately for
    the frequent and rare categories. This is done as a list (tuple) of
    6 return values, in the order:
    (mean_rt_freq, median_rt_freq, accuracy_freq,
     mean_rt_rare, median_rt_rare, accuracy_rare)
    '''

    rt_freq = []
    rt_rare = []
    n_corr_freq = 0
    n_corr_rare = 0

    # open file and read it
    fp = open(logfile_name, 'r')
    all_lines = fp.readlines()
    fp.close()

    # hard-code the index of the stimulus/response type/number
    idx = 5
    
    # loop over lines from 6th onwards
    for line in all_lines[5:]:
        split_line = line.split(field_sep)

        if split_line[2].startswith('STIM'):
            stim_time = split_line[0]
            cur_stim = split_line[2][idx]
        else:
            split_line = line.split(field_sep)

            resp_time = split_line[0]
            cur_resp = split_line[2][idx]

            # calculate RT
            RT = int(resp_time) - int(stim_time)

            if cur_stim in string.ascii_lowercase:
                rt_freq.append(RT)
                if cur_resp == '1':
                    n_corr_freq += 1
            elif cur_stim in string.digits:
                rt_rare.append(RT)            
                if cur_resp == '2':
                    n_corr_rare += 1    

    # calculate return values
    # freq
    mean_rt_freq = mean(rt_freq) * 100e-6
    median_rt_freq = median(rt_freq) * 100e-6
    accuracy_freq = 100 * n_corr_freq / len(rt_freq)

    # rare
    mean_rt_rare = mean(rt_rare) * 100e-6
    median_rt_rare = median(rt_rare) * 100e-6
    accuracy_rare = 100 * n_corr_rare / len(rt_rare)                    
                    
    return(mean_rt_freq, median_rt_freq, accuracy_freq,
           mean_rt_rare, median_rt_rare, accuracy_rare)

## Test the function on the same file, then on a new one

In [21]:
(mean_rt_freq, median_rt_freq, accuracy_freq,
    mean_rt_rare, median_rt_rare, accuracy_rare) = read_log_file(logfile_name)

In [22]:
print('Frequent category:')
print('------------------')
print('Mean:', mean_rt_freq)
print('Median:', median_rt_freq)
print('Accuracy:', accuracy_freq)

Frequent category:
------------------
Mean: 0.4994505859375
Median: 0.4647
Accuracy: 96.484375


In [23]:
print('Rare category:')
print('--------------')
print('Mean:', mean_rt_rare)
print('Median:', median_rt_rare)
print('Accuracy:', accuracy_rare)

Rare category:
--------------
Mean: 0.595238671875
Median: 0.5651
Accuracy: 85.9375


In [24]:
logfile_name = '../src/logs/0048_MSB_2016-09-23.log'

In [25]:
(mean_rt_freq, median_rt_freq, accuracy_freq,
    mean_rt_rare, median_rt_rare, accuracy_rare) = read_log_file(logfile_name)

In [26]:
print('Frequent category:')
print('------------------')
print('Mean:', mean_rt_freq)
print('Median:', median_rt_freq)
print('Accuracy:', accuracy_freq)

Frequent category:
------------------
Mean: 0.50315771484375
Median: 0.4666
Accuracy: 95.60546875


In [27]:
print('Rare category:')
print('--------------')
print('Mean:', mean_rt_rare)
print('Median:', median_rt_rare)
print('Accuracy:', accuracy_rare)

Rare category:
--------------
Mean: 0.58263359375
Median: 0.5496
Accuracy: 88.671875
