## Feature making after image screening

This notebook is a continuation to (1) and it should take the locks that are accepted as locks to produce a file with features. The features file this notebook produces, should have one row for every lock of the participant, with all the possible features we can make for him/her. The lock files can be found in

    'data/Main collection - processed/',
    
and the new files will be saved in

    'data/Main collection - features/'
    
All locks for a person will be used, and all the 916 will be computed for each lock. Once again, the folder names, of each of the participants for whom we have locks are:

    'Giorgos Mon 19',
    'the remaining names have been removed to protect the privacy of the participants'

In [1]:
# Import the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.spatial.distance import euclidean
from IPython.display import display, clear_output
import os
import shutil
import time

# Pretty display for notebooks
%matplotlib inline

# specify the place from where we will use the data
source_path_rel = 'data/Main collection - processed/'

# specify the place where we will make the new file
target_path_rel = 'data/Main collection - features/'

# specify the subject that we are running the process for
subject = 'Giorgos Mon 19'

# get the path that contains the participants
save_path = target_path_rel + subject + '.csv'

# get the path that contains the participants
source_path = source_path_rel + subject + '/'

# get the files in the path that contain the names of the participants
lock_file_names = os.listdir(source_path)

# remove the logs file
lock_file_names.remove('logs')

print("The number of files in the directory is {}, and its first \
5 files are:\n{},\n{},\n{},\n{},\n{}.".format(len(lock_file_names), 
                                        *lock_file_names[:6]))

The number of files in the directory is 80, and its first 5 files are:
Giorgos_M19_f__0.csv,
Giorgos_M19_f__1.csv,
Giorgos_M19_f__10.csv,
Giorgos_M19_f__11.csv,
Giorgos_M19_f__12.csv.


In [2]:
# DEBUGGING OPTIONS
np.set_printoptions(threshold=40)
pd.set_option('display.max_rows', 280)
pd.set_option('display.max_columns', 1000)

#### Sorting Folder Names to Natural Order

There is no particular reason to sort the list according the order that makes sense, just to give some more high level control of the flow of the loop.

In [3]:
import re

def atoi(text):
    """ These two functions help to list the files in natual order """
    return int(text) if text.isdigit() else text

def natural_keys(text):
    """
    alist.sort(key=natural_keys) sorts in human order
    http://nedbatchelder.com/blog/200712/human_sorting.html
    (See Toothy's implementation in the comments)
    """
    return [ atoi(c) for c in re.split('(\d+)', text) ]

# sort the files according to natural order
lock_file_names.sort(key=natural_keys)

print("The number of files in the directory is {}, and its first \
5 files are:\n{},\n{},\n{},\n{},\n{}.".format(len(lock_file_names), 
                                        *lock_file_names[:6]))

The number of files in the directory is 80, and its first 5 files are:
Giorgos_M19_f__0.csv,
Giorgos_M19_f__1.csv,
Giorgos_M19_f__2.csv,
Giorgos_M19_f__4.csv,
Giorgos_M19_f__5.csv.


### Finding Points of Interest

Findding the PoIs is maybe the most important part of this notebook. These points will be used, crucially, to make most of the features. The method though, given that this is not the subject of this project, is not very robust. For that reason, in the previous notebook we have tested and made sure that the PoIs can be found adequately with the algorithms that are used.

In [4]:
def findCE(df, idxD, idxB, idxF, verbose=False):
    """ Finds the C and E indices, based on B, D and F points. """
    
    # find index C
    idxC = df['x'][idxB:idxD].idxmin()
    
    # find index E
    idxE = df['x'][idxD:idxF].idxmax()
    
    if verbose:
        print("The C index is {}.".format(idxC))
        print("The E index is {}.".format(idxE))
        
    return idxC, idxE


def findF(df, idxG, idxD, verbose=False):
    """ Finds the index of point F, based on the gradual change of the movement
    (capture by percenting change). The calculations happen on around one third 
    of the shape."""
    
    # make a dataframe that records the change in x and y values in percentages
    dpct = df['x'][idxD:].pct_change(periods=1, fill_method='pad', limit=1, freq=None)
    
    # fill the NaN values
    dpct = dpct.fillna(value=0)
        
    # find the first index, coming from D, that has negative change
    idxFirstNeg = dpct[dpct<0].head(1).index.values[0]
    
    # initialize index F
    idxF = df['x'][idxFirstNeg:idxG].idxmin()
    
    if verbose:
        print("The F index is {}.".format(idxF))
        
    return idxF


def findB(df, idxA, idxD, verbose=False):
    """ Finds the index of point B, based on the gradual change of the movement
    (capture by percenting change). The calculations happen on around one third 
    of the shape."""
    
    # make a dataframe that records the change in x and y values in percentages
    dpct = df['x'][:idxD].pct_change(periods=1, fill_method='pad', limit=1, freq=None)
    
    # fill the NaN values
    dpct = dpct.fillna(value=0)
    
    # find the first index, in the range from A to D, that has negative change
    idxLastNeg = dpct[dpct<0].tail(1).index.values[0]
    
    # initialize index F
    idxB = df['x'][idxA:idxLastNeg].idxmax()
    
    if verbose:
        print("The B index is {}.".format(idxB))
        
    return idxB


def findG(df, verbose=False):
    """ Finds the index of point G of the lock, by the maximun values of its two
    columns x and y. """
  
    # initialize dummy dataframe
    dummy = pd.DataFrame(columns=['y_rev', 'x and y_rev'])
  
    # find the max value in the y axis
    max_y = df['y'].max()
    
    if verbose:
        print('The max value in y axis of the plot is {}.'.format(max_y))
  
    # make a reversed(y) column
    dummy['y_rev'] = df['y'].apply(lambda x: max_y - x)

    # make the column we extract G's index from
    dummy['x and y_rev'] = df['x'].add(dummy['y_rev'])

    # get the index of G
    idxG = dummy['x and y_rev'].idxmax()
    
    if verbose:
        print("The index G is... {}.".format(idxG))

    return idxG


def findIndices(data):
    """ Finds the indices of the Points of Interest in the lock. This function
    workds under the assumption that the start of the axis is near point A. """
    
    # find the indexes for A and H (both bottom left A starts the shape and H ends it) and G (bottom right)
    iA, iH, iG = 0, -1, findG(data)
    
    # index for D (highest point)
    iD = data['y'].idxmax()
    
    # find the index for B (neck left)
    iB = findB(data, iA, iD)
    
    # find the index for F (neck right)
    iF = findF(data, iG, iD)
    
    # find the indices for E (circle right) and F (neck right)
    iC, iE = findCE(data, iD, iB, iF)
        
    return iA, iB, iC, iD, iE, iF, iG, iH


def findPoints(data, iA, iB, iC, iD, iE, iF, iG, iH, keepTimestamp=False):
    """ Finds the points in the dataframe according the indices """
    
    if keepTimestamp: # find ABCDEFGH while keeping 'millis' column
        
        A = data.iloc[iA]  # bottom left (starting point)
        B = data.iloc[iB]  # neck left
        C = data.iloc[iC]  # head left
        D = data.iloc[iD]  # highest point
        E = data.iloc[iE]  # head right
        F = data.iloc[iF]  # neck right
        G = data.iloc[iG]  # bottom right
        H = data.iloc[iH]  # bottom left (ending point)
            
        return A, B, C, D, E, F, G, H
        
    else: # find ABCDEFGH while dropping 'millis' column
        
        A = data.iloc[iA][:3]  # bottom left (starting point)
        B = data.iloc[iB][:3]  # neck left
        C = data.iloc[iC][:3]  # head left
        D = data.iloc[iD][:3]  # highest point
        E = data.iloc[iE][:3]  # head right
        F = data.iloc[iF][:3]  # neck right
        G = data.iloc[iG][:3]  # bottom right
        H = data.iloc[iH][:3]  # bottom left (ending point)
            
        return A, B, C, D, E, F, G, H


### Feature Creation

In this step we will use the PoIs functions do create features that do or donnot depend on them. These features can be separated in three groups:

1. Global features
2. Distance Features (in-between Points of Interest)
3. Ratios between all Distance Features

The way to create all of them for each file separately, is by loading the file and computing various characteristics that we store on a python dictionary. That dictionary will be a line into the features file, every key-value pair will refer to a column and its value. Note that as we said earlier in this stage we want to drop as much features as possible to it.

As far as **Global Features** go, these refer to the features that can come up by looking at the distribuitioons that can come up from the raw data, without using any semantic ques such as Points of Interest. There are 11 distributions that we take into account here, and we are computing 9 statistics for 7 of them, and 10 statistics for the remaining 4. The distributions are:

    'positional values for x, y and z (raw data)'           -> 9 statistics
    'changes in position for x, y, z and their magnitude'   -> 10 statistics (includes summation)
    'speed with which the change in position happened'      -> 9 statistics

In the case of the changes in position, it is the only place where it makes sense to sum up all the elements because that is the measure of the total distance covered in the x, y, z and magnitude which is the actual total distance. All these accound for $(3*9) + (4*10) + (4*9) = 103$ features. If we add the total time taken to make the shape, we have $104$ Global Features.

The naming convention for every feature goes like this: **'axis name' + 'group it belongs to' + '_' + 'statistic'**. The axis names can be x, y, z, millis or mag (for magnitude). The group it belongs to is CamelCased and is one of the groups mentioned above, namely 'Pos', 'PosInc' and 'Speed'. The statistics are: mean, median, std, variance, skewedness, kurtosis, min, max, range - and summation in the case of 'PosInc'.

Regarding **Distance Features**, these are related with the Points of Interest Concept which is the important points that in some sense define a lock drawing. These Points are 8 as it could be seen in the screening earlier, and in this group of features we are taking into account all the possible distances in-between those 8 points, which is $8 \choose 2$$=28$. However that is not the entire story since we can take eucliden distances and temporal distances, which makes it $28*2=56$ features.

Finally, on a similar fashion with Distance Features, in the case of **Ratio Features** we are taking the 28 eucliden distances and the 28 temporal distances and we are again finding all possible ratios between them. That accoutns to $28 \choose 2$$=378$ features, which ends up if we take both categories into account, in $756$ features.

In [5]:
##### 2. Ratio Features (RoIs) #############################################################################
def extractRatios(ROIs, d):
    """ Constructs the lock features according to the Points of Interest. """
    
    # start the banlist
    banList = []
    
    # find the distances
    for key1, values1 in ROIs.items():
        
        # update the banlist
        banList.append(key1)
        
        for key2, values2 in ROIs.items():
            
            # abort the loop if the key is in banlist already
            if key2 in banList:
                continue
            
            #calculate the ratio for spatial distance of the pair at hand
            d[key1+'|'+key2+'_xyz'] = values1['xyz'] / values2['xyz']
            
            #calculate the temporal distance for every pair and add it to dict
            d[key1+'|'+key2+'_mil'] = values1['millisec'] / values2['millisec']
    
    return d


def extractROIs(POIs, d):
    """ Constructs the lock features according to the Points of Interest. """
    
    # start the banlist
    banList = []
    
    # start the list
    ROIs = []
    
    # find the distances
    for key1, values1 in POIs.items():
        
        # update the banlist
        banList.append(key1)
        
        for key2, values2 in POIs.items():
            
            # abort the loop if the key is in banlist already
            if key2 in banList:
                continue
            
            #append the correct string
            ROIs.append(key1+key2)
    
    # initialize the ROIs actual data structure with the keys we just came up
    ROIs = {key: {'xyz': [], 'millisec': []} for key in ROIs}
        
    # establish the correct dictionary
    for key, value in ROIs.items():

        # add them to dict
        value['xyz'] = d[key+'_xyz']
        value['millisec'] = d[key+'_mil']
            
    return ROIs


##### 2. Dinstance Features (PoIs) #############################################################################
def extractDistances(POIs, d):
    """ Constructs the lock features according to the Points of Interest. """
    
    # start the banlist
    banList = []
    
    # find the distances
    for key1, values1 in POIs.items():
        
        # update the banlist
        banList.append(key1)
        
        for key2, values2 in POIs.items():
            
            # abort the loop if the key is in banlist already
            if key2 in banList:
                continue
            
            #calculate the spatial distance for every pair and add to dict
            d[key1+key2+'_xyz'] = euclidean(values1['xyz'], values2['xyz'])
            
            #calculate the temporal distance for every pair and add it to dict
            d[key1+key2+'_mil'] = abs(values2['millisec'][0] - values1['millisec'][0])
    
    return d


def extractPOIs(df):
    """ Find the Points of Interest of each file. It returns a data structure of 
    a dictionary of dictionaries. The first dictionary has every point's name
    (e.g. 'A') as keys, and the second, the corresponding index, XYZ value and 
    time. """
  
    # initialize the data structure
    POIs = {'A': None, 
            'B': None, 
            'C': None, 
            'D': None, 
            'E': None, 
            'F': None, 
            'G': None,
            'H': None}
  
    # populate it with an internal dict
    for k, v in POIs.items():
        POIs[k] = {'index': [], 'xyz': [], 'millisec': []}
  
    # find the indices for all POIs
    iA, iB, iC, iD, iE, iF, iG, iH = findIndices(df)
    
    # list the values that will go to POIs
    values = findPoints(df, iA, iB, iC, iD, iE, iF, iG, iH, keepTimestamp=True)
      
    # list the keys, the indices, the XYZs and the millis of the dictionary
    keys = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
    indices = [iA, iB, iC, iD, iE, iF, iG, iH] # i delayed making them a list to make it more readable..

    for i in range(len(keys)):
        POIs[keys[i]]['index'].append(indices[i])
        POIs[keys[i]]['xyz'].extend([values[i]['x'], values[i]['y'], values[i]['z']])
        POIs[keys[i]]['millisec'].append(values[i]['millis'])
  
    return POIs


##### 1. Simple Features #############################################################################
def allMeasurements(series, tag, d):
    """ Accepts a pandas series and does multiple statistical operations. The default way to treat 
    NaNs is to skip them. """
    
    # find mean
    d[tag+'_mean'] = series.mean()

    # find median
    d[tag+'_median'] = series.median()

    # find variance (2nd moment)
    d[tag+'_var'] = series.var(ddof=0)

    # find standard deviation
    d[tag+'_std'] = series.std(ddof=0)

    # find skewness (3rd moment)
    d[tag+'_skew'] = series.skew()

    # find kurtosis (4th moment)
    d[tag+'_kurt'] = series.kurt()

    # find min
    d[tag+'_min'] = series.min()

    # find max
    d[tag+'_max'] = series.max()

    # find range (max - min)
    d[tag+'_range'] = d[tag+'_max'] - d[tag+'_min']
    
    if tag.endswith('Inc'):
            
        # make an additional summation measurement (total distance covered)
        d[tag+'_sum'] = series.sum()


def remakePosGroup(df):
    """ Remakes the three positional columns to start from 0 """
    
    # find the starting values of x y and z
    xx = df.head(1)['xPos'].values[0]
    yy = df.head(1)['yPos'].values[0]
    zz = df.head(1)['zPos'].values[0]
    
    # subtract the above values from every entry to position the lock to start from (0, 0, 0)
    df['xPos'] = df['xPos'].sub(xx)
    df['yPos'] = df['yPos'].sub(yy)
    df['zPos'] = df['zPos'].sub(zz)
    
        
def extractSimpleFeatures(df_raw, d):
    """ Extracts simple features from data """
    
    # make a copy of the dataframe to play
    df = pd.DataFrame(df_raw)
    
    # change column names to be in the desired format for csv
    df.rename(index=int, copy=False, columns={'x': 'xPos', 'y': 'yPos', 'z': 'zPos'}, inplace=True)
    
    # calculate the 3 positional increment columns and the time increments
    df[['xPosInc', 'yPosInc', 'zPosInc', 'millisInc']] = df.sub(df.shift(1)).fillna(0)
    
    # calculate the change of magnitude column
    df['magPosInc'] = np.sqrt(df[['xPosInc', 'yPosInc', 'zPosInc']].pow(2).sum(axis=1))
    
    # calculate the 4 speed increment columns
    df[['xSpeed', 'ySpeed', 'zSpeed', 'magSpeed']] = \
     df[['xPosInc', 'yPosInc', 'zPosInc', 'magPosInc']].div(df['millisInc'], axis=0)
        
    # remake xPos, yPos and zPos to start from 0
    remakePosGroup(df)
        
    #display(df)
    
    # initialize the axis that conta
    distributions = ['xPos', 'yPos', 'zPos', 
                     'xPosInc', 'yPosInc', 'zPosInc', 'magPosInc', 
                     'xSpeed', 'ySpeed', 'zSpeed', 'magSpeed']
    
    for dist in distributions:
        
        # make all measurements for the distributions of each axis
        allMeasurements(df[dist], dist, d)
                    
    # the 104th feature is the total time taken (only purely temporal feature)
    d['millis_range'] = df.tail(1)['millis'].values[0] - df.head(1)['millis'].values[0]

In [6]:
# control appending to csv once created
appendNewLock = False

for file in lock_file_names:
    
    # load the file
    data = pd.read_csv(source_path+file)
    
    # make a dictionary for the instance features
    d = {}
    
    # 1. construct SIMPLE FEATURES
    extractSimpleFeatures(data, d)
    
    # find the Points of Interest of the 3d shape
    POIs = extractPOIs(data)
      
    # 2. construct DISTANCE FEATURES
    d = extractDistances(POIs, d)

    # find the Ratios of Interest of the 3d shape
    ROIs = extractROIs(POIs, d)
      
    # 3. construct RATIO FEATURES
    d = extractRatios(ROIs, d)
    
    # upload into a new dataframe line for easy of passing it to csv
    d = pd.DataFrame(d, index=[0])
    
    #display(d)
    
    #break #-------------------------------------------------
    
    if appendNewLock:
        # append to existing
        d.to_csv(save_path, index=False, mode='a', header=False) 
    else:
        # create new csv
        d.to_csv(save_path, index=False, mode='w')
    
    # switch for appending instead of creating new csvs
    appendNewLock = True



In [7]:
print("The number of columns of the last file is {}.".format(d.columns.values.shape[0]))

The number of columns of the last file is 916.
