## Manual Lock Classification and pre-Processing.

The purpose of this notebook is to select and pre-process the lock files for a single person.

The paths used are:
- to find the locks: 'data/Main collection - raw/'
- to create the locks: 'data/' (then the folder should manualy go to Main collection - processed)

To begin the process the first thing to do is to import the necessary libraries, initialize the path variables, and create the folder for whom we are creating the processed files. After that we will use some pre-processing functions before visualizing each image. During the visualization process user input is required to say if the image should be kept or dropped.

In [1]:
# Import the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display, clear_output
import os
import shutil
import time

# Pretty display for notebooks
%matplotlib inline

# specify the subject that we are running the process for
subject = 'Giorgos Mon 19'

# specify saving paths for the accepted and rejected images
img_saving_path = 'images/' + subject + '/'
img_saving_path_accepted = img_saving_path+'accepted images/'
img_saving_path_rejected = img_saving_path+'rejected images/'

# delete the subject folder and all its contents if it exists
shutil.rmtree(img_saving_path) if os.path.exists(img_saving_path) else None

# get the files in the path that contain the names of the participants
os.mkdir(img_saving_path)
os.mkdir(img_saving_path_accepted)
os.mkdir(img_saving_path_rejected)

# specify the place where we will make the new file
target_path_rel = 'data/'

# get the path that contains the participants
target_path = target_path_rel + subject + '/'

# delete the subject folder and all its contents if it exists
shutil.rmtree(target_path) if os.path.exists(target_path) else None
    
# get the files in the path that contain the names of the participants
os.mkdir(target_path)
os.mkdir(target_path+'logs/')

# specify the place from where we will use the data
source_path_rel = 'data/Main collection - raw/'

# get the path that contains the participants
source_path = source_path_rel + subject + '/'

# get the files in the path that contain the names of the participants
lock_file_names = os.listdir(source_path)

print("The number of files in the directory is {}, and its first \
5 files are:\n{},\n{},\n{},\n{},\n{}.".format(len(lock_file_names), 
                                        *lock_file_names[:6]))

The number of files in the directory is 93, and its first 5 files are:
Giorgos_M19_f__0.csv,
Giorgos_M19_f__1.csv,
Giorgos_M19_f__10.csv,
Giorgos_M19_f__11.csv,
Giorgos_M19_f__12.csv.


In [2]:
# DEBUGGING OPTIONS
np.set_printoptions(threshold=40)
pd.set_option('display.max_rows', 280)

#### Sorting Folder Names to Natural Order

There is no particular reason to sort the list according the order that makes sense, just to give some more high level control of the flow of the loop.

In [3]:
import re

def atoi(text):
    """ These two functions help to list the files in natual order """
    return int(text) if text.isdigit() else text

def natural_keys(text):
    """
    alist.sort(key=natural_keys) sorts in human order
    http://nedbatchelder.com/blog/200712/human_sorting.html
    (See Toothy's implementation in the comments)
    """
    return [ atoi(c) for c in re.split('(\d+)', text) ]

# sort the files according to natural order
lock_file_names.sort(key=natural_keys)

print("The number of files in the directory is {}, and its first \
5 files are:\n{},\n{},\n{},\n{},\n{}.".format(len(lock_file_names), 
                                        *lock_file_names[:6]))

The number of files in the directory is 93, and its first 5 files are:
Giorgos_M19_f__0.csv,
Giorgos_M19_f__1.csv,
Giorgos_M19_f__2.csv,
Giorgos_M19_f__3.csv,
Giorgos_M19_f__4.csv.


***

### Pre-processing functions

The lock selection will be in terms of the 'lockiness' of a lock and the success in calculating some Points of Interest that will be used to make features. In order to improve the number of locks that can be considered locks, and to succeed more often when calculating the PoIs, we will do some pre-processing before making features for every lock. The operations we will do will be to:

1. Reset of their co-ordinates (meaning to change the center of the interaction box).
2. Chop of the front and back tails of the lock.
3. Remove duplicates (will help the next step).
4. Interpolate some crazy values.

The first step will help to bring the lock into a space that makes more intuitive sense, and brings the start of the axis in the front left corner of the interaction box, from the perspective of the user. The second step deals with the fact that the user sometimes delayed to make a move and we do not want to take the slow reaction time into account. Finally, the third step is because sometimes the Kinect was glitching and some times it gave some crazy values (e.g. in some rare cases the user would momentarily bring the palm closer to the kinect than the index finger and one value needs to be corrected).

#### Coordinate reset

The functions that deal with the co-ordinate reset are in the cell below. They use some values that refer to the optics of the Kinect v2 and they were taken from [1].

[1] https://github.com/shiffman/OpenKinect-for-Processing/tree/master/OpenKinect-Processing/examples/Kinect_v2/DepthPointCloud2

In [4]:
from math import ceil


def getOffsetX(X):
    """ Calculates the offset for x dimension according to the Kinect v2 
    specifications. The offset is different than just mapping pixel values 
    into their real dimensions so only 0 and 511 which is the leftmost and 
    rightmost possible pixel indices 
    - X: x position of the pixel of the kinect depth image """

    if X != 0 and X != 511:
        raise ValueError("X pixel needs to be set as 0 or 511 which are the \
leftmost and rightmost indices of the pixels of the Kinect depth image")
  
    cx = 254.878 # parameter of the Kinect v2 optics
    fx = 365.456 # parameter of the Kinect v2 optics
    #maxX = 0     # min pixel width of the kinect v2 depth camera
    maxZ = 4500  # max range of the kinect v2 depth camera
    
    return (X - cx) * maxZ / fx


def getOffsetY(Y):
    """ Calculates the offset for y dimension according to the Kinect v2 
    specifications. The offset is different than just mapping pixel values 
    into their real dimensions so only 0 and 423 which is the highest and 
    lowest possible pixel indices 
    - Y: y position of the pixel of the kinect depth image """

    if Y != 0 and Y != 423:
        raise ValueError("Y pixel needs to be set as 0 or 423 which are the \
highest and lowest indices of the pixels of the Kinect depth image")
  
    cy = 205.395 # parameter of the Kinect v2 optics
    fy = 365.456 # parameter of the Kinect v2 optics
    #maxY = 424   # max pixel width of the kinect v2 depth camera
    maxZ = 4500  # max range of the kinect v2 depth camera
  
    return (Y - cy) * maxZ / fy


def roundUp(x):
    """ Round a value upwards to the first decimal """
    return ceil(x * 10) / 10.0


def resetXY(data, offset_Z=500, pixelX=0, pixelY=423):
    """ Resets x and y data to bring them to the desired coordinate system. 
    - roundUpwards: rounds the offset values upwards, to the first decimal """
  
    # set the offsets by the KinectData class
    offset_X, offset_Y = getOffsetX(pixelX), getOffsetY(pixelY)
  
    # take the absolute value of offset_X to prepare it for addition
    offset_X = abs(offset_X)

    # reset X
    data['x'] = data['x'].add(offset_X)
  
    # reset Y
    data['y'] = data['y'].subtract(offset_Y)
  
    # make the negative value of Y into positive (this also flips the y axis)
    data['y'] = data['y'].apply(lambda x: abs(x))
  
    # reset Z
    data['z'] = data['z'].subtract(offset_Z)
  
    return data

#### Chopping sides

Also, chopping the sides will remove the response time elements which we do not want to leak into the features. Regarding the backchop, there are 56 lines that represent the closing signal, and we could maybe drop all of them because that is no shapemaking but just waiting from the side of the participant to close the signal. However we will leave a few of them (6) and drop less than 56, because usually some of them might be included at the tail end of someone's shape.

Regarding the frontchop the situation is a bit more complicated. In the front tail, there are no extra points from the signal, since they are discarded from the application that created the data. The signal points are not included in the buffer that will later on become the csv. However, even when the signal is made, the user's movement doesn't start immediately. Some users take more time to respond and some less, and all these points are being recorded. So, in order to avoid keeping the points that are after signal but before the actual drawing, we will check the euclidean distance the very first point has with the ones that follow. When that distance is above 20mm we can be sure that substantial movement has occured. Also, similarly to the previous case we will not discard all the points but we will keep 3 (chosen arbitrary like the backchop case) that should capture the early movement, that presumably has started in the 20mm window. Theoretically, there is also the case that the participant will be drifting for more than 20mm before he starts the movement but these incidents should be very rare, and for not much distance since prolonged movement can be seen on the shape and has therefore been discarded earlier.

In addition to that we will keep a logs file with the duplicates of a processed image, the front chop and the backchop and statistics about the front chop and the duplicates in the entire file.

In [5]:
from scipy.spatial.distance import euclidean


def chopSides(data, tFront=None, tBack=None, flatChopBack=None, trim_break=3):
    """ Chops the unecessary points from the sides (beginning - ending) of a 
    drawing 3D line such as the lock. It accepts:
    - data: as a pandas dataframe of the data
    - tFront: as a threshold value for how much difference is allowed at the 
      front side of the trace
    - tBack: as a threshold value for how much difference is allowed at the 
      back side of the trace
    - flatChopBack: flat chop amount from the back
    - trim_break: the trim break factor accounts for the number of rows that 
      will be kept despite the fact that they break the threshold """
  
    # raise error if two options are chosen for the back chop
    if tBack and flatChopBack:
        raise ValueError("Two backchop options are specified instead of one.")
  
    # return early with a notification if all values are None
    if tFront is None and tBack is None and flatChopBack is None:
        print("All trimming values are None, no chopping was done.")
        return data, 0, 0

    # if only flatChopBack is set to a numeric value
    if flatChopBack and not tFront and not tBack:
        return data[:-flatChopBack], 0, flatChopBack
    
    # initialize the counters
    frontChopCounter, backChopCounter = 0, 0
  
    # load the threshold values to a dictionary
    t = {'f': tFront, 'b':tBack}
  
    # s will control frontwards and backwards traversing
    s = 'f' if t['f'] else ''
    s = s+'b' if t['b'] else s
      
    # loop once or twice depending on the sides selection
    for letter in s:
          
        # reverse the data to traverse backwards
        data = data[::-1] if letter=='b' else data
    
        # hold the first line as a reference r, iloc gets first line no matter the df index
        firstLine = data.iloc[0]
    
        # loop over data and break with the first incident of dissimilarity
        for i, row in data.iloc[1:].iterrows():
      
            # checks if the line is NOT similar enough with the reference
            if euclidean([row['x'], row['y'], row['z']], 
                         [firstLine['x'], firstLine['y'], firstLine['z']]) > t[letter]:
                
                # prevent bugs in case the trim factor is bigger than i - should be very rare
                trim_break = 0 if i<trim_break else trim_break
        
                # change the i if backwards
                i = data.shape[0]-i+trim_break if letter == 'b' else i-trim_break
        
                # set results to the appropriate slice of "data"
                data = data[i:]
        
                # set the counters depending on the side the loop is for
                frontChopCounter = i if letter == 'f' else frontChopCounter
                backChopCounter = i if letter == 'b' else backChopCounter
                break
    
        # reset indices if front chop occured
        data.index = range(data.shape[0]) if letter=='f' else data.index
    
        # reverse data again if they have been reversed for the backwards loop
        data = data[::-1] if letter=='b' else data
  
    if flatChopBack:
  
        # if there is a flatChop amount specified for the back tail slice it out 
        data = data[:-flatChopBack]
    
        # add flatChopBack to the variable that gets returned
        backChopCounter = flatChopBack
    
    # returns the data, the amount of front and back lines chopped separately
    return data, frontChopCounter, backChopCounter


#### Finding and Removing Duplicates

In order to improve the interpolation process that follows, the points that are exact duplicates in terms of their 'x', 'y' and 'z' with their previous point will be found. Then the oldest point is kept and the earliest removed.

In [6]:
def checkForDuplicates(data, variableNames='xyz', removeDuplicates=False):
    """ Checks for duplicate lines in a dataframe """
  
    # sort the variable names and put them back to string
    v = ''.join(sorted(variableNames))
  
    if v != 'xy' and v != 'xyz':
        raise ValueError("variableNames accepts only format like 'xy' or 'xyz'. \
It specifies which axis will be looked for duplicates.")
    
    # check for the index values to be nice and in order
    if list(data.index.values) != list(range(data.shape[0])):
        raise ValueError("The index values of the data frame is not in order!.")
  
    # initialize a counter
    duplicateCounter = 0
  
    # initialize a data structure to store the indices
    duplicates = []
  
    # specify the columns
    cols = list(v)
  
    # make data a numpy array
    #data = data[cols]
  
    # init previous line
    prevLine = data.iloc[0]
  
    #for i in range(1, num_rows):
    for i, row in data.iloc[1:].iterrows():
    
        #if np.all( prevLine == data.loc[[i]] ):
        if prevLine[cols].equals(row[cols]):
    
            # increment a counter in case of duplicate
            duplicateCounter += 1
      
            # add the indices to a data structure
            duplicates.append((i-1, i))
    
        # keep the previous line to compare in the next iteration
        prevLine = data.iloc[i]
  
    if removeDuplicates:
      
        indicesToKill = []
    
        for tup in duplicates:
    
            i, j = tup
    
            if data.iloc[i]['millis'] <= data.iloc[j]['millis']:
        
                #
                indicesToKill.append(j)
    
        # drop the specified indices
        data = data.drop(data.index[indicesToKill])
    
        data.index = range(data.shape[0])
  
        return data, duplicateCounter, duplicates, indicesToKill

    return data, duplicateCounter, duplicates

#### Interpolation

For the purpose of interpolation there will be two functions are involved. First we will be using a g-h filter to construct a new lock out of the user's lock, and then by superimposing this lock to the original one we can find the points for which there is a lot of disparity. Then we check the neighbours of that point and if their distance is too high in both cases then it means that this is a crazy point.

The values of g and h that we found to be working best for our case are 0.9 in both cases. We found that by trial and error. Also, the way we combined interpolation and the filter was only to fix one point, then we run the filter again, on the corrected data and see if we find another point. That is because once we find a crazy point, it takes a while to adjust and in that little while, more points can easily be considered as crazy.

In [7]:
def smoothingGH(data, g=0.25, h=0.25):
    """ Applies a simple g-h filter. The end index (end_idx) is included in the analysis. """
    
    axis = ['x', 'y', 'z']
    
    df = pd.DataFrame(data.loc[:, axis])

    # make an estimate according to the first three rows
    x_est = df.loc[:2, axis].sum().div(3)
            
    # initialize delta
    delta_xyz = [0, 0, 0]
    
    for i, row in df.loc[3:df.index.values[-2], axis].iterrows():
        
        # prediction step
        x_pred = x_est + delta_xyz           # prediction estimate
        delta_xyz = [0, 0, 0]                # reset delta

        # update step
        residual = row - x_pred              # residual calculation
        delta_xyz = delta_xyz + h * residual # update delta
        x_est = x_pred + g * residual        # update estimate
        df.loc[i, axis] = x_est              # update data frame
        
    return df

In [8]:
def find_two_smallest_indices(numbers, i, sp):
    """ Finds the two smallest numbers in a list and their indices. Then, according to the 
    index of the row (i), and the space factor (sp) which defines how many points backwards will be
    checked, the new index is being set. """
    m1, m2 = float('inf'), float('inf')
    i1, i2 = 0, 0
    for idx, x in enumerate(numbers):
        if x <= m1:
            m1, m2 = x, m1
            i1, i2 = idx, i1
        elif x < m2:
            m2 = x
            i2 = idx
    return i+i1-sp, i+i2-sp


def checkForTwoOffsInARow(A, B, C, D):
    """ Finds whether there are two off values in a row and returns a boolean """
    
    AB = np.sqrt(np.sum((B-A)**2))   # first vector distance
    BC = np.sqrt(np.sum((C-B)**2))   # second vector distance
    CD = np.sqrt(np.sum((D-C)**2))   # third vector distance
    
    if np.mean([AB, CD]) < 2*BC:
        return True # one off value
    else:
        return False # two off values
    

def logPointToDict(d, num_points, i, i1, i2, cc):
    """ Logs the point to dictionary """
    
    d['point'+str(num_points+1)] = {'index': i, 
                                    'closest 2 indices(smooth_data)':(i1, i2),
                                    'color code': cc}

def findFromSmooth(row, data_smooth, axis_compare, sp, i):
    """  """
    # find the nearby points in the smoothened lock
    nearby_points_smooth = [data_smooth.loc[j, axis_compare] for j in range(i-sp, i+sp)]

    # find the distance between the rows and nearby rows
    distances = np.sum((nearby_points_smooth - row.values)**2, axis=1)

    # Find the indices of the two smallest points
    point1_idx, point2_idx = find_two_smallest_indices(distances, i, sp)

    # make linear space in between them for each axis
    x_lin_space = np.linspace(data_smooth.loc[point1_idx, 'x'],
                              data_smooth.loc[point2_idx, 'x'], 15)
    y_lin_space = np.linspace(data_smooth.loc[point1_idx, 'y'], 
                              data_smooth.loc[point2_idx, 'y'], 15)

    # combine the linear spaces
    lin_space = np.transpose(np.array([x_lin_space, y_lin_space]))

    # get the distances of all those points
    distances_two = np.sqrt(np.sum((lin_space - row.values)**2, axis=1))

    return min(distances_two), point1_idx, point2_idx



def interpolate(data_interp, data_smooth, interpolation_points, skip_indices,
                gh_threshold=20, interpolation_threshold=25, verbose=False, sp=3):
    """ Interpolates on data, according to data_smooth and returns a new data structure. 
    - sp, is the space that is checked around an index"""
        
    axis_compare = ['x', 'y']                                         # keeps comparisons in 'x' and 'y'
    df = pd.DataFrame(data_interp.loc[:, ['x', 'y', 'z', 'millis']])  # makes a new df to work with
    found_one = False                                                 # control variable to control stopping
    last_index = list(data_interp.index.values)[-6]                   # last point that will be checked
    
    # go through every point in the base dataset
    for i, row in df.loc[sp+1:last_index, axis_compare].iterrows():
        
        if i in skip_indices:
            continue
        
        # find the smallest distance between the point in the row and other points
        smallest_distance, point1_idx, point2_idx = findFromSmooth(row, data_smooth, 
                                                                   axis_compare, sp, i)
        
        # if the difference is too large consider interpolation
        if smallest_distance > gh_threshold:
            
            if found_one:
                return df, True
            
            found_one = True
            
            # find the number of points to name the next point
            num_points = len(interpolation_points.items())

            # find whethere there are two points in a row that need to be interpolated
            one_point_off = checkForTwoOffsInARow(df.loc[i-1, axis_compare], 
                                                  df.loc[i, axis_compare],
                                                  df.loc[i+1, axis_compare],
                                                  df.loc[i+2, axis_compare])
            
            if one_point_off:
            
                # find the new point that will occur after the interpolation
                new_point = np.mean([df.loc[i-1, axis_compare], df.loc[i+1, axis_compare]], axis=0)
            
                # find the difference of the two points
                interpolation_diff = np.sqrt(np.sum((df.loc[i, axis_compare] - new_point)**2, axis=0))
                
                if interpolation_diff > interpolation_threshold:
                
                    # interpolate
                    df.loc[i, axis_compare] = new_point
                    
                    # log the value to the dictionary
                    logPointToDict(interpolation_points, num_points, i, point1_idx, point2_idx, 'r')
                        
                else:

                    # log the value to the dictionary
                    logPointToDict(interpolation_points, num_points, 
                                   i, point1_idx, point2_idx, 'grey')
                        
                # mark the index so we dont go over it again
                skip_indices.append(i)

            else: # case of: one_point_off==False
                
                # take one third of the distance to use it for interpolation
                a_third = (df.loc[i+2, axis_compare] - df.loc[i-1, axis_compare]) / 3
                
                # find the new point that will occur after the interpolation
                new_point_one = df.loc[i-1, axis_compare] + a_third
            
                # find the new point that will occur after the interpolation
                new_point_two = df.loc[i-1, axis_compare] + 2 * a_third
            
                # interpolate first point
                df.loc[i, axis_compare] = new_point_one
                
                # log the value to the dictionary
                logPointToDict(interpolation_points, num_points, i, point1_idx, point2_idx, 'r')
                    
                # interpolate second point
                df.loc[i+1, axis_compare] = new_point_two
                
                # log the value to the dictionary
                logPointToDict(interpolation_points, num_points+1, i+1, point2_idx, point2_idx+1, 'r')
                    
                # mark the index so we dont go over it again
                skip_indices.append(i)
                skip_indices.append(i+1)

    return df, False

#### pre-Processing function and Keeping logs

In the next cell there is a bundle pre-processing function. In some cases though, the files were too small or too bad to compute the PoIs and therefore some attention has also been paid to that in order to avoid crushing.

In [9]:
def processData(data,
                trimming_factor_front_mm=20,
                trimming_factor_back_mm=0,
                flat_back_trim=50):
    """ This is a function that bundles-up the pre-processing functions. """
    
    img_quality_good = True
    
    front_chop = 0
    
    duplicate_count = 0
    
    if data.shape[0] > 55:
    
        # step 1: reset coordinates
        data = resetXY(data)

        # step 2: chop sides
        data, front_chop, back_chop = chopSides(data, 
                                                tFront=trimming_factor_front_mm,
                                                tBack=trimming_factor_back_mm,
                                                flatChopBack=flat_back_trim)
                                
    else:
        
        img_quality_good = False
        
        print("The data has too few rows (only {}) to remove duplicates or do any kind of chopping."
              .format(data.shape[0]))
    
    if data.shape[0] > 15:
        
        # check and remove duplicates
        data, duplicate_count, _, _ = checkForDuplicates(data, removeDuplicates=True)
        
        # smoothen in allowed keys
        data_smooth = smoothingGH(data, g=0.9, h=0.9)

        data_first_smooth = pd.DataFrame(data_smooth.loc[:, :]) # initialize and keep first smooth
        data_interp = pd.DataFrame(data.loc[:, :])              # make it again for plotting the pure
        
        corrections_remaining = True

        interpolation_points = {}

        skip_indices = []

        while corrections_remaining:
            
            # interpolate data according to data_smooth
            data_interp, corrections_remaining = interpolate(data_interp,
                                                             data_smooth,
                                                             interpolation_points,
                                                             skip_indices,
                                                             gh_threshold=3,#.2, 
                                                             verbose=True)
            
            if corrections_remaining:

                # run smoothing again on interpolated data
                data_smooth = smoothingGH(data_interp, g=0.9, h=0.9)
                
    else:
            
        img_quality_good = False

        print("The data after chopping has too few rows (only {}) for any meaningful processing."
              .format(data.shape[0]))

        dd = pd.DataFrame({'x': [1,4,2.5,1,4],
                           'y': [1,4,2.5,4,1], 
                           'z': [1,1,1,1,1], 
                           'millis': [1,2,3,4,5]})
        
        # if too few points return data three times
        return dd, dd, dd, {}, front_chop, duplicate_count, img_quality_good 
    
    # that is what gets return if all goes well
    return data, data_first_smooth, data_interp, interpolation_points, front_chop, duplicate_count, img_quality_good

### Visualize and Select the Locks

In order to visualize the locks we will have them plotted and use user input Y/N for each image. All the Locks of a person will be visualized in a for loop and the loop will pause until input. In addition to that we want to produce some images that can be used for documentation purposes, that are without the title etc.

In the next cells we will add the plotting function for the selection loop, as well as the functions for the saved image:

In [10]:
def findCE(df, idxD, idxB, idxF, verbose=False):
    """ Finds the C and E indices, based on B, D and F points. """
    
    # take care of bugs
    if idxB==0 or idxF==0:
        print("There is an error: Points B or F are set to 0 which is an unrealistice value -> because of that, C and E are both set to 0.")
        return 0, 0
    
    # find index C
    idxC = df['x'][idxB:idxD].idxmin()
    
    # find index E
    idxE = df['x'][idxD:idxF].idxmax()
    
    if verbose:
        print("The C index is {}.".format(idxC))
        print("The E index is {}.".format(idxE))
        
    return idxC, idxE


def findF(df, idxG, idxD, verbose=False):
    """ Finds the index of point F, based on the gradual change of the movement
    (capture by percenting change). The calculations happen on around one third 
    of the shape."""
    
    # make a dataframe that records the change in x and y values in percentages
    dpct = df['x'][idxD:].pct_change(periods=1, fill_method='pad', limit=1, freq=None)
    
    # fill the NaN values
    dpct = dpct.fillna(value=0)
    
    try:
        
        # find the first index, coming from D, that has negative change
        idxFirstNeg = dpct[dpct<0].head(1).index.values[0]
    
        # initialize index F
        idxF = df['x'][idxFirstNeg:idxG].idxmin()
    
    except:
        
        print("ERROR -> There is a problem with finding index F (probably no negative value between D to G)")
        return 0
    
    if verbose:
        print("The F index is {}.".format(idxF))
        
    return idxF


def findB(df, idxA, idxD, verbose=False):
    """ Finds the index of point B, based on the gradual change of the movement
    (capture by percenting change). The calculations happen on around one third 
    of the shape."""
    
    # make a dataframe that records the change in x and y values in percentages
    dpct = df['x'][:idxD].pct_change(periods=1, fill_method='pad', limit=1, freq=None)
    
    # fill the NaN values
    dpct = dpct.fillna(value=0)
    
    try:
        
        # find the first index, in the range from A to D, that has negative change
        idxLastNeg = dpct[dpct<0].tail(1).index.values[0]
        
        # initialize index F
        idxB = df['x'][idxA:idxLastNeg].idxmax()
        
    except:
        
        print("ERROR -> There is a problem with finding index B (probably no negative value between A to D)")
        return 0
    
    if verbose:
        print("The B index is {}.".format(idxB))
        
    return idxB


def findG(df, verbose=False):
    """ Finds the index of point G of the lock, by the maximun values of its two
    columns x and y. """
  
    # initialize dummy dataframe
    dummy = pd.DataFrame(columns=['y_rev', 'x and y_rev'])
  
    # find the max value in the y axis
    max_y = df['y'].max()
    
    if verbose:
        print('The max value in y axis of the plot is {}.'.format(max_y))
  
    # make a reversed(y) column
    dummy['y_rev'] = df['y'].apply(lambda x: max_y - x)

    # make the column we extract G's index from
    dummy['x and y_rev'] = df['x'].add(dummy['y_rev'])

    # get the index of G
    idxG = dummy['x and y_rev'].idxmax()
    
    if verbose:
        print("The index G is... {}.".format(idxG))

    return idxG


def findIndices(data):
    """ Finds the indices of the Points of Interest in the lock. This function
    workds under the assumption that the start of the axis is near point A. """
    
    # find the indexes for A and H (both bottom left A starts the shape and H ends it) and G (bottom right)
    iA, iH, iG = 0, -1, findG(data)
    
    # index for D (highest point)
    iD = data['y'].idxmax()
    
    # find the index for B (neck left)
    iB = findB(data, iA, iD)
    
    # find the index for F (neck right)
    iF = findF(data, iG, iD)
    
    # find the indices for E (circle right) and F (neck right)
    iC, iE = findCE(data, iD, iB, iF)
        
    return iA, iB, iC, iD, iE, iF, iG, iH


def findPoints(data, iA, iB, iC, iD, iE, iF, iG, iH, keepTimestamp=False):
    """ Finds the points in the dataframe according the indices """
    
    if keepTimestamp: # find ABCDEFGH while keeping 'millis' column
        
        A = data.iloc[iA]  # bottom left (starting point)
        B = data.iloc[iB]  # neck left
        C = data.iloc[iC]  # head left
        D = data.iloc[iD]  # highest point
        E = data.iloc[iE]  # head right
        F = data.iloc[iF]  # neck right
        G = data.iloc[iG]  # bottom right
        H = data.iloc[iH]  # bottom left (ending point)
            
        return A, B, C, D, E, F, G, H
        
    else: # find ABCDEFGH while dropping 'millis' column
        
        A = data.iloc[iA][:3]  # bottom left (starting point)
        B = data.iloc[iB][:3]  # neck left
        C = data.iloc[iC][:3]  # head left
        D = data.iloc[iD][:3]  # highest point
        E = data.iloc[iE][:3]  # head right
        F = data.iloc[iF][:3]  # neck right
        G = data.iloc[iG][:3]  # bottom right
        H = data.iloc[iH][:3]  # bottom left (ending point)
            
        return A, B, C, D, E, F, G, H


In [11]:
from mpl_toolkits.mplot3d.axes3d import Axes3D
import matplotlib.gridspec as gridspec

# plotting style
plt.style.use('seaborn-deep')

def findBorders(X, Y, Z):
    """ Finds the plotting dimensions for the box that ensure that all sides are 
    the the same size and the real dimensions lock can fit in """
  
    # mins, maxes, differences, and medians for every axis
    mins = [np.min(X), np.min(Y), np.min(Z)]
    maxs = [np.max(X), np.max(Y), np.max(Z)]
    diff = [I-J for I,J in zip(maxs, mins)]
    mids = [(I/2)+J for I,J in zip(diff, mins)]
    # get the highest difference
    bigMax = max(diff)
    # calculated half the side of the plotting box
    gap = (bigMax/2) + bigMax*0.05
    # return the plotting box dimensions
    return [mids[0]-gap, mids[1]-gap, mids[2]-gap, mids[0]+gap, mids[1]+gap, mids[2]+gap]


def simplePlotter(data, data_first_smooth, data_interp, true_world_scale=True, first_img='', 
                  second_img='', image_title='', interpolation_points=None, show_image=True, 
                  PoIs=True, img_quality_good=True):
    """ This is the basic plotting function that plots an instance. Its parameters 
    are explained as follows:
    - data: should be a loaded data frame (eg. pandas df)
    - trueWorldScale: plots will be in a box with all sides the same size
    - imageTitle: the name of the image
    - showImage: controls whether images will be outputed at the ipython console
    - panoramic: if true the angle is set to give panoramic view"""
    
    # set the graph
    fig = plt.figure()
    fig.set_figwidth(12)
    fig.set_figheight(6)

    # set axes
    ax1 = fig.add_subplot(121, projection='3d')
    ax2 = fig.add_subplot(122, projection='3d', sharex=ax1)
    # plot axes
    ax1.plot(data['x'], data['y'], data['z'])
    ax1.plot(data_first_smooth['x'], data_first_smooth['y'], data_first_smooth['z'], 
             alpha=0.6, lw=1, c='g')
    ax2.plot(data_interp['x'], data_interp['y'], data_interp['z'])
    
    if interpolation_points:
        for _, point in interpolation_points.items():
            # get the info of the interpolation point
            idx = point['index']
            cc = point['color code']
            # plot the interpolation point in ax1
            ax1.scatter(data['x'][idx], data['y'][idx], data['z'][idx], color=cc, s=20, alpha=0.55)
            if cc=='r':
                # plot the interpolation point in ax2
                ax2.scatter(data_interp['x'][idx], data_interp['y'][idx], data_interp['z'][idx], 
                            color=cc, s=20, alpha=0.55)
            # plot the points used to find it
            p1, p2 = point['closest 2 indices(smooth_data)']
            ax1.scatter(data_first_smooth['x'][p1], 
                        data_first_smooth['y'][p1], 
                        data_first_smooth['z'][p1], color="g", s=20, alpha=0.35)
            ax1.scatter(data_first_smooth['x'][p2], 
                        data_first_smooth['y'][p2], 
                        data_first_smooth['z'][p2], color="g", s=20, alpha=0.35)
   
    if PoIs and img_quality_good:
    
        # finds the indices for all points
        iA, iB, iC, iD, iE, iF, iG, iH = findIndices(data_interp)

        # finds the points 
        A, B, C, D, E, F, G, H = findPoints(data_interp, iA, iB, iC, iD, iE, iF, iG, iH)
    
        # plot points of interest
        ax2.scatter(*A, color="g", s=100, alpha=0.65)  #A (starting point)
        ax2.scatter(*B, color="y", s=100, alpha=0.65)  #B
        ax2.scatter(*C, color="y", s=100, alpha=0.65)  #C
        ax2.scatter(*D, color="y", s=100, alpha=0.65)  #D
        ax2.scatter(*E, color="y", s=100, alpha=0.65)  #E
        ax2.scatter(*F, color="y", s=100, alpha=0.65)  #F
        ax2.scatter(*G, color="y", s=100, alpha=0.65)  #G
        ax2.scatter(*H, color="r", s=100, alpha=0.65)  #H (ending point)

    # adjust the space in-between the axes
    plt.subplots_adjust(wspace=0, hspace=0)    
    
    # set titles
    ax1.set_title(first_img)
    ax2.set_title(second_img)
    # figure title
    plt.suptitle(image_title, fontsize=12)
      
    # label axes
    ax1.set_xlabel('x', linespacing=1)
    ax1.set_ylabel('y', linespacing=1)
    ax1.set_zlabel('z', linespacing=1)
    ax2.set_xlabel('x', linespacing=1)
    ax2.set_ylabel('y', linespacing=1)
    ax2.set_zlabel('z', linespacing=1)
  
    # set the axis according to world scale
    if true_world_scale:
        limits = findBorders(data['x'],data['y'],data['z'])
        #print(limits[3] - limits[0], limits[4] - limits[1], limits[5] - limits[2])
        ax1.set_xlim3d(limits[0], limits[3])
        ax1.set_ylim3d(limits[1], limits[4])
        ax1.set_zlim3d(limits[2], limits[5])
        ax2.set_xlim3d(limits[0], limits[3])
        ax2.set_ylim3d(limits[1], limits[4])
        ax2.set_zlim3d(limits[2], limits[5])
  
    # rotate for desired angle
    ax1.view_init(elev=90., azim=270)
    ax2.view_init(elev=90., azim=270)
    
    #ax1.tick_params(axis='x', labelrotation=45, direction='out')
    #fig.autofmt_xdate()
    #fig.tight_layout()
    ax1.locator_params(tight=True, axis='x', nbins=6)
    ax1.locator_params(tight=True, axis='y', nbins=6)
    ax1.locator_params(tight=True, axis='z', nbins=6)
    ax2.locator_params(tight=True, axis='x', nbins=6)
    ax2.locator_params(tight=True, axis='y', nbins=6)
    ax2.locator_params(tight=True, axis='z', nbins=6)
    plt.tight_layout()
   
    # show the figure into the plt console
    if show_image:
        plt.show()

In [12]:
def debug_show(data, image_title, true_world_scale=True):
    """ This functions is to make and save the plot of only the raw data. """
    
    # set the graph
    fig = plt.figure()
    fig.set_figwidth(6)
    fig.set_figheight(6)
    
    # set axes
    ax = fig.add_subplot(111, projection='3d')
    # plot axes
    ax.plot(data['x'], data['y'], data['z'])
              
    # label axes
    ax.set_xlabel('x', linespacing=1)
    ax.set_ylabel('y', linespacing=1)
    ax.set_zlabel('z', linespacing=1)
  
    # set the axis according to world scale
    if true_world_scale:
        limits = findBorders(data['x'],data['y'],data['z'])
        #print(limits[3] - limits[0], limits[4] - limits[1], limits[5] - limits[2])
        ax.set_xlim3d(limits[0], limits[3])
        ax.set_ylim3d(limits[1], limits[4])
        ax.set_zlim3d(limits[2], limits[5])
  
    # rotate for desired angle
    ax.view_init(elev=90., azim=270)
    
    ax.locator_params(tight=True, axis='x', nbins=6)
    ax.locator_params(tight=True, axis='y', nbins=6)
    ax.locator_params(tight=True, axis='z', nbins=6)
    plt.tight_layout()
    
    plt.show()


def saveRawDataPlot(data, saving_path, image_title, true_world_scale=True):
    """ This functions is to make and save the plot of only the raw data. """
    
    # set the graph
    fig = plt.figure()
    fig.set_figwidth(6)
    fig.set_figheight(6)
    
    # set axes
    ax = fig.add_subplot(111, projection='3d')
    # plot axes
    ax.plot(data['x'], data['y'], data['z'])
              
    # label axes
    ax.set_xlabel('x', linespacing=1)
    ax.set_ylabel('y', linespacing=1)
    ax.set_zlabel('z', linespacing=1)
  
    # set the axis according to world scale
    if true_world_scale:
        limits = findBorders(data['x'],data['y'],data['z'])
        #print(limits[3] - limits[0], limits[4] - limits[1], limits[5] - limits[2])
        ax.set_xlim3d(limits[0], limits[3])
        ax.set_ylim3d(limits[1], limits[4])
        ax.set_zlim3d(limits[2], limits[5])
  
    # rotate for desired angle
    ax.view_init(elev=90., azim=270)
    
    ax.locator_params(tight=True, axis='x', nbins=6)
    ax.locator_params(tight=True, axis='y', nbins=6)
    ax.locator_params(tight=True, axis='z', nbins=6)
    plt.tight_layout()
    
    fig.savefig(saving_path + image_title + '_raw' + '.png')
    plt.close(fig)

    
def saveRawAndSmoothDataPlot(data, data_first_smooth, saving_path, image_title, 
                             interpolation_points, true_world_scale=True):
    """ This functions is to make and save the plot of the raw and smoothened data with their 
    PoIs. """
    
    # set the graph
    fig = plt.figure()
    fig.set_figwidth(6)
    fig.set_figheight(6)
    
    # set axes
    ax = fig.add_subplot(111, projection='3d')
    # plot axes
    ax.plot(data['x'], data['y'], data['z'])
    ax.plot(data_first_smooth['x'], data_first_smooth['y'], data_first_smooth['z'], 
            alpha=0.6, lw=1, c='g')
    
    if interpolation_points:
        for _, point in interpolation_points.items():
            # get the info of the interpolation point
            idx = point['index']
            cc = point['color code']
            # plot the interpolation point in ax1
            ax.scatter(data['x'][idx], data['y'][idx], data['z'][idx], color=cc, s=20, alpha=0.55)
            # find the points used to calculate the interpolation
            p1, p2 = point['closest 2 indices(smooth_data)']
            # plot those points
            ax.scatter(data_first_smooth['x'][p1], 
                       data_first_smooth['y'][p1], 
                       data_first_smooth['z'][p1], color="g", s=20, alpha=0.35)
            ax.scatter(data_first_smooth['x'][p2], 
                       data_first_smooth['y'][p2], 
                       data_first_smooth['z'][p2], color="g", s=20, alpha=0.35)
          
    # label axes
    ax.set_xlabel('x', linespacing=1)
    ax.set_ylabel('y', linespacing=1)
    ax.set_zlabel('z', linespacing=1)
  
    # set the axis according to world scale
    if true_world_scale:
        limits = findBorders(data['x'],data['y'],data['z'])
        #print(limits[3] - limits[0], limits[4] - limits[1], limits[5] - limits[2])
        ax.set_xlim3d(limits[0], limits[3])
        ax.set_ylim3d(limits[1], limits[4])
        ax.set_zlim3d(limits[2], limits[5])
  
    # rotate for desired angle
    ax.view_init(elev=90., azim=270)
    
    ax.locator_params(tight=True, axis='x', nbins=6)
    ax.locator_params(tight=True, axis='y', nbins=6)
    ax.locator_params(tight=True, axis='z', nbins=6)
    plt.tight_layout()
    
    fig.savefig(saving_path + image_title + '_raw-smooth' + '.png')
    plt.close(fig)

    
def saveInterpDataPlot(data_interp, saving_path, image_title, interpolation_points, 
                       img_quality_good, true_world_scale=True, PoIs=True):
    """ This functions is to make and save the plot of the interpolated data with their PoIs. """
    
    # set the graph
    fig = plt.figure()
    fig.set_figwidth(6)
    fig.set_figheight(6)
    
    # set axes
    ax = fig.add_subplot(111, projection='3d')
    # plot axes
    ax.plot(data_interp['x'], data_interp['y'], data_interp['z'])
    
    if interpolation_points:
        for _, point in interpolation_points.items():
            # get the info of the interpolation point
            idx = point['index']
            cc = point['color code']
            # plot the interpolation point in ax1
            if cc=='r':
                # plot the interpolation point in ax2
                ax.scatter(data_interp['x'][idx], data_interp['y'][idx], data_interp['z'][idx], 
                           color=cc, s=20, alpha=0.55)
   
    if PoIs and img_quality_good:
    
        # finds the indices for all points
        iA, iB, iC, iD, iE, iF, iG, iH = findIndices(data_interp)

        # finds the points 
        A, B, C, D, E, F, G, H = findPoints(data_interp, iA, iB, iC, iD, iE, iF, iG, iH)
    
        # plot points of interest
        ax.scatter(*A, color="g", s=100, alpha=0.65)  #A (starting point)
        ax.scatter(*B, color="y", s=100, alpha=0.65)  #B
        ax.scatter(*C, color="y", s=100, alpha=0.65)  #C
        ax.scatter(*D, color="y", s=100, alpha=0.65)  #D
        ax.scatter(*E, color="y", s=100, alpha=0.65)  #E
        ax.scatter(*F, color="y", s=100, alpha=0.65)  #F
        ax.scatter(*G, color="y", s=100, alpha=0.65)  #G
        ax.scatter(*H, color="r", s=100, alpha=0.65)  #H (ending point)

    # label axes
    ax.set_xlabel('x', linespacing=1)
    ax.set_ylabel('y', linespacing=1)
    ax.set_zlabel('z', linespacing=1)
  
    # set the axis according to world scale
    if true_world_scale:
        limits = findBorders(data_interp['x'],data_interp['y'],data_interp['z'])
        #print(limits[3] - limits[0], limits[4] - limits[1], limits[5] - limits[2])
        ax.set_xlim3d(limits[0], limits[3])
        ax.set_ylim3d(limits[1], limits[4])
        ax.set_zlim3d(limits[2], limits[5])
  
    # rotate for desired angle
    ax.view_init(elev=90., azim=270)
    
    ax.locator_params(tight=True, axis='x', nbins=6)
    ax.locator_params(tight=True, axis='y', nbins=6)
    ax.locator_params(tight=True, axis='z', nbins=6)
    plt.tight_layout()
    
    fig.savefig(saving_path + image_title + '_interpolated' + '.png')
    plt.close(fig)

***

### Lock Selection Process

The following cell requires user input, the locks are being displayed, and the user has to write 'y' or 'n' (only 1 letter). If 'y', a new file that has been undergone pre-processing will be outputed, and some logs will be kept. If 'n', then the image will be left out. 

In [13]:
logs = ['' for x in range(26)]

lock_file_names_kept, lock_file_names_dropped = [], []
front_chops_per_lock, duplicates_per_lock, interpolations_per_lock = [], [], []

decided = False

back_chop = 50

for file in lock_file_names:
    
    # load the file
    data_raw = pd.read_csv(source_path+file)
    
    # do the pre-processing steps to the data and pass the variables for logging
    data_raw, data_first_smooth, data_interp, interpolation_points, \
    front_chop , duplicate_count, img_quality_good = processData(data_raw, flat_back_trim=back_chop)
    #print(data_interp[:3])
    # show raw&smooth:left - interpolated:right
    simplePlotter(data_raw, data_first_smooth, data_interp, 
                  first_img='Non-interpolated data & on top of first Smoothing',
                  second_img='Interpolated data (with highlights) & PoIs', 
                  interpolation_points=interpolation_points,
                  image_title=file, 
                  img_quality_good=img_quality_good)
    
    # loop infinately until a button is pressed
    while not decided:
        
        # get the key
        key = str(input("Is it a lock (Y/N)? "))
        # re evaluate while
        decided = True if key is 'Y' or key is 'y' or key is 'N' or key is 'n' else False
    
    # reset the control key for next loop
    decided = False
    
    if key is 'Y' or key is 'y':
        
        # add file to kept
        lock_file_names_kept.append(file)

        inter_counter = 0

        for _, point in interpolation_points.items():
            
            if point['color code']=='r':
            
                # increment the count counter if the point is red color coded
                inter_counter += 1

        # add the interpolated points counter
        interpolations_per_lock.append(inter_counter)
        
        # add the front chop to the front chops
        front_chops_per_lock.append(front_chop)
    
        # add the duplicate count to the duplicates
        duplicates_per_lock.append(duplicate_count)
    
        # make logs
        logs.append('In {}, the frontchop was {} rows and the backchop {}; there were also found {} duplicates, and {} points were interpolated.'
                    .format(file, front_chop, back_chop, duplicate_count, inter_counter))

        # save the file into new location
        data_interp.to_csv(target_path+file, index=False)
        
        # save the images
        saveRawDataPlot(data_raw, img_saving_path_accepted, file)
        saveRawAndSmoothDataPlot(data_raw, data_first_smooth, img_saving_path_accepted, 
                                 file, interpolation_points)
        saveInterpDataPlot(data_interp, img_saving_path_accepted, file, 
                           interpolation_points, img_quality_good)

    else:
        
        # add file to dropped
        lock_file_names_dropped.append(file)

        # save the images
        saveRawDataPlot(data_raw, img_saving_path_rejected, file)
        saveRawAndSmoothDataPlot(data_raw, data_first_smooth, img_saving_path_rejected, 
                                 file, interpolation_points)
        saveInterpDataPlot(data_interp, img_saving_path_rejected, file, 
                           interpolation_points, img_quality_good)
    
    # clear all output for next plot
    clear_output()
    
    #break

print("There are {} images that are kept out of the selection process."
      .format(len(lock_file_names_kept)))
print("There are {} images that are dropped out of the selection process."
      .format(len(lock_file_names_dropped)))

There are 78 images that are kept out of the selection process.
There are 15 images that are dropped out of the selection process.


#### Drop out some logs for the process

Some stuff have been saved temporarily - like the name of the files - and some others can be quickly computed that show some statistics regarding the front chop and the duplicates. By making a logs csv we maintain all that.

In [14]:
num_files_kept = len(lock_file_names_kept)

num_kept_and_interpolated = np.count_nonzero(interpolations_per_lock)

perc_interpolated = num_kept_and_interpolated / num_files_kept * 100

logs[0] = "Some stats for the distributions of the front_chop, the duplicates and the number of interpolations per lock in the files that have been kept are:"
logs[1] = "front chop mean: {}".format(np.mean(front_chops_per_lock))
logs[2] = "front chop std: {}".format(np.std(front_chops_per_lock))
logs[3] = "front chop median: {}".format(np.median(front_chops_per_lock))
logs[4] = "front chop min: {}".format(np.min(front_chops_per_lock))
logs[5] = "front chop max: {}".format(np.max(front_chops_per_lock))
logs[6] = " - - - - - "
logs[7] = "duplicates mean: {}".format(np.mean(duplicates_per_lock))
logs[8] = "duplicates std: {}".format(np.std(duplicates_per_lock))
logs[9] = "duplicates median: {}".format(np.median(duplicates_per_lock))
logs[10] = "duplicates min: {}".format(np.min(duplicates_per_lock))
logs[11] = "duplicates max: {}".format(np.max(duplicates_per_lock))
logs[12] = " - - - - - "
logs[13] = "interpolations mean: {}".format(np.mean(interpolations_per_lock))
logs[14] = "interpolations std: {}".format(np.std(interpolations_per_lock))
logs[15] = "interpolations median: {}".format(np.median(interpolations_per_lock))
logs[16] = "interpolations min: {}".format(np.min(interpolations_per_lock))
logs[17] = "interpolations max: {}".format(np.max(interpolations_per_lock))
logs[18] = "---------------------------------------------------------------------------------------"
logs[19] = "The criterion to keep or drop a lock was its 'lockyness', no matter some sporadic offshoot values \
that are clear glitches and can be interpolated."
logs[20] = "number of locks kept: {}".format(num_files_kept)
logs[21] = "number of locks dropped: {}".format(len(lock_file_names_dropped))
logs[22] = " - - - - - "
logs[23] = "From the files that have been kept, in {} out of the {} cases ({:.2f}%) we interpolated one or more points.".format(num_kept_and_interpolated, num_files_kept, perc_interpolated)
logs[24] = "---------------------------------------------------------------------------------------"
logs[25] = "The front-back chops and the duplicates in each file separately are:"

# put the list into a series
logs = pd.Series(logs, index=range(len(logs)))

print("The log file is as follows:\n")

display(logs)

# drop the csv
logs.to_csv(target_path+'logs/logs.csv', index=False)

The log file is as follows:



0      Some stats for the distributions of the front_...
1                    front chop mean: 18.564102564102566
2                      front chop std: 3.168716247338545
3                                front chop median: 18.0
4                                     front chop min: 12
5                                     front chop max: 33
6                                             - - - - - 
7                     duplicates mean: 4.987179487179487
8                      duplicates std: 3.894419379203745
9                                 duplicates median: 4.0
10                                     duplicates min: 1
11                                    duplicates max: 20
12                                            - - - - - 
13              interpolations mean: 0.34615384615384615
14                interpolations std: 0.5734940832396165
15                            interpolations median: 0.0
16                                 interpolations min: 0
17                             