# Knowledge from Incomplete Trajectories: Navigation systems do not cater for familiar wayfinders

Companion demonstration code to the paper: *Knowledge from Incomplete Trajectories: Navigation systems do not cater for familiar wayfinders*.

This code demonstrates the analytical part that enables to classify trajectories with a *single* gap. 

This code is devised to be dependent on a minimum number of dependencies and to demonstrate the classification of trajectories as highlighted in the paper. It is *not* a production level, performant code, or a clean library. 

The code is coupled with a small set of demonstration data - these are artificial data, not the data used in the paper itself. They are only suited to demonstrate the functionality of this code.



## Theory at a glance

The paper *Knowledge from Incomplete Trajectories: Navigation systems do not cater for familiar wayfinders* proposes a simple conceptual model of incomplete trajectories (captured in a tracked *session*). 
This is the conceptual model of trajectories with gaps:
        
![Schematic of trajectory with gap](figs/trajectory_annot.png)

The code demonstrated here classifies trajectories by analysing the pre-gap, gap, and post-gap sections of the trajectories, evaluating whether the tracked object *moved* (**M**) or *did not move* (**N**) in these respective trajectory parts.

This classification results in eight characteristic sequences (*MMM, MMN, MNM, NMM, NNN, NNM, NMN, MNN*), that form the following conceptual neighbourhoods:

![Conceptual neighbourhood](figs/conceptualNeighbourhood.png)

## Relating gapped trajectories to information needs

To apply the findings of the paper -- i.e., to relate the presence of gaps to information needs -- an analyst must be certain that two assumptions are met:

- The presence of the gap in the trajectory must be relatable to information use (i.e., in our paper, the user was not tracked when the screen was off). However, this may be reflected differently in other data sources, for instance, with a decrease of sampling frequency, rather then a real gap in the trajectory., The code below would need to be modified to analyse such data sources.
- the gaps must be reliably identificable as active gaps - i.e., technical failures should be effectively filtered out. 


## Code assumptions

This code is designed to run with data in the following format. 

| time     | sessionID | x | y | 
| -------------:| -------------:| ---:| ---:|
| 2000-01-01 01:00:00 | 1 | 100 | 100 |

All coordinates are assumed to be in a planar system for the code to work - this is easily achieved by projecting from GPS lon/lat coordinates to the desired system, or using spheroidical computation on distances ( not demonstrated here, to reduce dependencies).

Here, we read the dataset from the <i>"/data"</i> subfolder into a pandas data frame.

In [1]:
### Library imports

import os # to manipulate paths
import numpy as np
import pandas as pd # Pandas data frames
import math

In [2]:
### Function declarations

def dist(df, x_col="x",y_col="y"):
    """Simple distance function on data frame
    Euclidean distance,
    replace with a Haversine or other function 
    if needed to apply on lon,lat
    """
    return np.sqrt(df[x_col].diff()**2 + df[y_col].diff()**2)


In [3]:
### Read data
input_file_name = "gap_trajs_input_data.csv"

# Change this if you are not using the "data" folder for your input dataset
input_file = os.path.join(os.getcwd(),'data',input_file_name)
os.path.exists(input_file) # test that your path is correct

# read csv file into Pandas data frame, with header (first row), assume all inputs are strings
data = pd.read_csv(input_file, dtype={14:str}, header=0)
data.head()


Unnamed: 0,sessionID,time,x,y
0,s82,2016-09-11 23:42:22,517124.889294,202226.341991
1,s82,2016-09-11 23:42:27,517119.144369,202206.67101
2,s82,2016-09-11 23:42:32,517119.144369,202206.67101
3,s82,2016-09-11 23:42:37,517119.210414,202207.778931
4,s82,2016-09-11 23:42:42,517119.210414,202207.778931


## Data processing pipeline:

The pre-processing loads the data (as csv), and:

1. **Adding time differences** Computes time differences between observations (timeDiff)
2. **Adding distance differences** Computes the distance differences between observations (distDiff) *based on Euclidean distance* (hence, requires projected data in this demo code)
3. **Session annotation** Flags whether an observation is the start or end of a session (sessionStart,sessionEnd);
4. **Gap annotation** Identifies if an observation is a start or end of a gap, based on defined gap duration (gapStart, gapEnd);
5. **Cleanup** Resolves conflicts between gapStart and sessionStart, and gapEnd and sessionEnd
6. **Session classification** Classifies sessions into the eight types from the conceptual model, by movement in the pre-gap, gap and post-gap period

The code below results in two (CSV) datasets, one characterising the session in summary ( the result of the classification), the second computing parameters at the observation level, enabling the classification. 


## Preprocessing step 


### Time and distance differences, session starts and ends

In [4]:
## convert timestamps to a timestamp format
data["time"] = pd.to_datetime(data["time"]) # cast time stamp stringgs to pandas datetime

## compute time differences between consecutive observations, by session
# this is computed in two directions, from start and from end of session.
data["timeDiff"] = data.sort_values(['sessionID','time']).groupby("sessionID")["time"].diff()/np.timedelta64(1,"s")/60 # Compute time difference between consecutive fixes, in minnutes
data["negtimeDiff"] = data.sort_values(['sessionID','time']).groupby("sessionID")["time"].diff(periods=-1)/np.timedelta64(1,"s")/60 # Compute time difference between consecutive fixes, in minnutes

## compute distance differences between consecutive observations, by session
# this is where having planar data is important
data["dist"] = data.groupby(['sessionID'], group_keys=False).apply(dist)

## Annotate sessions - start and end of session
# first, set the entire column to a default value
data['sessionStart'] = False
data['sessionEnd'] = False

# then, set the first and last observation of the session to Start/End respectively
data.loc[data.groupby("sessionID").head(1).index, 'sessionStart'] = True
data.loc[data.groupby("sessionID").tail(1).index, 'sessionEnd'] = True

# Note - we do not include code to handle edge cases - such as single-observation sessions, robustly, 
# This needs to be added if required.

## Verify that you have identified session starts and ends correctly
# The number of session ends and starts should match

print(len(data[data["timeDiff"].isna()]))
print(len(data[data["sessionEnd"].eq(True)]))
print(len(data[data["sessionStart"].eq(True)]))

10
10
10


## Identify gaps in the dataset

In this step, we identify gaps based on a threshold durations of gaps. Here, we domonstrate the use of three threesholds: 10s, 1 min and 3 mins.
    

In [5]:
## Gap durations in minutes. 
# Consider experimenting with values.

gap_durations = [10/60,1,3] 

In [6]:
## Identify gap ends and gap starts, 
# based on the differrences exceeding gap duration thresholds.
for gap_duration in gap_durations:
    # this identifies gap ends, as the annotation arrives to gapEnd
    data.loc[abs(data.timeDiff) > gap_duration, 'gapEnd'] = gap_duration
    data.loc[abs(data.negtimeDiff) > gap_duration, 'gapStart'] = gap_duration

## Session summaries

Here, we summarize sessions

In [7]:
## Summarize sessions 
# start by a copy of sessionIDs and sessionStarts 
session_summaries = data[data.sessionStart.eq(True)][["sessionID","time"]].copy()
session_summaries = session_summaries.rename(columns={"time":"sessionStartTime"})

# add session Ends
session_ends = data[data.sessionEnd.eq(True)][["sessionID","time"]]
session_summaries = session_summaries.merge(session_ends, on="sessionID")
session_summaries = session_summaries.rename(columns={"time":"sessionEndTime"})

## count number of gaps, by type of duration
gaps10s = data[data['gapStart'] == 10/60].groupby('sessionID')['gapStart'].count()
gaps10s.name= "n10sGaps"

gaps1 = data[data['gapStart'] == 1].groupby('sessionID')['gapStart'].count().astype('int64')
gaps1.name= "n1minGaps"

gaps3 = data[data['gapStart'] == 3].groupby('sessionID')['gapStart'].count().astype('int64')
gaps3.name= "n3minGaps"

# merge the results to session summaries
gaps10s = pd.DataFrame(gaps10s)
gaps1 = pd.DataFrame(gaps1)
gaps3 = pd.DataFrame(gaps3)
session_summaries = session_summaries.merge(gaps10s, on="sessionID", how='left').merge(gaps1, on="sessionID", how='left').merge(gaps3, on="sessionID", how='left')

# sanity check
no_gap_sessions = len(session_summaries[session_summaries['n10sGaps'].isna() & session_summaries['n1minGaps'].isna() & session_summaries['n3minGaps'].isna()])
total_by_gaps = len(session_summaries[session_summaries['n10sGaps'].notna() | session_summaries['n1minGaps'].notna() | session_summaries['n3minGaps'].notna()])
overall_sessions = data["sessionID"].nunique()
match = (no_gap_sessions+total_by_gaps)==overall_sessions
print(f'Sessions without gaps {no_gap_sessions}, with gaps {total_by_gaps}, total sessions {overall_sessions}, counts match: {match} .')

session_summaries.sort_values("sessionStartTime")
session_summaries


Sessions without gaps 1, with gaps 9, total sessions 10, counts match: True .


Unnamed: 0,sessionID,sessionStartTime,sessionEndTime,n10sGaps,n1minGaps,n3minGaps
0,s82,2016-09-11 23:42:22,2016-09-11 23:44:07,,,
1,s06,2016-09-11 06:53:13,2016-09-11 07:29:41,5.0,2.0,
2,s61,2016-09-11 03:03:05,2016-09-11 05:48:10,,,1.0
3,s02,2016-09-11 06:38:21,2016-09-11 07:54:52,,,1.0
4,s30,2016-09-11 20:40:50,2016-09-11 23:59:58,,,1.0
5,s63,2016-09-11 03:47:43,2016-09-11 04:18:25,,,1.0
6,s24,2016-09-11 06:28:54,2016-09-11 07:31:57,,,1.0
7,s12,2016-09-11 04:34:36,2016-09-11 05:43:51,,,1.0
8,s1,2016-09-11 22:41:01,2016-09-11 22:52:55,,,1.0
9,s25,2016-09-11 11:49:27,2016-09-11 12:02:46,,,1.0


## Classify sessions

In [8]:
def getCategory(distB,distG,distA,threshold):
    """ Function to classify trajectories by threshold distance covered in the gap, 
    Uses the 8 classes of sessions classified in this paper, + a no-gap class.
    """
    definedClass = ""
    if math.isnan(distB):
        distB = 0
    if math.isnan(distG):
        distG = 0
    if math.isnan(distA):
        distA = 0
    distB = float(distB)
    distA = float(distA)
    distG = float(distG)
    if (distB == 0 and distG == 0 and distA == 0):
        definedClass = "NoGap"
    elif (distB > threshold and distG > threshold and distA > threshold):
        definedClass = "MMM"
    elif (distB > threshold and distG > threshold and distA <= threshold):
        definedClass = "MMN"
    elif (distB > threshold and distG <= threshold and distA > threshold):
        definedClass = "MNM"
    elif(distB > threshold and distG <= threshold and distA <= threshold):
        definedClass = "MNN"
    elif(distB <= threshold and distG > threshold and distA > threshold):
        definedClass = "NMM"
    elif(distB <= threshold and distG > threshold and distA <= threshold):
        definedClass = "NMN"
    elif(distB <= threshold and distG <= threshold and distA > threshold):
        definedClass = "NNM"
    elif(distB <= threshold and distG <= threshold and distA <= threshold):
        definedClass = "NNN"
    return definedClass

### We only classify sessions that contain a single gap

Here, we will use a single 3 min gap, and a distance threshold of 200m

In [9]:
## distance threshold for classification
# in [m]
gap_duration = 3
threshold = 200

In [10]:
## Execute session classification
# requires the original data frame with time  differences, and the summary data

# filter only sessions with a single gap, 1x 3 mins gap
sessions_3min = session_summaries[session_summaries['n3minGaps']==1]

# identify data for these sessions
data_3mingap = data[data['gapEnd']==gap_duration]
data_3mingap = data_3mingap.rename(columns={'time': 'gapEndTime'})
data_3min = pd.merge(data,data_3mingap[['sessionID','gapEndTime']],on='sessionID', how='left')

# to be sure that we are working with date times.
data_3min["time"] = pd.to_datetime(data_3min["time"])
data_3min["gapEndTime"] = pd.to_datetime(data_3min["gapEndTime"])
data_3min_filtered = data_3min[data_3min['gapEndTime'].notna()]

## Extract and summarize parts of sessions before and after gaps
# filter
data_3min_before_gap = data_3min_filtered[data_3min_filtered['time'] < data_3min_filtered['gapEndTime']]
data_3min_after_gap = data_3min_filtered[data_3min_filtered['time'] > data_3min_filtered['gapEndTime']]

# summarize time before/after gap
before_gap_summaries = data_3min_before_gap[['sessionID','timeDiff','dist']].groupby(['sessionID']).sum()
before_gap_summaries = before_gap_summaries.rename(columns={'timeDiff': 'minuteB', 'dist': 'distB'})
after_gap_summaries = data_3min_after_gap[['sessionID','timeDiff','dist']].groupby(['sessionID']).sum()
after_gap_summaries = after_gap_summaries.rename(columns={'timeDiff': 'minuteA', 'dist': 'distA'})

## Produce final summaries
session_summaries2 = session_summaries.merge(data_3mingap[['sessionID','timeDiff','dist']],on='sessionID', how='left')
session_summaries2 = session_summaries2.rename(columns={'timeDiff': 'minuteG', 'dist': 'distG'})
session_summaries_full = session_summaries2.merge(before_gap_summaries,on='sessionID', how='left').merge(after_gap_summaries,on='sessionID', how='left')

## Classify sessions
session_summaries_full['types'] = session_summaries_full.apply(lambda x: getCategory(x['distB'],x['distA'],x['distG'],threshold), axis=1)
session_summaries_full

Unnamed: 0,sessionID,sessionStartTime,sessionEndTime,n10sGaps,n1minGaps,n3minGaps,minuteG,distG,minuteB,distB,minuteA,distA,types
0,s82,2016-09-11 23:42:22,2016-09-11 23:44:07,,,,,,,,,,NoGap
1,s06,2016-09-11 06:53:13,2016-09-11 07:29:41,5.0,2.0,,,,,,,,NoGap
2,s61,2016-09-11 03:03:05,2016-09-11 05:48:10,,,1.0,158.583333,10176.847114,0.75,582.941837,5.75,6956.011968,MMM
3,s02,2016-09-11 06:38:21,2016-09-11 07:54:52,,,1.0,31.933333,13.782077,43.75,8.417493,0.833333,48.234331,NNN
4,s30,2016-09-11 20:40:50,2016-09-11 23:59:58,,,1.0,24.75,22182.516079,12.766667,17255.702635,161.616667,538.538022,MMM
5,s63,2016-09-11 03:47:43,2016-09-11 04:18:25,,,1.0,13.5,5.2426,7.016667,3576.77409,10.183333,4804.656319,MMN
6,s24,2016-09-11 06:28:54,2016-09-11 07:31:57,,,1.0,32.516667,19.593015,29.783333,9935.276975,0.75,72.408479,MNN
7,s12,2016-09-11 04:34:36,2016-09-11 05:43:51,,,1.0,18.716667,1539.664485,3.75,11.316544,46.783333,51237.192912,NMM
8,s1,2016-09-11 22:41:01,2016-09-11 22:52:55,,,1.0,11.55,1826.899355,0.233333,167.392846,0.116667,8.093199,NNM
9,s25,2016-09-11 11:49:27,2016-09-11 12:02:46,,,1.0,8.35,13.435171,0.05,11.134682,4.916667,4171.81668,NMN


# Write output

In [11]:
output_file_name = "session_annot.csv"
output_file = os.path.join(os.getcwd(),"data",output_file_name)
session_summaries_full.to_csv(output_file, index=False)

In [12]:
pd.__version__

'1.2.4'

In [13]:
np.__version__

'1.20.2'