# Knowledge from Incomplete Trajectories: Navigation systems do not cater for familiar wayfinders

Companion demonstration code to the paper: *Knowledge from Incomplete Trajectories: Navigation systems do not cater for familiar wayfinders*.

This code demonstrates the analytical part taht enables to classify trajectories wioth a *single* gap. 

This code is devised to be dependent on a minimum number of dependencies and to demonstrate the classification of trajectories as highlighted in the paper. It is *not* a production level, performant code, or a clean library. 

The code is coupled with a small set of demonstration data - these are artificial data, not the data used in the paper itself. They are only suited to demonstrate the functionality of this code.



## Theory at a glance

The paper *Knowledge from Incomplete Trajectories: Navigation systems do not cater for familiar wayfinders* proposes a simple conceptual model of incomplete trajectories (captured in a tracked *session*). 
This is the conceptual model of trajectories with gaps:
        
![Schematic of trajectory with gap](figs/trajectory_annot.png)

The code demonstrated here classifies trajectories by analysing the pre-gap, gap, and post-gap sections of the trajectories, evaluating whether the tracked object *moved* (**M**) or *did not move* (**N**) in these respective trajectory parts.

This classification results in eight characteristic sequences (*MMM, MMN, MNM, NMM, NNN, NNM, NMN, MNN*), that form the following conceptual neighbourhoods:

![Conceptual neighbourhood](figs/conceptualNeighbourhood.png)

## Relating gapped trajectories to information needs

To apply the findings of the paper -- i.e., to relate the presence of gaps to information needs -- an analyst must be certain that two assumptions are met:

- The presence of the gap in the trajectory must be relatable to information use (i.e., in our paper, the user was not tracked when the screen was off). However, this may be reflected differently in other data sources, for instance, with a decrease of sampling frequency, rather then a real gap in the trajectory., The code below would need to be modified to analyse such data sources.
- the gaps must be reliably identificable as active gaps - i.e., technical failures should be effectively filtered out. 


## Code assumptions

This code is designed to run with data in the following format. All coordinates are assumed to be in a planar system for the code to work - this is easily achieved by projecting from GPS lon/lat coordinates to the desired system, or using spheroidical computation on distances ( not demonstrated here, to reduce dependencies).

| Timestamp     | SessionID | x | y | 
| -------------:| -------------:| ---:| ---:|
| 2000-01-01 01:00:00 | 1 | 100 | 100 |

## Data pre-processing

The pre-processing loads the data (as csv), and:

1. **Adding time differences** Computes time differences between observations (timeDiff)
2. **Adding distance differences** Computes the distance differences between observations (distDiff) *based on Euclidean distance* (hence, requires projected data in this demo code)
3. **Session annotation** Flags whether an observation is the start or end of a session (sessionStart,sessionEnd);
4. **Gap annotation** Identifies if an observation is a start or end of a gap, based on defined gap duration (gapStart, gapEnd);
5. **Cleanup** Resolves conflicts between gapStart and sessionStart, and gapEnd and sessionEnd
6. **Session classification** Classifies sessions into the eight types from the conceptual model, by movement in the pre-gap, gap and post-gap period

The code below results in two (CSV) datasets, one characterising the session in summary ( the result of the classification), the second computing parameters at the observation level, enabling the classification. 


In [1]:
### Imports
import os # to manipulate paths
import numpy as np
import pandas as pd # Pandas data frames
import math

In [2]:
### Functions


def dist(df, x_col="x",y_col="y"):
    """Simple distance function on data frame
    Euclidean distance,
    replace with a Haversine or other function 
    if needed to apply on lon,lat
    """
    return np.sqrt(df[x_col].diff()**2 + df[y_col].diff()**2)


In [3]:
### Main execution

input_file_name = "gap_trajs_input_data.csv"
input_file = os.path.join(os.getcwd(),'data',input_file_name)
os.path.exists(input_file) # test that your path is correct


True

In [4]:
# read csv file, with header ( first row), assume all inputs are strings
data = pd.read_csv(input_file, header=0)

# convert timestamps to a timestamp format
data["time"] = pd.to_datetime(data["time"]) # cast time stamp stringgs to pandas datetime

# compute time and distance differences between consecutive observations, by session
data["timeDiff"] = data.sort_values(['sessionID','time']).groupby("sessionID")["time"].diff()/np.timedelta64(1,"s")/60 # Compute time difference between consecutive fixes, in minnutes
data["negtimeDiff"] = data.sort_values(['sessionID','time']).groupby("sessionID")["time"].diff(periods=-1)/np.timedelta64(1,"s")/60 # Compute time difference between consecutive fixes, in minnutes

data["dist"] = data.groupby(['sessionID'], group_keys=False).apply(dist)

data['sessionStart'] = False
data['sessionEnd'] = False

# Annotate sessions - start and end of session
data.loc[data.groupby("sessionID").head(1).index, 'sessionStart'] = True
data.loc[data.groupby("sessionID").tail(1).index, 'sessionEnd'] = True

# cleanup for border cases - needed?
# test from here.
conditions = [
    data["sessionEnd"].eq(True) & data["sessionStart"].eq(True)
]
outcomes = [False]
    
data["sessionEnd"] = np.select(conditions, outcomes, default=data["sessionEnd"])





### Remov later - Verify that you have identified sessionstarts and ends


In [5]:
print(len(data[data["timeDiff"].isna()]))
print(len(data[data["sessionEnd"].eq(True)]))
print(len(data[data["sessionStart"].eq(True)]))

10
10
10


## Identify gaps in the dataset

In this step, we will now identify gaps based on a threshold duration.
    

In [6]:
gap_durations = [10/60,1,3] # in minutes. Consider experimenting with 1 min, or 10/60 => 10seconds

for gap_duration in gap_durations:
    # this identifies gap ends, as the nanotation arrives to gapEnd
    data.loc[abs(data.timeDiff) > gap_duration, 'gapEnd'] = gap_duration
    data.loc[abs(data.negtimeDiff) > gap_duration, 'gapStart'] = gap_duration

In [7]:
print(len(data[data["gapEnd"].eq(10/60)]))
print(len(data[data["gapStart"].eq(10/60)]))
print(len(data[data["gapEnd"].eq(1)]))
print(len(data[data["gapStart"].eq(1)]))
print(len(data[data["gapEnd"].eq(3)]))
print(len(data[data["gapStart"].eq(3)]))

5
5
2
2
8
8


In [8]:
## Summarize sessions - we start by a seed of sessionIDs and sessionStarts

session_summaries = data[data.sessionStart.eq(True)][["sessionID","time"]].copy()
session_summaries = session_summaries.rename(columns={"time":"sessionStartTime"})
#session Ends
session_ends = data[data.sessionEnd.eq(True)][["sessionID","time"]]
session_summaries = session_summaries.merge(session_ends, on="sessionID")
session_summaries = session_summaries.rename(columns={"time":"sessionEndTime"})

# add in timeDiff
# count number of gaps, by type
gaps = data[data['gapStart'] == 10/60].groupby('sessionID')['gapStart'].count().astype('int64')

gaps.name= "n10sGaps"
gaps = pd.DataFrame(gaps)
gaps1 = data[data['gapStart'] == 1].groupby('sessionID')['gapStart'].count().astype('int64')

gaps1.name= "n1minGaps"
gaps1 = pd.DataFrame(gaps1)
gaps3 = data[data['gapStart'] == 3].groupby('sessionID')['gapStart'].count().astype('int64')
gaps3.name= "n3minGaps"
gaps3 = pd.DataFrame(gaps3)

session_summaries = session_summaries.merge(gaps, on="sessionID", how='left').merge(gaps1, on="sessionID", how='left').merge(gaps3, on="sessionID", how='left')
#sanity check ! finalise

no_gap_sessions = len(session_summaries[session_summaries['n10sGaps'].isna() & session_summaries['n1minGaps'].isna() & session_summaries['n3minGaps'].isna()])

total_by_gaps = len(session_summaries[session_summaries['n10sGaps'].notna() | session_summaries['n1minGaps'].notna() | session_summaries['n3minGaps'].notna()])
overall_sessions = data["sessionID"].nunique()
print(f'sessions without gaps {no_gap_sessions}, with gaps {total_by_gaps}, total sessions {overall_sessions} matches .')


session_summaries.sort_values("sessionStartTime")

#.select(["sessionID",])
#data.set_index("sessionID")

sessions without gaps 1, with gaps 9, total sessions 10 matches .


Unnamed: 0,sessionID,sessionStartTime,sessionEndTime,n10sGaps,n1minGaps,n3minGaps
2,s61,2016-09-11 03:03:05,2016-09-11 05:48:10,,,1.0
5,s63,2016-09-11 03:47:43,2016-09-11 04:18:25,,,1.0
7,s12,2016-09-11 04:34:36,2016-09-11 05:43:51,,,1.0
6,s24,2016-09-11 06:28:54,2016-09-11 07:31:57,,,1.0
3,s02,2016-09-11 06:38:21,2016-09-11 07:54:52,,,1.0
1,s06,2016-09-11 06:53:13,2016-09-11 07:29:41,5.0,2.0,
9,s25,2016-09-11 11:49:27,2016-09-11 12:02:46,,,1.0
4,s30,2016-09-11 20:40:50,2016-09-11 23:59:58,,,1.0
8,s1,2016-09-11 22:41:01,2016-09-11 22:52:55,,,1.0
0,s82,2016-09-11 23:42:22,2016-09-11 23:44:07,,,


In [9]:
data['gapStart']

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
4312   NaN
4313   NaN
4314   NaN
4315   NaN
4316   NaN
Name: gapStart, Length: 4317, dtype: float64

In [10]:
data[data['gapStart'].notna()]

Unnamed: 0,sessionID,time,x,y,timeDiff,negtimeDiff,dist,sessionStart,sessionEnd,gapEnd,gapStart
29,s06,2016-09-11 06:53:49,813093.006897,927907.1,0.083333,-0.9,0.0,False,False,,0.166667
40,s06,2016-09-11 06:55:33,813479.262193,928967.9,0.083333,-0.35,82.646621,False,False,,0.166667
50,s06,2016-09-11 06:56:39,814228.382551,929667.7,0.083333,-1.416667,79.495278,False,False,,1.0
67,s06,2016-09-11 06:59:24,815045.016061,931214.2,0.083333,-0.6,82.56877,False,False,,0.166667
125,s06,2016-09-11 07:04:45,819362.328631,931110.0,0.083333,-1.25,68.629742,False,False,,1.0
136,s06,2016-09-11 07:06:50,820787.003532,931069.0,0.083333,-0.483333,114.227019,False,False,,0.166667
163,s06,2016-09-11 07:09:29,822807.523868,931558.3,0.083333,-0.416667,78.016515,False,False,,0.166667
411,s61,2016-09-11 03:03:50,804059.40367,1035425.0,0.083333,-158.583333,12.514436,False,False,,3.0
1006,s02,2016-09-11 07:22:06,-586136.543418,203618.1,0.083333,-31.933333,0.0,False,False,,3.0
1171,s30,2016-09-11 20:53:36,-1997.659939,-50251.28,0.083333,-24.75,145.374756,False,False,,3.0


In [11]:
gapLengths = data[data['gapStart'] > 10/60][["sessionID","timeDiff"]]
gapLengths


Unnamed: 0,sessionID,timeDiff
50,s06,0.083333
125,s06,0.083333
411,s61,0.083333
1006,s02,0.083333
1171,s30,0.083333
3154,s63,0.083333
3633,s24,0.083333
3688,s12,0.083333
4251,s1,0.033333
4256,s25,0.05


In [12]:
output_file = os.path.join(os.getcwd(),"data","session_annot.csv")
data.to_csv(output_file)

## Classify sessions

In [13]:
def getCategory(distB,distG,distA,threshold):
    definedClass = ""
    if math.isnan(distB):
        distB = 0
    if math.isnan(distG):
        distG = 0
    if math.isnan(distA):
        distA = 0
    distB = float(distB)
    distA = float(distA)
    distG = float(distG)
    if (distB==0 and distG==0 and distA==0):
        definedClass = "NoGap"
    elif (distB>threshold and distG>threshold and distA>threshold):
        definedClass = "MMM"
    elif (distB>threshold and distG>threshold and distA<=threshold):
        definedClass = "MMN"
    elif (distB>threshold and distG<=threshold and distA>threshold):
        definedClass = "MNM"
    elif(distB>threshold and distG<=threshold and distA<=threshold):
        definedClass = "MNN"
    elif(distB<=threshold and distG>threshold and distA>threshold):
        definedClass = "NMM"
    elif(distB<=threshold and distG>threshold and distA<=threshold):
        definedClass = "NMN"
    elif(distB<=threshold and distG<=threshold and distA>threshold):
        definedClass = "NNM"
    elif(distB<=threshold and distG<=threshold and distA<=threshold):
        definedClass = "NNN"
    return definedClass

In [14]:
## Execute session classification

In [15]:
# sessions with only 1 3 mins gap
sessions_3min = session_summaries[session_summaries['n3minGaps']==1]
data_3mingap = data[data['gapEnd']==3]
data_3mingap = data_3mingap.rename(columns={'time': 'gapEndTime'})

data_3min = pd.merge(data,data_3mingap[['sessionID','gapEndTime']],on='sessionID', how='left')

data_3min["time"] = pd.to_datetime(data_3min["time"])
data_3min["gapEndTime"] = pd.to_datetime(data_3min["gapEndTime"])
data_3min_filtered = data_3min[data_3min['gapEndTime'].notna()]


data_3min_before_gap = data_3min_filtered[data_3min_filtered['time']<data_3min_filtered['gapEndTime']]
data_3min_after_gap = data_3min_filtered[data_3min_filtered['time']>data_3min_filtered['gapEndTime']]

before_gap_summaries = data_3min_before_gap[['sessionID','timeDiff','dist']].groupby(['sessionID']).sum()
before_gap_summaries = before_gap_summaries.rename(columns={'timeDiff': 'minuteB', 'dist': 'distB'})
after_gap_summaries = data_3min_after_gap[['sessionID','timeDiff','dist']].groupby(['sessionID']).sum()
after_gap_summaries = after_gap_summaries.rename(columns={'timeDiff': 'minuteA', 'dist': 'distA'})
#print(before_gap_summaries)

session_summaries2 = pd.merge(session_summaries,data_3mingap[['sessionID','timeDiff','dist']],on='sessionID', how='left')
session_summaries2 = session_summaries2.rename(columns={'timeDiff': 'minuteG', 'dist': 'distG'})
session_summaries3 = pd.merge(session_summaries2,before_gap_summaries,on='sessionID', how='left')
session_summaries4 = pd.merge(session_summaries3,after_gap_summaries,on='sessionID', how='left')
print(session_summaries4)
session_summaries4['types'] = session_summaries4.apply(lambda x: getCategory(x['distB'],x['distA'],x['distG'],200), axis=1)
print(session_summaries4)



  sessionID    sessionStartTime      sessionEndTime  n10sGaps  n1minGaps  \
0       s82 2016-09-11 23:42:22 2016-09-11 23:44:07       NaN        NaN   
1       s06 2016-09-11 06:53:13 2016-09-11 07:29:41       5.0        2.0   
2       s61 2016-09-11 03:03:05 2016-09-11 05:48:10       NaN        NaN   
3       s02 2016-09-11 06:38:21 2016-09-11 07:54:52       NaN        NaN   
4       s30 2016-09-11 20:40:50 2016-09-11 23:59:58       NaN        NaN   
5       s63 2016-09-11 03:47:43 2016-09-11 04:18:25       NaN        NaN   
6       s24 2016-09-11 06:28:54 2016-09-11 07:31:57       NaN        NaN   
7       s12 2016-09-11 04:34:36 2016-09-11 05:43:51       NaN        NaN   
8        s1 2016-09-11 22:41:01 2016-09-11 22:52:55       NaN        NaN   
9       s25 2016-09-11 11:49:27 2016-09-11 12:02:46       NaN        NaN   

   n3minGaps     minuteG         distG    minuteB         distB     minuteA  \
0        NaN         NaN           NaN        NaN           NaN         NaN   
1    

# Write output

In [16]:
session_summaries4

Unnamed: 0,sessionID,sessionStartTime,sessionEndTime,n10sGaps,n1minGaps,n3minGaps,minuteG,distG,minuteB,distB,minuteA,distA,types
0,s82,2016-09-11 23:42:22,2016-09-11 23:44:07,,,,,,,,,,NoGap
1,s06,2016-09-11 06:53:13,2016-09-11 07:29:41,5.0,2.0,,,,,,,,NoGap
2,s61,2016-09-11 03:03:05,2016-09-11 05:48:10,,,1.0,158.583333,10176.847114,0.75,582.941837,5.75,6956.011968,MMM
3,s02,2016-09-11 06:38:21,2016-09-11 07:54:52,,,1.0,31.933333,13.782077,43.75,8.417493,0.833333,48.234331,NNN
4,s30,2016-09-11 20:40:50,2016-09-11 23:59:58,,,1.0,24.75,22182.516079,12.766667,17255.702635,161.616667,538.538022,MMM
5,s63,2016-09-11 03:47:43,2016-09-11 04:18:25,,,1.0,13.5,5.2426,7.016667,3576.77409,10.183333,4804.656319,MMN
6,s24,2016-09-11 06:28:54,2016-09-11 07:31:57,,,1.0,32.516667,19.593015,29.783333,9935.276975,0.75,72.408479,MNN
7,s12,2016-09-11 04:34:36,2016-09-11 05:43:51,,,1.0,18.716667,1539.664485,3.75,11.316544,46.783333,51237.192912,NMM
8,s1,2016-09-11 22:41:01,2016-09-11 22:52:55,,,1.0,11.55,1826.899355,0.233333,167.392846,0.116667,8.093199,NNM
9,s25,2016-09-11 11:49:27,2016-09-11 12:02:46,,,1.0,8.35,13.435171,0.05,11.134682,4.916667,4171.81668,NMN
