# Knowledge from Incomplete Trajectories: Navigation systems do not cater for familiar wayfinders

Companion demonstration code to the paper: *Knowledge from Incomplete Trajectories: Navigation systems do not cater for familiar wayfinders*.

This code demonstrates the analytical part taht enables to classify trajectories wioth a *single* gap. 

This code is devised to be dependent on a minimum number of dependencies and to demonstrate the classification of trajectories as highlighted in the paper. It is *not* a production level, performant code, or a clean library. 

The code is coupled with a small set of demonstration data - these are artificial data, not the data used in the paper itself. They are only suited to demonstrate the functionality of this code.



## Theory at a glance

The paper *Knowledge from Incomplete Trajectories: Navigation systems do not cater for familiar wayfinders* proposes a simple conceptual model of incomplete trajectories (captured in a tracked *session*). 
This is the conceptual model of trajectories with gaps:
        
![Schematic of trajectory with gap](figs/trajectory_annot.png)

The code demonstrated here classifies trajectories by analysing the pre-gap, gap, and post-gap sections of the trajectories, evaluating whether the tracked object *moved* (**M**) or *did not move* (**N**) in these respective trajectory parts.

This classification results in eight characteristic sequences (*MMM, MMN, MNM, NMM, NNN, NNM, NMN, MNN*), that form the following conceptual neighbourhoods:

![Conceptual neighbourhood](figs/conceptualNeighbourhood.png)

## Relating gapped trajectories to information needs

To apply the findings of the paper -- i.e., to relate the presence of gaps to information needs -- an analyst must be certain that two assumptions are met:

- The presence of the gap in the trajectory must be relatable to information use (i.e., in our paper, the user was not tracked when the screen was off). However, this may be reflected differently in other data sources, for instance, with a decrease of sampling frequency, rather then a real gap in the trajectory., The code below would need to be modified to analyse such data sources.
- the gaps must be reliably identificable as active gaps - i.e., technical failures should be effectively filtered out. 


## Code assumptions

This code is designed to run with data in the following format. All coordinates are assumed to be in a planar system for the code to work - this is easily achieved by projecting from GPS lon/lat coordinates to the desired system, or using spheroidical computation on distances ( not demonstrated here, to reduce dependencies).

| Timestamp     | SessionID | x | y | 
| -------------:| -------------:| ---:| ---:|
| 2000-01-01 01:00:00 | 1 | 100 | 100 |

## Data pre-processing

The pre-processing loads the data (as csv), and:

1. **Adding time differences** Computes time differences between observations (timeDiff)
2. **Adding distance differences** Computes the distance differences between observations (distDiff) *based on Euclidean distance* (hence, requires projected data in this demo code)
3. **Session annotation** Flags whether an observation is the start or end of a session (sessionStart,sessionEnd);
4. **Gap annotation** Identifies if an observation is a start or end of a gap, based on defined gap duration (gapStart, gapEnd);
5. **Cleanup** Resolves conflicts between gapStart and sessionStart, and gapEnd and sessionEnd
6. **Session classification** Classifies sessions into the eight types from the conceptual model, by movement in the pre-gap, gap and post-gap period

The code below results in two (CSV) datasets, one characterising the session in summary ( the result of the classification), the second computing parameters at the observation level, enabling the classification. 


In [1]:
### Imports
import os # to manipulate paths
import numpy as np
import pandas as pd # Pandas data frames
import math

In [2]:
### Functions


def dist(df, x_col="x",y_col="y"):
    """Simple distance function on data frame
    Euclidean distance,
    replace with a Haversine or other function 
    if needed to apply on lon,lat
    """
    return np.sqrt(df[x_col].diff()**2 + df[y_col].diff()**2)


In [3]:
### Main execution

input_file_name = "gap_trajs_input_data.csv"
input_file = os.path.join(os.getcwd(),'data',input_file_name)
os.path.exists(input_file) # test that your path is correct


True

In [4]:
# read csv file, with header ( first row), assume all inputs are strings
data = pd.read_csv(input_file, header=0)

# convert timestamps to a timestamp format
data["time"] = pd.to_datetime(data["time"]) # cast time stamp stringgs to pandas datetime

# compute time and distance differences between consecutive observations, by session
data["timeDiff"] = data.sort_values(['sessionID','time']).groupby("sessionID")["time"].diff()/np.timedelta64(1,"s")/60 # Compute time difference between consecutive fixes, in minnutes
data["negtimeDiff"] = data.sort_values(['sessionID','time']).groupby("sessionID")["time"].diff(periods=-1)/np.timedelta64(1,"s")/60 # Compute time difference between consecutive fixes, in minnutes

data["dist"] = data.groupby(['sessionID'], group_keys=False).apply(dist)

data['sessionStart'] = False
data['sessionEnd'] = False

# Annotate sessions - start and end of session
data.loc[data.groupby("sessionID").head(1).index, 'sessionStart'] = True
data.loc[data.groupby("sessionID").tail(1).index, 'sessionEnd'] = True

# cleanup for border cases - needed?
# test from here.
conditions = [
    data["sessionEnd"].eq(True) & data["sessionStart"].eq(True)
]
outcomes = [False]
    
data["sessionEnd"] = np.select(conditions, outcomes, default=data["sessionEnd"])





### Remov later - Verify that you have identified sessionstarts and ends


In [5]:
print(len(data[data["timeDiff"].isna()]))
print(len(data[data["sessionEnd"].eq(True)]))
print(len(data[data["sessionStart"].eq(True)]))

10
10
10


## Identify gaps in the dataset

In this step, we will now identify gaps based on a threshold duration.
    

In [6]:
gap_durations = [10/60,1,3] # in minutes. Consider experimenting with 1 min, or 10/60 => 10seconds

for gap_duration in gap_durations:
    # this identifies gap ends, as the nanotation arrives to gapEnd
    data.loc[abs(data.timeDiff) > gap_duration, 'gapEnd'] = gap_duration
    data.loc[abs(data.negtimeDiff) > gap_duration, 'gapStart'] = gap_duration

In [7]:
print(len(data[data["gapEnd"].eq(10/60)]))
print(len(data[data["gapStart"].eq(10/60)]))
print(len(data[data["gapEnd"].eq(1)]))
print(len(data[data["gapStart"].eq(1)]))
print(len(data[data["gapEnd"].eq(3)]))
print(len(data[data["gapStart"].eq(3)]))

360
360
2
2
8
8


In [8]:
## Summarize sessions - we start by a seed of sessionIDs and sessionStarts

session_summaries = data[data.sessionStart.eq(True)][["sessionID","time"]].copy()
session_summaries = session_summaries.rename(columns={"time":"sessionStartTime"})
#session Ends
session_ends = data[data.sessionEnd.eq(True)][["sessionID","time"]]
session_summaries = session_summaries.merge(session_ends, on="sessionID")
session_summaries = session_summaries.rename(columns={"time":"sessionEndTime"})

# add in timeDiff
# count number of gaps, by type
gaps = data[data['gapStart'] == 10/60].groupby('sessionID')['gapStart'].count().astype('int64')

gaps.name= "n10sGaps"
gaps = pd.DataFrame(gaps)
gaps1 = data[data['gapStart'] == 1].groupby('sessionID')['gapStart'].count().astype('int64')

gaps1.name= "n1minGaps"
gaps1 = pd.DataFrame(gaps1)
gaps3 = data[data['gapStart'] == 3].groupby('sessionID')['gapStart'].count().astype('int64')
gaps3.name= "n3minGaps"
gaps3 = pd.DataFrame(gaps3)

session_summaries = session_summaries.merge(gaps, on="sessionID", how='left').merge(gaps1, on="sessionID", how='left').merge(gaps3, on="sessionID", how='left')
#sanity check ! finalise

no_gap_sessions = len(session_summaries[session_summaries['n10sGaps'].isna() & session_summaries['n1minGaps'].isna() & session_summaries['n3minGaps'].isna()])

total_by_gaps = len(session_summaries[session_summaries['n10sGaps'].notna() | session_summaries['n1minGaps'].notna() | session_summaries['n3minGaps'].notna()])
overall_sessions = data["sessionID"].nunique()
print(f'sessions without gaps {no_gap_sessions}, with gaps {total_by_gaps}, total sessions {overall_sessions} matches .')


session_summaries.sort_values("sessionStartTime")

#.select(["sessionID",])
#data.set_index("sessionID")

sessions without gaps 0, with gaps 10, total sessions 10 matches .


Unnamed: 0,sessionID,sessionStartTime,sessionEndTime,n10sGaps,n1minGaps,n3minGaps
2,3fedd0690e6e498c9a610f173d34c40d,2016-11-09 03:03:00,2016-11-09 05:48:00,6.0,,1.0
5,faa5da9f6f2244deb3f7d283fd3bac0f,2016-11-09 03:47:00,2016-11-09 04:18:00,17.0,,1.0
7,5a839eeaa0164f6da36199521c701cf5,2016-11-09 04:34:00,2016-11-09 05:43:00,50.0,,1.0
6,6a7c86eba1e249ce9ebc34c20a8ec05f,2016-11-09 06:28:00,2016-11-09 07:31:00,30.0,,1.0
3,2dfdf169dae8438598111dbfed43dbbc,2016-11-09 06:38:00,2016-11-09 07:54:00,44.0,,1.0
1,041640aef216425cbccdf976e150dc49,2016-11-09 06:53:00,2016-11-09 07:29:00,32.0,2.0,
9,ff850e772a094fb7838312f42d053d27,2016-11-09 11:49:00,2016-11-09 12:02:00,5.0,,1.0
4,66f6ace4d9bf4019899528a15a7b5cca,2016-11-09 20:40:00,2016-11-09 23:59:00,174.0,,1.0
8,40203e8e70404da9b785e9500b9c9828,2016-11-09 22:41:00,2016-11-09 22:52:00,,,1.0
0,0017f6bb4c9747369843223faf23673e,2016-11-09 23:42:00,2016-11-09 23:44:00,2.0,,


In [9]:
data['gapStart']

0      0.166667
1      0.166667
2           NaN
3      0.166667
4      0.166667
         ...   
375    0.166667
376    0.166667
377    0.166667
378    0.166667
379         NaN
Name: gapStart, Length: 380, dtype: float64

In [10]:
data[data['gapStart'].notna()]

Unnamed: 0,time,sessionID,x,y,timeDiff,negtimeDiff,dist,sessionStart,sessionEnd,gapEnd,gapStart
0,2016-11-09 23:42:00,0017f6bb4c9747369843223faf23673e,3017124.889,2702226.342,,-1.0,,True,False,,0.166667
1,2016-11-09 23:43:00,0017f6bb4c9747369843223faf23673e,3017119.210,2702207.779,1.0,-1.0,19.412264,False,False,0.166667,0.166667
3,2016-11-09 06:53:00,041640aef216425cbccdf976e150dc49,3313093.007,3427907.141,,-1.0,,True,False,,0.166667
4,2016-11-09 06:54:00,041640aef216425cbccdf976e150dc49,3313523.984,3428150.985,1.0,-1.0,495.177817,False,False,0.166667,0.166667
5,2016-11-09 06:55:00,041640aef216425cbccdf976e150dc49,3313582.779,3428485.505,1.0,-1.0,339.647586,False,False,0.166667,0.166667
...,...,...,...,...,...,...,...,...,...,...,...
374,2016-11-09 11:57:00,ff850e772a094fb7838312f42d053d27,2542718.100,2380516.819,8.0,-1.0,3.443298,False,False,3.000000,0.166667
375,2016-11-09 11:58:00,ff850e772a094fb7838312f42d053d27,2542717.205,2380513.494,1.0,-1.0,3.443349,False,False,0.166667,0.166667
376,2016-11-09 11:59:00,ff850e772a094fb7838312f42d053d27,2542660.546,2380587.045,1.0,-1.0,92.843911,False,False,0.166667,0.166667
377,2016-11-09 12:00:00,ff850e772a094fb7838312f42d053d27,2542188.613,2380962.408,1.0,-1.0,603.007577,False,False,0.166667,0.166667


In [11]:
gapLengths = data[data['gapStart'] > 10/60][["sessionID","timeDiff"]]
gapLengths


Unnamed: 0,sessionID,timeDiff
6,041640aef216425cbccdf976e150dc49,1.0
13,041640aef216425cbccdf976e150dc49,1.0
38,3fedd0690e6e498c9a610f173d34c40d,
90,2dfdf169dae8438598111dbfed43dbbc,1.0
105,66f6ace4d9bf4019899528a15a7b5cca,1.0
275,faa5da9f6f2244deb3f7d283fd3bac0f,1.0
317,6a7c86eba1e249ce9ebc34c20a8ec05f,1.0
323,5a839eeaa0164f6da36199521c701cf5,1.0
371,40203e8e70404da9b785e9500b9c9828,
373,ff850e772a094fb7838312f42d053d27,


In [12]:
output_file = os.path.join(os.getcwd(),"data","session_annot.csv")
data.to_csv(output_file)

## Classify sessions

In [13]:
def getCategory(distB,distG,distA,threshold):
    definedClass = ""
    if math.isnan(distB):
        distB = 0
    if math.isnan(distG):
        distG = 0
    if math.isnan(distA):
        distA = 0
    distB = float(distB)
    distA = float(distA)
    distG = float(distG)
    if (distB==0 and distG==0 and distA==0):
        definedClass = "NoGap"
    elif (distB>threshold and distG>threshold and distA>threshold):
        definedClass = "MMM"
    elif (distB>threshold and distG>threshold and distA<=threshold):
        definedClass = "MMN"
    elif (distB>threshold and distG<=threshold and distA>threshold):
        definedClass = "MNM"
    elif(distB>threshold and distG<=threshold and distA<=threshold):
        definedClass = "MNN"
    elif(distB<=threshold and distG>threshold and distA>threshold):
        definedClass = "NMM"
    elif(distB<=threshold and distG>threshold and distA<=threshold):
        definedClass = "NMN"
    elif(distB<=threshold and distG<=threshold and distA>threshold):
        definedClass = "NNM"
    elif(distB<=threshold and distG<=threshold and distA<=threshold):
        definedClass = "NNN"
    return definedClass

In [14]:
## Execute session classification

In [15]:
# sessions with only 1 3 mins gap
sessions_3min = session_summaries[session_summaries['n3minGaps']==1]
data_3mingap = data[data['gapEnd']==3]
data_3mingap = data_3mingap.rename(columns={'time': 'gapEndTime'})

data_3min = pd.merge(data,data_3mingap[['sessionID','gapEndTime']],on='sessionID', how='left')

data_3min["time"] = pd.to_datetime(data_3min["time"])
data_3min["gapEndTime"] = pd.to_datetime(data_3min["gapEndTime"])
data_3min_filtered = data_3min[data_3min['gapEndTime'].notna()]


data_3min_before_gap = data_3min_filtered[data_3min_filtered['time']<data_3min_filtered['gapEndTime']]
data_3min_after_gap = data_3min_filtered[data_3min_filtered['time']>data_3min_filtered['gapEndTime']]

before_gap_summaries = data_3min_before_gap[['sessionID','timeDiff','dist']].groupby(['sessionID']).sum()
before_gap_summaries = before_gap_summaries.rename(columns={'timeDiff': 'minuteB', 'dist': 'distB'})
after_gap_summaries = data_3min_after_gap[['sessionID','timeDiff','dist']].groupby(['sessionID']).sum()
after_gap_summaries = after_gap_summaries.rename(columns={'timeDiff': 'minuteA', 'dist': 'distA'})
#print(before_gap_summaries)

session_summaries2 = pd.merge(session_summaries,data_3mingap[['sessionID','timeDiff','dist']],on='sessionID', how='left')
session_summaries2 = session_summaries2.rename(columns={'timeDiff': 'minuteG', 'dist': 'distG'})
session_summaries3 = pd.merge(session_summaries2,before_gap_summaries,on='sessionID', how='left')
session_summaries4 = pd.merge(session_summaries3,after_gap_summaries,on='sessionID', how='left')
print(session_summaries4)
session_summaries4['types'] = session_summaries4.apply(lambda x: getCategory(x['distB'],x['distA'],x['distG'],200), axis=1)
print(session_summaries4)



                          sessionID    sessionStartTime      sessionEndTime  \
0  0017f6bb4c9747369843223faf23673e 2016-11-09 23:42:00 2016-11-09 23:44:00   
1  041640aef216425cbccdf976e150dc49 2016-11-09 06:53:00 2016-11-09 07:29:00   
2  3fedd0690e6e498c9a610f173d34c40d 2016-11-09 03:03:00 2016-11-09 05:48:00   
3  2dfdf169dae8438598111dbfed43dbbc 2016-11-09 06:38:00 2016-11-09 07:54:00   
4  66f6ace4d9bf4019899528a15a7b5cca 2016-11-09 20:40:00 2016-11-09 23:59:00   
5  faa5da9f6f2244deb3f7d283fd3bac0f 2016-11-09 03:47:00 2016-11-09 04:18:00   
6  6a7c86eba1e249ce9ebc34c20a8ec05f 2016-11-09 06:28:00 2016-11-09 07:31:00   
7  5a839eeaa0164f6da36199521c701cf5 2016-11-09 04:34:00 2016-11-09 05:43:00   
8  40203e8e70404da9b785e9500b9c9828 2016-11-09 22:41:00 2016-11-09 22:52:00   
9  ff850e772a094fb7838312f42d053d27 2016-11-09 11:49:00 2016-11-09 12:02:00   

   n10sGaps  n1minGaps  n3minGaps  minuteG         distG  minuteB  \
0       2.0        NaN        NaN      NaN           NaN     

# Write output