### We process and clean the original data in order to prepare them for the study. The original data is composed by 161,009 GPS locations corresponding to 262 users from the 10 schools. However, a great number of participants used a different transport method for the study rather than walking. Furthermore, there are several trajectories that were not correctly recorded (with just few GPS locations and no clear path). Finally, we also need to clean each individual path from outliers, that mainly appear and the beginning and the end of the trajectory due to the GPS connection. 

### So after cleaning the data, we have just 83 participants and 36,091 GPS locations. All the participants' data-set are saved in a .csv file, adding three new columns for the time difference between Geo-locations, the Euclidean distance and the instantaneous velocity.

### Finally, in a four stage we also perform linear interpolation between GPS locations in order to have all of them uniformly separated by 1 second. We save a new .csv file with the interpolated data.

In [7]:
import networkx as nx
import osmnx as ox
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import glob
import os
from math import sin, cos, sqrt, atan2, radians

#%matplotlib inline
ox.config(log_console=True)
ox.__version__


def getDistanceFromLatLonInM(lat1,lon1,lat2,lon2):
    """ Function that returns the distance in metres between 2 GPS locations in degrees (latitude and longitude).
    It is based in the Haversine formula (https://en.wikipedia.org/wiki/Haversine_formula) which takes into account the
    Earth's curvature. 
    
    Input:
        - 2 GPS coordinates: (latitude1,longitude1) of the first point and (latitude2,longitude2) of the second point. 
        
    Output:
        - Distance in metres between the two GPS locations.
    """
    
    R = 6371 # Radius of the earth in km
    dLat = radians(lat2-lat1)
    dLon = radians(lon2-lon1)
    rLat1 = radians(lat1)
    rLat2 = radians(lat2)
    a = sin(dLat/2) * sin(dLat/2) + cos(rLat1) * cos(rLat2) * sin(dLon/2) * sin(dLon/2) 
    c = 2 * atan2(sqrt(a), sqrt(1-a))
    d = R * c # Distance in km
    e= d*1000 #distance in m
    
    return e


## Stage 1: Remove non-pedestrian journeys

We remove all non-pedestrian journeys for this study, by looking at the nickname of each participant, where the means of transport is indicated.

Ex:  2018-11-05_sgv_1302_bus.csv

## Stage 2: Remove invalid journeys

Remove all erroneous trajectories/journeys (few GPS locs. without a clear trajectory)

## Stage 3: Remove outliers

We clean each journey from the possible outliers that may appear (specially at the initial/end of the trajectory) due to the gps connexion or so.

## Stage 4: Linear Interpolation

We are interested in to have all the data separated uniformly in time (1 second). For this reason, we interpolate linearly the data to predict the geo-locations at those "temporal gaps" where the time difference is greater than 1 second.
    
In other words, if two geolocations are separated by more than 1 second (for instance, two seconds) we perform a linear interpolation, creating the time in between with the new geolocation.
    
To do that, we must first convert the time column into index, then resample it to every 1 second, and fullfil the rows of geolocations that are missing predicting the values with a straight line.

The final data-set for each user contains the data separated by one second and three new columns corresponding to the time difference between geo-locations, the Haversine/Euclidean distance and the instantaneous velocity


In [9]:
all_files = glob.glob(os.path.join("*.csv")) #make list of paths for all the csv files (each user)

for file in all_files:
    df = pd.read_csv(file) #read the file     
    df2=df.copy() 

    df2['time'] = pd.to_datetime(df2['time'])   # Time to datetime format
    df2.index = df2['time']  # Then convert the column time into index

    del df2['time']  # delete all the columns of the dataframe less the coordinates (latitude and longitude)
    del df2['nickname']
    del df2['At']
    del df2['d']
    del df2['v']

    df2=df2.resample('1S').asfreq().interpolate()     # Resample the index of times every 1 second (1S) and interpolate linearly
    df2.reset_index(level=0, inplace=True)            # the missing values of the latitude and longitude. Then reset index.

    
    At=[]     # We calculate again the time difference between geolocations, the Harvesine distance and the velocity.
    distance=[]
    for i in range(1,len(df2['time'])):
        At.append((df2['time'][i]-df2['time'][i-1]).total_seconds())
        dist=getDistanceFromLatLonInM(df2['latitude'][i-1],df2['longitude'][i-1],df2['latitude'][i],df2['longitude'][i])
        distance.append(dist)

    At.insert(len(At), np.nan)
    distance.insert(len(distance), np.nan)

    df2['At']=At
    df2['d']=distance
    df2['v']=df2['d']/df2['At']
    
    file2=os.path.splitext(file)[0]           # Save new .csv file for each user with the extension "interpolated"
    #df2.to_csv(file2+'_interpolated.csv')