## Preprocessing trajectory data

Data preprocessing is a set of activities performed to prepare data for future analysis and data mining activities.

## Load data from file

The dataset used in this tutorial is GeoLife GPS Trajectories. Available in https://www.microsoft.com/en-us/download/details.aspx?id=52367

In [12]:
import pandas as pd
import numpy as np
from pymove import MoveDataFrame

In [13]:
df = pd.read_csv('examples/geolife_sample.csv', parse_dates=['datetime'])
df.head()

Unnamed: 0,lat,lon,datetime,id
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1
4,39.984217,116.319422,2008-10-23 05:53:21,1


In [14]:
df_move = MoveDataFrame(df, latitude="lat", longitude="lon", datetime="datetime")

In [None]:
df_move.show_trajectories_info()

## Filtering

The filters module provides functions to perform different types of data filtering.

Importing the module:

In [None]:
from pymove import filters

A bounding box (usually shortened to bbox) is an area defined by two longitudes and two latitudes. The function by_bbox, filters points of the trajectories according to a especified bounding box.

In [None]:
bbox = (22.147577, 113.54884299999999, 41.132062, 121.156224)
filt_df = filters.by_bbox(df_move, bbox)
filt_df.head()

by_datetime function filters point trajectories according to the time specified by the parameters: start_datetime and end_datetime.

In [None]:
filters.by_datetime(df_move,start_datetime = "2009-03-19 05:45:37", end_datetime = "2009-03-19 05:46:17")

by label function filters trajectories points according to specified value and column label, set by value and label_name respectively.

In [None]:
filters.by_label(df_move, value = 116.327219, label_name = "lon").head()

by_id function filters trajectories points according to especified trajectory id.

In [None]:
filters.by_id(df_move, id_=5).head()

A tid is the result of concatenation between the id and date of a trajectory.
The by_tid function filters trajectory points according to the tid specified by the tid_ parameter.

In [None]:
df_move.generate_tid_based_on_id_datatime()
filters.by_tid(df_move, "12008102305").head()

outliers function filters trajectories points that are outliers.

In [None]:
outliers_points = filters.outliers(df_move)
outiliers_points.head()

clen_duplicates function removes the duplicate rows of the Dataframe, optionally only certaind columns can be consider.

In [None]:
filters.clean_duplicates(df_move)

clean_consecutive_duplicates function removes consecutives duplicate rows of the Dataframe. Optionally only certaind columns can be consider, this is defined by the parameter subset, in this example only the lat column is considered.

In [None]:
filtered_df = filters.clean_consecutive_duplicates(df_move, subset = ["lat"])
len(filtered_df)

clean_nan_values function removes missing values from the dataframe.

In [None]:
filters.clean_nan_values(df_move)
len(df_move)

clean_gps_jumps_by_distance function removes from the dataframe the trajectories points that are outliers.

In [None]:
filters.clean_gps_jumps_by_distance(df_move)

clean_gps_nearby_points_by_distances function removes points from the trajectories when the distance between them and the point before is smaller than the parameter radius_area.

In [None]:
filters.clean_gps_nearby_points_by_distances(df_move, radius_area = 10)

clean_gps_nearby_points_by_speed function removes points from the trajectories when the speed of travel between them
and the point before is smaller than the value set by the parameter speed_radius.

In [None]:
filters.clean_gps_nearby_points_by_speed(df_move, speed_radius=40.0)

clean_gps_speed_max_radius function recursively removes trajectories points with speed higher than the value especifeid by the user.
    Given any point p of the trajectory, the point will be removed if one of the following happens:
    if the travel speed from the point before p to p is greater than the  max value of speed between adjacent
    points set by the user. Or the travel speed between point p and the next point is greater than the value set by
    the user. When the clening is done, the function will update the time and distance features in the dataframe and
    will call itself again.
    The function will finish processing when it can no longer find points disrespecting the limit of speed.

In [None]:
filters.clean_gps_speed_max_radius(df_move)

clean_trajectories_with_few_points function removes from the given dataframe, trajectories with fewer points than was specified by the parameter min_points_per_trajectory.

In [None]:
filters.clean_trajectories_with_few_points(df_move)

## Segmantation

The segmentation module are used to segment trajectories based on different parameters.

Importing the module:

In [17]:
from pymove import segmentation

bbox_split function splits the bounding box in grids of the same size. The number of grids is defined by the parameter number_grids.

In [None]:
bbox = (22.147577, 113.54884299999999, 41.132062, 121.156224)
segmentation.bbox_split(bbox, number_grids=4)

by_dist_time_speed functions segments the trajectories into clusters based on distance, time and speed. The distance, time and speed limits by the parameters by max_dist_between_adj_points, max_time_between_adj_points, max_speed_between_adj_points respectively. The column tid_part is added, it indicates the segment which the point belongs to.

In [None]:
segmentation.by_dist_time_speed(df_move, max_dist_between_adj_points=5000, 
                                max_time_between_adj_points=800,max_speed_between_adj_points=60.0)
df_move.head()

by_speed function segments the trajectories into clusters based on speed. The speed limit is defined by the parameter max_speed_between_adj_points. The column tid_seed is added, it indicates the segment which the point belongs to.

In [None]:
segmentation.by_speed(df_move, max_speed_between_adj_points=70.0)
df_move.head()

by_time function segments the trajectories into clusters based on time. The time limit is defined by the parameter max_time_between_adj_points. The column tid_time is added, it indicates the segment which the point belongs to.

In [None]:
segmentation.by_time(df_move, max_time_between_adj_points = 1000)
df_move.head()

segment_traj_by_max_dist function segments the trajectories into clusters based on distance. The distance limit is defined by the parameter max_dist_between_adj_points. The column tid_dist is added, it indicates the segment which the point belongs to.

In [18]:
segmentation.segment_traj_by_max_dist(df_move, max_dist_between_adj_points = 4000)
df_move.head()

Split trajectories by max distance between adjacent points: 4000
...setting id as index


  dist = (df_.at[idx, DIST_TO_PREV] > max_dist_between_adj_points)


(217653/217653) 100% in 00:00:00.018 - estimated end in 00:00:00.000
... Reseting index

Total Time: 0.02 seconds
------------------------------------------



Unnamed: 0,id,lat,lon,datetime,dist_to_prev,dist_to_next,dist_prev_to_next,situation,tid_dist
0,1,39.984094,116.319236,2008-10-23 05:53:05,,13.690153,,,1
1,1,39.984198,116.319322,2008-10-23 05:53:06,13.690153,7.403788,20.223428,move,1
2,1,39.984224,116.319402,2008-10-23 05:53:11,7.403788,1.821083,5.888579,move,1
3,1,39.984211,116.319389,2008-10-23 05:53:16,1.821083,2.889671,1.873356,stop,1
4,1,39.984217,116.319422,2008-10-23 05:53:21,2.889671,66.555997,68.72726,move,1


## Stay point detection 

A stay point is location where a moving object has stayed for a while within a certain distance threshold. A stay point could stand different places such: a restaurant, a school, a work place.

Importing the module:

In [4]:
from pymove import stay_point_detection

stay_point_detection function converts the time data into a cyclical format. The columns hour_sin and hour_cos are added to the dataframe.

In [5]:
stay_point_detection.create_update_datetime_in_format_cyclical(df_move)

Encoding cyclical continuous features - 24-hour time
...hour_sin and  hour_cos features were created...



In [6]:
df_move.head()

Unnamed: 0,lat,lon,datetime,id,hour_sin,hour_cos
0,39.984094,116.319236,2008-10-23 05:53:05,1,0.979084,0.203456
1,39.984198,116.319322,2008-10-23 05:53:06,1,0.979084,0.203456
2,39.984224,116.319402,2008-10-23 05:53:11,1,0.979084,0.203456
3,39.984211,116.319389,2008-10-23 05:53:16,1,0.979084,0.203456
4,39.984217,116.319422,2008-10-23 05:53:21,1,0.979084,0.203456


create_or_update_move_stop_by_dist_time function creates or updates the stay points of the trajectories, based on distance and time metrics. The column segment_stop is added to the dataframe, it indicates the trajectory segment which the point belongs to. The column stop is also added, it indicates is the point represents a stop, a place where the object was stationary.

In [7]:
stay_point_detection.create_or_update_move_stop_by_dist_time(df_move, dist_radius=40, time_radius=1000)

Split trajectories by max distance between adjacent points: 40

Creating or updating distance features in meters...

...Sorting by id and datetime to increase performance

...Set id as index to increase attribution performance

(217653/217653) 100% in 00:00:00.088 - estimated end in 00:00:00.000
...Reset index

..Total Time: 0.09474706649780273
...setting id as index


  dist = (df_.at[idx, DIST_TO_PREV] > max_dist_between_adj_points)


(217653/217653) 100% in 00:00:00.064 - estimated end in 00:00:00.000
... Reseting index

Total Time: 0.07 seconds
------------------------------------------


Creating or updating distance, time and speed features in meters by seconds

...Sorting by segment_stop and datetime to increase performance

...Set segment_stop as index to a higher peformance

(5/217653) 0% in 00:00:00.176 - estimated end in 02:08:04.202
(43995/217653) 20% in 00:00:00.228 - estimated end in 00:00:00.901
(88581/217653) 40% in 00:00:00.276 - estimated end in 00:00:00.403
(130800/217653) 60% in 00:00:00.338 - estimated end in 00:00:00.224
(174825/217653) 80% in 00:00:00.481 - estimated end in 00:00:00.118
...Reset index...

..Total Time: 0.584
Create or update stop as True or False
...Creating stop features as True or False using 1000 to time in seconds
True     157738
False     59915
Name: stop, dtype: int64

Total Time: 1.02 seconds
-----------------------------------------------------



In [8]:
df_move.head()

Unnamed: 0,segment_stop,id,lat,lon,datetime,hour_sin,hour_cos,dist_to_prev,dist_to_next,dist_prev_to_next,time_to_prev,speed_to_prev,stop
0,1,1,39.984094,116.319236,2008-10-23 05:53:05,0.979084,0.203456,,13.690153,,,,False
1,1,1,39.984198,116.319322,2008-10-23 05:53:06,0.979084,0.203456,13.690153,7.403788,20.223428,1.0,13.690153,False
2,1,1,39.984224,116.319402,2008-10-23 05:53:11,0.979084,0.203456,7.403788,1.821083,5.888579,5.0,1.480758,False
3,1,1,39.984211,116.319389,2008-10-23 05:53:16,0.979084,0.203456,1.821083,2.889671,1.873356,5.0,0.364217,False
4,1,1,39.984217,116.319422,2008-10-23 05:53:21,0.979084,0.203456,2.889671,66.555997,68.72726,5.0,0.577934,False


create_update_move_and_stop_by_radius function creates or updates the stay points of the trajectories, based on distance. The column segment_stop is added to the dataframe, it indicates the trajectory segment which the point belongs to. The column situation is also added, it indicates is the point represents a stop point or a moving point.

In [15]:
stay_point_detection.create_update_move_and_stop_by_radius(df_move, radius=2)


Creating or updating features MOVE and STOPS...


Creating or updating distance features in meters...

...Sorting by id and datetime to increase performance

...Set id as index to increase attribution performance

(217653/217653) 100% in 00:00:00.092 - estimated end in 00:00:00.000
...Reset index

..Total Time: 0.09885811805725098

....There are 58981 stops to this parameters



In [11]:
df_move.head()

Unnamed: 0,segment_stop,id,lat,lon,datetime,hour_sin,hour_cos,dist_to_prev,dist_to_next,dist_prev_to_next,time_to_prev,speed_to_prev,stop,situation
0,1,1,39.984094,116.319236,2008-10-23 05:53:05,0.979084,0.203456,,13.690153,,,,False,
1,1,1,39.984198,116.319322,2008-10-23 05:53:06,0.979084,0.203456,13.690153,7.403788,20.223428,1.0,13.690153,False,move
2,1,1,39.984224,116.319402,2008-10-23 05:53:11,0.979084,0.203456,7.403788,1.821083,5.888579,5.0,1.480758,False,move
3,1,1,39.984211,116.319389,2008-10-23 05:53:16,0.979084,0.203456,1.821083,2.889671,1.873356,5.0,0.364217,False,stop
4,1,1,39.984217,116.319422,2008-10-23 05:53:21,0.979084,0.203456,2.889671,66.555997,68.72726,5.0,0.577934,False,move


## Compression

Importing the module:

In [20]:
from pymove import compression

The function below is used to reduce the size of the trajectory, the stop points are used to make the compression. 

In [21]:
compression.compress_segment_stop_to_point(df_move)

Split trajectories by max distance between adjacent points: 30
...setting id as index


  dist = (df_.at[idx, DIST_TO_PREV] > max_dist_between_adj_points)


(217653/217653) 100% in 00:00:00.079 - estimated end in 00:00:00.000
... Reseting index

Total Time: 0.08 seconds
------------------------------------------


Creating or updating distance, time and speed features in meters by seconds

...Sorting by segment_stop and datetime to increase performance

...Set segment_stop as index to a higher peformance

(5/217653) 0% in 00:00:00.159 - estimated end in 01:55:48.248
(43973/217653) 20% in 00:00:00.213 - estimated end in 00:00:00.843
(88533/217653) 40% in 00:00:00.261 - estimated end in 00:00:00.381
(130666/217653) 60% in 00:00:00.326 - estimated end in 00:00:00.217
(174240/217653) 80% in 00:00:00.533 - estimated end in 00:00:00.132
...Reset index...

..Total Time: 0.657
Create or update stop as True or False
...Creating stop features as True or False using 900 to time in seconds
True     152603
False     65050
Name: stop, dtype: int64

Total Time: 0.86 seconds
-----------------------------------------------------

...setting mean to lat and

HBox(children=(IntProgress(value=0, max=285), HTML(value='')))

(807/152603) 0% in 00:00:00.086 - estimated end in 00:00:16.321
(8046/152603) 5% in 00:00:00.237 - estimated end in 00:00:04.265
(15384/152603) 10% in 00:00:00.537 - estimated end in 00:00:04.795
(23731/152603) 15% in 00:00:01.050 - estimated end in 00:00:05.705
(31376/152603) 20% in 00:00:01.219 - estimated end in 00:00:04.713
(38917/152603) 25% in 00:00:01.440 - estimated end in 00:00:04.207
(46175/152603) 30% in 00:00:01.683 - estimated end in 00:00:03.880
(53618/152603) 35% in 00:00:01.999 - estimated end in 00:00:03.692
(61596/152603) 40% in 00:00:02.260 - estimated end in 00:00:03.339
(69846/152603) 45% in 00:00:02.591 - estimated end in 00:00:03.070
(76415/152603) 50% in 00:00:02.799 - estimated end in 00:00:02.791
(84323/152603) 55% in 00:00:03.022 - estimated end in 00:00:02.447
(91696/152603) 60% in 00:00:03.220 - estimated end in 00:00:02.138
(99924/152603) 65% in 00:00:03.616 - estimated end in 00:00:01.906
(107013/152603) 70% in 00:00:04.067 - estimated end in 00:00:01.732

In [22]:
df_move

Unnamed: 0,segment_stop,id,lat,lon,datetime,dist_to_prev,dist_to_next,dist_prev_to_next,situation,tid_dist,time_to_prev,speed_to_prev,stop,lat_mean,lon_mean
562,13,1,40.008987,116.312734,2008-10-23 10:56:57,,11.013568,26.512002,move,1,,,True,40.013824,116.306535
1368,13,1,39.990973,116.326094,2008-10-24 00:04:28,16.068387,40.942759,56.717519,move,1,2.0,8.034193,True,40.013824,116.306535
1575,17,1,39.978484,116.326845,2008-10-24 01:45:41,,14.366909,50.322414,move,1,,,True,39.980124,116.310749
1847,17,1,39.980909,116.308171,2008-10-24 02:28:19,21.870376,51.683385,53.596809,move,1,1508.0,0.014503,True,39.980124,116.310749
1938,21,1,39.981700,116.310196,2008-10-24 03:18:22,,20.642805,58.753841,move,1,,,True,39.979594,116.313750
2646,21,1,39.982684,116.311197,2008-10-24 05:40:23,6.437367,73.921445,72.942149,move,1,5.0,1.287473,True,39.979594,116.313750
2667,25,1,39.981691,116.310004,2008-10-24 06:09:34,,19.612423,51.388287,move,1,,,True,39.981541,116.310107
3039,25,1,39.979459,116.325806,2008-10-24 06:33:05,5.697464,78.265156,83.404371,move,1,5.0,1.139493,True,39.981541,116.310107
3089,27,1,40.013812,116.306483,2008-10-24 23:44:05,,7.587244,4358.356247,move,2,,,True,39.996381,116.299268
3924,27,1,39.997757,116.276764,2008-10-25 00:49:53,12.807471,320.654991,327.200429,move,2,2.0,6.403736,True,39.996381,116.299268
