# #02 - Exploring Preprocessing 

Data preprocessing is a set of activities performed to prepare data for future analysis and data mining activities.

## Load data from file

The dataset used in this tutorial is GeoLife GPS Trajectories. Available in https://www.microsoft.com/en-us/download/details.aspx?id=52367

In [1]:
import pandas as pd
import numpy as np
from pymove import MoveDataFrame
from pymove import read_csv

In [2]:
df = pd.read_csv('geolife_sample.csv', parse_dates=['datetime'])
df.head()

Unnamed: 0,lat,lon,datetime,id
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1
4,39.984217,116.319422,2008-10-23 05:53:21,1


In [3]:
df_move = MoveDataFrame(df, latitude="lat", longitude="lon", datetime="datetime")

In [4]:
df_move.show_trajectories_info()



Number of Points: 217653

Number of IDs objects: 2

Start Date:2008-10-23 05:53:05     End Date:2009-03-19 05:46:37

Bounding Box:(22.147577, 113.54884, 41.13206, 121.15623)





## Filtering

The filters module provides functions to perform different types of data filtering.

Importing the module:

In [5]:
from pymove import filters
df_move = read_csv('geolife_sample.csv')

A bounding box (usually shortened to bbox) is an area defined by two longitudes and two latitudes. The function by_bbox, filters points of the trajectories according to a especified bounding box.

In [6]:
bbox = (22.147577, 113.54884299999999, 41.132062, 121.156224)
filt_df = filters.by_bbox(df_move, bbox)
filt_df.head()

Unnamed: 0,lat,lon,datetime,id
0,39.984093,116.319237,2008-10-23 05:53:05,1
1,39.9842,116.319321,2008-10-23 05:53:06,1
2,39.984222,116.319405,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1
4,39.984219,116.31942,2008-10-23 05:53:21,1


by_datetime function filters point trajectories according to the time specified by the parameters: start_datetime and end_datetime.

In [7]:
filters.by_datetime(df_move, start_datetime = "2009-03-19 05:45:37", end_datetime = "2009-03-19 05:46:17")

Unnamed: 0,lat,lon,datetime,id
217643,40.000206,116.327171,2009-03-19 05:45:37,5
217644,40.00013,116.327171,2009-03-19 05:45:42,5
217645,40.000069,116.327179,2009-03-19 05:45:47,5
217646,40.0,116.327217,2009-03-19 05:45:52,5
217647,39.99992,116.327209,2009-03-19 05:45:57,5
217648,39.999897,116.327293,2009-03-19 05:46:02,5
217649,39.999901,116.327354,2009-03-19 05:46:07,5
217650,39.999947,116.327393,2009-03-19 05:46:12,5
217651,40.000015,116.327431,2009-03-19 05:46:17,5


by label function filters trajectories points according to specified value and column label, set by value and label_name respectively.

In [8]:
filters.by_label(df_move, value = 116.327219, label_name = "lon").head()

Unnamed: 0,lat,lon,datetime,id
133,39.979034,116.327217,2008-10-23 06:01:02,1
159,39.977978,116.327217,2008-10-23 10:33:47,1
160,39.978092,116.327217,2008-10-23 10:33:50,1
3066,39.97916,116.327217,2008-10-24 06:34:27,1
13792,39.976044,116.327217,2008-10-26 07:03:51,1


by_id function filters trajectories points according to especified trajectory id.

In [9]:
filters.by_id(df_move, id_=5).head()

Unnamed: 0,lat,lon,datetime,id
108607,40.004154,116.321335,2008-10-24 04:12:30,5
108608,40.003834,116.321465,2008-10-24 04:12:35,5
108609,40.003784,116.321434,2008-10-24 04:12:40,5
108610,40.003689,116.321426,2008-10-24 04:12:45,5
108611,40.00359,116.321426,2008-10-24 04:12:50,5


A tid is the result of concatenation between the id and date of a trajectory.
The by_tid function filters trajectory points according to the tid specified by the tid_ parameter.

In [10]:
df_move.generate_tid_based_on_id_datatime()
filters.by_tid(df_move, "12008102305").head()


Creating or updating tid feature...

...Sorting by id and datetime to increase performance


...tid feature was created...



Unnamed: 0,lat,lon,datetime,id,tid
0,39.984093,116.319237,2008-10-23 05:53:05,1,12008102305
1,39.9842,116.319321,2008-10-23 05:53:06,1,12008102305
2,39.984222,116.319405,2008-10-23 05:53:11,1,12008102305
3,39.984211,116.319389,2008-10-23 05:53:16,1,12008102305
4,39.984219,116.31942,2008-10-23 05:53:21,1,12008102305


outliers function filters trajectories points that are outliers.

In [11]:
outliers_points = filters.outliers(df_move)
outliers_points.head()


Creating or updating distance features in meters...

...Sorting by id and datetime to increase performance

...Set id as index to increase attribution performance



VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

...Reset index

..Total Time: 0.39299511909484863
...Filtering jumps 



Unnamed: 0,id,lat,lon,datetime,tid,dist_to_prev,dist_to_next,dist_prev_to_next
148,1,39.970512,116.341454,2008-10-23 10:32:53,12008102310,1452.723849,1470.02729,70.179656
338,1,39.995041,116.326462,2008-10-23 10:44:24,12008102310,11.784583,10.573243,2.327355
8133,1,39.991074,116.188393,2008-10-25 08:20:19,12008102508,5.442257,6.559655,1.139224
8380,1,39.988392,116.188454,2008-10-25 09:07:13,12008102509,10.914588,8.551564,2.448361
10175,1,40.015167,116.311043,2008-10-25 23:40:12,12008102523,23.015178,24.266527,3.971611


clen_duplicates function removes the duplicate rows of the Dataframe, optionally only certaind columns can be consider.

In [12]:
filters.clean_duplicates(df_move)


Remove rows duplicates by subset
...Sorting by id and datetime to increase performance

...There are no GPS points duplicated


clean_consecutive_duplicates function removes consecutives duplicate rows of the Dataframe. Optionally only certaind columns can be consider, this is defined by the parameter subset, in this example only the lat column is considered.

In [13]:
filtered_df = filters.clean_consecutive_duplicates(df_move, subset = ["lat"])
len(filtered_df)

183291

clean_nan_values function removes missing values from the dataframe.

In [14]:
filters.clean_nan_values(df_move)
len(df_move)

217653

clean_gps_jumps_by_distance function removes from the dataframe the trajectories points that are outliers.

In [15]:
filters.clean_gps_jumps_by_distance(df_move)


Cleaning gps jumps by distance to jump_coefficient 3.0...

...Filtering jumps 

...Dropping 417 rows of gps points

...Rows before: 217653, Rows after:217236, Sum drop:417

...Filtering jumps 

417 GPS points were dropped


Unnamed: 0,id,lat,lon,datetime,tid,dist_to_prev,dist_to_next,dist_prev_to_next
0,1,39.984093,116.319237,2008-10-23 05:53:05,12008102305,,14.015319,
1,1,39.984200,116.319321,2008-10-23 05:53:06,12008102305,14.015319,7.345484,20.082062
2,1,39.984222,116.319405,2008-10-23 05:53:11,12008102305,7.345484,1.628622,5.929780
3,1,39.984211,116.319389,2008-10-23 05:53:16,12008102305,1.628622,2.448495,1.224247
4,1,39.984219,116.319420,2008-10-23 05:53:21,12008102305,2.448495,66.161008,68.115491
...,...,...,...,...,...,...,...,...
217648,5,39.999897,116.327293,2009-03-19 05:46:02,52009031905,7.470501,5.830361,13.000767
217649,5,39.999901,116.327354,2009-03-19 05:46:07,52009031905,5.830361,6.359989,10.913244
217650,5,39.999947,116.327393,2009-03-19 05:46:12,52009031905,6.359989,7.943372,14.161495
217651,5,40.000015,116.327431,2009-03-19 05:46:17,52009031905,7.943372,5.443728,6.747541


clean_gps_nearby_points_by_distances function removes points from the trajectories when the distance between them and the point before is smaller than the parameter radius_area.

In [16]:
filters.clean_gps_nearby_points_by_distances(df_move, radius_area = 10)


Cleaning gps points from radius of 10 meters

...Dropping 138524 gps points

...Rows before: 217653, Rows after:79129

138524 GPS points were dropped


Unnamed: 0,id,lat,lon,datetime,tid,dist_to_prev,dist_to_next,dist_prev_to_next
0,1,39.984093,116.319237,2008-10-23 05:53:05,12008102305,,14.015319,
1,1,39.984200,116.319321,2008-10-23 05:53:06,12008102305,14.015319,7.345484,20.082062
5,1,39.984711,116.319862,2008-10-23 05:53:23,12008102305,66.161008,6.254723,60.106536
14,1,39.984959,116.319969,2008-10-23 05:54:03,12008102305,40.444375,10.888440,50.636317
15,1,39.985035,116.320053,2008-10-23 05:54:04,12008102305,10.888440,32.678475,24.990416
...,...,...,...,...,...,...,...,...
217637,5,40.000759,116.327087,2009-03-19 05:45:07,52009031905,28.504349,18.264679,46.708167
217638,5,40.000595,116.327065,2009-03-19 05:45:12,52009031905,18.264679,10.233476,27.953506
217639,5,40.000515,116.327019,2009-03-19 05:45:17,52009031905,10.233476,6.455600,16.250192
217641,5,40.000366,116.327072,2009-03-19 05:45:27,52009031905,11.259984,8.672381,19.873955


clean_gps_nearby_points_by_speed function removes points from the trajectories when the speed of travel between them
and the point before is smaller than the value set by the parameter speed_radius.

In [17]:
filters.clean_gps_nearby_points_by_speed(df_move, speed_radius=40.0)


Creating or updating distance, time and speed features in meters by seconds

...Sorting by id and datetime to increase performance

...Set id as index to a higher peformance



VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

...Reset index...

..Total Time: 0.259

Cleaning gps points using 40.0 speed radius

...Dropping 217372 gps points

...Rows before: 217653, Rows after:281

217372 GPS points were dropped


Unnamed: 0,id,lat,lon,datetime,tid,dist_to_prev,dist_to_next,dist_prev_to_next,time_to_prev,speed_to_prev
0,1,39.984093,116.319237,2008-10-23 05:53:05,12008102305,,14.015319,,,
149,1,39.977650,116.326927,2008-10-23 10:33:00,12008102310,1470.027290,7.308369,1467.127499,7.0,210.003899
560,1,40.009804,116.313248,2008-10-23 10:56:54,12008102310,47.087018,65.579993,112.380725,1.0,47.087018
561,1,40.009262,116.312950,2008-10-23 10:56:55,12008102310,65.579993,35.035117,100.449449,1.0,65.579993
1369,1,39.990658,116.326347,2008-10-24 00:04:29,12008102400,40.734946,42.020713,82.547348,1.0,40.734946
...,...,...,...,...,...,...,...,...,...,...
216382,5,40.000183,116.327286,2009-02-28 03:52:45,52009022803,333.702810,28.429016,361.914007,5.0,66.740562
217458,5,39.999920,116.320061,2009-03-19 04:36:02,52009031904,556.623406,265.002011,821.356742,5.0,111.324681
217459,5,39.999077,116.317154,2009-03-19 04:36:07,52009031904,265.002011,85.242667,219.182959,5.0,53.000402
217463,5,40.001122,116.320877,2009-03-19 04:40:52,52009031904,267.543222,127.292949,394.833355,5.0,53.508644


clean_gps_speed_max_radius function recursively removes trajectories points with speed higher than the value especifeid by the user.
    Given any point p of the trajectory, the point will be removed if one of the following happens:
    if the travel speed from the point before p to p is greater than the  max value of speed between adjacent
    points set by the user. Or the travel speed between point p and the next point is greater than the value set by
    the user. When the clening is done, the function will update the time and distance features in the dataframe and
    will call itself again.
    The function will finish processing when it can no longer find points disrespecting the limit of speed.

In [18]:
filters.clean_gps_speed_max_radius(df_move)


Creating or updating distance, time and speed features in meters by seconds

...Sorting by id and datetime to increase performance

...Set id as index to a higher peformance



VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

...Reset index...

..Total Time: 0.219

Clean gps points with speed max > 50.0 meters by seconds
...Dropping 183 rows of jumps by speed max

...Rows before: 217653, Rows after:217470

183 GPS points were dropped


Unnamed: 0,id,lat,lon,datetime,tid,dist_to_prev,dist_to_next,dist_prev_to_next,time_to_prev,speed_to_prev
0,1,39.984093,116.319237,2008-10-23 05:53:05,12008102305,,14.015319,,,
1,1,39.984200,116.319321,2008-10-23 05:53:06,12008102305,14.015319,7.345484,20.082062,1.0,14.015319
2,1,39.984222,116.319405,2008-10-23 05:53:11,12008102305,7.345484,1.628622,5.929780,5.0,1.469097
3,1,39.984211,116.319389,2008-10-23 05:53:16,12008102305,1.628622,2.448495,1.224247,5.0,0.325724
4,1,39.984219,116.319420,2008-10-23 05:53:21,12008102305,2.448495,66.161008,68.115491,5.0,0.489699
...,...,...,...,...,...,...,...,...,...,...
217648,5,39.999897,116.327293,2009-03-19 05:46:02,52009031905,7.470501,5.830361,13.000767,5.0,1.494100
217649,5,39.999901,116.327354,2009-03-19 05:46:07,52009031905,5.830361,6.359989,10.913244,5.0,1.166072
217650,5,39.999947,116.327393,2009-03-19 05:46:12,52009031905,6.359989,7.943372,14.161495,5.0,1.271998
217651,5,40.000015,116.327431,2009-03-19 05:46:17,52009031905,7.943372,5.443728,6.747541,5.0,1.588674


clean_trajectories_with_few_points function removes from the given dataframe, trajectories with fewer points than was specified by the parameter min_points_per_trajectory.

In [19]:
filters.clean_trajectories_with_few_points(df_move)


Cleaning gps points from trajectories of fewer than 2 points


...There are 4 ids with few points

...Tids before drop: 625

...Tids after drop: 621

...Shape - before drop: (217653, 8) - after drop: (217649, 8)


Unnamed: 0,id,lat,lon,datetime,tid,dist_to_prev,dist_to_next,dist_prev_to_next
0,1,39.984093,116.319237,2008-10-23 05:53:05,12008102305,,14.015319,
1,1,39.984200,116.319321,2008-10-23 05:53:06,12008102305,14.015319,7.345484,20.082062
2,1,39.984222,116.319405,2008-10-23 05:53:11,12008102305,7.345484,1.628622,5.929780
3,1,39.984211,116.319389,2008-10-23 05:53:16,12008102305,1.628622,2.448495,1.224247
4,1,39.984219,116.319420,2008-10-23 05:53:21,12008102305,2.448495,66.161008,68.115491
...,...,...,...,...,...,...,...,...
217648,5,39.999897,116.327293,2009-03-19 05:46:02,52009031905,7.470501,5.830361,13.000767
217649,5,39.999901,116.327354,2009-03-19 05:46:07,52009031905,5.830361,6.359989,10.913244
217650,5,39.999947,116.327393,2009-03-19 05:46:12,52009031905,6.359989,7.943372,14.161495
217651,5,40.000015,116.327431,2009-03-19 05:46:17,52009031905,7.943372,5.443728,6.747541


## Segmentation

The segmentation module are used to segment trajectories based on different parameters.

Importing the module:

In [20]:
from pymove import segmentation
df_move = read_csv('geolife_sample.csv')

bbox_split function splits the bounding box in grids of the same size. The number of grids is defined by the parameter number_grids.

In [21]:
bbox = (22.147577, 113.54884299999999, 41.132062, 121.156224)
segmentation.bbox_split(bbox, number_grids=4)

const_lat: 4.74612125
const_lon: 1.901845250000001


Unnamed: 0,lat_min,lon_min,lat_max,lon_max
0,22.147577,113.548843,41.132062,115.450688
1,22.147577,115.450688,41.132062,117.352533
2,22.147577,117.352533,41.132062,119.254379
3,22.147577,119.254379,41.132062,121.156224


by_dist_time_speed functions segments the trajectories into clusters based on distance, time and speed. The distance, time and speed limits by the parameters by max_dist_between_adj_points, max_time_between_adj_points, max_speed_between_adj_points respectively. The column tid_part is added, it indicates the segment to which the point belongs to.

In [22]:
segmentation.by_dist_time_speed(df_move, max_dist_between_adj_points=5000, 
                                max_time_between_adj_points=800,max_speed_between_adj_points=60.0)
df_move.head()


Split trajectories
...max_time_between_adj_points: 800
...max_dist_between_adj_points: 5000
...max_speed: 60.0

Creating or updating distance, time and speed features in meters by seconds

...Sorting by id and datetime to increase performance

...Set id as index to a higher peformance



VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

...Reset index...

..Total Time: 0.192
...setting id as index


VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

... Reseting index

...No trajs with only one point. (217653, 8)

Creating or updating distance, time and speed features in meters by seconds

...Sorting by id and datetime to increase performance

...Set id as index to a higher peformance



VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

...Reset index...

..Total Time: 0.200
------------------------------------------



Unnamed: 0,id,lat,lon,datetime,dist_to_prev,time_to_prev,speed_to_prev,tid_part
0,1,39.984093,116.319237,2008-10-23 05:53:05,,,,1
1,1,39.9842,116.319321,2008-10-23 05:53:06,14.015319,1.0,14.015319,1
2,1,39.984222,116.319405,2008-10-23 05:53:11,7.345484,5.0,1.469097,1
3,1,39.984211,116.319389,2008-10-23 05:53:16,1.628622,5.0,0.325724,1
4,1,39.984219,116.31942,2008-10-23 05:53:21,2.448495,5.0,0.489699,1


by_speed function segments the trajectories into clusters based on speed. The speed limit is defined by the parameter max_speed_between_adj_points. The column tid_speed is added, it indicates the segment to  which the point belongs to.

In [23]:
segmentation.by_max_speed(df_move, max_speed_between_adj_points=70.0)
df_move.head()

Split trajectories by max_speed_between_adj_points: 70.0
...setting id as index


VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

... Reseting index
...No trajs with only one point. (217653, 9)

Creating or updating distance, time and speed features in meters by seconds

...Sorting by id and datetime to increase performance

...Set id as index to a higher peformance



VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

...Reset index...

..Total Time: 0.251
------------------------------------------



Unnamed: 0,id,lat,lon,datetime,dist_to_prev,time_to_prev,speed_to_prev,tid_part,tid_speed
0,1,39.984093,116.319237,2008-10-23 05:53:05,,,,1,1
1,1,39.9842,116.319321,2008-10-23 05:53:06,14.015319,1.0,14.015319,1,1
2,1,39.984222,116.319405,2008-10-23 05:53:11,7.345484,5.0,1.469097,1,1
3,1,39.984211,116.319389,2008-10-23 05:53:16,1.628622,5.0,0.325724,1,1
4,1,39.984219,116.31942,2008-10-23 05:53:21,2.448495,5.0,0.489699,1,1


by_time function segments the trajectories into clusters based on time. The time limit is defined by the parameter max_time_between_adj_points. The column tid_time is added, it indicates the segment to  which the point belongs to.

In [24]:
segmentation.by_max_time(df_move, max_time_between_adj_points=1000)
df_move.head()

Split trajectories by max_time_between_adj_points: 1000
...setting id as index


VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

... Reseting index
...No trajs with only one point. (217653, 10)

Creating or updating distance, time and speed features in meters by seconds

...Sorting by id and datetime to increase performance

...Set id as index to a higher peformance



VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

...Reset index...

..Total Time: 0.248
------------------------------------------



Unnamed: 0,id,lat,lon,datetime,dist_to_prev,time_to_prev,speed_to_prev,tid_part,tid_speed,tid_time
0,1,39.984093,116.319237,2008-10-23 05:53:05,,,,1,1,1
1,1,39.9842,116.319321,2008-10-23 05:53:06,14.015319,1.0,14.015319,1,1,1
2,1,39.984222,116.319405,2008-10-23 05:53:11,7.345484,5.0,1.469097,1,1,1
3,1,39.984211,116.319389,2008-10-23 05:53:16,1.628622,5.0,0.325724,1,1,1
4,1,39.984219,116.31942,2008-10-23 05:53:21,2.448495,5.0,0.489699,1,1,1


segment_traj_by_max_dist function segments the trajectories into clusters based on distance. The distance limit is defined by the parameter max_dist_between_adj_points. The column tid_dist is added, it indicates the segment to which the point belongs to.

In [25]:
segmentation.by_max_dist(df_move, max_dist_between_adj_points = 4000)
df_move.head()

Split trajectories by max distance between adjacent points: 4000
...setting id as index


VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

... Reseting index
...No trajs with only one point. (217653, 11)

Creating or updating distance features in meters...

...Sorting by id and datetime to increase performance

...Set id as index to increase attribution performance



VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

...Reset index

..Total Time: 0.3769388198852539
------------------------------------------



Unnamed: 0,id,lat,lon,datetime,dist_to_prev,time_to_prev,speed_to_prev,tid_part,tid_speed,tid_time,tid_dist,dist_to_next,dist_prev_to_next
0,1,39.984093,116.319237,2008-10-23 05:53:05,,,,1,1,1,1,14.015319,
1,1,39.9842,116.319321,2008-10-23 05:53:06,14.015319,1.0,14.015319,1,1,1,1,7.345484,20.082062
2,1,39.984222,116.319405,2008-10-23 05:53:11,7.345484,5.0,1.469097,1,1,1,1,1.628622,5.92978
3,1,39.984211,116.319389,2008-10-23 05:53:16,1.628622,5.0,0.325724,1,1,1,1,2.448495,1.224247
4,1,39.984219,116.31942,2008-10-23 05:53:21,2.448495,5.0,0.489699,1,1,1,1,66.161008,68.115491


## Stay point detection 

A stay point is location where a moving object has stayed for a while within a certain distance threshold. A stay point could stand different places such: a restaurant, a school, a work place.

Importing the module:

In [26]:
from pymove import stay_point_detection
df_move = read_csv('geolife_sample.csv')

stay_point_detection function converts the time data into a cyclical format. The columns hour_sin and hour_cos are added to the dataframe.

In [27]:
stay_point_detection.create_update_datetime_in_format_cyclical(df_move)
df_move.head()

Encoding cyclical continuous features - 24-hour time
...hour_sin and  hour_cos features were created...



Unnamed: 0,lat,lon,datetime,id,hour_sin,hour_cos
0,39.984093,116.319237,2008-10-23 05:53:05,1,0.979084,0.203456
1,39.9842,116.319321,2008-10-23 05:53:06,1,0.979084,0.203456
2,39.984222,116.319405,2008-10-23 05:53:11,1,0.979084,0.203456
3,39.984211,116.319389,2008-10-23 05:53:16,1,0.979084,0.203456
4,39.984219,116.31942,2008-10-23 05:53:21,1,0.979084,0.203456


create_or_update_move_stop_by_dist_time function creates or updates the stay points of the trajectories, based on distance and time metrics. The column segment_stop is added to the dataframe, it indicates the trajectory segment to  which the point belongs to. The column stop is also added, it indicates is the point represents a stop, a place where the object was stationary.

In [28]:
stay_point_detection.create_or_update_move_stop_by_dist_time(df_move, dist_radius=40, time_radius=1000)
df_move.head()

Split trajectories by max distance between adjacent points: 40

Creating or updating distance features in meters...

...Sorting by id and datetime to increase performance

...Set id as index to increase attribution performance



VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

...Reset index

..Total Time: 0.3743858337402344
...setting id as index


VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

... Reseting index
...No trajs with only one point. (217653, 10)

Creating or updating distance features in meters...

...Sorting by id and datetime to increase performance

...Set id as index to increase attribution performance



VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

...Reset index

..Total Time: 0.38266563415527344
------------------------------------------


Creating or updating distance, time and speed features in meters by seconds

...Sorting by segment_stop and datetime to increase performance

...Set segment_stop as index to a higher peformance



VBox(children=(HTML(value=''), IntProgress(value=0, max=3509)))

...Reset index...

..Total Time: 1.891
Create or update stop as True or False
...Creating stop features as True or False using 1000 to time in seconds
True     157842
False     59811
Name: stop, dtype: int64

Total Time: 2.98 seconds
-----------------------------------------------------



Unnamed: 0,segment_stop,id,lat,lon,datetime,hour_sin,hour_cos,dist_to_prev,dist_to_next,dist_prev_to_next,time_to_prev,speed_to_prev,stop
0,1,1,39.984093,116.319237,2008-10-23 05:53:05,0.979084,0.203456,,14.015319,,,,False
1,1,1,39.9842,116.319321,2008-10-23 05:53:06,0.979084,0.203456,14.015319,7.345484,20.082062,1.0,14.015319,False
2,1,1,39.984222,116.319405,2008-10-23 05:53:11,0.979084,0.203456,7.345484,1.628622,5.92978,5.0,1.469097,False
3,1,1,39.984211,116.319389,2008-10-23 05:53:16,0.979084,0.203456,1.628622,2.448495,1.224247,5.0,0.325724,False
4,1,1,39.984219,116.31942,2008-10-23 05:53:21,0.979084,0.203456,2.448495,66.161008,68.115491,5.0,0.489699,False


create_update_move_and_stop_by_radius function creates or updates the stay points of the trajectories, based on distance. The column situation is also added, it indicates if the point represents a stop point or a moving point.

In [29]:
stay_point_detection.create_update_move_and_stop_by_radius(df_move, radius=2)
df_move.head()


Creating or updating features MOVE and STOPS...


....There are 58738 stops to this parameters



Unnamed: 0,segment_stop,id,lat,lon,datetime,hour_sin,hour_cos,dist_to_prev,dist_to_next,dist_prev_to_next,time_to_prev,speed_to_prev,stop,situation
0,1,1,39.984093,116.319237,2008-10-23 05:53:05,0.979084,0.203456,,14.015319,,,,False,
1,1,1,39.9842,116.319321,2008-10-23 05:53:06,0.979084,0.203456,14.015319,7.345484,20.082062,1.0,14.015319,False,move
2,1,1,39.984222,116.319405,2008-10-23 05:53:11,0.979084,0.203456,7.345484,1.628622,5.92978,5.0,1.469097,False,move
3,1,1,39.984211,116.319389,2008-10-23 05:53:16,0.979084,0.203456,1.628622,2.448495,1.224247,5.0,0.325724,False,stop
4,1,1,39.984219,116.31942,2008-10-23 05:53:21,0.979084,0.203456,2.448495,66.161008,68.115491,5.0,0.489699,False,move


## Compression

Importing the module:

In [30]:
from pymove import compression
df_move = read_csv('geolife_sample.csv')

The function below is used to reduce the size of the trajectory, the stop points are used to make the compression. 

In [31]:
df_compressed = compression.compress_segment_stop_to_point_optimizer(df_move)
len(df_move), len(df_compressed)

Split trajectories by max distance between adjacent points: 30

Creating or updating distance features in meters...

...Sorting by id and datetime to increase performance

...Set id as index to increase attribution performance



VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

...Reset index

..Total Time: 0.3160533905029297
...setting id as index


VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

... Reseting index
...No trajs with only one point. (217653, 8)

Creating or updating distance features in meters...

...Sorting by id and datetime to increase performance

...Set id as index to increase attribution performance



VBox(children=(HTML(value=''), IntProgress(value=0, max=2)))

...Reset index

..Total Time: 0.18229031562805176
------------------------------------------


Creating or updating distance, time and speed features in meters by seconds

...Sorting by segment_stop and datetime to increase performance

...Set segment_stop as index to a higher peformance



VBox(children=(HTML(value=''), IntProgress(value=0, max=4834)))

...Reset index...

..Total Time: 2.212
Create or update stop as True or False
...Creating stop features as True or False using 900 to time in seconds
True     151227
False     66426
Name: stop, dtype: int64

Total Time: 3.06 seconds
-----------------------------------------------------

...setting mean to lat and lon...
...get only segments stop...


VBox(children=(HTML(value=''), IntProgress(value=0, max=286)))

...Dropping 150655 points...
...Shape_before: 217653
...Current shape: 66998
-----------------------------------------------------



(217653, 66998)

## Map matching

Import module:

In [32]:
from pymove import map_matching

In [33]:
# map_matching.check_time_dist(df_move)