Author: Luca Pappalardo
</br>Geospatial Analytics, Master degree in Data Science and Business Informatics, University of Pisa

# Geospatial Analytics - Lesson 4: Preprocessing Data

In this lesson, we will learn how to handle and explore spatial data in Python using folium and scikit-mobility.

1. [Noise Filtering](#filtering)
2. [Trajectory compression](#compression)
3. [Stop Detection](#stopdetection)
4. [Stops Clustering](#clustering)
5. [Practice](#practice)
6. [From trajectories to flows](#flowtotraj)

Mobility data analysis requires data cleaning and preprocessing steps. 

The `preprocessing` module allows the user to perform noise filtering, trajectory compression, and stop detection. 

Note that if a `TrajDataFrame` contains multiple trajectories from multiple objects, the preprocessing methods automatically apply to the single trajectory and, when necessary, to the single object. 

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
# import the libraries
import skmob
import pandas as pd
import geopandas as gpd
import folium

## Load the GeoLife dataset
- you find a portion of the Geolife dataset at this link: https://github.com/scikit-mobility/tutorials/raw/master/mda_masterbd2020/data/geolife_sample.txt.gz

In [3]:
# create a TrajDataFrame from a dataset of trajectories 
url = "https://github.com/scikit-mobility/tutorials/raw/master/mda_masterbd2020/data/geolife_sample.txt.gz"
tdf = skmob.TrajDataFrame.from_file(url)
print(type(tdf))
tdf.head()

<class 'skmob.core.trajectorydataframe.TrajDataFrame'>


Unnamed: 0,lat,lng,datetime,uid
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1
4,39.984217,116.319422,2008-10-23 05:53:21,1


Let's create a `TrajDataFrame` for a single user

In [4]:
tdf['uid'].unique()

array([1, 5], dtype=int64)

In [5]:
user1_tdf = tdf[tdf.uid == 1]
print('points of this user: %s' %len(user1_tdf))
user1_tdf.head()

points of this user: 108607


Unnamed: 0,lat,lng,datetime,uid
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1
4,39.984217,116.319422,2008-10-23 05:53:21,1


In [6]:
user1_tdf.plot_trajectory()

<a id='filtering'></a>
## Noise filtering

Trajectory data are in general **noisy**, usually because of recording errors like poor signal reception. When the error associated with the coordinates of points is large, the best solution is to **filter out** these points. 

In scikit-mobility, the method `filter` filters out a point if the speed from the previous point is higher than the parameter `max_speed_kmh`, which is by default set to 500km/h. 

The intensity of the filter is controlled by the `max_speed_kmh` parameter. The lower the value, the more intense the filter is.

`filter` has other parameters, check them here: https://scikit-mobility.github.io/scikit-mobility/reference/preprocessing.html#skmob.preprocessing.filtering.filter 

To use the `filter` function, we you must import it from the `preprocessing` module.

In [7]:
from skmob.preprocessing import filtering

In [8]:
f_tdf = filtering.filter(tdf, max_speed_kmh=500.)
print("Number of points in the filtered tdf: %d" %len(f_tdf))
print("Number of filtered points: %d\n" %(len(tdf) - len(f_tdf)))
f_tdf.head()

Number of points in the filtered tdf: 217599
Number of filtered points: 54



Unnamed: 0,lat,lng,datetime,uid
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1
4,39.984217,116.319422,2008-10-23 05:53:21,1


Every time you use a `preprocessing` function, an item is added to the `parameters` attribute describing the parameter values when invoking the function

In [9]:
f_tdf.parameters

{'from_file': 'https://github.com/scikit-mobility/tutorials/raw/master/mda_masterbd2020/data/geolife_sample.txt.gz',
 'filter': {'function': 'filter',
  'max_speed_kmh': 500.0,
  'include_loops': False,
  'speed_kmh': 5.0,
  'max_loop': 6,
  'ratio_max': 0.25}}

Let's compare visually the original trajectory and the filtered trajectory of the selected user

In [10]:
user1_f_tdf = f_tdf[f_tdf['uid'] == 1]
print(user1_f_tdf.parameters)
print('Filtered points:\t%s'%(len(user1_tdf) - len(user1_f_tdf)))

{'from_file': 'https://github.com/scikit-mobility/tutorials/raw/master/mda_masterbd2020/data/geolife_sample.txt.gz', 'filter': {'function': 'filter', 'max_speed_kmh': 500.0, 'include_loops': False, 'speed_kmh': 5.0, 'max_loop': 6, 'ratio_max': 0.25}}
Filtered points:	18


In [11]:
map_f = user1_tdf.plot_trajectory(zoom=11, weight=10, opacity=0.5, hex_color='black') 
user1_f_tdf.plot_trajectory(map_f=map_f, hex_color='red')

### Which points have been filtered?

In [12]:
# indicator adds column _merge
merged = user1_tdf.merge(user1_f_tdf, indicator=True, how='outer')
diff_df = merged[merged['_merge'] == 'left_only']
print(len(diff_df))
diff_df

18


Unnamed: 0,lat,lng,datetime,uid,_merge
149,39.977648,116.326925,2008-10-23 10:33:00,1,left_only
17792,40.013398,116.30649,2008-10-27 12:27:55,1,left_only
23212,39.975403,116.312814,2008-10-31 06:15:21,1,left_only
23213,39.975342,116.312961,2008-10-31 06:15:23,1,left_only
24509,40.070867,116.301276,2008-11-01 01:06:36,1,left_only
24510,40.070832,116.301441,2008-11-01 01:06:37,1,left_only
25373,40.062216,116.294486,2008-11-01 04:17:41,1,left_only
25374,40.061976,116.294452,2008-11-01 04:17:42,1,left_only
25375,40.061711,116.29427,2008-11-01 04:17:43,1,left_only
25376,40.061615,116.294441,2008-11-01 04:17:44,1,left_only


Let's extract the filtered points between indexes `25372` and `23377`.



In [13]:
min_index, max_index = 25373, 25376
dt_start = user1_tdf.loc[min_index - 1]['datetime']
dt_end = user1_tdf.loc[max_index + 1]['datetime']
filtered_tdf = user1_f_tdf[(user1_f_tdf['datetime'] >= dt_start) \
                 & (user1_f_tdf['datetime'] <= dt_end)]

unfiltered_tdf = user1_tdf[(user1_tdf['datetime'] >= dt_start) \
                  & (user1_tdf['datetime'] <= dt_end)]
filtered_tdf

Unnamed: 0,lat,lng,datetime,uid
25366,40.064046,116.301866,2008-11-01 04:17:40,1
25367,40.061521,116.294584,2008-11-01 04:17:45,1


Compute the speeds between consecutive points on the unfiltered trajectory

In [14]:
lat_lng_dt = unfiltered_tdf[['lat', 'lng', 'datetime']].values

In [15]:
# avg speed (km/h) between last not filtered point and following points
from  skmob.utils.gislib import getDistance
lat0, lng0, dt0 = lat_lng_dt[0]
pd.DataFrame(
    [[dt0, dt , getDistance((lat, lng), (lat0, lng0)) / ((dt - dt0).seconds / 3600),
     getDistance((lat, lng), (lat0, lng0)) / ((dt - dt0).seconds / 3600) > 500.0] \
     for i, (lat ,lng, dt) in enumerate(lat_lng_dt[1:])], \
             columns=['time 0', 'time 1', 'speed (km/h)', 'to_filter'])

Unnamed: 0,time 0,time 1,speed (km/h),to_filter
0,2008-11-01 04:17:40,2008-11-01 04:17:41,2376.687211,True
1,2008-11-01 04:17:40,2008-11-01 04:17:42,1208.91039,True
2,2008-11-01 04:17:40,2008-11-01 04:17:43,835.951942,True
3,2008-11-01 04:17:40,2008-11-01 04:17:44,618.545448,True
4,2008-11-01 04:17:40,2008-11-01 04:17:45,489.850389,False


### Playing with the `max_speed_kmh` parameter

In [16]:
f2_tdf = filtering.filter(tdf, max_speed_kmh=100.)
print("Number of points in the filtered tdf: %d" %len(f2_tdf))
print("Number of filtered points: %d\n" %(len(tdf) - len(f2_tdf)))
f2_tdf.head()

Number of points in the filtered tdf: 216374
Number of filtered points: 1279



Unnamed: 0,lat,lng,datetime,uid
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1
4,39.984217,116.319422,2008-10-23 05:53:21,1


In [17]:
user1_f2_tdf = f2_tdf[f2_tdf['uid'] == 1]
print(user1_f2_tdf.parameters)
print('Filtered points:\t%s'%(len(user1_tdf) - len(user1_f2_tdf)))

{'from_file': 'https://github.com/scikit-mobility/tutorials/raw/master/mda_masterbd2020/data/geolife_sample.txt.gz', 'filter': {'function': 'filter', 'max_speed_kmh': 100.0, 'include_loops': False, 'speed_kmh': 5.0, 'max_loop': 6, 'ratio_max': 0.25}}
Filtered points:	558


In [18]:
map_f = user1_tdf.plot_trajectory(zoom=12, weight=10, opacity=0.5, hex_color='black') 
user1_f2_tdf.plot_trajectory(map_f=map_f, hex_color='blue')

In [19]:
map_f = folium.plugins.DualMap(location=(user1_tdf['lat'].mean(), 
                                         user1_tdf['lng'].mean()), 
                                       tiles='cartodbpositron', zoom_start=12)
m1, m2 = map_f.m1, map_f.m2

# filtering 1
user1_tdf.plot_trajectory(map_f=m1, zoom=12, weight=10, opacity=0.5, hex_color='black') 
user1_f_tdf.plot_trajectory(map_f=m1, start_end_markers=False, hex_color='blue')

# filtering 2
user1_tdf.plot_trajectory(map_f=m2, zoom=12, weight=10, opacity=0.5, hex_color='black') 
user1_f2_tdf.plot_trajectory(map_f=m2, start_end_markers=False, hex_color='blue')
#display(map_f)
map_f

<a id="compression"></a>
## Trajectory compression

The goal of trajectory compression is to reduce the number of points while preserving the trajectory structure. 

In scikit-mobility, we can use the method `compression.compress` under the preprocessing module. 

All points within a radius of `spatial_radius_km` kilometers from a given initial point are compressed into a single point that has the median coordinates of all points and the time of the initial point. 

check the documentation of `compress` here: https://scikit-mobility.github.io/scikit-mobility/reference/preprocessing.html#skmob.preprocessing.compression.compress 

In [20]:
from skmob.preprocessing import compression

In [21]:
fc_tdf = compression.compress(f_tdf, spatial_radius_km=0.2)
fc_tdf.head()

Unnamed: 0,lat,lng,datetime,uid
0,39.984302,116.32073,2008-10-23 05:53:05,1
1,39.982115,116.321225,2008-10-23 05:56:06,1
2,39.979737,116.321564,2008-10-23 05:57:03,1
3,39.979671,116.323778,2008-10-23 05:59:05,1
4,39.979638,116.326375,2008-10-23 05:59:59,1


In [22]:
print('Points of the filtered trajectory:\t%s'%len(f_tdf))
print('Points of the compressed trajectory:\t%s'%len(fc_tdf))
print('Compressed points:\t\t\t%s'%(len(f_tdf) - len(fc_tdf)))

Points of the filtered trajectory:	217599
Points of the compressed trajectory:	6280
Compressed points:			211319


In [23]:
fc_tdf.parameters

{'from_file': 'https://github.com/scikit-mobility/tutorials/raw/master/mda_masterbd2020/data/geolife_sample.txt.gz',
 'filter': {'function': 'filter',
  'max_speed_kmh': 500.0,
  'include_loops': False,
  'speed_kmh': 5.0,
  'max_loop': 6,
  'ratio_max': 0.25},
 'compress': {'function': 'compress', 'spatial_radius_km': 0.2}}

In [24]:
user1_fc_tdf = fc_tdf[fc_tdf['uid'] == 1]

In [25]:
print('Points of the filtered trajectory:\t%s'%len(user1_f_tdf))
print('Points of the compressed trajectory:\t%s'%len(user1_fc_tdf))
print('Compressed points:\t\t\t%s'%(len(user1_f_tdf)-len(user1_fc_tdf)))

Points of the filtered trajectory:	108589
Points of the compressed trajectory:	3489
Compressed points:			105100


In [26]:
map_f = user1_tdf.plot_trajectory(zoom=12, weight=10, opacity=0.5, hex_color='black') 
user1_fc_tdf.plot_trajectory(map_f=map_f, hex_color='blue')

### Playing the the `spatial_radius_km` parameter

In [27]:
end_time = user1_f_tdf.iloc[10000]['datetime']
map_f = user1_f_tdf[user1_f_tdf['datetime'] < end_time].plot_trajectory(zoom=14, weight=5, hex_color='black',
                                                                      opacity=0.5, start_end_markers=False)
user1_fc_tdf[user1_fc_tdf['datetime'] < end_time].plot_trajectory(map_f=map_f, \
                                                  start_end_markers=False, hex_color='red')

In [28]:
spatial_radius_km=0.5

user1_fc_tdf = compression.compress(user1_f_tdf, spatial_radius_km=spatial_radius_km)
end_time = user1_f_tdf.iloc[10000]['datetime']
map_f = user1_f_tdf[user1_f_tdf['datetime'] < end_time].plot_trajectory(zoom=14, weight=5, hex_color='black',
                                                                      opacity=0.5, start_end_markers=False)
user1_fc_tdf[user1_fc_tdf['datetime'] < end_time].plot_trajectory(map_f=map_f, \
                                                  start_end_markers=False, hex_color='red')

<a id="stopdetection"></a>
## Stop detection

Some points in a trajectory can represent Point-Of-Interests (POIs) such as schools, restaurants, and bars or represent individual-specific places such as home and work locations. These points are usually called Stay Points or Stops, and they can be detected in different ways.

A common approach is to apply spatial clustering algorithms to cluster trajectory points by looking at their spatial proximity. 

In scikit-mobility, the `stay_locations` function in the `detection` module finds the stay points visited by an object. 

A stop is detected when the individual spends at least `minutes_for_a_stop` minutes within a distance `stop_radius_factor * spatial_radius_km` from a given trajectory point. 

The stop’s coordinates are the median latitude and longitude values of the points found within the specified distance

Check the documentation of `stops` here: https://scikit-mobility.github.io/scikit-mobility/reference/preprocessing.html#skmob.preprocessing.detection.stops

In [29]:
from skmob.preprocessing import detection

In [30]:
fcs_tdf = detection.stay_locations(fc_tdf, stop_radius_factor=0.5, 
                          minutes_for_a_stop=20.0, spatial_radius_km=0.2)
fcs_tdf.head()

Unnamed: 0,lat,lng,datetime,uid,leaving_datetime
0,39.978945,116.326825,2008-10-23 05:59:59,1,2008-10-23 10:32:53
1,40.013819,116.306532,2008-10-23 11:10:09,1,2008-10-23 23:46:02
2,39.978987,116.326686,2008-10-24 00:10:39,1,2008-10-24 01:48:57
3,39.980755,116.310771,2008-10-24 01:53:53,1,2008-10-24 03:26:35
4,39.97958,116.313649,2008-10-24 03:26:35,1,2008-10-24 03:50:36


A new column `leaving_datetime` is added to the `TrajDataFrame` to indicate the time when the moving object left the stop location.

In [31]:
fcs_tdf.parameters

{'from_file': 'https://github.com/scikit-mobility/tutorials/raw/master/mda_masterbd2020/data/geolife_sample.txt.gz',
 'filter': {'function': 'filter',
  'max_speed_kmh': 500.0,
  'include_loops': False,
  'speed_kmh': 5.0,
  'max_loop': 6,
  'ratio_max': 0.25},
 'compress': {'function': 'compress', 'spatial_radius_km': 0.2},
 'detect': {'function': 'stay_locations',
  'stop_radius_factor': 0.5,
  'minutes_for_a_stop': 20.0,
  'spatial_radius_km': 0.2,
  'leaving_time': True,
  'no_data_for_minutes': 1000000000000.0,
  'min_speed_kmh': None}}

#### Visualise the compressed trajectory and the stops
Click on the stop markers to see a pop up with:

- User ID
- Coordinates of the stop (click to see the location on Google maps)
- Arrival time
- Departure time

In [32]:
user1_fcs_tdf = fcs_tdf[fcs_tdf['uid'] == 1]
map_f = user1_fcs_tdf.plot_trajectory(hex_color='blue', start_end_markers=False)
user1_fcs_tdf.plot_stops(map_f=map_f, hex_color='red', number_of_sides=4, radius=8)

In [33]:
dt1 = user1_fcs_tdf.iloc[0].leaving_datetime
dt2 = user1_fcs_tdf.iloc[1].leaving_datetime
dt1, dt2

(Timestamp('2008-10-23 10:32:53'), Timestamp('2008-10-23 23:46:02'))

In [34]:
# select all points between the first two stops
user1_tid1_tdf = user1_tdf[(user1_tdf.datetime >= dt1) 
                           & (user1_tdf.datetime <= dt2)]
user1_tid1_tdf.head()

Unnamed: 0,lat,lng,datetime,uid
148,39.970511,116.341455,2008-10-23 10:32:53,1
149,39.977648,116.326925,2008-10-23 10:33:00,1
150,39.977586,116.326918,2008-10-23 10:33:05,1
151,39.977596,116.326894,2008-10-23 10:33:10,1
152,39.977661,116.326947,2008-10-23 10:33:14,1


In [35]:
# plot the trip
user1_tid1_map = user1_tid1_tdf.plot_trajectory(zoom=12, weight=5, opacity=0.9, hex_color='red', tiles='Stamen Toner', )
user1_tid1_map

<a id="clustering"></a>
## Clustering

The stops correspond to visits to the same location at different times, based on spatial proximity. 

The clustering algorithm used is DBSCAN (by sklearn).

- a new column cluster is added with cluster ID (int)
- 0 is the most visited, 1 the second most visited, etc.

In [36]:
from skmob.preprocessing import clustering

In [37]:
fcscl_tdf = clustering.cluster(fcs_tdf, cluster_radius_km=0.1, min_samples=1)
fcscl_tdf.head()

Unnamed: 0,lat,lng,datetime,uid,leaving_datetime,cluster
0,39.978945,116.326825,2008-10-23 05:59:59,1,2008-10-23 10:32:53,0
1,40.013819,116.306532,2008-10-23 11:10:09,1,2008-10-23 23:46:02,1
2,39.978987,116.326686,2008-10-24 00:10:39,1,2008-10-24 01:48:57,0
3,39.980755,116.310771,2008-10-24 01:53:53,1,2008-10-24 03:26:35,4
4,39.97958,116.313649,2008-10-24 03:26:35,1,2008-10-24 03:50:36,50


In [38]:
fcscl_tdf.parameters

{'from_file': 'https://github.com/scikit-mobility/tutorials/raw/master/mda_masterbd2020/data/geolife_sample.txt.gz',
 'filter': {'function': 'filter',
  'max_speed_kmh': 500.0,
  'include_loops': False,
  'speed_kmh': 5.0,
  'max_loop': 6,
  'ratio_max': 0.25},
 'compress': {'function': 'compress', 'spatial_radius_km': 0.2},
 'detect': {'function': 'stay_locations',
  'stop_radius_factor': 0.5,
  'minutes_for_a_stop': 20.0,
  'spatial_radius_km': 0.2,
  'leaving_time': True,
  'no_data_for_minutes': 1000000000000.0,
  'min_speed_kmh': None},
 'cluster': {'function': 'cluster',
  'cluster_radius_km': 0.1,
  'min_samples': 1}}

In [39]:
user1_fcscl_tdf = fcscl_tdf[fcscl_tdf['uid'] == 1]
map_f = user1_fcscl_tdf.plot_trajectory(start_end_markers=False, hex_color='black')
user1_fcscl_tdf.plot_stops(map_f=map_f, radius=8)

### Playing with the `cluster_radius_km` parameter

In [40]:
user1_fcscl_tdf = clustering.cluster(user1_fcs_tdf, cluster_radius_km=0.5, min_samples=1)
map_f = user1_fcscl_tdf.plot_trajectory(start_end_markers=False, hex_color='black')
user1_fcscl_tdf.plot_stops(map_f=map_f, radius=8)

<a id="practice"></a>
## Practice

### Load the tessellation of the neighborhoods in San Francisco
- find it here: https://raw.githubusercontent.com/scikit-mobility/tutorials/master/mda_masterbd2020/data/bay_area_zip_codes.geojson
- visualize the tessellation (use black for background, red for borders, and a value of 2 for the weight of the borders)

In [41]:
# create a TrajDataFrame from a dataset of trajectories 
url = "https://raw.githubusercontent.com/scikit-mobility/tutorials/master/mda_masterbd2020/data/bay_area_zip_codes.geojson"
tessellation = gpd.read_file(url) # load a tessellation
geoms = [geom[0] for geom in tessellation['geometry']]
tessellation['geometry'] = geoms
tessellation.head()

Unnamed: 0,area,zip,state,po_name,length,geometry
0,12313263537.0,94558,CA,NAPA,995176.225313,"POLYGON ((-122.10329 38.51328, -122.10348 38.5..."
1,7236949520.92,95620,CA,DIXON,441860.2014,"POLYGON ((-121.65336 38.31339, -121.69340 38.3..."
2,3001414164.85,95476,CA,SONOMA,311318.546326,"POLYGON ((-122.40684 38.15568, -122.40757 38.1..."
3,1194301744.88,94559,CA,NAPA,359104.646602,"POLYGON ((-122.29369 38.15524, -122.29850 38.1..."
4,991786103.42,94533,CA,FAIRFIELD,200772.556587,"POLYGON ((-121.94748 38.30151, -121.94718 38.2..."


In [42]:
tessellation.rename(columns={'zip': 'tile_ID'}, inplace=True)
print(tessellation.shape)
tessellation.head()

(187, 6)


Unnamed: 0,area,tile_ID,state,po_name,length,geometry
0,12313263537.0,94558,CA,NAPA,995176.225313,"POLYGON ((-122.10329 38.51328, -122.10348 38.5..."
1,7236949520.92,95620,CA,DIXON,441860.2014,"POLYGON ((-121.65336 38.31339, -121.69340 38.3..."
2,3001414164.85,95476,CA,SONOMA,311318.546326,"POLYGON ((-122.40684 38.15568, -122.40757 38.1..."
3,1194301744.88,94559,CA,NAPA,359104.646602,"POLYGON ((-122.29369 38.15524, -122.29850 38.1..."
4,991786103.42,94533,CA,FAIRFIELD,200772.556587,"POLYGON ((-121.94748 38.30151, -121.94718 38.2..."


In [43]:
from skmob.utils.plot import plot_gdf

In [44]:
tess_style = {'color':'black', 'fillColor':'black', 'weight': 1}
popup_features=['tile_ID', 'po_name', 'area']
map_f = plot_gdf(tessellation, zoom=9, style_func_args=tess_style, 
             popup_features=popup_features)
map_f

### Load the taxi San Francisco dataset

- [**download the dataset**](https://drive.google.com/file/d/1fKB3W10bY2OAZmz2XEICTEVHpIxnxw98/view) and put it into a `data` folder

In [45]:
%%time
mydateparser = lambda x: pd.to_datetime(x, unit='s')
tdf = skmob.TrajDataFrame(
    pd.read_csv('data/cabs.csv.gz', 
    compression='gzip', parse_dates = ['timestamp'], 
    date_parser=mydateparser), longitude='lon', 
    datetime='timestamp', user_id='driver').sort_values(by=['uid', 'datetime'])

FileNotFoundError: [Errno 2] No such file or directory: 'data/cabs.csv.gz'

In [46]:
print('records: %s' %len(tdf))
print('taxis: %s' %len(tdf['uid'].unique()))
print('period: %s - %s' %(tdf.datetime.min(), tdf.datetime.max()))
tdf.head()

records: 217653
taxis: 2
period: 2008-10-23 05:53:05 - 2009-03-19 05:46:37


Unnamed: 0,lat,lng,datetime,uid
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1
4,39.984217,116.319422,2008-10-23 05:53:21,1


In [47]:
tdf.plot_trajectory(start_end_markers=False, opacity=0.15, hex_color='red', zoom=10)

### Select a subset of days and drivers
- select the first 100 drivers
- select points up to `2008-05-21 00:00:00`

Print again:
- the number of records
- the number of taxis
- the period of time covered by the dataset

In [48]:
max_datetime = pd.to_datetime('2008-05-21 00:00:00')
drivers = tdf['uid'].unique()[:100]
tdf = tdf[(tdf['datetime'] <= max_datetime) & (tdf['uid'].isin(drivers))]
print('records: %s' %len(tdf))
print('taxis: %s' %len(tdf['uid'].unique()))
print('period: %s - %s' %(tdf.datetime.min(), tdf.datetime.max()))
tdf.head()

records: 0
taxis: 0
period: NaT - NaT


Unnamed: 0,lat,lng,datetime,uid


### Filtering 
Filter the trajectories with the `filtering` function using `max_speed_kmh=500.0`

Print:
- how many points the new `TrajDataFrame` has
- how many points have been filtered out

In [51]:
%%time
f_tdf = filtering.filter(tdf, max_speed_kmh=500.0)
print('Number of records:\t%s'%len(f_tdf))
print('Filtered points:\t%s'%(len(tdf) - len(f_tdf)))

TypeError: __init__() got an unexpected keyword argument 'index'

Visualize the trajectory of user `uid = 'abboip'` and the filtered trajectory of the same user

In [52]:
map_f = f_tdf[f_tdf['uid'] == 'abboip'].plot_trajectory(hex_color='red')
map_f = tdf[tdf['uid'] == 'abboip'].plot_trajectory(map_f=map_f, hex_color='blue', opacity=0.5)
map_f

Filter the original `TrajDataFrame` using `max_speed_kmh=100.0`
- print how many records have been filtered out
- plot the trajectories of the initial trajectory and the new filtered one of user `'abboip'`

In [53]:
f_tdf2 = filtering.filter(tdf, max_speed_kmh=100.0)
print('Number of records:\t%s'%len(f_tdf2))
print('Filtered points:\t%s'%(len(tdf) - len(f_tdf2)))

TypeError: __init__() got an unexpected keyword argument 'index'

In [None]:
map_f = f_tdf2[f_tdf2['uid'] == 'abboip'].plot_trajectory(hex_color='red')
map_f = tdf[tdf['uid'] == 'abboip'].plot_trajectory(map_f=map_f, hex_color='blue', opacity=0.5)
map_f

### Compression 
Compress the `TrajDataFrame` filtered with `max_speed_kmh=500.0` using default argument valueùs

In [54]:
%%time
cf_tdf = compression.compress(f_tdf)
print('Points of the filtered trajectory:\t%s'%len(f_tdf))
print('Points of the compressed trajectory:\t%s'%len(cf_tdf))
print('Compressed points:\t\t\t%s'%(len(f_tdf)-len(cf_tdf)))

Points of the filtered trajectory:	217599
Points of the compressed trajectory:	6280
Compressed points:			211319
CPU times: total: 2.39 s
Wall time: 2.39 s


Print the `parameters` attributed of the obtained `TrajDataFrame`

In [55]:
cf_tdf.parameters

{'from_file': 'https://github.com/scikit-mobility/tutorials/raw/master/mda_masterbd2020/data/geolife_sample.txt.gz',
 'filter': {'function': 'filter',
  'max_speed_kmh': 500.0,
  'include_loops': False,
  'speed_kmh': 5.0,
  'max_loop': 6,
  'ratio_max': 0.25},
 'compress': {'function': 'compress', 'spatial_radius_km': 0.2}}

In [56]:
cf_tdf.plot_trajectory(map_f=map_f, start_end_markers=False)

Plot the compressed trajectory of user `abboip` and the original trajectory of the same user together

In [57]:
map_f = cf_tdf[cf_tdf['uid'] == 'abboip'].plot_trajectory(hex_color='red')
map_f = tdf[tdf['uid'] == 'abboip'].plot_trajectory(map_f=map_f, hex_color='blue', opacity=0.5)
map_f

Create a very compressed tdf (`spatial_radius_km=2.0`) and visually compare the compressed trajectory of `abboip` with their original one

In [58]:
cf_tdf2 = compression.compress(f_tdf, spatial_radius_km=2.0)
map_f = cf_tdf2[cf_tdf2['uid'] == 'abboip'].plot_trajectory(hex_color='red')
map_f = tdf[tdf['uid'] == 'abboip'].plot_trajectory(map_f=map_f, hex_color='blue', opacity=0.5)
map_f

### Stop detection
Detect the stops (stay locations) in the `TrajDataFrame` filtered and compressed

In [59]:
from skmob.preprocessing.detection import stay_locations

In [60]:
scf_tdf = stay_locations(cf_tdf, minutes_for_a_stop=5)
print(len(scf_tdf))
scf_tdf.head()

734


Unnamed: 0,lat,lng,datetime,uid,leaving_datetime
0,39.978945,116.326825,2008-10-23 05:59:59,1,2008-10-23 10:32:53
1,40.015963,116.306171,2008-10-23 11:02:56,1,2008-10-23 11:10:09
2,40.013819,116.306532,2008-10-23 11:10:09,1,2008-10-23 23:46:02
3,39.978987,116.326686,2008-10-24 00:10:39,1,2008-10-24 01:48:57
4,39.980755,116.310771,2008-10-24 01:53:53,1,2008-10-24 03:26:35


In [61]:
map_f = cf_tdf[cf_tdf['uid'] == 'abboip'].plot_trajectory(hex_color='red')
map_f = scf_tdf[scf_tdf['uid'] == 'abboip'].plot_stops(map_f=map_f, hex_color='blue')
map_f

ValueError: Location values cannot contain NaNs.

### Clustering
Clusters the stops

In [62]:
cl_scf_tdf = clustering.cluster(scf_tdf)
cl_scf_tdf.head()

Unnamed: 0,lat,lng,datetime,uid,leaving_datetime,cluster
0,39.978945,116.326825,2008-10-23 05:59:59,1,2008-10-23 10:32:53,0
1,40.015963,116.306171,2008-10-23 11:02:56,1,2008-10-23 11:10:09,1
2,40.013819,116.306532,2008-10-23 11:10:09,1,2008-10-23 23:46:02,1
3,39.978987,116.326686,2008-10-24 00:10:39,1,2008-10-24 01:48:57,0
4,39.980755,116.310771,2008-10-24 01:53:53,1,2008-10-24 03:26:35,16


In [63]:
map_f = cf_tdf[cf_tdf['uid'] == 'abboip'].plot_trajectory(hex_color='red', start_end_markers=False)
map_f = cl_scf_tdf[cl_scf_tdf['uid'] == 'abboip'].plot_stops(map_f=map_f, radius=8)
map_f

ValueError: Location values cannot contain NaNs.

### Focus on Berkeley
- select only tiles in the tessellation for which `po_name = BERKELEY`

In [64]:
berkeley = tessellation[tessellation['po_name'] == 'BERKELEY']
berkeley

Unnamed: 0,area,tile_ID,state,po_name,length,geometry
60,93749561.2773,94708,CA,BERKELEY,57103.8440235,"POLYGON ((-122.24556 37.88067, -122.24776 37.8..."
61,48164525.3774,94707,CA,BERKELEY,41165.1690067,"POLYGON ((-122.28202 37.88165, -122.28220 37.8..."
65,77089960.9892,94710,CA,BERKELEY,83425.2822454,"POLYGON ((-122.30103 37.84738, -122.30039 37.8..."
67,15499387.5023,94709,CA,BERKELEY,19530.661874,"POLYGON ((-122.27328 37.87333, -122.27336 37.8..."
68,38096533.8085,94703,CA,BERKELEY,35167.821257,"POLYGON ((-122.28338 37.88096, -122.28202 37.8..."
69,63505176.7551,94704,CA,BERKELEY,53493.6016455,"POLYGON ((-122.21804 37.86767, -122.21838 37.8..."
71,33198715.556,94702,CA,BERKELEY,31001.1628029,"POLYGON ((-122.28338 37.88096, -122.28308 37.8..."
72,20110702.5332,94720,CA,BERKELEY,21911.8449073,"POLYGON ((-122.26640 37.87415, -122.26543 37.8..."
76,51854267.7298,94705,CA,BERKELEY,48080.6093518,"POLYGON ((-122.26943 37.86003, -122.26691 37.8..."


- plot the tessellation

In [65]:
ber_map_f = plot_gdf(berkeley, zoom=12)
ber_map_f

- map the tdf to this new tessellation (with `remove_na=True`)

In [66]:
mapped_cf_tdf_ber = cf_tdf.mapping(berkeley, remove_na=True)
mapped_cf_tdf_ber.head()

Unnamed: 0,lat,lng,datetime,uid,tile_ID


- plot the trajectories on top of the the new tessellation

In [67]:
mapped_cf_tdf_ber.plot_trajectory(map_f=ber_map_f, start_end_markers=False)

<a id="flowtotraj"></a>
## Extracting a `FlowDataFrame` from a `TrajDataFrame`

In [68]:
fdf = cf_tdf.to_flowdataframe(tessellation)
fdf.head()

Unnamed: 0,origin,destination,flow


In [69]:
map_f = fdf.plot_tessellation(zoom=10, style_func_args=tess_style, )
fdf.plot_flows(map_f=map_f, flow_color='red', color_origin_point='red', 
               min_flow=0, flow_exp=0.5, radius_origin_point=5)

## Comparing two users
- select the 1st and the 6th driver in the list of drivers
- create two new TDFs with their trajectories
- compare their trajectories in a DualMap
- add to the two maps also the heatmap and the cloropleth map

In [70]:
driver1_tdf = cf_tdf[cf_tdf['uid'] == cf_tdf['uid'].unique()[0]]
driver2_tdf = cf_tdf[cf_tdf['uid'] == cf_tdf['uid'].unique()[5]]

IndexError: index 5 is out of bounds for axis 0 with size 2

In [71]:
driver1_tdf.head()

Unnamed: 0,lat,lng,datetime,uid
0,39.984302,116.32073,2008-10-23 05:53:05,1
1,39.982115,116.321225,2008-10-23 05:56:06,1
2,39.979737,116.321564,2008-10-23 05:57:03,1
3,39.979671,116.323778,2008-10-23 05:59:05,1
4,39.979638,116.326375,2008-10-23 05:59:59,1


In [72]:
map_f = folium.plugins.DualMap(location=(tdf['lat'].mean(), 
                                         tdf['lng'].mean()), 
                                       tiles='cartodbpositron', zoom_start=12)
m1, m2 = map_f.m1, map_f.m2
driver1_tdf = cf_tdf[cf_tdf['uid'] == cf_tdf['uid'].unique()[0]]
driver2_tdf = cf_tdf[cf_tdf['uid'] == cf_tdf['uid'].unique()[5]]

driver1_tdf.plot_trajectory(map_f=m1, start_end_markers=False, hex_color='red')
driver2_tdf.plot_trajectory(map_f=m2, start_end_markers=False, hex_color='blue')

map_f

ValueError: Location values cannot contain NaNs.

Compare in a DualMap the `FlowDataFrame`s of the two drivers

In [73]:
map_f = folium.plugins.DualMap(location=(cf_tdf['lat'].mean(), 
                                         cf_tdf['lng'].mean()), 
                                       tiles='cartodbpositron', zoom_start=10)
m1, m2 = map_f.m1, map_f.m2

fdf1 = driver1_tdf.to_flowdataframe(tessellation)
map_f1 = fdf1.plot_tessellation(map_f=m1, style_func_args=tess_style)
map_f1 = fdf1.plot_flows(map_f=map_f1, flow_color='red', color_origin_point='red', 
               min_flow=0, flow_exp=0.5, radius_origin_point=5)

fdf2 = driver2_tdf.to_flowdataframe(tessellation)
map_f2 = fdf2.plot_tessellation(map_f=m2, style_func_args=tess_style)
map_f2 = fdf2.plot_flows(map_f=map_f2, flow_color='blue', color_origin_point='blue', 
               min_flow=0, flow_exp=0.5, radius_origin_point=5)
map_f

NameError: name 'driver2_tdf' is not defined