<span>
<img src="img/logo_skmob.png" width="260px" align="right"/>
</span>
<span>
<b>Author:</b> <a href="http://about.giuliorossetti.net">Luca Pappalardo</a><br/>
<b>Python version:</b>  >=3.8<br/>
</span>

<a id='top'></a>
# *scikit-mobility: Mobility Data Analysis*

In this notebook are introduced the main steps for the reanding, analyzing and visualizing mobility data: for a complete overview on the ``skmob`` library refer to the online <a href="http://bit.ly/skmob_doc](http://bit.ly/skmob_doc/">documentation</a>..

**Note:** this notebook is purposely not 100% comprehensive, it only discusses the basic things you need to get started. 

In [1]:
import skmob
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

## Social Media: the <font color="blue">Brightkite</font> data set
[Brightkite](https://snap.stanford.edu/data/loc-brightkite.html) was a location-based social networking service provider where users shared their locations by checking-in in the period Apr 2008 - Oct 2010: 
- 58,228 users
- 4,491,143 checkins

In [2]:
# load the pandas DataFrame
url = "https://snap.stanford.edu/data/loc-brightkite_totalCheckins.txt.gz"
df = pd.read_csv(url, sep='\t', header=0, nrows=100000, names=['user', 'check-in_time', 'latitude', 'longitude', 'location id'])

# convert it to a TrajDataFrame
bdf = skmob.TrajDataFrame(df, latitude='latitude', longitude='longitude', datetime='check-in_time', user_id='user')
bdf.head()

Unnamed: 0,uid,datetime,lat,lng,location id
0,0,2010-10-16 06:02:04+00:00,39.891383,-105.070814,7a0f88982aa015062b95e3b4843f9ca2
1,0,2010-10-16 03:48:54+00:00,39.891077,-105.068532,dd7cd3d264c2d063832db506fba8bf79
2,0,2010-10-14 18:25:51+00:00,39.750469,-104.999073,9848afcc62e500a01cf6fbf24b797732f8963683
3,0,2010-10-14 00:21:47+00:00,39.752713,-104.996337,2ef143e12038c870038df53e0478cefc
4,0,2010-10-13 23:31:51+00:00,39.752508,-104.996637,424eb3dd143292f9e013efa00486c907


In [3]:
bdf['leaving_datetime'] = bdf.datetime
# take the points of a single user
user0_bdf = bdf[bdf.uid == bdf.uid.unique()[0]]
# take a sample of 200 random points
user0_bdf_sample = user0_bdf.sample(200)
# plot the stops of the user
user0_map = user0_bdf_sample.plot_stops(zoom=3)
# plot the trajectory of the user
user0_bdf_sample.plot_trajectory(map_f=user0_map)

### GPS: the <font color="blue">GeoLife</font> dataset

collected in (Microsoft Research Asia) **[GeoLife](https://www.microsoft.com/en-us/download/details.aspx?id=52367)** project by 182 users in the period Apr 2007 - Aug 2012.

- 17,621 trajectories
- total distance of about 1.2 million kilometers 
- total duration of 48,000+ hours.

In [4]:
tdf = skmob.TrajDataFrame.from_file('data/geolife_sample.txt.gz').sort_values(by='datetime')
print(type(tdf))
print(tdf.crs)
print(tdf.parameters)
tdf.head()

<class 'skmob.core.trajectorydataframe.TrajDataFrame'>
{'init': 'epsg:4326'}
{'from_file': 'data/geolife_sample.txt.gz'}


Unnamed: 0,lat,lng,datetime,uid
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1
4,39.984217,116.319422,2008-10-23 05:53:21,1


In [5]:
tdf.plot_trajectory(zoom=12, weight=3, opacity=0.9, tiles='Stamen Toner')

- How many users in the data set?
- How many points?
- What's the time window?

In [6]:
print('# users: %s' %len(tdf.uid.unique()))
print('# points: %s' %len(tdf))
print('time window: %s' 
      %(tdf.iloc[-1].datetime - tdf.iloc[0].datetime))

# users: 2
# points: 217653
time window: 146 days 23:53:32


## Let's focus on a single user
using the *select* operation as we do in **pandas**

In [7]:
user1_tdf = tdf[tdf.uid == 1]
user1_tdf.head()

Unnamed: 0,lat,lng,datetime,uid
0,39.984094,116.319236,2008-10-23 05:53:05,1
1,39.984198,116.319322,2008-10-23 05:53:06,1
2,39.984224,116.319402,2008-10-23 05:53:11,1
3,39.984211,116.319389,2008-10-23 05:53:16,1
4,39.984217,116.319422,2008-10-23 05:53:21,1


In [8]:
user1_map = user1_tdf.plot_trajectory(zoom=11, weight=3, tiles='Open Street Map')
user1_map

## Mobility data preprocessing

There are 3 common steps we can apply to clean our data:

- Filtering
- Compression
- Stop detection


## Filtering trajectories

Filter out points with speed higher than `max_speed` km/h from the previous point.

In [9]:
from skmob.preprocessing import filtering

In [10]:
# filter points with speed higher than 500km/h
user1_ftdf = filtering.filter(user1_tdf, max_speed_kmh=500.)

In [11]:
user1_ftdf.parameters

{'from_file': 'data/geolife_sample.txt.gz',
 'filter': {'function': 'filter',
  'max_speed_kmh': 500.0,
  'include_loops': False,
  'speed_kmh': 5.0,
  'max_loop': 6,
  'ratio_max': 0.25}}

Very few points have been filtered.

In [12]:
print('Points of the raw trajectory:\t\t%s'%len(user1_tdf))
print('Points of the filtered trajectory:\t%s'%len(user1_ftdf))
print('Filtered points:\t\t\t%s'%(len(user1_tdf)-len(user1_ftdf)))

Points of the raw trajectory:		108607
Points of the filtered trajectory:	108589
Filtered points:			18


## Compressing trajectories

Reduce the number of points of the trajectory, preserving the structure.

Merge together all points that are closer than `spatial_radius_km=0.2` kilometers from each other.

In [13]:
from skmob.preprocessing import compression

In [14]:
user1_ctdf = compression.compress(user1_ftdf, spatial_radius_km=0.1)
user1_ctdf.head()

Unnamed: 0,lat,lng,datetime,uid
0,39.984578,116.319749,2008-10-23 05:53:05,1
1,39.984533,116.320287,2008-10-23 05:54:03,1
2,39.984235,116.320923,2008-10-23 05:54:38,1
3,39.982974,116.321144,2008-10-23 05:55:54,1
4,39.982069,116.321219,2008-10-23 05:56:22,1


In [15]:
user1_ctdf.parameters

{'from_file': 'data/geolife_sample.txt.gz',
 'filter': {'function': 'filter',
  'max_speed_kmh': 500.0,
  'include_loops': False,
  'speed_kmh': 5.0,
  'max_loop': 6,
  'ratio_max': 0.25},
 'compress': {'function': 'compress', 'spatial_radius_km': 0.1}}

The compressed trajectory has only a small fraction of the points of the filtered trajectory.

In [16]:
print('Points of the filtered trajectory:\t%s'%len(user1_ftdf))
print('Points of the compressed trajectory:\t%s'%len(user1_ctdf))
print('Compressed points:\t\t\t%s'%(len(user1_ftdf)-len(user1_ctdf)))

Points of the filtered trajectory:	108589
Points of the compressed trajectory:	7099
Compressed points:			101490


## Stop detection

Identify locations where the user spent at least `minutes_for_a_stop` minutes within a distance `spatial_radius_km` $\times$ `stop_radius_factor`, from a given point. 

A new column `leaving_datetime` is added, indicating the time when the user departs from the stop.

In [17]:
from skmob.preprocessing import detection

In [18]:
user1_stdf = detection.stops(user1_ctdf, stop_radius_factor=0.5, \
            minutes_for_a_stop=20.0, spatial_radius_km=0.2, 
                       leaving_time=True)
user1_stdf.head()

Unnamed: 0,lat,lng,datetime,uid,leaving_datetime
0,39.978532,116.327267,2008-10-23 06:00:30,1,2008-10-23 10:32:53
1,40.013999,116.306183,2008-10-23 11:09:35,1,2008-10-23 23:45:27
2,39.979245,116.325659,2008-10-24 00:14:40,1,2008-10-24 01:49:03
3,39.9813,116.310084,2008-10-24 01:56:07,1,2008-10-24 03:22:00
4,39.979556,116.312931,2008-10-24 03:22:00,1,2008-10-24 03:50:05


In [19]:
user1_stdf.parameters

{'from_file': 'data/geolife_sample.txt.gz',
 'filter': {'function': 'filter',
  'max_speed_kmh': 500.0,
  'include_loops': False,
  'speed_kmh': 5.0,
  'max_loop': 6,
  'ratio_max': 0.25},
 'compress': {'function': 'compress', 'spatial_radius_km': 0.1},
 'detect': {'function': 'stops',
  'stop_radius_factor': 0.5,
  'minutes_for_a_stop': 20.0,
  'spatial_radius_km': 0.2,
  'leaving_time': True,
  'no_data_for_minutes': 1000000000000.0,
  'min_speed_kmh': None}}

#### Visualise the compressed trajectory and the stops

Click on the stop markers to see a pop up with: 
- User ID
- Coordinates of the stop (click to see the location on Google maps)
- Arrival time
- Departure time

In [20]:
map_f = user1_stdf.plot_trajectory(max_points=1000, hex_color=-1, start_end_markers=False)
user1_stdf.plot_stops(map_f=map_f, hex_color=-1)

In [21]:
from skmob.preprocessing import detection
user1_stdf = detection.stops(user1_tdf, stop_radius_factor=0.5, 
                             minutes_for_a_stop=20.0, spatial_radius_km=0.2, 
                             leaving_time=True)
user1_stdf.head(4)

Unnamed: 0,lat,lng,datetime,uid,leaving_datetime
0,39.978253,116.327275,2008-10-23 06:01:05,1,2008-10-23 10:32:53
1,40.013819,116.306532,2008-10-23 11:10:09,1,2008-10-23 23:46:02
2,39.97895,116.326439,2008-10-24 00:12:30,1,2008-10-24 01:48:57
3,39.981316,116.310181,2008-10-24 01:56:47,1,2008-10-24 02:28:19


In [22]:
user1_stdf.plot_stops(map_f=user1_map, hex_color=-1)

## Stops define <font color="blue">trips</font>
Let's take the first trip of the individual using the stops

In [23]:
user1_stdf.head(4)

Unnamed: 0,lat,lng,datetime,uid,leaving_datetime
0,39.978253,116.327275,2008-10-23 06:01:05,1,2008-10-23 10:32:53
1,40.013819,116.306532,2008-10-23 11:10:09,1,2008-10-23 23:46:02
2,39.97895,116.326439,2008-10-24 00:12:30,1,2008-10-24 01:48:57
3,39.981316,116.310181,2008-10-24 01:56:47,1,2008-10-24 02:28:19


In [24]:
dt1 = user1_stdf.iloc[0].leaving_datetime
dt2 = user1_stdf.iloc[1].leaving_datetime
dt1, dt2

(Timestamp('2008-10-23 10:32:53'), Timestamp('2008-10-23 23:46:02'))

In [25]:
# select all points between the first two stops
user1_tid1_tdf = user1_tdf[(user1_tdf.datetime >= dt1) 
                           & (user1_tdf.datetime <= dt2)]
user1_tid1_tdf.head()

Unnamed: 0,lat,lng,datetime,uid
148,39.970511,116.341455,2008-10-23 10:32:53,1
149,39.977648,116.326925,2008-10-23 10:33:00,1
150,39.977586,116.326918,2008-10-23 10:33:05,1
151,39.977596,116.326894,2008-10-23 10:33:10,1
152,39.977661,116.326947,2008-10-23 10:33:14,1


In [26]:
# plot the trip
user1_tid1_map = user1_tid1_tdf.plot_trajectory(zoom=13, weight=5, opacity=0.9, tiles='Stamen Toner', )
user1_tid1_map

Compute the length of the trip and the distance between origin and destination

In [27]:
from skmob.utils.gislib import getDistanceByHaversine
from skmob.measures.individual import distance_straight_line
# take origin and destination of the trip
start_loc = user1_tid1_tdf.iloc[0][['lat', 'lng']]
end_loc = user1_tid1_tdf.iloc[-1][['lat', 'lng']]
# compute distance between origin and destination
print("distance:", getDistanceByHaversine(end_loc, start_loc))

distance: 5.511092656068364


In [28]:
distance_straight_line(user1_tid1_tdf)

100%|██████████| 1/1 [00:00<00:00, 81.26it/s]


Unnamed: 0,uid,distance_straight_line
0,1,8.713016
