### Import packages

In [43]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [44]:
import os
import pathlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import sklearn.cluster
import sklearn.neighbors

from util import config
from util import mapping
from util import road_backbone

# Remove weird rides - DBSCAN

GPS-recorded rides can have some issues with gaps in the data.  This could be an issue with the machine, or an issue with the human, e.g. forgetting to turn the unit back on for a while after pausing.

Here, look for long gaps in each ride, and take the longest continuous segment out of each.

To find continuous segments, use DBSCAN

* Want to extract contiguous segments, so density-based clustering is ideal
* DBSCAN is the best density-based option, because it is intuitive to define 'epsilon' (maximum distance between points in this cluster) in geographic space

In [None]:
trips = pd.read_feather(
    os.path.join(config.PROCESSED_DATA_PATH, 'trips.feather')
)
trips.set_index('rte_id', inplace=True)


# If everything is working well, GPS points should be < 20 m apart
# However, a user's tolerance is much higher than this - as long as
# ride is easy to follow in map view, it is fine for this purpose.
clusterer = sklearn.cluster.DBSCAN(eps=0.1) # 0.1 degree ~ 1 km

bad_rtes = []
for i, rte_id in enumerate(trips.index.tolist()):
    if not i % 200: print('{} of {}'.format(i, trips.shape[0]))
    
    ride = pd.read_feather(
        os.path.join(config.CLEAN_TRIPS_PATH, '{}.feather'.format(rte_id))
    )
        
    clusterer.fit(ride[['lat', 'lon']])
    ride['labels'] = clusterer.labels_
    if ride.labels.nunique() == 1:
        continue
        
    bad_rtes += [ride]
    
    # Find segment with largest number of breadcrumb points
    biggest_segment = ride.labels.value_counts().index[0]
    
    ride = (ride[ride.labels == biggest_segment]
            .reset_index(drop=True).drop('labels', axis=1))
    
    # If this now means that the ride is very short, drop it
    if ride['dist'].sum() <= 1:
        os.remove(os.path.join(config.CLEAN_DATA_PATH, '{}.feather'.format(rte_id)))
        continue
    
    ride.to_feather(
        os.path.join(config.CLEAN_DATA_PATH, '{}.feather'.format(rte_id))
    )
    

0 of 19570
200 of 19570
400 of 19570
600 of 19570
800 of 19570
1000 of 19570
1200 of 19570
1400 of 19570
1600 of 19570
1800 of 19570
2000 of 19570
2200 of 19570
2400 of 19570
2600 of 19570
2800 of 19570
3000 of 19570
3200 of 19570
3400 of 19570
3600 of 19570
3800 of 19570
4000 of 19570
4200 of 19570
4400 of 19570
4600 of 19570
4800 of 19570
5000 of 19570
5200 of 19570
5400 of 19570
5600 of 19570
5800 of 19570
6000 of 19570
6200 of 19570
6400 of 19570
6600 of 19570
6800 of 19570
7000 of 19570
7200 of 19570
7400 of 19570
7600 of 19570


Check out some of those discontinuous trips visually

In [None]:
colours = sns.color_palette('husl', 5)
inds = np.random.choice(len(bad_rtes), 5, replace=False)

for ind in inds:
    ride = bad_rtes[ind]
    for i, cluster_label in enumerate(ride.labels.value_counts().index.tolist()):
        plt.plot(ride[ride.labels == cluster_label].lon,
                 ride[ride.labels == cluster_label].lat,
                 '.', markersize=0.5, 
                 color=np.array(colours[i]) * (a.labels.nunique() - i) / a.labels.nunique()
                )

And remove any of these deleted trips from the main DataFrame.

In [None]:
trips = pd.read_feather(
    os.path.join(config.PROCESSED_DATA_PATH, 'trips.feather')
)
# List all [rte_id].feather files in CLEAN_TRIPS_PATH
clean_trip_ids = [rte_id.stem for rte_id 
                  in pathlib.Path(config.CLEAN_TRIPS_PATH).glob('*.feather')]


trips = trips[(trips['rte_id'].isin(clean_trip))].reset_index(drop=True)
trips.to_feather(
    os.path.join(config.PROCESSED_DATA_PATH, 'trips.feather')
)

## Build the road backbone - CLIQUE

The lat/lon breadcrumbs are inconsistent across all the rides.  To make life easier, need to build a backbone of consistent lat/lon points along all of the roads included in the database.

CLIQUE clustering is ideal

* Above some threshold (e.g. one ride), data density does not matter - so will not get more cluster centres where there are more rides
* Cluster centres are evenly distributed in space at the granularity that you decide is appropriate

For the basic road backbone, a granularity of around 200 m (0.002 degrees) seems appropriate - except very close to junctions, this can distinguish between different roads.

Later, I will want to calculate distances between points with low latency - as the user puts their preferred start location into the app, I need to calculate how far it is to any given ride.  As distances of less than 1 mile are unlikely to be significant in this proximity question, we can use a coarser granularity (0.015 degrees).

For the de-duping question, an even coarser grid can be used - I would rather throw out rides that are a little bit different than retain them.  This also makes the clustering at that stage easier.  I use a granularity of around 3.5 miles (0.05 degrees).

In [21]:
trips = pd.read_feather(
    os.path.join(config.PROCESSED_DATA_PATH, 'trips.feather')
)
trips.set_index('rte_id', inplace=True)


3.4687500000000004

We want to address de-duping first, as this will make everything else run much more quickly :)

In [42]:
import collections
collections.defaultdict(set)

defaultdict(set, {})

In [None]:
pts_per_degree = 20

grid_pts, gridpts_at_rte = road_backbone.make_road_backbone(
    trips.index.tolist(), pts_per_degree
)

road_backbone.save_gridpts(grid_pts, grid_dict, grid_fname, rtes_at_grid_fname)
