# Trajectories Clustering and Analysis

### Goal of this Notebook
Parsing and clustering the avaiable trajectory data and analyzing aggregate measurements.
***
**Outputs:**
- No file outputs (yet). All visualizations and analysis contained within IPython Notebook.

**Inputs:**
- TRAJECTORIES.CSV [Manually Made] from raw trajectories data. Available on Dropbox in `Theophile Cabannes/Data Collection/Raw data/Here data/step_019_organize_by_provider`.

**Temporary Files Within the Pipeline:** 
- No temporary files.

**Dependent Scripts:**
- No script dependencies.

**Dependent Libraries:**
- [numpy](https://numpy.org), can be installed with `pip install numpy`
- pandas
- os
- csv
- json
- geopy
- matplotlib
- gmplot
- dipy
- shapely
- keplergl
***
**Sections:**
- A. [Parse Raw Trajectory Data](#section_ID_a)
- B. [Clustering](#section_ID_b)
- C. [Plotting, Mapping & Analysis](#section_ID_c)

# To dos
0. Discuss with Michal about the way you did the clustering, as it might be helpful for him to reuse some of your code.
1. Use the module fremont dropbox to get the folders from the dropbox (see next cell)
2. Create both files trajectories and trajectories condensed in the current iPython notebook.
    - Put them in `/Private Structured data collection/Data processing/Auxiliary files/Demand/Flow_speed/Trajectories`
3. Use the external and internal TAZs instead of the sklearn clustering to cluster the trajectories depending on their origin and destination
    - TAZ are shapefiles in `Private Structured data collection/Data processing/Raw/Demand/OD demand/TAZ`
4. Write a function that takes as input the ids of the origin and destination TAZ and output the corresponding trajectories using Kepler.gl
5. Remove `trajectories.csv` and `trajectories_condensed.csv` from GitHub (they are under NDA)
6. Generate all Kepler.gl maps in `/Private Structured data collection/Data processing/Temporary exports to be copied to processed data`

### To do later
7. Match paths to road sections (see Jane McFarlan for that)
8. For every O-D pairs (where O and D are TAZ id), and 15 minutes time step output the corresponding paths used by drivers
9. Compare the paths used by drivers using Here data, with the ones used by drivers in Aimsun simulations.
10. For the path going from South of I-680N to North of I-680N, deduces the percentage of drivers using local roads instead of staying on the Highway for different time of the day

In [1]:
import sys
import os

# We let this notebook to know where to look for fremontdropbox module
module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from fremontdropbox import get_dropbox_location

dropbox_dir = get_dropbox_location()

rootdir = dropbox_dir + "/Private Structured data collection/Data processing/Raw/Demand/Flow_speed/Here data"
print(rootdir)

/Users/theophile/Fremont Dropbox/Theophile Cabannes/Private Structured data collection/Data processing/Raw/Demand/Flow_speed/Here data


In [None]:
import os
import csv
import json
import numpy as np
import pandas as pd
import geopy.distance
import matplotlib.pyplot as plt

from gmplot import gmplot
from keplergl import KeplerGl
from dipy.segment.metric import Metric
from dipy.segment.metric import ResampleFeature
from dipy.segment.clustering import QuickBundles
from shapely.geometry import MultiPoint
from geopy.distance import great_circle

<a id="section_ID_a"></a>
## A. Parse Raw Trajectory Data into Singular CSV

In [None]:
# rootdir = './step_019_organize_by_provider'
counter = 0
with open('trajectories.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["Time", "Speed", "Heading", "Origin X", "Origin Y", "Dest X", "Dest Y", "Source"])

    for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            counter += 1
            path = os.path.join(subdir, file)

            if (not path == "./step_019_organize_by_provider\.ipynb_checkpoints\Test_python-checkpoint.ipynb"):

                with open(path) as f:
                    data = json.load(f)

                for feature in data['features']:
                    
                    trajectory = [feature['properties']['time'], feature['properties']['speed'], feature['properties']['heading'],\
                                    feature['geometry']['coordinates'][0][0], feature['geometry']['coordinates'][0][1],\
                                    feature['geometry']['coordinates'][1][0], feature['geometry']['coordinates'][1][1], os.path.basename(path).split(".")[0]]
                    
                    writer.writerow(trajectory)
                    # print(trajectory)

    print("All trajectory data has been parsed.", counter, "files total.")

In [None]:
# rootdir = './step_019_organize_by_provider'
counter = 0
with open('trajectories_condensed.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(["Time Start", "Time End", "Origin X", "Origin Y", "Dest X", "Dest Y", "Source"])

    for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            counter += 1
            path = os.path.join(subdir, file)

            if (not path == "./step_019_organize_by_provider\.ipynb_checkpoints\Test_python-checkpoint.ipynb"):

                with open(path) as f:
                    data = json.load(f)
                start, end = data['features'][0], data['features'][-1]
                
                trajectory = [start['properties']['time'], end['properties']['time'], start['geometry']['coordinates'][0][0],\
                                start['geometry']['coordinates'][0][1], end['geometry']['coordinates'][1][0],\
                                end['geometry']['coordinates'][1][1], os.path.basename(path).split(".")[0]]
                
                writer.writerow(trajectory)
                    # print(trajectory)

    print("All trajectory data has been parsed.", counter, "files total.")

<a id="section_ID_b"></a>
## B. Clustering

In [None]:
from sklearn.cluster import DBSCAN
from sklearn import metrics

df = pd.read_csv("trajectories.csv")

# represent GPS points as (lat, lon)
coords = df.as_matrix(columns=['Origin Y', 'Origin X'])
# earth's radius in km
kms_per_radian = 6371.0088
# define epsilon as 0.1 kilometers, converted to radians for use by haversine
epsilon = 0.1 / kms_per_radian

# eps is the max distance that points can be from each other to be considered in a cluster
# min_samples is the minimum cluster size (everything else is classified as noise)
db = DBSCAN(eps=epsilon, min_samples=100, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_
# get the number of clusters (ignore noisy samples which are given the label -1)
num_clusters = len(set(cluster_labels) - set([-1]))

print ('Clustered ' + str(len(df)) + ' points to ' + str(num_clusters) + ' clusters')

# turn the clusters in to a pandas series
clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])
clusters

<a id="section_ID_c"></a>
## C. Plotting, Mapping, and Analysis

In [None]:
def get_centermost_point(cluster):
    centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
    centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
    return tuple(centermost_point)

# get the centroid point for each cluster
centermost_points = clusters.map(get_centermost_point)
lats, lons = zip(*centermost_points)

rep_points = pd.DataFrame({'lon':lons, 'lat':lats})
fig, ax = plt.subplots(figsize=[15, 10])

rs_scatter = ax.scatter(rep_points['lon'][0], rep_points['lat'][0], c='#99cc99', edgecolor='None', alpha=0.7, s=maxes[0]/10)

for i in range(1, num_clusters):
    ax.scatter(rep_points['lon'][i], rep_points['lat'][i], c='#99cc99', edgecolor='None', alpha=0.7, s=maxes[i]*2)

df_scatter = ax.scatter(df['Origin X'], df['Origin Y'], c='k', alpha=0.9, s=3)

ax.set_title('Full GPS trace vs. DBSCAN clusters')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.legend([df_scatter, rs_scatter], ['GPS Points', 'Cluster Centers'], loc='upper right')

labels = ['Cluster {0}'.format(i) for i in range(1, num_clusters+1)]
for label, x, y in zip(labels, rep_points['lon'], rep_points['lat']):
    plt.annotate(
        label, 
        xy = (x, y), xytext = (-25, -30),
        textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'white', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

plt.show()

In [None]:
M = []
def groupByTime(row):
    t = df[(df['Origin Y']==row[0]) & (df['Origin X']==row[1])]['Time'].iloc[0]
    return t[ (t[:t.index(':')].index(" ")):][0:3]
for i in range(num_clusters):
    hours = np.apply_along_axis(groupByTime, 1, clusters[i]).tolist()
    M.append(list(map(int, hours)))

In [None]:
f, axarr = plt.subplots(len(M), figsize=(12.5,50))
maxes = []
for i in range(len(M)):
    y, x, _ = axarr[i].hist(list(M[i]))
    maxes.append(y.max())
    axarr[i].set_title("Cluster {0}".format(i + 1))
    axarr[i].set_xlabel("Hour")
    axarr[i].set_ylabel("Trajectories")
f.tight_layout(pad=1.0)


In [None]:
gmap = gmplot.GoogleMapPlotter(df["Origin Y"][0], df["Origin X"][0], 11)
gmap.plot(df["Origin Y"], df["Origin X"], "cornflowerblue", edge_width=1)

gmap.draw("trajectories_map.html")

print("Plotted trajectories.")


gmap = gmplot.GoogleMapPlotter(df["Origin Y"][0], df["Origin X"][0], 11)
gmap.plot(df["Origin Y"], df["Origin X"], "cornflowerblue", edge_width=1)
gmap.heatmap(rep_points['lat'], rep_points['lon'], radius=20)

gmap.draw("trajectories_map_with_clusters.html")

print("Plotted trajectories with clusters.")


gmap = gmplot.GoogleMapPlotter(df["Origin Y"][0], df["Origin X"][0], 11)
gmap.heatmap(rep_points['lat'], rep_points['lon'], radius=20)

gmap.draw("map_with_clusters.html")

print("Plotted clusters.")

In [None]:
df = pd.read_csv("trajectories_condensed.csv")

map1 = KeplerGl(height=500, data={"d1":df})
map1

map1.save_to_html(file_name='map1.html')