# UBER Pickups 
--------
In this second part of this project, we want to:

> implement **DBSCAN** technique  

> **Visualize** hot-zones for drivers using **maps**

Some ideas in this part of the project was inspired by **[Geoff Boeing](http://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/)** article about dealing with spatial data.

--------

### Table of Contents

* [1. Clustering: DBSCAN](#section1)
    * [1.1. DBSCAN parameters](#section21)
    * [1.2. Useful functions](#section21)
    * [1.3. Load data](#section22)
    * [1.4. Driver Query](#section22)
        * [1.4.1. Find the hot zones](#section22)
        * [1.4.2. Visualize recommendations](#section22)
* [2. Further improvements](#section2)

 #### Import useful modules ⬇️⬇️ and Global params

In [1]:
# generic libs
import pandas as pd
import numpy as np

# ML libs
from sklearn.cluster import DBSCAN

# geometry libs
from shapely.geometry import MultiPoint
from geopy.distance import great_circle
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="geoapiExercises")

# plot libs
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "iframe_connected"

# global params
file_path = "data/pre_aug14.csv"

 # Clustering : DBSCAN 

 ## DBSCAN parameters  

> We will try to restate the required DBSCAN parameters for a better understanding by drivers when using the application. 
>> **Eps/Epsilon/Radius (km)**  the maximum distance between two pickups to be in the same hot-zone.  

>> **min_samples**  the minimum number of pickups per zone to be considered as hot.     

>> **Metric : haversine**   
As we are dealing with spatial locations, the most suitable distance metric is the haversine distance. It determines the **great-circle** distance between two points on a **sphere** (the earth in our case) given their longitudes and latitudes.   
The **great-circle** distance is the shortest distance between two points on the surface of a sphere.

 ## Useful functions

> Recall that we will use the **haversine** distance as a metric for the DBSCAN algorithm. Hence, before applying the algorithm, we must convert the coordinates and the maximum distance between pickups (eps) to **radians**. 

> Given the predefined **_maximum distance_** and the **_minimum pickups_**, the function **get_clusters** runs the DBSCAN algorithm to the pickups data and return **_annotated pickups_** with their equivalent **_clusters_** and the **_number of clusters_** (number of hot zones).

In [2]:
def get_clusters(max_distance,min_pickups,pickups):
    
    ## get pickups coordinates from available data
    coords = pickups[['Lat', 'Lon']].to_numpy()
        
    # convert distances from kilometers to radians, as measured along a great circle on a sphere 
    # with a radius of 6371 km, the mean radius of the Earth.
    kms_per_radian = 6371.0088
    
    # convert coordinates and epsilon to radians
    epsilon = max_distance / kms_per_radian
    coords_rad = np.radians(coords)
    
    ##  DBSCAN clustering
    db = DBSCAN(eps=epsilon, min_samples=min_pickups, metric='haversine')
    db.fit(coords_rad)
    
    ## count the clusters
    cluster_labels = db.labels_
    n_clusters = len(set(cluster_labels))
    print('Number of clusters: {}'.format(n_clusters))
    
    # annotate the pickups with their equivalent clusters
    annotated_pickups = np.insert(coords, 2, cluster_labels, axis = 1)
   
    return annotated_pickups, n_clusters

> **Generate understandable recommendations**: DBSCAN does not define centroids like K-means. Hence, the visualization of the hot spots with all the pickups locations in those zones will be a mess for drivers and hard to be understood. To solve this issue, we can compute the **centroid** of each cluster and only visualize centroids in the recommendation. In this way, the recommendation will be more intelligible and not too cluttered.

>> The problem is that the centroid of a collection of points might be one of its points or a point that does not exist in the collection. Hence, instead of computing the **centroid** of a cluster, we want to find the point in the cluster **nearest** that centroid. We will call such point the **center-most** point.

>> To do so, we will use 2 geometry libraries: The **shapely** library, more precisely, the class **MultiPoint** that allows us to implement a collection of points (cluster of locations) and the **geopy** library and its **great-circle** distance.

In [3]:
def get_centermost_point(cluster):
    # centroid : tuple (latitude, longitude)
    centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
    # centermost : one point in the cluster that gives the shortest distance to the centroid in meters(m)
    centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)

    return tuple(centermost_point)

> Given the **_annotated pickups_** and the **_number of clusters_** got by the function **get_clusters**, the function **get_centermost_points** computes the **_centermost_** locations of each hot-spot and its **_size_** (the number of pickups). These 2 values will help us to generate understandable recommendations. We exclude the outliers cluster.

In [4]:
def get_centermost_points(annotated_pickups, n_clusters):
    # return centermost coordinates (lat, lon) and the clusters' sizes
    # annotated_pickups [lat, lon, cluster] : np array
    
    # initialize lists for centermost_point and size
    lats = []
    lons = []
    sizes = []
    
    # grouping clusters (without considering the noise cluster)
    for c in range(n_clusters-1):
        # get all pickups in the cluster "c"
        mask = (annotated_pickups[:, 2] == c)
        cluster = annotated_pickups[mask, :][:,0:2]
        
        # filter empty clusters
        if cluster.any():
            centermost =  get_centermost_point(cluster)
            lats.append(centermost[0]) 
            lons.append(centermost[1])
            sizes.append(len(cluster))
            
    # make the computed data in a df     
    centermosts = [lats,lons,sizes]
    df_centermosts =  pd.DataFrame({'lat':lats, 'lon':lons, 'size':sizes})
    
    return df_centermosts

> The last function **_get_location_info_** is used to get information about a location given its gps coordinates (latitude & longitude). To do so, we will use **_Nominatim_** class of the geocoding library **_geopy_**.

In [5]:
def get_location_info(latitude, longitude):
    location = str( geolocator.reverse(str(latitude)+","+str(longitude)))
    return location

 ## Load data

In [6]:
pickups = pd.read_csv(file_path)
pickups.head()

Unnamed: 0,date,Lat,Lon,weekday,hour
0,2014-08-01 00:03:00,40.7366,-73.9906,4,0
1,2014-08-01 00:09:00,40.726,-73.9918,4,0
2,2014-08-01 00:12:00,40.7209,-74.0507,4,0
3,2014-08-01 00:12:00,40.7387,-73.9856,4,0
4,2014-08-01 00:12:00,40.7323,-74.0077,4,0


 ## Driver Query

> To illustrate the functioning of the proposed implementation we will take an example of a driver query:  
>> Where are the locations which experience more than 25 pickups that are within 50 meters of each other between 4 pm and 6 pm on Thursday ? 
>>> 1) **Run** DBSCAN using the query elements to find the hot zones. 

>>> 2) **Visualize** the hot-zones (only the centermost locations) using a map to make understandable recommendations for the driver.

 ### Find the hot zones

In [7]:
# Query params
'''
day = Thursday (3)
hour = between 4 and 6 pm
max_distance = 50 m = 0.05 km (eps: maximum distance between two cluster members in kilometers)
min_pickups = 25 (min_smaples: minimum number of cluster members)
'''

# get pickups data given the driver query
mask = (pickups['weekday']== 3) & (pickups['hour'] >= 4) & (pickups['hour'] <= 6)
df = pickups[mask].reset_index()

# Clustering params
max_distance = 0.05
min_pickups = 25

# get hot zones
annotated_pickups, n_clusters= get_clusters(max_distance ,min_pickups, df)

# get zones centermost locations & the number of pickups in each zone (size)
df_centermosts = get_centermost_points(annotated_pickups, n_clusters)
df_centermosts['location_info'] = df_centermosts.apply(lambda row: get_location_info(row.lat, row.lon), axis = 1)
df_centermosts

Number of clusters: 8


Unnamed: 0,lat,lon,size,location_info
0,40.646,-73.7767,78,"John F. Kennedy International Airport, JFK Acc..."
1,40.6465,-73.7898,34,"JFK Terminal 8, Terminal 8 Parking, Queens, Qu..."
2,40.6448,-73.782,168,"John F. Kennedy International Airport, JFK Acc..."
3,40.695,-74.178,42,"Newark Liberty International Airport, US 1-9 L..."
4,40.7387,-74.0088,39,"108, Horatio Street, Manhattan Community Board..."
5,40.6484,-73.7828,31,"Departures Parking Connector, Queens, Queens C..."
6,40.7111,-74.0057,27,"8 Spruce Street, 8, Spruce Street, Manhattan C..."


 ### Visualize recommendations

> We will use the **_size_** value to emphasize the most condensed locations during a predefined time frame. This way, recommendations will be more informative and easier to be understood.

In [None]:
fig = px.scatter_mapbox(
    df_centermosts, 
    lat= "lat", 
    lon = "lon", 
    size = 'size',
    color = 'size',
    text = 'location_info',
    mapbox_style = "carto-positron",
    zoom = 10,
    title = 'Hot Spots for Thursday between 4 pm and 6 pm',
    width = 1500, height = 600  
)

fig.show()

 # Application

> We have implemented the proposed solution in a user-friendly application that is more manageable by drivers using **_Dash, HTML_** and **_CSS_**. You find the implementation in the file **_app.py_**.

<img src="img/driver_query.png">

![image](img/map.png)

 # Further improvements

> A further improvement that can be made is to get automatically the location of the driver and recommend the **_K_** nearest hot zones in order to alleviate the decision phase and help drivers make optimal decisions. (**TO BE DONE !!!**)