# UBER Unsupervised learning

The file data for this note book was originally quite large. Hence the data used here is a sample consisting of 30 000 entries

In [8]:
# Import librairies

import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import matplotlib.pyplot as plt
import random
import os

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.cluster import DBSCAN

path = os.getcwd() + '/'

# Convert features to readable formats

In [11]:
# Import datset
data = pd.read_csv(path + "data_sample.csv")

In [12]:
# convert Date/time column to Datetime index as Date
data["Date"] = pd.to_datetime(data['Date/Time'], dayfirst=False)

In [13]:
# delete Date/time since it is redundant
# Delete column Base as it might skew the results. Base represents geographic allocation of Uber drivers.
del data["Date/Time"]
del data["Base"]

In [14]:
# Extract day, day of week and tim from datetime index.
data['Day'] = pd.DatetimeIndex(data['Date']).day
data['Day_of_week'] = pd.DatetimeIndex(data['Date']).weekday
data['time'] = pd.DatetimeIndex(data['Date']).hour

# Preprocessing

The question now becomes what features to include as input for the model. 
Since the goal is to describe hot zones per day of the week, we have elected to focus on geographic clusters which will be filtered in the mapbox using a day of week slider.


In [15]:
# Indicating the two numeric features
numeric_features = ['Lat', 'Lon']

In [16]:
# indicate transforms
numeric_transformer = StandardScaler()

In [17]:
# Declare prepossor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)
    ])

Since we are only interested in feeding the model geographic data while at the same time being able to filter out for certain variables such as date and time the data set is copied. This dataset feed to the model will only have lat and lon and is labelled "data_sample_geo".

In [20]:
data_sample_geo = data.copy()

In [21]:
# delete non-geographic columns
del data_sample_geo['Day_of_week']
del data_sample_geo['time']
del data_sample_geo['Day']

In [22]:
# preprocess dataset
X_full = preprocessor.fit_transform(data_sample_geo)

# Kmeans elbow

First we will attempt to use a Kmeans to cluster the geographic datapoints

In [23]:
# for loop to apply Kmeans dataset and determine appropriate number of clusters
wcss =  []
k = []
for i in range (1,12): 
    kmeans = KMeans(n_clusters= i, random_state = 0)
    kmeans.fit(X_full)
    wcss.append(kmeans.inertia_)
    k.append(i)
    print("WCSS for K={} --> {}".format(i, wcss[-1]))

WCSS for K=1 --> 60000.000000000015
WCSS for K=2 --> 42145.8843353345
WCSS for K=3 --> 28715.755289147874
WCSS for K=4 --> 23418.109823751984
WCSS for K=5 --> 18713.46636998722
WCSS for K=6 --> 15118.079574355083
WCSS for K=7 --> 12833.35555668224
WCSS for K=8 --> 11248.020099467438
WCSS for K=9 --> 9710.504403566869
WCSS for K=10 --> 8771.425410823234
WCSS for K=11 --> 7969.153090252349


This first loop indicates that k = 10 might be a good place to evalute what the dataset looks lie with clusters

In [24]:
# Declare instance of Kmeans with 10 clusters
kmeans_10 = KMeans(n_clusters= 10, random_state = 0)

In [25]:
# Fit model
kmeans_10.fit(X_full)

In [28]:
#We add a new column 'cluster' to indicate the rsults of the model training into the dataframe. 
data["cluster"] = kmeans_10.predict(X_full)

In [29]:
# visulisation
data['cluster'] = data['cluster'].astype(str)
fig = px.scatter_mapbox(
        data, 
        lat="Lat", 
        lon="Lon",
        color="cluster",
        animation_frame="Day_of_week",
        animation_group="Day_of_week",
        mapbox_style="carto-positron"
)

fig.show()

While this first map does indicate general hotzones, it fails to encapsulate geographic distances as some of these zones are seperated by a river. Moreover, we can question the pertinance of model that clusters Manhattan into 3 sperate zones and not one as it represent a center of gravity for the city of New York.

# Test with a DBscan model

To compare and contrast with Kmeans, we will a ppy a DBscan model using a manhattan metric.

In [30]:
# Declare and instance of DBscan
db = DBSCAN(eps=0.2, min_samples=200, metric="manhattan")

In [31]:
# Fit model (this section might take sometime and require some ram)
db.fit(X_full)

In [32]:
# Cluster is now the labels attribute by the modelm to each row.
data["cluster"] = db.labels_

In [33]:
# Convert int to str to make visulation easy and no into a gradient.
data['cluster'] = data['cluster'].astype(str)
# Visualize
fig = px.scatter_mapbox(
        data, 
        lat="Lat", 
        lon="Lon",
        color="cluster",
        animation_frame="Day_of_week",
        animation_group="Day_of_week",
        mapbox_style="carto-positron"
)

fig.show()

In [34]:
# Count value in each cluster.
data['cluster'].value_counts()

0     22924
-1     3577
3      1024
1       696
4       691
2       650
5       236
6       202
Name: cluster, dtype: int64

This second version using DBscan does capture geography more efficiently than the Kmeans model. Namely it clusters Manhattan into a single zone and detects two new clusters centered around New York's two airports. However, it does have a considerable amount of outliers.