<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Uber_logo_2018.svg/1024px-Uber_logo_2018.svg.png" alt="UBER LOGO" width="50%" />

# UBER Pickups 

## Company's Description 📇

<a href="http://uber.com/" target="_blank">Uber</a> is one of the most famous startup in the world. It started as a ride-sharing application for people who couldn't afford a taxi. Now, Uber expanded its activities to Food Delivery with <a href="https://www.ubereats.com/fr-en" target="_blank">Uber Eats</a>, package delivery, freight transportation and even urban transportation with <a href="https://www.uber.com/fr/en/ride/uber-bike/" target="_blank"> Jump Bike</a> and <a href="https://www.li.me/" target="_blank"> Lime </a> that the company funded. 


The company's goal is to revolutionize transportation accross the globe. It operates now on about 70 countries and 900 cities and generates over $14 billion revenue! 😮


## Project 🚧

One of the main pain point that Uber's team found is that sometimes drivers are not around when users need them. For example, a user might be in San Francisco's Financial District whereas Uber drivers are looking for customers in Castro.  

(If you are not familiar with the bay area, check out <a href="https://www.google.com/maps/place/San+Francisco,+CA,+USA/@37.7515389,-122.4567213,13.43z/data=!4m5!3m4!1s0x80859a6d00690021:0x4a501367f076adff!8m2!3d37.7749295!4d-122.4194155" target="_blank">Google Maps</a>)

Eventhough both neighborhood are not that far away, users would still have to wait 10 to 15 minutes before being picked-up, which is too long. Uber's research shows that users accept to wait 5-7 minutes, otherwise they would cancel their ride. 

Therefore, Uber's data team would like to work on a project where **their app would recommend hot-zones in major cities to be in at any given time of day.**  

## Goals 🎯

Uber already has data about pickups in major cities. Your objective is to create algorithms that will determine where are the hot-zones that drivers should be in. Therefore you will:

* Create an algorithm to find hot zones 
* Visualize results on a nice dashboard 

## Scope of this project 🖼️

To start off, Uber wants to try this feature in New York city. Therefore you will only focus on this city. Data can be found here: 

👉👉<a href="https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+non+Supervis%C3%A9/Projects/uber-trip-data.zip" target="_blank"> Uber Trip Data</a> 👈👈

**You only need to focus on New York City for this project**

## Helpers 🦮

To help you achieve this project, here are a few tips that should help you: 

### Clustering is your friend 

Clustering technics are a perfect fit for the job. Think about it, all the pickup locations can be gathered into different clusters. You can then use **cluster coordinates to pin hot zones** 😉
    

### Create maps with `plotly` 

Check out <a href="https://plotly.com/" target="_blank">Plotly</a> documentation, you can create maps and populate them easily. Obviously, there are other libraries but this one should do the job pretty well. 


### Start small grow big 

Eventhough Uber wants to have hot-zones per hour and per day of week, you should first **start small**. Pick one day at a given hour and **then start to generalize** your approach. 

## Deliverable 📬

To complete this project, your team should: 

* Have a map with hot-zones using any python library (`plotly` or anything else). 
* You should **at least** describe hot-zones per day of week. 
* Compare results with **at least** two unsupervised algorithms like KMeans and DBScan. 

Your maps should look something like this: 

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/Clusters_uber_pickups.png" alt="Uber Cluster Map" />

# I. EDA 

## Import libraries

In [1]:
# install plotly
!pip install plotly



In [2]:
# import Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.cluster import KMeans, DBSCAN
import plotly.express as px
from sklearn.metrics import silhouette_score

## Load dataset

In [3]:
# Load the dataset
uber = pd.read_csv("Data/uber-raw-data-jul14.csv")

## Basic info & stats of the data

In [5]:
print("Uber pickups data :")
display(uber)

display(uber.info())

print("Basics statistics: ")
display(uber.describe(include="all"))

print("Percentage of missing values : ")
display((uber.isna().sum()/uber.shape[0]*100).sort_values(ascending=False))

Uber pickups data :


Unnamed: 0,Date/Time,Lat,Lon,Base
0,7/1/2014 0:03:00,40.7586,-73.9706,B02512
1,7/1/2014 0:05:00,40.7605,-73.9994,B02512
2,7/1/2014 0:06:00,40.7320,-73.9999,B02512
3,7/1/2014 0:09:00,40.7635,-73.9793,B02512
4,7/1/2014 0:20:00,40.7204,-74.0047,B02512
...,...,...,...,...
796116,7/31/2014 23:22:00,40.7285,-73.9846,B02764
796117,7/31/2014 23:23:00,40.7615,-73.9868,B02764
796118,7/31/2014 23:29:00,40.6770,-73.9515,B02764
796119,7/31/2014 23:30:00,40.7225,-74.0038,B02764


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 796121 entries, 0 to 796120
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   Date/Time  796121 non-null  object 
 1   Lat        796121 non-null  float64
 2   Lon        796121 non-null  float64
 3   Base       796121 non-null  object 
dtypes: float64(2), object(2)
memory usage: 24.3+ MB


None

Basics statistics: 


Unnamed: 0,Date/Time,Lat,Lon,Base
count,796121,796121.0,796121.0,796121
unique,44286,,,5
top,7/15/2014 19:30:00,,,B02617
freq,79,,,310160
mean,,40.739141,-73.972353,
std,,0.040551,0.05866,
min,,39.7214,-74.826,
25%,,40.7209,-73.9961,
50%,,40.7425,-73.9832,
75%,,40.7608,-73.9651,


Percentage of missing values : 


Base         0.0
Lon          0.0
Lat          0.0
Date/Time    0.0
dtype: float64

In [5]:
uber['Base'].value_counts()

B02617    310160
B02598    245597
B02682    196754
B02512     35021
B02764      8589
Name: Base, dtype: int64

# II. First clustering on only one day / one hour

#### Create data selection

In [4]:
# convert "Date/time" to datetime format
uber["Date/Time"] = pd.to_datetime(uber["Date/Time"])

In [5]:
uber["Date/Time"][0]

Timestamp('2014-07-01 00:03:00')

In [6]:
uber_d1 = uber.copy()

In [7]:
# For the first analyse, we will focus only on one day and one hour of this day
# We will keep 2014-07-01 at 11 a.m.
mask = (uber_d1['Date/Time'] >= '2014-07-01 11:00:00') & (uber_d1['Date/Time'] < '2014-07-01 12:00:00')
selecthour=uber_d1.loc[mask]
display(selecthour)

Unnamed: 0,Date/Time,Lat,Lon,Base
249,2014-07-01 11:00:00,40.7571,-73.9744,B02512
250,2014-07-01 11:00:00,40.8584,-73.9346,B02512
251,2014-07-01 11:03:00,40.7582,-73.9653,B02512
252,2014-07-01 11:03:00,40.7210,-74.0084,B02512
253,2014-07-01 11:04:00,40.7559,-73.9833,B02512
...,...,...,...,...
787597,2014-07-01 11:30:00,40.7712,-73.9811,B02764
787598,2014-07-01 11:34:00,40.6824,-73.9617,B02764
787599,2014-07-01 11:36:00,40.7365,-73.9933,B02764
787600,2014-07-01 11:46:00,40.7488,-73.9756,B02764


#### Preprocessing

In [8]:
# preprocessing on features
numeric_features = [1,2]
categorical_features = [3]

numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(drop="first")

preprocessor = ColumnTransformer(
    transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
    ])

print("Preprocessing on features ... ")
X = preprocessor.fit_transform(selecthour)
print(X[:5])
print("...Done.")

Preprocessing on features ... 
[[ 0.39146691  0.02590775  0.          0.          0.          0.        ]
 [ 3.13674293  0.94967072  0.          0.          0.          0.        ]
 [ 0.42127741  0.23711989  0.          0.          0.          0.        ]
 [-0.58685949 -0.76323649  0.          0.          0.          0.        ]
 [ 0.35894637 -0.18066236  0.          0.          0.          0.        ]]
...Done.


#### KMeans model

##### Fine optimized k

In [9]:
# Elbow method to find "best" k

%time
wcss =  []
k = []
for i in range (2,9): 
    kmeans = KMeans(n_clusters= i, random_state = 0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
    k.append(i)

print(k)
print(wcss)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.91 µs
[2, 3, 4, 5, 6, 7, 8]
[1638.0297632408503, 1306.6418680181177, 1138.3392593446918, 990.1501804099037, 858.4326618336308, 749.639289652669, 674.546688563885]


In [10]:
# Create figure
fig= px.line( x=k, y=wcss)
fig.show(renderer="iframe")

In [11]:
# Silhouette method to refine our hypothesis for k

%time
sil = []
k = []
for i in range (2,9): 
    kmeans = KMeans(n_clusters= i, random_state = 0)
    kmeans.fit(X)
    sil.append(silhouette_score(X, kmeans.labels_))
    k.append(i)

print(k)
print(sil)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 6.68 µs
[2, 3, 4, 5, 6, 7, 8]
[0.5373276860954993, 0.23643531130135434, 0.2519051401397063, 0.2775838851582872, 0.37028392630747514, 0.3856420346755343, 0.3573816591796094]


In [12]:
# Create figure
fig= px.bar( x=k, y=sil)
fig.show(renderer="iframe")

##### KMeans with optimized k

In [13]:
# Create KMeans instance
kmeans = KMeans (n_clusters = 4, random_state=0)

# Apply KMeans on X
kmeans.fit(X)

KMeans(n_clusters=4, random_state=0)

In [14]:
kmeans.cluster_centers_

array([[ 0.44869032, -0.05181067,  0.28470588,  0.34117647,  0.29176471,
         0.01411765],
       [-0.80264144, -0.39050265,  0.38338658,  0.32268371,  0.23322684,
         0.00958466],
       [-1.56518589,  4.24024747,  0.33333333,  0.26666667,  0.4       ,
         0.        ],
       [ 2.00026603,  1.92007498,  0.4047619 ,  0.23809524,  0.30952381,
         0.        ]])

In [15]:
# Add new column on dataset with results of KMeans
selecthour["Cluster_KMeans"] = kmeans.labels_
display(selecthour.head())
print()
print(selecthour["Cluster_KMeans"].value_counts())
print()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Date/Time,Lat,Lon,Base,Cluster_KMeans
249,2014-07-01 11:00:00,40.7571,-73.9744,B02512,0
250,2014-07-01 11:00:00,40.8584,-73.9346,B02512,3
251,2014-07-01 11:03:00,40.7582,-73.9653,B02512,0
252,2014-07-01 11:03:00,40.721,-74.0084,B02512,1
253,2014-07-01 11:04:00,40.7559,-73.9833,B02512,0



0    425
1    313
3     42
2     15
Name: Cluster_KMeans, dtype: int64



In [16]:
# Create a scatter mapbox with KMeans clusters
fig = px.scatter_mapbox(selecthour, lat="Lat", lon="Lon", zoom=10, color="Cluster_KMeans", mapbox_style="carto-positron")
fig.show(renderer="iframe_connected")

#### DBScan model

In [71]:
# DBScan
db = DBSCAN(eps=0.6, min_samples=40, metric = 'manhattan')
db.fit(X)

DBSCAN(eps=0.6, metric='manhattan', min_samples=40)

In [72]:
# Clusters created by DBScan
np.unique(db.labels_)

array([-1,  0,  1,  2])

In [73]:
# Add new column on dataset with results of DBScan
selecthour["Cluster_DBScan"] = db.labels_
display(selecthour.head())
print()
print(selecthour["Cluster_DBScan"].value_counts())
print()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Date/Time,Lat,Lon,Base,Cluster_KMeans,Cluster_DBScan
249,2014-07-01 11:00:00,40.7571,-73.9744,B02512,0,-1
250,2014-07-01 11:00:00,40.8584,-73.9346,B02512,3,-1
251,2014-07-01 11:03:00,40.7582,-73.9653,B02512,0,-1
252,2014-07-01 11:03:00,40.721,-74.0084,B02512,1,-1
253,2014-07-01 11:04:00,40.7559,-73.9833,B02512,0,-1



-1    223
 1    203
 0    202
 2    167
Name: Cluster_DBScan, dtype: int64



In [74]:
#Visualize all clusters (except outliers : -1) with DBScan results
fig = px.scatter_mapbox(selecthour[selecthour["Cluster_DBScan"] != -1], 
                        lat='Lat', lon='Lon', color='Cluster_DBScan', 
                        mapbox_style="open-street-map", zoom = 10)
fig.show(renderer="iframe_connected")

# III. Clustering on more data

### DBScan

In [99]:
# Create sample for testing hyperparameters
uber_sample = uber.sample(70000)

In [100]:
# preprocessing on features
numeric_features = [1,2]
categorical_features = [3]

numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(drop="first")

preprocessor = ColumnTransformer(
    transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
    ])

print("Preprocessing on features ... ")
X2 = preprocessor.fit_transform(uber_sample)
print(X2[:5])
print("...Done.")

Preprocessing on features ... 
[[ 0.04891627 -0.55314556  0.          1.          0.          0.        ]
 [-0.59422257 -0.40673752  0.          1.          0.          0.        ]
 [-0.49162066 -0.55484798  0.          1.          0.          0.        ]
 [-1.95307234  4.19490104  0.          1.          0.          0.        ]
 [ 0.42178665 -0.37609398  1.          0.          0.          0.        ]]
...Done.


In [140]:
# DBScan
db1 = DBSCAN(eps=0.3, min_samples=2000, metric='manhattan')
db1.fit(X2)

# 0.7 / 500

DBSCAN(eps=0.3, metric='manhattan', min_samples=2000)

In [141]:
np.unique(db1.labels_)

array([-1,  0,  1,  2])

In [142]:
# Add new column on dataset with results of DBScan
uber_sample["Cluster_DBScan"] = db1.labels_
display(uber_sample.head())
print()
print(uber_sample["Cluster_DBScan"].value_counts())
print()

Unnamed: 0,Date/Time,Lat,Lon,Base,Cluster_DBScan
530412,2014-07-26 21:51:00,40.7409,-74.005,B02617,0
451445,2014-07-20 01:31:00,40.7152,-73.9964,B02617,0
576783,2014-07-31 00:08:00,40.7193,-74.0051,B02617,0
448118,2014-07-19 20:27:00,40.6609,-73.7261,B02617,-1
62563,2014-07-04 23:14:00,40.7558,-73.9946,B02598,1



-1    27062
 0    18356
 1    13989
 2    10593
Name: Cluster_DBScan, dtype: int64



In [143]:
#Visualize all clusters (except outliers : -1) with DBScan results
fig = px.scatter_mapbox(uber_sample[uber_sample["Cluster_DBScan"] != -1], 
                        lat='Lat', lon='Lon', color='Cluster_DBScan', 
                        mapbox_style="open-street-map", zoom = 10)
fig.show(renderer="iframe_connected")

In [144]:
#Visualize all clusters (with outliers : -1) with DBScan results
fig = px.scatter_mapbox(uber_sample, 
                        lat='Lat', lon='Lon', color='Cluster_DBScan', 
                        mapbox_style="open-street-map", zoom = 10)
fig.show(renderer="iframe_connected")