<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Uber_logo_2018.svg/1024px-Uber_logo_2018.svg.png" alt="UBER LOGO" width="50%" />

# UBER Pickups 

## Company's Description 📇

<a href="http://uber.com/" target="_blank">Uber</a> is one of the most famous startup in the world. It started as a ride-sharing application for people who couldn't afford a taxi. Now, Uber expanded its activities to Food Delivery with <a href="https://www.ubereats.com/fr-en" target="_blank">Uber Eats</a>, package delivery, freight transportation and even urban transportation with <a href="https://www.uber.com/fr/en/ride/uber-bike/" target="_blank"> Jump Bike</a> and <a href="https://www.li.me/" target="_blank"> Lime </a> that the company funded. 


The company's goal is to revolutionize transportation accross the globe. It operates now on about 70 countries and 900 cities and generates over $14 billion revenue! 😮


## Project 🚧

One of the main pain point that Uber's team found is that sometimes drivers are not around when users need them. For example, a user might be in San Francisco's Financial District whereas Uber drivers are looking for customers in Castro.  

(If you are not familiar with the bay area, check out <a href="https://www.google.com/maps/place/San+Francisco,+CA,+USA/@37.7515389,-122.4567213,13.43z/data=!4m5!3m4!1s0x80859a6d00690021:0x4a501367f076adff!8m2!3d37.7749295!4d-122.4194155" target="_blank">Google Maps</a>)

Eventhough both neighborhood are not that far away, users would still have to wait 10 to 15 minutes before being picked-up, which is too long. Uber's research shows that users accept to wait 5-7 minutes, otherwise they would cancel their ride. 

Therefore, Uber's data team would like to work on a project where **their app would recommend hot-zones in major cities to be in at any given time of day.**  

## Goals 🎯

Uber already has data about pickups in major cities. Your objective is to create algorithms that will determine where are the hot-zones that drivers should be in. Therefore you will:

* Create an algorithm to find hot zones 
* Visualize results on a nice dashboard 

## Scope of this project 🖼️

To start off, Uber wants to try this feature in New York city. Therefore you will only focus on this city. Data can be found here: 

👉👉<a href="https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+non+Supervis%C3%A9/Projects/uber-trip-data.zip" target="_blank"> Uber Trip Data</a> 👈👈

**You only need to focus on New York City for this project**

In [2]:
# install plotly
!pip install plotly

Collecting plotly
  Using cached plotly-5.3.1-py2.py3-none-any.whl (23.9 MB)
Collecting tenacity>=6.2.0
  Using cached tenacity-8.0.1-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.3.1 tenacity-8.0.1


In [2]:
# import Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN
import plotly.express as px
from sklearn.metrics import silhouette_score

In [3]:
# Load the dataset
data = pd.read_csv("uber-raw-data-jul14.csv")
data.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,7/1/2014 0:03:00,40.7586,-73.9706,B02512
1,7/1/2014 0:05:00,40.7605,-73.9994,B02512
2,7/1/2014 0:06:00,40.732,-73.9999,B02512
3,7/1/2014 0:09:00,40.7635,-73.9793,B02512
4,7/1/2014 0:20:00,40.7204,-74.0047,B02512


In [4]:
data.shape

(796121, 4)

In [5]:
data.describe(include="all")

Unnamed: 0,Date/Time,Lat,Lon,Base
count,796121,796121.0,796121.0,796121
unique,44286,,,5
top,7/15/2014 19:30:00,,,B02617
freq,79,,,310160
mean,,40.739141,-73.972353,
std,,0.040551,0.05866,
min,,39.7214,-74.826,
25%,,40.7209,-73.9961,
50%,,40.7425,-73.9832,
75%,,40.7608,-73.9651,


In [6]:
# null values ?
data.isnull().sum()

Date/Time    0
Lat          0
Lon          0
Base         0
dtype: int64

In [7]:
data.dtypes

Date/Time     object
Lat          float64
Lon          float64
Base          object
dtype: object

In [8]:
data['Base'].value_counts()

B02617    310160
B02598    245597
B02682    196754
B02512     35021
B02764      8589
Name: Base, dtype: int64

## Helpers 🦮

To help you achieve this project, here are a few tips that should help you: 

### Clustering is your friend 

Clustering technics are a perfect fit for the job. Think about it, all the pickup locations can be gathered into different clusters. You can then use **cluster coordinates to pin hot zones** 😉
    

### Create maps with `plotly` 

Check out <a href="https://plotly.com/" target="_blank">Plotly</a> documentation, you can create maps and populate them easily. Obviously, there are other libraries but this one should do the job pretty well. 


### Start small grow big 

Eventhough Uber wants to have hot-zones per hour and per day of week, you should first **start small**. Pick one day at a given hour and **then start to generalize** your approach. 

## Clustering zones

### Kmeans Model

In [4]:
# Create sample for testing hyperparameters
data_sample = data.sample(50000)

In [5]:
# Create X1 with only Lat & Lon
X1 = data_sample.iloc[:, 1:3]
print(X1.head())

            Lat      Lon
678196  40.7833 -73.9787
321926  40.7220 -73.9865
82483   40.7396 -73.9962
758811  40.7408 -73.9981
658828  40.7051 -73.9332


In [6]:
# preprocessings on X1 
numeric_features = [1,2]
sc = StandardScaler()
X1_norm = sc.fit_transform(X1)


In [19]:
print(X1_norm[:5])

[[ 2.75570035  0.56337507]
 [-3.42504967  0.12194599]
 [-0.68955026 -0.570995  ]
 [-1.01485289 -0.38107784]
 [-0.52197011 -0.59152659]]


In [20]:
# Elbow method to find "best" k

%time
wcss =  []
k = []
for i in range (3,10): 
    kmeans = KMeans(n_clusters= i, random_state = 0)
    kmeans.fit(X1_norm)
    wcss.append(kmeans.inertia_)
    k.append(i)

print(k)
print(wcss)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 8.34 µs
[3, 4, 5, 6, 7, 8, 9]
[296211.1652642944, 243454.2792337618, 206358.1714253708, 171665.3896743139, 145991.20932146473, 124653.52905476769, 109376.50117186522]


In [21]:
# Show the result
%time

# Create figure
fig= px.line( x=k, y=wcss)

# Add title and axis labels
fig.update_layout(
    yaxis_title="Inertia",
    xaxis_title="# Clusters",
    title="Inertia per cluster"
)

# Render
fig.show(renderer="iframe")



CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.44 µs


In [None]:
# Silhouette method to refine our hypothesis for k

%time
sil = []
k = []
for i in range (3,10): 
    kmeans = KMeans(n_clusters= i, random_state = 0)
    kmeans.fit(X1_norm)
    sil.append(silhouette_score(X1_norm, kmeans.labels_))
    k.append(i)

print(k)
print(sil)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.44 µs


In [12]:
%time

# Create figure
fig= px.bar( x=k, y=sil)

# Add title and axis labels
fig.update_layout(
    yaxis_title="Silhouette Score",
    xaxis_title="# Clusters",
    title="Silhouette Score per cluster"
)

# Render
fig.show(renderer="iframe")

CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 8.11 µs


#### KMeans Model with k=7

In [9]:
# X with all observations of the dataframe on Lat & Lon
X = data.iloc[:, 1:3]
print(X.head())

       Lat      Lon
0  40.7586 -73.9706
1  40.7605 -73.9994
2  40.7320 -73.9999
3  40.7635 -73.9793
4  40.7204 -74.0047


In [10]:
# preprocessings on X

numeric_features = [1,2]
sc = StandardScaler()
X_norm = sc.fit_transform(X)

In [11]:
# Create KMeans instance
kmeans = KMeans (n_clusters = 7, random_state=0)

In [12]:
# Apply KMeans on X
kmeans.fit(X_norm)

KMeans(n_clusters=7, random_state=0)

In [13]:
kmeans.cluster_centers_

array([[-0.27708846, -0.37227255],
       [ 0.59285174, -0.0600337 ],
       [-1.98226268,  3.305348  ],
       [-1.55964134,  0.08309462],
       [ 1.33861411,  1.29791652],
       [ 4.40870823,  5.6736707 ],
       [-1.04823365, -3.96298136]])

In [14]:
kmeans.labels_

array([1, 1, 0, ..., 3, 0, 0], dtype=int32)

In [15]:
# Add new column on dataset with results of KMeans
data["Cluster_KMeans"] = kmeans.labels_
data.head()

Unnamed: 0,Date/Time,Lat,Lon,Base,Cluster_KMeans
0,7/1/2014 0:03:00,40.7586,-73.9706,B02512,1
1,7/1/2014 0:05:00,40.7605,-73.9994,B02512,1
2,7/1/2014 0:06:00,40.732,-73.9999,B02512,0
3,7/1/2014 0:09:00,40.7635,-73.9793,B02512,1
4,7/1/2014 0:20:00,40.7204,-74.0047,B02512,0


In [16]:
data["Cluster_KMeans"].value_counts()

0    329088
1    311131
3     71840
4     47418
2     25749
6      7579
5      3316
Name: Cluster_KMeans, dtype: int64

In [17]:
# Create a scatter mapbox with KMeans clusters
fig = px.scatter_mapbox(data, lat="Lat", lon="Lon", zoom=8, color="Cluster_KMeans", mapbox_style="carto-positron")
fig.show(renderer="iframe_connected")

### DBScan Model sur sample

In [7]:
db = DBSCAN(min_samples=5000)

In [None]:
db.fit(X1_norm)

In [1]:
db.unique(db.labels_)

NameError: name 'db' is not defined

## Deliverable 📬

To complete this project, your team should: 

* Have a map with hot-zones using any python library (`plotly` or anything else). 
* You should **at least** describe hot-zones per day of week. 
* Compare results with **at least** two unsupervised algorithms like KMeans and DBScan. 

Your maps should look something like this: 

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/Clusters_uber_pickups.png" alt="Uber Cluster Map" />