<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Uber_logo_2018.svg/1024px-Uber_logo_2018.svg.png" alt="UBER LOGO" width="50%" />

# UBER Pickups

## Company's Description 📇

<a href="http://uber.com/" target="_blank">Uber</a> is one of the most famous startup in the world. It started as a ride-sharing application for people who couldn't afford a taxi. Now, Uber expanded its activities to Food Delivery with <a href="https://www.ubereats.com/fr-en" target="_blank">Uber Eats</a>, package delivery, freight transportation and even urban transportation with <a href="https://www.uber.com/fr/en/ride/uber-bike/" target="_blank"> Jump Bike</a> and <a href="https://www.li.me/" target="_blank"> Lime </a> that the company funded.


The company's goal is to revolutionize transportation accross the globe. It operates now on about 70 countries and 900 cities and generates over $14 billion revenue! 😮


## Project 🚧

One of the main pain point that Uber's team found is that sometimes drivers are not around when users need them. For example, a user might be in San Francisco's Financial District whereas Uber drivers are looking for customers in Castro.  

(If you are not familiar with the bay area, check out <a href="https://www.google.com/maps/place/San+Francisco,+CA,+USA/@37.7515389,-122.4567213,13.43z/data=!4m5!3m4!1s0x80859a6d00690021:0x4a501367f076adff!8m2!3d37.7749295!4d-122.4194155" target="_blank">Google Maps</a>)

Eventhough both neighborhood are not that far away, users would still have to wait 10 to 15 minutes before being picked-up, which is too long. Uber's research shows that users accept to wait 5-7 minutes, otherwise they would cancel their ride.

Therefore, Uber's data team would like to work on a project where **their app would recommend hot-zones in major cities to be in at any given time of day.**  

## Goals 🎯

Uber already has data about pickups in major cities. Your objective is to create algorithms that will determine where are the hot-zones that drivers should be in. Therefore you will:

* Create an algorithm to find hot zones
* Visualize results on a nice dashboard

## Scope of this project 🖼️

To start off, Uber wants to try this feature in New York city. Therefore you will only focus on this city. Data can be found here:

👉👉<a href="https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+non+Supervis%C3%A9/Projects/uber-trip-data.zip" target="_blank"> Uber Trip Data</a> 👈👈

**You only need to focus on New York City for this project**

## Helpers 🦮

To help you achieve this project, here are a few tips that should help you:

### Clustering is your friend

Clustering technics are a perfect fit for the job. Think about it, all the pickup locations can be gathered into different clusters. You can then use **cluster coordinates to pin hot zones** 😉
    

### Create maps with `plotly`

Check out <a href="https://plotly.com/" target="_blank">Plotly</a> documentation, you can create maps and populate them easily. Obviously, there are other libraries but this one should do the job pretty well.


### Start small grow big

Eventhough Uber wants to have hot-zones per hour and per day of week, you should first **start small**. Pick one day at a given hour and **then start to generalize** your approach.

## Deliverable 📬

To complete this project, your team should:

* Have a map with hot-zones using any python library (`plotly` or anything else).
* You should **at least** describe hot-zones per day of week.
* Compare results with **at least** two unsupervised algorithms like KMeans and DBScan.

Your maps should look something like this:

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/Clusters_uber_pickups.png" alt="Uber Cluster Map" />

In [1]:
%matplotlib inline


In [2]:
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import  silhouette_score

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import plotly.express as px
import plotly.io as pio

In [3]:
taxi=pd.read_csv('taxi-zone-lookup.csv')
taxi.head()

Unnamed: 0,LocationID,Borough,Zone
0,1,EWR,Newark Airport
1,2,Queens,Jamaica Bay
2,3,Bronx,Allerton/Pelham Gardens
3,4,Manhattan,Alphabet City
4,5,Staten Island,Arden Heights


In [4]:
# Basic stats
print("Number of rows : {}".format(taxi.shape[0]))
print()

print("Display of dataset: ")
display(taxi.head())
print()

print("Basics statistics: ")
taxi_desc = taxi.describe(include="all")
display(taxi_desc)
print()

print("Percentage of missing values: ")
display(100 * taxi.isnull().sum() / taxi.shape[0])

Number of rows : 265

Display of dataset: 


Unnamed: 0,LocationID,Borough,Zone
0,1,EWR,Newark Airport
1,2,Queens,Jamaica Bay
2,3,Bronx,Allerton/Pelham Gardens
3,4,Manhattan,Alphabet City
4,5,Staten Island,Arden Heights



Basics statistics: 


Unnamed: 0,LocationID,Borough,Zone
count,265.0,265,265
unique,,7,261
top,,Queens,Governor's Island/Ellis Island/Liberty Island
freq,,69,3
mean,133.0,,
std,76.643112,,
min,1.0,,
25%,67.0,,
50%,133.0,,
75%,199.0,,



Percentage of missing values: 


Unnamed: 0,0
LocationID,0.0
Borough,0.0
Zone,0.0


In [5]:
apr_14=pd.read_csv('/content/uber-raw-data-apr14.csv')
apr_14.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


In [6]:
# Basic stats
print("Number of rows : {}".format(apr_14.shape[0]))
print()

print("Display of apr_14set: ")
display(apr_14.head())
print()

print("Basics statistics: ")
apr_14_desc = apr_14.describe(include="all")
display(apr_14_desc)
print()

print("Percentage of missing values: ")
display(100 * apr_14.isnull().sum() / apr_14.shape[0])

Number of rows : 564516

Display of apr_14set: 


Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512



Basics statistics: 


Unnamed: 0,Date/Time,Lat,Lon,Base
count,564516,564516.0,564516.0,564516
unique,41999,,,5
top,4/7/2014 20:21:00,,,B02682
freq,97,,,227808
mean,,40.740005,-73.976817,
std,,0.036083,0.050426,
min,,40.0729,-74.7733,
25%,,40.7225,-73.9977,
50%,,40.7425,-73.9848,
75%,,40.7607,-73.97,



Percentage of missing values: 


Unnamed: 0,0
Date/Time,0.0
Lat,0.0
Lon,0.0
Base,0.0


### Retraitement de la date sur apr_14

In [7]:
apr_14['Date/Time'] = pd.to_datetime(apr_14['Date/Time'])

In [8]:
apr_14['day_of_week'] = apr_14['Date/Time'].dt.day_name()
apr_14['hour'] = apr_14['Date/Time'].dt.hour

In [9]:
apr_14.head()

Unnamed: 0,Date/Time,Lat,Lon,Base,day_of_week,hour
0,2014-04-01 00:11:00,40.769,-73.9549,B02512,Tuesday,0
1,2014-04-01 00:17:00,40.7267,-74.0345,B02512,Tuesday,0
2,2014-04-01 00:21:00,40.7316,-73.9873,B02512,Tuesday,0
3,2014-04-01 00:28:00,40.7588,-73.9776,B02512,Tuesday,0
4,2014-04-01 00:33:00,40.7594,-73.9722,B02512,Tuesday,0


In [10]:
apr_14 = apr_14.drop(['Date/Time'], axis=1)

In [11]:
apr_14.head()

Unnamed: 0,Lat,Lon,Base,day_of_week,hour
0,40.769,-73.9549,B02512,Tuesday,0
1,40.7267,-74.0345,B02512,Tuesday,0
2,40.7316,-73.9873,B02512,Tuesday,0
3,40.7588,-73.9776,B02512,Tuesday,0
4,40.7594,-73.9722,B02512,Tuesday,0


In [12]:
# Basic stats
print("Number of rows : {}".format(apr_14.shape[0]))
print()

print("Display of apr_14set: ")
display(apr_14.head())
print()

print("Basics statistics: ")
apr_14_desc = apr_14.describe(include="all")
display(apr_14_desc)
print()

print("Percentage of missing values: ")
display(100 * apr_14.isnull().sum() / apr_14.shape[0])

Number of rows : 564516

Display of apr_14set: 


Unnamed: 0,Lat,Lon,Base,day_of_week,hour
0,40.769,-73.9549,B02512,Tuesday,0
1,40.7267,-74.0345,B02512,Tuesday,0
2,40.7316,-73.9873,B02512,Tuesday,0
3,40.7588,-73.9776,B02512,Tuesday,0
4,40.7594,-73.9722,B02512,Tuesday,0



Basics statistics: 


Unnamed: 0,Lat,Lon,Base,day_of_week,hour
count,564516.0,564516.0,564516,564516,564516.0
unique,,,5,7,
top,,,B02682,Wednesday,
freq,,,227808,108631,
mean,40.740005,-73.976817,,,14.465043
std,0.036083,0.050426,,,5.873925
min,40.0729,-74.7733,,,0.0
25%,40.7225,-73.9977,,,10.0
50%,40.7425,-73.9848,,,16.0
75%,40.7607,-73.97,,,19.0



Percentage of missing values: 


Unnamed: 0,0
Lat,0.0
Lon,0.0
Base,0.0
day_of_week,0.0
hour,0.0


### Tentative de preprocess

In [13]:
apr_14=apr_14.drop(columns=['Base'])
apr_14

Unnamed: 0,Lat,Lon,day_of_week,hour
0,40.7690,-73.9549,Tuesday,0
1,40.7267,-74.0345,Tuesday,0
2,40.7316,-73.9873,Tuesday,0
3,40.7588,-73.9776,Tuesday,0
4,40.7594,-73.9722,Tuesday,0
...,...,...,...,...
564511,40.7640,-73.9744,Wednesday,23
564512,40.7629,-73.9672,Wednesday,23
564513,40.7443,-73.9889,Wednesday,23
564514,40.6756,-73.9405,Wednesday,23


### Détermination du jour le plus fréquent

In [80]:
apr_14['day_of_week'].value_counts()

Unnamed: 0_level_0,count
day_of_week,Unnamed: 1_level_1
Wednesday,108631
Tuesday,91185
Friday,90303
Thursday,85067
Saturday,77218
Monday,60861
Sunday,51251


### Test sur le vendredi

In [14]:
apr_14_friday = apr_14[apr_14['day_of_week'] == 'Friday']
apr_14_friday

Unnamed: 0,Lat,Lon,day_of_week,hour
3829,40.7528,-73.9858,Friday,0
3830,40.7263,-74.0018,Friday,0
3831,40.7263,-73.9917,Friday,0
3832,40.7813,-73.9516,Friday,0
3833,40.7170,-73.9987,Friday,0
...,...,...,...,...
562589,40.7113,-73.9474,Friday,23
562590,40.7028,-73.9294,Friday,23
562591,40.7238,-73.9880,Friday,23
562592,40.7675,-73.9806,Friday,23


In [15]:
# Basic stats
print("Number of rows : {}".format(apr_14_friday.shape[0]))
print()

print("Display of dataset: ")
display(apr_14_friday.head())
print()

print("Basics statistics: ")
apr_14_friday_desc = apr_14_friday.describe(include="all")
display(apr_14_friday_desc)
print()

print("Percentage of missing values: ")
display(100 * apr_14_friday.isnull().sum() / apr_14_friday.shape[0])

Number of rows : 90303

Display of dataset: 


Unnamed: 0,Lat,Lon,day_of_week,hour
3829,40.7528,-73.9858,Friday,0
3830,40.7263,-74.0018,Friday,0
3831,40.7263,-73.9917,Friday,0
3832,40.7813,-73.9516,Friday,0
3833,40.717,-73.9987,Friday,0



Basics statistics: 


Unnamed: 0,Lat,Lon,day_of_week,hour
count,90303.0,90303.0,90303,90303.0
unique,,,1,
top,,,Friday,
freq,,,90303,
mean,40.740356,-73.978503,,15.067019
std,0.034619,0.04689,,5.803299
min,40.2825,-74.6563,,0.0
25%,40.7234,-73.998,,11.0
50%,40.7421,-73.9856,,16.0
75%,40.7603,-73.9712,,20.0



Percentage of missing values: 


Unnamed: 0,0
Lat,0.0
Lon,0.0
day_of_week,0.0
hour,0.0


In [79]:
valeur_la_plus_frequente = apr_14_friday['hour'].mode()[0]
print(f"L'heure la plus fréquente est {valeur_la_plus_frequente} heures")


L'heure la plus fréquente est 18 heures


In [17]:
apr_14_friday_18 = apr_14_friday[apr_14_friday['hour'] == 18]
apr_14_friday_18

Unnamed: 0,Lat,Lon,day_of_week,hour
4940,40.7182,-74.0029,Friday,18
4941,40.7649,-73.9766,Friday,18
4942,40.7326,-74.0081,Friday,18
4943,40.6859,-73.9725,Friday,18
4944,40.7367,-74.0100,Friday,18
...,...,...,...,...
562448,40.7626,-73.9768,Friday,18
562449,40.6903,-74.1784,Friday,18
562450,40.7527,-73.9966,Friday,18
562451,40.7268,-73.9780,Friday,18


In [18]:
apr_14_friday_18.describe(include='all')

Unnamed: 0,Lat,Lon,day_of_week,hour
count,7258.0,7258.0,7258,7258.0
unique,,,1,
top,,,Friday,
freq,,,7258,
mean,40.742365,-73.980539,,18.0
std,0.029237,0.039724,,0.0
min,40.5765,-74.4162,,18.0
25%,40.725,-73.9981,,18.0
50%,40.7458,-73.9858,,18.0
75%,40.7605,-73.9722,,18.0


In [19]:
# apr_14_friday = apr_14_friday.sample(n = 20000, random_state = 42)
# apr_14_friday.head()

In [21]:
# Création du pipeline pour les variables quantitatives
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
numeric_features = [0,1,3] # Positions des colonnes quantitatives dans X
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler()) # pour normaliser les variables
])

# Création du pipeline pour les variables catégorielles
categorical_features = [2] # Positions des colonnes catégorielles dans X
categorical_transformer = Pipeline(
    steps=[
    ('encoder', OneHotEncoder(drop='first')) # on encode les catégories sous forme de colonnes comportant des 0 et des 1
    ])

# On combine les pipelines dans un ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Preprocessings sur le dataset
print("Preprocessing sur le train set...")
print(apr_14_friday_18.head())
X = preprocessor.fit_transform(apr_14_friday_18) # fit_transform !!
print('...Terminé.')
print(X[0:5, :])
print()


Preprocessing sur le train set...
          Lat      Lon day_of_week  hour
4940  40.7182 -74.0029      Friday    18
4941  40.7649 -73.9766      Friday    18
4942  40.7326 -74.0081      Friday    18
4943  40.6859 -73.9725      Friday    18
4944  40.7367 -74.0100      Friday    18
...Terminé.
[[-0.82659247 -0.56293581  0.        ]
 [ 0.77082518  0.09917222  0.        ]
 [-0.33402685 -0.6938469   0.        ]
 [-1.9314445   0.20239059  0.        ]
 [-0.19378248 -0.7416798   0.        ]]



In [22]:
wcss =  []
for i in range (2,25):
    kmeans = KMeans(n_clusters= i, random_state = 0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

print(wcss)

[10457.696737224625, 6358.034831804139, 5353.056331646358, 4966.841921533823, 4022.068705207002, 3257.2597363319665, 2722.603795142155, 2330.868884363785, 1988.199515481465, 1921.3873881610784, 1647.632320769708, 1525.1955176671847, 1453.7659756219446, 1357.0224529201341, 1243.2839444359126, 1169.4897256212223, 1115.3669701801361, 1062.9661680979027, 1041.5960260267548, 920.1635079037842, 872.0995857498241, 831.9768398207848, 799.6652602809824]


In [23]:
pio.renderers.default = "colab"
fig = px.line(x = range(2,25), y = wcss)
fig.show()

In [None]:
# # Utilisation du silhouette_score pour déterminer le nombre optimal de clusters
# s_score = []
# for i in range (2,25):
#     kmeans = KMeans(n_clusters= i)
#     kmeans.fit(X)
#     s_score.append(silhouette_score(X, kmeans.predict(X)))

# print(s_score)

In [None]:
# # Utilisation du silhouette_score pour déterminer le nombre optimal de clusters
# s_score = []
# k = range(2, 25)  # Define k here
# for i in k:  # Iterate through k directly
#     kmeans = KMeans(n_clusters=i)
#     kmeans.fit(X)
#     s_score.append(silhouette_score(X, kmeans.predict(X)))

# print(s_score)


# # Create a data frame
# cluster_scores = pd.DataFrame(s_score, columns=['Silhouette Score']) # Use s_score and provide a column name
# #k_frame = pd.Series(k)  # k is already defined

# # Create figure
# fig = px.bar(data_frame=cluster_scores,
#              x=k,
#              y='Silhouette Score'  # Refer to the column name
#             )

# # Add title and axis labels
# fig.update_layout(
#     yaxis_title="Silhouette Score",
#     xaxis_title="# Clusters",
#     title="Silhouette Score per cluster"
# )

# # Render
# #fig.show(renderer="notebook")
# fig.show() # if using workspace

In [24]:
kmeans = KMeans(n_clusters=10, random_state=0)

# Fit kmeans to our dataset
kmeans.fit(X)

In [26]:
apr_14_friday_18.loc[:,'Cluster_KMeans'] = kmeans.predict(X)
apr_14_friday_18.head()




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Lat,Lon,day_of_week,hour,Cluster_KMeans
4940,40.7182,-74.0029,Friday,18,4
4941,40.7649,-73.9766,Friday,18,9
4942,40.7326,-74.0081,Friday,18,4
4943,40.6859,-73.9725,Friday,18,7
4944,40.7367,-74.01,Friday,18,1


In [27]:
# Visualisation bi-dimensionnelle
fig = px.scatter(apr_14_friday_18, x = 'Lat', y = 'Lon', color = "Cluster_KMeans")
fig.show()


In [28]:
fig = px.scatter_mapbox(
        apr_14_friday_18[apr_14_friday_18.Cluster_KMeans != -1], # Use the correct column name: Cluster_KMeans
        lat="Lat",
        lon="Lon",
        height = 1000,
        color="Cluster_KMeans", # Use the correct column name for color: Cluster_KMeans
        mapbox_style="carto-positron"
)

fig.show()

In [74]:
from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.15, min_samples=200, metric="manhattan")

db.fit(X)

import numpy as np
np.unique(db.labels_)

array([-1,  0,  1,  2,  3])

In [75]:
apr_14_friday_18["cluster"] = db.labels_
apr_14_friday_18.head()




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Lat,Lon,day_of_week,hour,Cluster_KMeans,cluster
4940,40.7182,-74.0029,Friday,18,4,1
4941,40.7649,-73.9766,Friday,18,9,0
4942,40.7326,-74.0081,Friday,18,4,-1
4943,40.6859,-73.9725,Friday,18,7,-1
4944,40.7367,-74.01,Friday,18,1,-1


In [76]:
fig = px.scatter_mapbox(
        apr_14_friday_18[apr_14_friday_18.cluster != -1], # Mask pour virer les outliers
        lat="Lat",
        lon="Lon",
        color="cluster",
        height = 1000,
        mapbox_style="carto-positron"
)

fig.show()
