# UBER Pickups 

## Company's Description 📇

Uber is one of the most famous startup in the world. It started as a ride-sharing application for people who couldn't afford a taxi. Now, Uber expanded its activities to Food Delivery with Uber Eats, package delivery, freight transportation and even urban transportation with Jump Bike and Lime that the company funded. 

## Project 🚧

One of the main pain point that Uber's team found is that sometimes drivers are not around when users need them. For example, a user might be in San Francisco's Financial District whereas Uber drivers are looking for customers in Castro.  

Eventhough both neighborhood are not that far away, users would still have to wait 10 to 15 minutes before being picked-up, which is too long. Uber's research shows that users accept to wait 5-7 minutes, otherwise they would cancel their ride. 

Therefore, Uber's data team would like to work on a project where **their app would recommend hot-zones in major cities to be in at any given time of day.**  

## Goals 🎯

Uber already has data about pickups in major cities. Your objective is to create algorithms that will determine where are the hot-zones that drivers should be in. Therefore you will:

* Create an algorithm to find hot zones 
* Visualize results on a nice dashboard 

## EDA

In [23]:
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.cluster import KMeans, DBSCAN, MiniBatchKMeans
from sklearn.metrics import silhouette_score


In [6]:
data = pd.read_csv("uber-trip-data/uber-raw-data-apr14.csv")
data.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


In [3]:
data.describe(include="all")

Unnamed: 0,Date/Time,Lat,Lon,Base
count,564516,564516.0,564516.0,564516
unique,41999,,,5
top,4/7/2014 20:21:00,,,B02682
freq,97,,,227808
mean,,40.740005,-73.976817,
std,,0.036083,0.050426,
min,,40.0729,-74.7733,
25%,,40.7225,-73.9977,
50%,,40.7425,-73.9848,
75%,,40.7607,-73.97,


## Preprocessing
Il n'y a aucune données manquantes donc pas besoin d'imputer des valeurs.

Il faut simplement créer de nouvelles colonnes pour la date et supprimer la colonne Base qui ne nous sert pas.

Il faut aussi standardiser les valeurs.

In [7]:
data = data.drop("Base",axis=1)
data["Date/Time"] = pd.to_datetime(data['Date/Time'], format="%m/%d/%Y %H:%M:%S")
data["Hour"] = data["Date/Time"].dt.hour
data["DayOfWeek"] =  data["Date/Time"].dt.strftime("%A")
data.head()

Unnamed: 0,Date/Time,Lat,Lon,Hour,DayOfWeek
0,2014-04-01 00:11:00,40.769,-73.9549,0,Tuesday
1,2014-04-01 00:17:00,40.7267,-74.0345,0,Tuesday
2,2014-04-01 00:21:00,40.7316,-73.9873,0,Tuesday
3,2014-04-01 00:28:00,40.7588,-73.9776,0,Tuesday
4,2014-04-01 00:33:00,40.7594,-73.9722,0,Tuesday


In [12]:
# Create pipeline for numeric features
numeric_features = ['Lat', 'Lon', 'Hour'] 
numeric_transformer = Pipeline(steps=[
  ('scaler', StandardScaler())
])

# Create pipeline for categorical features
categorical_features = ['DayOfWeek']
categorical_transformer = Pipeline(steps=[
  ('encoder', OneHotEncoder(drop='first'))
])

# Use ColumnTransformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

X = preprocessor.fit_transform(data)

print(X[0:5, :])

[[ 0.8035544   0.43463557 -2.46258789  0.          0.          0.
   0.          1.          0.        ]
 [-0.36873718 -1.14392195 -2.46258789  0.          0.          0.
   0.          1.          0.        ]
 [-0.23293981 -0.20789287 -2.46258789  0.          0.          0.
   0.          1.          0.        ]
 [ 0.52087416 -0.01553096 -2.46258789  0.          0.          0.
   0.          1.          0.        ]
 [ 0.53750241  0.09155711 -2.46258789  0.          0.          0.
   0.          1.          0.        ]]


## Train models

On va tester deux modèles : K-means & DBSCAN

### K-means

On commence par cherhcer le meilleur nombre de cluster possible en évaluant le coude et la silhouette.

In [25]:
# Elbow & Silhouette Graph
wcss =  []
sil = [0]
k = []
for i in range (1,16): 
    kmeans = MiniBatchKMeans(n_clusters= i, random_state = 0, n_init='auto')
    print("Start fitting nb clusters : {}".format(i))
    predictions = kmeans.fit_predict(X)
    wcss.append(kmeans.inertia_)
    if i > 1:
        print("Start mesuring silhouette nb clusters : {}".format(i))
        sil.append(silhouette_score(X, predictions, n_jobs=-1, sample_size=100000, random_state=0)) # Calcul sur la 1/4 des lignes
    k.append(i)
    print("WCSS for K={} --> {}".format(i, wcss[-1]))
    print("Silhouette score for K={} is {}".format(i, sil[-1]))


Start fitting nb clusters : 1



MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 5120 or by setting the environment variable OMP_NUM_THREADS=4



WCSS for K=1 --> 2097536.7644196097
Silhouette score for K=1 is 0
Start fitting nb clusters : 2
Start mesuring silhouette nb clusters : 2



MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 5120 or by setting the environment variable OMP_NUM_THREADS=4



WCSS for K=2 --> 1686182.8700218673
Silhouette score for K=2 is 0.2554364774919503
Start fitting nb clusters : 3



MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 5120 or by setting the environment variable OMP_NUM_THREADS=4



Start mesuring silhouette nb clusters : 3
WCSS for K=3 --> 1373057.5902631548
Silhouette score for K=3 is 0.27590190498527695
Start fitting nb clusters : 4



MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 5120 or by setting the environment variable OMP_NUM_THREADS=4



Start mesuring silhouette nb clusters : 4
WCSS for K=4 --> 1216374.4650407946
Silhouette score for K=4 is 0.21234533067606967
Start fitting nb clusters : 5



MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 5120 or by setting the environment variable OMP_NUM_THREADS=4



Start mesuring silhouette nb clusters : 5
WCSS for K=5 --> 1119484.4643467409
Silhouette score for K=5 is 0.1850701998050938
Start fitting nb clusters : 6



MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 5120 or by setting the environment variable OMP_NUM_THREADS=4



Start mesuring silhouette nb clusters : 6
WCSS for K=6 --> 1039363.7216393219
Silhouette score for K=6 is 0.16552873293796672
Start fitting nb clusters : 7



MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 5120 or by setting the environment variable OMP_NUM_THREADS=4



Start mesuring silhouette nb clusters : 7
WCSS for K=7 --> 988833.197360777
Silhouette score for K=7 is 0.14519571842391543
Start fitting nb clusters : 8



MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 5120 or by setting the environment variable OMP_NUM_THREADS=4



Start mesuring silhouette nb clusters : 8
WCSS for K=8 --> 968348.4169798974
Silhouette score for K=8 is 0.1895353810516464
Start fitting nb clusters : 9



MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 5120 or by setting the environment variable OMP_NUM_THREADS=4



Start mesuring silhouette nb clusters : 9
WCSS for K=9 --> 925001.817974631
Silhouette score for K=9 is 0.21436747216337337
Start fitting nb clusters : 10



MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 5120 or by setting the environment variable OMP_NUM_THREADS=4



Start mesuring silhouette nb clusters : 10
WCSS for K=10 --> 852674.7019681594
Silhouette score for K=10 is 0.2164866262558394
Start fitting nb clusters : 11



MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 5120 or by setting the environment variable OMP_NUM_THREADS=4



Start mesuring silhouette nb clusters : 11
WCSS for K=11 --> 774685.4667952444
Silhouette score for K=11 is 0.2246043861056426
Start fitting nb clusters : 12



MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 5120 or by setting the environment variable OMP_NUM_THREADS=4



Start mesuring silhouette nb clusters : 12
WCSS for K=12 --> 741778.4126237568
Silhouette score for K=12 is 0.22476027077193433
Start fitting nb clusters : 13



MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 5120 or by setting the environment variable OMP_NUM_THREADS=4



Start mesuring silhouette nb clusters : 13
WCSS for K=13 --> 715831.2464747402
Silhouette score for K=13 is 0.22982416863369878
Start fitting nb clusters : 14



MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 5120 or by setting the environment variable OMP_NUM_THREADS=4



Start mesuring silhouette nb clusters : 14
WCSS for K=14 --> 704676.5709208992
Silhouette score for K=14 is 0.22805951116619003
Start fitting nb clusters : 15



MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 5120 or by setting the environment variable OMP_NUM_THREADS=4



Start mesuring silhouette nb clusters : 15
WCSS for K=15 --> 674577.3385819043
Silhouette score for K=15 is 0.23920837757424776


In [26]:
wcss_frame = pd.DataFrame(wcss)
sil_frame = pd.DataFrame(sil)
k_frame = pd.Series(k)

fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Bar(x=k_frame, y=sil_frame.iloc[:,-1], name="Silhouette"),
    secondary_y=False,
)

fig.add_trace(
    go.Line(x=k_frame, y=wcss_frame.iloc[:,-1], name="Inertie"),
    secondary_y=True,
)

fig.update_layout(
    title_text="K-means"
)

# Set x-axis title
fig.update_xaxes(title_text="Nb clusters")

# Set y-axes titles
fig.update_yaxes(title_text="Silhouette", secondary_y=False)
fig.update_yaxes(title_text="Inertie", secondary_y=True)

fig.show()


plotly.graph_objs.Line is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.scatter.Line
  - plotly.graph_objs.layout.shape.Line
  - etc.




On remarque que le meilleur nombre de clusters est 12.