<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Uber_logo_2018.svg/1024px-Uber_logo_2018.svg.png" alt="UBER LOGO" width="50%" />

# UBER Pickups 

## Company's Description 📇

<a href="http://uber.com/" target="_blank">Uber</a> is one of the most famous startup in the world. It started as a ride-sharing application for people who couldn't afford a taxi. Now, Uber expanded its activities to Food Delivery with <a href="https://www.ubereats.com/fr-en" target="_blank">Uber Eats</a>, package delivery, freight transportation and even urban transportation with <a href="https://www.uber.com/fr/en/ride/uber-bike/" target="_blank"> Jump Bike</a> and <a href="https://www.li.me/" target="_blank"> Lime </a> that the company funded. 


The company's goal is to revolutionize transportation accross the globe. It operates now on about 70 countries and 900 cities and generates over $14 billion revenue! 😮


## Project 🚧

One of the main pain point that Uber's team found is that sometimes drivers are not around when users need them. For example, a user might be in San Francisco's Financial District whereas Uber drivers are looking for customers in Castro.  

(If you are not familiar with the bay area, check out <a href="https://www.google.com/maps/place/San+Francisco,+CA,+USA/@37.7515389,-122.4567213,13.43z/data=!4m5!3m4!1s0x80859a6d00690021:0x4a501367f076adff!8m2!3d37.7749295!4d-122.4194155" target="_blank">Google Maps</a>)

Eventhough both neighborhood are not that far away, users would still have to wait 10 to 15 minutes before being picked-up, which is too long. Uber's research shows that users accept to wait 5-7 minutes, otherwise they would cancel their ride. 

Therefore, Uber's data team would like to work on a project where **their app would recommend hot-zones in major cities to be in at any given time of day.**  

## Goals 🎯

Uber already has data about pickups in major cities. Your objective is to create algorithms that will determine where are the hot-zones that drivers should be in. Therefore you will:

* Create an algorithm to find hot zones 
* Visualize results on a nice dashboard 

## Scope of this project 🖼️

To start off, Uber wants to try this feature in New York city. Therefore you will only focus on this city. Data can be found here: 

👉👉<a href="https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+non+Supervis%C3%A9/Projects/uber-trip-data.zip" target="_blank"> Uber Trip Data</a> 👈👈

**You only need to focus on New York City for this project**

## Helpers 🦮

To help you achieve this project, here are a few tips that should help you: 

### Clustering is your friend 

Clustering technics are a perfect fit for the job. Think about it, all the pickup locations can be gathered into different clusters. You can then use **cluster coordinates to pin hot zones** 😉
    

### Create maps with `plotly` 

Check out <a href="https://plotly.com/" target="_blank">Plotly</a> documentation, you can create maps and populate them easily. Obviously, there are other libraries but this one should do the job pretty well. 


### Start small grow big 

Eventhough Uber wants to have hot-zones per hour and per day of week, you should first **start small**. Pick one day at a given hour and **then start to generalize** your approach. 

## Deliverable 📬

To complete this project, your team should: 

* Have a map with hot-zones using any python library (`plotly` or anything else). 
* You should **at least** describe hot-zones per day of week. 
* Compare results with **at least** two unsupervised algorithms like KMeans and DBScan. 

Your maps should look something like this: 

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/Clusters_uber_pickups.png" alt="Uber Cluster Map" />

## Analyse descriptive et préparation des données

In [51]:
import pandas as pd
import numpy as np

from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN, KMeans, OPTICS
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import silhouette_score

import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
# setting Jedha color palette as default
pio.templates["jedha"] = go.layout.Template(
    layout_colorway=["#4B9AC7", "#4BE8E0", "#9DD4F3", "#97FBF6", "#2A7FAF", "#23B1AB", "#0E3449", "#015955"]
)
pio.templates.default = "jedha"
#pio.renderers.default = "svg" # to be replaced by "iframe" if working on JULIE

In [2]:
# Import des données Uber
df_avr = pd.read_csv('uber-trip-data/uber-raw-data-apr14.csv')
df_mai = pd.read_csv('uber-trip-data/uber-raw-data-may14.csv') 
df_juin = pd.read_csv('uber-trip-data/uber-raw-data-jun14.csv')
df_juil = pd.read_csv('uber-trip-data/uber-raw-data-jul14.csv')
df_aou = pd.read_csv('uber-trip-data/uber-raw-data-aug14.csv')
df_sep = pd.read_csv('uber-trip-data/uber-raw-data-sep14.csv')

In [3]:
# Regroupement des données Uber
df = pd.concat([df_avr, df_mai, df_juin, df_juil, df_aou, df_sep], ignore_index=True)

In [4]:
df.describe(include="all")

Unnamed: 0,Date/Time,Lat,Lon,Base
count,4534327,4534327.0,4534327.0,4534327
unique,260093,,,5
top,4/7/2014 20:21:00,,,B02617
freq,97,,,1458853
mean,,40.73926,-73.97302,
std,,0.03994991,0.0572667,
min,,39.6569,-74.929,
25%,,40.7211,-73.9965,
50%,,40.7422,-73.9834,
75%,,40.761,-73.9653,


In [5]:
df.shape

(4534327, 4)

In [6]:
# On restreint aux alentours de Manhattan par souci de visibilité
int_lat = [40.56, 40.9]
int_lon = [-74.2, -73.6]
df_nyc = df[(df['Lat'] >= int_lat[0]) & (df['Lat'] <= int_lat[1]) &
            (df['Lon'] >= int_lon[0]) & (df['Lon'] <= int_lon[1])]

In [7]:
# On extrait un échantillon pour réduire la taille des données
sample = df_nyc.sample(frac=0.01, random_state=0)
sample.shape

(45050, 4)

In [8]:
sample.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
2909535,8/28/2014 21:35:00,40.7566,-73.9779,B02598
2775031,8/9/2014 11:19:00,40.6875,-73.9723,B02598
951830,5/23/2014 20:54:00,40.7315,-74.0049,B02617
4413776,9/13/2014 13:31:00,40.7414,-74.0065,B02764
496716,4/24/2014 12:51:00,40.7523,-73.9675,B02682


In [9]:
data = sample.copy()
data['Date/Time'] = data['Date/Time'].apply(lambda s: pd.to_datetime(s))

In [10]:
data['Hour'] = data['Date/Time'].dt.hour
data['Month'] = data['Date/Time'].dt.month
data['DayOfWeek'] = data['Date/Time'].dt.dayofweek
data['Day'] = data['Date/Time'].dt.day

In [11]:
data = data.reset_index(drop=True)

In [12]:
data.head()

Unnamed: 0,Date/Time,Lat,Lon,Base,Hour,Month,DayOfWeek,Day
0,2014-08-28 21:35:00,40.7566,-73.9779,B02598,21,8,3,28
1,2014-08-09 11:19:00,40.6875,-73.9723,B02598,11,8,5,9
2,2014-05-23 20:54:00,40.7315,-74.0049,B02617,20,5,4,23
3,2014-09-13 13:31:00,40.7414,-74.0065,B02764,13,9,5,13
4,2014-04-24 12:51:00,40.7523,-73.9675,B02682,12,4,3,24


In [13]:
fig = px.scatter_mapbox(data, lat="Lat", lon="Lon", size_max=15,
                        zoom=10, width=800, height=800,
                        title="Positions des lieux de prise en charge Uber")
fig.update_layout(mapbox_style="carto-positron")

fig.show()

## Analyse des lieux de prise en charge

In [14]:
# Analyse par heure
df_heures = data.groupby('Hour').size().reset_index(name='count')
df_heures.head()

Unnamed: 0,Hour,count
0,0,1062
1,1,695
2,2,428
3,3,501
4,4,545


In [15]:
fig = px.bar(df_heures, x='Hour', y='count', title='Nombre de prises en charge par heure',
             labels={'Hour': 'Heure', 'count': 'Nombre de prises en charge'})
fig.show()

Les prises en charge sont plus fréquentes le soir, avec un pic dans la tranche horaire 17h-18h.

In [16]:
# Analyse par jour de la semaine
def affiche_joursem(d):
    if d==0: return 'Lundi'
    if d==1: return 'Mardi'
    if d==2: return 'Mercredi'
    if d==3: return 'Jeudi'
    if d==4: return 'Vendredi'
    if d==5: return 'Samedi'
    if d==6: return 'Dimanche'
df_joursem = data.groupby('DayOfWeek').size().reset_index(name='count')
df_joursem['DayOfWeek'] = df_joursem['DayOfWeek'].apply(affiche_joursem)
df_joursem.head()

Unnamed: 0,DayOfWeek,count
0,Lundi,5353
1,Mardi,6569
2,Mercredi,7009
3,Jeudi,7569
4,Vendredi,7416


In [17]:
fig = go.Figure()
fig.add_trace(go.Bar(x=df_joursem['DayOfWeek'], 
                     y=df_joursem['count'], 
                     text=df_joursem['count'],
                     textposition='outside'))
fig.update_layout(title='Nombre de prises en charge par jour de la semaine',
                  xaxis_title='Jour de la semaine',
                  yaxis_title='Nombre de prises en charge',
                  yaxis=dict(range=[0, 8100]))
fig.show()

Le jeudi est le jour où il y a le plus de prises en charge.

In [18]:
# Analyse par mois
def affiche_mois(m):
    if m==4: return 'Avril'
    if m==5: return 'Mai'
    if m==6: return 'Juin'
    if m==7: return 'Juillet'
    if m==8: return 'Août'
    if m==9: return 'Septembre'
df_mois = data.groupby('Month').size().reset_index(name='count')
df_mois['Month'] = df_mois['Month'].apply(affiche_mois)
df_mois

Unnamed: 0,Month,count
0,Avril,5608
1,Mai,6432
2,Juin,6630
3,Juillet,7908
4,Août,8257
5,Septembre,10215


In [19]:
fig = go.Figure()
fig.add_trace(go.Bar(x=df_mois['Month'], 
                     y=df_mois['count'], 
                     text=df_mois['count'],
                     textposition='outside'))
fig.update_layout(title='Nombre de prises en charge par mois',
                  xaxis_title='Mois',
                  yaxis_title='Nombre de prises en charge',
                  yaxis=dict(range=[0, 11000]))
fig.show()

Les prises en charge sont plus fréquentes au mois de septembre.

In [20]:
# Analyse bivariée

df_bivarie = data.groupby(['Hour', 'DayOfWeek']).size().unstack(fill_value=0)
df_bivarie.head()

DayOfWeek,0,1,2,3,4,5,6
Hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,58,61,87,90,120,290,356
1,36,45,56,50,88,179,241
2,37,25,29,37,54,95,151
3,67,54,50,63,71,85,111
4,100,73,70,92,82,72,56


In [21]:
# Construction de la figure
fig = go.Figure()
buttons = []

# Ajout des traces à la figure pour chaque jour de la semaine
for j, jour in enumerate(['Lundi', 'Mardi', 'Mercredi', 'Jeudi', 'Vendredi', 'Samedi', 'Dimanche']):
    fig.add_trace(go.Scatter(x=np.arange(24), y=df_bivarie[j], mode='lines', name=jour, visible=False))
    buttons.append(dict(label=jour,
                        method='update',
                        args=[{'visible': [k == j for k in range(7)]},
                              {'title': f'Nombre de prises en charge - {jour}'}])) 

fig.data[0].visible = True

fig.update_layout(
    title='Nombre de prises en charge par heure et par jour de la semaine',
    xaxis_title='Heure',
    yaxis_title='Nombre de prises en charge',
    updatemenus=[{
        'buttons': buttons,
        'direction': 'down',
        'showactive': True,
    }]
)

fig.update_yaxes(range=[0, df_bivarie.values.max() + 10])
fig.update_xaxes(showgrid=True)

fig.show()

In [22]:
# Définition des couleurs pour chaque journée
color_days = {
    0: '#FF0000',
    1: '#00DD00',
    2: '#5555FF',
    3: '#DDAA00',
    4: '#008080',
    5: '#800080',
    6: '#000000'
}

In [23]:
# Construction de la figure
fig = go.Figure()

# Ajout des traces à la figure pour chaque jour de la semaine
for j, jour in enumerate(['Lundi', 'Mardi', 'Mercredi', 'Jeudi', 'Vendredi', 'Samedi', 'Dimanche']):
    fig.add_trace(go.Scatter(
        x=np.arange(24),
        y=df_bivarie[j],
        mode='lines+markers',
        name=jour,
        line=dict(width=2, color=color_days[j])
    ))

# Mise en forme du graphique
fig.update_layout(
    title='Nombre de prises en charge par heure et par jour de la semaine',
    xaxis_title='Heure',
    yaxis_title='Nombre de prises en charge',
    xaxis=dict(tickmode='linear', tick0=0, dtick=1),
    legend_title='Jour de la semaine'
)

fig.update_yaxes(range=[0, df_bivarie.values.max() + 10])
fig.update_xaxes(showgrid=True)

fig.show()

## Preprocessing des données

In [24]:
numeric_features = ['Lat', 'Lon']
numeric_transformer = StandardScaler()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
])

X = preprocessor.fit_transform(data)

## KMeans

In [25]:
# Calcul de la silhouette
sil = []
k = []

## Il faut commencer à i=2 car la silhouette n'accepte pas moins de 2 labels
for i in range(2,11):
    kmeans = KMeans(n_clusters = i, random_state = 0, n_init = 'auto')
    kmeans.fit(X)
    sil.append(silhouette_score(X, kmeans.predict(X)))
    k.append(i)

In [26]:
# Création du DataFrame
cluster_scores = pd.DataFrame(sil)
k_frame = pd.Series(k)

In [27]:
# Tracé
fig_sil = px.bar(data_frame=cluster_scores,
             x=k_frame,
             y=cluster_scores.iloc[:, -1]
)

# Titre et labels
fig_sil.update_layout(
    yaxis_title="Silhouette",
    xaxis_title="Nombre de clusters",
    title="Silhouette par cluster"
)

fig_sil.show()

In [28]:
# Création d'une boucle qui va collecter le Within-sum-of-square (wcss) pour chaque valeur K
# Utilisation du paramètre .inertia_ pour le wcss pour chaque valeur K
wcss = []
k = []
for i in range(1,11):
    kmeans = KMeans(n_clusters = i, random_state = 0, n_init = 'auto')
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
    k.append(i)
print(wcss)

[90099.99999999983, 60143.8740926911, 37501.723477390675, 29330.334320252397, 22493.750838839485, 17520.32377794447, 15446.435621638093, 12346.9617634701, 10483.43963364688, 9235.58434223069]


In [29]:
# Création du DataFrame
wcss_frame = pd.DataFrame(wcss)
k_frame = pd.Series(k)

# Tracé
fig_elbow = px.line(
    wcss_frame,
    x=k_frame,
    y=wcss_frame.iloc[:,-1]
)

# Titre et labels
fig_elbow.update_layout(
    yaxis_title="Inertie",
    xaxis_title="Nombre de clusters",
    title="Inertie par cluster"
)

fig_elbow.show()

In [30]:
kmeans = KMeans(n_clusters = 3, random_state = 0, n_init = 'auto')
kmeans.fit(X)

data['Cluster_KMeans'] = kmeans.predict(X)
data.head()

Unnamed: 0,Date/Time,Lat,Lon,Base,Hour,Month,DayOfWeek,Day,Cluster_KMeans
0,2014-08-28 21:35:00,40.7566,-73.9779,B02598,21,8,3,28,1
1,2014-08-09 11:19:00,40.6875,-73.9723,B02598,11,8,5,9,0
2,2014-05-23 20:54:00,40.7315,-74.0049,B02617,20,5,4,23,0
3,2014-09-13 13:31:00,40.7414,-74.0065,B02764,13,9,5,13,0
4,2014-04-24 12:51:00,40.7523,-73.9675,B02682,12,4,3,24,1


In [31]:
data_kmeans = data.sort_values('Cluster_KMeans', ascending=True)
data_kmeans['Cluster_KMeans'] = data['Cluster_KMeans'].astype('str')

In [32]:
color_kmeans = {
    '0': '#AA0000',
    '1': '#00DD00',
    '2': '#5555FF'
}

In [33]:
fig = px.scatter_mapbox(data_frame=data_kmeans,
                        lat='Lat',
                        lon='Lon',
                        color='Cluster_KMeans',
                        mapbox_style="carto-positron",
                        zoom=8.5,
                        color_discrete_map=color_kmeans)

fig.update_layout(
    title="Clustering avec la méthode KMeans"
)

fig.show()

## DBSCAN

In [34]:
data_dbscan = data.copy()

In [35]:
numeric_features = ['Lat', 'Lon']
numeric_transformer = StandardScaler()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
])

X = preprocessor.fit_transform(data_dbscan)

In [36]:
# Plages de recherche
eps_values = np.linspace(0.1, 0.5, 5)
min_samples_values = range(20, 30, 5)

# Stockage des résultats
results = []

for eps in eps_values:
    for min_samples in min_samples_values:
        db = DBSCAN(eps=eps, min_samples=min_samples, metric="manhattan")
        db.fit(X)
        
        labels = db.labels_
        num_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        num_outliers = sum(labels == -1)
        results.append((eps, min_samples, num_clusters, num_outliers))

# Transformation en DataFrame pour analyse
results_df = pd.DataFrame(results, columns=['eps', 'min_samples', 'num_clusters', 'num_outliers'])

# Filtre selon des critères spécifiques (par exemple : moins de 5 clusters et moins de 10% d'outliers)
filtered_results = results_df[
    (results_df['num_clusters'] <= 5) & 
    (results_df['num_outliers'] <= len(X) * 0.1)
].copy()

In [37]:
# Affichage des meilleures configurations
filtered_results.sort_values(by=['num_clusters', 'num_outliers'], inplace=True)
filtered_results

Unnamed: 0,eps,min_samples,num_clusters,num_outliers
8,0.5,20,3,313
9,0.5,25,3,338
7,0.4,25,3,561
5,0.3,25,4,841
4,0.3,20,5,732


In [38]:
eps = 0.5
min_samples = 20
db = DBSCAN(eps=eps, min_samples=min_samples, metric="manhattan")
db.fit(X)

print(f'Nombre de clusters : {np.unique(db.labels_)[-1]+1}')
print(f'Nombre de outliers : {np.unique(db.labels_, return_counts=True)[1][0]}')

data_dbscan['Cluster_DB'] = db.labels_
data_dbscan.head()

Nombre de clusters : 3
Nombre de outliers : 313


Unnamed: 0,Date/Time,Lat,Lon,Base,Hour,Month,DayOfWeek,Day,Cluster_KMeans,Cluster_DB
0,2014-08-28 21:35:00,40.7566,-73.9779,B02598,21,8,3,28,1,0
1,2014-08-09 11:19:00,40.6875,-73.9723,B02598,11,8,5,9,0,0
2,2014-05-23 20:54:00,40.7315,-74.0049,B02617,20,5,4,23,0,0
3,2014-09-13 13:31:00,40.7414,-74.0065,B02764,13,9,5,13,0,0
4,2014-04-24 12:51:00,40.7523,-73.9675,B02682,12,4,3,24,1,0


In [39]:
custom_colors = {
    '0': '#BB0000',
    '1': '#00DD00',
    '2': '#0000FF',
    '3': '#FFA500',
    '4': '#4B0082',
    '5': '#808080',
    '6': '#008080',
    '7': '#32CD32',
    '8': '#00FFFF',
    '9': '#E6A400',
    '10': '#FF00FF',
    '11': '#FF0000',
    '12': '#800080',
    '13': '#FF6347',
    '14': '#7FFF00',
    '15': '#D2691E',
    '16': '#ADFF2F',
    '17': '#1E90FF',
    '18': '#FF4500',
    '19': '#C0C0C0',
    '20': '#FF1493',
    '21': '#8B008B',
    '22': '#FF8C00',
    '23': '#B22222',
    '24': '#DDAA00'
}

In [40]:
data_dbscan['Cluster_DB'] = data_dbscan['Cluster_DB'].astype('int')
mask = (data_dbscan['Cluster_DB']>=0)
data_dbscan['Cluster_DB'] = data_dbscan['Cluster_DB'].astype('str')

In [41]:
fig = px.scatter_mapbox(data_frame=data_dbscan[mask],
                        lat='Lat',
                        lon='Lon',
                        color='Cluster_DB',
                        mapbox_style="carto-positron",
                        zoom=8.5,
                        color_discrete_map=custom_colors,
                        category_orders={'Cluster_DB': list(custom_colors.keys())})

fig.update_layout(
    title="Clustering avec la méthode DBSCAN"
)

fig.show()

## OPTICS

In [66]:
# Copie du dataframe
data_optics = data.copy()

numeric_features = ['Lat', 'Lon']
numeric_transformer = StandardScaler()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
])

X = preprocessor.fit_transform(data_optics)

In [67]:
# Initialisation d'OPTICS avec des paramètres adaptés
optics = OPTICS(min_samples=20, max_eps=0.5, metric="manhattan")

# Ajustement du modèle
optics.fit(X)

# Extraction des labels
data_optics['Cluster_OPTICS'] = optics.labels_


divide by zero encountered in divide



In [69]:
mask_optics = (data_optics['Cluster_OPTICS'] >= 0)

fig = px.scatter_mapbox(data_frame=data_optics[mask_optics],
                        lat='Lat',
                        lon='Lon',
                        color='Cluster_OPTICS',
                        mapbox_style="carto-positron",
                        zoom=9)

fig.update_layout(
    title="Clustering avec la méthode OPTICS"
)

fig.show()

## DBSCAN par jour de la semaine

On prend la méthode DBSCAN pour afficher les concentrations par jour de la semaine. Il est plus adapté pour mettre en évidence les "hot zones".

Pour faire un clustering plus "strict", on prend un epsilon plus petit.

In [42]:
eps = 0.3
min_samples = 20
numeric_features = ['Lat', 'Lon']
numeric_transformer = StandardScaler()
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
])

In [43]:
for day in range(0,7):
    maskday = (data_dbscan['DayOfWeek'] == day)
    df_day = data_dbscan[maskday].copy()
    X = preprocessor.fit_transform(df_day)
    db_day = DBSCAN(eps=eps, min_samples=min_samples, metric="manhattan")
    db_day.fit(X=X)
    data_dbscan.loc[maskday,'Cluster_DB'] = db_day.labels_

In [44]:
data_dbscan = data_dbscan.sort_values('DayOfWeek', ascending=True)

data_dbscan['Cluster_DB'] = data_dbscan['Cluster_DB'].astype('int')
mask = (data_dbscan['Cluster_DB']>=0)
data_dbscan['Cluster_DB'] = data_dbscan['Cluster_DB'].astype('str')

category_orders = {
    'Cluster_DB': sorted(custom_colors.keys(), key=lambda x: int(x))
}

In [45]:
# Pour régler le problème lorsque le nombre de clusters varie selon le jour de la semaine
# (qui peut créer des instabilités sur le graphique, avec des données qui sont affichées la mauvaise journée)
# On "comble" les trous
data_fake = data_dbscan.copy()
all_clusters = set(data_fake['Cluster_DB'].unique())
for day in data_fake['DayOfWeek'].unique():
    clusters_present = set(data_fake[data_fake['DayOfWeek'] == day]['Cluster_DB'])
    missing_clusters = all_clusters - clusters_present
    for cluster in missing_clusters:
        data_fake = pd.concat([
            data_fake,
            pd.DataFrame({'Lat': [None], 'Lon': [None], 'Cluster_DB': [cluster], 'DayOfWeek': [day]})
        ])

In [46]:
fig = px.scatter_mapbox(data_frame=data_fake[mask],
                        lat='Lat',
                        lon='Lon',
                        color='Cluster_DB',
                        zoom=8.5,
                        animation_frame='DayOfWeek',
                        mapbox_style="carto-positron",
                        color_discrete_map=custom_colors,
                        category_orders=category_orders)

fig.update_layout(
    title="Clustering DBSCAN par jour de la semaine",
    legend_title="Cluster"
)

fig.show()

## Les samedis heure par heure

In [47]:
df_day[mask]

Unnamed: 0,Date/Time,Lat,Lon,Base,Hour,Month,DayOfWeek,Day,Cluster_KMeans,Cluster_DB
13,2014-09-07 02:45:00,40.7416,-74.0035,B02617,2,9,6,7,0,0
22,2014-08-17 23:04:00,40.7489,-73.9759,B02598,23,8,6,17,1,0
43,2014-08-17 02:19:00,40.7116,-73.9535,B02598,2,8,6,17,0,0
44,2014-09-28 07:15:00,40.7202,-73.9919,B02617,7,9,6,28,0,0
47,2014-06-29 14:06:00,40.7448,-73.9831,B02682,14,6,6,29,1,0
...,...,...,...,...,...,...,...,...,...,...
45006,2014-06-22 00:05:00,40.7621,-73.9208,B02512,0,6,6,22,1,0
45008,2014-08-17 16:00:00,40.6836,-73.9844,B02617,16,8,6,17,0,0
45024,2014-09-14 01:05:00,40.7184,-73.9529,B02682,1,9,6,14,0,0
45030,2014-07-06 03:45:00,40.7353,-74.0052,B02598,3,7,6,6,0,0


In [48]:
maskday = (data_dbscan['DayOfWeek'] == 5)
df_day = data_dbscan.loc[maskday].copy()
df_day = df_day.sort_values('Hour', ascending=True)

In [49]:
df_day_fake = df_day.copy()
all_clusters = set(df_day_fake['Cluster_DB'].unique())
for hour in df_day_fake['Hour'].unique():
    clusters_present = set(df_day_fake[df_day_fake['Hour'] == hour]['Cluster_DB'])
    missing_clusters = all_clusters - clusters_present
    for cluster in missing_clusters:
        df_day_fake = pd.concat([
            df_day_fake,
            pd.DataFrame({'Lat': [None], 'Lon': [None], 'Cluster_DB': [cluster], 'Hour': [hour]})
        ])

In [50]:
fig = px.scatter_mapbox(
    data_frame=df_day_fake[mask],
    lat='Lat',
    lon='Lon',
    color='Cluster_DB',
    animation_frame='Hour',
    zoom=9,
    mapbox_style="carto-positron",
    color_discrete_map=custom_colors,
    category_orders=category_orders
)

fig.update_layout(
    title="Clustering DBSCAN par heure de la journée",
    legend_title="Cluster"
)

fig.show()