#  Uber

# Preface

<p align="center">
<img src="./assets/ny.png" alt="drawing" width="800"/>
<p>


## Executive summary

## Notes about specifications

* Read 
    * https://app.jedha.co/course/projects-unsupervised-machine-learning-ft/uber-pickups-ft

* Goals 
    * Make sure the driver are at the right place at the right time such that wating time for the user does not exceed 5 to 7 minutes 
    * Ideally app would recommend hot-zones in major cities to be in at any given time of day
    * Create an algorithm to find hot zones
    * Visualize results on a dashboard 

* Instructions 
    * Get the data
    * Focus on New-York
    * Use cluster coordinates to pin hot zones
        * Pickup locations can be gathered into different clusters. Use cluster coordinates to pin hot zones
    * Create maps with plotly
    * Start small then generalize
        * Pick one day and a given hour
        * **and then** start to generalize your approach

* Deliverables
  * Have a map with hot-zones (plotly)
  * At least describe hot-zones per day of week
  * Compare results with at least : KMeans and DBScan

## TODO & Ideas

* DBSCAN 
  * ~~metric = haversine car la terre est ronde et pas plate (enfin je crois)~~
  * ~~unité de epsilon~~
  * https://blog.stackademic.com/mastering-clustering-dbscan-a880566704bc
* ~~warning KMeans~~
  * ~~set OMP_NUM_THREADS=5~~
  * ~~Get-ChildItem Env:~~
  * ~~echo $env:OMP_NUM_THREADS~~


In [None]:
# prelude

# avoid warnings with KMeans OpenMP memory leaks under windows blablabla...
import platform
system = platform.system()
if system == "Windows":
  import os
  os.environ["OMP_NUM_THREADS"] = "5"
else:
  None


import pandas             as pd
import plotly.express     as px
import numpy              as np
import seaborn            as sns
import matplotlib.pyplot  as plt
import datetime


# from sklearn.model_selection    import train_test_split, GridSearchCV 
from sklearn.pipeline           import make_pipeline
from sklearn.impute             import SimpleImputer
from sklearn.preprocessing      import OneHotEncoder, StandardScaler    
from sklearn.compose            import make_column_transformer                  
# from sklearn.linear_model       import LinearRegression, Lasso, Ridge
# from sklearn.metrics            import mean_squared_error, mean_absolute_error, r2_score
# from sklearn.decomposition      import PCA

from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score

# -----------------------------------------------------------------------------
k_AssetsDir     = "assets/"
# k_FileName    = "walmart_store_sales.csv"
k_Gold          = 1.618         # gold number for ratio
k_Width         = 12
k_Height        = k_Width/k_Gold
k_WidthPx       = 1024
k_HeightPx      = k_WidthPx/k_Gold

k_target        = "weekly_sales"
k_random_state  = 0                       
k_test_size     = 20/100        

# EDA

## About the files available

In [None]:
tmp_df = pd.read_csv(f"assets/taxi-zone-lookup.csv", nrows=10)
display (tmp_df)

* With this file we could relate LocationID to geographical area 

In [None]:
tmp_df = pd.read_csv(f"assets/uber-raw-data-janjune-15.csv", nrows=10)
display (tmp_df)

* Using the locationID from the taxi look up table we could extend the content this DataFrame with Borough and Zone
* However, this DataFrame lack longitude and latitude
* We need both of them if we want to a tool that helps to deliver services within a 5 min. timeframe
    * For example, it is not said if a zone within a borough can/should be consiederd as a "5 min" area
    * Even though, we would have to take into acount the traffic in order to estimate what is a 5 min zone at 1 PM vs a 5 min zone at 2 AM


In [None]:
tmp_df = pd.read_csv(f"assets/uber-raw-data-apr14.csv", nrows=5)
display (tmp_df)

* We cannot use the column Base since we cannot relate it to either a borough or a zone
* Indeed, as we can see in the 2015 file, a dispatching base number can be related to multiple affiliated base number and a an affiliated bas number can be related to multiple location id. 

### <span style="color:orange"><b>Comments :</b></span>

For the reasons explained before, in the rest of this notebook :

* We do not use data from 2015
* We do not use the taxi zone look up
* In files from 2014, we drop the column "Base" and keep onyl ``date``, ``latitude`` and ``longitue``

## Getting the dataset

In [None]:
def uber_preprocessor(df):
  df.columns = df.columns.str.lower()
  df.columns = df.columns.str.replace("/", "_")

  # df["base"] = df["base"].astype("category")
  df.drop(columns="base", inplace=True)

  df["date_time"] = pd.to_datetime(df["date_time"]) # , dayfirst=True
  df["year"] = df["date_time"].dt.year
  df["month"] = df["date_time"].dt.month
  df["day"] = df["date_time"].dt.day
  df["weekday"] = df["date_time"].dt.weekday
  df["hour"] = df["date_time"].dt.hour
  df["minute"] = df["date_time"].dt.minute
  df.drop(columns="date_time", inplace=True)

  # may be we could drop hour and min. We will see how it goes 
  df["bin"] = df["hour"]*12 + df["minute"]//5
  # df.drop(columns="hour", inplace=True)
  # df.drop(columns="minute", inplace=True)


  return df

In [None]:
months = ["may"]
# months = ["apr", "may", "jun", "jul", "aug", "sep"]            # uncomment/comment this line to take all the months available into account 

df = pd.DataFrame()
for month in months:
  tmp_df = pd.read_csv(f"assets/uber-raw-data-{month}14.csv")
  df = pd.concat([df, tmp_df])

df = uber_preprocessor(df)
# display(df.sort_values(by="bin", ascending=False))
display(df.head(10))

Notes : 
* In the preprocessed dataframe
* A ``bin`` is a time unit of 5 minutes 
* Ànd so, the ``bin`` column indicates at which ``time bin`` the observation belongs
* Each day there are [0, 287] time bins from 00H00 to 23H29
* This might be useful later since users are not willing to wait more that 5-7 minutes  

In [None]:
print(f"At this point the dataset consists of :")
print(f"\t{len(df.shape):>9_} dimensions")
print(f"\t{df.shape[0]:>9_} observations")
print(f"\t{df.shape[1]:>9_} features    ")

In [None]:
types_df = pd.DataFrame ({
  "types" : df.dtypes.value_counts()
})
types_df["as_%"] = (100 * types_df["types"]/types_df["types"].sum()).round(2)

display(types_df)

In [None]:
# -----------------------------------------------------------------------------
def quick_View(df):
  summary_lst = []
  
  for col_name in df.columns:
    col_dtype               = df[col_name].dtype
    num_of_null             = df[col_name].isnull().sum()
    percent_of_null         = num_of_null/len(df)
    num_of_non_null         = df[col_name].notnull().sum()
    num_of_distinct_values  = df[col_name].nunique()
    
    if num_of_distinct_values <= 10:
        distinct_values_counts = df[col_name].value_counts().to_dict()
    else:
        top_10_values_counts    = df[col_name].value_counts().head(10).to_dict()
        distinct_values_counts  = {k: v for k, v in sorted(top_10_values_counts.items(), key=lambda item: item[1], reverse=True)}

    if col_dtype != "object":
       max_of_col = df[col_name].max()
       min_of_col = df[col_name].min()
       outlier_hi = df[col_name].mean() + 3*df[col_name].std()
       outlier_lo = df[col_name].mean() - 3*df[col_name].std()
    else:
       max_of_col = -1
       min_of_col =  1
       outlier_hi = -1
       outlier_lo =  1
    
    summary_lst.append({
      "name"                : col_name,
      "dtype"               : col_dtype,
      "# null"              : num_of_null,
      "% null"              : (100*percent_of_null).round(2),
      "# NOT null"          : num_of_non_null,
      "distinct val"        : num_of_distinct_values,
      "-3*sig"              : round(outlier_lo,2) ,
      "min"                 : round(min_of_col,2),
      "max"                 : round(max_of_col,2),
      "+3*sig"              : round(outlier_hi,2) ,
      "distinct val count"  : distinct_values_counts
    })
  
  tmp_df = pd.DataFrame(summary_lst)
  return tmp_df

In [None]:
# df.describe(include="all").T
# df.info()

tmp_df = quick_View(df)
display(tmp_df.sort_values(by="# null", ascending=False))                 

### <span style="color:orange"><b>Comments :</b></span>
* 4.5 M observations (when all months are loaded, 560k observation for april 2014)
* 0 % of null
* outliers ($\bar{x}$ + 3 $\sigma$ ) are in  
    * latitude
    * longigute

## Outliers

In [None]:
col_outliers = ["lat", "lon", ]

for col in col_outliers:
  # fig = px.box(df, y=col)
  # fig.show()
  # Plottly can't handle the whole dataset from april to sept. 
  # Let's go back to an always working safe bet. 
  fig, ax = plt.subplots(figsize=(k_Width, k_Height))
  sns.boxplot(df, x=col)
  ax.set_title("Outliers")


How many observations are considered as outliers ?

In [None]:
for col in col_outliers:
  upper_bound = df[col].mean() + 3*df[col].std()
  lower_bound = df[col].mean() - 3*df[col].std()
  nb_out = df.shape[0] - df[((df[col] >= lower_bound) & (df[col] <= upper_bound)) | df[col].isna()].shape[0]
  print(f"{col} have {nb_out:>6_} outliers ({100*nb_out/df.shape[0]:.2} %)")

In [None]:
def remove_Outliers_Sigma(df, column):
    mean_col = df[column].mean()
    sigma_col = df[column].std()

    lower_bound = mean_col - 3 * sigma_col
    upper_bound = mean_col + 3 * sigma_col
    df = df[((df[column] >= lower_bound) & (df[column] <= upper_bound)) | df[column].isna()]
    return df


In [None]:
print(f"Before outliers removal : {df.shape}")
for col in col_outliers:
    df = remove_Outliers_Sigma(df, col)
print(f"After  outliers removal : {df.shape}")


# Focus on one month

In [None]:
hour      = 7
day       = 9                                                      # 4 would be very cool but it's a Sunday, 9 is Friday
month     = 5                                                      # select another month if needed (at least it has been loarded in df)
year      = 2014                                                   # so far year must be 2014

date_obj = datetime.datetime(year, month, day)                     # day is mandatory to create a date object but not yet used
date_string = date_obj.strftime("%Y-%B")

In [None]:
month_df  = df[(df["month"] == month) & (df["year"] == year)] 
month_df.describe().T

### Pickups during the month

In [None]:

fig = px.histogram(
  month_df, 
  x="day",
  height = k_HeightPx,
  width = k_WidthPx,
  title = f"{date_string} - Pickups per day",
)
fig.show()

* The graph here above shows pickups per day during the month  
* It seems there is a kind of cycle along the weeks

### Breakdown of the pickups per day of the week

In [None]:
fig = px.histogram(
  month_df, 
  x="weekday", 
  height = k_HeightPx, 
  width = k_WidthPx,
  title = f"{date_string} - Pickups per day (0=Monday)",
)
fig.show()


* The weekly pattern is confirmed over april 20124

In [None]:
fig = px.histogram(
  df, 
  x="weekday", 
  height = k_HeightPx, 
  width = k_WidthPx,
  title = "Apr-Sept 2014 - Pickups per day (0=Monday)",
)
fig.show()


### <span style="color:orange"><b>Comments :</b></span>

* The weekly pattern is confirmed over the april-sept period of 2014
* Most active days of the week : Wenesday-Friday
* Least active days of the week : Sunday-Monday 


In [None]:
plt.figure(figsize=(k_Width, k_Height))
ax = sns.kdeplot(data = month_df, x = 'hour', fill = True, hue = 'weekday', palette = 'coolwarm', alpha = 0.2) 
ax.set_title(f"{date_string} - KDE plots pickups of each day vs hour")
ax.set_ylabel('Density of pickups')
ax.set_xticks([i for i in range(25)]) # using bw_adjust= low value ??? and cut=None does'nt work
plt.show()

### <span style="color:orange"><b>Comments :</b></span>

* Saturday and Sunday show a pick of activities between 0 and 1 AM
* Otherwise all the other days show 2 picks around 7AM and 6 PM 
* The density helps to realize why friday (day 4) is so important : not only the peak in the afternoon is hign but it is also very large 


### Localization of the pickups during the month

In [None]:

# it is safer to sort the values to be used as an animation_frame 
month_df = month_df.sort_values(by= "day", ascending=True)

fig = px.density_mapbox(
  month_df,
  lat="lat",
  lon="lon",
  animation_frame='day', 
  mapbox_style="carto-positron",
  radius=3,
  zoom = 10.0,
  height = k_HeightPx,
  width = k_WidthPx,
  title= f"{date_string} - Localized pickups per day over the selected month"
)
fig.show()

* The more pickups you have on one spot, the higher the "temperature" of the spot
* "temperature" obsiously refers to the number of pickups
* The view is per day along the month of study
* At this point there is no information about timing along the day

<p align="center">
<img src="./assets/hotspots.png" alt="drawing" width="800"/>
<p>


### <span style="color:orange"><b>Comments :</b></span>

* No matter the day of the month, the areas here below are always on the top of the list : 
    * LaGuardia
    * Brooklyn
    * Greenpoint
    * Manhattan


In [None]:
month_df = month_df.sort_values(by= "hour", ascending=True)

fig = px.density_mapbox(
  month_df,
  lat="lat",
  lon="lon",
  animation_frame='hour', 
  mapbox_style="carto-positron",
  radius=3,
  zoom = 10.0,
  height = k_HeightPx,
  width = k_WidthPx,
  title= f"{date_string} - Localized pickups per hour over the selected month"
)

fig.show()

* We look at all the days of the month at once
* We look how the pickups go per hour
* Brooklyn and LaGardia slow down their pickups from 1AM to respectively 4 and 5AM
* On the other hand Manhattan and Greenpoint never stop (NY, the city that never sleeps ?)

In [None]:

# ! One must order the dataframe by "bin" before the statement animation_frame='bin' otherwise the animation slider is really weird
month_df = month_df.sort_values(by= "bin", ascending=True)

fig = px.density_mapbox(
  month_df,
  lat="lat",
  lon="lon",
  animation_frame="bin", 
  mapbox_style="carto-positron",
  radius=5,                        # ! radius changed from 3 to 10
  zoom = 10.0,
  height = k_HeightPx,
  width = k_WidthPx,
  title= f"{date_string} - Localized pickups per bin of 5 minutes over the selected month"
)

fig.show()

* This graph is similar to the previous one but has a higher time resolution
* 12 (from 0 to 11) bins correspond to 1 hour

# Focus on one day

In [None]:
date_obj = datetime.datetime(year, month, day)
date_string = date_obj.strftime("%Y-%B-%d-%a")          # https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior


In [None]:
day_df = month_df[month_df["day"]==day]
day_df.describe().T

In [None]:
day_df = day_df.sort_values(by= "hour", ascending=True)

fig = px.density_mapbox(
  day_df,
  lat="lat",
  lon="lon",
  animation_frame='hour', 
  mapbox_style="carto-positron",
  radius=5,
  zoom = 10.0,
  height = k_HeightPx,
  width = k_WidthPx,
  title= f"{date_string} - Localized pickups during the next hour"
)

fig.show()

* No surprises. 
* For one day, we find the same pattern as we had with the monthly data, whether on the map or over the course of the day
* Bear in mind that the graph shows where the pickups will take place in the next hour for every hour of the day
* With a radius of 5 and some adjustment in zooming and panning, we can observe that at 6AM, the area of Soho (south Manhattan) and east and west side of the midle of Central Park "lite up" synchronously
* Then the center of Manhattan lites up
* Throughout the day, the “fire” spreads and covers the whole of Manhattan
* Combine with the next density graph, we can see where the peak of 6 PM takes palce


In [None]:
plt.figure(figsize=(k_Width, k_Height))
ax = sns.kdeplot(data = day_df, x = 'hour', fill = True, hue = 'weekday', palette = 'coolwarm', alpha = 0.2) 
ax.set_title(f"{date_string} - KDE plots pickups of each day vs hour")
ax.set_ylabel('Density of pickups')
ax.set_xticks([i for i in range(25)]) # using bw_adjust= low value ??? and cut=None does'nt work
plt.show()

* This graph is similar to the previous density plots
* Usufull when used in conjonction with the previous one

In [None]:
day_df = day_df.sort_values(by= "bin", ascending=True)

fig = px.density_mapbox(
  day_df,
  lat="lat",
  lon="lon",
  animation_frame='bin', 
  mapbox_style="carto-positron",
  radius=10,                          # ! radius changed from 3 to 10
  zoom = 10.0,
  height = k_HeightPx,
  width = k_WidthPx,
  title= f"{date_string} - Localized pickups during the next 5 minutes",
  
)

fig.show()

* Same graph but with a higher time resolution
* Here too, let's keep in mind that the graph shows where the pickups will take place within the next 5 minutes, for every time bin ff the day


Let's make a try with a ``scatter_mapbox()`` instead of ``density_mapbox``

In [None]:
fig = px.scatter_mapbox(
  day_df, 
  lat="lat", 
  lon="lon", 
  # opacity = 0.8,
  animation_frame='bin', 
  mapbox_style = "carto-positron",  
  zoom = 10.0,
  height = k_HeightPx,
  width = k_WidthPx,
  title = f"{date_string} - Localized pickups during the next 5 minutes",
)
fig.show()

# Focus on one hour

In [None]:
date_obj = datetime.datetime(year, month, day, hour)
date_string = date_obj.strftime("%Y-%B-%d-%a-%HH") # https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
print(date_string)

In [None]:
hour_df = day_df[day_df["hour"]==hour]
hour_df.describe().T

In [None]:
fig = px.scatter_mapbox(
  hour_df, 
  lat="lat", 
  lon="lon", 
  opacity = 0.5,
  mapbox_style = "carto-positron",  
  zoom = 10.0,
  height = k_HeightPx,
  width = k_WidthPx,
  title = f"{date_string} - Localized pickups during the next hour",
)
fig.show()

In [None]:
fig = px.scatter_mapbox(
  hour_df, 
  lat="lat", 
  lon="lon", 
  opacity = 0.5,
  animation_frame='bin', 
  mapbox_style = "carto-positron",  
  zoom = 10.0,
  height = k_HeightPx,
  width = k_WidthPx,
  title = f"{date_string} - Localized pickups during the next 5 minutes",
)
fig.show()

### <span style="color:orange"><b>Comments :</b></span>

* Surprisigly the graph using bins of 5 minustes seems more difficult to read
* On the frist graph we can see that 
    * LaGardia is up and running,  
    * East and west side of central park
    * South Manhattan & Greenpoint
    * Brookling waking up
* Know what we know from the initial study, we could identify 6 zones for this hour


# REPRENDRE À PARTIR D'ICI -----------------------------

# KMeans

In [None]:
lat_lon_hour = hour_df[["lat", "lon"]]
lat_lon_hour.shape

In [None]:
scaler = StandardScaler()
lat_lon_hour = scaler.fit_transform(lat_lon_hour)


In [None]:
wcss = []
for i in range(1, 20):
  kmeans = KMeans(n_clusters=i, n_init='auto', init='k-means++', random_state = k_random_state)
  kmeans.fit(lat_lon_hour)
  wcss.append(kmeans.inertia_)

fig = px.line(
  wcss, 
  height = k_HeightPx,
  width = k_WidthPx,
  x = ([i for i in range(1, 20)]),
  y = wcss,
  title = f"{date_string} - KMEANS - Determine the best # of clusters",
)
fig.show()


In [None]:
hour_extended_df = hour_df.copy()

optimal_clusters = 6
kmeans = KMeans(n_clusters=optimal_clusters, n_init='auto', random_state=k_random_state)
hour_extended_df['cluster'] = kmeans.fit_predict(lat_lon_hour)

In [None]:

fig = px.scatter_mapbox(
  hour_extended_df, 
  lat="lat", 
  lon="lon", 
  opacity = 0.5,
  # animation_frame='bin', 
  mapbox_style = "carto-positron",  
  color='cluster',
  zoom = 10.0,
  height = k_HeightPx,
  width = k_WidthPx,
  title = f"{date_string} - KMeans - Hot spots for the next hour",
)

fig.show()

# DBSCAN

In [None]:
# lat_lon_hour = hour_df[["lat", "lon"]]
lat_lon_hour = np.deg2rad(hour_df[["lat", "lon"]])
lat_lon_hour.shape

In [None]:
scaler = StandardScaler()
lat_lon_hour = scaler.fit_transform(lat_lon_hour)


In [None]:
# pickups within 200 m
kms_per_radian = 6371.0088                                  # radius of Earth in kilometers
epsilon = 0.25/ kms_per_radian

# db = DBSCAN(eps=0.25, min_samples=6, metric="euclidean", algorithm="auto")
db = DBSCAN(eps=epsilon, min_samples=5, metric='haversine', algorithm="auto") # algorithm='ball_tree', metric="euclidean"
db.fit(lat_lon_hour)                                        # le fit fait le predict. Y a pas de predict

hour_extended_df = hour_df.copy()
hour_extended_df['cluster'] = db.labels_

tmp_df = pd.DataFrame(db.labels_)
tmp_df[0].value_counts()

# print(len(set(db.labels_)))


In [None]:
fig = px.scatter_mapbox(
  hour_extended_df, 
  lat="lat", 
  lon="lon", 
  opacity = 0.5,
  mapbox_style = "carto-positron",  
  color='cluster',
  zoom = 10.0,
  height = k_HeightPx,
  width = k_WidthPx,
  title = f"{date_string} - DBSCAN - Hot spots for the next hour",
)
fig.show()

In [None]:
fig = px.scatter_mapbox(
  hour_extended_df[hour_extended_df["cluster"]!=-1], 
  lat="lat", 
  lon="lon", 
  opacity = 0.5,
  # animation_frame='bin', 
  mapbox_style = "carto-positron",  
  color='cluster',
  zoom = 10.0,
  height = k_HeightPx,
  width = k_WidthPx,
  title = f"{date_string} - KMeans - Hot spots for the next hour",
)

fig.show()

In [None]:
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt
 
neighbors = NearestNeighbors(n_neighbors=20)
neighbors_fit = neighbors.fit(lat_lon_hour)
distances, indices = neighbors_fit.kneighbors(lat_lon_hour)

In [None]:
distances = np.sort(distances, axis=0)
distances = distances[:,1]

# distances_df = pd.DataFrame(distances)
# fig = px.line(distances_df,  title='')
fig = px.line(distances,  title='Points sorted by distance to the 20th nearest neighbor')
fig.show()

https://www.sefidian.com/2022/12/18/how-to-determine-epsilon-and-minpts-parameters-of-dbscan-clustering/

Optimal eps = 0.25


In [None]:
nn_model = NearestNeighbors(n_neighbors=2)
nn_model.fit(lat_lon_OfTheDay)
distances, indices = nn_model.kneighbors(lat_lon_OfTheDay)

distances = np.sort(distances, axis=0)
distances = distances[:,1]


fig = px.line(distances,  title='Points sorted by distance to the 20th nearest neighbor')
fig.show()


# SCRAP BOOK - Please ignore


In [None]:
pivot_df = day_df.pivot_table(index="day", columns="hour", values='minute', aggfunc='count')
display(pivot_df)

In [None]:

       
# # -----------------------------------------------------------------------------
# def get_Tranformer(X):
#   numeric_features      = X.select_dtypes(include="number").columns
#   # categorical_features  = X.select_dtypes(exclude="number").columns

#   numerical_transformer = make_pipeline(
#       # SimpleImputer(strategy="median"),
#       StandardScaler(),
#   )

#   # categorical_transformer = make_pipeline(
#   #     SimpleImputer(strategy="most_frequent"),
#   #     OneHotEncoder(drop="first"),                 
#   # )

#   col_transformer = make_column_transformer(
#     (numerical_transformer,     numeric_features),
#     # (categorical_transformer, categorical_features),
#   )
#   return col_transformer


# wcss        = []
# nb_clstr    = []
# # silhouette  = []

# col_transformer = get_Tranformer(X)

# for n in range(2,20):
#   nb_clstr.append(n)
#   model = make_pipeline(
#     col_transformer,
#     KMeans(n_clusters= n, n_init='auto', random_state = k_random_state)        # default init = "k-means++", 
#   )
#   model.fit_transform(X)
#   # KMeans est le dernier élément du pipeline => utiliser -1
#   # .steps() renvoie des tuples (nom, objet)
#   # L'objet KMeans est accédé via [1]
#   kmeans_model = model.steps[-1][1] 
#   wcss.append(kmeans_model.inertia_)
#   print(f"WCSS for             K = {n:02} is wcss = {wcss[-1]:_.2f}")

#   # silhouette.append(silhouette_score(X, model.predict(X)))
#   # print(f"Silhouette score for K = {n:02} is {silhouette[-1]}")









In [None]:
# april_04_2014_df = april_04_2014_df.sort_values(by = 'hour')
fig = px.density_mapbox(
  day_df, 
  lat = "lat", 
  lon = "lon", 
  z = "hour", 
  radius = 10,  # default = 30
  # opacity = 0.05,
  mapbox_style="open-street-map", 
  color_continuous_scale=px.colors.sequential.Rainbow,
  zoom = 10.0,
  height = k_HeightPx,
  width = k_WidthPx,
  animation_frame = "hour",
  title= "april 04 2014 - Pickup per hour", 
)

fig.show()

In [None]:
# wcss_df = pd.DataFrame(wcss)
# nb_clstr_df = pd.Series(nb_clstr)

# fig= px.line(
#     wcss_df,
#     x = nb_clstr_df,
#     y = wcss_df.iloc[:,-1]
# )

# fig.update_layout(
#     yaxis_title="Inertia",
#     xaxis_title="# Clusters",
#     title="Inertia per cluster"
# )
# fig.show() 

In [None]:
# # Create a data frame
# cluster_scores=pd.DataFrame(silhouette)
# nb_clstr_df = pd.Series(nb_clstr)

# # Create figure
# fig = px.bar(data_frame=cluster_scores,
#              x=nb_clstr_df,
#              y=cluster_scores.iloc[:, -1]
#             )

# # Add title and axis labels
# fig.update_layout(
#     yaxis_title="Silhouette Score",
#     xaxis_title="# Clusters",
#     title="Silhouette Score per cluster"
# )

# # Render
# #fig.show(renderer="notebook")
# fig.show() # if using workspace