# UBER Pickups

One of the main pain point that Uber's team found is that sometimes drivers are not around when users need them. For example, a user might be in San Francisco's Financial District whereas Uber drivers are looking for customers in Castro. 
Eventhough both neighborhood are not that far away, users would still have to wait 10 to 15 minutes before being picked-up, which is too long. Uber's research shows that users accept to wait 5-7 minutes, otherwise they would cancel their ride.

Therefore, Uber's data team would like to work on a project where their app would recommend hot-zones in major cities to be in at any given time of day.

To start off, Uber wants to try this feature in New York city. Therefore we will only focus on this city.

## Part 1 : EDA

In [1]:
# Import useful libraries

import pandas as pd
import numpy as np
import datetime
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import  silhouette_score
import seaborn as sns
import matplotlib.pyplot as plt 
! pip install plotly -q
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = "iframe_connected"

In [2]:
# Read the file

print("Loading dataset...")
df = pd.read_csv("uber-raw-data-apr14.csv")
print("...Done.")
print()
df.head()

Loading dataset...
...Done.



Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


In [3]:
# Basics statistics

print("Number of rows : {}".format(df.shape[0]))
print()

print("Number of columns : {}".format(df.shape[1]))
print()

print("Basics statistics: ")
df_desc = df.describe(include='all')
display(df_desc)
print()

print("Percentage of missing values: ")
print()
display(100*df.isnull().sum()/df.shape[0])

Number of rows : 564516

Number of columns : 4

Basics statistics: 


Unnamed: 0,Date/Time,Lat,Lon,Base
count,564516,564516.0,564516.0,564516
unique,41999,,,5
top,4/7/2014 20:21:00,,,B02682
freq,97,,,227808
mean,,40.740005,-73.976817,
std,,0.036083,0.050426,
min,,40.0729,-74.7733,
25%,,40.7225,-73.9977,
50%,,40.7425,-73.9848,
75%,,40.7607,-73.97,



Percentage of missing values: 



Date/Time    0.0
Lat          0.0
Lon          0.0
Base         0.0
dtype: float64

In [4]:
# Decide not keep column "Base" to only focus on Date and Time

useless_cols = ['Base']
print("Dropping column...")
df = df.drop(useless_cols, axis=1)
print("...Done.")
print()
df.head()

Dropping column...
...Done.



Unnamed: 0,Date/Time,Lat,Lon
0,4/1/2014 0:11:00,40.769,-73.9549
1,4/1/2014 0:17:00,40.7267,-74.0345
2,4/1/2014 0:21:00,40.7316,-73.9873
3,4/1/2014 0:28:00,40.7588,-73.9776
4,4/1/2014 0:33:00,40.7594,-73.9722


In [5]:
#Creation of new columns from the 'Date/Time' column for data processing
#Extraction day of week and hour but not year and month since all data are in April 2014
#Drop column "Date/Time"

print("Convert to datetime ...")
df['Date/Time']= pd.to_datetime(df['Date/Time']) 
print("... Done")

print("Extracting hour ...")
df['Hour'] = df['Date/Time'].dt.hour
print("... Done")

print("Extract day of week ...")
df['DayOfWeek'] = df['Date/Time'].dt.weekday
print("... Done")

print("Drop colonne Date/Time ...")
df = df.drop(columns=["Date/Time"])
print("...Done")

df.head()

Convert to datetime ...
... Done
Extracting hour ...
... Done
Extract day of week ...
... Done
Drop colonne Date/Time ...
...Done


Unnamed: 0,Lat,Lon,Hour,DayOfWeek
0,40.769,-73.9549,0,1
1,40.7267,-74.0345,0,1
2,40.7316,-73.9873,0,1
3,40.7588,-73.9776,0,1
4,40.7594,-73.9722,0,1


In [6]:
# Distribution of variables

features = ["DayOfWeek", "Hour"]
fig1 = make_subplots(rows = len(features), cols = 1, subplot_titles = features)
for i in range(len(features)):
    fig1.add_trace(
        go.Histogram(
            x = df[features[i]], nbinsx = 50),
        row = i + 1,
        col = 1)
fig1.update_layout(
        title = go.layout.Title(text = "Distribution of variables", x = 0.5), showlegend = False, 
            autosize=True, height=1000)
fig1.show()

In [7]:
# Using plotly scatter mapbox, visualize all data points on a map

fig = px.scatter_mapbox(
        df, 
        lat="Lat", 
        lon="Lon",
        color="DayOfWeek",
        mapbox_style="carto-positron",
        zoom = 8
)

fig.show()

For this analysis, I started by studying one day at a specific time to hope to be able to generalize my approach.

The graph shows that 17 hours is the time when there are the most pickups, so it is on this schedule that my analysis will be based.

The dataset is quite big with 564 516 lines so I took a sample of the dataset with 10 000 lines before making any data processing !

## Part 2 : Clustering with KMeans for monday at 17H

In [8]:
df_monday = df.sample(10000, random_state=0)
df_monday = df_monday[df_monday['DayOfWeek']== 0]
df_monday = df_monday[df_monday['Hour']== 17]

In [9]:
df_monday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek
477996,40.7436,-73.979,17,0
267987,40.7118,-74.0066,17,0
531289,40.7429,-73.9948,17,0
478114,40.7065,-73.931,17,0
380970,40.7299,-73.993,17,0


In [10]:
df_monday.shape

(95, 4)

In [11]:
# Create pipeline for numeric features

print("Creating numeric pipeline...")
numeric_features = [0,1,2] 
numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler())])
print("... Done")

# Create pipeline for categorical features

print("Creating categorical pipeline...")
categorical_features = [3]
categorical_transformer = Pipeline(
    steps=[('encoder', OneHotEncoder(drop='first'))])
print("... Done")

# Use ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)  
    ])

# Preprocessings on dataset
print()
print("Preprocessing sur le train set...")
print()
print(df_monday.head())
X = preprocessor.fit_transform(df_monday)
print()
print('...Done.')
print()
print(X[0:5, :])
print()

Creating numeric pipeline...
... Done
Creating categorical pipeline...
... Done

Preprocessing sur le train set...

            Lat      Lon  Hour  DayOfWeek
477996  40.7436 -73.9790    17          0
267987  40.7118 -74.0066    17          0
531289  40.7429 -73.9948    17          0
478114  40.7065 -73.9310    17          0
380970  40.7299 -73.9930    17          0

...Done.

[[-0.00896085 -0.03209147  0.        ]
 [-1.26806387 -0.50374966  0.        ]
 [-0.03667695 -0.30209869  0.        ]
 [-1.47791437  0.78818365  0.        ]
 [-0.5514046  -0.27133837  0.        ]]



In [12]:
# Use the Elbow method to find the optimal number of clusters

wcss =  []
k = []
for i in range (2,15): 
    kmeans = KMeans(n_clusters= i, random_state = 0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
    k.append(i)
    print("WCSS for K={} --> {}".format(i, wcss[-1]))

WCSS for K=2 --> 116.81501638667395
WCSS for K=3 --> 60.93353017168727
WCSS for K=4 --> 39.73011049964254
WCSS for K=5 --> 27.04116311949092
WCSS for K=6 --> 19.02354733742236
WCSS for K=7 --> 13.891484892228991
WCSS for K=8 --> 10.637693355965547
WCSS for K=9 --> 7.599644376320941
WCSS for K=10 --> 5.805632374706117
WCSS for K=11 --> 5.002954320284434
WCSS for K=12 --> 4.087405528247166
WCSS for K=13 --> 3.720662231816298
WCSS for K=14 --> 3.2299002495973728


In [13]:
# Create DataFrame
wcss_frame = pd.DataFrame(wcss)
k_frame = pd.Series(k)

# Create figure
fig= px.line(
    wcss_frame,
    x=k_frame,
    y=wcss_frame.iloc[:,-1]
)

# Create title and axis labels
fig.update_layout(
    yaxis_title="Inertia",
    xaxis_title="# Clusters",
    title="Inertia per cluster"
)

fig.show()

As we can see, after K=10 WCSS is not decreasing a whole lot. The optimal nomber of clusters seems to be 10. Let's verify with the silhouette method.

In [14]:
# Use the silhouette method to see if we can refine our hypothesis for k clusters

# Computer mean silhouette score
sil = []
k = []

#We need to start at i=2 as silhouette score cannot accept less than 2 labels 
for i in range (2,15): 
    kmeans = KMeans(n_clusters= i, random_state = 0)
    kmeans.fit(X)
    sil.append(silhouette_score(X, kmeans.predict(X)))
    k.append(i)
    print("Silhouette score for K={} is {}".format(i, sil[-1]))

Silhouette score for K=2 is 0.8469811615529953
Silhouette score for K=3 is 0.5078507478968481
Silhouette score for K=4 is 0.4949525464121174
Silhouette score for K=5 is 0.5125451950045308
Silhouette score for K=6 is 0.5013365549421439
Silhouette score for K=7 is 0.4922061332000023
Silhouette score for K=8 is 0.5030902214438921
Silhouette score for K=9 is 0.517474010006963
Silhouette score for K=10 is 0.5137459155550975
Silhouette score for K=11 is 0.4415830005034284
Silhouette score for K=12 is 0.4398556476912527
Silhouette score for K=13 is 0.43261923247886097
Silhouette score for K=14 is 0.42576413365340865


In [15]:
# Create a data frame 
cluster_scores=pd.DataFrame(sil)
k_frame = pd.Series(k)

# Create figure
fig = px.bar(data_frame=cluster_scores,  
             x=k, 
             y=cluster_scores.iloc[:, -1]
            )

# Add title and axis labels
fig.update_layout(
    yaxis_title="Silhouette Score",
    xaxis_title="Clusters",
    title="Silhouette Score per cluster"
)

fig.show()

Here silhouette method gives similar results with an optimal k = 9.
So, I have decided to choose an optimal number of clusters at 9.

In [16]:
# Train again a KMeans with the optimal number of clusters

kmeans = KMeans(n_clusters= 9)
kmeans.fit(X)

KMeans(n_clusters=9)

In [17]:
df_monday.loc[:,'Cluster_KMeans'] = kmeans.predict(X)
df_monday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek,Cluster_KMeans
477996,40.7436,-73.979,17,0,0
267987,40.7118,-74.0066,17,0,2
531289,40.7429,-73.9948,17,0,0
478114,40.7065,-73.931,17,0,3
380970,40.7299,-73.993,17,0,2


In [18]:
# Visualization clusters on a map

fig = px.scatter_mapbox(df_monday, lat="Lat", lon="Lon",color="Cluster_KMeans",hover_name="Cluster_KMeans",
                        mapbox_style="carto-positron", zoom = 10)
fig.show()

Kmeans seems working but let's try with DBSCAN for choose the best for this analysis

## Part 3 : Clustering with DBSCAN for monday at 17H

In [19]:
# Create pipeline for numeric features

print("Creating numeric pipeline...")
numeric_features = [0,1,2] 
numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler())])
print("... Done")

# Create pipeline for categorical features

print("Creating categorical pipeline...")
categorical_features = [3]
categorical_transformer = Pipeline(
    steps=[('encoder', OneHotEncoder(drop='first'))])
print("... Done")

# Use ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)  
    ])

# Preprocessings on dataset
print()
print("Preprocessing sur le train set...")
print()
print(df_monday.head())
X = preprocessor.fit_transform(df_monday)
print()
print('...Done.')
print()
print(X[0:5, :])
print()

Creating numeric pipeline...
... Done
Creating categorical pipeline...
... Done

Preprocessing sur le train set...

            Lat      Lon  Hour  DayOfWeek  Cluster_KMeans
477996  40.7436 -73.9790    17          0               0
267987  40.7118 -74.0066    17          0               2
531289  40.7429 -73.9948    17          0               0
478114  40.7065 -73.9310    17          0               3
380970  40.7299 -73.9930    17          0               2

...Done.

[[-0.00896085 -0.03209147  0.        ]
 [-1.26806387 -0.50374966  0.        ]
 [-0.03667695 -0.30209869  0.        ]
 [-1.47791437  0.78818365  0.        ]
 [-0.5514046  -0.27133837  0.        ]]



In [20]:
# Instanciate DBSCAN 

db = DBSCAN(eps=0.2, min_samples=10, metric="manhattan")
db.fit(X)

DBSCAN(eps=0.2, metric='manhattan', min_samples=10)

In [21]:
np.unique(db.labels_)

array([-1,  0,  1])

In [22]:
df_monday["cluster"] = db.labels_
df_monday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek,Cluster_KMeans,cluster
477996,40.7436,-73.979,17,0,0,-1
267987,40.7118,-74.0066,17,0,2,-1
531289,40.7429,-73.9948,17,0,0,1
478114,40.7065,-73.931,17,0,3,-1
380970,40.7299,-73.993,17,0,2,-1


In [23]:
# Visualization clusters on a map

fig = px.scatter_mapbox(
        df_monday[df_monday.cluster != -1], 
        lat="Lat", 
        lon="Lon",
        color="cluster",
        hover_name="cluster",
        mapbox_style="carto-positron",
        color_continuous_scale ="viridis",
        zoom = 12
)

fig.show()

For monday at 17h, i found 2 clusters in Manhattan. The first in Midtown (essentially in Midtown west, Midtown east and Garment district) and the second also in Midtown (essentially in Chelsea and Gramercy)

DBSCAN seems to be more adapted for this analysis. It is more precise. I decided to try to generalize with.

## Part 4 : Generalize to every day of week at 17 hours

#### For tuesday

In [24]:
df_tuesday = df.sample(10000, random_state=0)
df_tuesday = df_tuesday[df_tuesday['DayOfWeek']== 1]
df_tuesday = df_tuesday[df_tuesday['Hour']== 17]
df_tuesday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek
292814,40.7407,-73.9919,17,1
37887,40.7548,-73.9722,17,1
9815,40.7245,-73.9833,17,1
109902,40.755,-73.9943,17,1
110001,40.7674,-73.9594,17,1


In [25]:
df_tuesday.shape

(167, 4)

In [26]:
# Create pipeline for numeric features

print("Creating numeric pipeline...")
numeric_features = [0,1,2] 
numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler())])
print("... Done")

# Create pipeline for categorical features

print("Creating categorical pipeline...")
categorical_features = [3]
categorical_transformer = Pipeline(
    steps=[('encoder', OneHotEncoder(drop='first'))])
print("... Done")

# Use ColumnTransformer 

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)  
    ])

# Preprocessings on dataset
print()
print("Preprocessing sur le train set...")
print()
print(df_tuesday.head())
X = preprocessor.fit_transform(df_tuesday)
print()
print('...Done.')
print()
print(X[0:5, :])
print()

Creating numeric pipeline...
... Done
Creating categorical pipeline...
... Done

Preprocessing sur le train set...

            Lat      Lon  Hour  DayOfWeek
292814  40.7407 -73.9919    17          1
37887   40.7548 -73.9722    17          1
9815    40.7245 -73.9833    17          1
109902  40.7550 -73.9943    17          1
110001  40.7674 -73.9594    17          1

...Done.

[[ 5.37714460e-04 -1.88858660e-01  0.00000000e+00]
 [ 4.87520884e-01  3.33044831e-01  0.00000000e+00]
 [-5.58974863e-01  3.89773816e-02  0.00000000e+00]
 [ 4.94428446e-01 -2.52440811e-01  0.00000000e+00]
 [ 9.22697332e-01  6.72149637e-01  0.00000000e+00]]



In [27]:
# Instanciate DBSCAN 

db = DBSCAN(eps=0.2, min_samples=10, metric="manhattan")
db.fit(X)

DBSCAN(eps=0.2, metric='manhattan', min_samples=10)

In [28]:
np.unique(db.labels_)

array([-1,  0,  1])

In [29]:
df_tuesday["cluster"] = db.labels_
df_tuesday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek,cluster
292814,40.7407,-73.9919,17,1,-1
37887,40.7548,-73.9722,17,1,0
9815,40.7245,-73.9833,17,1,-1
109902,40.755,-73.9943,17,1,-1
110001,40.7674,-73.9594,17,1,-1


In [30]:
# Visualization clusters on a map

fig = px.scatter_mapbox(
        df_tuesday[df_tuesday.cluster != -1], 
        lat="Lat", 
        lon="Lon",
        color="cluster",
        hover_name="cluster",
        mapbox_style="carto-positron",
        color_continuous_scale ="viridis",
        zoom = 11
)

fig.show()

For tuesday at 17h, i found 2 clusters in Manhattan. The first in Midtown (essentially in Midtown east and Murray Hill) and the second in Financial district.

#### For wednesday

In [31]:
df_wednesday = df.sample(10000, random_state=0)
df_wednesday = df_wednesday[df_wednesday['DayOfWeek']== 2]
df_wednesday = df_wednesday[df_wednesday['Hour']== 17]
df_wednesday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek
296073,40.7044,-73.9866,17,2
18762,40.7305,-73.9935,17,2
395738,40.7476,-73.9846,17,2
338082,40.7041,-74.011,17,2
116210,40.7476,-73.9822,17,2


In [32]:
df_wednesday.shape

(154, 4)

In [33]:
# Create pipeline for numeric features

print("Creating numeric pipeline...")
numeric_features = [0,1,2] 
numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler())])
print("... Done")

# Create pipeline for categorical features

print("Creating categorical pipeline...")
categorical_features = [3]
categorical_transformer = Pipeline(
    steps=[('encoder', OneHotEncoder(drop='first'))])
print("... Done")

# Use ColumnTransformer 

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)  
    ])

# Preprocessings on dataset
print()
print("Preprocessing sur le train set...")
print()
print(df_wednesday.head())
X = preprocessor.fit_transform(df_wednesday)
print()
print('...Done.')
print()
print(X[0:5, :])
print()

Creating numeric pipeline...
... Done
Creating categorical pipeline...
... Done

Preprocessing sur le train set...

            Lat      Lon  Hour  DayOfWeek
296073  40.7044 -73.9866    17          2
18762   40.7305 -73.9935    17          2
395738  40.7476 -73.9846    17          2
338082  40.7041 -74.0110    17          2
116210  40.7476 -73.9822    17          2

...Done.

[[-1.37927515 -0.21455717  0.        ]
 [-0.49032342 -0.44035539  0.        ]
 [ 0.09209322 -0.14910842  0.        ]
 [-1.38949298 -1.01303201  0.        ]
 [ 0.09209322 -0.07056991  0.        ]]



In [34]:
# Instanciate DBSCAN 

db = DBSCAN(eps=0.2, min_samples=10, metric="manhattan")
db.fit(X)

DBSCAN(eps=0.2, metric='manhattan', min_samples=10)

In [35]:
np.unique(db.labels_)

array([-1,  0,  1,  2])

In [36]:
df_wednesday["cluster"] = db.labels_
df_wednesday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek,cluster
296073,40.7044,-73.9866,17,2,-1
18762,40.7305,-73.9935,17,2,-1
395738,40.7476,-73.9846,17,2,0
338082,40.7041,-74.011,17,2,-1
116210,40.7476,-73.9822,17,2,0


In [37]:
# Visualization clusters on a map

fig = px.scatter_mapbox(
        df_wednesday[df_wednesday.cluster != -1], 
        lat="Lat", 
        lon="Lon",
        color="cluster",
        hover_name="cluster",
        mapbox_style="carto-positron",
        color_continuous_scale ="viridis",
        zoom = 12
)

fig.show()

For wednesday at 17h, i found 3 clusters in Manhattan. The first in Midtown (essentially in Midtown east and Murray Hill), the second in Flatiron district and the third in Chelsea.

#### For thursday

In [38]:
df_thursday = df.sample(10000, random_state=0)
df_thursday = df_thursday[df_thursday['DayOfWeek']== 3]
df_thursday = df_thursday[df_thursday['Hour']== 17]
df_thursday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek
403520,40.7561,-73.9976,17,3
255005,40.8078,-73.9595,17,3
122304,40.7788,-73.956,17,3
499137,40.7654,-73.9721,17,3
403106,40.7577,-73.9825,17,3


In [39]:
df_thursday.shape

(121, 4)

In [40]:
# Create pipeline for numeric features

print("Creating numeric pipeline...")
numeric_features = [0,1,2] 
numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler())])
print("... Done")

# Create pipeline for categorical features

print("Creating categorical pipeline...")
categorical_features = [3]
categorical_transformer = Pipeline(
    steps=[('encoder', OneHotEncoder(drop='first'))])
print("... Done")

# Use ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)  
    ])

# Preprocessings on dataset
print()
print("Preprocessing sur le train set...")
print()
print(df_thursday.head())
X = preprocessor.fit_transform(df_thursday)
print()
print('...Done.')
print()
print(X[0:5, :])
print()

Creating numeric pipeline...
... Done
Creating categorical pipeline...
... Done

Preprocessing sur le train set...

            Lat      Lon  Hour  DayOfWeek
403520  40.7561 -73.9976    17          3
255005  40.8078 -73.9595    17          3
122304  40.7788 -73.9560    17          3
499137  40.7654 -73.9721    17          3
403106  40.7577 -73.9825    17          3

...Done.

[[ 0.5499176  -0.42051047  0.        ]
 [ 2.45074614  0.5442528   0.        ]
 [ 1.38451736  0.63287935  0.        ]
 [ 0.89184614  0.22519723  0.        ]
 [ 0.60874401 -0.03815022  0.        ]]



In [41]:
# Instanciate DBSCAN 

db = DBSCAN(eps=0.2, min_samples=10, metric="manhattan")
db.fit(X)

DBSCAN(eps=0.2, metric='manhattan', min_samples=10)

In [42]:
np.unique(db.labels_)

array([-1,  0,  1,  2])

In [43]:
df_thursday["cluster"] = db.labels_
df_thursday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek,cluster
403520,40.7561,-73.9976,17,3,-1
255005,40.8078,-73.9595,17,3,-1
122304,40.7788,-73.956,17,3,-1
499137,40.7654,-73.9721,17,3,2
403106,40.7577,-73.9825,17,3,1


In [44]:
# Visualization clusters on a map

fig = px.scatter_mapbox(
        df_thursday[df_thursday.cluster != -1], 
        lat="Lat", 
        lon="Lon",
        color="cluster",
        hover_name="cluster",
        mapbox_style="carto-positron",
        color_continuous_scale ="viridis",
        zoom = 12
)

fig.show()

For thursday at 17h, i found 3 clusters in Manhattan. The first in Flatiron district, the second in districts Murray Hill and Midtown east and the third in Midtown east

#### For friday

In [45]:
df_friday = df.sample(10000, random_state=0)
df_friday = df_friday[df_friday['DayOfWeek']== 4]
df_friday = df_friday[df_friday['Hour']== 17]
df_friday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek
90104,40.7253,-73.978,17,4
53051,40.7712,-73.9822,17,4
411707,40.7546,-73.9819,17,4
304358,40.7161,-74.0149,17,4
304331,40.7699,-73.9843,17,4


In [46]:
df_friday.shape

(110, 4)

In [47]:
# Create pipeline for numeric features

print("Creating numeric pipeline...")
numeric_features = [0,1,2] 
numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler())])
print("... Done")

# Create pipeline for categorical features

print("Creating categorical pipeline...")
categorical_features = [3]
categorical_transformer = Pipeline(
    steps=[('encoder', OneHotEncoder(drop='first'))])
print("... Done")

# Use ColumnTransformer 

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)  
    ])

# Preprocessings on dataset
print()
print("Preprocessing sur le train set...")
print()
print(df_friday.head())
X = preprocessor.fit_transform(df_friday)
print()
print('...Done.')
print()
print(X[0:5, :])
print()

Creating numeric pipeline...
... Done
Creating categorical pipeline...
... Done

Preprocessing sur le train set...

            Lat      Lon  Hour  DayOfWeek
90104   40.7253 -73.9780    17          4
53051   40.7712 -73.9822    17          4
411707  40.7546 -73.9819    17          4
304358  40.7161 -74.0149    17          4
304331  40.7699 -73.9843    17          4

...Done.

[[-0.68770162  0.06100928  0.        ]
 [ 0.81327338 -0.04969884  0.        ]
 [ 0.27043711 -0.04179112  0.        ]
 [-0.98855064 -0.91164063  0.        ]
 [ 0.77076211 -0.1050529   0.        ]]



In [48]:
# Instanciate DBSCAN 

db = DBSCAN(eps=0.2, min_samples=10, metric="manhattan")
db.fit(X)

DBSCAN(eps=0.2, metric='manhattan', min_samples=10)

In [49]:
np.unique(db.labels_)

array([-1,  0])

In [50]:
df_friday["cluster"] = db.labels_
df_friday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek,cluster
90104,40.7253,-73.978,17,4,-1
53051,40.7712,-73.9822,17,4,-1
411707,40.7546,-73.9819,17,4,0
304358,40.7161,-74.0149,17,4,-1
304331,40.7699,-73.9843,17,4,-1


In [51]:
# Visualization clusters on a map

fig = px.scatter_mapbox(
        df_friday[df_friday.cluster != -1], 
        lat="Lat", 
        lon="Lon",
        color="cluster",
        hover_name="cluster",
        mapbox_style="carto-positron",
        color_continuous_scale ="viridis",
        zoom = 12
)

fig.show()

For friday at 17h, i found 1 clusters in Manhattan in Midtown at Midtown east

#### For saturday

In [52]:
df_saturday = df.sample(10000, random_state=0)
df_saturday = df_saturday[df_saturday['DayOfWeek']== 5]
df_saturday = df_saturday[df_saturday['Hour']== 17]
df_saturday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek
308670,40.7306,-73.9862,17,5
134436,40.7188,-73.9976,17,5
58858,40.7461,-73.9841,17,5
95782,40.7212,-73.9877,17,5
517702,40.7467,-73.9901,17,5


In [53]:
df_saturday.shape

(88, 4)

In [54]:
# Create pipeline for numeric features

print("Creating numeric pipeline...")
numeric_features = [0,1,2] 
numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler())])
print("... Done")

# Create pipeline for categorical features

print("Creating categorical pipeline...")
categorical_features = [3]
categorical_transformer = Pipeline(
    steps=[('encoder', OneHotEncoder(drop='first'))])
print("... Done")

# Use ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)  
    ])

# Preprocessings on dataset
print()
print("Preprocessing sur le train set...")
print()
print(df_saturday.head())
X = preprocessor.fit_transform(df_saturday)
print()
print('...Done.')
print()
print(X[0:5, :])
print()

Creating numeric pipeline...
... Done
Creating categorical pipeline...
... Done

Preprocessing sur le train set...

            Lat      Lon  Hour  DayOfWeek
308670  40.7306 -73.9862    17          5
134436  40.7188 -73.9976    17          5
58858   40.7461 -73.9841    17          5
95782   40.7212 -73.9877    17          5
517702  40.7467 -73.9901    17          5

...Done.

[[-0.16669379 -0.21573237  0.        ]
 [-0.44547433 -0.43854995  0.        ]
 [ 0.19950098 -0.17468703  0.        ]
 [-0.3887732  -0.24505047  0.        ]
 [ 0.21367626 -0.29195944  0.        ]]



In [55]:
# Instanciate DBSCAN 

db = DBSCAN(eps=0.2, min_samples=10, metric="manhattan")
db.fit(X)

DBSCAN(eps=0.2, metric='manhattan', min_samples=10)

In [56]:
np.unique(db.labels_)

array([-1,  0])

In [57]:
df_saturday["cluster"] = db.labels_
df_saturday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek,cluster
308670,40.7306,-73.9862,17,5,-1
134436,40.7188,-73.9976,17,5,0
58858,40.7461,-73.9841,17,5,-1
95782,40.7212,-73.9877,17,5,-1
517702,40.7467,-73.9901,17,5,-1


In [58]:
# Visualization clusters on a map

fig = px.scatter_mapbox(
        df_saturday[df_saturday.cluster != -1], 
        lat="Lat", 
        lon="Lon",
        color="cluster",
        hover_name="cluster",
        mapbox_style="carto-positron",
        color_continuous_scale ="viridis",
        zoom = 12
)

fig.show()

For saturday at 17h, i found 1 clusters in Manhattan in Tribeca and Soho district

#### For sunday

In [59]:
df_sunday = df.sample(10000, random_state=0)
df_sunday = df_sunday[df_sunday['DayOfWeek']== 6]
df_sunday = df_sunday[df_sunday['Hour']== 17]
df_sunday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek
525584,40.7198,-73.9993,17,6
189416,40.7477,-73.9788,17,6
241099,40.7435,-73.992,17,6
472812,40.7509,-73.991,17,6
374512,40.7665,-73.9529,17,6


In [60]:
df_sunday.shape

(54, 4)

In [61]:
# Create pipeline for numeric features

print("Creating numeric pipeline...")
numeric_features = [0,1,2] 
numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler())])
print("... Done")

# Create pipeline for categorical features

print("Creating categorical pipeline...")
categorical_features = [3]
categorical_transformer = Pipeline(
    steps=[('encoder', OneHotEncoder(drop='first'))])
print("... Done")

# Use ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)  
    ])

# Preprocessings on dataset
print()
print("Preprocessing sur le train set...")
print()
print(df_sunday.head())
X = preprocessor.fit_transform(df_sunday)
print()
print('...Done.')
print()
print(X[0:5, :])
print()

Creating numeric pipeline...
... Done
Creating categorical pipeline...
... Done

Preprocessing sur le train set...

            Lat      Lon  Hour  DayOfWeek
525584  40.7198 -73.9993    17          6
189416  40.7477 -73.9788    17          6
241099  40.7435 -73.9920    17          6
472812  40.7509 -73.9910    17          6
374512  40.7665 -73.9529    17          6

...Done.

[[-0.34026588 -0.41530446  0.        ]
 [ 0.37893245 -0.06249805  0.        ]
 [ 0.27066604 -0.28967096  0.        ]
 [ 0.46142115 -0.27246089  0.        ]
 [ 0.86355355  0.38324274  0.        ]]



In [62]:
# Instanciate DBSCAN 

db = DBSCAN(eps=0.2, min_samples=10, metric="manhattan")
db.fit(X)

DBSCAN(eps=0.2, metric='manhattan', min_samples=10)

In [63]:
np.unique(db.labels_)

array([-1])

The previous parameters are not adapted for this day at this time. I have to modify them to find cluster.

In [64]:
db = DBSCAN(eps=0.2, min_samples=5, metric="manhattan")
db.fit(X)

DBSCAN(eps=0.2, metric='manhattan')

In [65]:
np.unique(db.labels_)

array([-1,  0,  1])

In [66]:
df_sunday["cluster"] = db.labels_
df_sunday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek,cluster
525584,40.7198,-73.9993,17,6,0
189416,40.7477,-73.9788,17,6,1
241099,40.7435,-73.992,17,6,-1
472812,40.7509,-73.991,17,6,-1
374512,40.7665,-73.9529,17,6,-1


In [67]:
# Visualization clusters on a map

fig = px.scatter_mapbox(
        df_sunday[df_sunday.cluster != -1], 
        lat="Lat", 
        lon="Lon",
        color="cluster",
        hover_name="cluster",
        mapbox_style="carto-positron",
        color_continuous_scale ="viridis",
        zoom = 12
)

fig.show()

For sunday at 17h, i found 2 clusters in Manhattan. The first in Soho and the second in Murray Hill

## Part 5 : Generalize to every day of week

#### For monday

In [68]:
df_all_monday = df.sample(10000, random_state=0)
df_all_monday = df_all_monday[df_all_monday['DayOfWeek']== 0]

In [69]:
df_all_monday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek
377815,40.7818,-73.9561,9,0
529279,40.7311,-73.9841,11,0
430050,40.7743,-73.963,14,0
105728,40.7486,-73.9909,18,0
382258,40.7392,-73.987,19,0


In [70]:
df_all_monday.shape

(1099, 4)

In [71]:
# Create pipeline for numeric features

print("Creating numeric pipeline...")
numeric_features = [0,1,2] 
numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler())])
print("... Done")

# Create pipeline for categorical features

print("Creating categorical pipeline...")
categorical_features = [3]
categorical_transformer = Pipeline(
    steps=[('encoder', OneHotEncoder(drop='first'))])
print("... Done")

# Use ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)  
    ])

# Preprocessings on dataset
print()
print("Preprocessing sur le train set...")
print()
print(df_all_monday.head())
X = preprocessor.fit_transform(df_all_monday)
print()
print('...Done.')
print()
print(X[0:5, :])
print()

Creating numeric pipeline...
... Done
Creating categorical pipeline...
... Done

Preprocessing sur le train set...

            Lat      Lon  Hour  DayOfWeek
377815  40.7818 -73.9561     9          0
529279  40.7311 -73.9841    11          0
430050  40.7743 -73.9630    14          0
105728  40.7486 -73.9909    18          0
382258  40.7392 -73.9870    19          0

...Done.

[[ 0.92679218  0.22404626 -0.8997545 ]
 [-0.28510074 -0.21831335 -0.53406286]
 [ 0.74751808  0.11503621  0.0144746 ]
 [ 0.1332055  -0.32574354  0.74585788]
 [-0.09148471 -0.26412917  0.9287037 ]]



In [72]:
# Instanciate DBSCAN 

db = DBSCAN(eps=0.2, min_samples=12, metric="manhattan")
db.fit(X)

DBSCAN(eps=0.2, metric='manhattan', min_samples=12)

In [73]:
np.unique(db.labels_)

array([-1,  0,  1,  2])

In [74]:
df_all_monday["cluster"] = db.labels_
df_all_monday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek,cluster
377815,40.7818,-73.9561,9,0,-1
529279,40.7311,-73.9841,11,0,-1
430050,40.7743,-73.963,14,0,-1
105728,40.7486,-73.9909,18,0,-1
382258,40.7392,-73.987,19,0,-1


In [75]:
# Visualization clusters on a map

fig = px.scatter_mapbox(
        df_all_monday[df_all_monday.cluster != -1], 
        lat="Lat", 
        lon="Lon",
        color="cluster",
        hover_name="cluster",
        mapbox_style="carto-positron",
        color_continuous_scale ="viridis",
        zoom = 11
)

fig.show()

For monday's day, i found 3 clusters in Manhattan. The first in Midtown (Midtown East,Murray Hill and Garment District), the second also in Midtown (Flatiron District, Chelsea East and Midtown east) and the third in Soho.

#### For tuesday

In [76]:
df_all_tuesday = df.sample(10000, random_state=0)
df_all_tuesday = df_all_tuesday[df_all_tuesday['DayOfWeek']== 1]

In [77]:
df_all_tuesday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek
440591,40.7212,-74.0011,21,1
219003,40.7145,-74.0085,6,1
438222,40.7432,-73.9934,18,1
484249,40.7689,-73.9607,18,1
435079,40.798,-73.9735,11,1


In [78]:
df_all_tuesday.shape

(1648, 4)

In [79]:
# Create pipeline for numeric features

print("Creating numeric pipeline...")
numeric_features = [0,1,2] 
numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler())])
print("... Done")

# Create pipeline for categorical features

print("Creating categorical pipeline...")
categorical_features = [3]
categorical_transformer = Pipeline(
    steps=[('encoder', OneHotEncoder(drop='first'))])
print("... Done")

# Use ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)  
    ])

# Preprocessings on dataset
print()
print("Preprocessing sur le train set...")
print()
print(df_all_tuesday.head())
X = preprocessor.fit_transform(df_all_tuesday)
print()
print('...Done.')
print()
print(X[0:5, :])
print()

Creating numeric pipeline...
... Done
Creating categorical pipeline...
... Done

Preprocessing sur le train set...

            Lat      Lon  Hour  DayOfWeek
440591  40.7212 -74.0011    21          1
219003  40.7145 -74.0085     6          1
438222  40.7432 -73.9934    18          1
484249  40.7689 -73.9607    18          1
435079  40.7980 -73.9735    11          1

...Done.

[[-0.60564831 -0.50750026  1.20619372]
 [-0.80203806 -0.6759859  -1.58828695]
 [ 0.03921356 -0.33218412  0.64729759]
 [ 0.79252948  0.41234025  0.64729759]
 [ 1.64550587  0.12090563 -0.65679339]]



In [80]:
# Instanciate DBSCAN 

db = DBSCAN(eps=0.2, min_samples=15, metric="manhattan")
db.fit(X)

DBSCAN(eps=0.2, metric='manhattan', min_samples=15)

In [81]:
np.unique(db.labels_)

array([-1,  0,  1,  2])

In [82]:
df_all_tuesday["cluster"] = db.labels_
df_all_tuesday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek,cluster
440591,40.7212,-74.0011,21,1,-1
219003,40.7145,-74.0085,6,1,-1
438222,40.7432,-73.9934,18,1,1
484249,40.7689,-73.9607,18,1,-1
435079,40.798,-73.9735,11,1,-1


In [83]:
# Visualization clusters on a map

fig = px.scatter_mapbox(
        df_all_tuesday[df_all_tuesday.cluster != -1], 
        lat="Lat", 
        lon="Lon",
        color="cluster",
        hover_name="cluster",
        mapbox_style="carto-positron",
        color_continuous_scale ="viridis",
        zoom = 11
)

fig.show()

For tuesday's day, i found 3 clusters in Manhattan. The first in Midtown (Midtown East,Murray Hill and Garment District), the second also in Midtown (Flatiron District,Chelsea East and Midtown east) and the third in Soho.

#### For wednesday

In [84]:
df_all_wednesday = df.sample(10000, random_state=0)
df_all_wednesday = df_all_wednesday[df_all_wednesday['DayOfWeek']== 2]

In [85]:
df_all_wednesday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek
216098,40.7555,-73.9917,20,2
544773,40.7276,-73.953,11,2
295739,40.7537,-73.9807,16,2
153414,40.723,-73.9987,13,2
250316,40.7767,-73.9531,11,2


In [86]:
df_all_wednesday.shape

(1932, 4)

In [87]:
# Create pipeline for numeric features

print("Creating numeric pipeline...")
numeric_features = [0,1,2] 
numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler())])
print("... Done")

# Create pipeline for categorical features

print("Creating categorical pipeline...")
categorical_features = [3]
categorical_transformer = Pipeline(
    steps=[('encoder', OneHotEncoder(drop='first'))])
print("... Done")

# Use ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)  
    ])

# Preprocessings on dataset
print()
print("Preprocessing sur le train set...")
print()
print(df_all_wednesday.head())
X = preprocessor.fit_transform(df_all_wednesday)
print()
print('...Done.')
print()
print(X[0:5, :])
print()

Creating numeric pipeline...
... Done
Creating categorical pipeline...
... Done

Preprocessing sur le train set...

            Lat      Lon  Hour  DayOfWeek
216098  40.7555 -73.9917    20          2
544773  40.7276 -73.9530    11          2
295739  40.7537 -73.9807    16          2
153414  40.7230 -73.9987    13          2
250316  40.7767 -73.9531    11          2

...Done.

[[ 0.38918288 -0.31527897  1.00047312]
 [-0.43179216  0.49971847 -0.70353575]
 [ 0.33621675 -0.08362596  0.24313584]
 [-0.56715005 -0.46269453 -0.32486712]
 [ 1.01300621  0.49761253 -0.70353575]]



In [88]:
# Instanciate DBSCAN 

db = DBSCAN(eps=0.2, min_samples=14, metric="manhattan")
db.fit(X)

DBSCAN(eps=0.2, metric='manhattan', min_samples=14)

In [89]:
np.unique(db.labels_)

array([-1,  0,  1,  2])

In [90]:
df_all_wednesday["cluster"] = db.labels_
df_all_wednesday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek,cluster
216098,40.7555,-73.9917,20,2,0
544773,40.7276,-73.953,11,2,-1
295739,40.7537,-73.9807,16,2,0
153414,40.723,-73.9987,13,2,-1
250316,40.7767,-73.9531,11,2,-1


In [91]:
# Visualization clusters on a map

fig = px.scatter_mapbox(
        df_all_wednesday[df_all_wednesday.cluster != -1], 
        lat="Lat", 
        lon="Lon",
        color="cluster",
        hover_name="cluster",
        mapbox_style="carto-positron",
        color_continuous_scale ="viridis",
        zoom = 11
)

fig.show()

For wednesday's day, i found 3 clusters in Manhattan. The first in Midtown (Midtown East,Murray Hill, Garment District and Gramercy), the second in Downtown at Greenwich Village and the third also in Downtown at Soho.

#### For thursday

In [92]:
df_all_thursday = df.sample(10000, random_state=0)
df_all_thursday = df_all_thursday[df_all_thursday['DayOfWeek']== 3]

In [93]:
df_all_thursday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek
160769,40.645,-73.7819,13,3
48518,40.756,-73.9832,21,3
348950,40.756,-73.9693,22,3
496295,40.7281,-73.8701,11,3
455826,40.75,-74.0026,21,3


In [94]:
df_all_thursday.shape

(1509, 4)

In [95]:
# Create pipeline for numeric features

print("Creating numeric pipeline...")
numeric_features = [0,1,2] 
numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler())])
print("... Done")

# Create pipeline for categorical features

print("Creating categorical pipeline...")
categorical_features = [3]
categorical_transformer = Pipeline(
    steps=[('encoder', OneHotEncoder(drop='first'))])
print("... Done")

# Use ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)  
    ])

# Preprocessings on dataset
print()
print("Preprocessing sur le train set...")
print()
print(df_all_thursday.head())
X = preprocessor.fit_transform(df_all_thursday)
print()
print('...Done.')
print()
print(X[0:5, :])
print()

Creating numeric pipeline...
... Done
Creating categorical pipeline...
... Done

Preprocessing sur le train set...

            Lat      Lon  Hour  DayOfWeek
160769  40.6450 -73.7819    13          3
48518   40.7560 -73.9832    21          3
348950  40.7560 -73.9693    22          3
496295  40.7281 -73.8701    11          3
455826  40.7500 -74.0026    21          3

...Done.

[[-2.77009241  3.90163861 -0.34066159]
 [ 0.39568915 -0.16004072  1.10433724]
 [ 0.39568915  0.12042298  1.28496209]
 [-0.40003433  2.12200564 -0.70191129]
 [ 0.22456582 -0.55147926  1.10433724]]



In [96]:
# Instanciate DBSCAN 

db = DBSCAN(eps=0.13, min_samples=12, metric="manhattan")
db.fit(X)

DBSCAN(eps=0.13, metric='manhattan', min_samples=12)

In [97]:
np.unique(db.labels_)

array([-1,  0,  1,  2])

In [98]:
df_all_thursday["cluster"] = db.labels_
df_all_thursday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek,cluster
160769,40.645,-73.7819,13,3,-1
48518,40.756,-73.9832,21,3,-1
348950,40.756,-73.9693,22,3,-1
496295,40.7281,-73.8701,11,3,-1
455826,40.75,-74.0026,21,3,-1


In [99]:
# Visualization clusters on a map

fig = px.scatter_mapbox(
        df_all_thursday[df_all_thursday.cluster != -1], 
        lat="Lat", 
        lon="Lon",
        color="cluster",
        hover_name="cluster",
        mapbox_style="carto-positron",
        color_continuous_scale ="viridis",
        zoom = 12
)

fig.show()

For thursday's day, i found 3 clusters in Manhattan. The first in Midtown East, the second in Gramercy and the third also in Gramercy.

#### For friday

In [100]:
df_all_friday = df.sample(10000, random_state=0)
df_all_friday = df_all_friday[df_all_friday['DayOfWeek']== 4]

In [101]:
df_all_friday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek
169442,40.7076,-74.0101,13,4
349900,40.7171,-74.0144,2,4
232582,40.7843,-73.9769,16,4
352668,40.7127,-73.9411,11,4
90104,40.7253,-73.978,17,4


In [102]:
df_all_friday.shape

(1573, 4)

In [103]:
# Create pipeline for numeric features

print("Creating numeric pipeline...")
numeric_features = [0,1,2] 
numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler())])
print("... Done")

# Create pipeline for categorical features

print("Creating categorical pipeline...")
categorical_features = [3]
categorical_transformer = Pipeline(
    steps=[('encoder', OneHotEncoder(drop='first'))])
print("... Done")

# Use ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)  
    ])

# Preprocessings on dataset
print()
print("Preprocessing sur le train set...")
print()
print(df_all_friday.head())
X = preprocessor.fit_transform(df_all_friday)
print()
print('...Done.')
print()
print(X[0:5, :])
print()

Creating numeric pipeline...
... Done
Creating categorical pipeline...
... Done

Preprocessing sur le train set...

            Lat      Lon  Hour  DayOfWeek
169442  40.7076 -74.0101    13          4
349900  40.7171 -74.0144     2          4
232582  40.7843 -73.9769    16          4
352668  40.7127 -73.9411    11          4
90104   40.7253 -73.9780    17          4

...Done.

[[-0.99944652 -0.69784081 -0.33714938]
 [-0.71499739 -0.79344565 -2.24420914]
 [ 1.29710592  0.04031745  0.18295782]
 [-0.84674225  0.8362833  -0.68388752]
 [-0.46947288  0.0158604   0.35632689]]



In [104]:
# Instanciate DBSCAN 

db = DBSCAN(eps=0.2, min_samples=17, metric="manhattan")
db.fit(X)

DBSCAN(eps=0.2, metric='manhattan', min_samples=17)

In [105]:
np.unique(db.labels_)

array([-1,  0,  1])

In [106]:
df_all_friday["cluster"] = db.labels_
df_all_friday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek,cluster
169442,40.7076,-74.0101,13,4,-1
349900,40.7171,-74.0144,2,4,-1
232582,40.7843,-73.9769,16,4,-1
352668,40.7127,-73.9411,11,4,-1
90104,40.7253,-73.978,17,4,-1


In [107]:
# Visualization clusters on a map

fig = px.scatter_mapbox(
        df_all_friday[df_all_friday.cluster != -1], 
        lat="Lat", 
        lon="Lon",
        color="cluster",
        hover_name="cluster",
        mapbox_style="carto-positron",
        color_continuous_scale ="viridis",
        zoom = 12
)

fig.show()

For friday's day, i found 2 clusters in Manhattan. The first in Midtown East and the second in West village 

#### For saturday

In [108]:
df_all_saturday = df.sample(10000, random_state=0)
df_all_saturday = df_all_saturday[df_all_saturday['DayOfWeek']== 5]

In [109]:
df_all_saturday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek
239000,40.7475,-74.0089,23,5
512706,40.7475,-73.982,0,5
422568,40.7685,-73.9847,23,5
364300,40.7273,-74.0017,15,5
308670,40.7306,-73.9862,17,5


In [110]:
df_all_saturday.shape

(1325, 4)

In [111]:
# Create pipeline for numeric features

print("Creating numeric pipeline...")
numeric_features = [0,1,2] 
numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler())])
print("... Done")

# Create pipeline for categorical features

print("Creating categorical pipeline...")
categorical_features = [3]
categorical_transformer = Pipeline(
    steps=[('encoder', OneHotEncoder(drop='first'))])
print("... Done")

# Use ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)  
    ])

# Preprocessings on dataset
print()
print("Preprocessing sur le train set...")
print()
print(df_all_saturday.head())
X = preprocessor.fit_transform(df_all_saturday)
print()
print('...Done.')
print()
print(X[0:5, :])
print()

Creating numeric pipeline...
... Done
Creating categorical pipeline...
... Done

Preprocessing sur le train set...

            Lat      Lon  Hour  DayOfWeek
239000  40.7475 -74.0089    23          5
512706  40.7475 -73.9820     0          5
422568  40.7685 -73.9847    23          5
364300  40.7273 -74.0017    15          5
308670  40.7306 -73.9862    17          5

...Done.

[[ 0.30971178 -0.63658196  1.19747928]
 [ 0.30971178 -0.05684711 -2.17433824]
 [ 0.8652213  -0.11503611  1.19747928]
 [-0.22463547 -0.4814113   0.02467318]
 [-0.13734112 -0.14736334  0.31787471]]



In [112]:
# Instanciate DBSCAN 

db = DBSCAN(eps=0.2, min_samples=14, metric="manhattan")
db.fit(X)

DBSCAN(eps=0.2, metric='manhattan', min_samples=14)

In [113]:
np.unique(db.labels_)

array([-1,  0,  1])

In [114]:
df_all_saturday["cluster"] = db.labels_
df_all_saturday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek,cluster
239000,40.7475,-74.0089,23,5,-1
512706,40.7475,-73.982,0,5,-1
422568,40.7685,-73.9847,23,5,-1
364300,40.7273,-74.0017,15,5,-1
308670,40.7306,-73.9862,17,5,-1


In [115]:
# Visualization clusters on a map

fig = px.scatter_mapbox(
        df_all_saturday[df_all_saturday.cluster != -1], 
        lat="Lat", 
        lon="Lon",
        color="cluster",
        hover_name="cluster",
        mapbox_style="carto-positron",
        color_continuous_scale ="viridis",
        zoom = 11
)

fig.show()

For saturday's day, i found 2 clusters in Manhattan. The first in Downtown North (West Village, Greenwich Village, Soho and Noho), the second at Meatpacking District.

#### For sunday

In [116]:
df_all_sunday = df.sample(10000, random_state=0)
df_all_sunday = df_all_sunday[df_all_sunday['DayOfWeek']== 6]

In [117]:
df_all_sunday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek
63031,40.78,-73.9486,5,6
100482,40.7495,-73.9917,11,6
526979,40.774,-73.8723,22,6
374098,40.7394,-73.9912,16,6
370963,40.7394,-74.008,1,6


In [118]:
df_all_sunday.shape

(914, 4)

In [119]:
# Create pipeline for numeric features

print("Creating numeric pipeline...")
numeric_features = [0,1,2] 
numeric_transformer = Pipeline(steps=[ ('scaler', StandardScaler())])
print("... Done")

# Create pipeline for categorical features

print("Creating categorical pipeline...")
categorical_features = [3]
categorical_transformer = Pipeline(
    steps=[('encoder', OneHotEncoder(drop='first'))])
print("... Done")

# Use ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)  
    ])

# Preprocessings on dataset
print()
print("Preprocessing sur le train set...")
print()
print(df_all_sunday.head())
X = preprocessor.fit_transform(df_all_sunday)
print()
print('...Done.')
print()
print(X[0:5, :])
print()

Creating numeric pipeline...
... Done
Creating categorical pipeline...
... Done

Preprocessing sur le train set...

            Lat      Lon  Hour  DayOfWeek
63031   40.7800 -73.9486     5          6
100482  40.7495 -73.9917    11          6
526979  40.7740 -73.8723    22          6
374098  40.7394 -73.9912    16          6
370963  40.7394 -74.0080     1          6

...Done.

[[ 1.17586168  0.45089613 -0.93215382]
 [ 0.38405773 -0.31557729 -0.09646123]
 [ 1.02009697  1.80778528  1.43564185]
 [ 0.12185379 -0.30668549  0.59994926]
 [ 0.12185379 -0.60545007 -1.48928222]]



In [120]:
# Instanciate DBSCAN 

db = DBSCAN(eps=0.2, min_samples=25, metric="manhattan")
db.fit(X)

DBSCAN(eps=0.2, metric='manhattan', min_samples=25)

In [121]:
np.unique(db.labels_)

array([-1,  0])

In [122]:
df_all_sunday["cluster"] = db.labels_
df_all_sunday.head()

Unnamed: 0,Lat,Lon,Hour,DayOfWeek,cluster
63031,40.78,-73.9486,5,6,-1
100482,40.7495,-73.9917,11,6,-1
526979,40.774,-73.8723,22,6,-1
374098,40.7394,-73.9912,16,6,-1
370963,40.7394,-74.008,1,6,-1


In [123]:
# Visualization clusters on a map

fig = px.scatter_mapbox(
        df_all_sunday[df_all_sunday.cluster != -1], 
        lat="Lat", 
        lon="Lon",
        color="cluster",
        hover_name="cluster",
        mapbox_style="carto-positron",
        color_continuous_scale ="viridis",
        zoom = 12
)

fig.show()

For sunday's day, i found 1 cluster in Manhattan in Downtown North (East Village, Noho and Lower East Side)

## Part 6 : Conclusion

Midtown in Manhattan seems to be the hot zone and especially during weekdays while weekends the hot zone is Downtown North.

To be more precise and more detailed, this analysis must be continued for each day at each hour in order to be able to precisely target the area that needs more drivers.
Ideally it would be good to refine by streets to better target with more data in sample, because a district in New York is extremely large.