# DBSCAN
This notebook uses DBSCAN as a clustering density based approach.



Note: before starting, for reasons of time we could,for this delivery, do the clustering on the full dataset so for now we decided to employ some sort of data reduction as to make it feasible to run such an algorithm

The approach we used was:
- aggregation of races features for each races instance (the year races couple in the dataset)
- remove useless features that don't mean anything after aggregation e.g. stages
- chunking on the dataset reduction every two years to get small enough clusters

In [None]:
import datetime
from sklearn.cluster import DBSCAN
from os import path
import pandas as pd
from sklearn.preprocessing import StandardScaler

def most_frequent(series):
    return series.mode()[0] if not series.mode().empty else series.iloc[0]
RACES_PATH=path.join("..","dataset","engineered_races.csv")
races_df=pd.read_csv(RACES_PATH)

#aggregation of races to reduce dataset size
clustering_data=races_df.groupby(['date','stage','std_name','cyclist']).agg({
    'profile':most_frequent,
    'is_tarmac':most_frequent,
    'difficulty_level':most_frequent,

    'points':'sum',

    'length':'mean',
    'climb_total':'mean',
    'competitive_age':'mean',
    'startlist_quality':'mean',
    'delta':'mean',
    'performance_index':'mean',
    'difficulty':'mean',
    'convenience_score':'mean',
    'difficulty_score':'mean',
    'gain_ratio':'mean',

    'cyclist_age':'first',
    'position':'first',
    'cyclist_team':'first',
}).reset_index()


#convert to timestamp(units are useless since it's getting normalized)
clustering_data['date']=pd.to_datetime(clustering_data['date'])
clustering_data['day']=clustering_data['date'].dt.day
clustering_data['month']=clustering_data['date'].dt.month
clustering_data['year']=clustering_data['date'].dt.year

#one hot encoding difficulty
ohe_diff_lvl=pd.get_dummies(races_df['difficulty_level']).astype(float)

#dividing into chunks
dec_cut=pd.date_range(
    start=clustering_data['date'].min(),
    end=clustering_data['date'].max(),
    freq='2YE'
)
#apply chunks
clustering_data['decade']=pd.cut(
    clustering_data['date'],
    bins=dec_cut,
)
clustering_data[ohe_diff_lvl.columns]=ohe_diff_lvl
#remove useless columns
clustering_data=clustering_data.drop(columns="date")

clustering_data

# clustering organization

A few notes are due before starting, first the eps are difficulty to setup for now a good strategy would be to take inspiration using the first paper the introduced the algorithm, which you can find [here](https://dl.acm.org/doi/10.5555/3001460.3001507), and use the distance from the k-th NN varying K until we find a good eps value for us.

## applying the elbow method
In this part since it is diifcult to estimate values we picked a kth neighbor that is not too low to have an eps taht is higher and manages to reach more points.

In [None]:
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score
import itertools as it
import numpy as np
import utils
clustering_data=clustering_data.drop(columns=["difficulty_level","stage","std_name","cyclist","cyclist_team","easy","hard","moderate","is_tarmac","gain_ratio","difficulty_score","position"]).drop_duplicates()

std_scaler=StandardScaler()

print(clustering_data.columns)

dec_groups=clustering_data.groupby('decade')
normalized_decade_groups={k:std_scaler.fit_transform(g.drop(columns="decade").drop_duplicates()) for k,g in dec_groups }

print({k:len(g) for k,g in normalized_decade_groups.items()})

initial_eps=dict()

kth_neighbor=30

for k,data in normalized_decade_groups.items():
    min_pts=data.shape[1]
    nn=NearestNeighbors(n_neighbors=min_pts-1,n_jobs=-1)
    nn.fit(data)
    distances,indices= nn.kneighbors(data)
    k_distances= np.sort(distances[:, -1])

    initial_eps[k]=k_distances[kth_neighbor-1]


print(f"""
number of groups={len(normalized_decade_groups)}
initial eps values per group={initial_eps}
""")

Now that we have the sorted distances we can pick the eps values and proceed to test dbscan , in this case we have more starting eps values given the segmentation hance we have a lot of tests to do.


NOTE: since we didn't manage to make execution feasible we had to cut the clusterings and we go only from 1970 to 1994 with jumps of two years

NOTE: DBSCAN relies a lot on the density we where afraid that sampling would make clustering meaningless because of too many points removed and having a too approximated distribution.

In [None]:

# useful for reference
db_scan_mapping={
    -1:'noisy',
    0:'border',
    1:'core'

}

group_results=pd.DataFrame()


for k,decade_data in normalized_decade_groups.items():
    #NOTE: this might have to be revisited for it's just to try if everyting works
    dimension=decade_data.shape[0]
    min_pts=int(dimension-1)
    #using the method seen at laboratory to select initial values
    #print(decade_data.drop(columns="decade").info())
    maximum_distance = abs(decade_data.max() - decade_data.min()).sum().item()
    average_concentration = dimension / maximum_distance
    #use diferent scales for eps values
    # during the tests a lot of low values where not taken into consideration
    eps_values=initial_eps[k] * np.array([500,250,100,50,10, 5, 2.5, 1, 0.1, 0.01, 0.0001])
    #try various metrics
    metrics=['euclidean']

    min_pts_values=[min_pts]
    print(
    f"""
    period {k}
    maxium distance: {maximum_distance}
    average concentration:{average_concentration}
    eps values:{eps_values}
    used metrics:{metrics}
    number of minimum samples:{min_pts}
    number of samples used:{decade_data.shape[0]}
    """
    )
    #normalization is done for each group
    result=utils.run_dbscan(min_pts_values,eps_values,metrics,decade_data)
    result["group"]=k
    group_results=pd.concat([group_results,result])
group_results.reset_index()

As for the result we only managed to find a meaningful clustering in th first two years with a silhoutte score of 0.79 all the otehr are all noise , however after some consideration we found out that we don't have any meaningful clusteriong because we have all points taht are core so eps is too high bnut lowering it doesn't change even after testing very different scales both big and small.

In [None]:
best_idx=0
best_params=group_results.iloc[best_idx]

best_dbscan=DBSCAN(eps=best_params['eps'],min_samples=best_params['min_samples']).fit(normalized_decade_groups[best_params['group']])

labels=best_dbscan.labels_

statistics=np.unique(best_dbscan.labels_,return_counts=True)

print(
f"""
results:{best_params}
statistics:
    raw counts: noise {statistics[0][0]}| core {statistics[1][0]}
"""
)




So for this first delivery we can only say taht after some consideration the dataset tends to be very sparse, probably some sensd approaches would be to:
- use sampling and make a bigger hyperparameters space.
- find a more refined method to select the eps values.
- use different segmentations.

As you can see clustyring by years even after trying to reduce the dimension to get less sparse clusters is not effective, aside from a nice 0.71 in the first part we cannot get much more than that, we can infer very different densities across years which makes for very bad clusterings.

So we can try other kind of segmentations, a first approach could be geospatial: group by the race occurencies across time.

## geospatial clustering

In [1]:
import datetime
from sklearn.cluster import DBSCAN
from os import path
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score
import itertools as it
import numpy as np
import utils

RACES_PATH=path.join("..","dataset","engineered_races.csv")
races_df=pd.read_csv(RACES_PATH)
def most_frequent(series):
    return series.mode()[0] if not series.mode().empty else series.iloc[0]

clustering_data=races_df.groupby(['date','stage','std_name','cyclist']).agg({
    'profile':most_frequent,
    'is_tarmac':most_frequent,
    'difficulty_level':most_frequent,

    'points':'sum',

    'length':'mean',
    'climb_total':'mean',
    'competitive_age':'mean',
    'startlist_quality':'mean',
    'delta':'mean',
    'performance_index':'mean',
    'difficulty':'mean',
    'convenience_score':'mean',
    'difficulty_score':'mean',
    'gain_ratio':'mean',

    'cyclist_age':'first',
    'position':'first',
    'cyclist_team':'first',
}).reset_index()
clustering_data.groupby('std_name').count()

Unnamed: 0_level_0,date,stage,cyclist,profile,is_tarmac,difficulty_level,points,length,climb_total,competitive_age,startlist_quality,delta,performance_index,difficulty,convenience_score,difficulty_score,gain_ratio,cyclist_age,position,cyclist_team
std_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
amstel-gold-race,4349,4349,4349,4349,4349,4349,4349,4349,4349,4349,4349,4349,4349,4349,4349,4349,4349,4349,4349,4349
dauphine,26669,26669,26669,26669,26669,26669,26669,26669,26669,26669,26669,26669,26669,26669,26669,26669,26669,26669,26669,26669
dwars-door-vlaanderen,2656,2656,2656,2656,2656,2656,2656,2656,2656,2656,2656,2656,2656,2656,2656,2656,2656,2656,2656,2656
e3-harelbeke,3287,3287,3287,3287,3287,3287,3287,3287,3287,3287,3287,3287,3287,3287,3287,3287,3287,3287,3287,3287
giro-d-italia,95581,95581,95581,95581,95581,95581,95581,95581,95581,95581,95581,95581,95581,95581,95581,95581,95581,95581,95581,95581
gp-montreal,1070,1070,1070,1070,1070,1070,1070,1070,1070,1070,1070,1070,1070,1070,1070,1070,1070,1070,1070,1070
gp-quebec,1299,1299,1299,1299,1299,1299,1299,1299,1299,1299,1299,1299,1299,1299,1299,1299,1299,1299,1299,1299
gran-camino,792,792,792,792,792,792,792,792,792,792,792,792,792,792,792,792,792,792,792,792
il-lombardia,3069,3069,3069,3069,3069,3069,3069,3069,3069,3069,3069,3069,3069,3069,3069,3069,3069,3069,3069,3069
itzulia-basque-country,17936,17936,17936,17936,17936,17936,17936,17936,17936,17936,17936,17936,17936,17936,17936,17936,17936,17936,17936,17936


In [None]:
ohe_diff_lvl=pd.get_dummies(races_df['difficulty_level']).astype(float)
clustering_data[ohe_diff_lvl.columns]=ohe_diff_lvl

ohe_tarmac=pd.get_dummies(races_df['is_tarmac']).astype(float)

cols=list(ohe_tarmac.columns)
cols[0]='True_is_tarmac'
cols[1]='False_is_tarmac'
clustering_data[cols]=ohe_tarmac


clustering_data=clustering_data.drop(columns=["cyclist","cyclist_team","is_tarmac","difficulty_level","date","stage"]).drop_duplicates()

std_scaler=StandardScaler()

clustering_data=utils.random_sampling_reduce(clustering_data,0.25)

races_groups=clustering_data.groupby('std_name')
normalized_races_groups={k:std_scaler.fit_transform(g.drop(columns="std_name").drop_duplicates()) for k,g in races_groups }

print({k:len(g) for k,g in normalized_races_groups.items()})

initial_eps=dict()

kth_neighbor=4

for k,data in normalized_races_groups.items():
    min_pts=data.shape[1]
    nn=NearestNeighbors(n_neighbors=min_pts-1,n_jobs=-1)
    nn.fit(data)
    distances,indices= nn.kneighbors(data)
    k_distances= np.sort(distances[:, -1])

    initial_eps[k]=k_distances[kth_neighbor-1]


print(f"""
number of groups={len(normalized_races_groups)}
initial eps values per group={initial_eps}
""")

{'amstel-gold-race': 1108, 'dauphine': 6671, 'dwars-door-vlaanderen': 648, 'e3-harelbeke': 821, 'giro-d-italia': 23911, 'gp-montreal': 269, 'gp-quebec': 327, 'gran-camino': 175, 'il-lombardia': 749, 'itzulia-basque-country': 4528, 'la-fleche-wallone': 1271, 'liege-bastogne-liege': 1190, 'milano-sanremo': 1638, 'omloop-het-nieuwsblad': 1005, 'paris-nice': 8011, 'paris-roubaix': 909, 'ronde-van-vlaanderen': 1076, 'san-sebastian': 1168, 'strade-bianche': 304, 'tirreno-adriatico': 6844, 'tour-de-france': 36613, 'tour-de-romandie': 4985, 'tour-de-suisse': 8383, 'uae-tour': 1072, 'volta-a-catalunya': 6528, 'vuelta-a-espana': 26282, 'world-championship': 949}

number of groups=27
initial eps values per group={'amstel-gold-race': 1.331096502661081, 'dauphine': 0.8401145858418447, 'dwars-door-vlaanderen': 1.2632881010334707, 'e3-harelbeke': 1.3823696824421974, 'giro-d-italia': 0.7261308576322284, 'gp-montreal': 2.0184520099061922, 'gp-quebec': 1.882460185176687, 'gran-camino': 2.48516596897522,

: 

In [None]:
import gc

group_results=pd.DataFrame()
for k,decade_data in normalized_races_groups.items():
    #NOTE: this might have to be revisited for it's just to try if everyting works
    dimension=decade_data.shape[0]
    min_pts=int(dimension-1)
    #using the method seen at laboratory to select initial values
    #print(decade_data.drop(columns="decade").info())
    maximum_distance = abs(decade_data.max() - decade_data.min()).sum().item()
    average_concentration = dimension / maximum_distance
    #use diferent scales for eps values
    # during the tests a lot of low values where not taken into consideration
    eps_values=initial_eps[k] * np.array([500,250,100,50,10, 5, 2.5, 1, 0.1, 0.01, 0.0001])
    #try various metrics
    metrics=['euclidean']

    min_pts_values=[min_pts]
    print(
    f"""
    period {k}
    maxium distance: {maximum_distance}
    average concentration:{average_concentration}
    eps values:{eps_values}
    used metrics:{metrics}
    number of minimum samples:{min_pts}
    number of samples used:{decade_data.shape[0]}
    """
    )
    gc.collect()
    #normalization is done for each group
    result=utils.run_dbscan(min_pts_values,eps_values,metrics,decade_data)
    result["group"]=k
    group_results=pd.concat([group_results,result])
group_results.reset_index()


    period amstel-gold-race
    maxium distance: 13.372756617282896
    average concentration:82.85501873024634
    eps values:[6.65548251e+02 3.32774126e+02 1.33109650e+02 6.65548251e+01
 1.33109650e+01 6.65548251e+00 3.32774126e+00 1.33109650e+00
 1.33109650e-01 1.33109650e-02 1.33109650e-04]
    used metrics:['euclidean']
    number of minimum samples:1107
    number of samples used:1108
    
-0 - (665.5482513305406, 'euclidean', 1107)
dbscan done, time=0.02056097984313965 seconds | silhoutte score:all noise
-1 - (332.7741256652703, 'euclidean', 1107)
dbscan done, time=0.016307830810546875 seconds | silhoutte score:all noise
-2 - (133.1096502661081, 'euclidean', 1107)
dbscan done, time=0.02366805076599121 seconds | silhoutte score:all noise
-3 - (66.55482513305405, 'euclidean', 1107)
dbscan done, time=0.019884347915649414 seconds | silhoutte score:all noise
-4 - (13.31096502661081, 'euclidean', 1107)
dbscan done, time=0.01560664176940918 seconds | silhoutte score:all noise
-5 - (6.