# Feature engineering

This notebook takes care of the feature engineering part, in particular we're gonna add 3 new features to the dataset.

In [41]:
import pandas as pd
from os import path

RACES_PATH=path.join("..","dataset","races_cleaned.csv")
races_df=pd.read_csv(RACES_PATH)

CYCLISTS_PATH=path.join("..","dataset","cyclists_cleaned.csv")
cyclist_df=pd.read_csv(CYCLISTS_PATH)

## before starting the analysis

Before analyzing the features we edveloped it is important to note that our main entities to be analyzed are races and cyclists, some features depends only on the cyclists, others only on the race and some on both to highlight such relationship we specify this in function like manner `features(entity)`.

Also during the analysis much more features emerged to be necessary in order to determine the original 3 ones we developed.

## engineered features

The features we came up are:
- climbing_efficiency: we want to get a ratio between the climb total and the length, this  can help us understand better the performances for cyclists on different kind of slopeness on average for the race.
$$
climbing\_ efficiency(race)= \frac{climb\_ total(race)(m)}{length(race)(KM)\times 1000}
$$
Note: length is multiplied by 1000 to get the same unit as the length
- competitive\_age: we calculate the difference the age a cyclist has at the time of the race, this is useful because we might track the performances  of a single cyclist in multiple races across time and determine,if possible, if there are ages that are more performant that others or any kind of pattern across ages.
$$
competitive\_age(cyclist,race)=current\_ race \_ date(race) - cyclist\_ age(cyclist)
$$

- difficulty: this one serves to estimate the difficulty an athlete might have when facing certain stages, the important factors are
    - the difficulty of the terrain quantified by the terrain \_ score which is the weight sum of the difficulty of each profile type for a certain race, the elements being weighted are the difficulties attributed to each profile type and the weights are determined for each race using the percentage of time the profile appears.
    - the age of the athlete at the time of the race i.e. the competitive age
    - the BMI of the athlete
    The rationale is that a person that is older, overweight and deals with a difficult race has a lot more difficulty than another that is in shape, younger for the same race.
    
$$
terrain \_ score(race) = \sum_i\frac{( \# \text{of times terrain } i \text{ appears in the race}) }{( \# \text{of stages for the race})}*(\text{terrain $i$ difficulty})
$$
$$
difficulty(cyclist,race)=climbing\_ efficiency(race)*terrain \_ score(race) * BMI(cyclist) * competitive \_ age(cyclist)
$$

- convenience\_score: a ratio between points and difficulty, basically how much we gain from participating on certain races. The points of this feature is to find if there is any cyclist that tries to maximize the points gained while partecipating to the least difficult races or a brave one that goes to partecipate in difficult races to the it's capabilities and so on with any kind of behavioral pattern we can deduct.
$$
convenience\_ score(cyclist,age)=\frac{points(cyclist)}{difficulty(cyclist,age)}
$$

In [42]:
races_df['climbing_efficiency']=races_df['climb_total']/(races_df['length']*1000)
races_df['date']=pd.to_datetime(races_df['date'])

years_to_sub=races_df.merge(cyclist_df,left_on="cyclist",right_on="_url",how='inner')['birth_year'].astype('int32')
races_df['competitive_age']=races_df['date'].dt.year-years_to_sub

#profile types difficulty for terrain difficulty estimation
terrain_difficulty={
    1:10,
    2:20,
    3:30,
    4:40,
    5:50
}

#extract name, and stage. Year can be deducted from the date feature
url_df=races_df['_url'].str.split('/',expand=True)
url_df.rename(columns={0:'name',1:'year',2:'stage'},inplace=True)
races_df['std_name']=url_df['name']
races_df['stage']=url_df['stage']

#group races by (race_name,year)
stages_grouping=races_df.groupby(['std_name',races_df.date.dt.year])

#calculate components of the difficulty formula coming from the race
profile_difficulty=races_df['profile'].map(terrain_difficulty)
profile_counting=stages_grouping['profile'].transform('size')
profile_freqs=stages_grouping['profile'].value_counts().reset_index(name='prof_freqs')
profile_freqs=races_df.merge(profile_freqs,left_on=['std_name',races_df.date.dt.year],right_on=['std_name','date'],how='inner')['prof_freqs']

# calculate components of the  difficulty formula coming from the cyclist
cyclists_physicals=races_df.merge(cyclist_df,left_on='cyclist',right_on='_url',how='inner')[['weight','height']]
cyclist_df['bmi']=cyclist_df['weight'] / (cyclist_df['height']/100) ** 2
cyclist_bmi=races_df.merge(cyclist_df,left_on=['cyclist'],right_on=['_url'],how='inner')['bmi']

diff_terms_df=pd.DataFrame()

diff_terms_df['norm_bmi']=cyclist_bmi
diff_terms_df['norm_age']=races_df['competitive_age']
diff_terms_df['norm_terrain_diff']=((profile_freqs/profile_counting)*profile_difficulty)
# given different scales for the values this ensures to us consistenccy for scales
diff_terms_df=(diff_terms_df-diff_terms_df.min())/(diff_terms_df.max()-diff_terms_df.min())

#calculate formula,difficulty score and remove useless columns
races_df['difficulty']= diff_terms_df['norm_bmi']*diff_terms_df['norm_age']*diff_terms_df['norm_terrain_diff']
# the 1000 mutliplier is to have the same scales given tht points reach up to circa 2000 max so we don't get strange points
races_df['convenience_score']=races_df['points']/(races_df['difficulty']*1000)

# as per the description of the dataset the hour of start is noisy and irrelevant, therefore it is evicted
races_df['date']=races_df['date'].dt.date

races_df=races_df.drop(columns=['_url','name'])
races_df

Unnamed: 0,points,length,climb_total,profile,startlist_quality,date,position,cyclist,cyclist_age,is_tarmac,cyclist_team,delta,climbing_efficiency,competitive_age,std_name,stage,difficulty,convenience_score
0,100.0,162000.0,1101.0,1.0,1241,1978-07-05,0,sean-kelly,22.0,True,vini-ricordi-pinarello-sidermec-1986,0.0,0.000007,22,tour-de-france,stage-6,0.000022,4447.062477
1,100.0,162000.0,1101.0,1.0,1241,1978-07-05,1,gerrie-knetemann,27.0,True,norway-1987,0.0,0.000007,27,tour-de-france,stage-6,0.000007,15082.524217
2,100.0,162000.0,1101.0,1.0,1241,1978-07-05,2,rene-bittinger,24.0,True,france-1978,0.0,0.000007,24,tour-de-france,stage-6,0.000005,19918.560455
3,100.0,162000.0,1101.0,1.0,1241,1978-07-05,3,joseph-bruyere,30.0,True,navigare-blue-storm-1993,0.0,0.000007,30,tour-de-france,stage-6,0.000004,22967.664812
4,100.0,162000.0,1101.0,1.0,1241,1978-07-05,4,sven-ake-nilsson,27.0,True,spain-1991,0.0,0.000007,27,tour-de-france,stage-6,0.000003,30564.962371
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
589860,80.0,8400.0,60.0,1.0,878,2010-05-08,192,anders-lund-1,25.0,True,watney-avia-1972,80.0,0.000007,25,giro-d-italia,stage-1,0.000004,22151.287781
589861,80.0,8400.0,60.0,1.0,878,2010-05-08,193,andrea-masciarelli,28.0,True,free-agent,82.0,0.000007,28,giro-d-italia,stage-1,0.000002,49201.110052
589862,80.0,8400.0,60.0,1.0,878,2010-05-08,194,marco-corti,24.0,True,kazakhstan-2001,83.0,0.000007,24,giro-d-italia,stage-1,0.000009,8965.874034
589863,80.0,8400.0,60.0,1.0,878,2010-05-08,195,robbie-mcewen,38.0,True,radio-popular-paredes-boavista-2023,90.0,0.000007,38,giro-d-italia,stage-1,0.000029,2728.665767


Note: a nice analysis to do could be look for performance at high altitudes (lungs capacity maybe), in 50's and 60's training and eating changed so athlets in more recents races should perform better on high altitudes or we show see a trend for better performances (t.b.d. later how to measure them) as years get close to 2024.