# Feature Engineering

### Data Mining Project 2024/25

Authors: Nicola Emmolo, Simone Marzeddu, Jacopo Raffi

In [72]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [73]:
import pandas as pd

In [74]:
dataset = pd.read_csv('../data/complete_dataset.csv')
dataset['date'] = pd.to_datetime(dataset['date'], format='%Y-%m-%d')
dataset.head()
#TODO: remove the comment to fillna(0) when use the cleaned dataset

Unnamed: 0,race_url,race_name,points,uci_points,length,climb_total,profile,startlist_quality,date,position,cyclist_url,cyclist_age,mostly_tarmac,cyclist_team,delta,cyclist_name,birth_year,weight,height,nationality
0,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,1978-07-05,0,sean-kelly,22.0,True,vini-ricordi-pinarello-sidermec-1986,0.0,Sean Kelly,1956.0,77.0,180.0,Ireland
1,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,1978-07-05,1,gerrie-knetemann,27.0,True,norway-1987,0.0,Gerrie Knetemann,1951.0,,,Netherlands
2,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,1978-07-05,2,rene-bittinger,24.0,True,,0.0,René Bittinger,1954.0,69.0,174.0,France
3,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,1978-07-05,3,joseph-bruyere,30.0,True,navigare-blue-storm-1993,0.0,Joseph Bruyère,1948.0,,,Belgium
4,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,1978-07-05,4,sven-ake-nilsson,27.0,True,spain-1991,0.0,Sven-Åke Nilsson,1951.0,63.0,172.0,Sweden


## Position Attribute Normalization

The attribute "position" is normalized so that its values are more meaningful and comparable between different races in a way which is more invariant respect to the total namber f participants

In [75]:
max_position = dataset.groupby('race_url')['position'].max().reset_index()
max_pos_dict = max_position.set_index('race_url')['position'].to_dict()

dataset['position'] = dataset['position'] / dataset['race_url'].map(max_pos_dict)

## Lenght and Climb Total Attributes Scaling

Both the attributes "lenght" and "climb_total" represent distances in meters, but with values that usually reaches and exceed kilometers. For this reason we decided to scale these values changing the unit of measure from meters to kilimeters.

In [76]:
dataset['length'] = dataset['length'] / 1000
dataset['climb_total'] = dataset['climb_total'] / 1000

## New Feature: Season Attribute

So to extract as much information as possible, we decided to engineer a "race_season" categorical attribute. Considering in particular the fact that we can't meaningully exploit the original "temperature" attribute due to its massive amount of NaN values, the season can be a useful proxy to similar kinds of information.

The computation of this attriute consists in the virtual subdivision of the year in quarters, where the months of January, Frebruary and March are considered Winter months, The months of April, March and June are considered Spring months, the months of July, August and September are considered Summer months, and finaly the remaining months of October, Novembr and December are considered as Autumn months.

In [77]:
def get_season(month):
    if month in [7, 8, 9]:
        return 'summer'
    elif month in [1, 2, 3]:
        return 'winter'
    elif month in [4, 5, 6]:
        return 'spring'
    else:
        return 'autumn'
    
dataset['race_season'] = dataset['date'].dt.month.apply(get_season)

## New Feature: BMI Attribute

the "height" and "weight" attributes are both interesting sources of information as well as highly correlated features from the cyclyst dataset. Our intuition is that considering a more complex feature combining both these attributes, we can access a more complete formalization of the physical condition of each cyclist. The BMI (Body Mass Index) is a well known proxy to the physical health of a person, still lacking informations about muscolar and fat mass, but still a more  descriptive feature respect to height or weight alone.

In [78]:
dataset['cyclist_bmi'] = dataset['weight'] / (dataset['height'] / 100) ** 2

## New Feature: Age Group Attribute

The ages of cyclists spans in a vast range of values. In our vision, little fluctuations in this attribute are not particularly representative of any relevant information, and this is why we find more interesting to consider the age group to which each cyclist belong in order to study more relevant relations.

In [79]:
bins = [0, 18, 25, 30, float('inf')]
labels = ['<18', '18-25', '25-30', '>30'] #TODO: check if better to change the labels (make them more understandable)

dataset['cyclist_age_group'] = pd.cut(dataset['cyclist_age'], bins=bins, labels=labels, right=False)

## Climb Percentage

In [80]:
dataset['climb_percentage'] = dataset['climb_total'] / dataset['length']

## New Feature: Climb Power Attribute (Power-Weight Ratio Proxy)

As mentioned, the BMI alone is not very descriptive of information such as the muscolar power or structure of a given cyclist. Based on our researches, a value such as PWR (Power-Weight ratio) would be an interstin addition to our dataset, but without any information related to the power demonstrated by each cyclist in a given race we are unable to directly access this knowledge.

In order to get as close as possible to this kind of information, we designed a new feature called "Climb Power", considering (for a given race) both the climbing difficulty of the track (a mix of "climb_total" and "profile") and the effectivness demonstrated by the cyclist in the given context (the "delta" achieved in the race, noting that "position" would be less significant since cyclist end the race in groups, more clearly identifiable by the "delta" attribute). The "power" demonstrated by a cyclist with this calculation is finallyweighted by the cyclist "BMI", so to put it in relation with the body composition and balance of the athlete. 

https://calculator.academy/bike-climbing-power-calculator/

In [81]:
dataset['cyclist_climb_power'] = ((dataset['climb_percentage']) * dataset['profile'] * dataset['weight']) / (dataset['delta']+1)

min_value = dataset['cyclist_climb_power'].min()
max_value = dataset['cyclist_climb_power'].max()
dataset['cyclist_climb_power'] = (dataset['cyclist_climb_power'] - min_value) / (max_value - min_value)

## New Feature: Physical Effort Attribute

This feature, called "race_physical_effort" is designed to describe a summary of the technical difficulty of a given race, calculating its value from "lenght", "climb_total" and "profile" of the track.

In [82]:
dataset['race_physical_effort'] = dataset['length'] * dataset['climb_total'] * (dataset['profile']+1)

min_value = dataset['race_physical_effort'].min()
max_value = dataset['race_physical_effort'].max()
dataset['race_physical_effort'] = (dataset['race_physical_effort'] - min_value) / (max_value - min_value)

## New Feature: Prestige Attribute

In a similar way to the previous feature, the "race_prestige" attribute is designed to evaluate the relevance of a given race in terms of its participants and points value (attributes "startlist_quality" and "points"). Note that we ignored in this case the value of the attrivute "uci_points", this choice comes from two reasons:
- the dataset lacks of values of "uci_points" previous to 2001 (since this metric was invented in that period)
- the features "points" and "uci_points" are highly correlated, and we can therefore assume that having them both would be redundant

In [83]:
dataset['race_prestige'] = dataset['points'] * dataset['startlist_quality']

min_value = dataset['race_prestige'].min()
max_value = dataset['race_prestige'].max()
dataset['race_prestige'] = (dataset['race_prestige'] - min_value) / (max_value - min_value)

## New Feature: Experience Attribute

The "cyclist_previous_experience" is a feature designed to give insights about the experience in this sport accumulated by a given cyclist before a given race. We considered that each race gives the cyclist an experience value proportional to the prestige and physical effort of the race, with a greater weight given the the first attrivute rather then the second one.

In [84]:
dataset = dataset.sort_values(by=['cyclist_url', 'date'])

prestige_coeff = 1
physical_effort_coeff = 0.2

dataset['cyclist_previous_experience'] = dataset['race_prestige'] * prestige_coeff + dataset['race_physical_effort'] * physical_effort_coeff
dataset['cyclist_previous_experience'] = dataset.groupby('cyclist_url')['cyclist_previous_experience'].transform(lambda x: x.shift().cumsum())
# dataset['cyclsit_previous_experience'] = dataset['cyclsit_previous_experience'].fillna(0)

## New Feature: Days Since Last Race Attribute

The attribute "cyclist_days_since_last_race" is designed to represent both the sharpness (inverse of beaing out of practice) and the level of fatigue of a given cyclist, depending on how this feature is exploited in the analysis. This feature calculates the number of days passed since the last race in which a given cyclist participated, therefore giving us insights about the activity and fatigue level of the athlete during a given race.

In [85]:
dataset = dataset.sort_values(by=['cyclist_url', 'date'])

# Calculate days since last race
dataset['cyclist_days_since_last_race'] = dataset.groupby('cyclist_url')['date'].diff().dt.days
# dataset['cyclist_days_since_last_race'] = dataset['cyclist_days_since_last_race'].fillna(0)

In [86]:
dataset.to_csv('../data/complete_dataset_fe.csv', index=False)