# 3. Calculate New Features

In this notebook, we'll load the preprocessed dataframe saved at the end of the second notebook (the PreProcessing notebook) and we'll make use of it in order to build from scratch new features!

More precisely, in order to train our model more effectively, for each match we'll analyze several parameters and we'll add these to our dataframe.

These parameters, important to decide who's more likely to win a tennis match, are the following:
- a rank index parameter, that ranges from -1 to 1, and compares the ranking of Player 0 and Player 1. If Player 0 has a higher ranking than Player 1, the rank index will grow towards 1; if the opposite is true, the rank index will be negative

- a surface performance index that ranges from 0 to 1 for each player, and represents the proportion of matches won against the total number of matches played on a given surface

- a recent form index that ranges from 0 to 1 for each player, that represents the proportion of matches won in the last 7 matches played.

- a form index like the previous, but calculated on the last 20 matches

- and finally another index, that measures the performance of each player against similar players (in terms of ranking) to the current opponent for a given match

Since we'll make use of dataframes and csv files, we import pandas library

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("csv/Preprocessed_Data.csv")

In [3]:
df.head()

Unnamed: 0,Avg_Pl1,Avg_Pl0,Date,Pl1_Rank,Player 1,Max_Pl1,Max_Pl0,Pl0_Rank,Player 0,Won,...,Masters 1000,Masters Cup,1st Round,2nd Round,3rd Round,4th Round,Quarterfinals,Round Robin,Semifinals,The Final
0,,,2000-01-03,63,Dosedel S.,,,77,Ljubicic I.,1.0,...,0,0,1,0,0,0,0,0,0,0
1,,,2000-01-03,56,Clement A.,,,5,Enqvist T.,0.0,...,0,0,1,0,0,0,0,0,0,0
2,,,2000-01-03,40,Escude N.,,,655,Baccanello P.,1.0,...,0,0,1,0,0,0,0,0,0,0
3,,,2000-01-03,87,Knippschild J.,,,65,Federer R.,0.0,...,0,0,1,0,0,0,0,0,0,0
4,,,2000-01-03,81,Fromberg R.,,,198,Woodbridge T.,1.0,...,0,0,1,0,0,0,0,0,0,0


The purpose of the following function is to calculate an index based on the rankings of 2 players.

In [4]:
"""
ranking_comparator_index function takes in input the ranks of players 0 and 1 (i.e.: 3, 19) and returns a value between -1 and 1.
The value is negative, if player1 has a higher ranking than player0. It's positive if player0 has a higher ranking than 
player1, and it's close to 0 if the players have similar rankings.
Similarity between rankings is not only expressed by the distance between the rankings. 
In fact, between number 1 and number 2, there is a higher difference than number 59 and number 60.
This metric is taken into account while calculating the ranking comparator index.
"""
def ranking_comparator_index(player_0_rank, player_1_rank):
    tmp0 = 1/player_0_rank
    tmp1 = 1/player_1_rank
    
    max_tmp = max(tmp0,tmp1)
    coeff = ( abs(tmp0-tmp1) / max_tmp)**2
    
    proportion = coeff * max_tmp / (tmp0 + tmp1)
    proportion = round(proportion,4)
    if player_0_rank < player_1_rank:
        return proportion
    else:
        return proportion * -1

Next, we're going to generate a new dataframe, where for each player in the dataset, we will calculate the performance index on each of the classified surfaces, like Carpet, Grass, Clay and Hard.
Note: this might take between a minute or two to complete!

In [5]:
surface_performance_list = []

#list all players in the dataset
players = set(df['Player 0']) | set(df['Player 1'])

surfaces = ['Hard','Grass','Clay','Carpet']
for player in players:
    for surface in surfaces:
        player_df = df[(df["Player 0"]==player) | (df["Player 1"]==player)]

        total_played = player_df[ (player_df[surface] == 1) ].shape[0]
        
        total_won = player_df[(player_df[surface] == 1) & \
                        ( ( (player_df["Player 0"]==player) & (player_df["Won"]==0) ) | (( (player_df["Player 1"]==player) & (player_df["Won"]==1) )) ) ].shape[0]
        
        if total_played == 0: performance_index = 0
        else: performance_index = round(float(total_won/total_played), 2)
        surface_performance_list.append({"Player":player,"Surface":surface,"Performance":performance_index})

surface_perf_df = pd.DataFrame(surface_performance_list)

Let's take a look at the surface performance of the greatest tennis players of our generation!

In [6]:
surface_perf_df[surface_perf_df['Player'] == "Federer R."]

Unnamed: 0,Player,Surface,Performance
148,Federer R.,Hard,0.84
149,Federer R.,Grass,0.88
150,Federer R.,Clay,0.77
151,Federer R.,Carpet,0.77


In [7]:
surface_perf_df[surface_perf_df['Player'] == "Nadal R."]

Unnamed: 0,Player,Surface,Performance
2936,Nadal R.,Hard,0.77
2937,Nadal R.,Grass,0.78
2938,Nadal R.,Clay,0.91
2939,Nadal R.,Carpet,0.5


In [8]:
surface_perf_df[surface_perf_df['Player'] == "Djokovic N."]

Unnamed: 0,Player,Surface,Performance
2148,Djokovic N.,Hard,0.84
2149,Djokovic N.,Grass,0.85
2150,Djokovic N.,Clay,0.8
2151,Djokovic N.,Carpet,0.54


In [9]:
surface_perf_df[surface_perf_df['Player'] == "Murray A."]

Unnamed: 0,Player,Surface,Performance
4456,Murray A.,Hard,0.77
4457,Murray A.,Grass,0.81
4458,Murray A.,Clay,0.69
4459,Murray A.,Carpet,0.75


Now, we're going to implement a function that returns the performance index of a player in the last X matches before a given date.

In [10]:
"""
This method takes in input a dataframe containing all the matches, a player, a date D and a tuple of integers, 
where the minimum number represents the number of matches to consider to calculate the recent form and the max number 
represents the number of matches to consider to calculate the form of the player.
"""
def form_performance_index(df, player, date, matches_to_consider_tuple):
    try:
        recent_form_matches = min(matches_to_consider_tuple)
        form_matches = max(matches_to_consider_tuple)
    except:
        raise ValueError("Wrong value for argument matches_to_consider_tuple. It has to be a tuple of integers.")
    
    player_df = df[(df['Player 0']==player) | (df['Player 1']==player)]
    
    #Form DF
    last_matches_before_date_df_form = player_df[player_df['Date'] < date]       \
                                            .sort_values('Date', ascending=False).head(form_matches)
    
    #Recent Form DF
    last_matches_before_date_df_recent_form = last_matches_before_date_df_form.head(recent_form_matches)
    
    #Form
    total_played_form = last_matches_before_date_df_form.shape[0]
    total_won_form = last_matches_before_date_df_form[( ( \
                (last_matches_before_date_df_form["Player 0"]==player) & (last_matches_before_date_df_form["Won"]==0) ) | \
                (( (last_matches_before_date_df_form["Player 1"]==player) & (last_matches_before_date_df_form["Won"]==1) )) )].shape[0]
        
    if total_played_form == 0: performance_index_form = 0
    else: performance_index_form = round(float(total_won_form/total_played_form), 2)
    
    
    #Recent Form
    total_played_recent_form = last_matches_before_date_df_recent_form.shape[0]
    total_won_recent_form = last_matches_before_date_df_recent_form[( ( \
                (last_matches_before_date_df_recent_form["Player 0"]==player) & (last_matches_before_date_df_recent_form["Won"]==0) ) | \
                (( (last_matches_before_date_df_recent_form["Player 1"]==player) & (last_matches_before_date_df_recent_form["Won"]==1) )) )].shape[0]
        
    if total_played_recent_form == 0: performance_index_recent_form = 0
    else: performance_index_recent_form = round(float(total_won_recent_form/total_played_recent_form), 2)   
    
    return performance_index_form, performance_index_recent_form

Next, we're going to define a function that measure the perfomance index of a player in matches against an opponent in a given ranking range (i.e. top 4, top 10, top 20, etc.)

In [11]:
def performance_against_opponent_ranking_index(df, player, range_tuple):
    try:
        range_min = range_tuple[0]
        range_max = range_tuple[1]
    except:
        raise ValueError("Opponent Ranking Range tuple not provided correctly: " + range_tuple)
        
    if range_min > range_max:
        tmp = range_min
        range_min = range_max
        range_max = tmp
    
    player_won_as_pl0 = df[(df["Player 0"]==player) & (df["Won"]==0) & (df['Pl1_Rank'] <= range_max) & (df['Pl1_Rank'] >= range_min)]
    player_won_as_pl1 = df[(df["Player 1"]==player) & (df["Won"]==1) & (df['Pl0_Rank'] <= range_max) & (df['Pl0_Rank'] >= range_min)]
    
    player_lost_as_pl0 = df[(df["Player 0"]==player) & (df["Won"]==1) & (df['Pl1_Rank'] <= range_max) & (df['Pl1_Rank'] >= range_min)]
    player_lost_as_pl1 = df[(df["Player 1"]==player) & (df["Won"]==0) & (df['Pl0_Rank'] <= range_max) & (df['Pl0_Rank'] >= range_min)]
    
    
    won = player_won_as_pl0.shape[0] + player_won_as_pl1.shape[0]
    lost = player_lost_as_pl0.shape[0] + player_lost_as_pl1.shape[0]
    
    if won + lost == 0: return 0
    return round(float(won/(won+lost)), 2)
    

This function __get_bin__ takes in input the ranking of a player and returns a tuple object representing a bin.
The bin is the group of rankings to which a ranking belongs to. (i.e. ranking = 2 belongs to bin(1,4), etc.)

In [12]:
def __get_bin__(ranking):
    bins = [(1,4), (5,10), (11,20), (21,50), (51,100), (101,250), (251, 10000)]
        
    for b in bins:
        b_min = b[0]
        b_max = b[1]
        if ranking >= b_min and ranking <= b_max:
            return b
    
    return bins[-1]

The function below takes a row in the dataframe and returns the surface of the match, based on the dummy variables representing the surface

In [13]:
def _get_surface_by_row__(row):
    if row['Hard'] == 1: return "Hard"
    if row['Clay'] == 1: return "Clay"
    if row['Grass'] == 1: return "Grass"
    if row['Carpet'] == 1: return "Carpet"
    return ""

Now it's time to rumble!

Finally, we calculate the other features defined above, and we append these features at the end of the dataframe.

Note: this might take up to 1 hour!!

In [14]:
for ix, row in df.iterrows():

    pl0 = row["Player 0"]
    pl1 = row["Player 1"]

    rank_index = ranking_comparator_index(row['Pl0_Rank'], row['Pl1_Rank'])
    pl0_form_index_last20, pl0_form_index_last7 = form_performance_index(df, pl0, row['Date'], (20,7))
    pl1_form_index_last20, pl1_form_index_last7 = form_performance_index(df, pl1, row['Date'], (20,7))

    pl0_against_opponent_bin = performance_against_opponent_ranking_index(df, pl0, __get_bin__(row['Pl1_Rank']))
    pl1_against_opponent_bin = performance_against_opponent_ranking_index(df, pl1, __get_bin__(row['Pl0_Rank']))

    surface = _get_surface_by_row__(row)

    df.at[ix, 'Rank Index'] = rank_index
    df.at[ix, 'Pl0 Recent Form'] = pl0_form_index_last7
    df.at[ix, 'Pl0 Form'] = pl0_form_index_last20
    df.at[ix, 'Pl1 Recent Form']= pl1_form_index_last7
    df.at[ix, 'Pl1 Form'] = pl1_form_index_last20
    df.at[ix, 'Pl0 Perf. vs Similar Opponent'] = pl0_against_opponent_bin
    df.at[ix, 'Pl1 Perf. vs Similar Opponent'] = pl1_against_opponent_bin
    df.at[ix, 'Pl0 Surface Performance'] = surface_perf_df[(surface_perf_df["Player"] == pl0) & (surface_perf_df["Surface"] == surface)].iloc[0]['Performance']
    df.at[ix, 'Pl1 Surface Performance'] = surface_perf_df[(surface_perf_df["Player"] == pl1) & (surface_perf_df["Surface"] == surface)].iloc[0]['Performance']


Let's have a look at the dataset after having added all those new features:

In [15]:
df.head()

Unnamed: 0,Avg_Pl1,Avg_Pl0,Date,Pl1_Rank,Player 1,Max_Pl1,Max_Pl0,Pl0_Rank,Player 0,Won,...,The Final,Rank Index,Pl0 Recent Form,Pl0 Form,Pl1 Recent Form,Pl1 Form,Pl0 Perf. vs Similar Opponent,Pl1 Perf. vs Similar Opponent,Pl0 Surface Performance,Pl1 Surface Performance
0,,,2000-01-03,63,Dosedel S.,,,77,Ljubicic I.,1.0,...,0,-0.0182,0.0,0.0,0.0,0.0,0.66,0.17,0.61,0.4
1,,,2000-01-03,56,Clement A.,,,5,Enqvist T.,0.0,...,0,0.7614,0.0,0.0,0.0,0.0,0.58,0.24,0.56,0.51
2,,,2000-01-03,40,Escude N.,,,655,Baccanello P.,1.0,...,0,-0.8309,0.0,0.0,0.0,0.0,0.0,0.91,0.14,0.65
3,,,2000-01-03,87,Knippschild J.,,,65,Federer R.,0.0,...,0,0.0366,0.0,0.0,0.0,0.0,0.9,0.3,0.84,0.1
4,,,2000-01-03,81,Fromberg R.,,,198,Woodbridge T.,1.0,...,0,-0.2478,0.0,0.0,0.0,0.0,0.29,0.36,0.27,0.46


Finally, we save this dataset to a csv file, so that the next notebook will just load its data from file system.

In [16]:
df.to_csv("csv/FeatureCalculated_Data.csv", index=False)