# 3. Calculate New Features

In this notebook, we'll load the preprocessed dataframe saved at the end of the second notebook (the PreProcessing notebook) and we'll make use of it in order to build from scratch new features!

More precisely, in order to train our model more effectively, for each match we'll analyze several parameters and we'll add these to our dataframe.

These parameters, important to decide who's more likely to win a tennis match, are the following:
- a rank index parameter, that ranges from -1 to 1, and compares the ranking of Player 0 and Player 1. If Player 0 has a higher ranking than Player 1, the rank index will grow towards 1; if the opposite is true, the rank index will be negative

- a surface performance index that ranges from 0 to 1 for each player, and represents the proportion of matches won against the total number of matches played on a given surface

- a recent form index that ranges from 0 to 1 for each player, that represents the proportion of matches won in the last 7 matches played.

- a form index like the previous, but calculated on the last 20 matches

- and finally another index, that measures the performance of each player against similar players (in terms of ranking) to the current opponent for a given match

Since we'll make use of dataframes and csv files, we import pandas library

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("csv/Preprocessed_Data.csv")

In [3]:
df.head()

Unnamed: 0,Avg_Pl1,Avg_Pl0,B365_Pl1,B365_Pl0,Date,Pl1_Rank,Player 1,Max_Pl1,Max_Pl0,PS_Pl1,...,Masters 1000,Masters Cup,1st Round,2nd Round,3rd Round,4th Round,Quarterfinals,Round Robin,Semifinals,The Final
0,,,,,2000-01-03,63,Dosedel S.,,,,...,0,0,1,0,0,0,0,0,0,0
1,,,,,2000-01-03,56,Clement A.,,,,...,0,0,1,0,0,0,0,0,0,0
2,,,,,2000-01-03,40,Escude N.,,,,...,0,0,1,0,0,0,0,0,0,0
3,,,,,2000-01-03,87,Knippschild J.,,,,...,0,0,1,0,0,0,0,0,0,0
4,,,,,2000-01-03,81,Fromberg R.,,,,...,0,0,1,0,0,0,0,0,0,0


In [4]:
df.columns

Index(['Avg_Pl1', 'Avg_Pl0', 'B365_Pl1', 'B365_Pl0', 'Date', 'Pl1_Rank',
       'Player 1', 'Max_Pl1', 'Max_Pl0', 'PS_Pl1', 'PS_Pl0', 'Pl0_Rank',
       'Player 0', 'Won', 'Indoor', 'Outdoor', 'Carpet', 'Clay', 'Grass',
       'Hard', 'ATP250', 'ATP500', 'Grand Slam', 'Masters 1000', 'Masters Cup',
       '1st Round', '2nd Round', '3rd Round', '4th Round', 'Quarterfinals',
       'Round Robin', 'Semifinals', 'The Final'],
      dtype='object')

In [5]:
df['Date'] = pd.to_datetime(df['Date'])

### Adding a Performance index per match

For every match, we'll calculate a performance index for both players, where the winner will always have an index greater than 0.5 and the loser always below this value.
To calculate the exact value, however, we use the rankings of the players, so that beating a high ranked player will result in a value close to 1 for the winner, and losing from a low-ranked player will result in value close to 0 for the loser.

In [6]:
def perf_index(winner_rank, loser_rank):
    winner_index = 0.5 + 1/(loser_rank+1)
    loser_index = 1/(winner_rank+1)
    return winner_index, loser_index

In [7]:
def calculate_performance_index_per_match_df(df):
    copy_df = df.copy()
    
    for ix, row in copy_df.iterrows():
        pl0_rank = row['Pl0_Rank']
        pl1_rank = row['Pl1_Rank']
        winner = row["Won"]
        if winner == 0:
            winner_index, loser_index = perf_index(pl0_rank, pl1_rank)
            copy_df.at[ix,'Match Perf. Index Pl0'] = winner_index
            copy_df.at[ix,'Match Perf. Index Pl1'] = loser_index
        else:
            winner_index, loser_index = perf_index(pl1_rank, pl0_rank)
            copy_df.at[ix,'Match Perf. Index Pl0'] = loser_index
            copy_df.at[ix,'Match Perf. Index Pl1'] = winner_index
    return copy_df

In [8]:
df = calculate_performance_index_per_match_df(df)

### Ranking Feature

The purpose of the following function is to calculate an index based on the rankings of 2 players.

In [9]:
"""
ranking_comparator_index function takes in input the ranks of players 0 and 1 (i.e.: 3, 19) and returns a value between -1 and 1.
The value is negative, if player1 has a higher ranking than player0. It's positive if player0 has a higher ranking than 
player1, and it's close to 0 if the players have similar rankings.
Similarity between rankings is not only expressed by the distance between the rankings. 
In fact, between number 1 and number 2, there is a higher difference than number 59 and number 60.
This metric is taken into account while calculating the ranking comparator index.
"""
def ranking_comparator_index(player_0_rank, player_1_rank):
    tmp0 = 1/player_0_rank
    tmp1 = 1/player_1_rank
    
    max_tmp = max(tmp0,tmp1)
    coeff = ( abs(tmp0-tmp1) / max_tmp)**2
    
    proportion = coeff * max_tmp / (tmp0 + tmp1)
    proportion = round(proportion,4)
    if player_0_rank < player_1_rank:
        return proportion
    else:
        return proportion * -1

### Surface Performance Feature

Next, we're going to generate a new dataframe, where for each player in the dataset, we will calculate the performance index on each of the classified surfaces, like Carpet, Grass, Clay and Hard.
Note: this might take between a minute or two to complete!

In [10]:
surface_performance_list = []

#list all players in the dataset
players = set(df['Player 0']) | set(df['Player 1'])

surfaces = ['Hard','Grass','Clay','Carpet']
for player in players:
    for surface in surfaces:
        player_df = df[(df["Player 0"]==player) | (df["Player 1"]==player)]

        total_played = player_df[ (player_df[surface] == 1) ].shape[0]
        
        total_won = player_df[(player_df[surface] == 1) & \
                        ( ( (player_df["Player 0"]==player) & (player_df["Won"]==0) ) | (( (player_df["Player 1"]==player) & (player_df["Won"]==1) )) ) ].shape[0]
        
        if total_played == 0: performance_index = 0
        else: performance_index = round(float(total_won/total_played), 2)
        surface_performance_list.append({"Player":player,"Surface":surface,"Performance":performance_index})

surface_perf_df = pd.DataFrame(surface_performance_list)

Let's take a look at the surface performance of the greatest tennis players of our generation!

In [11]:
surface_perf_df[surface_perf_df['Player'] == "Federer R."]

Unnamed: 0,Player,Surface,Performance
3176,Federer R.,Hard,0.84
3177,Federer R.,Grass,0.88
3178,Federer R.,Clay,0.77
3179,Federer R.,Carpet,0.77


In [12]:
surface_perf_df[surface_perf_df['Player'] == "Nadal R."]

Unnamed: 0,Player,Surface,Performance
3408,Nadal R.,Hard,0.77
3409,Nadal R.,Grass,0.78
3410,Nadal R.,Clay,0.91
3411,Nadal R.,Carpet,0.5


In [13]:
surface_perf_df[surface_perf_df['Player'] == "Djokovic N."]

Unnamed: 0,Player,Surface,Performance
2616,Djokovic N.,Hard,0.84
2617,Djokovic N.,Grass,0.85
2618,Djokovic N.,Clay,0.8
2619,Djokovic N.,Carpet,0.54


In [14]:
surface_perf_df[surface_perf_df['Player'] == "Murray A."]

Unnamed: 0,Player,Surface,Performance
940,Murray A.,Hard,0.77
941,Murray A.,Grass,0.81
942,Murray A.,Clay,0.69
943,Murray A.,Carpet,0.75


The function below takes a row in the dataframe and returns the surface of the match, based on the dummy variables representing the surface

In [15]:
def _get_surface_by_row__(row):
    if row['Hard'] == 1: return "Hard"
    if row['Clay'] == 1: return "Clay"
    if row['Grass'] == 1: return "Grass"
    if row['Carpet'] == 1: return "Carpet"
    return ""

### Form Performance

Now, we're going to implement a function that returns the performance index of a player in the last X matches before a given date.

In [16]:
def avg_performance_player(df, player):
    sum_as_p0 = df[df["Player 0"] == player]["Match Perf. Index Pl0"].sum()
    sum_as_p1 = df[df["Player 1"] == player]["Match Perf. Index Pl1"].sum()
    if df.shape[0] == 0: return 0
    return round( (sum_as_p0+sum_as_p1)/df.shape[0], 2)

In [17]:
"""
This method takes in input a dataframe containing all the matches, a player, a date D and a tuple of integers, 
where the minimum number represents the number of matches to consider to calculate the recent form and the max number 
represents the number of matches to consider to calculate the form of the player.
"""
def form_performance_index(df, player, date, matches_to_consider_tuple):
    try:
        recent_form_matches = min(matches_to_consider_tuple)
        form_matches = max(matches_to_consider_tuple)
    except:
        raise ValueError("Wrong value for argument matches_to_consider_tuple. It has to be a tuple of integers.")
    
    player_df = df[(df['Player 0']==player) | (df['Player 1']==player)]
    
    #Form DF
    last_matches_before_date_df_form = player_df[player_df['Date'] < date].tail(form_matches).sort_values('Date', ascending=False)
    avg_perf_form = avg_performance_player(last_matches_before_date_df_form, player)
    
    #Recent Form DF
    last_matches_before_date_df_recent_form = last_matches_before_date_df_form.head(recent_form_matches)
    avg_perf_recent_form = avg_performance_player(last_matches_before_date_df_recent_form, player)
    
    #Form
    total_played_form = last_matches_before_date_df_form.shape[0]
    total_won_form = last_matches_before_date_df_form[( ( \
                (last_matches_before_date_df_form["Player 0"]==player) & (last_matches_before_date_df_form["Won"]==0) ) | \
                (( (last_matches_before_date_df_form["Player 1"]==player) & (last_matches_before_date_df_form["Won"]==1) )) )].shape[0]
        
    if total_played_form == 0: performance_index_form = 0
    else: performance_index_form = (round(float(total_won_form/total_played_form), 2) + avg_perf_form)/2
    
    
    #Recent Form
    total_played_recent_form = last_matches_before_date_df_recent_form.shape[0]
    total_won_recent_form = last_matches_before_date_df_recent_form[( ( \
                (last_matches_before_date_df_recent_form["Player 0"]==player) & (last_matches_before_date_df_recent_form["Won"]==0) ) | \
                (( (last_matches_before_date_df_recent_form["Player 1"]==player) & (last_matches_before_date_df_recent_form["Won"]==1) )) )].shape[0]
        
    if total_played_recent_form == 0: performance_index_recent_form = 0
    else: performance_index_recent_form = (round(float(total_won_recent_form/total_played_recent_form), 2) + avg_perf_recent_form)/2
    
    return performance_index_form, performance_index_recent_form

### Performance against similar ranked opponents

Next, we're going to define a function that measure the perfomance index of a player in matches against an opponent in a given ranking range (i.e. top 4, top 10, top 20, etc.)

In [18]:
def performance_against_opponent_ranking_index(df, player, range_tuple):
    try:
        range_min = range_tuple[0]
        range_max = range_tuple[1]
    except:
        raise ValueError("Opponent Ranking Range tuple not provided correctly: " + range_tuple)
        
    if range_min > range_max:
        tmp = range_min
        range_min = range_max
        range_max = tmp
    
    player_won_as_pl0 = df[(df["Player 0"]==player) & (df["Won"]==0) & (df['Pl1_Rank'] <= range_max) & (df['Pl1_Rank'] >= range_min)]
    player_won_as_pl1 = df[(df["Player 1"]==player) & (df["Won"]==1) & (df['Pl0_Rank'] <= range_max) & (df['Pl0_Rank'] >= range_min)]
    
    player_lost_as_pl0 = df[(df["Player 0"]==player) & (df["Won"]==1) & (df['Pl1_Rank'] <= range_max) & (df['Pl1_Rank'] >= range_min)]
    player_lost_as_pl1 = df[(df["Player 1"]==player) & (df["Won"]==0) & (df['Pl0_Rank'] <= range_max) & (df['Pl0_Rank'] >= range_min)]
    
    
    won = player_won_as_pl0.shape[0] + player_won_as_pl1.shape[0]
    lost = player_lost_as_pl0.shape[0] + player_lost_as_pl1.shape[0]
    
    if won + lost == 0: return 0
    return round(float(won/(won+lost)), 2)
    

Now let's create a dataframe that will contain, for each player, their performance against opponents in different "ranking bins"

In [19]:
def create_performance_vs_opponents_dataframe(df):
    perf_df_list = []
    
    players = set(df["Player 0"]) | set(df["Player 1"])
    bins = [(1,4), (5,10), (11,20), (21,50), (51,100), (101,250), (251, 10000)]
    for player in players:
        player_df = df[(df["Player 0"] == player) | (df["Player 1"] == player)]
        for b in bins:
            performance = performance_against_opponent_ranking_index(player_df, player, b)
            perf_df_list.append({"Player":player, "Min Bin":b[0], "Max Bin":b[1], "Performance":performance})
    
    return pd.DataFrame(perf_df_list)

In [20]:
performance_opponents_dataframe = create_performance_vs_opponents_dataframe(df)

This function __get_bin__ takes in input the ranking of a player and returns a tuple object representing a bin.
The bin is the group of rankings to which a ranking belongs to. (i.e. ranking = 2 belongs to bin(1,4), etc.)

In [21]:
def __get_bin__(ranking):
    bins = [(1,4), (5,10), (11,20), (21,50), (51,100), (101,250), (251, 10000)]
        
    for b in bins:
        b_min = b[0]
        b_max = b[1]
        if ranking >= b_min and ranking <= b_max:
            return b
    
    return bins[-1]

### H2H Feature

Next, we're going to consider h2h matches and overall experience (number of matches played) as factors

In [22]:
"""
This function takes in input the matches won by players 0 and 1, and returns a number representing the h2h index.
This index can go from -1 to 1. It's positive if Player 0 has won more matches than Player 1 and negative otherwise.
The index approaches 1 as the difference between matches won by player 0 and won by Player 1 increases.
It approaches -1 as the difference between matches won by player 1 and won by Player 0 increases.
"""
def h2h_index(won_p0, won_p1):
    if won_p0 == won_p1: return 0
    regr = 0.05128*max(won_p0,won_p1) - 0.06035*min(won_p0,won_p1) + 0.1831
    if won_p0 > won_p1: return min(max(0.001,regr),0.999)
    else: return max(min(-0.001,-regr),-0.999)

In [23]:
def calculate_h2h_index_before_date(df, pl0, pl1, date):
    first_half_games_df = df[(df["Player 0"]==pl0) & (df["Player 1"]==pl1) & (df["Date"] < date)]
    second_half_games_df = df[(df["Player 0"]==pl1) & (df["Player 1"]==pl0) & (df["Date"] < date)]
    
    total_first_half = first_half_games_df.shape[0]
    total_second_half = second_half_games_df.shape[0]
    
    first_half_won_p0 = first_half_games_df[first_half_games_df["Won"] == 0].shape[0]
    first_half_won_p1 = total_first_half - first_half_won_p0
    
    second_half_won_p1 = second_half_games_df[second_half_games_df["Won"] == 0].shape[0]
    second_half_won_p0 = total_second_half - second_half_won_p1
    
    won_p0 = first_half_won_p0 + second_half_won_p0
    won_p1 = first_half_won_p1 + second_half_won_p1
    
    return h2h_index(won_p0, won_p1)

### Experience Feature

In [24]:
def expierence_index(df, pl0, pl1, date):    
    pl0_exp = df[((df["Player 0"]==pl0) | (df["Player 1"]==pl0)) & (df["Date"] < date)].shape[0]
    pl1_exp = df[((df["Player 0"]==pl1) | (df["Player 1"]==pl1)) & (df["Date"] < date)].shape[0]
    if pl0_exp + pl1_exp == 0: return 0
    return (pl0_exp-pl1_exp)/(pl0_exp+pl1_exp)

### Reliability Feature

This feature calculates how frequent a given player wins when he's favourite or loses when he's un underdog.
In essence, it calculates how "easy" is to predict correctly his matches given the fact that he's the favourite or not.

In [25]:
def reliability_index(df, player):
    betting_data_df = df[(pd.notna(df["B365_Pl1"])) & (pd.notna(df["B365_Pl0"]))]
    fav_as_p0 = betting_data_df[(betting_data_df["Player 0"]==player) & (betting_data_df["B365_Pl0"] < betting_data_df["B365_Pl1"])]
    fav_as_p1 = betting_data_df[(betting_data_df["Player 1"]==player) & (betting_data_df["B365_Pl0"] > betting_data_df["B365_Pl1"])]
    
    und_as_p0 = betting_data_df[(betting_data_df["Player 0"]==player) & (betting_data_df["B365_Pl0"] > betting_data_df["B365_Pl1"])]
    und_as_p1 = betting_data_df[(betting_data_df["Player 1"]==player) & (betting_data_df["B365_Pl0"] < betting_data_df["B365_Pl1"])]
    
    played_as_fav = fav_as_p0.shape[0] + fav_as_p1.shape[0]
    played_as_und = und_as_p0.shape[0] + und_as_p1.shape[0]
    
    if played_as_fav + played_as_und == 0: return 0
    
    won_as_fav = fav_as_p0[fav_as_p0["Won"]==0].shape[0] + fav_as_p1[fav_as_p1["Won"] == 1].shape[0]
    lost_as_fav = played_as_fav - won_as_fav
    
    won_as_und = und_as_p0[und_as_p0["Won"]==0].shape[0] + und_as_p1[und_as_p1["Won"] == 1].shape[0]
    lost_as_und = played_as_und - won_as_und
    
    return (won_as_fav + lost_as_und)/(played_as_fav + played_as_und)

# Putting it all together

Now it's time to rumble!

Finally, we calculate the other features defined above, and we append these features at the end of the dataframe.

Note: this might take up to 1 hour!!

In [26]:
def calculate_new_features(df):
    
    new_features_df = df.copy()
    
    performance_df_absent = False
    try:
        form_performance_df = pd.read_csv("csv/Form_Performance.csv")
        form_performance_df['Date'] = pd.to_datetime(form_performance_df['Date'])
    except:
        print("Warn: file 'csv/Form_Performance.csv' not found. Creating it now...")
        performance_df_absent = True
        form_performance_df_list = []
    
    for ix, row in df.iterrows():

        if ix % 5000 == 0: print('{} matches computed on total {}'.format(ix, df.shape[0]))
                
        pl0 = row["Player 0"]
        pl1 = row["Player 1"]

        rank_index = ranking_comparator_index(row['Pl0_Rank'], row['Pl1_Rank'])
        
        if performance_df_absent:
            pl0_form_index_last20, pl0_form_index_last7 = form_performance_index(df, pl0, row['Date'], (20,7))
            pl1_form_index_last20, pl1_form_index_last7 = form_performance_index(df, pl1, row['Date'], (20,7))
            form_performance_df_list.append(
                {"Date":row['Date'],"Pl0":row['Player 0'],"Pl1":row['Player 1'], "Pl0_last_7":pl0_form_index_last7,
                "Pl0_last_20":pl0_form_index_last20,"Pl1_last_7":pl1_form_index_last7,"Pl1_last_20":pl1_form_index_last20}
            )
        else:
            match_row = form_performance_df[(form_performance_df['Date']==row['Date']) & (form_performance_df['Pl0']==pl0) & (form_performance_df['Pl1']==pl1)].iloc[0]
            pl0_form_index_last20 = match_row['Pl0_last_20']
            pl1_form_index_last20 = match_row['Pl1_last_20']
            pl0_form_index_last7 = match_row['Pl0_last_7']
            pl1_form_index_last7 = match_row['Pl1_last_7']

        p0_bin = __get_bin__(row['Pl0_Rank'])
        p1_bin = __get_bin__(row['Pl1_Rank'])
        
        pl0_against_opponent_bin = performance_opponents_dataframe[
            (performance_opponents_dataframe["Player"]==pl0) & (performance_opponents_dataframe["Min Bin"]==p1_bin[0]) 
            & (performance_opponents_dataframe["Max Bin"]==p1_bin[1])].iloc[0,3]
        
        pl1_against_opponent_bin = performance_opponents_dataframe[
            (performance_opponents_dataframe["Player"]==pl1) & (performance_opponents_dataframe["Min Bin"]==p0_bin[0]) 
            & (performance_opponents_dataframe["Max Bin"]==p0_bin[1])].iloc[0,3]

        surface = _get_surface_by_row__(row)

        h2h_index = calculate_h2h_index_before_date(df, pl0, pl1, row['Date'])
        experience_index = expierence_index(df, pl0, pl1, row['Date'])
        reliability_pl0 = reliability_index(df, pl0)
        reliability_pl1 = reliability_index(df, pl1)
        
        new_features_df.at[ix, 'Rank Index'] = rank_index
        new_features_df.at[ix, 'Pl0 Recent Form'] = pl0_form_index_last7
        new_features_df.at[ix, 'Pl0 Form'] = pl0_form_index_last20
        new_features_df.at[ix, 'Pl1 Recent Form']= pl1_form_index_last7
        new_features_df.at[ix, 'Pl1 Form'] = pl1_form_index_last20
        new_features_df.at[ix, 'Pl0 Perf. vs Similar Opponent'] = pl0_against_opponent_bin
        new_features_df.at[ix, 'Pl1 Perf. vs Similar Opponent'] = pl1_against_opponent_bin
        new_features_df.at[ix, 'Pl0 Surface Performance'] = surface_perf_df[(surface_perf_df["Player"] == pl0) & (surface_perf_df["Surface"] == surface)].iloc[0]['Performance']
        new_features_df.at[ix, 'Pl1 Surface Performance'] = surface_perf_df[(surface_perf_df["Player"] == pl1) & (surface_perf_df["Surface"] == surface)].iloc[0]['Performance']
        new_features_df.at[ix, 'H2H Index'] = h2h_index
        new_features_df.at[ix, 'Exp Index'] = experience_index
        new_features_df.at[ix, 'Reliability Pl0'] = reliability_pl0
        new_features_df.at[ix, 'Reliability Pl1'] = reliability_pl1
        
    if performance_df_absent:
        form_performance_df = pd.DataFrame(form_performance_df_list)
        form_performance_df.to_csv("csv/Form_Performance.csv", index=False)
    
    return new_features_df

In [27]:
import time
start_time = time.time()
calculated_new_features = calculate_new_features(df)
mins = round( (time.time() - start_time)/60 )
print("--- {} minutes ---".format(mins))

0 matches computed on total 54908
5000 matches computed on total 54908
10000 matches computed on total 54908
15000 matches computed on total 54908
20000 matches computed on total 54908
25000 matches computed on total 54908
30000 matches computed on total 54908
35000 matches computed on total 54908
40000 matches computed on total 54908
45000 matches computed on total 54908
50000 matches computed on total 54908
--- 106 minutes ---


Let's have a look at the dataset after having added all those new features:

In [28]:
list(calculated_new_features.columns)

['Avg_Pl1',
 'Avg_Pl0',
 'B365_Pl1',
 'B365_Pl0',
 'Date',
 'Pl1_Rank',
 'Player 1',
 'Max_Pl1',
 'Max_Pl0',
 'PS_Pl1',
 'PS_Pl0',
 'Pl0_Rank',
 'Player 0',
 'Won',
 'Indoor',
 'Outdoor',
 'Carpet',
 'Clay',
 'Grass',
 'Hard',
 'ATP250',
 'ATP500',
 'Grand Slam',
 'Masters 1000',
 'Masters Cup',
 '1st Round',
 '2nd Round',
 '3rd Round',
 '4th Round',
 'Quarterfinals',
 'Round Robin',
 'Semifinals',
 'The Final',
 'Match Perf. Index Pl0',
 'Match Perf. Index Pl1',
 'Rank Index',
 'Pl0 Recent Form',
 'Pl0 Form',
 'Pl1 Recent Form',
 'Pl1 Form',
 'Pl0 Perf. vs Similar Opponent',
 'Pl1 Perf. vs Similar Opponent',
 'Pl0 Surface Performance',
 'Pl1 Surface Performance',
 'H2H Index',
 'Exp Index',
 'Reliability Pl0',
 'Reliability Pl1']

Before saving our dataframe, let's reorder the columns to make the file more readable

In [29]:
new_features = calculated_new_features.copy()

In [30]:
new_features = new_features[['Date', 'Player 0', 'Player 1', 'Won', 'Pl0_Rank', 'Pl1_Rank', 'Avg_Pl0', 'Avg_Pl1', 'Max_Pl1', 'Max_Pl0', 
         'PS_Pl0', 'PS_Pl1', 'B365_Pl0', 'B365_Pl1', 'Indoor', 'Outdoor', 'Carpet', 'Clay', 'Grass', 'Hard', 'ATP250', 'ATP500', 
                     'Grand Slam', 'Masters 1000', 'Masters Cup', '1st Round', '2nd Round', '3rd Round', '4th Round', 
                     'Quarterfinals', 'Round Robin', 'Semifinals', 'The Final', 'Rank Index', 'Pl0 Recent Form', 'Pl0 Form', 
                     'Pl1 Recent Form', 'Pl1 Form', 'Pl0 Perf. vs Similar Opponent', 'Pl1 Perf. vs Similar Opponent', 
                     'Pl0 Surface Performance', 'Pl1 Surface Performance', 'H2H Index', 'Exp Index', 'Reliability Pl0', 'Reliability Pl1']]

In [31]:
new_features.head()

Unnamed: 0,Date,Player 0,Player 1,Won,Pl0_Rank,Pl1_Rank,Avg_Pl0,Avg_Pl1,Max_Pl1,Max_Pl0,...,Pl1 Recent Form,Pl1 Form,Pl0 Perf. vs Similar Opponent,Pl1 Perf. vs Similar Opponent,Pl0 Surface Performance,Pl1 Surface Performance,H2H Index,Exp Index,Reliability Pl0,Reliability Pl1
0,2000-01-03,Ljubicic I.,Dosedel S.,1.0,77,63,,,,,...,0.0,0.0,0.66,0.17,0.61,0.4,0.0,0.0,0.687008,0.0
1,2000-01-03,Enqvist T.,Clement A.,0.0,5,56,,,,,...,0.0,0.0,0.58,0.24,0.56,0.51,0.0,0.0,0.625,0.640805
2,2000-01-03,Baccanello P.,Escude N.,1.0,655,40,,,,,...,0.0,0.0,0.0,0.91,0.14,0.65,0.0,0.0,0.75,0.680851
3,2000-01-03,Federer R.,Knippschild J.,0.0,65,87,,,,,...,0.0,0.0,0.9,0.3,0.84,0.1,0.0,0.0,0.868249,1.0
4,2000-01-03,Woodbridge T.,Fromberg R.,1.0,198,81,,,,,...,0.0,0.0,0.29,0.36,0.27,0.46,0.0,0.0,0.0,0.0


Finally, we save this dataset to a csv file, so that the next notebook will just load its data from file system.

In [32]:
new_features.to_csv("csv/FeatureCalculated_Data.csv", index=False)