# Tennis Match Data Analysis

## How to load and process tennis data?

Please note the following:
- It takes a long time to process the point data. So we avoid processing the point data if possible.
- The raw match data is missing many match_ids. Some match_ids in the point data are not in the original match data. So we need to synchronize them before using the match data.
- Both match data and point data need cleaning.
- We need to process the point data to generate new data columns.

Here are the steps for loading data:
- Fastest: Load the processed point data and processed math data, and then you can start your data analysis.
- Slow: If you download new data from GitHub, you need to clean and process both the raw match data and raw point data.
- Faster: If you have changed some of the match data cleaning and processing functions, you need to load the raw match data and process it. Load the processed point data, and synchronized them. The match data will be updated. The point data is unchanged. Save the processed match data.

## Problem with data

- The original match data has some problem on line 955. Need to manually fix it.
- There are missing values in match data.
- There are duplicate match_ids in the match data.
- Match data has a match_id that is cut off. Need to check the point data to manually fix it.
- There are match_ids in the point data but are not in the match data.
- There are a few cases where a player's handedness is entered wrong.
- Need to manually fix "Final TB?" for Wimbledon 2019 and later.

- There are missing values in point data: "Gm1", "Gm2", "rallyCount". Need to fix them manually.  
- There are rallies without point ending code.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


## Match data

Download the raw match data from https://github.com/JeffSackmann/tennis_MatchChartingProject.
<br>For male players, download charting-m-matches.csv. It's better to convert it to a .xls file.

Some commonly used functions are specified in this section.

## Functions for match data cleaning

In [None]:
import pandas as pd
import numpy as np
import re
import math

In [None]:
# Clean match data
def clean_match_data(df_matches):

    df_matches = df_matches.copy()

    # There are several duplicated matches. Remove them.
    df_matches.drop_duplicates(subset="match_id", inplace=True)

    # Data cleaning
    # Remove unnecessary white spaces and convert to lower case letters
    df_matches["Player 1"] = df_matches["Player 1"].str.strip()
    df_matches["Player 2"] = df_matches["Player 2"].str.strip()
    df_matches["Player 1"] = df_matches["Player 1"].replace(to_replace=r"\s+", value="_", regex=True)
    df_matches["Player 2"] = df_matches["Player 2"].replace(to_replace=r"\s+", value="_", regex=True)
    df_matches["Player 1"] = df_matches["Player 1"].str.lower()
    df_matches["Player 2"] = df_matches["Player 2"].str.lower()
    df_matches["Pl 1 hand"] = df_matches["Pl 1 hand"].str.strip()
    df_matches["Pl 2 hand"] = df_matches["Pl 2 hand"].str.strip()
    df_matches["Pl 1 hand"] = df_matches["Pl 1 hand"].str.lower()
    df_matches["Pl 2 hand"] = df_matches["Pl 2 hand"].str.lower()
    df_matches["Tournament"] = df_matches["Tournament"].str.strip()
    df_matches["Tournament"] = df_matches["Tournament"].str.lower()
    df_matches["Tournament"] = df_matches["Tournament"].replace(to_replace=r"\s+", value="_", regex=True)
    df_matches["Surface"] = df_matches["Surface"].str.strip()
    df_matches["Surface"] = df_matches["Surface"].str.lower()
    df_matches["Gender"] = df_matches["Gender"].str.strip()
    df_matches["Gender"] = df_matches["Gender"].str.lower()
    df_matches["Round"] = df_matches["Round"].str.strip()
    df_matches["Round"] = df_matches["Round"].str.lower()

    # Change date to string type
    df_matches = df_matches.astype({"Date": "str"})

    # Fill missing data with "unknown"
    df_matches["Surface"] = df_matches["Surface"].fillna("unknown")
    df_matches["Tournament"] = df_matches["Tournament"].fillna("unknown")
    df_matches["Player 1"] = df_matches["Player 1"].fillna("unknown")
    df_matches["Player 2"] = df_matches["Player 2"].fillna("unknown")
    df_matches["Round"] = df_matches["Round"].fillna("unknown")

    return df_matches

### Tennis match formats

Tennis tournaments use a variety of formats and have a complicated history of format changes. Here is a summary:
- Australian Open: Before 2019, no tiebreak in the 5th set. Since 2019, after 6-6 in the 5th set, there is a 10 point tiebreak.
- Wimbledon: Before 2019, no tiebreak in the 5th set. Since 2019, after 12-12 in the 5th set, there is a 7 point tiebreak.
- Roland Garros: No tiebreak in the 5th set.
- US Open: After 6-6 in the 5th set, there is a 7 point tiebreak.
- NextGen ATP Final: All singles matches are the best-of-five sets, with each set the first to four games (not six games). Tiebreak at 3-All. No-Ad scoring (server’s choice in 2019, receiver's choice in 2018).
- Davis Cup: Before 2016, it's best of 5 sets with no tiebreak in the 5th set (same as Wimbledon and Roland Garros). Since 2016, there is a 7-point tiebreak at 6-6 in the 5th set. Since 2018, all matches are the best of three tiebreak sets.
- Olympics: Before 2016, all the matches are best of three sets, except for the final, which is best of 5 sets, with no tiebreak in the final set of every match. From 2016, all the matches are best of three sets, except for the final, which is best of 5 sets, with a 7-point tiebreak in the final set of every match. From Tokyo 2021, the final match will also be best of 3 sets.
- Laver Cup: Best of 3 sets, with the third set a 10-point tiebreaker.
- All the other tournaments are best of 3 tiebreak sets.

All these can affect the calculation of the game and set score gaps.

The word "tiebreaker" was the original term. "Tiebreak" was used more receently. Both words are used interchangeably.

The function below will set the match format properly.


In [None]:
# Specify the match format and try to handle all the special cases.
def set_match_format(match_data, match_format_file):

    # This file contains the match format for special cases, such as grand slams.
    match_format = pd.read_csv(match_format_file)

    match_data = match_data.copy()

    # Set the most common match format for professional tennis matches.
    match_data["Best of"] = 3
    match_data["Final TB?"] = 1
    match_data["best_of_game"] = 6
    match_data["regular_tiebreak_trigger"] = 6
    match_data["final_set_tiebreak_trigger"] = 6

    for index, row in match_data.iterrows():
        # We get match information directly from match_id
        match_info = pd.Series(row["match_id"].split("-"), index =
                           ['date','gender', "tournament", "round", "player1", "player2"])

        match_info = match_info.str.lower()

        year = re.findall(r"^(\d{4})", match_info["date"])
        if len(year) > 0:
            match_info["year"] = int(year.pop(0))
        else:
            match_info["year"] = None

        # Handle special cases, such as grand slams, Olympics, Davis Cup, etc.
        for index2, row2 in match_format.iterrows():
            if ((row2["tournament"] in match_info["tournament"]) &
               (match_info["year"] >= row2["year_min"]) &
               (match_info["year"] <= row2["year_max"])):
                if ((row2["round"] == "any") |
                    ((row2["round"] != "any") &
                    (((match_info["round"] == "f") &
                      (row2["round"] == "f")) |
                      ((match_info["round"] != "f") &
                       (row2["round"] != "f"))))):
                    match_data.at[index, "Best of"] = row2["best_of_set"]
                    match_data.at[index, "Final TB?"] = row2["final_tb"]
                    match_data.at[index, "best_of_game"] = row2["best_of_game"]
                    match_data.at[index, "regular_tiebreak_trigger"] = \
                        row2["regular_tiebreak_trigger"]
                    match_data.at[index, "final_set_tiebreak_trigger"] = \
                        row2["final_set_tiebreak_trigger"]
        # print("end of for")

    return match_data

In [None]:
# Clean the handedness columns because there are errors.
# Create a table of players and their handednesses.
def clean_handedness_match_data(df_matches):
    """Fixed: selected_player_hand_value_counts scoping error resolved."""
    df_matches = df_matches.copy()
    players = get_players(df_matches)
    df_players_handedness = []

    for player in players:
        # Logic for Player 1 side
        p1_mask = df_matches["Player 1"] == player
        selected_p1_hand = df_matches[p1_mask]["Pl 1 hand"]
        counts_p1 = selected_p1_hand.value_counts()

        # Logic for Player 2 side
        p2_mask = df_matches["Player 2"] == player
        selected_p2_hand = df_matches[p2_mask]["Pl 2 hand"]
        counts_p2 = selected_p2_hand.value_counts()

        # Combine counts to find the true handedness
        combined_counts = counts_p1.add(counts_p2, fill_value=0)
        
        if len(combined_counts) > 0:
            correct_hand = combined_counts.idxmax()
            
            # Update dataframe with corrected hand
            df_matches.loc[p1_mask, "Pl 1 hand"] = correct_hand
            df_matches.loc[p2_mask, "Pl 2 hand"] = correct_hand
            
            df_players_handedness.append({"player": player, "handedness": correct_hand})

    return df_matches, pd.DataFrame(df_players_handedness)

## Functions for match data retrieval.

In [None]:
# How many matches are there for each player in the database?
def get_player_match_count(df_matches):
    players = df_matches["Player 1"].append(df_matches["Player 2"])
    player_match_count = players.value_counts()

    return player_match_count

In [None]:
def get_players(df_matches):
    players = df_matches["Player 1"].append(df_matches["Player 2"]).unique()

    return players

In [None]:
# Find out how many right handed and left handed players are in the database
def get_handedness_count(df_player_handedness):
    unique_handed_players = df_players_handedness["handedness"]
    handedness_count = unique_handed_players.value_counts()
    return handedness_count

In [None]:
# Find the number of matches per tournament
def get_match_count_by_tournament(df_matches):
    tournaments = df_matches["Tournament"]
    tournaments_value_counts = tournaments.value_counts(dropna=False)
    return tournaments_value_counts

In [None]:
# Input: match data
# Output: A dataframe with each player's match count for each tournament
def get_player_match_count_by_tournament(data):

    data = data.copy()

    # Selecting just the columns, not the rows
    player1_tournament = data[["Player 1", "Tournament", "Year"]]

    player2_tournament = data[["Player 2", "Tournament", "Year"]]

    # Renaming the columns to make appending easier
    # This way, the players will be added to one column
    player1_tournament.rename(columns={"Player 1": "Player"}, inplace=True)
    player2_tournament.rename(columns={"Player 2": "Player"}, inplace=True)

    player_tournament = player1_tournament.append(player2_tournament, ignore_index=True)
#     print(player_tournament_sorting)
    # Grouping the players by Player, Tournament, and Year, combines similar matches with same value in column
    player_tournament_count = player_tournament.groupby(["Player", "Tournament", "Year"])["Player"].count()
    # Must-do, make "Player" and "Tournament" and "Year" column names

    # Count is now the index column with original Series values
    player_tournament_count = player_tournament_count.reset_index(name="Count")

    # Sort values from greatest to least
    player_tournament_count.sort_values(by=["Count"], ascending=False, inplace=True)
    return player_tournament_count

In [None]:
# This function counts how many times two players have played within the database,
# regardless of who served first or whether they appeared just in "Player 1" or "Player 2"
def get_player_head_to_head_count(data):
    data = data.copy()

    # checking how many matches "A" as player 1 has played against "B" as player 2
    player_hth_count = data.groupby(["Player 1", "Player 2"])["Player 1"].count()
    player_hth_count = player_hth_count.reset_index(name="Count")
    player_hth_count.sort_values(by=["Count"], ascending=False, inplace=True)
#     print(player_hth_count.head(20))
    # Making a copy of the data so the original is not affected
    # The data in the copy has already gone through groupby(), so no need to do groupby() again
    player_hth_count2 = player_hth_count.copy()
    # Swapping column names to count how many matches "B" as player 1 has played against "A" as player 2
    player_hth_count2.rename(columns={"Player 1": "Player 2", "Player 2": "Player 1"}, inplace=True)
    player_hth_count = player_hth_count.append(player_hth_count2)
    # Sum up in "Count" because that is the number of times they have played
    player_hth_count = player_hth_count.groupby(["Player 1", "Player 2"])["Count"].sum()
    # Reset the index so "Player 1" and "Player 2" become real columns, not indices
    # "Count" now becomes the new index column
    player_hth_count = player_hth_count.reset_index(name="Count")
    player_hth_count.sort_values(by=["Count"], ascending=False, inplace=True)

    return player_hth_count

In [None]:
# For each tournament, count the number of editions of this tournament appears in the database.
# For example, how many unique "US Open" tournaments appear in the database.
def get_tournament_count(data):
    data = data.copy()

    tournament_year_count = data.groupby(["Tournament", "Year"])["Tournament"].count()
    # Must-do, make "Player" and "Tournament" and "Year" column names
    # Count is now the index column with original Series values
    tournament_year_count = tournament_year_count.reset_index(name="Count")
    # Sort values from greatest to least
    tournament_year_count.sort_values(by=["Count"], ascending=False, inplace=True)

    return tournament_year_count

In [None]:
# Select matches based on different conditions
# This is one of the most important functions
def get_matches(data, match_id = None, player1=None, player2=None, tournament=None, surface=None,
                   player1_handedness=None, player2_handedness=None, best_of=None, date=None):
    df_results = data.copy()

    if match_id != None:
        df_results = df_results.loc[df_results["match_id"] == match_id]

    if player1 != None:
        df_results = df_results.loc[(df_results["Player 1"].str.contains(player1)) | (df_results["Player 2"].str.contains(player1))]

    if player2 != None:
        df_results = df_results.loc[(df_results["Player 1"].str.contains(player2)) | (df_results["Player 2"].str.contains(player2))]

    if tournament != None:
        df_results = df_results.loc[(df_results["Tournament"].str.contains(tournament))]

    if surface != None:
        df_results = df_results.loc[(df_results["Surface"].str.contains(surface))]

    if (player1 != None) & (player2_handedness != None):
        df_results = df_results.loc[((df_results["Player 1"].str.contains(player1)) & (df_results["Pl 2 hand"] == player2_handedness)) |
                                   ((df_results["Player 2"].str.contains(player1)) & (df_results["Pl 1 hand"] == player2_handedness))]

    if (player2 != None) & (player1_handedness != None):
        df_results = df_results.loc[((df_results["Player 1"].str.contains(player2)) & (df_results["Pl 2 hand"] == player1_handedness)) |
                                   ((df_results["Player 2"].str.contains(player2)) & (df_results["Pl 1 hand"] == player1_handedness))]

    if (best_of != None):
        df_results = df_results.loc[(df_results["Best of"] == best_of)]

    if (date != None):
        df_results = df_results.loc[(df_results["Date"].str.contains("^" + date))]

    return df_results


## Point-by-point data

Download the raw match data from https://github.com/JeffSackmann/tennis_MatchChartingProject.
<br>For male players, download charting-m-points.csv.

<br> When you download very large csv files and the files can't be read, you can use excel to save it as "CSV UTF-8(Comma delimited)" format, then you can read the entire spreadsheet.

Some commonly used functions are specified in this section.

## Functions for point data cleaning and pre-processing

In [None]:
def clean_point_data(df_points):
    """Fixed: int32 cast error for NaNs and localized logic."""
    df_points = df_points.copy()

    # Use 'Int32' (capital I) for nullable integers to prevent crash on NaN
    int_cols = ["TB?", "Pt", "Set1", "Set2", "Gm1", "Gm2", "Svr", "Ret", "rallyCount", "PtWinner", "isSvrWinner"]
    for col in int_cols:
        if col in df_points.columns:
            # Special case for TB? which might have 'S'
            if col == "TB?":
                df_points[col] = df_points[col].astype(str).str.replace("S", "2", case=False)
            df_points[col] = pd.to_numeric(df_points[col], errors='coerce').astype('Int32')

    # Month conversion logic
    month_to_num = {"jan":"1", "feb":"2", "mar":"3", "apr":"4", "may":"5", "jun":"6", 
                    "jul":"7", "aug":"8", "sep":"9", "oct":"10", "nov":"11", "dec":"12"}
    
    # Process 'Pts' column (handling Excel's auto-date conversion)
    if 'Pts' in df_points.columns:
        def fix_pts_score(row):
            pts = str(row['Pts']).lower()
            for m_name, m_num in month_to_num.items():
                if m_name in pts:
                    score = pts.split("-")
                    if len(score) == 2:
                        # If month is second, it's likely flipped (e.g. 2-Mar instead of 3-2)
                        if m_name in score[1]:
                            return f"{m_name}-{score[0]}".replace(m_name, m_num)
                    return pts.replace(m_name, m_num)
            return pts.replace("00", "0")
        
        df_points['Pts'] = df_points.apply(fix_pts_score, axis=1)

    return df_points

## Functions for point data retrieval

In [None]:
# Input: point_data (all the point data), match_data (user selected matches)
# Output: df_points_selected (all the points from the selected matches)
# This function is used to select points before analyzing them.
def get_points_in_matches(point_data, match_data):

    # Use a copy to avoid "A value is trying to be set on a copy of a slice from a DataFrame" warning.
    point_data = point_data.copy()

    # There may be multiple match_ids
    df_selected_points = point_data.loc[point_data["match_id"].isin(match_data["match_id"])]

    # df_points_selected.info()
    return df_selected_points

In [None]:
# Get the number of points in a match
# Cross check with the maximum point index in the data table

def get_point_count(point_data, match_id):
    points_in_match = point_data[point_data["match_id"] == match_id]

    point_count = len(points_in_match)
    max_point_index = points_in_match["Pt"].max()
    if point_count != max_point_index:
        print("Error: point count is different from the maximum point index.")
        print(match_id)

    return point_count

In [None]:
# Get all the points either before or after the specified point_id
# If you want to get the points for the entire match, just use the default values.

# search_direction can be either "before" or "after".

# Return a list of points, list of sets (that each contains a list of games),
# and a list of all the games, all sorted.

# The returned data can be further processed to find service point, return points,
# service games, and return games, etc.

def get_points_in_one_match(point_data, match_id, point_id = 0,
                            search_direction = "after"):

    # Get all the points in the selected match
    points_in_match = point_data[point_data["match_id"] == match_id]

    if points_in_match.empty:
        print("Error: Cannot find this match.")
        print(match_id)
        return None

    # Get all the points from the beginning of the match
    # This is the "flat" points table
    if search_direction == "before":
        selected_points = points_in_match[(points_in_match["Pt"] < point_id)]
    elif search_direction == "after":
        selected_points = points_in_match[(points_in_match["Pt"] > point_id)]
    else:
        print("Error: You can only use \"before\" or \"after\" in search_direction.")
        return None

    if selected_points.empty:
        print("Error: Cannot find points.")
        return None

    # Sort the point table by point index
    selected_points = selected_points.sort_values(by=["Pt"])

    # Divide points by sets and games
    # This is the "nested" points table
    list_sets = []
    # Divide all the points into sets, and then into games.
    # The sets are sorted
    for set_index in np.sort(selected_points["set_index"].unique()):
        list_games = []
        # For each set, find all the points in this set
        points_in_set = selected_points[selected_points["set_index"] == set_index]

        # For each game, find all the points in this game
        # Divide all the points in a set into games.
        # The games are sorted
        for game_index in np.sort(points_in_set["game_index_in_set"].unique()):
            points_in_game = points_in_set[points_in_set["game_index_in_set"] == game_index]
            # Add all the points in a game to this list.
            list_games.append(points_in_game)
        # Add the per-game point list to the set list.
        list_sets.append(list_games)

    # Check if the nested point list is the same as the flat point list

    # This is how to flatten a nested list into a flat list
    # Now we have a list of all the games
    list_games = [item for sublist in list_sets for item in sublist]

    # Merge all the dataframes in the list into one
    # In this case, merge all the game dataframes into one
    selected_points2 = pd.concat(list_games)

    # Check if the nested list has the same content as the flat list
    if selected_points.equals(selected_points2) != True:
        print("Error: The nested point list " +
              "is not the same as the flat point list.")

    # Return a flat point list, a nested set and game lists, and a game list
    return selected_points, list_sets, list_games

In [None]:
# Get points based on their places on the match timeline

# Keep the arguments simple. Avoid using long and complex arguments in a method.
# It's better to write the simple functions and then write the
# more complex functions by calling several basic functions, rather than creating a
# function with complex arguments.
def get_service_points(point_data, player):
    point_data = point_data.copy()

    return point_data[point_data["server_name"].str.contains(player)]

def get_return_points(point_data, player):
    point_data = point_data.copy()

    return point_data[point_data["returner_name"].str.contains(player)]

# Get X points to the end of the given point list.
def get_last_x_points(point_data, x = 1):
    point_data = point_data.copy()

    point_data.sort_values(by=["Pt"], inplace = True)

    return point_data.tail(x)

def get_first_x_points(point_data, x = 1):
    point_data = point_data.copy()

    point_data.sort_values(by=["Pt"], inplace = True)

    return point_data.head(x)

def is_tiebreak_game(points_in_game):
    # Check if all the points in the given table are tiebreak points.
    tb_symbol = points_in_game["TB?"].unique()

    if len(tb_symbol) != 1:
        # If there are more than one symbols in "TB?"
        return False
    elif(tb_symbol[0] == 1):
        return True
    else:
        return False

# get_last_x_service_games can be achieved by calling get_serveice_games()
# and then get the last x games from the list.
# Tiebreak games are excluded.
def get_service_games(list_games, player):
    list_service_games = []
    for game in list_games:
        if ((is_tiebreak_game(game) != True) &
            (game[game["server_name"].str.contains(player)].empty != True)):
            list_service_games.append(game)

    return list_service_games

# Tiebreak games are excluded
def get_return_games(list_games, player):
    list_return_games = []

    for game in list_games:
        if ((is_tiebreak_game(game) != True) &
            (game[game["returner_name"].str.contains(player)].empty != True)):
            list_return_games.append(game)

    return list_return_games

# Get tiebreak points
def get_tiebreak_games(list_games):
    list_tiebreak_games = []

    for game in list_games:
        if (is_tiebreak_game(game) == True):
            list_tiebreak_games.append(game)

    return list_tiebreak_games



In [None]:
def get_player_names(point_data, match_id):
    points_in_match= point_data[point_data["match_id"] == match_id]

    if points_in_match.empty:
        print("Error: The specified match does not exist.")
        return None

    first_point = points_in_match.head(1)

    return first_point["player1"].to_string(index=False), first_point["player2"].to_string(index=False)

In [None]:
def get_score(point_data, match_id):

    points_in_match= point_data[point_data["match_id"] == match_id]

    if points_in_match.empty:
        print("Error: The specified match does not exist.")
        return None

    player1_name, player2_name = get_player_names(point_data, match_id)
    scores = pd.DataFrame()

    set_indices = np.sort(points_in_match["set_index"].unique())

    for set_index in set_indices:
        points_in_set = points_in_match[points_in_match["set_index"] == set_index]
        points_in_set = points_in_set.sort_values(by="set_index")
        last_point = points_in_set.tail(1)
        scores = scores.append(last_point[["Set1.1", "Set2.1", "Gm1.1", "Gm2.1"]], ignore_index = True)

    scores.columns =[player1_name + "_set", player2_name + "_set", player1_name+"_game", player2_name+"_game"]
    return scores

## Calculate the winning probabilities for different game scores and point scores.


Calculate winning probabilities for different game score scenarios

In [None]:
# Calculate winning probability for different game scores
game_score_win_prob = pd.DataFrame(
    columns=['name','serve_first','game_score_player','game_score_opp','win_count','lose_count'])

for index_m, row_m in df_matches.iterrows():
  my_match_id = row_m["match_id"]
  match_format = row_m["Best of"]

  # Get points from one match
  points_in_match, all_sets, all_games = get_points_in_one_match(df_points,
                                                                 my_match_id)

  # Get player names
  player_name1, player_name2 = get_player_names(points_in_match, my_match_id)
  # print(player_name1 + " " + player_name2)

  # Sort the points by the set
  set_indices = np.sort(points_in_match["set_index"].unique())

  for set_index in set_indices:
      points_in_set = points_in_match[points_in_match["set_index"] == set_index]
      points_in_set = points_in_set.sort_values(by="set_index")

      set_winner_id = points_in_set.iloc[-1]["PtWinner"]
      first_server_id = points_in_set.iloc[0]["Svr"]

      # Get the unique game score combinations
      result_df = points_in_set.drop_duplicates(subset=['Gm1', 'Gm2'], keep='first')
      # display(result_df)

      for index, row in result_df.iterrows():
        if set_winner_id == 1:
          win_count = 1
          lose_count = 0
        else:
          win_count = 0
          lose_count = 1

        if first_server_id == 1:
          serve_first = True
        else:
          serve_first = False

        # Add column names for player1
        new_row_player1 = {"name": player_name1, "serve_first": serve_first,
                           "game_score_player": row["Gm1"], "game_score_opp": row["Gm2"],
                           "win_count": win_count, "lose_count": lose_count}

        game_score_win_prob = game_score_win_prob.append(new_row_player1,
                                                      ignore_index = True)

        if set_winner_id == 2:
          win_count = 1
          lose_count = 0
        else:
          win_count = 0
          lose_count = 1

        if first_server_id == 2:
          serve_first = True
        else:
          serve_first = False

        # Add column names for player2
        new_row_player2 = {"name": player_name2, "serve_first": serve_first,
                           "game_score_player": row["Gm2"], "game_score_opp": row["Gm1"],
                           "win_count": win_count, "lose_count": lose_count}

        game_score_win_prob = game_score_win_prob.append(new_row_player2,
                                                      ignore_index = True)

# Calculate the win or lose count for different score scenarios.
# If we want to calculate win or lose count per player, we can add "name" to groupby()
game_score_win_prob = game_score_win_prob.groupby(["serve_first",
                                                   "game_score_player",
                                                   "game_score_opp"]).agg({"win_count":"sum","lose_count":"sum"}).reset_index()

display(game_score_win_prob)

Unnamed: 0,serve_first,game_score_player,game_score_opp,win_count,lose_count
0,False,0,0,2242,2213
1,False,0,1,1129,1783
2,False,0,2,202,798
3,False,0,3,88,632
4,False,0,4,9,308
...,...,...,...,...,...
121,True,13,13,1,1
122,True,14,13,1,1
123,True,14,14,1,1
124,True,14,15,0,1


In [None]:
# Calculate win rate
game_score_win_prob["win_rate"] = round((game_score_win_prob["win_count"] / (game_score_win_prob["win_count"] + game_score_win_prob["lose_count"])), 2)

In [None]:
# Save the data to a csv file
game_score_win_prob.to_csv("/content/gdrive/MyDrive/Tennis_data/w_game_score_win_prob.csv", index=False)

In [None]:
display(game_score_win_prob)

Unnamed: 0,serve_first,game_score_player,game_score_opp,win_count,lose_count,win_rate
0,False,0,0,2242,2213,0.50
1,False,0,1,1129,1783,0.39
2,False,0,2,202,798,0.20
3,False,0,3,88,632,0.12
4,False,0,4,9,308,0.03
...,...,...,...,...,...,...
121,True,13,13,1,1,0.50
122,True,14,13,1,1,0.50
123,True,14,14,1,1,0.50
124,True,14,15,0,1,0.00


Calculate the winning percentage for different point score scenarios

In [None]:
# The game_index is wrong in "20080125-M-Australian_Open-SF-Jo_Wilfried_Tsonga_-Rafael_Nadal"

# Create a dataframe
point_score_win_prob = pd.DataFrame(
    columns=["name","server", "tiebreak",
             "point_score_player", "point_score_opp",
             'win_count','lose_count'])

# This taks too long. Process 1/3 of the matches at a time
for i in range(2200, 3424):
  print("match index: " + str(i))

  # Get "match id" and "best of" from the matches list
  my_match_id = df_matches.iloc[i]["match_id"]
  match_format = df_matches.iloc[i]["Best of"]

  # print(my_match_id)

  # Get points from one match
  points_in_match, all_sets, all_games = get_points_in_one_match(df_points,
                                                            my_match_id)

  # Get player names
  player_name1, player_name2 = get_player_names(points_in_match, my_match_id)
  # print(player_name1 + " " + player_name2)

  # Get points from each set
  for set_index in np.sort(points_in_match["set_index"].unique()):
    points_in_set = points_in_match[(points_in_match["set_index"] == set_index)]
    points_in_set = points_in_set.sort_values(by="Pt") # sort by point index

    # Get points from each game
    for game_index_in_set in np.sort(points_in_set["game_index_in_set"].unique()):
      points_in_game = points_in_set[(points_in_set["game_index_in_set"] == game_index_in_set)]
      points_in_game = points_in_game.sort_values(by="Pt")

      game_winner_id = points_in_game.iloc[-1]["PtWinner"]
      game_server_id = points_in_game.iloc[0]["Svr"]

      # Is this a tiebreak game?
      if points_in_game.iloc[0]["TB?"] == 0:
        is_tiebreak = False
      else:
        is_tiebreak = True

      # Get the unique point scores
      result_df = points_in_game.drop_duplicates(subset=["Pts"], keep='first')
      # display(result_df)

      for index, row in result_df.iterrows():
        # Separate point score because the point score is like 15-40 in the table
        # Need to separate them into 15 and 40
        point_score = row["Pts"].split("-", maxsplit=1)

        # Use player name instead of game_winner_id?
        if game_winner_id == 1:
          win_count = 1
          lose_count = 0
        else:
          win_count = 0
          lose_count = 1

        if game_server_id == 1:
          is_server = True
        else:
          is_server = False

        # If this is a tiebreak game, ignore who serves first by
        # setting is_server to false
        if is_tiebreak:
          is_server = False

        new_row_player1 = {"name": player_name1, "server": is_server,
                          "tiebreak": is_tiebreak,
                          "point_score_player": point_score[0],
                          "point_score_opp": point_score[1],
                          "win_count": win_count, "lose_count": lose_count}

        # Add column names for player1 to the table
        point_score_win_prob = point_score_win_prob.append(new_row_player1,
                                                      ignore_index = True)

        if game_winner_id == 2:
          win_count = 1
          lose_count = 0
        else:
          win_count = 0
          lose_count = 1

        if game_server_id == 2:
          is_server = True
        else:
          is_server = False

        # If this is a tiebreak game, ignore who serves first by
        # setting is_server to false
        if is_tiebreak:
          is_server = False

        # Need to reverse the point score
        new_row_player2 = {"name": player_name2, "server": is_server,
                          "tiebreak": is_tiebreak,
                          "point_score_player": point_score[1],
                          "point_score_opp": point_score[0],
                          "win_count": win_count, "lose_count": lose_count}

        # Add column names for player 2 to the table
        point_score_win_prob = point_score_win_prob.append(new_row_player2,
                                                      ignore_index = True)

# Calculate the win or lose count for different score scenarios grouped, grouped by players.
# If we want to calculate win or lose count per player, we can add "name" to groupby()
point_score_win_prob_with_name = point_score_win_prob.groupby(["name", "server", "tiebreak",
                                                     "point_score_player",
                                                     "point_score_opp"]).agg({"win_count":"sum","lose_count":"sum"}).reset_index()

# Calculate the overall win win or loss count for different score combinations
# Do not add "name" to groupby()
point_score_win_prob_no_name = point_score_win_prob.groupby(["server", "tiebreak",
                                                     "point_score_player",
                                                     "point_score_opp"]).agg({"win_count":"sum","lose_count":"sum"}).reset_index()

# display(point_score_win_prob)

In [None]:
# Calculate win rate for different score combinations, grouped by players
point_score_win_prob_with_name["win_rate"] = round((point_score_win_prob_with_name["win_count"] / (point_score_win_prob_with_name["win_count"] + point_score_win_prob_with_name["lose_count"])), 2)
point_score_win_prob_with_name.to_csv("/content/gdrive/MyDrive/Tennis_data/m_point_score_win_prob_with_name3.csv", index=False)
display(point_score_win_prob_with_name)

Unnamed: 0,name,server,tiebreak,point_score_player,point_score_opp,win_count,lose_count,win_rate
0,aaron_krickstein,False,False,0,0,17,78,0.18
1,aaron_krickstein,False,False,0,15,11,28,0.28
2,aaron_krickstein,False,False,0,30,6,7,0.46
3,aaron_krickstein,False,False,0,40,2,0,1.00
4,aaron_krickstein,False,False,15,0,6,50,0.11
...,...,...,...,...,...,...,...,...
12695,younes_el_aynaoui,True,False,40,15,0,1,0.00
12696,younes_el_aynaoui,True,False,40,30,4,2,0.67
12697,younes_el_aynaoui,True,False,40,40,8,1,0.89
12698,younes_el_aynaoui,True,False,40,AD,8,0,1.00


In [None]:
# Calculate win rate for different score combinations, not grouped by players
point_score_win_prob_no_name["win_rate"] = round((point_score_win_prob_no_name["win_count"] / (point_score_win_prob_no_name["win_count"] + point_score_win_prob_no_name["lose_count"])), 2)
point_score_win_prob_no_name.to_csv("/content/gdrive/MyDrive/Tennis_data/m_point_score_win_prob_no_name3.csv", index=False)
display(point_score_win_prob_no_name)

Unnamed: 0,server,tiebreak,point_score_player,point_score_opp,win_count,lose_count,win_rate
0,False,False,0,0,7605,28963,0.21
1,False,False,0,15,3842,14692,0.21
2,False,False,0,30,1807,8459,0.18
3,False,False,0,40,913,5340,0.15
4,False,False,15,0,3763,14274,0.21
...,...,...,...,...,...,...,...
118,True,False,40,15,6941,1872,0.79
119,True,False,40,30,6600,2452,0.73
120,True,False,40,40,5961,2284,0.72
121,True,False,40,AD,3796,1484,0.72
