# Tennis Match Data Analysis 

For an introduction to the tennis match data set, visit 
https://docs.google.com/document/d/1pUpK76tkvP09V1ptHHEv77gEnzyZLIMGq2vNobyICGE/edit

## How to load and process tennis data?

Please note the following:
- It takes a long time to process the point data. So we avoid processing the point data if possible. 
- The raw match data is missing many match_ids. Some match_ids in the point data are not in the original match data. So we need to synchronize them before using the match data. 
- Both match data and point data need cleaning. 
- We need to process the point data to generate new data columns. 

Here are the steps for loading data: 
- Fastest: Load the processed point data and processed math data, and then you can start your data analysis. 
- Slow: If you download new data from GitHub, you need to clean and process both the raw match data and raw point data. 
- Faster: If you have changed some of the match data cleaning and processing functions, you need to load the raw match data and process it. Load the processed point data, and synchronized them. The match data will be updated. The point data is unchanged. Save the processed match data.

## Problem with data

- The original match data has some problem on line 955. Need to manually fix it. 
- There are missing values in match data.
- There are duplicate match_ids in the match data.
- Match data has a match_id that is cut off. Need to check the point data to manually fix it. 
- There are match_ids in the point data but are not in the match data.
- There are a few cases where a player's handedness is entered wrong.
- Need to manually fix "Final TB?" for Wimbledon 2019 and later. 

- There are missing values in point data: "Gm1", "Gm2", "rallyCount". Need to fix them manually.  
- There are rallies without point ending code. 

## 1. Commonly used functions for match data

Download the raw match data from https://github.com/JeffSackmann/tennis_MatchChartingProject.
<br>For male players, download charting-m-matches.csv. It's better to convert it to a .xls file. 

Some commonly used functions are specified in this section.

## Functions for match data cleaning

In [109]:
import pandas as pd
import numpy as np
import re
import math

In [110]:
def clean_match_data(df_matches): 
    
    df_matches = df_matches.copy()
    
    # There are several duplicated matches. Remove them. 
    df_matches.drop_duplicates(subset="match_id", inplace=True)
    
    # Data cleaning
    df_matches["Player 1"] = df_matches["Player 1"].str.strip()
    df_matches["Player 2"] = df_matches["Player 2"].str.strip()
    df_matches["Player 1"] = df_matches["Player 1"].replace(to_replace=r"\s+", value="_", regex=True)
    df_matches["Player 2"] = df_matches["Player 2"].replace(to_replace=r"\s+", value="_", regex=True)
    df_matches["Player 1"] = df_matches["Player 1"].str.lower()
    df_matches["Player 2"] = df_matches["Player 2"].str.lower()
    df_matches["Pl 1 hand"] = df_matches["Pl 1 hand"].str.strip()
    df_matches["Pl 2 hand"] = df_matches["Pl 2 hand"].str.strip()
    df_matches["Pl 1 hand"] = df_matches["Pl 1 hand"].str.lower()
    df_matches["Pl 2 hand"] = df_matches["Pl 2 hand"].str.lower()
    df_matches["Tournament"] = df_matches["Tournament"].str.strip()
    df_matches["Tournament"] = df_matches["Tournament"].str.lower()
    df_matches["Tournament"] = df_matches["Tournament"].replace(to_replace=r"\s+", value="_", regex=True)
    df_matches["Surface"] = df_matches["Surface"].str.strip()
    df_matches["Surface"] = df_matches["Surface"].str.lower()
    df_matches["Gender"] = df_matches["Gender"].str.strip()
    df_matches["Gender"] = df_matches["Gender"].str.lower()
    df_matches["Round"] = df_matches["Round"].str.strip()
    df_matches["Round"] = df_matches["Round"].str.lower()
#     df_matches["Final TB?"] = df_matches["Final TB?"].str.strip()
#     df_matches["Final TB?"] = df_matches["Final TB?"].str.lower()

    df_matches = df_matches.astype({"Date": "str"})
    # df_matches = df_matches.astype({"Best of": "int32"})
    df_matches["Year"] = df_matches["Date"].str.extract(r"^(\d{4})")

    df_matches["Surface"] = df_matches["Surface"].fillna("unknown")
    df_matches["Tournament"] = df_matches["Tournament"].fillna("unknown")
#     df_matches["Best of"] = df_matches["Best of"].fillna(0)
    df_matches["Player 1"] = df_matches["Player 1"].fillna("unknown")
    df_matches["Player 2"] = df_matches["Player 2"].fillna("unknown")
#     df_matches["Pl 1 hand"] = df_matches["Pl 1 hand"].fillna("unknown")
#     df_matches["Pl 2 hand"] = df_matches["Pl 2 hand"].fillna("unknown")
    df_matches["Round"] = df_matches["Round"].fillna("unknown")
#     df_matches["Final TB?"] = df_matches["Final TB?"].fillna("unknown")
    
#     df_matches = df_matches.astype({"Best of": "int32"}) 
    
    return df_matches

### Tennis match formats

Tennis tournaments use a variety of formats and have a complicated history of format changes. Here is a summary:
- Australian Open: Before 2019, no tiebreak in the 5th set. Since 2019, after 6-6 in the 5th set, there is a 10 point tiebreak.
- Wimbledon: Before 2019, no tiebreak in the 5th set. Since 2019, after 12-12 in the 5th set, there is a 7 point tiebreak.
- Roland Garros: No tiebreak in the 5th set.
- US Open: After 6-6 in the 5th set, there is a 7 point tiebreak. 
- NextGen ATP Final: All singles matches are the best-of-five sets, with each set the first to four games (not six games). Tiebreak at 3-All. No-Ad scoring (server’s choice in 2019, receiver's choice in 2018).
- Davis Cup: Before 2016, it's best of 5 sets with no tiebreak in the 5th set (same as Wimbledon and Roland Garros). Since 2016, there is a 7-point tiebreak at 6-6 in the 5th set. Since 2018, all matches are the best of three tiebreak sets.
- Olympics: Before 2016, all the matches are best of three sets, except for the final, which is best of 5 sets, with no tiebreak in the final set of every match. From 2016, all the matches are best of three sets, except for the final, which is best of 5 sets, with a 7-point tiebreak in the final set of every match. From Tokyo 2021, the final match will also be best of 3 sets. 
- Laver Cup: Best of 3 sets, with the third set a 10-point tiebreaker.
- All the other tournaments are best of 3 tiebreak sets. 

All these can affect the calculation of the game and set score gaps. 

The word "tiebreaker" was the original term. "Tiebreak" was used more receently. Both words are used interchangeably. 

The function below will set the match format properly.


In [111]:
# Set the match format and try to handle all the special cases.
def set_match_format(match_data, match_format_file):
    
    # This file contains the match format for special cases, such as grand slams.
    match_format = pd.read_csv(match_format_file)
    
    match_data = match_data.copy()
    
    # Set the most common match format for professional tennis matches. 
    match_data["Best of"] = 3
    match_data["Final TB?"] = 1
    match_data["best_of_game"] = 6
    match_data["regular_tiebreak_trigger"] = 6
    match_data["final_set_tiebreak_trigger"] = 6
    
    for index, row in match_data.iterrows():
        # We get match information directly from match_id
        match_info = pd.Series(row["match_id"].split("-"), index = 
                           ['date','gender', "tournament", "round", "player1", "player2"])

        match_info = match_info.str.lower()

        year = re.findall(r"^(\d{4})", match_info["date"])
        if len(year) > 0:
            match_info["year"] = int(year.pop(0))
        else: 
            match_info["year"] = None
        
        # If you use Series.str.lower() on a Series with different data types, 
        # it will turn a number into NaN.
        
        # Handle special cases, such as grand slams, Olympics, Davis Cup, etc.
        for index2, row2 in match_format.iterrows():
            if ((row2["tournament"] in match_info["tournament"]) & 
               (match_info["year"] >= row2["year_min"]) & 
               (match_info["year"] <= row2["year_max"])):
                if ((row2["round"] == "any") | 
                    ((row2["round"] != "any") &
                    (((match_info["round"] == "f") & 
                      (row2["round"] == "f")) |
                      ((match_info["round"] != "f") &
                       (row2["round"] != "f"))))):
                    match_data.at[index, "Best of"] = row2["best_of_set"]
                    match_data.at[index, "Final TB?"] = row2["final_tb"]
                    match_data.at[index, "best_of_game"] = row2["best_of_game"]
                    match_data.at[index, "regular_tiebreak_trigger"] = \
                        row2["regular_tiebreak_trigger"]
                    match_data.at[index, "final_set_tiebreak_trigger"] = \
                        row2["final_set_tiebreak_trigger"]

    return match_data

In [112]:
# clean the handedness columns because there are errors. Create a table of players and their handednesses.
def clean_handedness_match_data(df_matches):
    
    df_matches = df_matches.copy()
    
    df_players_handedness = pd.DataFrame()
    
    players = get_players(df_matches)

    # Check every player and check if there is an error in this player's handedness.
    # Some players may be marked as righthanded in some matches and lefthanded in other matches.
    # A playe cannot have two handedness. 
    for player in players:

        #these two lines select the columns "Pl 1 hand" and "Pl 2 hand" from df_matches for each player
        selected_player1_hand = df_matches[(df_matches["Player 1"] == player)]["Pl 1 hand"]
        selected_player2_hand = df_matches[(df_matches["Player 2"] == player)]["Pl 2 hand"]

        #correct_handedness needs to have an initial value
        #so that if there is no error in handedness,
        #we can just use the default handedness
        #if there is an error, it will display later in the if function
        if len(selected_player1_hand) > 0:
            # Use the majority of the handedness as a default handedness. 
            correct_handedness = selected_player1_hand.value_counts().idxmax()
        elif len(selected_player2_hand) > 0:
            correct_handedness = selected_player2_hand.value_counts().idxmax()
        else:
            correct_handedness = None

        #assign variable so that yoou do not have to continuously use the value_counts() function
        selected_player_hand_value_counts = selected_player1_hand.value_counts()

        #if the player has more than one handedness in Pl 1 hand column,
        #replace the wrong hand with correct hand
        if (len(selected_player_hand_value_counts) > 1):
            print(player)
            print(selected_player_hand_value_counts)

            #finding which hand occurs more frequently for the player
            #this is also the correct handedness for that player
            correct_handedness = selected_player_hand_value_counts.idxmax()

            #finding which hand occurs less frequently for the player
            #this is also the wrong handedness for that player
            wrong_handedness = selected_player_hand_value_counts.idxmin()

            #must save back to the original column and replace wrong_hand with correct_hand
            df_matches.loc[(df_matches["Player 1"] == player), "Pl 1 hand"] = \
                df_matches.loc[(df_matches["Player 1"] == player), "Pl 1 hand"].replace(
                wrong_handedness, correct_handedness)

        #Do the same for the "Pl 2 hand" column
        selected_player_hand_value_counts = selected_player2_hand.value_counts()
        if (len(selected_player_hand_value_counts) > 1):
            print(player)
            print(selected_player_hand_value_counts)
            correct_handedness = selected_player_hand_value_counts.idxmax()
            wrong_handedness = selected_player_hand_value_counts.idxmin()
            df_matches.loc[(df_matches["Player 2"] == player), "Pl 2 hand"] = \
                df_matches.loc[(df_matches["Player 2"] == player), "Pl 2 hand"].replace(
                wrong_handedness, correct_handedness)

        row = {"player": player, "handedness": correct_handedness}
        df_players_handedness = df_players_handedness.append(row, ignore_index=True)

    # df_players_handedness now has the correct handedness for each player. 
#     print(df_players_handedness)
    
    return df_matches, df_players_handedness

## Functions for match data retrieval. 

In [113]:
# How many matches are there for each player in the database?
def get_player_match_count(df_matches):
    players = df_matches["Player 1"].append(df_matches["Player 2"])
    player_match_count = players.value_counts()
    
    return player_match_count 

In [114]:
def get_players(df_matches):
    players = df_matches["Player 1"].append(df_matches["Player 2"]).unique()
    
    return players

In [115]:
# Find out how many right handed and left handed players are in the database
def get_handedness_count(df_player_handedness): 
    unique_handed_players = df_players_handedness["handedness"]
    handedness_count = unique_handed_players.value_counts()
    return handedness_count

In [116]:
# Find the number of matches per tournament
def get_match_count_by_tournament(df_matches):
    tournaments = df_matches["Tournament"]
    tournaments_value_counts = tournaments.value_counts(dropna=False)
    return tournaments_value_counts

In [117]:
# Input: match data
# Output: A dataframe with each player's match count for each tournament
def get_player_match_count_by_tournament(data):
    
    data = data.copy()
    
    # Selecting just the columns, not the rows
    player1_tournament = data[["Player 1", "Tournament", "Year"]]
    
    player2_tournament = data[["Player 2", "Tournament", "Year"]]
    
    # Renaming the columns to make appending easier
    # This way, the players will be added to one column
    player1_tournament.rename(columns={"Player 1": "Player"}, inplace=True)
    player2_tournament.rename(columns={"Player 2": "Player"}, inplace=True)
    
    player_tournament = player1_tournament.append(player2_tournament, ignore_index=True)
#     print(player_tournament_sorting)
    # Grouping the players by Player, Tournament, and Year, combines similar matches with same value in column
    player_tournament_count = player_tournament.groupby(["Player", "Tournament", "Year"])["Player"].count()
    # Must-do, make "Player" and "Tournament" and "Year" column names
    
    # Count is now the index column with original Series values
    player_tournament_count = player_tournament_count.reset_index(name="Count")
    
    # Sort values from greatest to least
    player_tournament_count.sort_values(by=["Count"], ascending=False, inplace=True)
    return player_tournament_count

In [118]:
# This function counts how many times two players have played within the database,
# regardless of who served first or whether they appeared just in "Player 1" or "Player 2"
def get_player_head_to_head_count(data):
    data = data.copy()
    
    # checking how many matches "A" as player 1 has played against "B" as player 2
    player_hth_count = data.groupby(["Player 1", "Player 2"])["Player 1"].count()
    player_hth_count = player_hth_count.reset_index(name="Count")
    player_hth_count.sort_values(by=["Count"], ascending=False, inplace=True)
#     print(player_hth_count.head(20))
    # Making a copy of the data so the original is not affected
    # The data in the copy has already gone through groupby(), so no need to do groupby() again
    player_hth_count2 = player_hth_count.copy()
    # Swapping column names to count how many matches "B" as player 1 has played against "A" as player 2
    player_hth_count2.rename(columns={"Player 1": "Player 2", "Player 2": "Player 1"}, inplace=True)
    player_hth_count = player_hth_count.append(player_hth_count2)
    # Sum up in "Count" because that is the number of times they have played
    player_hth_count = player_hth_count.groupby(["Player 1", "Player 2"])["Count"].sum()
    # Reset the index so "Player 1" and "Player 2" become real columns, not indices
    # "Count" now becomes the new index column
    player_hth_count = player_hth_count.reset_index(name="Count")
    player_hth_count.sort_values(by=["Count"], ascending=False, inplace=True)
    
    return player_hth_count

In [119]:
# For each tournament, count the number of editions of this tournament appears in the database. 
# For example, how many unique "US Open" tournaments appear in the database.
def get_tournament_count(data):
    data = data.copy()
    
    tournament_year_count = data.groupby(["Tournament", "Year"])["Tournament"].count()
    # Must-do, make "Player" and "Tournament" and "Year" column names
    # Count is now the index column with original Series values
    tournament_year_count = tournament_year_count.reset_index(name="Count")
    # Sort values from greatest to least
    tournament_year_count.sort_values(by=["Count"], ascending=False, inplace=True)
    
    return tournament_year_count

In [120]:
# Select matches based on different conditions
# This is one of the most important functions
def get_matches(data, match_id = None, player1=None, player2=None, tournament=None, surface=None, 
                   player1_handedness=None, player2_handedness=None, best_of=None, date=None):
    df_results = data.copy()
    
    if match_id != None:
        df_results = df_results.loc[df_results["match_id"] == match_id]
    
    if player1 != None:
        df_results = df_results.loc[(df_results["Player 1"].str.contains(player1)) | (df_results["Player 2"].str.contains(player1))]
    
    if player2 != None:
        df_results = df_results.loc[(df_results["Player 1"].str.contains(player2)) | (df_results["Player 2"].str.contains(player2))]
        
    if tournament != None:
        df_results = df_results.loc[(df_results["Tournament"].str.contains(tournament))]
        
    if surface != None:
        df_results = df_results.loc[(df_results["Surface"].str.contains(surface))]
    
    if (player1 != None) & (player2_handedness != None):
        df_results = df_results.loc[((df_results["Player 1"].str.contains(player1)) & (df_results["Pl 2 hand"] == player2_handedness)) | 
                                   ((df_results["Player 2"].str.contains(player1)) & (df_results["Pl 1 hand"] == player2_handedness))]
    
    if (player2 != None) & (player1_handedness != None):
        df_results = df_results.loc[((df_results["Player 1"].str.contains(player2)) & (df_results["Pl 2 hand"] == player1_handedness)) | 
                                   ((df_results["Player 2"].str.contains(player2)) & (df_results["Pl 1 hand"] == player1_handedness))]
        
    if (best_of != None):
        df_results = df_results.loc[(df_results["Best of"] == best_of)]
        
    if (date != None):
        df_results = df_results.loc[(df_results["Date"].str.contains("^" + date))]
                                    
    return df_results
    

## 2. Commonly used functions for point-by-point data

Download the raw match data from https://github.com/JeffSackmann/tennis_MatchChartingProject.
<br>For male players, download charting-m-points.csv. 

<br> When you download very large csv files and the files can't be read, you can use excel to save it as "CSV UTF-8(Comma delimited)" format, then you can read the entire spreadsheet.

Some commonly used functions are specified in this section.

## Functions for point data cleaning and pre-processing

In [121]:
import math

def clean_point_data(df_points):
    df_points = df_points.copy()
    
    # By explicitly setting the data types for different columns,
    # we are checking the data consistency. If there are missing 
    # values or multiple data types in a column, astype() will return error. 
    
    # Do we need to fill na with 0?
    df_points = df_points.astype({"rallyLen": "float64"})
    
    # Someone entered "S" in the "TB?" column to indicate a 10-point super-tiebreak
    df_points["TB?"] = df_points["TB?"].str.replace("S", "2", case=False)    
    df_points = df_points.astype({"TB?": "int32"})
    
    df_points = df_points.astype({"Pt": "int32"})
    df_points = df_points.astype({"Set1": "int32"})
    df_points = df_points.astype({"Set2": "int32"})
    
    # If there are missing values in "Gm1", you have to manually fix it. 
    for index, row in df_points.iterrows():
        if math.isnan(row["Gm1"]):
            print("\"Gm1\" == nan")
            print(row[["match_id", "Pt"]])
    
    df_points = df_points.astype({"Gm1": "int32"})
    
    # If there are missing values in "Gm2", you have to manually fix it.
    for index, row in df_points.iterrows():
        if math.isnan(row["Gm2"]):
            print("\"Gm1\" == nan")
            print(row[["match_id", "Pt", "Gm2"]])
            
    df_points = df_points.astype({"Gm2": "int32"})
    
    # df_points = df_points.astype({"TbSet": "int32"})
    df_points = df_points.astype({"Svr": "int32"})
    df_points = df_points.astype({"Ret": "int32"})
    df_points = df_points.astype({"rallyLen": "int32"})
    
    month_to_number = {"Jan":"1", "Feb":"2", "Mar":"3", "Apr":"4",
                      "May":"5", "Jun":"6", "Jul":"7", "Aug":"8",
                       "Sep":"9", "Oct":"10", "Nov":"11", "Dec":"12"}
    
    months = month_to_number.keys()
    
    for index, row in df_points.iterrows():

        # There are missing values in "rallyCount". You have to manually fix it.
        # Some values in "rallyLength" are incorrect. You have to manually fix it. 
        # There are easier ways to check this. 
        if math.isnan(row["rallyCount"]):
            print("\"rallyCount\" == nan")
            print(row[["match_id", "Pt", "rallyCount"]])
        
        # Excel automatically changes "2-1" (not 1-2) to "1-Feb", etc. So we have to swap them back. 
        # Excel also convert "1-0" to "Jan-00", etc. In this case, do not swap them. 
        if any(x in row["Pts"] for x in months):
            
            # Switch the server-returner score because Excel has changed the sequence. 
            score = row["Pts"].split("-") 
            
            if len(score) != 2:
                print("Error: the score is wrong")
                print(row)
            
            if score[1] in months:
                # If the second score is the name of a month, then switch the score, 
                # because Excel will convert "3-2" to "2-Mar", "5-6" to "6-May", etc. 
                df_points.at[index, "Pts"] = score[1] + "-" + score[0]
                # If the first score is the name of a month, do nothing, 
                # because Excel will convert "1-0" to "Jan-00", "3-0" to "Mar-00", "12-13" to "Dec-13", etc. 
                # It doesn't swap the numbers. 

        if any(x in row["PtsAfter"] for x in months):
            
            # Switch the server-returner score because Excel has changed the sequence. 
            score = row["PtsAfter"].split("-") 
            if len(score) != 2:
                print("Error: the score is wrong")
                print(row)
            
            if score[1] in months:
                df_points.at[index, "PtsAfter"] = score[1] + "-" + score[0]          
 
    for month in months:
        df_points["Pts"] = df_points["Pts"].str.replace(month, month_to_number[month], 
                                                        case=False)
        df_points["PtsAfter"] = df_points["PtsAfter"].str.replace(month, month_to_number[month], 
                                                                  case=False)
    
    # Excel will convert "1-0" to "Jan-00" so we need to replace 00 with 0.
    df_points["Pts"] = df_points["Pts"].str.replace("00", "0")
    df_points["PtsAfter"] = df_points["PtsAfter"].str.replace("00", "0")
    
    df_points = df_points.astype({"rallyCount": "int32"})
    df_points = df_points.astype({"PtWinner": "int32"})
    df_points = df_points.astype({"isSvrWinner": "int32"})
    
    return df_points

In [122]:
# Only call this if you are processing match data downloaded from GitHub. 
# If you are loading the processed match data, you don't need to call this.

# The original match data and point data are not consistent. 
# Some matches in the original point data are not in the original matches data. 
# Need to merge them
def sync_match_and_point_data(point_data, match_data):
    point_data = point_data.copy()
    match_data = match_data.copy()
    
    print("Checking if points data and match data are consistent ... ")
    num_matches_point_data = point_data["match_id"].nunique()
    num_matches_match_data = match_data["match_id"].nunique()
    print("There are " + str(num_matches_match_data) + " match ids in the match data set.")
    print("There are " + str(num_matches_point_data) + " match ids in the point data set.")
    
    if  num_matches_point_data != num_matches_match_data:
        print("The number of match ids do not match. Trying to fix the problem ...")
        unique_match_id_match_data = set(match_data["match_id"].unique())
        unique_match_id_point_data = set(point_data["match_id"].unique())
        
#         print("The following match ids are in the point data set but not the match data set")
#         print(*(unique_match_id_point_data - unique_match_id_match_data), sep="\n")
        
        print("The following match ids are in the match data set but not the point data set. Fix it.")
        print(*(unique_match_id_match_data - unique_match_id_point_data), sep ="\n")
        
        # Find the match ids that exist in the point data but not in the match data
        new_match_id = pd.Series(list(unique_match_id_point_data - unique_match_id_match_data))
        
        # Construct a dataframe based on the match ids in the point data but not in the match data 
        df_new_match_data = pd.DataFrame()
        df_new_match_data["match_id"] = new_match_id
        # Break each match_id into individual components
        df_new_match_data[['Date','Gender', "Tournament", "Round", "Player 1", "Player 2"]] = \
            new_match_id.str.lower().str.split("-", expand = True)
        
        df_new_match_data["Year"] = df_new_match_data["Date"].str.extract(r"^(\d{4})")
        
        df_new_match_data["Surface"] = "unknown"
        
        # "Best of" and "Final TB?" will be handled in another function.
        
        # Get a list of tournaments from the original match data
        list_current_matches = match_data["Tournament"].tolist()
        
        # We will check if the tournament in each new match_id already exists in the original 
        # match data. If so, we can copy the "surface", "best of", and "final tb?" into the new 
        # match data.
        
        # Write a function to generate Best of, Final TB, and Surface for a given match_id.  
        
        for index, row in df_new_match_data.iterrows():
            # If the current tournament is already in the original match data set, copy the surface
            # Best of, and Final TB to the new match data.
            if row["Tournament"] in list_current_matches: 
                # The current tournament may appear multiple times in the original match data
                # So you only select the first one. Use iloc[], do not use index chain. 
                df_new_match_data.loc[index, "Surface"] = \
                    match_data.loc[match_data["Tournament"] == row["Tournament"], 
                                   "Surface"].iloc[0]
#             else:
                # The current match is not previously known in the list. 
                
       
        # To-do: fill player handedness 
        
        # Merge the new match data with the original match data
        df_merged_match_data = match_data.append(df_new_match_data, ignore_index=True)
        
        # Check if the merged match data is consistent with the point data. 
        num_matches_match_data = df_merged_match_data["match_id"].nunique()
        if num_matches_match_data == num_matches_point_data:
            print("Match data and point data are the same at " + str(num_matches_match_data))
            print("Problem solved")
        else:
            print("Number of matches in the match data: " + str(num_matches_match_data))
            print("Number of matches in the point data: " + str(num_matches_point_data))
            print("Still has problem")
            
        return df_merged_match_data
    else:
        print("No problem found")
        return match_data


In [123]:
import re

# Identify serve side for each point and save the results in a new column "serve_side"
def identify_serve_side(data):
    data = data.copy()
    # List all the possible scores (before serve) and their corresponding serve sides. 
    dict_serve_side = {"0-0": "deuce", 
                       "0-15": "ad", 
                      "15-0": "ad",
                      "15-15": "deuce",
                      "30-0": "deuce",
                      "0-30": "deuce",
                      "30-15": "ad",
                      "15-30": "ad",
                      "40-0": "ad",
                      "0-40": "ad",
                      "40-15": "deuce",
                      "15-40": "deuce",
                       "30-30": "deuce",
                       "40-30": "ad",
                       "30-40": "ad",
                       "40-40": "deuce",
                       "40-AD": "ad",
                       "AD-40": "ad"
                      }

    data["serve_side"] = None

    for index, row in data.iterrows():
        if data.loc[index, "TB?"] == 0:
            # Identify serve sides based on the score (before serve)
            data.at[index, "serve_side"] = dict_serve_side[data.loc[index, "Pts"]]
        elif data.loc[index, "TB?"] > 0:
            # If "TB?" == 1, it's a regular 7-point tiebreak.
            # If "TB?" == 2, it's a 10-point super tiebreak.
            
            # For tiebreak points, if the sum of the two scores (before serve) are even, it's on the deuce side.
            # If the sum of the two scores (before serve) are odd, it's on the ad side. 
    #         print(data.loc[index, "Pts"])

            # Retrieve the first score
            tb_point1_str = re.search("^(\d+)-", data.loc[index, "Pts"])
            if tb_point1_str:
                tb_point1 = int(tb_point1_str.group(1))
            # Retrieve the second score
            tb_point2_str = re.search("-(\d+)", data.loc[index, "Pts"])
            if tb_point2_str:
                tb_point2 = int(tb_point2_str.group(1))

            if ((tb_point1 + tb_point2) % 2) == 0:
                data.at[index, "serve_side"] = "deuce"
            elif ((tb_point1 + tb_point2) % 2) == 1:
                data.at[index, "serve_side"] = "ad"
        else:
            print("Unable to identify server side.")
            print("TB? == " + str(data.loc[index, "TB?"]))

    # data[["Pts", "TB?", "serve_side"]]
    return data

In [124]:
# Separate serve direction from serve outcome

# Identify serve direction and outcome for each serve and save the results in new columns
# "Sv1_diretion," "Sv1_outcome," "Sv2_direction," and "Sv2_outcome"
def identify_serve_direction_outcome(data):
    data = data.copy()
    
    # First serve
    # The first digit of "Sv1" is the serve direction
    data["Sv1_direction"] = data["Sv1"].str.extract(r"^(\d)")
    # If the serve direction is a character not 4, 5, or 6, set it to 0.
    data.loc[data["Sv1_direction"].isin(["4", "5", "6"]) == False, "Sv1_direction"] = "0"
    # If the serve direction is empty, fill it with 0. 
    data["Sv1_direction"].fillna(value="0", inplace=True)
    # Replace numeric code with a word. May need to keep the numbers for stats analysis. 
    data["Sv1_direction"].replace({"4": "wide", "5": "body", "6": "t", "0": "unknown"},
                                  inplace=True)

    # Retrieve serve outcome
    #whatever is inside the parentheses is what is being captured/retrieved
    data["Sv1_outcome"] = data["Sv1"].str.extract(r"^\d(.+)")
    
    # Convert code to meaningful words.
    # replace() only works if the entire string in a cell matches the pattern. 
    # It cannot do substring replacement
    data["Sv1_outcome"].replace({"n": "net", "d": "deep", "*": "ace", "w": "wide",
                                 "#": "unreturnable", "x": "wide_and_deep", 
                                 "+n": "S&V_net", "+d":"S&V_deep", "+*": "S&V_ace", 
                                "+w": "S&V_wide", "+#": "S&V_unreturnable",
                                "+x": "S&V_wide_and_deep"},
                                inplace=True)
    # If the serve outcome is empty, it means the serve is in. 
    data["Sv1_outcome"].fillna(value="in", inplace=True)

    # Second serve
    data["Sv2_direction"] = data["Sv2"].str.extract(r"^(\d)")
    data.loc[data["Sv2_direction"].isin(["4", "5", "6"]) == False, "Sv2_direction"] = "0"
 
    data["Sv2_direction"].replace({"4": "wide", "5": "body", "6": "t", "0": "unknown"}, inplace=True)

    data["Sv2_outcome"] = data["Sv2"].str.extract(r"^\d(.+)")
    # replace() only works if the entire string in a cell matches the pattern. 
    # It cannot do substring replacement
    data["Sv2_outcome"].replace({"n": "net", "d": "deep", "*": "ace", "w": "wide",
                                 "#": "unreturnable", "x": "wide_and_deep", 
                                 "+n": "S&V_net", "+d":"S&V_deep", "+*": "S&V_ace", 
                                "+w": "S&V_wide", "+#": "S&V_unreturnable",
                                "+x": "S&V_wide_and_deep"},
                                inplace=True)

    # data[["Pts", "serve_side", "Sv1", "Sv2", "Sv1_direction", "Sv2_direction", "Sv1_outcome", "Sv2_outcome"]]

    # print(data["Sv1_outcome"].value_counts())
    # data.info()
    return data

In [125]:
# Identify server and returner name for each point and save the results in
# the new columns "server_name" and "returner_name."
def identify_server_returner_point_winner(data):
    data = data.copy()
    
    data["server_name"] = None
    data["returner_name"] = None
    data["point_winner_name"] = None
    data["player1"] = None
    data["player2"] = None
    data["date"] = None
    data["gender"] = None
    data["tournament"] = None
    data["round"] = None

    for index, row in data.iterrows():
        # Retrieve player and tournament information from match_id, instead of
        # df_matches because it's simpler and perhaps more accurate. 
        
        match_info = pd.Series(row["match_id"].split("-"), index = 
                               ['date','gender', "tournament", "round", "player1", "player2"])
        match_info = match_info.str.lower()
        
        if row["Svr"] == 1:
            # Save data directly to the data frame.
            # Do not save data to row["Svr"] because it will not be saved to the data frame. 
            data.at[index, "server_name"] = match_info["player1"]
            data.at[index, "returner_name"] = match_info["player2"]
        elif row["Svr"] == 2:
            data.at[index, "server_name"] = match_info["player2"]
            data.at[index, "returner_name"] = match_info["player1"]

        if row["PtWinner"] == 1:
            data.at[index, "point_winner_name"] = match_info["player1"]
        else:
            data.at[index, "point_winner_name"] = match_info["player2"]

        data.at[index, "player1"] = match_info["player1"]
        data.at[index, "player2"] = match_info["player2"]
        data.at[index, "tournament"] = match_info["tournament"]
        data.at[index, "gender"] = match_info["gender"]
        data.at[index, "date"] = match_info["date"]
        data.at[index, "round"] = match_info["round"]

    # data[["server_name", "Svr", "Serving", "Pts", "serve_side", "Sv1", "Sv2", "Sv1_direction", "Sv2_direction", "Sv1_outcome", "Sv2_outcome"]]

    # data.to_csv("fed_nadal_points.csv")
    return data

In [126]:
# Assign scores to the players. The original point score only shows the server score 
# and returner score. 

def assign_point_scores_to_players(point_data):
    
    point_data = point_data.copy()
    
    point_data["player1_point_score_before"] = None
    point_data["player2_point_score_before"] = None
    
    for index, row in point_data.iterrows():
        point_score = row["Pts"].split("-")
        
        if len(point_score) != 2:
            print("wrong point score: " + row["match_id"] + " " + str(point_score))

        # The original data table already switched score sequence when the server is switched.
        server_point = point_score[0]
        returner_point = point_score[1]
                
        if row["Svr"] == 1:
            point_data.at[index, "player1_point_score_before"] = server_point
            point_data.at[index, "player2_point_score_before"] = returner_point
        elif row["Svr"] == 2:
            point_data.at[index, "player2_point_score_before"] = server_point
            point_data.at[index, "player1_point_score_before"] = returner_point            
        else:
            print("Error: server is incorrect: ")
            print(row)
    
    return point_data  

In [127]:
# Calculate set index, game index, and game index within a set
# These indices can be used to retrieve points by sets and games

def calculate_set_game_index(point_data):
    point_data = point_data.copy()
    
    point_data["set_index"] = point_data["Set1"] + point_data["Set2"] + 1
    point_data["game_index_in_set"] = point_data["Gm1"] + point_data["Gm2"] + 1
    
    # Retrieve only game indices from point_data["Gm#"], which is in the format of "3(2)"
    # or just "3". Some points do not have point indices in point_data["Gm#"]. 
    # So we ignore the point indices.
    point_data["game_index"] = point_data["Gm#"].str.extract(r'^(\d+).*')
    
    return point_data
    

In [128]:
# Convert traditional tennis scores to simple point scores

def calculate_simple_point_score(point_data):
    
    point_data = point_data.copy()
    point_data["player1_simple_point_score_before"] = 0
    point_data["player2_simple_point_score_before"] = 0
    point_data["player1_simple_point_score_after"] = 0
    point_data["player2_simple_point_score_after"] = 0    
    
    point_data.sort_values(by=["match_id","Pt"], inplace=True)
    
    player1_simple_score = 0
    player2_simple_score = 0
    
    for index, row in point_data.iterrows():
        if row["Pts"] == "0-0":
            player1_simple_score = 0
            player2_simple_score = 0            
        
        # To-do: What if Pts is Gm?
        point_data.at[index, "player1_simple_point_score_before"] = player1_simple_score
        point_data.at[index, "player2_simple_point_score_before"] = player2_simple_score
            
        if row["PtWinner"] == 1:
            player1_simple_score += 1
        elif row["PtWinner"] == 2:
            player2_simple_score += 1
        else: 
            print("Error: PtWinner is wrong: ")        
            print(row)

        point_data.at[index, "player1_simple_point_score_after"] = player1_simple_score
        point_data.at[index, "player2_simple_point_score_after"] = player2_simple_score

    return point_data
        

In [129]:
import re   
from collections import defaultdict 

# This method must be called after 
# identify_serve_side() and identify_serve_direction_outcome()
# because we need to use the new columns created by these methods. 

# Parse the Rally string and separate server shots and returner shots
# Also try to infer both player's court position from the shot directions
# and shot types. 
def separate_server_returner_shots(point_data):
    point_data = point_data.copy()
    
    # Add new columns
    point_data["server_shots_code"] = ""
    point_data["returner_shots_code"] = ""
    point_data["how_rally_end"] = ""
    point_data["player1_shots_code"] = ""
    point_data["player2_shots_code"] = ""
    point_data["server_shots_code_w_position"] = ""
    point_data["returner_shots_code_w_position"] = ""
    point_data["player1_shots_code_w_position"] = ""
    point_data["player2_shots_code_w_position"] = ""
    
    # Use one player's shot direction to identify 
    # the other player's lateral court position
    gen_lateral_position = defaultdict(lambda: "na")
    gen_lateral_position["1"] = "deuce"
    gen_lateral_position["2"] = "middle"
    gen_lateral_position["3"] = "ad"
    
    # Use one player's shot type to identify the opponent's court depth, if possible.
    gen_depth = defaultdict(lambda: "deep")
    gen_depth["u"] = "net" # my dropshot
    gen_depth["y"] = "net" # my dropshot
    gen_depth["q"] = "na" # unknown
    gen_depth["7"] = "short" # return within the service box
    gen_depth["8"] = "short" # return in the "no man's land"
    gen_depth["9"] = "deep" # return deep
    
    for index, row in point_data.iterrows():
        # Only process if the Rally is not empty. 
        if pd.isnull(row["Rally"]) == False:
            
            # Retrieve the rally end code
            rally_end_codes = re.findall(r"[*nwdx!e@#SRPQC]+$", row["Rally"])
            if len(rally_end_codes) > 0:
                point_data.at[index, "how_rally_end"] = rally_end_codes.pop(0)
            else:
                print("Warning: No rally end code: ", end = "")
                print(row[["match_id", "Pt", "Rally"]])
            
            # Use a regular expression to separate each shot from the Rally string.
            # Return a list of shots
            shots = re.findall(r"[fbrsvzopuylmhijktq][\+\-\=;^]*\d*[*nwdx!e@#SRPQC]*", 
                               row["Rally"])
            
            # How to figure out the court positions of the players?
            # The returner court position is determined by the last serve direction 
            # For each shot, retrieve the shot direction code and the shot code. 
            # The lateral position of the current player is initially set 
            # based on the previous shot direction and shot type. 
            # Then the player depth is modified by the current shot type. 
            # Save the court position along side the shot type and direction.
            # Estimate the opponent's position and depth for the next shot 
            # based on the current shot direction and  type. 
            # Repeat the above process.
            
            # Set the lateral_position and depth based on the serve
            # If a player serves from the deuce side, the returner is also on his deuce side.
            estimated_lateral_position = row["serve_side"]
            estimated_depth = "deep" # return of serve is almost always deep
            
            # However, if the serve goes to T, then the returner must go to the middle.
            if (((row["1stIn"] == 1) & (row["isRally1st"] == 1) & (row["Sv1_direction"] == "t")) | 
                ((row["2ndIn"] == 1) & (row["isRally2nd"] == 1) & (row["Sv2_direction"] == "t"))):
                    estimated_lateral_position = "middle"
            
            for i in range(0, len(shots)):
                # Go through each shot
                
                if i % 2 == 0:
                    # If the index is even, it's the returner's shot
                    column = "returner_shots_code"
                else:
                    column = "server_shots_code"
                
                point_data.at[index, column] = (point_data.at[index,column] + shots[i] + ",")
                
                # Next, try to identify each player's court position for each shot.
                # Retrieve the shot code and the shot direction code for the current shot.
                m = re.search(r"\d+", shots[i])
                if m:               
                    shot_direction = m.group(0)

                    # If there are two digits, the second one is the shot depth.
                    shot_depth = None
                    if (len(shot_direction) > 1):
                        # Get the depth first, otherwise the depth is lost.
                        shot_depth = shot_direction[1]
                        # The depth is lost after this line because we override shot_direction.
                        shot_direction = shot_direction[0]
                else:
                    # Some shots have no direction (e.g., forced error)
                    shot_direction = None
                    shot_depth = None
                
                # Find shot type code
                m = re.match(r"[fbrsvzopuylmhijktq]", shots[i])
                if (m == None) | (len(m.group(0)) != 1): 
                    print("Error: shot type is wrong")
                    print(row[["match_id", "Rally"]])
                    # Do not continue
                    break
                    
                shot_type = m.group(0)
                
                m = re.search(r"[\+\-\=;^]", shots[i])                
                if m:
                    shot_type_extra = m.group(0)
                else: 
                    shot_type_extra = None
                
                # Set the court position based on the opponent's previous shot
                # Here opponent_lateral_position is the estimated court position 
                my_lateral_position = estimated_lateral_position
                my_depth = estimated_depth
                
                # Update the court depth based my shot type
                if re.match(r"[vzopij]", shot_type):
                    # Volleys and overheads
                    my_depth = "net"
                elif re.match(r"[jk]", shot_type) != None:
                    # Swing volleys
                    my_depth = "short"
                
                if shot_type_extra != None:
                    if shot_type_extra == "+":
                        # approach shot
                        my_depth = "short"
                    elif shot_type_extra == "-":
                        # net
                        my_depth = "net"
                    elif shot_type_extra == "=":
                        # baseline
                        my_depth = "deep"
                    elif shot_type_extra == "^":
                        # stop volley or drop volley
                        my_depth = "net"
                    elif shot_type_extra == ";":
                        # net cord
                        my_depth = "short"

                # Attach court position to the shot code and save it. 
                if i % 2 == 0:
                    # If the index is even, it's the returner's shot
                    column = "returner_shots_code_w_position"
                else:
                    column = "server_shots_code_w_position"
                
                try:
                    point_data.at[index, column] = (point_data.at[index,column] + my_lateral_position + "_" +
                                                    my_depth + "_" + shots[i] + ",")                    
                    
                except TypeError:
                    print("TypeError: ")
                    print(row[["match_id", "Pts"]])
                
                if shot_direction != None:
                    # Save the estimated lateral_position and depth for the opponent in the next shot
                    estimated_lateral_position = gen_lateral_position[shot_direction]
                else:
                    # If the current shot has no direction, 
                    # then we don't know the opponent's next lateral position. 
                    estimated_lateral_position = "na"

                if shot_depth != None:
                    # If we have the shot depth, use it. 
                    # Some matches have shot depth for each shot. 
                    # Most matches has shot depth only for the return shot. 
                    estimated_depth = gen_depth[shot_depth]
                else:
                    # Otherwise, use the shot type to estimate the depth.
                    estimated_depth = gen_depth[shot_type]
                   
            # Also save the shot codes based on players
            if row["Svr"] == 1:
                # If player1 is the server ...
                point_data.at[index, "player1_shots_code"] = point_data.at[index, 
                                                                           "server_shots_code"]
                point_data.at[index, "player2_shots_code"] = point_data.at[index, 
                                                                           "returner_shots_code"]
                point_data.at[index, "player1_shots_code_w_position"] = point_data.at[index, 
                                                                           "server_shots_code_w_position"]
                point_data.at[index, "player2_shots_code_w_position"] = point_data.at[index, 
                                                                           "returner_shots_code_w_position"]
            else:
                # If player2 is the server ...
                point_data.at[index, "player2_shots_code"] = point_data.at[index, 
                                                                           "server_shots_code"]
                point_data.at[index, "player1_shots_code"] = point_data.at[index, 
                                                                           "returner_shots_code"]
                point_data.at[index, "player2_shots_code_w_position"] = point_data.at[index, 
                                                                           "server_shots_code_w_position"]
                point_data.at[index, "player1_shots_code_w_position"] = point_data.at[index, 
                                                                           "returner_shots_code_w_position"]
                
    # to-do: Find an easy way to convert rally code to meaningful words.
            
    return point_data
    

In [130]:
# Identify critial moments: game point, break point, set point, match point, etc.
# This function handles special cases of 5th set in Wimbledon, Roland Garros, and Austrian Open. 
def identify_critical_points(point_data, match_data):
    
    # To-do: How to deal with NextGen ATP final matches. They play first to 4 in each set. 
    # At 3-3, they play a tiebreak. There is no 5-3 score. 
    
    point_data = point_data.copy()
    
    point_data["server_critical_point"] = None
    point_data["returner_critical_point"] = None
    
    game_point_scores = ["40-0", "40-15", "40-30", "AD-40"]
    break_point_scores = ["0-40", "15-40", "30-40", "40-AD"]
    server_setup_point_scores = ["30-0", "30-15", "30-30"]
    returner_setup_point_scores = ["0-30", "15-30", "30-30"]
    deuce_point_score = ["40-40"]
    
    for index, row in point_data.iterrows():
        
        match_info = match_data.loc[match_data["match_id"] == row["match_id"]]
        if match_info.empty == True:
            print("Cannot find match: " + row["match_id"])
        
        # Get information about the current match
        best_of = int(match_info["Best of"].iloc[0])
        tournament = match_info["Tournament"].iloc[0]
        final_tb = int(match_info["Final TB?"].iloc[0])
        
        if row["TB?"] == 0:
            # Not in tiebreak
            if row["Pts"] in game_point_scores:
                # Server has game point
                point_data.at[index, "server_critical_point"] = "game_point"
                
                if row["Svr"] == 1:                    
                    # In Wimbledon and Roland Garros, a player needs to win two games in a row
                    # after 5-5 to win the 5th set.                    
                    if (((row["Gm1"] == 5) & (row["Gm2"] <= 4)) | 
                        ((row["Gm1"] > 5) & (row["Gm1"] > row["Gm2"]))):
                        # Server is player1 and has a set point
                        point_data.at[index, "server_critical_point"] += ",set_point"
                        
                        if (((best_of == 3) & (row["Set1"] == 1)) | 
                            ((best_of == 5) & (row["Set1"] == 2))):
                            # Player1 has already won 1 or 2 set and now has a set point
                            point_data.at[index, "server_critical_point"] += ",match_point"
                else:
                    if (((row["Gm2"] == 5) & (row["Gm1"] <= 4)) | 
                        ((row["Gm2"] > 5) & (row["Gm2"] > row["Gm1"]))):
                        # Server is player2 and has a set point
                        point_data.at[index, "server_critical_point"] += ",set_point"
                        
                        if (((best_of == 3) & (row["Set2"] == 1)) | 
                            ((best_of == 5) & (row["Set2"] == 2))):
                            # Player2 has already won 1 or 2 set and now has a set point
                            point_data.at[index, "server_critical_point"] += ",match_point"
                        
            elif row["Pts"] in break_point_scores:
                # Returner has a break point
                point_data.at[index, "returner_critical_point"] = "break_point"
                
                if row["Svr"] == 1:
                    # Server is player1, and player2 has a breakpoint
                    # In Wimbledon and Roland Garros, a player needs to win two games in a row
                    # after 5-5 to win the 5th set. 
                    if (((row["Gm2"] == 5) & (row["Gm1"] <= 4)) | 
                        ((row["Gm2"] > 5) & (row["Gm2"] > row["Gm1"]))):
                        point_data.at[index, "returner_critical_point"] += ",set_point"
                        
                        if (((best_of == 3) & (row["Set2"] == 1)) | 
                            ((best_of == 5) & (row["Set2"] == 2))):
                            # Player2 has already won 1 or 2 set and now has a set point
                            point_data.at[index, "returner_critical_point"] += ",match_point"                        
                else:
                    # Server is player2, and player1 has a break point
                    if (((row["Gm1"] == 5) & (row["Gm2"] <= 4)) | 
                        ((row["Gm1"] > 5) & (row["Gm1"] > row["Gm2"]))):
                        point_data.at[index, "returner_critical_point"] += ",set_point" 

                        if (((best_of == 3) & (row["Set1"] == 1)) | 
                            ((best_of == 5) & (row["Set1"] == 2))):
                            # Player1 has already won 1 or 2 set and now has a set point
                            point_data.at[index, "returner_critical_point"] += ",match_point"
            
            if row["Pts"] in server_setup_point_scores:
                point_data.at[index, "server_critical_point"] = "setup_point"
            
            if row["Pts"] in returner_setup_point_scores:
                point_data.at[index, "returner_critical_point"] = "setup_point" 
            
            if row["Pts"] in deuce_point_score:
                point_data.at[index, "server_critical_point"] = "deuce_point"
                point_data.at[index, "returner_critical_point"] = "deuce_point"
            
        else:
            #Tiebreak points
            tiebreak_point = row["Pts"].split("-")
            
            if len(tiebreak_point) != 2:
                print("wrong tiebreak point" + str(tiebreak_point))
            
            # The original data table already switched score sequence when the server is switched.
            server_point = int(tiebreak_point[0])
            returner_point = int(tiebreak_point[1])
            
            # For AO or Laver Cup, if the 5th set is 6-6, they will play a 10 point tiebreak, not 7.
            # For these tournaments, the value for the "Final TB?" is 2. 
            # If row["TB?"] == 2, it means it's a 10 point super-tiebreak. But I am not sure if this is 
            # consisently set. So I check both. 
            if (((((final_tb == 2) & (row["Set1"] == 2) & (row["Set2"] == 2)) | (row["TB?"] == 2)) & 
                 (((server_point == 9) & (returner_point <= 8)) |
                ((server_point > 9) & (server_point > returner_point)))) | 
                (((server_point == 6) & (returner_point <= 5)) | 
               ((server_point > 6) & (server_point > returner_point)))):
                # A tiebreak point is also a set point
                point_data.at[index, "server_critical_point"] = "tb_game_point,set_point"
                
                # French Open does not have a tiebreak in the 5th set, therefore the 
                # code below does not apply to French Open.
                
                # Since 2019, Wimbledon has a tiebreak after 12-12 in the 5th set. 
                # Therefore, the following code applies to Wimbledon from 2019, if
                # the players reach 12-12. No special code is needed for Wimbledon after 2019. 
                
                if row["Svr"] == 1:
                    # Server is player1
                    if (((best_of == 3) & (row["Set1"] == 1)) | 
                        ((best_of == 5) & (row["Set1"] == 2))):
                        # Player1 has already won 1 or 2 set and now has a set point
                        point_data.at[index, "server_critical_point"] += ",match_point"
                else:                        
                    # Server is player2
                    if (((best_of == 3) & (row["Set2"] == 1)) | 
                        ((best_of == 5) & (row["Set2"] == 2))):
                        # Player2 has already won 1 or 2 set and now has a set point
                        point_data.at[index, "server_critical_point"] += ",match_point"                
                
            elif (((((final_tb == 2) & (row["Set1"] == 2) & (row["Set2"] == 2)) | (row["TB?"] == 2)) & 
                 (((returner_point == 9) & (server_point <= 8)) |
                ((returner_point > 9) & (returner_point > server_point)))) | 
                  (((returner_point == 6) & (server_point <= 5)) | 
               ((returner_point > 6) & (returner_point > server_point)))):
                # For returner
                point_data.at[index, "returner_critical_point"] = "tb_break_point,set_point"
                
                if row["Svr"] == 1:
                    # Server is player1, and player2 has a tiebreak point 
                    if (((best_of == 3) & (row["Set2"] == 1)) | 
                        ((best_of == 5) & (row["Set2"] == 2))):
                        # Player2 has already won 1 or 2 set and now has a set point
                        point_data.at[index, "returner_critical_point"] += ",match_point"                        
                else:
                    # Server is player2, and player1 has a break point
                    if (((best_of == 3) & (row["Set1"] == 1)) | 
                        ((best_of == 5) & (row["Set1"] == 2))):
                        # Player1 has already won 1 or 2 set and now has a set point
                        point_data.at[index, "returner_critical_point"] += ",match_point"
            
            # If row["TB?"] == 2, it means it's a 10 point super-tiebreak, like in Laver Cup. 
            if (((final_tb == 2) &  (row["Set1"] == 2) & (row["Set2"] == 2)) | (row["TB?"] == 2)):
                # In a 10 point tiebreak, the setup point is 8-x
                if ((server_point == 8) & (server_point >= returner_point)): 
                    point_data.at[index, "server_critical_point"] = "tb_setup_point"
                
                if ((returner_point == 8) & (returner_point >= server_point)):
                    point_data.at[index, "returner_critical_point"] = "tb_setup_point"
                
                if (server_point == returner_point) & (server_point > 8):
                    point_data.at[index, "server_critical_point"] = "tb_deuce_point"
                    point_data.at[index, "returner_critical_point"] = "tb_deuce_point"
            else:
                # For all the other tournaments
                if (server_point == 5) & (server_point >= returner_point): 
                    point_data.at[index, "server_critical_point"] = "tb_setup_point"

                if (returner_point == 5) & (returner_point >= server_point):
                    point_data.at[index, "returner_critical_point"] = "tb_setup_point"

                if (server_point == returner_point) & (server_point > 5):
                    point_data.at[index, "server_critical_point"] = "tb_deuce_point"
                    point_data.at[index, "returner_critical_point"] = "tb_deuce_point"                
                
    # Use df.apply() to replace "for index, row in df.iterrows()" for simple data processing. 
    # df.apply() is more efficient than for loops. 
    # df.apply() applies a function to every row or column of df.
    # The function is usually short, such as a one-line lambda function. 
   
    # Use Lambda function to replace "def xxxx()"
    # Lambda function is only one statement
    # Lambda function is concise but more difficult to understand.
    # Lambda functions are used along with built-in Python functions like 
    # filter(), map(), reduce(), and the Pandas dataframe function apply().

    # For lambda function, using "row" or "column" (instead of x) for the argument
    # because it's easier to understand. 
    point_data["player1_critical_point"] = point_data.apply(
        lambda row: row["server_critical_point"] if row["Svr"] == 1 else row["returner_critical_point"], 
        axis = 1)
    
    point_data["player2_critical_point"] = point_data.apply(
        lambda row: row["server_critical_point"] if row["Svr"] == 2 else row["returner_critical_point"], 
        axis = 1)

    return point_data

In [131]:
# Calculate the gaps in point scores, game scores, and set scores. 
# These gaps are used in calculating anxiety indices.
# They may also be used in storytelling analysis.

def calculate_score_gaps(point_data, match_data, point_gap_file): 
    
    # To-do: How to deal with NextGen ATP final matches. They play first to 4 in each set. 
    # At 3-3, they play a tiebreak. There is no 5-3 score. 
    
    # Everything needs to be normalized to [0:1]
    # Uncertainty = min(server_points_to_win, server_points_to_loss) * server_returner_gap * weightUncertainty
    
    # Hope = (server_points_to_win * server_returner_gap * (serve game * service_game_confidence or return game * return_game_confidence) 
    # * (point_confidence) * critical_moment * weightHope
    
    # Fear = (server_points_to_loss * server_returner_gap * (serve game * service_game_confidence or return game * return_game_confidence) 
    # * (point_confidence) * critical_moment * weightFear
    
    point_data = point_data.copy()
    
    # To-do: add normalized gaps to the point gap table
    
    # The point score gaps for non-tiebreak points are stored in this table 
    # so we don't need to calculate it. 
    # Other cases are too complicated to precalcualte, so we'll have to write code. 
    point_gap_table = pd.read_csv(point_gap_file)
    
    # Calculate point score gaps for non-tiebreak points using the pint_gap_table.
    
    # Left merge point_gap_table with the point data based on "Pts".
    # For example, if Pts is "40-30", values in "server_points_to_win", 
    # "returner_points_to_loss," etc. in point_gap_table will fill the 
    # corresponding columns in the point_data.
    # This is the best way to merge two tables with matching columns. 
    point_data = pd.merge(point_data, point_gap_table,
                        how="left", on=["Pts"])
        
    # Calculate point gaps for tiebreak points
    for index, row in point_data.iterrows():  
        
        match_info = match_data.loc[match_data["match_id"] == row["match_id"]]
        if match_info.empty == True:
            print("Cannot find match: " + row["match_id"])
        
        # Get information about the current match
        best_of = int(match_info["Best of"].iloc[0])
        tournament = match_info["Tournament"].iloc[0]
        final_set_tb = int(match_info["Final TB?"].iloc[0])
        
        if row["TB?"] >= 1:
            #Tiebreak points
            tiebreak_point = row["Pts"].split("-")
            
            if len(tiebreak_point) != 2:
                print("wrong tiebreak point" + str(tiebreak_point))
            
            # The original data table already switched score sequence when the server is switched.
            server_point = int(tiebreak_point[0])
            returner_point = int(tiebreak_point[1])

            tb_target_point = 7
            tb_combined_points_for_2_point_gap = 12
            
            # If final_tb == 2, it means the final set tiebreak is a 10 point tiebreak, like
            # in AO or Laver Cup.
            # If row["TB?"] == 2, it also means it's a 10 point super-tiebreak.
            # I am not sure if they do this consistently, so I check both. 
            if (((final_set_tb == 2) & ((row["Set1"] == 2) & (row["Set2"] == 2))) | 
                (row["TB?"] == 2)):
                tb_target_point = 10
                tb_combined_points_for_2_point_gap = 18

            if (server_point + returner_point) < tb_combined_points_for_2_point_gap:
                # Before 6-6
                point_data.at[index, "server_points_to_win"] = tb_target_point - server_point
                point_data.at[index, "server_points_to_loss"] = tb_target_point - returner_point
                point_data.at[index, "returner_points_to_win"] = tb_target_point - returner_point
                point_data.at[index, "returner_points_to_loss"] = tb_target_point - server_point

                point_data.at[index, "server_returner_points_gap"] = server_point - returner_point
                point_data.at[index, "returner_server_points_gap"] = returner_point - server_point
                
                # Calculate normalized gaps (between 0 and 1) so we can compare the gaps between 
                # tiebreak scores and non-tiebreak scores.
                point_data.at[index, "server_points_to_win_norm"] = \
                    point_data.at[index, "server_points_to_win"] / tb_target_point
            
                point_data.at[index, "server_points_to_loss_norm"] = \
                     point_data.at[index, "server_points_to_loss"] / tb_target_point

                point_data.at[index, "returner_points_to_win_norm"] = \
                    point_data.at[index, "returner_points_to_win"] / tb_target_point

                point_data.at[index, "returner_points_to_loss_norm"] = \
                    point_data.at[index, "returner_points_to_loss"] / tb_target_point

                # The biggest possible point gap in tiebreak is 6 (or 9 in 10-point tiebreak)
                point_data.at[index, "server_returner_points_gap_norm"] = \
                    point_data.at[index, "server_returner_points_gap"] / (tb_target_point - 1)

                point_data.at[index, "returner_server_points_gap_norm"] = \
                    point_data.at[index, "returner_server_points_gap"] / (tb_target_point - 1)
                
            else: 
                # After 6-6 (or 9-9 in 10-point tiebreak)
                point_data.at[index, "server_points_to_win"] = 2 - (server_point - returner_point)
                point_data.at[index, "server_points_to_loss"] = 2 - (returner_point - server_point)
                point_data.at[index, "returner_points_to_win"] = 2 - (returner_point - server_point)
                point_data.at[index, "returner_points_to_loss"] = 2 - (server_point - returner_point)
                
                point_data.at[index, "server_returner_points_gap"] = server_point - returner_point
                point_data.at[index, "returner_server_points_gap"] = returner_point - server_point                
                            
                # Calculate normalized gaps so we can compare the gaps between tiebreak scores and 
                # non-tiebreak scores.
                # When the tiebreak scores are tied at 6, each sides needs to win two points in
                # a row to win the tiebreak. Now the most point one can win is 3 points. 
                # The biggest possible gap is 1. 
                point_data.at[index, "server_points_to_win_norm"] = \
                    point_data.at[index, "server_points_to_win"] / 3
            
                point_data.at[index, "server_points_to_loss_norm"] = \
                     point_data.at[index, "server_points_to_loss"] / 3

                point_data.at[index, "returner_points_to_win_norm"] = \
                    point_data.at[index, "returner_points_to_win"] / 3

                point_data.at[index, "returner_points_to_loss_norm"] = \
                    point_data.at[index, "returner_points_to_loss"] / 3

                point_data.at[index, "server_returner_points_gap_norm"] = \
                    point_data.at[index, "server_returner_points_gap"] 

                point_data.at[index, "returner_server_points_gap_norm"] = \
                    point_data.at[index, "returner_server_points_gap"] 
        
    # Calculate game score gap

    point_data[["server_games_to_win", "server_games_to_loss", 
                "returner_games_to_win", "returner_games_to_loss", 
                "server_returner_games_gap", "returner_server_games_gap",
                "server_games_to_win_norm", "server_games_to_loss_norm", 
                "returner_games_to_win_norm", "returner_games_to_loss_norm", 
                "server_returner_games_gap_norm", "returner_server_games_gap_norm"]] = None
    
    for index, row in point_data.iterrows():        
        if row["TB?"] == 0:
            # Non-tiebreak points
            
            if row["Svr"] == 1:
                server_game_score = int(row["Gm1"])
                returner_game_score = int(row["Gm2"])
            else:
                server_game_score = int(row["Gm2"])
                returner_game_score = int(row["Gm1"])
                
            if (server_game_score + returner_game_score) <= 9:
                # Before the game scores are tied at 5-5
                point_data.at[index, "server_games_to_win"] = 6 - server_game_score
                point_data.at[index, "server_games_to_loss"] = 6 - returner_game_score
                point_data.at[index, "returner_games_to_win"] = 6 - returner_game_score
                point_data.at[index, "returner_games_to_loss"] = 6 - server_game_score
                
                # Before 5-5, biggest possible games to win or loss is 6
                point_data.at[index, "server_games_to_win_norm"] = \
                    point_data.at[index, "server_games_to_win"] / 6
                
                point_data.at[index, "server_games_to_loss_norm"] = \
                    point_data.at[index, "server_games_to_loss"] / 6
                    
                point_data.at[index, "returner_games_to_win_norm"] = \
                    point_data.at[index, "returner_games_to_win"] / 6
                
                point_data.at[index, "returner_games_to_loss_norm"] = \
                    point_data.at[index, "returner_games_to_loss"] / 6
                
            else: 
                # After the game sctores are tied after 5-5
                
                point_data.at[index, "server_games_to_win"] = 7 - server_game_score
                point_data.at[index, "server_games_to_loss"] = 7 - returner_game_score
                point_data.at[index, "returner_games_to_win"] = 7 - returner_game_score
                point_data.at[index, "returner_games_to_loss"] = 7 - server_game_score
                
                # The most games you need to win or lose are 2. 
                # divide by 2
                
                point_data.at[index, "server_games_to_win_norm"] = \
                    point_data.at[index, "server_games_to_win"] / 2
                
                point_data.at[index, "server_games_to_loss_norm"] = \
                    point_data.at[index, "server_games_to_loss"] / 2 
                
                point_data.at[index, "returner_games_to_win_norm"] = \
                    point_data.at[index, "returner_games_to_win"] / 2
                
                point_data.at[index, "returner_games_to_loss_norm"] = \
                    point_data.at[index, "returner_games_to_loss"] / 2
                
                # if final_set_tb == 0, it means this match has no tiebreak in the 5th set.
                # We need to override the above.
                
                # To-do: Wimbledon introduced a tiebreak in the 5th set after 12-12.
                # Does the current code cover this? 
        
                if ((final_set_tb == 0) & ((row["Set1"] == 2) & (row["Set2"] == 2))):
                    point_data.at[index, "server_games_to_win"] = \
                        (2 - (server_game_score - returner_game_score))
                    point_data.at[index, "server_games_to_loss"] = \
                        (2 - (returner_game_score - server_game_score))
                    point_data.at[index, "returner_games_to_win"] = \
                        (2 - (returner_game_score - server_game_score))
                    point_data.at[index, "returner_games_to_loss"] = \
                        (2 - (server_game_score - returner_game_score))
                    
                    # If there is no tiebreak in the final set, the most games you 
                    # need to win or lose are 3. Therefore, divide by 3
                    point_data.at[index, "server_games_to_win_norm"] = \
                        point_data.at[index, "server_games_to_win"] / 3
                
                    point_data.at[index, "server_games_to_loss_norm"] = \
                        point_data.at[index, "server_games_to_loss"] / 3 

                    point_data.at[index, "returner_games_to_win_norm"] = \
                        point_data.at[index, "returner_games_to_win"] / 3

                    point_data.at[index, "returner_games_to_loss_norm"] = \
                        point_data.at[index, "returner_games_to_loss"] / 3
                
            point_data.at[index, "server_returner_games_gap"] = (server_game_score - 
                                                                returner_game_score)
            point_data.at[index, "returner_server_games_gap"] = (returner_game_score - 
                                                                 server_game_score)
            
            # For non-tiebreak points, the largest possible gap is 6.
            point_data.at[index, "server_returner_games_gap_norm"] = \
                point_data.at[index, "server_returner_games_gap"] / 6
            point_data.at[index, "returner_server_games_gap_norm"] = \
                point_data.at[index, "returner_server_games_gap"] / 6
        else: 
            # tiebreak points
            point_data.at[index, "server_games_to_win"] = 1
            point_data.at[index, "server_games_to_loss"] = 1
            point_data.at[index, "returner_games_to_win"] = 1
            point_data.at[index, "returner_games_to_loss"] = 1
            point_data.at[index, "server_returner_games_gap"] = 0
            point_data.at[index, "returner_server_games_gap"] = 0
            
            # In a tiebreak, the most game you can win or loss is always 1.
            point_data.at[index, "server_games_to_win_norm"] = \
                point_data.at[index, "server_games_to_win"]
            
            point_data.at[index, "server_games_to_loss_norm"] = \
                point_data.at[index, "server_games_to_loss"]
            
            point_data.at[index, "returner_games_to_win_norm"] = \
                point_data.at[index, "returner_games_to_win"]
            
            point_data.at[index, "returner_games_to_loss_norm"] = \
                point_data.at[index, "returner_games_to_loss"]
            
            # For tiebreak points, the game_points_gap is 0.
            point_data.at[index, "server_returner_games_gap_norm"] = \
                point_data.at[index, "server_returner_games_gap"]
            
            point_data.at[index, "returner_server_games_gap_norm"] = \
                point_data.at[index, "returner_server_games_gap"]            
    
    # Calculate set point gap
    point_data[["server_sets_to_win", "server_sets_to_loss", 
                "returner_sets_to_win", "returner_sets_to_loss", 
                "server_returner_sets_gap", "returner_server_sets_gap",
                "server_sets_to_win_norm", "server_sets_to_loss_norm", 
                "returner_sets_to_win_norm", "returner_sets_to_loss_norm", 
                "server_returner_sets_gap_norm", "returner_server_sets_gap_norm"]] = None   
    
    for index, row in point_data.iterrows():
        
        # Best of may be 0
        best_of = match_data.loc[match_data["match_id"] == row["match_id"], "Best of"]
        best_of = int(best_of.iloc[0])
    
        if row["Svr"] == 1:
            server_set_score = int(row["Set1"])
            returner_set_score = int(row["Set2"])
        else:
            server_set_score = int(row["Set2"])
            returner_set_score = int(row["Set1"])
        
        if best_of == 5:
            target_sets = 3
            max_set_score_gap = 2
        elif best_of == 3:
            target_sets = 2
            max_set_score_gap = 1
            
        point_data.at[index, "server_sets_to_win"] = target_sets - server_set_score
        point_data.at[index, "server_sets_to_loss"] = target_sets - returner_set_score
        point_data.at[index, "returner_sets_to_win"] = target_sets - returner_set_score
        point_data.at[index, "returner_sets_to_loss"] = target_sets - server_set_score
        
        # Normalize the value of gaps so they are comparable with other gaps
        # We can then use them in the same equation.
        point_data.at[index, "server_sets_to_win_norm"] = \
            point_data.at[index, "server_sets_to_win"] / target_sets
        point_data.at[index, "server_sets_to_loss_norm"] = \
            point_data.at[index, "server_sets_to_loss"] / target_sets
        point_data.at[index, "returner_sets_to_win_norm"] = \
            point_data.at[index, "returner_sets_to_win"] / target_sets
        point_data.at[index, "returner_sets_to_loss_norm"] = \
            point_data.at[index, "returner_sets_to_loss"] / target_sets
        
        point_data.at[index, "server_returner_sets_gap"] = server_set_score - returner_set_score
        point_data.at[index, "returner_server_sets_gap"] = returner_set_score - server_set_score
        
        # devide by max_set_points_gap
        point_data.at[index, "server_returner_sets_gap_norm"] = \
            point_data.at[index, "server_returner_sets_gap"] / max_set_score_gap
        point_data.at[index, "returner_server_sets_gap_norm"] = \
            point_data.at[index, "returner_server_sets_gap"] / max_set_score_gap
    
    # For game anxiety, in addition to win_gap, loss_gap, and server_returner_game_gap, 
    # also consider (num_service_games - num_return_games). If you have more service games, 
    # then you are less anxious. 
    # Should we consider form? How do we calculate form? Confidence is a measure of form. 
    
    return point_data

In [132]:
# This function can only be called after the shots are separated.

def calculate_shot_count(point_data):
    point_data = point_data.copy()
    
    # count the number of , in each row.
    # If there is no shot for server, row["server_shots_code"] is empty "".
    point_data["server_shot_count"] = point_data.apply(
        lambda row: row["server_shots_code"].count(","), axis=1)
    
    # Do not add service count to the rally count because it may skew the later stats calculation. 
    # Add a separate column for service count and treat services separately. 
    # The service count can be added to the server shot count if needed.
    # Use pd.isna() to check for missing values.
    point_data["service_count"] = point_data.apply(
        lambda row: 1 if pd.isna(row["2nd"]) else 2, axis = 1)
    
    point_data["returner_shot_count"] = point_data.apply(
        lambda row: row["returner_shots_code"].count(","), axis=1)
    
    # Assign to player1 and player 2
    point_data["player1_shot_count"] = point_data.apply(
        lambda row: row["server_shot_count"] if (row["Svr"] == 1) else row["returner_shot_count"], 
    axis = 1)
    
    point_data["player2_shot_count"] = point_data.apply(
        lambda row: row["server_shot_count"] if (row["Svr"] == 2) else row["returner_shot_count"], 
    axis = 1)
    
    return point_data

In [133]:
# This function can only be called after the shots are separated. 

# To-do: take shot direction into the consideration, not just the shot type.
# Not just the shot type variety but direction variety

# Calculate each player's shot variety count.
# It seems that this count is quite consistent for each player.
# For example, Federer almost always had a higher variety count than Nadal and Djokovic.
# Nadal almost always had a higher variety count than Djokovic.
def calculate_shot_variety_count(point_data, exclude_common_shots = True):
    point_data = point_data.copy()
    
    if exclude_common_shots:
        # exclude the most common shot types: f, b
        # Variety means a player often uses other types of shots, 
        # such as slice or dropshot. 
        regex_shot_type = re.compile(r"[rsvzopuylmhijktq][\+\-\=;^]*")
    else:
        # include forehand and backhand
        regex_shot_type = re.compile(r"[fbrsvzopuylmhijktq][\+\-\=;^]*")

    # Don't forget to add axis = 1.
    point_data["server_shots_variety_count"] = point_data.apply(
        lambda row: len(np.unique(regex_shot_type.findall(row["server_shots_code"])))
        if (pd.isna(row["server_shots_code"]) != True) else np.NaN,
    axis = 1)
    
    point_data["returner_shots_variety_count"] = point_data.apply(
        lambda row: len(np.unique(regex_shot_type.findall(row["returner_shots_code"]))) 
        if (pd.isna(row["returner_shots_code"]) != True) else np.NaN,
    axis = 1)

    # Assign to player1 and player 2
    point_data["player1_shots_variety_count"] = point_data.apply(
        lambda row: row["server_shots_variety_count"] 
        if (row["Svr"] == 1) else row["returner_shots_variety_count"], 
    axis = 1)
    
    point_data["player2_shots_variety_count"] = point_data.apply(
        lambda row: row["server_shots_variety_count"] 
        if (row["Svr"] == 2) else row["returner_shots_variety_count"], 
    axis = 1)    

    return point_data

In [134]:
# calculate shot direction change count
def calculate_shot_direction_change_count(point_data):
    point_data = point_data.copy()
    
    point_data["server_shot_direction_change_count"] = 0
    point_data["returner_shot_direction_change_count"] = 0
    point_data["player1_shot_direction_change_count"] = 0
    point_data["player2_shot_direction_change_count"] = 0
    
    switch_column = {"returner_shot_direction_change_count": 
                     "server_shot_direction_change_count",
                     "server_shot_direction_change_count": 
                     "returner_shot_direction_change_count"}
        
    for index, row in point_data.iterrows(): 
        # If there is no rally for this point, do nothing
        if pd.isna(row["Rally"]):
            continue

        shot_directions = re.findall(r"\d+", row["Rally"])

        prev_dir = None
        column = "returner_shot_direction_change_count"
        while len(shot_directions) > 0:
            # Pop the next shot
            current_dir = shot_directions.pop(0)

            # If there is direction code and depth code, 
            # only use the direction code. 
            if len(current_dir) > 1:
                current_dir = current_dir[0]

            # If the current shot dir is different from the previous shot dir
            if (prev_dir != None) & (current_dir != prev_dir): 
                point_data.at[index, column] += 1

            prev_dir = current_dir
            column = switch_column[column]
        
    # Assign to player1 and player 2
    point_data["player1_shot_direction_change_count"] = point_data.apply(
        lambda row: row["server_shot_direction_change_count"] 
        if (row["Svr"] == 1) else row["returner_shot_direction_change_count"], 
    axis = 1)
    
    point_data["player2_shot_direction_change_count"] = point_data.apply(
        lambda row: row["server_shot_direction_change_count"] 
        if (row["Svr"] == 2) else row["returner_shot_direction_change_count"], 
    axis = 1)    
    
    return point_data    

In [135]:
def calculate_net_shot_count(point_data, including_middle_court = True):
    point_data = point_data.copy()
    
    court_positions = {"server_net_shot_count" : "server_shots_code_w_position",
                   "returner_net_shot_count" : "returner_shots_code_w_position", 
                   "player1_net_shot_count" : "player1_shots_code_w_position",
                   "player2_net_shot_count" : "player2_shots_code_w_position"}

    for column in court_positions:
        # Count how many "_net" are in the court position codes.
        point_data[column] = point_data.apply(
            lambda row: row[court_positions[column]].count("_net"), axis = 1)
    
        if including_middle_court:
            # Count haw many "_short" are in the court position codes. 
            point_data[column] += point_data.apply(
                lambda row: row[court_positions[column]].count("_short"), axis = 1)
        
    return point_data

In [136]:
# The best way to verify my run index is to compare the run distance difference between server and returner 
# in my run index and the real data. 

# Source: http://www.tennisabstract.com/blog/category/distance-run/
# In the average match, there was only a 125 meter difference between the distances covered by the two players. 
# In percentage terms, that means one player outran the other by only 5.5%.

# On average, the returner must run just over 10% further. When the first serve is put in play, 
# that difference jumps to 12%. On second-serve points, it drops to 7%.

# In the average Roland Garros match, the competitors combined for 4.8 km per match, compared to 4.1 km at Wimbledon. 
# (The dataset consists of about twice as many Wimbledon matches, so the overall numbers are skewed in that direction.) 
# Measured by the point, that’s 47 meters per point on clay and 37 meters per point on grass.

# Between these extremes, the average match features a combined 4.4 km of running, or just over 20 meters per point. 
# If we limit our view to points of five shots or longer (a very approximate way of separating rallies from points 
# in which the serve largely determines the outcome), the average distance per point is 42 meters.

# to-do: server runs too much? Returner also needs to run if the serve is to the middle.  
# Our run index tends to underestimate the distance run because players may stand far behind baseline, like Thiem. 

# Calculate each player's run index
# The run index is the estimated distance run by a player. 
# A run index of 1 is about half the width of a tennis court. 
# lateral_run_base_value estimates how much a player moves lateraly 
# when we cannot derive movement from data. The player has to make a move.
# depth_run_base_value estimates how much a player moves in depth when he/she hits a ball
# when we cannot derive movement from data. The player has to make a move. 
# depth_run_weight is the ratio between half_courth_length/court_width
def calculate_run_index(point_data, 
                        lateral_run_base_value = 0.3, 
                        depth_run_base_value = 0.2,
                        depth_run_weight = 1.44):
    point_data = point_data.copy()
    
    # retrieve the court position string
    # Pop the first one
    # if the lateral position is different from the previous position, +1 or +2
    # if the vertical position is different +1 or +2, 
    # based on the lateral and depth position change, calculate the run index, add it to the current shot run index

    point_data["server_run_index"] = 0.0
    point_data["returner_run_index"] = 0.0
    point_data["player1_run_index"] = 0.0
    point_data["player2_run_index"] = 0.0   
    
    for index, row in point_data.iterrows(): 
        
        source_columns = ["server_shots_code_w_position", "returner_shots_code_w_position"]
        target_columns = ["server_run_index", "returner_run_index"]
       
        # iterate through two lists in parallel
        for source_col, target_col in zip(source_columns, target_columns):
            
            # Retrieve all the position code
            court_positions = re.findall(r"(deuce|ad|middle|na)_(deep|short|net|na)", 
                                         row[source_col])

            # The starting position for each player when the point starts
            # We assume that the returner always stands behind the baseline.
            # But sometimes a returner stands inside the court but we don't know it 
            # from the data. 
            prev_lateral_position = row["serve_side"]
            prev_depth = "deep"
            while len(court_positions) > 0:
                # Pop the next position
                current_position = court_positions.pop(0)

                lateral_position = current_position[0]
                depth = current_position[1]

                # If the current shot dir is different from the previous shot dir
                if (prev_lateral_position != None) & (prev_depth != None):
                    
                    # Calculate the lateral run index
                    # One lateral run is about 3.66 meter (1/3 of the court width)
                    if lateral_position != prev_lateral_position:
                        if (lateral_position == "middle") | (prev_lateral_position == "middle"):
                            # Running from middle to deuce or ad, 
                            # or running from deuce or ad to middle
                            lateral_run = 1
                        elif lateral_position in ["deuce", "ad"]:
                            # Running from deuce to ad, or vice versa.
                            lateral_run = 2
                        else:
                            # Unknown position
                            lateral_run = 0
                    else: 
                        # Even if the ball comes back to the same side,
                        # the player is not standing still. 
                        # It's reasonable to assume that the play still needs to 
                        # move for 1/3 of the half the court width (about 1 meter) to get the ball. 
                        # The ball comes with a variety of lateral angles.
                        lateral_run = lateral_run_base_value

                    # Calculate depth run index
                    # One depth_run is about 3.96 meter
                    if depth != prev_depth:
                        if (depth == "short") | (prev_depth == "short"):
                            depth_run = 1
                        elif depth in ["deep", "net"]:
                            depth_run = 2
                        else:
                            depth_run = 0
                    else:
                        # Even if the ball comes back to the same side,
                        # the player still needs to move a little. 
                        # The ball comes with a variety of depth, but not as much the lateral run.
                        # It's reasonable to assume that a player will move 0.2 of half the 
                        # court length.
                        depth_run = depth_run_base_value

                    # Convert lateral run and depth run to run index
                    # Because each half court is longer in depth than in width, we need to 
                    # give a little more weight to the depth_run. 
                    # Because half_court_length/half_court_width = 1.44,
                    # 1 depth_run = 1.44 * (1 lateral_run)
                    run_index = math.sqrt((lateral_run**2) + ((depth_run * depth_run_weight)**2))

                    point_data.at[index, target_col] += run_index

                prev_lateral_position = lateral_position
                prev_depth = depth
            
    # Assign to player1 and player 2
    point_data["player1_run_index"] = point_data.apply(
        lambda row: row["server_run_index"] 
        if (row["Svr"] == 1) else row["returner_run_index"], 
    axis = 1)
    
    point_data["player2_run_index"] = point_data.apply(
        lambda row: row["server_run_index"] 
        if (row["Svr"] == 2) else row["returner_run_index"], 
    axis = 1)
    
    return point_data

In [137]:
# aggressiveness = opponent run index, count of net shots, fh %, bh %, slice %, dropshot, change direction, hit to middle
# Can you correlate aggressiveness with won or loss? 

# The goal is to identify a player's style: aggressive, defensive, variety, baseliner, all-court player, 
# Identify multiple dimensions and assign a scale to each dimension. A player's style is the combination of these dimensions. 
# How to define aggressive: use more aggressive shots, try to move the opponent, hit more winners and forced errors.  
# Defensive: wait for opponent to make error. 
# aggressiveness and defensiveness are relative. You need to compare this player with other players. 

# Calculate fh count, fh %, bh count, bh %, slice count, slice %, dropshot count, hit to middle count, hit to middle %
def calculate_shot_stats(point_data):
    point_data = point_data.copy()
    
    # player1 shots code count how many f, count fh dtl
    # count how many b, count bh dtl
    # count r & s (slice)
    # count u & v (dropshot)
    # count 2 (to middle)

# def calculate_defensiveness_index()

# calculate average number of points per service game for each player

# Functions for retrieving information from the point data

In [138]:
# Input: point_data (all the point data), match_data (user selected matches)
# Output: df_points_selected (all the points from the selected matches)
# This function is used to select points before analyzing them. 
def get_points_in_matches(point_data, match_data):

    # Use a copy to avoid "A value is trying to be set on a copy of a slice from a DataFrame" warning. 
    point_data = point_data.copy()
    
    # There may be multiple match_ids
    df_selected_points = point_data.loc[point_data["match_id"].isin(match_data["match_id"])]

    # df_points_selected.info()
    return df_selected_points

In [139]:
# Get the number of points in a match
# Cross check with the maximum point index in the data table

def get_point_count(point_data, match_id):
    points_in_match = point_data[point_data["match_id"] == match_id]
    
    point_count = len(points_in_match)
    max_point_index = points_in_match["Pt"].max()
    if point_count != max_point_index:
        print("Error: point count is different from the maximum point index.")
        print(match_id)
    
    return point_count

In [140]:
# Get all the points either before or after the specified point_id
# If you want to get the points for the entire match, just use the default values.

# search_direction can be either "before" or "after".

# Return a list of points, list of sets (that each contains a list of games), 
# and a list of all the games, all sorted. 

# The returned data can be further processed to find service point, return points, 
# service games, and return games, etc.

def get_points_in_one_match(point_data, match_id, point_id = 0, search_direction = "after"):
    
    # Get all the points in the selected match
    points_in_match = point_data[point_data["match_id"] == match_id]
        
    if points_in_match.empty: 
        print("Error: Cannot find this match.")
        print(match_id)
        return None

    # Get all the points from the beginning of the match
    # This is the "flat" points table
    if search_direction == "before": 
        selected_points = points_in_match[(points_in_match["Pt"] < point_id)]
    elif search_direction == "after":
        selected_points = points_in_match[(points_in_match["Pt"] > point_id)]
    else:
        print("Error: You can only use \"before\" or \"after\" in search_direction.")
        return None
    
    if selected_points.empty:
        print("Error: Cannot find points.")
        return None
    
    # Sort the point table by point index
    selected_points = selected_points.sort_values(by=["Pt"])
    
    # Divide points by sets and games
    # This is the "nested" points table
    list_sets = []
    # Divide all the points into sets, and then into games.
    # The sets are sorted
    for set_index in np.sort(selected_points["set_index"].unique()):
        list_games = []
        # For each set, find all the points in this set
        points_in_set = selected_points[selected_points["set_index"] == set_index]
        
        # For each game, find all the points in this game
        # Divide all the points in a set into games.
        # The games are sorted
        for game_index in np.sort(points_in_set["game_index_in_set"].unique()):
            points_in_game = points_in_set[points_in_set["game_index_in_set"] == game_index]
            # Add all the points in a game to this list. 
            list_games.append(points_in_game)
        # Add the per-game point list to the set list.
        list_sets.append(list_games)
    
    # Check if the nested point list is the same as the flat point list
    
    # This is how to flatten a nested list into a flat list
    # Now we have a list of all the games
    list_games = [item for sublist in list_sets for item in sublist] 

    # Merge all the dataframes in the list into one
    # In this case, merge all the game dataframes into one
    selected_points2 = pd.concat(list_games)

    # Check if the nested list has the same content as the flat list
    if selected_points.equals(selected_points2) != True:
        print("Error: The nested point list " + 
              "is not the same as the flat point list.")
        
    # Return a flat point list, a nested set and game lists, and a game list
    return selected_points, list_sets, list_games

In [141]:
# Get points based on their places on the match timeline

# Keep the arguments simple. Avoid using long and complex arguments in a method. 
# It's better to write the simple functions and then write the 
# more complex functions by calling several basic functions, rather than creating a 
# function with complex arguments. 
def get_service_points(point_data, player):
    point_data = point_data.copy()
    
    return point_data[point_data["server_name"].str.contains(player)]

def get_return_points(point_data, player):
    point_data = point_data.copy()
    
    return point_data[point_data["returner_name"].str.contains(player)]

# Get X points to the end of the given point list.
def get_last_x_points(point_data, x = 1):
    point_data = point_data.copy()
    
    point_data.sort_values(by=["Pt"], inplace = True)
    
    return point_data.tail(x)

def get_first_x_points(point_data, x = 1):
    point_data = point_data.copy()
    
    point_data.sort_values(by=["Pt"], inplace = True)
    
    return point_data.head(x)

def is_tiebreak_game(points_in_game):
    # Check if all the points in the given table are tiebreak points.
    tb_symbol = points_in_game["TB?"].unique()
    
    if len(tb_symbol) != 1:
        # If there are more than one symbols in "TB?"
        return False
    elif(tb_symbol[0] == 1):
        return True
    else:
        return False

# get_last_x_service_games can be achieved by calling get_serveice_games() 
# and then get the last x games from the list.
# Tiebreak games are excluded. 
def get_service_games(list_games, player):
    list_service_games = []
    for game in list_games:
        if ((is_tiebreak_game(game) != True) &  
            (game[game["server_name"].str.contains(player)].empty != True)):
            list_service_games.append(game)
    
    return list_service_games

# Tiebreak games are excluded
def get_return_games(list_games, player):
    list_return_games = []
    
    for game in list_games:
        if ((is_tiebreak_game(game) != True) &  
            (game[game["returner_name"].str.contains(player)].empty != True)):
            list_return_games.append(game)
    
    return list_return_games

# Get tiebreak points
def get_tiebreak_games(list_games):
    list_tiebreak_games = []

    for game in list_games:
        if (is_tiebreak_game(game) == True):
            list_tiebreak_games.append(game)
    
    return list_tiebreak_games
    
    

In [142]:
def get_player_names(point_data, match_id):
    points_in_match= point_data[point_data["match_id"] == match_id]
    
    if points_in_match.empty:
        print("Error: The specified match does not exist.")
        return None
    
    first_point = points_in_match.head(1)
    
    return first_point["player1"].to_string(index=False), first_point["player2"].to_string(index=False)

In [143]:
def get_score(point_data, match_id):
    
    points_in_match= point_data[point_data["match_id"] == match_id]

    if points_in_match.empty:
        print("Error: The specified match does not exist.")
        return None
    
    player1_name, player2_name = get_player_names(point_data, match_id)
    scores = pd.DataFrame()
    
    set_indices = np.sort(points_in_match["set_index"].unique())
    
    for set_index in set_indices:
        points_in_set = points_in_match[points_in_match["set_index"] == set_index]
        points_in_set = points_in_set.sort_values(by="set_index")
        last_point = points_in_set.tail(1)
        scores = scores.append(last_point[["Set1.1", "Set2.1", "Gm1.1", "Gm2.1"]], ignore_index = True)
    
    scores.columns =[player1_name + "_set", player2_name + "_set", player1_name+"_game", player2_name+"_game"] 
    return scores

In [144]:
# to-do: calculate forehand potency and backhand potency based on Jack Skaman's formula. 
# Do the same thing for other shots: slice, volley. Take into consideration of the number of shots made. 
# 1 point for a forehand/backhand winner, 0.5 point for a forehand or backhand before a winner, 
# -1 point for an unforced error.
# Serve potency 

# to-do: how to detect the turning point in a match? When does a player start to dominate?
# Can we use run index? When the gap between the run index starts to grow, when a player comes forward?

# to-do: Take shot directions into consideration in the calculation of variety

# to-do: How to detect patterns in rallys?

# to-do: get stats for the given points

# There are three main lines of research: 
# 1. visualize the scores and stats (live data and more detailed match chart data)
# 1.1 Storytelling and suspense
# 2. analyze and visualize player styles (serve and rally)
# 2.1 analyze player strengths and weaknesses
# 3. Tactical analysis and recommendations 
# 4. Injury identification, visualization, and prevention (linking type of shots and known injury?)
# Which part of the body or muscle is used more frequently than others?

# Research idea: analyze and visualize player serve styles
# Research idea: analyze and visualize player's strength, weaknesses, style
# Need to come up with a list 
# Research idea: find frequently used shot patterns/combinations, calculate the potency of these patterns
# Make suggestions based on strength and weaknesses of individual shots and combinations. 
# Is there a more sophisticated way to link shots and combinations and win or loss?

# Get the following stats for a player

# first serve return point won %
# second serve return point won %
# first serve wide %
# first serve wide in %
# first serve T %
# first serve T in %
# first serve body %
# first serve body in %
# first serve wide won %
# first serve T won %
# first serve body won %
# first serve short rally %
# second serve short rally %
# first serve short rally won %
# second serve short rally won % 
# first serve + forehand %
# first serve + backhand %
# first serve + forehand won %
# first serve + backhand won % 
# first serve aggressive play %
# first serve defensive play %
# first serve neutral play %
# first serve long running % 
# first serve average rally length
# first serve short, medium, and long rally distribution 
# 

# scond serve directions %

# The key is to identify how to win points in second serves. 
# The key is to identify how to win return points
# First serve points are relatively easy to win

# number of winners
# number of opponent's forced errors
# number of unforced errors
# number of break point won

# Critical points won %

# Return a dataframe that contains the statistics for all the players in the data set
# This data is for game-level, set-level, or match-level data analysis. 
# Point-level data is collected in other functions. 
def get_stats(point_data):
    stats = pd.DataFrame(columns = [
        "player_name",
        "first_serve_in_%", 
                                          "first_serve_won_%",
                                         "second_serve_in_%",
                                         "second_serve_won_%",
                                         "ace_count",
                                          "unreturnable_serve_count",
                                          "double_fault_count",  
                                          "first_serve_rally_length_median",
                                          "first_serve_rally_length_mean",
                                            "first_serve_rally_length_upper_quantile",
                                          "first_serve_rally_length_max",
                                          "second_serve_rally_length_median",
                                          "second_serve_rally_length_mean",
                                          "second_serve_rally_length_max",
                                            "second_serve_rally_length_upper_quantile",
                                            "first_serve_deuce_won_%",
                                            "first_serve_ad_won_%",
                                            "second_serve_deuce_won_%",
                                            "second_serve_ad_won_%",
                                            "first_serve_deuce_rally_length_mean",
                                            "first_serve_ad_rally_length_mean",
                                            "second_serve_deuce_rally_length_mean",
                                            "second_serve_ad_rally_length_mean",
                                          "first_serve_wide_%",
                                          "first_serve_wide_won_%",
                                          "first_serve_wide_rally_length_mean",
                                          "first_serve_body_%",
                                          "first_serve_body_won_%",
                                            "first_serve_body_rally_length_mean",
                                          "first_serve_t_%",
                                          "first_serve_t_won_%",
                                            "first_serve_t_rally_length_mean",
                                          "second_serve_wide_%",
                                          "second_serve_wide_won_%",
                                            "second_serve_wide_rally_length_mean",
                                          "second_serve_body_%",
                                          "second_serve_body_won_%",
                                            "second_serve_body_rally_length_mean",
                                          "second_serve_t_%",
                                          "second_serve_t_won_%",
                                            "second_serve_t_rally_length_mean",
                                         "break_point_count",
                                          "break_point_won_count",
                                         "game_point_count",
                                         "game_point_won_count",
                                         "set_point_count",
                                         "set_point_won_count",
                                         "match_point_count",
                                         "match_point_won_count",
                                         "setup_point_count",
                                         "setup_point_won_count",
                                            "deuce_point_count",
                                            "deuce_point_won_count",
                                          "forehand_%",
                                          "forehand_won_%", # Need to consider directions too
                                          "forehand_winner_opp_forced_error_%",
                                          "forehand_unforced_error_%",
                                          "backhand_%",
                                          "backhand_won_%"
                                          "backhand_winner_opp_forced_error_%",
                                          "backhand_unforced_error_%",
                                          "backhand_slice_%",
                                          "backhand_slice_won_%", # Last shot or just part of rally?
                                          "backhand_slice_winner_opp_forced_error_%",
                                          "backhand_slice_unforced_error_%", # consider approach shot too
                                          "netpoint_%",
                                          "netpoint_won_%",
                                          "netpoint_unforced_error_%",
                                            "relative_short_rally_won_%",
                                            "relative_medium_rally_won_%",
                                            "relative_long_rally_won_%",
                                            ""
                                         ])
    
    # Get all the servers, divide the data by server, calculate stats, add to dataframe
    # Get a list of unique players. We assume every player serves. 
    list_players = point_data["server_name"].unique()
    
    # Add a row for each player
    for player in list_players:
        stats = stats.append({"player_name" : player}, ignore_index = True)
    
    # Service-related stats
    for player in list_players:
        # Select all the points served by the selected server
        server_points = point_data.loc[point_data["server_name"] == player]
        returner_points = point_data.loc[point_data["returner_name"] == player]
        
        # Get the index of the selected player's row in stats
        index = (stats.loc[stats["player_name"] == player].index)[0]
        
        first_serve_in_points = server_points.loc[server_points["1stIn"] == 1]
        second_serve_in_points = server_points.loc[server_points["2ndIn"] == 1]
        
        # count() exlcudes all None and NaN cells.
        first_serve_in_count = first_serve_in_points["1stIn"].count()
        first_serve_count = server_points["1stIn"].count()
        stats.loc[index, "first_serve_in_%"] = first_serve_in_count / first_serve_count 
        
        first_serve_won_count = first_serve_in_points.loc[first_serve_in_points["isSvrWinner"] == 1, "1stIn"].count()
        stats.loc[index, "first_serve_won_%"] = first_serve_won_count / first_serve_in_count
        
        second_serve_in_count = second_serve_in_points["2ndIn"].count()
        # count() exclude all the None and NaN cells.
        second_serve_count = server_points["2ndIn"].count()
        stats.loc[index, "second_serve_in_%"] = second_serve_in_count / second_serve_count
        
        second_serve_won_count = second_serve_in_points.loc[(second_serve_in_points["isSvrWinner"] == 1), "2ndIn"].count()
        stats.loc[index, "second_serve_won_%"] = second_serve_won_count / second_serve_in_count
        
        stats.loc[index, "ace_count"] = server_points.loc[(server_points["isAce"] == True), "isAce"].count()
        
        stats.loc[index, "unreturnable_serve_count"] = \
            server_points.loc[(server_points["isUnret"] == True), "isUnret"].count()
        
        stats.loc[index, "double_fault_count"] = \
            server_points.loc[(server_points["isDouble"] == True), "isDouble"].count()
    
        stats.loc[index, "first_serve_rally_length_median"] = first_serve_in_points["rallyLen"].median()
        stats.loc[index, "first_serve_rally_length_mean"] = first_serve_in_points["rallyLen"].mean()
        stats.loc[index, "first_serve_rally_length_max"] = first_serve_in_points["rallyLen"].max()
        stats.loc[index, "first_serve_rally_length_upper_quantile"] = first_serve_in_points["rallyLen"].quantile(0.75)

        stats.loc[index, "second_serve_rally_length_median"] = second_serve_in_points["rallyLen"].median()
        stats.loc[index, "second_serve_rally_length_mean"] = second_serve_in_points["rallyLen"].mean()
        stats.loc[index, "second_serve_rally_length_max"] = second_serve_in_points["rallyLen"].max()
        stats.loc[index, "second_serve_rally_length_upper_quantile"] = second_serve_in_points["rallyLen"].quantile(0.75)
        
        first_serve_deuce_points = first_serve_in_points.loc[first_serve_in_points["serve_side"] == "deuce"]
        first_serve_deuce_points_won = first_serve_deuce_points.loc[first_serve_deuce_points["point_winner_name"] == player]
        stats.loc[index, "first_serve_deuce_won_%"] = (first_serve_deuce_points_won["point_winner_name"].count() / 
                                                       first_serve_deuce_points["point_winner_name"].count())
        
        first_serve_ad_points = first_serve_in_points.loc[first_serve_in_points["serve_side"] == "ad"]
        first_serve_ad_points_won = first_serve_ad_points.loc[first_serve_ad_points["point_winner_name"] == player]
        stats.loc[index, "first_serve_ad_won_%"] = (first_serve_ad_points_won["point_winner_name"].count() / 
                                                       first_serve_ad_points["point_winner_name"].count())
    
        stats.loc[index, "first_serve_deuce_rally_length_mean"] = first_serve_deuce_points["rallyLen"].mean()
        stats.loc[index, "first_serve_ad_rally_length_mean"] = first_serve_ad_points["rallyLen"].mean()
    
        second_serve_deuce_points = second_serve_in_points.loc[second_serve_in_points["serve_side"] == "deuce"]
        second_serve_deuce_points_won = second_serve_deuce_points.loc[second_serve_deuce_points["point_winner_name"] == player]
        stats.loc[index, "second_serve_deuce_won_%"] = (second_serve_deuce_points_won["point_winner_name"].count() / 
                                                       second_serve_deuce_points["point_winner_name"].count())
        
        second_serve_ad_points = second_serve_in_points.loc[second_serve_in_points["serve_side"] == "ad"]
        second_serve_ad_points_won = second_serve_ad_points.loc[second_serve_ad_points["point_winner_name"] == player]
        stats.loc[index, "second_serve_ad_won_%"] = (second_serve_ad_points_won["point_winner_name"].count() / 
                                                       second_serve_ad_points["point_winner_name"].count())

        stats.loc[index, "second_serve_deuce_rally_length_mean"] = second_serve_deuce_points["rallyLen"].mean()
        stats.loc[index, "second_serve_ad_rally_length_mean"] = second_serve_ad_points["rallyLen"].mean()
        
        first_serve_wide_points = server_points.loc[server_points["Sv1_direction"] == "wide"]
        first_serve_body_points = server_points.loc[server_points["Sv1_direction"] == "body"]
        first_serve_t_points = server_points.loc[server_points["Sv1_direction"] == "t"]
        # Every point has a first serve
        first_serve_count = server_points["Sv1_direction"].count()
        stats.loc[index, "first_serve_wide_%"] = first_serve_wide_points["Sv1_direction"].count() / first_serve_count
        stats.loc[index, "first_serve_body_%"] = first_serve_body_points["Sv1_direction"].count() / first_serve_count
        stats.loc[index, "first_serve_t_%"] = first_serve_t_points["Sv1_direction"].count() / first_serve_count
        
        second_serve_wide_points = server_points.loc[server_points["Sv2_direction"] == "wide"]
        second_serve_body_points = server_points.loc[server_points["Sv2_direction"] == "body"]
        second_serve_t_points = server_points.loc[server_points["Sv2_direction"] == "t"]
        second_serve_count = server_points.loc[server_points["Sv2_direction"] != "unknown", "Sv2_direction"].count()
        
        stats.loc[index, "second_serve_wide_%"] = second_serve_wide_points["Sv2_direction"].count() / second_serve_count
        stats.loc[index, "second_serve_body_%"] = second_serve_body_points["Sv2_direction"].count() / second_serve_count
        stats.loc[index, "second_serve_t_%"] = second_serve_t_points["Sv2_direction"].count() / second_serve_count 

        first_serve_wide_points_won = first_serve_wide_points.loc[first_serve_wide_points["point_winner_name"] == player]
        first_serve_body_points_won = first_serve_body_points.loc[first_serve_body_points["point_winner_name"] == player]
        first_serve_t_points_won = first_serve_t_points.loc[first_serve_t_points["point_winner_name"] == player]
        
        min_sample_size = 10
        # if the sample size is too small, don't calculate the winning %.
        first_serve_wide_count = first_serve_wide_points["point_winner_name"].count()
        if first_serve_wide_count >= min_sample_size: 
            first_serve_wide_won_count = first_serve_wide_points_won["point_winner_name"].count()
            stats.loc[index, "first_serve_wide_won_%"] = \
                first_serve_wide_won_count / first_serve_wide_count
        
        first_serve_body_count = first_serve_body_points["point_winner_name"].count()
        if first_serve_body_count >= min_sample_size: 
            first_serve_body_won_count = first_serve_body_points_won["point_winner_name"].count()
            stats.loc[index, "first_serve_body_won_%"] = \
                first_serve_body_won_count / first_serve_body_count
        
        first_serve_t_count = first_serve_t_points["point_winner_name"].count()
        if first_serve_t_count >= min_sample_size: 
            first_serve_t_won_count = first_serve_t_points_won["point_winner_name"].count()
            stats.loc[index, "first_serve_t_won_%"] = \
                first_serve_t_won_count / first_serve_t_count
            
        # second serve  
        second_serve_wide_points_won = second_serve_wide_points.loc[second_serve_wide_points["point_winner_name"] == player]
        second_serve_body_points_won = second_serve_body_points.loc[second_serve_body_points["point_winner_name"] == player]
        second_serve_t_points_won = second_serve_t_points.loc[second_serve_t_points["point_winner_name"] == player]
        
        second_serve_wide_count = second_serve_wide_points["point_winner_name"].count()
        if second_serve_wide_count >= min_sample_size: 
            second_serve_wide_won_count = second_serve_wide_points_won["point_winner_name"].count()
            stats.loc[index, "second_serve_wide_won_%"] = \
                second_serve_wide_won_count / second_serve_wide_count
        
        second_serve_body_count = second_serve_body_points["point_winner_name"].count()
        if second_serve_body_count >= min_sample_size: 
            second_serve_body_won_count = second_serve_body_points_won["point_winner_name"].count()
            stats.loc[index, "second_serve_body_won_%"] = \
                second_serve_body_won_count / second_serve_body_count
        
        second_serve_t_count = second_serve_t_points["point_winner_name"].count()
        if second_serve_t_count >= min_sample_size: 
            second_serve_t_won_count = second_serve_t_points_won["point_winner_name"].count()
            stats.loc[index, "second_serve_t_won_%"] = \
                second_serve_t_won_count / second_serve_t_count
            
        stats.loc[index, "first_serve_wide_rally_length_mean"] = first_serve_wide_points["rallyLen"].mean()
        stats.loc[index, "first_serve_body_rally_length_mean"] = first_serve_body_points["rallyLen"].mean()
        stats.loc[index, "first_serve_t_rally_length_mean"] = first_serve_t_points["rallyLen"].mean()
        
        stats.loc[index, "second_serve_wide_rally_length_mean"] = second_serve_wide_points["rallyLen"].mean()
        stats.loc[index, "second_serve_body_rally_length_mean"] = second_serve_body_points["rallyLen"].mean()
        stats.loc[index, "second_serve_t_rally_length_mean"] = second_serve_t_points["rallyLen"].mean()

        # Critical point stats
        # Since set_point and match_point are also break_point or game_point, there 
        # is no need to count set_point and match_point when you count total numbes of critical points. 
        keywords = ["break_point", "game_point", "set_point", 
                    "match_point", "setup_point", "deuce_point"]
        
        print(player)
        for keyword in keywords:
            print(keyword)
            server_critical_points = \
                server_points.loc[server_points["server_critical_point"].str.contains(keyword, na=False)]
            returner_critical_points = \
                returner_points.loc[returner_points["returner_critical_point"].str.contains(keyword, na=False)]
           
            server_critical_point_count = server_critical_points["server_critical_point"].count()
            print(server_critical_point_count)
            returner_critical_point_count = returner_critical_points["returner_critical_point"].count()
            print(returner_critical_point_count)
            critical_point_count = server_critical_point_count + returner_critical_point_count
            
            stats.loc[index, keyword+"_count"] = critical_point_count

            server_critical_points_won = \
                server_critical_points.loc[server_critical_points["point_winner_name"] == player]
            returner_critical_points_won = \
                returner_critical_points.loc[returner_critical_points["point_winner_name"] == player]
            
            server_critical_point_won_count = server_critical_points_won["server_critical_point"].count()
            print(server_critical_point_won_count)
            returner_critical_point_won_count = returner_critical_points_won["returner_critical_point"].count()
            print(returner_critical_point_won_count)
            critical_point_won_count = server_critical_point_won_count + returner_critical_point_won_count
            
            stats.loc[index, keyword+"_won_count"] = critical_point_won_count 
    
    # You cannot use round() if there are non-numerical values in the dataframe.
    stats.loc[:, stats.columns != "player_name"] = stats.loc[:, stats.columns != "player_name"].astype(float).round(decimals=3)
    return stats


In [396]:
# to-do: add a code for first and second serve. Treat first and second serves just as another shot code. 
# remove let code c
# remove shot end code to find most frequent patterns. Maybe it's better to keep the shot end code. 
# Remove repeated shot code? Do you remove 1-shot repitition, 2-shot repetition, 3-shot repitition, etc.?
# Remove shot direction?

import re

# Input: a Series of rallies, minimum shot pattern length, and maximum shot pattern length
# Output: a dictionary with shot patterns and frequencies
def count_shot_pattern_freq(rallies, min_pattern_len = 1, max_pattern_len = 4):
    
    shot_pattern_freq={}
    # regular expression of the shot code
    # To-do: add first and second serve code
    shot_code = "(((FIR)|(SEC)|(RET)*[fbrsvzopuylmhijktq])[\+\-\=;^]*\d*[*nwdx!e@#SRPQC]*,)"
        
    for index, rally in rallies.items():
        max_pattern_len = max_pattern_len if max_pattern_len <= len(rally) else len(rally)

        shot_start_positions = []
        # Find and store the starting position of every single shot code
        # You cannot save the iterator from re.finditer() and use it repeatedly.
        # An iterator object cannot be easily reset.
        # It's better to save the starting positions.
        for m in re.finditer(shot_code, rally):
            shot_start_positions.append(m.start())

        for pattern_len in range(min_pattern_len, max_pattern_len+1):
            # Retrieve a shot patterns of pattern_len shots.
            shot_pattern = shot_code + "{"+str(pattern_len)+"}"

            # Skip to the next shot code
            for start in shot_start_positions:
                # Find the next match
                sub = re.match(shot_pattern, rally[start:])
                if sub != None:
                    # Get the matching string
                    sub_string = sub.group(0)
                    # Count the frequency of this shot pattern
                    if sub_string not in shot_pattern_freq:
                        shot_pattern_freq[sub_string] = 1
                    else:
                        shot_pattern_freq[sub_string] += 1
                        
    return shot_pattern_freq

In [388]:
def is_leftie(player_name, file_name = "player_handedness.csv"):
    handedness_table = pd.read_csv(file_name)
    
    player_handedness = handedness_table.loc[handedness_table["player"].str.contains(player_name), "handedness"]
    if player_handedness.size == 1:
        if player_handedness.iloc[0] == "l":
            return True
        else:
            return False
    else:
        print("is_leftie() error: Player doesn't exist or there are mutliple entries.")
        return False
    

In [395]:
# This function assumes that the dataframe only contains two players.
# To-do: Calculate a player's shot pattern frequency over multiple matches. 
# Find all the unique players. For each player, find all the rows with this player as player1 and retrieve player1 shot code,
# count frequency, and then find all the rows with this player as player2, retrieve all the player2 shot code, count frequency,
# merge the two tables. Repeat for every player. 
# To-do: add first and second serve code.
# To-do: simplify, remove shot direction and merge repetition, 
# Add critical moments
# Add service game and return game pattern frequencies.
# Add high pressure points
# The goal is to build a player profile. 
def get_shot_pattern_freq_table(point_data, 
                                min_pattern_len = 1, max_pattern_len = 5, 
                                flip_shot_code_for_leftie = False,
                               remove_shot_direction = False,
                               merge_repetition = False):
    point_data = point_data.copy()
    
    shot_pattern_freq_list = []
    #print(points_in_match.loc[points_in_match.index[0], players])

    for player_id in [1, 2]:
        col = "player"+str(player_id)+"_shots_code"
        
        point_data[col].fillna("", inplace=True)
        
#         print(point_data[col])

        for index, row in point_data.iterrows():
            if row["Svr"] == player_id:
                # This player is serving
                if row["1stIn"] == 1:
#                     print("first serve in")
                    # first serve in
                    point_data.at[index, col] = \
                    ("FIR" + row["Sv1"] + "," + point_data.loc[index, col])
                elif row["2ndIn"] == 1:
#                     print("2nd serve in")
                    point_data.at[index, col] = \
                    ("SEC" + row["Sv2"] + "," + point_data.loc[index, col])
            elif row["Ret"] == player_id:
#                 print("returning")
                # This player is returning
               point_data.loc[index, col] = \
                    ("RET" + point_data.loc[index, col])
        
#         print(point_data[col])
        dict_freq = count_shot_pattern_freq(point_data[col],
                                           min_pattern_len = min_pattern_len, 
                                            max_pattern_len = max_pattern_len)

        player_name = point_data.loc[points_in_match.index[0], "player"+str(player_id)]
        # Create a dataframe from a dictionary, with keys and values as two columns. 
        df = pd.DataFrame(list(dict_freq.items()), 
                          columns = ["shot_pattern", player_name])
        
        # Flip the shot direction code 1 to 3 and 3 to 1 for leftie players so 
        # the patterns can be easily compared with the right handed players. 
        if is_leftie(player_name) & flip_shot_code_for_leftie:
            print("Flip shot code for leftie player: " + player_name)
            df["shot_pattern"] = df["shot_pattern"].str.replace("1", "ONE")
            df["shot_pattern"] = df["shot_pattern"].str.replace("3", "THREE")
            df["shot_pattern"] = df["shot_pattern"].str.replace("ONE", "3")
            df["shot_pattern"] = df["shot_pattern"].str.replace("THREE", "1")
            
        shot_pattern_freq_list.append(df)

    df_shot_pattern_freq = pd.merge(shot_pattern_freq_list[0], shot_pattern_freq_list[1], 
                                 how = "outer", on = ["shot_pattern"])
    df_shot_pattern_freq.sort_values(by = [df_shot_pattern_freq.columns[1], df_shot_pattern_freq.columns[2]], 
                                  ascending=False, inplace=True)
    
    return df_shot_pattern_freq

# This is the method I'm working on right now. <a name='coding' /> 
# Go to <a href=#testing>Testing Code</a>



In [37]:
# win_loss
# outcome: winner, opponent_forced_error, etc. 
# length
# shot_type
# def get_points_by_type(player, win_loss, outcome, length, shot_type)

# To-do

# What is the most efficient way to turn a match around? What is the minimal intervention to change the outcome of a match?
# What is the minimal intervention to reduce the number of unforced errors? What is the thing you can fix to eliminate the most errors or losses?

First, we are try to identify the characteristics of a player's game. Try to find the ingredience of his game, just like the ingredience of a chemical substance (e.g., paint). We start with single shots and count their frequencies. Then we try to find 2 to 3-shot patterns and count their frequencies. 

There are three measurement of a shot: consistency, power, and precision. Power and precision can be measured by the outcome. We merge power and precision into effectiveness.
# How to measure consistency? Count of unforced errors; divide unforced errors by shot type; unforced error %; Max, min, mean, and median shots between unforced errors, standard deviation.
# How to measure effectiveness? Count of winner/FE; divide winner/FE by shot type; combine shot type and shot direction; winner/FE %; Shots between winners and the opponent's forced errors. 
# How to measure vulnerability? same as above. Shots before an opponet winner/your FE. 

If the pattern is consistent across many matches against a player, we can then identify it as a characteristic of a player. 

In addition to the point outcome, we can also use one player's running left and right or back and forth as a way to measure he vulnerability and then associate shots or shot patterns with it. 

# How to identify the cause of winning and losing a point, a game, a set, or a match? How close are you to winning? You analyze the match on different levels. 
Match-level: how close are they?
Set-level: how close are they?
Game-level: how close are they? Correlate certain parameters with winning a game (first serve %, etc.)
Point-level: Correlate the count of single shots (fh, bh, slice) with winning or losing a point. Correlate the count of combinations with winning or losing a point. 

# The shot proceed the winner/opp_forced_error and unforced_error will be given a lethality score and entered into a hash table. The shot proceed that shot will be given a lower lethality score and enter into . The further the distance to the outcome and less the lethality score. 
The 2-shot pattern and 3-shot pattern proceed the outcome will also be given a score and enter into a hash table. In the end, the frequency and the score of these patterns will be used to determine their effectiveness and vulnerability. 

# Can specify same serve direction to get the outcome of the last service point 
# with the same direction and same serve side
def get_last_service_point():
    
# Can specify the outcome of the last return point from the same side.
def get_last_return_point():
    
def get_last_point():

def get_last_long_rally():
    
# Get the numbero of unforced errors in the last X points
def get_num_prev_unforced_errors(): 
    
def get_num_prev_winners():
    
def get_num_prev_forced_errors():

# Get the number of first serve faults for the last X service points
# Can specify serve direction 
def get_num_prev_first_serve_faults():
    
def get_num_prev_double_faults():
    
def get_num_prev_return_errors():
    
def calc_accum_rally_length():
    
def calc_accum_run_index(): 
    
def calc_aggressiveness_index():

def calc_variety_index():

# For storytelling
def identify_comeback():
    
def identify_reversal():
    
def identify_new_ball(): 

def get_num_unforced_errors_current_game()

def get_num_winners_current_game()

def get_num_forced_errors_current_game()

# To calculate confidence level 
def get_num_unforced_errors(current_game, last_points)

# Get number of winners and number of forced errors 

# Return the following stats for the last X points or last X games.
# num_winners, num_unforced, num_forced, num_aces, num_double_faults, num_unreturnables,
# pct_winners
# pct_points_won, 
# pct_first_serve, pct_first_serve_point_won, pct_second_serve_points_won
# Try to construct the state of mind for a player before any point. 
def get_stats_by_point(match_id, pt, num_points_back = 5, num_games_back = 0, max_rally_len, min_rally_len, )

### Service analysis
First we calculate the staistics, then we correlate them. There are many possible combinations. 
- The goal is to get what is on the player's mind
- First get the direct outcome of the serves: ace, in or miss
- There is also indirect outcome: point won or loss. Or we can further divide it into short rally point won or loss. 
- Then we break it down into directions: wide, t, body
- Then we correlate the outcome with the directions: 
- Get direct serve outcomes: 1st serve pct, 2nd server pct, number of aces. These are the direct outcomes. 
- These are the indirect serve outcomes: 1st serve point won pct, 2nd serve point won pct, 1st serve short rally won pct, 2nd serve short rally won pct
- What are the statistics we can get for serves: direct outcome (in, miss, ace), indirect outcome (point won, point loss, short), serve directions, serve sides, variety, surface, effectiveness/quality/confidence/consistency of serves (in pct, aces/short rally won), critical moments, fatigue. You can pick any combination and try to correlate them. For example, correlate serve confidence with the next serve direction. 
- Calcualte the variety index of first and second serves. 
- Anxiety 

Statistical analysis
Frequency analysis:

Correlations analysis: 
- Correlate direct serve outcome with serve directions: wide pct, t pct, or body pct
- Correlate indirect serve outcome with serve directions: wide short rally won pct, t short rally won pct, body short rally pct
- Correlate direct or indirect outcome with critical moments: short rally won pct at critical moments
- Correlate serve directions with critical moments: serve directions at setup points, set point or deuce points, etc. 
- Correlate serve outcome with serve sides: 
- Correlate serve direction with serve sides: 
- Correlate variety with serve sides, and critical moments. 
- Correlate outcome of previous points with serve directions. 
- We can pick any two statistics and try to correlate them. 
- We can calculate the serve quality/confidence 

### Rally analysis

First we calculate the staistics, then we correlate them. 
What statistics can we get out of the data: 
- outcome: win (winner, opponent's forced error, opponent's unforced error) or loss (my unforced error, my forced error, opponent's winner)
- techniques: fh, bh, etc.
- Technical patterns: combinations of techniques
- directions:
- rally count:
- variety of techniques in a rally
- variety of directions in a rally 
- Aggressiveness in a rally: how do they set up points (correlate aggressiveness with techniques or technical patterns) 
- accumulated historical data: fatigue (shot count, run index), confidence 
- critical moments 
- anxiety 
- Style: style is also frequency - something they do repeatedly. 

Frequency analysis:
- Patterns: what happens more often at what moment. 
- 

Corrlation analysis 
Correlation analysis is also frequency analysis. It just check what happens more frequently at certain moment. 
Pick any two statistics and try to correlate them. 
- Correlate unforced errors or winners with techniques (fh, bh)
- Correlate errors with critical moments
- Correlate variety with anxiety
- Correlate aggressiveness with anxiety
- Correlate outcome with aggressiveness
- correlate outcome with variety
- correlate outcome with rally count
- correlate outcome with fatigue

# How do they set up points: 

# Get percentage of winner, unforced errors, and forced errors in last X points. Use this to calculate rally confidence. 
# long rally confidence. 

# Get first server percentage and first serve win pct in the the last X points. Use this to calculate serve confidence. 
# Get serve wide confidence, T confidence, etc. 

# Find style, find favorite pattern, can you group them? 

## Commonly used functions end here. 

Data cleaning and pre-processing start below.

## 3. Load raw match data and perform data cleaning and pre-processing

If the raw match data is not changed and you have processed match data, skip this step. 


In [37]:
# You may need to manually correct the CSV file. On line 955, there is a cell where there is a starting "
# but without the closing ". Need to delete the " mark. I have to manually fix this problem. 
# The file "charting-m-matches.xls" is now cleaned by hand. 

#Use xls, not .csv or .xlsx, because they usually cause errors.
df_matches = pd.read_excel("charting-m-matches.xls")

In [38]:
df_matches = clean_match_data(df_matches)

In [39]:
# Do this before new data is added. The new match data may not have handedness info.

df_matches, df_players_handedness = clean_handedness_match_data(df_matches)

stanislas_wawrinka
r    20
l     1
Name: Pl 2 hand, dtype: int64
rafael_nadal
l    87
r     1
Name: Pl 1 hand, dtype: int64
rafael_nadal
l    160
r      1
Name: Pl 2 hand, dtype: int64
kyle_edmund
r    2
l    1
Name: Pl 2 hand, dtype: int64
nick_kyrgios
r    8
l    1
Name: Pl 1 hand, dtype: int64
thomas_muster
l    11
r     1
Name: Pl 2 hand, dtype: int64


In [40]:
df_players_handedness.to_csv("player_handedness.csv", index=False)

## The match data still needs to be synchronized with the point data. So the match data is not ready yet. 

## 4. Load raw point data and perform data cleaning and pre-processing

If the raw point data is not changed and you have processed point data, skip this step and load the processed data directly. 

Start here when you download a new charting-m-points.csv file from GitHub.

Otherwise, skip this part and start from the cell that loads the processed data. 

The cell below may take a long time to run. 

In [41]:
# This is the "CSV UTF-8" file.

# Use "charting-m-points_1.csv" if not testing
%time df_points = pd.read_csv("charting-m-points_1.csv", low_memory=False)

# 3 min
%time df_points = clean_point_data(df_points) 

Wall time: 2.03 s
Wall time: 2min 34s


In [42]:
df_points.to_csv("point_data_cleaned.csv", index=False)

In [43]:
# 1 sec
print("Synchronize match and point data ...")
%time df_matches = sync_match_and_point_data(df_points, df_matches)

Synchronize match and point data ...
Checking if points data and match data are consistent ... 
There are 2350 match ids in the match data set.
There are 2573 match ids in the point data set.
The number of match ids do not match. Trying to fix the problem ...
The following match ids are in the match data set but not the point data set. Fix it.

Match data and point data are the same at 2573
Problem solved
Wall time: 329 ms


In [44]:
# After merging the match_ids, set match format properly

# 6 sec
%time df_matches = set_match_format(df_matches, "match_format.csv")

Wall time: 7.53 s


In [45]:
# Save match data 
df_matches.to_csv("match_data_cleaned.csv", index=False)

In [46]:
# Continue to process point data

# 47 sec
print("Identify serve side ...")
%time df_points = identify_serve_side(df_points)

Identify serve side ...
Wall time: 46.6 s


In [47]:
# 4 sec
print("Identify serve direction and outcome ...")
%time df_points = identify_serve_direction_outcome(df_points)

Identify serve direction and outcome ...
Wall time: 2.31 s


In [48]:
# 6 min
print("Identify server, returner, and point winner ...")
%time df_points = identify_server_returner_point_winner(df_points)

Identify server, returner, and point winner ...
Wall time: 5min 4s


In [49]:
# 1 min
print("Separate server and returner shots ...")
%time df_points = separate_server_returner_shots(df_points)

Separate server and returner shots ...
Pt                                                         63
Rally                                                      b1
Name: 158509, dtype: object
Pt                                                         25
Rally                                                    s;27
Name: 158899, dtype: object
Pt                                                         44
Rally                                                   s2n#3
Name: 161399, dtype: object
Pt                                                         10
Rally                                                     b83
Name: 169011, dtype: object
Pt                                                        131
Rally                                                     f39
Name: 170302, dtype: object
Pt                                                        125
Rally                                                     b38
Name: 175373, dtype: object
Pt                                               

In [50]:
# 6 min
print("Identify critical points ...")
%time df_points = identify_critical_points(df_points, df_matches)

Identify critical points ...
Wall time: 5min 35s


In [51]:
# 12 min 
print("calculate score gaps ...")
%time df_points = calculate_score_gaps(df_points, df_matches, "point_gap_table.csv")

calculate score gaps ...
Wall time: 11min 41s


In [52]:
# 1 min
print("assign point scores to players ...")
%time df_points = assign_point_scores_to_players(df_points)

assign point scores to players ...
Wall time: 50.4 s


In [53]:
# 1 min
print("calculate simple point scores ...")
%time df_points = calculate_simple_point_score(df_points)

calculate simple point scores ...
Wall time: 59.8 s


In [54]:
# 2 s
%time df_points = calculate_set_game_index(df_points)

Wall time: 1.22 s


In [55]:
# 30 s
%time df_points = calculate_shot_count(df_points)

Wall time: 29.7 s


In [56]:
# 40 s
%time df_points = calculate_shot_variety_count(df_points)

Wall time: 39.6 s


In [57]:
# 1.5 min
%time df_points = calculate_shot_direction_change_count(df_points)

Wall time: 1min 9s


In [58]:
# 45 s
%time df_points = calculate_net_shot_count(df_points, including_middle_court = True)

Wall time: 43.9 s


In [59]:
# 1.5 min
%time df_points = calculate_run_index(df_points)

Wall time: 1min 32s


In [60]:
# 25 s
# Save the results so we don't need to do this every time.
%time df_points.to_csv("point_data_cleaned_processed.csv", index=False)

Wall time: 22 s


## 5. Load the cleaned and processed match and point data

Start here if you already have processed data. It's much faster.

In [146]:
import pandas as pd

# Load cleaned match data
df_matches = pd.read_csv("match_data_cleaned.csv")

# The type of the column "Date" will be set to int64. Need to set it to str.
df_matches = df_matches.astype({"Date": "str"})

In [147]:
# Read the processed data 
%time df_points = pd.read_csv("point_data_cleaned_processed.csv", low_memory=False)

Wall time: 10.6 s


In [148]:
players = get_players(df_matches)

len(players)

550

In [149]:
player_match_count = get_player_match_count(df_matches)

player_match_count

roger_federer        447
novak_djokovic       275
rafael_nadal         267
andy_murray          153
stefan_edberg        130
                    ... 
giovanni_lapentti      1
andrew_harris          1
dario_acosta           1
sanam_singh            1
jakob_hlasek           1
Length: 550, dtype: int64

In [44]:
match_count_by_tourney = get_match_count_by_tournament(df_matches)

match_count_by_tourney

australian_open         210
wimbledon               190
us_open                 175
roland_garros           126
indian_wells_masters    121
                       ... 
zhuhai_ch                 1
maui_ch                   1
burnie_ch                 1
pereira_ch                1
panama_city_ch            1
Name: Tournament, Length: 241, dtype: int64

In [44]:
player_match_count_by_tournament = get_player_match_count_by_tournament(df_matches)

player_match_count_by_tournament

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,Player,Tournament,Year,Count
3201,roger_federer,australian_open,2018,7
3406,roger_federer,us_open,2015,7
3006,rafael_nadal,roland_garros,2014,7
3300,roger_federer,indian_wells_masters,2018,6
2624,novak_djokovic,olympics,2012,6
...,...,...,...,...
1458,james_blake,cincinnati_masters,2007,1
1460,james_blake,masters_cup,2006,1
1461,james_blake,olympics,2008,1
1462,james_blake,queens_club,2006,1


In [45]:
h2h_count = get_player_head_to_head_count(df_matches)

h2h_count

Unnamed: 0,Player 1,Player 2,Count
2259,novak_djokovic,rafael_nadal,48
2501,rafael_nadal,novak_djokovic,48
2751,roger_federer,novak_djokovic,47
2263,novak_djokovic,roger_federer,47
2760,roger_federer,rafael_nadal,35
...,...,...,...
1163,hubert_hurkacz,novak_djokovic,1
1164,hubert_hurkacz,roger_federer,1
1166,hubert_hurkacz,taylor_harry_fritz,1
1167,hugo_dellien,aljaz_bedene,1


In [46]:
tournament_count = get_tournament_count(df_matches)

tournament_count

Unnamed: 0,Tournament,Year,Count
993,wimbledon,2019,39
401,indian_wells_masters,2019,32
522,miami_masters,2019,25
66,australian_open,2019,22
67,australian_open,2020,22
...,...,...,...
432,long_island,1990,1
434,los_angeles,1999,1
435,los_angeles,2001,1
436,los_angeles,2008,1


In [47]:
# Find the number of matches for a player 
player1_name = "djokovic"
player2_name = "murray"

selected_matches = get_matches(df_matches, player1=player1_name, player2=player2_name)
print(len(selected_matches))
selected_matches.info

30


<bound method DataFrame.info of                                                match_id        Player 1  \
858        20170107-M-Doha-F-Novak_Djokovic-Andy_Murray  novak_djokovic   
864   20161120-M-Tour_Finals-F-Andy_Murray-Novak_Djo...     andy_murray   
910   20160515-M-Rome_Masters-F-Andy_Murray-Novak_Dj...     andy_murray   
991   20150405-M-Miami_Masters-F-Novak_Djokovic-Andy...  novak_djokovic   
1001  20150321-M-Indian_Wells_Masters-SF-Novak_Djoko...  novak_djokovic   
1033  20150201-M-Australian_Open-F-Novak_Djokovic-An...  novak_djokovic   
1056   20141004-M-Beijing-SF-Novak_Djokovic-Andy_Murray  novak_djokovic   
1064   20140903-M-US_Open-QF-Novak_Djokovic-Andy_Murray  novak_djokovic   
1136  20140326-M-Miami_Masters-QF-Novak_Djokovic-And...  novak_djokovic   
1221  20130707-M-Wimbledon-F-Novak_Djokovic-Andy_Murray  novak_djokovic   
1249  20130127-M-Australian_Open-F-Novak_Djokovic-An...  novak_djokovic   
1269  20121107-M-Tour_Finals-RR-Novak_Djokovic-Andy_...  novak_djoko

In [136]:
# Find the head-to-head match count for two players
player1_name = "nadal"
player2_name = "federer"

len(get_matches(df_matches, player1=player1_name, player2=player2_name))

35

## For testing new point data processing functions

### Because it takes too long to process the entire point data table, I will select one match and test the new point data processing functions. The goal is to apply these functions to the entire point data table. 

If they work well, I will copy these functions to the cells above to apply them to the entire point data table. 

# Testing code here <a name='testing' />. Go to <a href=#coding>the method I'm writing. </a> 

In [287]:
# Get matches for testing
my_players = ["nadal", "federer", "djokovic", "wawrinka", "dimitrov"]
my_matches = pd.DataFrame()

while my_players:
    player_1 = my_players.pop(0)
    
    for player_2 in my_players:
        matches = get_matches(df_matches, player1=player_1, player2=player_2)
        
        my_matches = my_matches.append(matches)

# Get all the points for that match
my_points = get_points_in_matches(df_points, my_matches)

In [288]:
my_points.loc[my_points["match_id"].str.contains("dimitrov", case=False), "match_id"].unique()

array(['20121008-M-Shanghai_Masters-R32-Novak_Djokovic-Grigor_Dimitrov',
       '20130509-M-Madrid_Masters-R16-Grigor_Dimitrov-Stanislas_Wawrinka',
       '20131025-M-Basel-QF-Grigor_Dimitrov-Roger_Federer',
       '20140122-M-Australian_Open-QF-Grigor_Dimitrov-Rafael_Nadal',
       '20140517-M-Rome_Masters-SF-Grigor_Dimitrov-Rafael_Nadal',
       '20140614-M-Queens_Club-SF-Stanislas_Wawrinka-Grigor_Dimitrov',
       '20140704-M-Wimbledon-SF-Grigor_Dimitrov-Novak_Djokovic',
       '20150507-M-Madrid_Masters-R16-Grigor_Dimitrov-Stanislas_Wawrinka',
       '20151029-M-Basel-R16-Rafael_Nadal-Grigor_Dimitrov',
       '20160108-M-Brisbane-QF-Roger_Federer-Grigor_Dimitrov',
       '20170127-M-Australian_Open-SF-Rafael_Nadal-Grigor_Dimitrov',
       '20170710-M-Wimbledon-R16-Grigor_Dimitrov-Roger_Federer',
       '20180421-M-Monte_Carlo_Masters-SF-Rafael_Nadal-Grigor_Dimitrov',
       '20180621-M-Queens_Club-R16-Novak_Djokovic-Grigor_Dimitrov',
       '20190813-M-Cincinnati_Masters-R64-Grigor

In [321]:
ser = df_points["Rally"].str.extract(r"^(.)").iloc[:, 0].unique()

ser

array(['f', 's', 'r', 'b', nan, 'y', 'm', ';', 'l', 'u', 'q', 'h', 'v',
       't', 'o', '*', 'i', 'C'], dtype=object)

In [351]:
# 20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic
# "20100815-M-Canada_Masters-SF-Roger_Federer-Novak_Djokovic"
# "20080706-M-Wimbledon-F-Roger_Federer-Rafael_Nadal"
# 20040328-M-Miami_Masters-R32-Roger_Federer-Rafael_Nadal
# 20180713-M-Wimbledon-SF-Rafael_Nadal-Novak_Djokovic
# 20160128-M-Australian_Open-SF-Novak_Djokovic-Roger_Federer
# 20151122-M-Tour_Finals-F-Roger_Federer-Novak_Djokovic
# 20150913-M-US_Open-F-Roger_Federer-Novak_Djokovic
# 20150712-M-Wimbledon-F-Roger_Federer-Novak_Djokovic
# 20080606-M-Roland_Garros-SF-Novak_Djokovic-Rafael_Nadal
# 20070610-M-Roland_Garros-F-Roger_Federer-Rafael_Nadal
# 20170319-M-Indian_Wells_Masters-F-Roger_Federer-Stanislas_Wawrinka
# 20150911-M-US_Open-SF-Stanislas_Wawrinka-Roger_Federer
# 20160911-M-US_Open-F-Novak_Djokovic-Stanislas_Wawrinka
# 20140126-M-Australian_Open-F-Stanislas_Wawrinka-Rafael_Nadal
# 20170710-M-Wimbledon-R16-Grigor_Dimitrov-Roger_Federer
match_id = "20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic"

points_in_match, list_sets, list_games = get_points_in_one_match(my_points, match_id)

# service_games = get_service_games(list_games, "fed")
# return_games = get_return_games(list_games, "fed")

In [352]:
print(points_in_match["player1_shots_variety_count"].sum() - points_in_match["player2_shots_variety_count"].sum())

127


In [83]:
print(points_in_match["player1_shot_direction_change_count"].sum() - points_in_match["player2_shot_direction_change_count"].sum())

-17


In [84]:
print(points_in_match["player1_net_shot_count"].sum() - points_in_match["player2_net_shot_count"].sum())

28


In [1252]:
print(points_in_match["player1_run_index"].sum())
print(points_in_match["player2_run_index"].sum())
print((points_in_match["player1_run_index"].sum() + points_in_match["player2_run_index"].sum())*3.7)
print(points_in_match["player1_run_index"].sum() - points_in_match["player2_run_index"].sum())
print((points_in_match["player1_run_index"].sum() - points_in_match["player2_run_index"].sum())/
      ((points_in_match["player1_run_index"].sum() + points_in_match["player2_run_index"].sum())/2))

print((points_in_match["server_run_index"].mean() + points_in_match["returner_run_index"].mean())*3.7)
print((points_in_match["server_run_index"].mean() - points_in_match["returner_run_index"].mean())/points_in_match["server_run_index"].mean())

688.8971765137868
653.7483388058865
4967.788406682792
35.14883770790027
0.05235758404858129
19.481523163461926
0.08578694946932196


In [353]:
get_score(my_points, match_id)

Unnamed: 0,roger_federer_set,novak_djokovic_set,roger_federer_game,novak_djokovic_game
0,0,1,6,7
1,1,1,6,1
2,1,2,6,7
3,2,2,6,4
4,2,3,12,13


In [397]:
# To-do: Count the frequencies at critical momemnts and compare them with non-critical moments
# Compare service game points and return game points. Are they significantly different? 
# Against leftie and against right handed.
# On different surfaces
# Age difference 
# Compare different players. 

shot_pattern_freq = get_shot_pattern_freq_table(points_in_match, flip_shot_code_for_leftie = True)

In [398]:
shot_pattern_freq.to_csv("shot_pattern_frequency.csv", index=False)

In [48]:
stats = get_stats(points_in_match)

stats.to_csv("stats.csv", index=False)

novak_djokovic
break_point
0
6
0
2
game_point
15
0
10
0
set_point
0
1
0
0
match_point
0
0
0
0
setup_point
11
14
9
4
deuce_point
9
7
5
1
rafael_nadal
break_point
2
16
1
5
game_point
18
0
13
0
set_point
4
2
3
0
match_point
2
2
1
0
setup_point
17
14
12
6
deuce_point
7
9
6
4


### To-do:
- Identify point outcome (done)
- Separate the shot sequence for each player (done)
- Calculate fatigue index for each point and the cumulative fatigue index for each player and each point
- Calculate run index for each point and the cumulative run index for each player and each point
- Write a function to get the outcome of the last X number of points with rallies. Maybe differentiate between short and long rallies. This can be used to calculate rally confidence. 
- Write a function to get the outcome of the last point with the same serve direction. 
- Write a function to get the outcome of the last X number of serves faults. This can be used to calculate serve confidence. 
- Write a function to get the outcome of the last X number of returns. This can be used to calcualte return confidence. 
- Write a function to calculate the fear, hope, and uncertainty scores for each player and each point. 
- Write a function to identify crucial points. How to define crucial points? Two points from game (1), one-point from game (2), set point (add weight) (3), match point (add weight) (4). Breakpoint should have a higher weight.  
- Write a function to calculate the residue emotions after each game and each set (relief, excitement, frustration, anger)
- Calculate the gap between winning, losing, and the gap between the two players
- Write a function to evaluate the aggressiveness of a rally for a player
- Detect patterns in a player's rally. Can you classify players into "style" groups?
- Write a function to convert from traditional tennis score to point score
- Add "server first serve streak", "server second serve streak", "server/returner point streak", 
- How to evaluate a rally? Find statistical parameters for a rally:
    - Number of different shot types per point (how to measure variety?)
    - Distribution1a of different shot types 
    - Length of rally
    - Aggressiveness
    - Run index
- Long rally winning percentage?

## 3. Analyzing serve patterns

In [1]:
from scipy.stats import chi2_contingency
from scipy.stats import chi2

# Chi-square independence test
def chi_square(data, print_table = False):

    print("Chi-square Independence Test")
    table = data.copy()

#     table = table.set_index("server")

    test_stats, p, dof, expected = chi2_contingency(table)
#     print("chi-squared test results:")
#     print("chi2 test stats = %f" % test_stats)
#     print("p = %f" % p)
#     print('dof = %d ' % dof)

     # interpret test-statistic
    prob = 0.95
    critical = chi2.ppf(prob, dof)
    print('probability=%.3f, critical=%.3f, stat=%.3f' % (prob, critical, test_stats))
#     if abs(test_stats) >= critical:
#         print('Dependent (reject H0)')
#     else:
#         print('Independent (fail to reject H0)')

     # interpret p-value
#     alpha = 1.0 - prob
#     print('significance=%.3f, p=%.3f' % (alpha, p))
#     if p <= alpha:
#         print('Dependent (reject H0)')
#     else:
#         print('Independent (fail to reject H0)')

    if print_table:
        print(table)
        
    return p

In [None]:
# May need the commented code for windows
# from __main__ import *
# import os
# os.environ['PYTHONHOME'] = 'C:/Program Files/Python'
# os.environ['PYTHONPATH'] = 'C:/Program Files/Python/lib/site-packages'
# os.environ['R_HOME'] = 'C:\\Program Files\\R\\R-4.0.2'
# os.environ['R_USER'] = 'C:/Program Files/Python/Lib/site-packages/rpy2'

# I can't find a good implementation of Freeman-Halton Test in Python.
# Has to use R via rpy2 package. 
# rpy2 works well on Mac, but doesn't seem to work well on Windows 10. 
import numpy as np
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rpy2.robjects.numpy2ri.activate()

# Use the "stats" package in R. 
stats = importr('stats')

# Freeman-Halton test is an extension of the Fisher's Exact Test. 
# It can handle 2x3 tables, which is needed for this project. 
# It's suitable for small sample size.
# data is numpy ndarray
def freeman_halton_test(data):
#     print("Freeman-Halton Test")
#     m = np.array([[4,4],[4,5],[10,6]])
    res = stats.fisher_test(data)
    # print('p-value: {}'.format(res[0][0]))
    p_value = res[0][0]
#     print("p = " + str(p_value))
    
#     if p_value <= 0.05:
#         print("Dependent (reject H0)")
#     else:
#         print("Independent (fail to reject H0)")
        
    return p_value

In [None]:
from scipy.stats import chisquare
def serve_direction_test_of_even_distribution(point_data, match_data, player, opponent=None):
    surfaces = ["hard", "clay", "grass"]
    serve_sides = ["ad", "deuce"]
    
    serve_sequences = ["Sv1_direction", "Sv2_direction"]
        
    contingency_table = pd.DataFrame()
    
    for serve_sequence in serve_sequences:
        print(serve_sequence)
        for serve_side in serve_sides:
            print(serve_side)
            for surface in surfaces:
                print(surface)
                matches_by_surface = select_matches(data=match_data, player1=player, player2=opponent, surface=surface)
                selected_points = select_players_points(point_data, matches_by_surface)
                print("finished selecting points")

                if selected_points.empty != True:
                    my_data = selected_points.loc[(selected_points["server_name"].str.contains(player))]
                    # Without Series, you cannot use value_counts()
                    serve_frequency = my_data.loc[my_data["serve_side"] == serve_side, serve_sequence].value_counts()
                    serve_frequency.drop(labels="unknown", errors="ignore", inplace=True)
                    # Replace nan with 0
#                     serve_frequency.fillna(value=0, inplace=True)
                    # Sometimes the serve direction is not entered and is marked as nan, remove these counts. 
                    
                    serve_frequency_list = serve_frequency.tolist()
                    print(serve_frequency)
                    chisq, p = chisquare(serve_frequency_list)
                    print("chi-square: " + str(chisq))
                    print("p: " + str(p))
                    if p < 0.05:
                        print("Reject null hypothesis, not evenly distributed.\n")
                    else:
                        print("Cannot reject null hypothesis, possibly evenly distributed.\n")
                else:
                    print("No matches found")


In [None]:
#some players may generate errors due to NaN existing in the "Surface" column
serve_direction_test_of_even_distribution(point_data=df_points, match_data=df_matches, player="nadal", opponent=None)
# nadal_matches = select_matches(data=df_matches, player1="nadal")
# nadal_matches["Surface"].value_counts(dropna=False)
# possible even distribution:
# goffin, shapovalov, tsitsipas, wawrinka, pouille

## To-do

Within-subject reliability/consistency test:
- Internal consistency: whether the serve pattern is consistent within a match (very short time -- hours)
- External consistency: whether the serve pattern is consistent over time in a tournament (short time -- days)
- External consistency: Check if a player's serve pattern is consistent against the same player on the same surface or at the same tournament, year after year (long time)
- Check if the serve direction patterns vary from match to match on the same surface and same serve side (use MANOVA)

Within-subject differences:
- Compare serve patterns on ad and deuce side in the same match
- Compare serve patterns against different opponents in the same tournament
- Compare serve patterns against the same player on different surfaces/tournaments

Between-subject differences:
- Compare two players' serve patterns when they play each other
- Compare the serve patterns of different players against the same player in the same tournament

## Check if there is a difference in serve patterns between ad and deuce sides

In [None]:
# Input: player, match_data
# Output: a table with the following columns: player, match_id, surface, opponent, serve_sequence, p_value

def compare_players_serves_on_different_sides(player, first_serve=True, surface=None, opponent=None):
    selected_matches = select_matches(data=df_matches, player1=player, player2=opponent, surface=surface)
    selected_points = select_players_points(df_points, selected_matches)
    
    if first_serve == True:
        serve_sequence = "Sv1_direction"
    else:
        serve_sequence = "Sv2_direction"
        
    if selected_points.empty != True:
        my_data = selected_points.loc[(selected_points["server_name"].str.contains(player))]
        contingency_table = pd.DataFrame()
            
        for serve_side in my_data["serve_side"].unique():
            # Must use .loc[], as not using .loc[] does not result in a Series
            #Without Series, you cannot use value_counts()
            serve_frequency = my_data.loc[(my_data["serve_side"] == serve_side), serve_sequence].value_counts()
            serve_frequency.drop(labels="unknown", errors="ignore", inplace=True)
            serve_frequency["serve_side"] = serve_side
            contingency_table = contingency_table.append(serve_frequency, ignore_index = True)
        contingency_table = contingency_table.dropna(how="all")
        contingency_table = contingency_table.set_index("serve_side")
#         contingency_table.fillna(value=0, inplace=True)
        contingency_table = contingency_table[["wide", "body", "t"]]
        print(contingency_table)
        chi_square(contingency_table, print_table = False)
    else:
        print("No matches found")

In [None]:
compare_players_serves_on_different_sides("nadal", first_serve=True, surface="clay", opponent=None)

# Check if there is a difference in serve directions on different surfaces
- Need to control by year and/or by opponent

In [None]:
def compare_players_serves_on_different_surfaces(player, first_serve=True, opponent=None):
    surfaces = ["hard", "clay", "grass"]
    serve_sides = ["ad", "deuce"]
    
    if first_serve == True:
        serve_sequence = "Sv1_direction"
    else:
        serve_sequence = "Sv2_direction"
        
    contingency_table = pd.DataFrame()
    
    for serve_side in serve_sides:
        print(serve_side)
        for surface in surfaces:
            matches_by_surface = select_matches(data=df_matches, player1=player, player2=opponent, surface=surface)
            selected_points = select_players_points(df_points, matches_by_surface)

            if selected_points.empty != True:
                my_data = selected_points.loc[(selected_points["server_name"].str.contains(player))]

                # Without Series, you cannot use value_counts()
                serve_frequency = my_data.loc[my_data["serve_side"] == serve_side, serve_sequence].value_counts()
                serve_frequency.drop(labels="unknown", errors="ignore", inplace=True)
                serve_frequency["surface"] = surface
                contingency_table = contingency_table.append(serve_frequency, ignore_index = True)    
            else:
                print("No matches found")

        contingency_table = contingency_table.dropna(how="all")
        contingency_table = contingency_table.set_index("surface")
        contingency_table.fillna(value=0, inplace=True)
        contingency_table = contingency_table[["wide", "body", "t"]]
        print(contingency_table)
        chi_square(contingency_table, print_table = False)
        contingency_table.drop(contingency_table.index, inplace=True)

In [None]:
compare_players_serves_on_different_surfaces("murray", first_serve=False)

## Check the difference between two players' serve patterns when they play each other

- To-do: control by surface

In [322]:
def compare_player_serves_when_they_play_each_other(point_data, match_data, player1, player2, test_mode="freeman-halton"):
    
    df_points_selected = select_players_points(point_data=point_data, match_data=match_data)
    
    p_value = None
    
    df_two_player_serve_comparison = pd.DataFrame()
    
    if df_points_selected.empty == False:      
        players = [player1, player2]
        serve_sides = ["deuce", "ad"]
        serve_seqs = ["Sv1_direction", "Sv2_direction"]
        
        for serve_side in serve_sides:
#             print(serve_side)
            for serve_seq in serve_seqs:
#                 print(serve_seq)
                contingency_table = pd.DataFrame()

                for player in players:
                    serve_dir_counts = df_points_selected.loc[(df_points_selected["server_name"].str.contains(player)) & 
                                                              (df_points_selected["serve_side"] == serve_side), serve_seq].value_counts()
        #             print(serve_dir_counts)
                    serve_dir_counts.drop(labels = "unknown", inplace=True, errors="ignore")
                    serve_dir_counts["server_name"] = player
                    contingency_table = contingency_table.append(serve_dir_counts, ignore_index=True)

                contingency_table.set_index("server_name", inplace=True)
                contingency_table.fillna(0, inplace=True)
#                 contingency_table = contingency_table[["wide", "body", "t"]]
                
#                 print(contingency_table)
                
                if (test_mode == "freeman-halton"):
                    p_value = freeman_halton_test(contingency_table.to_numpy())
                
                elif test_mode == "chi-square":
                    # The sample size may be too small for chi-square test. 
                    p_value = chi_square(contingency_table)
                    
                serve_comparison = pd.Series({"player_1": player1, 
                                             "player_2": player2, 
                                             "serve_side": serve_side,
                                             "serve_sequence": serve_seq,
                                             "p_value": p_value})
                
                df_two_player_serve_comparison = df_two_player_serve_comparison.append(serve_comparison, 
                                                                                       ignore_index=True)
                
#                 print(p_value)
    else:
        print("No match found")
        
    return df_two_player_serve_comparison

In [337]:
player_hth_count.head(60)

Unnamed: 0,Player 1,Player 2,Count
2500,rafael_nadal,novak_djokovic,48
2258,novak_djokovic,rafael_nadal,48
2750,roger_federer,novak_djokovic,47
2262,novak_djokovic,roger_federer,47
2511,rafael_nadal,roger_federer,35
2759,roger_federer,rafael_nadal,35
2206,novak_djokovic,andy_murray,30
307,andy_murray,novak_djokovic,30
2661,roger_federer,andy_murray,24
429,boris_becker,stefan_edberg,24


In [None]:
# Compare players' serves when they play each other
# player1 = "djokovic"
# player2 = "nadal"
# tournament = None
# surface = "grass"
# date = "2018"

# my_matches = select_matches(df_matches, player1=player1, player2=player2, 
#                             tournament=tournament, surface=surface, date = date)
# print(str(len(my_matches)) + " matches")
# print()

# if (len(my_matches) > 0):
#     df_two_player_serve_comparison = compare_player_serves_when_they_play_each_other(df_points, my_matches, player1, player2, 
#                                                               test_mode="freeman-halton")
#     print(df_two_player_serve_comparison)
# else:
#     print("No match found")
    
df_player_hth_serve_comparison = pd.DataFrame()

for index, row in player_hth_count[player_hth_count["Count"] >= 8].iterrows():
    player1 = row["Player 1"]
    player2 = row["Player 2"]
    player_matches = select_matches(data=df_matches, player1=player1, 
                                    player2=player2)
    print(player1 + " vs " + player2)
    print(len(player_matches))
    
    for match_id in player_matches["match_id"]:
        print(match_id)
        selected_match = df_matches.loc[df_matches["match_id"] == match_id]
        df_two_player_serve_comparison = compare_player_serves_when_they_play_each_other(df_points, 
                                                            selected_match,
                                                            player1, player2, 
                                                            test_mode="freeman-halton")
        if (df_two_player_serve_comparison.empty == False):
            df_two_player_serve_comparison["match_id"] = row_match["match_id"]
            df_player_hth_serve_comparison = \
            df_player_hth_serve_comparison.append(df_two_player_serve_comparison, 
                                                  ignore_index=True)

df_player_hth_serve_comparison.head(20)

In [339]:
df_player_hth_serve_comparison.to_excel("player_hth_serve_comparison.xls")

In [340]:
player_hth_serve_difference = pd.DataFrame()

for index, row in player_hth_count.loc[player_hth_count["Count"] >= 8].iterrows():
    player1 = row["Player 1"]
    player2 = row["Player 2"]
    
    serve_sides = ["deuce", "ad"]
    serve_sequences = ["Sv1_direction", "Sv2_direction"]
    
    for serve_side in serve_sides:
        for serve_seq in serve_sequences:
            selected_stats = df_player_hth_serve_comparison.loc[(df_player_hth_serve_comparison["serve_side"] == serve_side) & 
                                                   (df_player_hth_serve_comparison["serve_sequence"] == serve_seq) & 
                                                   (df_player_hth_serve_comparison["player_1"] == player1) & 
                                                   (df_player_hth_serve_comparison["player_2"] == player2)]
            pct_serve_different = len(selected_stats.loc[selected_stats["p_value"] < 0.05]) / len(selected_stats)
            
            entry = {"player_1": player1,
                    "player_2": player2,
                    "serve_side": serve_side,
                    "serve_sequence": serve_seq,
                    "% of matches with different serve patterns": pct_serve_different}
            
            player_hth_serve_difference = player_hth_serve_difference.append(entry, ignore_index=True)

In [342]:
player_hth_serve_difference.to_excel("player_hth_serve_difference.xls")

In [None]:
# Compare players' serves when they play the same player on the same surface
def compare_player_serves_when_they_play_same_opponent(point_data, match_data, player1, player2, opponent, surface, test_mode="freeman-halton"):
    
    my_matches1 = select_matches(data=match_data, player1=player1, player2=opponent, surface=surface)
    if my_matches1.empty:
        print("No match found between " + player1 + " and " + opponent)
    else:
        print("%d matches between %s and %s" % (len(my_matches1), player1, opponent))
        
    df_points_selected1 = select_players_points(point_data=point_data, match_data=my_matches1)
    
    my_matches2 = select_matches(data=match_data, player1=player2, player2=opponent, surface=surface)
    if my_matches2.empty:
        print("No match found between " + player2 + " and " + opponent)
    else:
        print("%d matches between %s and %s" % (len(my_matches2), player2, opponent))
        
    df_points_selected2 = select_players_points(point_data=point_data, match_data=my_matches2)
    
    if (df_points_selected1.empty == False) & (df_points_selected2.empty == False):      
        players = [player1, player2]
        serve_sides = ["deuce", "ad"]
        serve_seqs = ["Sv1_direction", "Sv2_direction"]
        surfaces = ["hard", "clay", "grass"]
        
        print(surface)
        for serve_side in serve_sides:
            print(serve_side)
            for serve_seq in serve_seqs:
                print(serve_seq)
                contingency_table = pd.DataFrame()
                
                serve_dir_counts1 = df_points_selected1.loc[(df_points_selected1["server_name"].str.contains(player1)) & 
                                                          (df_points_selected1["serve_side"] == serve_side), serve_seq].value_counts()
                serve_dir_counts1["server_name"] = player1
                contingency_table = contingency_table.append(serve_dir_counts1, ignore_index=True)

                serve_dir_counts2 = df_points_selected2.loc[(df_points_selected2["server_name"].str.contains(player2)) & 
                                                          (df_points_selected2["serve_side"] == serve_side), serve_seq].value_counts()
                serve_dir_counts2["server_name"] = player2
                
                contingency_table = contingency_table.append(serve_dir_counts2, ignore_index=True)
            
                contingency_table.set_index("server_name", inplace=True)
                contingency_table.fillna(0, inplace=True)
                
                contingency_table = contingency_table[["wide", "body", "t"]]
                print(contingency_table)
                
                if test_mode == "freeman-halton":
                    freeman_halton_test(contingency_table.to_numpy())
                elif test_mode == "chi-square":
                    chi_square(contingency_table)
                print()
    else:
        print("No match found")

In [None]:
player1 = "federer"
player2 = "nadal"
opponent = "djokovic"
surface = "grass"

compare_player_serves_when_they_play_same_opponent(point_data = df_points, match_data = df_matches, player1 = player1, player2 = player2, 
                                                   opponent = opponent, surface = surface, test_mode="chi-square")

In [None]:
# Find out how many matches two players played against each other
# Use the match for checking internal consistency

player1 = "djokovic"
player2 = "murray"
tournament = "australian"
date = "2019"

my_matches = select_matches(data=df_matches, player1=player1, player2=player2, tournament=tournament, date=None)

print(my_matches)

## Check the internal consistency of a player's serve patterns

In [263]:
# Check whether a player's serve pattern is consistent within a match. 
# Retrieve all the points each player served, randomly split them into half. 
# Count the frequency of serve directions for each half and conduct a Chi-square test to see they are significantly different, controlled by 
# serve sides and serve sequences.
# We can do the random split multiple times to check the results. 

# To-do: use Freeman-Halton test (2x3 table, Fisher's exact test extension) to compare the two samples because the sample
# size is often very small and there are 0s in the data. 
# A chi-square test is performed on the data set, providing that at least 80% of the cells have an expected frequency of 5 or greater, and that no cell has an expected frequency smaller than 1.0. 
# Do not use Chi-square test. 
# Online Freeman-Halton tests can be found at http://vassarstats.net/fisher2x3.html and 
# https://www.danielsoper.com/statcalc/calculator.aspx?id=58. Use them for verification.
# 
def check_internal_consistency_split_half(point_data, match_id, player1):
    df_points_selected = point_data.loc[point_data["match_id"] == match_id]
    
    return_value = pd.DataFrame()
    
    min_point_count = 20
    # Some matches, such as "20150208-M-Montpellier-F-Richard_Gasquet-Jerzy_Janowicz"
    # have only a small number of points. Ignore them. 
    if len(df_points_selected) < min_point_count:
        print("Error: %s has fewer than %d points" % (match_id, min_point_count))
        print()
        return return_value
    
    for player in [player1]:
#         print(player)
        
        # Get all the points served by the current player
        df_points_player_served = df_points_selected.loc[df_points_selected["server_name"].str.contains(player), 
                                                          ["serve_side", "server_name", "Sv1_direction", "Sv2_direction"]]

        # Split the points in half, randomly
        df_points_player_served_half1 = df_points_player_served.sample(frac=0.5)
        df_points_player_served_half2 = df_points_player_served.drop(df_points_player_served_half1.index)
    #     print(df_points_player1_served_half1)
    #     print(df_points_player1_served_half2)

        # Divide each half by serve side and serve sequence
        for serve_side in ["deuce", "ad"]:
#             print(serve_side)
            for serve_seq in ["Sv1_direction", "Sv2_direction"]:

                contingency_table = pd.DataFrame(columns=["group", "wide", "body", "t"])

#                 print(serve_seq)
                
                # Count the frequency of serve directions for the first half
                serve_dir_counts1 = df_points_player_served_half1.loc[df_points_player_served_half1["serve_side"] \
                                                                       == serve_side, serve_seq].value_counts(dropna=False)
                
                # Remove unknown serve directions
                serve_dir_counts1.drop(labels = "unknown", inplace=True, errors="ignore")
                serve_dir_counts1["group"] = 1
                
                # Add to the contingency table
                contingency_table = contingency_table.append(serve_dir_counts1)
    #             print(serve_dir_counts1)

                # Count the frequency of serve directions for the second half
                serve_dir_counts2 = df_points_player_served_half2.loc[df_points_player_served_half2["serve_side"] \
                                                                       == serve_side, serve_seq].value_counts(dropna=False)
                serve_dir_counts2.drop(labels = "unknown", inplace=True, errors="ignore")
                serve_dir_counts2["group"] = 2
                contingency_table = contingency_table.append(serve_dir_counts2)
    #             print(serve_dir_counts2)

                # Use Chi-square test to compare the distribution of the two halves
                contingency_table.set_index("group", inplace=True)
                # If a column is all na, drop it. Otherwise, Chi-square test will report error. 
#                 contingency_table.dropna(axis=1, how="all", inplace=True)
                contingency_table.fillna(0, inplace=True)
#                 print(contingency_table)
                
                p_value = freeman_halton_test(contingency_table.to_numpy())
                entry = pd.Series({"player": player, "serve_side": serve_side, "serve_sequence": serve_seq, "p-value": p_value})
                return_value = return_value.append(entry, ignore_index=True)
                
    return return_value
                
                # The sample sizes are often too small for chi-square test
                # chi_square(contingency_table)
#                 print()

In [265]:
# Check for internal consistency for one player over many matches

for player in player_match_count[player_match_count >= 30].index:
    print(player)

    player_matches = select_matches(data=df_matches, player1=player)
    player_match_consistency = pd.DataFrame()
    for match_id in player_matches["match_id"]:
#         print(match_id)
        player_consistency_table = check_internal_consistency_split_half(df_points, match_id, player)
        if player_consistency_table.empty != True:
            player_consistency_table["match_id"] = match_id
            player_match_consistency = player_match_consistency.append(player_consistency_table, ignore_index=True)

player_match_consistency.to_excel("player_match_serve_consistency.xls")

roger_federer
novak_djokovic
rafael_nadal
andy_murray
stefan_edberg
pete_sampras
andre_agassi
lleyton_hewitt
juan_martin_del_potro
stanislas_wawrinka
ivan_lendl
boris_becker
gael_monfils
david_ferrer
alexander_zverev
dominic_thiem
andy_roddick
tomas_berdych
milos_raonic
karim_mohamed_maamoun
kei_nishikori
grigor_dimitrov
jo_wilfried_tsonga
daniil_medvedev
richard_gasquet
Error: 20150208-M-Montpellier-F-Richard_Gasquet-Jerzy_Janowicz has fewer than 20 points

john_isner
stefanos_tsitsipas
nick_kyrgios
robin_haase
philipp_kohlschreiber
denis_shapovalov
borna_coric
roberto_bautista_agut


In [267]:
pct_H0 = 100 * len(player_match_consistency.loc[player_match_consistency["p-value"] > 0.05])/len(player_match_consistency)

print("%f%% of matches have internal serve pattern consistency" % pct_H0)

99.166667% of matches have internal serve pattern consistency


## Check the external consistency of a player's serve pattern using parallel forms reliability test

In [121]:
# Find how many matches a player played in a tournament. 
# The more matches in a tournament the better for checking external consistency using parallel forms reliability test

# Need at least four matches per tournament to test external consistency
# Djokovic 2016 AO 2019(3), 2016(4), 2012(4), wimbledon 2019(6), 2018(3), 2015 (3), french open 2015(3)
# Federer 2020 AO (4), Wimbledon 2019(6), 2018(5), 2017(3)  
# Nadal AO 2014(6), 2017(3), 2019(2), Wimbledon 2019(5), 2018(2), 2008(3), Roland Garros 2014(7)
# Wawrinka US Open 2016(3), AO 2014 (3)
# Murray AO 2012 (3), 2020 (3), US Open 2012 (3), 
# Sampras: US Open 2001(3), AO 1995(3)
# del Potro: French 2019 (3), 2018(3), us open 2018(3), 2009(3)
# Agassi: us open 1995(3), 2005(3)

player = "federer"
tournament = "wimbledon"
date = None

my_matches = select_matches(data=df_matches, player1=player, tournament=tournament, date=date)

print(my_matches)

                                               match_id            Player 1  \
244   20190714-M-Wimbledon-F-Roger_Federer-Novak_Djo...       roger_federer   
245   20190712-M-Wimbledon-SF-Roger_Federer-Rafael_N...       roger_federer   
247   20190710-M-Wimbledon-QF-Roger_Federer-Kei_Nish...       roger_federer   
250   20190708-M-Wimbledon-R16-Roger_Federer-Matteo_...       roger_federer   
256   20190706-M-Wimbledon-R32-Roger_Federer-Lucas_P...       roger_federer   
263   20190704-M-Wimbledon-R64-Roger_Federer-Jay_Clarke       roger_federer   
663   20180711-M-Wimbledon-QF-Kevin_Anderson-Roger_F...      kevin_anderson   
664   20180709-M-Wimbledon-R16-Adrian_Mannarino-Roge...    adrian_mannarino   
665   20180706-M-Wimbledon-R32-Roger_Federer-Jan_Len...       roger_federer   
666   20180704-M-Wimbledon-R64-Lukas_Lacko-Roger_Fed...         lukas_lacko   
667   20180702-M-Wimbledon-R128-Dusan_Lajovic-Roger_...       dusan_lajovic   
790    20170716-M-Wimbledon-F-Marin_Cilic-Roger_Fede

In [47]:
def check_external_consistency_parallel_forms_reliability(point_data, match_data1, match_data2, player):
    selected_point_data1 = select_players_points(point_data=point_data, match_data=match_data1)
    selected_point_data1 = selected_point_data1.loc[selected_point_data1["server_name"].str.contains(player)]
    
    selected_point_data2 = select_players_points(point_data=point_data, match_data=match_data2)
    selected_point_data2 = selected_point_data2.loc[selected_point_data2["server_name"].str.contains(player)]
    
    print(player)
    
    for serve_side in ["deuce", "ad"]:
        print(serve_side)
        for serve_seq in ["Sv1_direction", "Sv2_direction"]:
            print(serve_seq)
            
            contingency_table = pd.DataFrame(columns=["match", "wide", "body", "t"])
            
            serve_dir_counts1 = selected_point_data1.loc[selected_point_data1["serve_side"] == serve_side, serve_seq].value_counts(dropna=False)
            serve_dir_counts1["match"] = 1
            serve_dir_counts1.drop(labels = "unknown", inplace=True, errors="ignore")
            
            contingency_table = contingency_table.append(serve_dir_counts1)
            
            serve_dir_counts2 = selected_point_data2.loc[selected_point_data2["serve_side"] == serve_side, serve_seq].value_counts(dropna=False)
            serve_dir_counts2["match"] = 2
            serve_dir_counts2.drop(labels = "unknown", inplace=True, errors="ignore")
            
            contingency_table = contingency_table.append(serve_dir_counts2)
            
            contingency_table.set_index("match", inplace=True)
            # If a column is all na, drop it. Otherwise, Chi-square test will report error.
#             contingency_table.dropna(axis=1, how="all", inplace=True)
            contingency_table.fillna(0, inplace=True)
            print(contingency_table)
            freeman_halton_test(contingency_table.to_numpy())
            
            # The sample sizes are often too small for chi-square test
            # chi_square(contingency_table)
            print()
    

In [52]:
# Year by year comparison found more inconsistent server patterns but still many are consistent. 
# Within tournament consistency is generally very high.

player = "federer"
opponent1 = "djokovic"
opponent2 = "millman"
tournament = "australian"
date1 = "2020"
date2 = "2020"

my_matches1 = select_matches(data=df_matches, player1=player, player2 = opponent1, tournament=tournament, date=date1)
print(str(len(my_matches1)) + " matches")

my_matches2 = select_matches(data=df_matches, player1=player, player2 = opponent2, tournament=tournament, date=date2)
print(str(len(my_matches2)) + " matches")

# print(match_id1 + " " + match_id2)

check_external_consistency_parallel_forms_reliability(point_data = df_points, 
                                                      match_data1 = my_matches1, 
                                                      match_data2 = my_matches2, 
                                                      player = player)


1 matches
1 matches
federer
deuce
Sv1_direction
       wide  body     t
match                  
1.0    30.0   0.0  22.0
2.0    44.0   2.0  40.0
Freeman-Halton Test
p = 0.549067376698846
Independent (fail to reject H0)

Sv2_direction
       wide  body  t
match               
1         6     4  5
2        12    10  6
Freeman-Halton Test
p = 0.7196135395075792
Independent (fail to reject H0)

ad
Sv1_direction
       wide  body     t
match                  
1.0    26.0   0.0  26.0
2.0    42.0   3.0  30.0
Freeman-Halton Test
p = 0.2440413219043995
Independent (fail to reject H0)

Sv2_direction
       wide  body  t
match               
1         7     8  6
2        12     9  7
Freeman-Halton Test
p = 0.8156386818235442
Independent (fail to reject H0)



To-do

1. Serve direction analysis
The characteristics of serves for one player
    - Comparison by serve sides(use chi-square)
    - Comparison first and second serves(use chi-square, ind=serves, dep=serve_direction)
    - Comparison by surfaces (control by opponent)(use chi-square)
    - Comparison by opponents (different serve patterns for different opponents?) (use chi-square)
    - Comparison by age (early years vs later years) (chi-square, ind=different ages, dep=serve_directions)
    - Comparison by tournament (grand slam vs other tournaments) (chi-square)
    - Show the change of serve directions during a match (draw a line chart show the counts of each serve direction over time)
    - Serve patterns at critical moments vs non-critical moments (1 points from winning, 1 points from losing, etc.)
    - Serve pattern based on tiredness, group serves by the count of points played (early, middle, or later) and compare between groups.
    - Serve patterns when leading vs when trailing
    - Serve patterns when first serve rate is high (high confidence) vs when the rate is low (low confidence)
    - Serve patterns when winning several points in a row (high confidence), serve patterns when losing several points in a row (low confidence, frustration). 
    - Serve after unforced errors (one or more). Serve after the opponent hitting a winner. Serve after hitting winners. 
    - Serve right after long points. 
    - Serve right after aces (is there a difference? Is he likely to serve a different direction?)
    - Serve right after double faults (is there a difference?)
    - Serve right after repeated first serve faults
    - Is the success or failure of certain serve directions (e.g., aces, short points won, double faults) in the beginning influence the later serve decisions? (anchor effect) Any short term or long term effects?
    - Can we find any subtle bias in serve selection?
    
Compare the serves of two players
    - When they play each other ...
    - When they play the same opponent ...
    - When they are tired
    - When they are leading or losing
    - At critical moments ...
    - At high or low confidence level 

Compare the serves for three or more players?
    - Is it useful?
    
2. Serve error analysis
    - Frequency of errors at critical moments (double fault, first serve fault)
    - Types of errors correlated with tension, confidence, etc. 
