# Wimbledon 2024 Match Statistics - Data Science Portfolio Project

In this project I am going to ask and answer several basic questions about tennis matchplay, and use statistical measures from the 2024 Wimbledon tournament to answer them to help gain a deeper understanding of high percentage tennis matchplay. 

Dataset: http://www.tennis-data.co.uk/wimbledon.php

Let's get started!
***

Import libraries and dataset:

In [27]:
import pandas as pd
wimbledon_matches = pd.read_csv('wimbledon_dataset.csv')
wimbledon_matches.head()

Unnamed: 0,ATP,Location,Tournament,Date,Series,Court,Surface,Round,Best of,Winner,...,Lsets,Comment,B365W,B365L,PSW,PSL,MaxW,MaxL,AvgW,AvgL
0,39,London,Wimbledon,01/07/2024,Grand Slam,Outdoor,Grass,1st Round,5,Fognini F.,...,0.0,Completed,2.0,1.8,2.2,1.75,2.2,1.82,2.08,1.75
1,39,London,Wimbledon,01/07/2024,Grand Slam,Outdoor,Grass,1st Round,5,Shang J.,...,0.0,Completed,1.8,2.0,1.9,2.01,1.91,2.1,1.84,1.97
2,39,London,Wimbledon,01/07/2024,Grand Slam,Outdoor,Grass,1st Round,5,Ruud C.,...,0.0,Completed,1.4,3.0,1.45,3.0,1.45,3.2,1.4,2.97
3,39,London,Wimbledon,01/07/2024,Grand Slam,Outdoor,Grass,1st Round,5,Coric B.,...,0.0,Completed,1.62,2.3,1.61,2.49,1.66,2.49,1.59,2.37
4,39,London,Wimbledon,01/07/2024,Grand Slam,Outdoor,Grass,1st Round,5,Struff J.L.,...,1.0,Completed,1.25,4.0,1.23,4.8,1.26,4.8,1.22,4.3


Inspecting the dataset: 

In [65]:
wimbledon_matches.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 127 entries, 0 to 126
Data columns (total 36 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ATP         127 non-null    int64  
 1   Location    127 non-null    object 
 2   Tournament  127 non-null    object 
 3   Date        127 non-null    object 
 4   Series      127 non-null    object 
 5   Court       127 non-null    object 
 6   Surface     127 non-null    object 
 7   Round       127 non-null    object 
 8   Best of     127 non-null    int64  
 9   Winner      127 non-null    object 
 10  Loser       127 non-null    object 
 11  WRank       127 non-null    int64  
 12  LRank       127 non-null    int64  
 13  WPts        127 non-null    int64  
 14  LPts        127 non-null    int64  
 15  W1          125 non-null    float64
 16  L1          125 non-null    float64
 17  W2          124 non-null    float64
 18  L2          124 non-null    float64
 19  W3          124 non-null    f

This dataset is pretty clean and uniformed, a couple things to note:
> - Lots of columns that may not be essential
> - Some columns are missing data
> - Data types are correct

Now I am going to eliminate non essential columns:
> - Specifications about the tournament that are all the same
> - Betting columns (just so we work with only the raw match data)

Before we proceed into the dataset, I just want to prefrence the specifications of the Wimbledon tournament, being some of the previous columns dropped. We are looking at the Men's 2024 Wimbledon final. Wimbledon is an ATP tennis tournament held in London, being part of the Grand Slam Series. It is played outdoors on grass courts, and is a best of 5 sets. Please take these considerations in if you are refrencing other tennis data that might have different specifications.

In [29]:
wimbledon_matches = wimbledon_matches.drop(labels=['ATP', 'Location', 'Tournament', 'Series', 'Court', 'Surface', 'Best of', 'B365W', 'B365L', 'PSW', 'PSL', 'MaxW', 'MaxL', 'AvgW', 'AvgL'], axis=1)

In [31]:
wimbledon_matches.head()

Unnamed: 0,Date,Round,Winner,Loser,WRank,LRank,WPts,LPts,W1,L1,...,L2,W3,L3,W4,L4,W5,L5,Wsets,Lsets,Comment
0,01/07/2024,1st Round,Fognini F.,Van Assche L.,94,104,635,585,6.0,1.0,...,3.0,7.0,5.0,,,,,3.0,0.0,Completed
1,01/07/2024,1st Round,Shang J.,Garin C.,91,106,658,570,7.0,5.0,...,4.0,6.0,4.0,,,,,3.0,0.0,Completed
2,01/07/2024,1st Round,Ruud C.,Bolt A.,8,234,4025,252,7.0,6.0,...,4.0,6.0,4.0,,,,,3.0,0.0,Completed
3,01/07/2024,1st Round,Coric B.,Meligeni Alves F.,89,145,664,403,6.0,3.0,...,6.0,6.0,3.0,,,,,3.0,0.0,Completed
4,01/07/2024,1st Round,Struff J.L.,Marozsan F.,41,43,1135,1047,6.0,4.0,...,7.0,6.0,2.0,6.0,3.0,,,3.0,1.0,Completed


***
Now that we have the limited the columns down, we can start asking our questions.

The first question I am going to ask is, what is the average match score?  

First let's isolate the set scores to another dataset

In [71]:
set_scores = wimbledon_matches.loc[:, ['W1', 'L1', 'W2', 'L2', 'W3', 'L3', 'W4', 'L4', 'W5', 'L5']]
set_scores.head()

Unnamed: 0,W1,L1,W2,L2,W3,L3,W4,L4,W5,L5
0,6.0,1.0,6.0,3.0,7.0,5.0,,,,
1,7.0,5.0,6.0,4.0,6.0,4.0,,,,
2,7.0,6.0,6.0,4.0,6.0,4.0,,,,
3,6.0,3.0,7.0,6.0,6.0,3.0,,,,
4,6.0,4.0,6.0,7.0,6.0,2.0,6.0,3.0,,


Now that we have the isolated columns, let's call the .describe() method to display the mean values for each column

In [73]:
set_scores.describe()

Unnamed: 0,W1,L1,W2,L2,W3,L3,W4,L4,W5,L5
count,125.0,125.0,124.0,124.0,124.0,124.0,75.0,75.0,36.0,36.0
mean,5.632,4.504,5.846774,4.379032,5.516129,4.524194,5.96,4.146667,6.166667,3.361111
std,1.310848,1.920282,1.097257,1.689724,1.553825,1.73188,1.032447,1.641467,0.377964,1.514742
min,1.0,1.0,1.0,1.0,1.0,1.0,2.0,0.0,6.0,1.0
25%,5.0,3.0,6.0,3.0,6.0,3.0,6.0,3.0,6.0,2.0
50%,6.0,5.0,6.0,4.0,6.0,4.5,6.0,4.0,6.0,3.0
75%,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,4.0
max,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,6.0


Now I am going to summarize the part of the table we want, using basic rounding principles and logical tennis match scores that are possible

Takeaway average scores: 
> - 1st set: 6 - 4
> - 2nd set: 6 - 4
> - 3rd set: 6 - 4
> - 4th set: 6 - 4
> - 5th set: 6 - 3

Wow this is very interesting, the majority of sets played at this tournament had a score of 6 - 4, the winner of course have the 6 games

If you play tennis or are familiar with the rules, you might notice something is off. As I said previously, this tournament is best of 5 sets, so if the winner won the first 3 sets, preemptively they would not have to play the last two sets. This of course is an average so it cannot describe all the specific match situations. However, this has me wondering how many matches actually end up getting to the fourth, or even the fifth set?

To do this, I am going to make a for loop to iterate through each match, checking if the match went to the fifth, fourth, or just 3rd set. The way the logic is lined up, we should not get any overlapping matches. 

In [75]:
five_sets = 0
four_sets = 0
three_sets = 0

for index, row in set_scores.iterrows():
    # Match went to 5 sets
    if pd.notnull(row['W5']):
        five_sets += 1
    # Match went to 4 sets
    elif pd.notnull(row['W4']):
        four_sets += 1
    # Match went to 3 sets
    else:
        three_sets += 1
        
print("5th set: " + str(five_sets))
print("4th set: " + str(four_sets))
print("3rd set: " + str(three_sets))

5th set: 36
4th set: 39
3rd set: 52


Concluding the data in percentages:
> - 41% of matches had only 3 sets
> - 31% of matches had only 4 sets
> - 28% of matches had 5 sets

So summarizing these results, most of the matches played had only 3 sets, meaning one player won all 3. Then the second most probable situation where the opponent was able to gain 1 set on the winner. And the least, but still decently probable situation, the opponent was able to take two sets on the winner. 

***
Now this is a generalization for all the matches played in all of the rounds in the tournament. We can dive deeper into this statistic and use the series column "round" to determine if these scores will change depending on which round it is. 

Wimbledon is organized in the following manner:
> - First round
> - Second round
> - Third round
> - Fourth round
> - Quarterfinals
> - Semifinals
> - The Final

Since there are 7 rounds, I am going to make a function that returns a list of the name of the round, the average winner sets, and the average loser sets. The function simply will iterate through the rows and filter to just the ones that match the inputted round number. It will then access the desired columns to save to several counter variables in which an average can be outputted, and eventually returned in the list.

In [107]:
def set_count(round_num):
    round_w = 0
    round_l = 0
    round_c = 0
    
    for index, row in wimbledon_matches.iterrows(): 
        if row['Round'] == round_num:
            # Takes care of null values in walkover matches
            if pd.notnull(row['Wsets']):
                round_w += row['Wsets']
                round_l += row['Lsets']
                round_c += 1

        else:
            pass

    avg_w = int(round(round_w / round_c, 0))
    avg_l = int(round(round_l / round_c, 0))

    set_count_data = [round_num, avg_w, avg_l]
    
    return set_count_data


Now that we have an efficient way to get our averages for each round, I am going to store the data from each round into a master list. Because of the way we returned our variables, in a list format, it has essentially already created us a 2d array. All that is left is to put the data into a dataframe, and allign the column names to match the order of how we returned out variables.

In [109]:
set_count_data = [set_count("1st Round"), set_count("2nd Round"), set_count("3rd Round"), set_count("4th Round"), set_count("Quarterfinals"), set_count("Semifinals"), set_count("The Final")]
round_avgs = pd.DataFrame(set_count_data, columns=["Round", "Avg_Wsets", "Avg_Lsets"])
round_avgs

Unnamed: 0,Round,Avg_Wsets,Avg_Lsets
0,1st Round,3,1
1,2nd Round,3,1
2,3rd Round,3,1
3,4th Round,3,1
4,Quarterfinals,3,2
5,Semifinals,3,0
6,The Final,3,0


With the new data in a dataframe, this makes it easier for us to analyze these statistics. We see that the average stays consistent in the first four rounds. One reason that could explain this is there is a wide range of player rankings in these early stages, making the difficulty matchup consistent throughout the matches. Then we start to see that there is a spike in the quarterfinals where it is a smaller gap, presumably a more competitive match. Then in the Semifinals and Final, we see a consistent 3 - 0 score. Because the Semifinals and Final consist only of three matches, we can not find much reasoning as it is more dependent on the specific situation of the match, and cannot be generalized unless compared to other datasets with similar data. 

One last statistic I want to look at is, what are the chances that you will automatically win the match? This can happen in one of two ways, either one player retires, or it is a walkover match. Lets isolate these rows and see the percentage. 

In [37]:
free_matches = (wimbledon_matches['Comment'] == "Retired") | (wimbledon_matches['Comment'] == "Walkover")
wimbledon_matches[free_matches]

Unnamed: 0,Date,Round,Winner,Loser,WRank,LRank,WPts,LPts,W1,L1,...,L2,W3,L3,W4,L4,W5,L5,Wsets,Lsets,Comment
56,03/07/2024,1st Round,Khachanov K.,Karatsev A.,22,99,1780,615,6.0,3.0,...,7.0,7.0,6.0,2.0,0.0,,,2.0,1.0,Retired
77,04/07/2024,2nd Round,Fils A.,Hurkacz H.,34,7,1250,4235,7.0,6.0,...,4.0,2.0,6.0,6.0,6.0,,,2.0,1.0,Retired
91,04/07/2024,2nd Round,Pouille L.,Kokkinakis T.,212,93,284,639,2.0,6.0,...,5.0,5.0,2.0,,,,,1.0,1.0,Retired
100,06/07/2024,3rd Round,De Minaur A.,Pouille L.,9,212,3830,284,,,...,,,,,,,,,,Walkover
114,07/07/2024,4th Round,Medvedev D.,Dimitrov G.,5,10,6445,3750,5.0,3.0,...,,,,,,,,0.0,0.0,Retired
122,10/07/2024,Quarterfinals,Djokovic N.,De Minaur A.,2,9,8360,3830,,,...,,,,,,,,,,Walkover


As we can see this only happened 6 times in the whole tournament. This means that there is a 5% chance that you will win the match from those circumstances.  

***
Here are some of the statistics in recap: 
> - Average score per game was 6 - 4
> - 41% of matches had only 3 sets
> - 31% of matches had only 4 sets
> - 28% of matches had 5 sets
> - Match scores vary depending on the round #, average was 3 - 1
> - 5% chance a match was won by a walkover or retire



As we said earlier, some of these statistics should be justified by comparing them to other tournament data. Matches will also vary case to case and there may be specific reasons that we can't explain with the raw data.

That concludes this dive into the 2024 Wimbledon Men's Tournament. Although some of these things may seem simple, it is often a part that people will not realize. It is good to be familiar with these numbers and other tennis statistics if you want to improve your game. Thank you for taking the time to look through this project, and of course see you next year for the 2025 Wimbledon Men's finals!