# Motivation
In tennis tall players are sterotypically considered big servers and bad returners. But I want to find out if there really is a relation between height with both serving and returning statistics. When you think of the great servers the names that come to mind are John Isner, Kalovic who are considered tall for tennis, but you also think of players like Pete Samprus and Rodger Federer who are not especially tall for professional tennis players. This leads me to question whether there is a real relation between height and serving in professioal tennis or whether it is an old sterotype of tennis that is still around, based on observation and not statistics. I don't only want to answer whether they are bigger serves, as they can serve it faster, but if they are more effective servers, do they win more service points then other player. There is also the idea in tennis that tall players are not as good at returning. There are example on both sides of this that I can thinks of. Like John Isner is tall and not a very good returner, but on the other hand Del Potro was a tall player and a good returner. Once again I want to see if there is any statistical evidence for these claims. These observations in themselves are no evidence for or against the claim, but they did spark my interest to look deeper into this idea. Then if there is a relation between both serving and returning with height, whether it is a good trade off or bad tradeoff by seeing what the relation with height and win rate is. This would be good to know when deciding who to look at for tennis prospects. It would be important to know for tennis coaches to know the effect height has on a players ability to both serve and return. It is also important to look at the data in sports and not get distracted by the flare of a player. Some one may have a big serve, and watching him would be impressive, but that doesn't mean they win service point more often. I feel that it is very common to make judgements on sport based on appearance and not on data, and that is why I want to take a data driven players serving abilities versus their height.

# Background
I want to go over some of the basics of tennis first so you will be able to better understand what I am going to be talking about. If you want a more in depth look at the rules of tennis you can look here http://protennistips.net/tennis-rules/. If you want to learn more about tennis in general and playing tennis you can look at the USTA site here https://www.usta.com/. Now I will go over a brief summary of things I think you should know to understancd what I am talking about. First if you could look at this diagram of a court so you will know the terminology https://tenniscompanion.org/a-diagram-of-tennis-court-dimensions-layout/. A tennis match consists either of a best of 3 or 5 sets. A set consists of best of 6 games win by 2, but if you get to 6 all you play a tie break. For each game you have one player serving and one returning. A game is first to four points win by 2. Players alternate serve each game. A serve is the shot that starts a point where one player standing behing the baseline hits it into diagonal service box of the opponent. You get two serves in tennis, so if the 1st serve goes out you have one more chance before losing a point. If they miss the second serve then they lose a point. After each serve players switch which service box is being served to. <br>

<u>Terminology</u>: <br>
<u>Double Fault:</u> If you hit 1st and 2nd serve out that is called a double fault <br>
<u>Ace</u>: If the server hit his serve in and it goes untouched by the returner <br>
<u>Break Point</u>: If the returner is one point from winning the game <br>


# Getting the Data
The data I will be looking at is from https://github.com/JeffSackmann/tennis_atp. The data has the atp matches for every year from 1968 until the current year. I will only be looking at 2015 and after since I want to look at the most reason data as the game evolves over the years and serving now is not the same as it was earlier. As both the form and the racquet technology has changes, and I thought 6 years of match data would be enough to look at. I will be exluding the 2022 data as the year is not over yet. For this I will only be looking at the atp tour level players and not challengers and below. Challengers is like the AAA league in baseball, where they are pro players but not the highest level so I thought it would be more insightful to only look at the highest level players.

In [184]:
import pandas as pd
import numpy as np
import statsmodels.api as stats
from matplotlib import pyplot as plt 
import csv

In [185]:
data2015 = pd.read_csv('data/atp_matches_2015.csv')
data2014 = pd.read_csv('data/atp_matches_2014.csv') 
data2016 = pd.read_csv('data/atp_matches_2016.csv') 
data2017 = pd.read_csv('data/atp_matches_2017.csv') 
data2018 = pd.read_csv('data/atp_matches_2018.csv') 
data2019 = pd.read_csv('data/atp_matches_2019.csv') 
data2020 = pd.read_csv('data/atp_matches_2020.csv') 
data2021 = pd.read_csv('data/atp_matches_2021.csv') 
data = pd.concat([data2015,data2016,data2017,data2019,data2020,data2021])

The tour level match data that we are looking at includes Davis cup, which is not part of the ATP tour technically so I do not want to look at matches for that so I removed 

In [186]:
data = data[data['tourney_level'] != 'D']
data = data[['w_1stIn','w_1stWon','l_1stIn', 'l_1stWon','winner_name','loser_name','loser_ht','w_ace','w_svpt','winner_ht',
             'w_2ndWon','w_bpFaced','w_bpSaved','l_ace','l_2ndWon','l_bpSaved','l_bpFaced','l_svpt']]

Now that I have loaded in the data I will first take a quick look at it to show what the data I'm dealing with looks like. Then I will look for missing data and try to fill it in by either finding the information somewhere else or come up with a method to fill it in.

In [187]:
data

Unnamed: 0,w_1stIn,w_1stWon,l_1stIn,l_1stWon,winner_name,loser_name,loser_ht,w_ace,w_svpt,winner_ht,w_2ndWon,w_bpFaced,w_bpSaved,l_ace,l_2ndWon,l_bpSaved,l_bpFaced,l_svpt
0,24.0,19.0,31.0,20.0,John Millman,Rhyne Williams,,6.0,44.0,183.0,14.0,1.0,1.0,3.0,5.0,1.0,5.0,50.0
1,59.0,39.0,50.0,26.0,Jarkko Nieminen,Denis Kudla,180.0,4.0,92.0,185.0,17.0,7.0,4.0,6.0,19.0,3.0,8.0,83.0
2,27.0,20.0,37.0,22.0,James Duckworth,Gilles Simon,183.0,4.0,45.0,183.0,11.0,3.0,2.0,2.0,5.0,10.0,15.0,56.0
3,39.0,31.0,38.0,30.0,Jeremy Chardy,Andrey Golubev,185.0,7.0,53.0,188.0,11.0,0.0,0.0,9.0,8.0,1.0,3.0,57.0
4,79.0,55.0,62.0,40.0,Martin Klizan,Jurgen Melzer,183.0,9.0,130.0,190.0,27.0,8.0,6.0,4.0,19.0,4.0,8.0,95.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2593,43.0,33.0,25.0,16.0,Cameron Norrie,Denis Shapovalov,185.0,1.0,54.0,188.0,4.0,1.0,1.0,3.0,7.0,2.0,6.0,45.0
2594,48.0,32.0,33.0,23.0,Andrey Rublev,Diego Schwartzman,170.0,4.0,80.0,188.0,15.0,8.0,7.0,1.0,7.0,5.0,10.0,64.0
2595,47.0,32.0,68.0,44.0,Casper Ruud,Grigor Dimitrov,188.0,3.0,83.0,183.0,19.0,6.0,1.0,4.0,13.0,6.0,12.0,94.0
2596,56.0,41.0,47.0,34.0,Cameron Norrie,Andrey Rublev,188.0,8.0,93.0,188.0,20.0,9.0,8.0,13.0,24.0,2.0,4.0,86.0


First I will just look to see what columns have missing data in it so I know what parts of the table I need to look deeper into. I will then find ways to fill in any missing data.

In [188]:
for column in data.columns:
    if len(data[data[column].isna() == True]) != 0 :
        print(column)
        

w_1stIn
w_1stWon
l_1stIn
l_1stWon
loser_ht
w_ace
w_svpt
winner_ht
w_2ndWon
w_bpFaced
w_bpSaved
l_ace
l_2ndWon
l_bpSaved
l_bpFaced
l_svpt


It seems that all of the columns have some amount of missing data. First part of the missing data I will address is the heights of the players since that is one of the main features of the players I am focusing on. Luckly https://www.atptour.com/ gives the heights of all the players that play on tour, so I will write all the names down with missing heights into a csv file and then manually enter in the height data for each player. Then I can load in the data back into pandas and enter it into the data. One player I did not find the heigth of on atptour was Flavio Cobolli, who I had to find a seperate bio to fill in his height data. I will start off by doing this for missing winner heights and then loser heights assuming there will be cross over and this will stop names from being printed twice. When I did this there was 140 players whose heights needed to be filled in for the winners alone. I have put the csv file PlayersNames.csv in the repository, incase you want to repeat what I have done, so you do not have to enter in all that data yourself.

In [7]:
file = open('PlayerNames23.csv','w')
writer = csv.writer(file)
for name in np.unique(data[data['winner_ht'].isna() == True]['winner_name']):
    writer.writerow([name])
file.close()

PermissionError: [Errno 13] Permission denied: 'PlayerNames.csv'

While going through the names to add in the height I found one mistake where Alejandro Gomez was named as Alejandro Gomez Gb42. I have verified that these are refering to the same player and that Alejandro Gomez Gb42 is not another player.

In [189]:
# Name was put in wrong on some entries, I did verify these were refering to the same man
data.loc[data['winner_name'] == 'Alejandro Gomez Gb42','winner_name'] = 'Alejandro Gomez'
data.loc[data['loser_name'] == 'Alejandro Gomez Gb42','loser_name'] = 'Alejandro Gomez'

In [190]:
heights = pd.read_csv('PlayerNames.csv')
for name in np.unique(data[data['winner_ht'].isna() == True]['winner_name']):
    data.loc[data['winner_name'] == name, 'winner_ht'] = int(heights.loc[heights['Name'] == name,'HT'])
    data.loc[data['loser_name'] == name, 'loser_ht'] = int(heights.loc[heights['Name'] == name,'HT'])

In [191]:
file2 = open('PlayerNames21.csv','w')
writer = csv.writer(file2)
for name in np.unique(data[data['loser_ht'].isna() == True]['loser_name']):
    writer.writerow([name])
file2.close()

For the players that I could not find there height reported on the atp tour site or anywhere else I filled in the average losing height, since this is the list of losing players missing height and any that won would have been filled in earlier, as my imputation technique. The 8 players below were the only 8 I could not find anything on. For the rest I used the atp site again to find their height or for some a google search. There was another 123 players to have their hieghts filled.

In [196]:
avgHt = round(np.average(np.array(data[data['loser_ht'].isna() == False]['loser_ht'])))
data.loc[data['loser_name'] == 'Adrien Bossel', 'loser_ht'] = float(avgHt)
data.loc[data['loser_name'] == 'Alexios Halebian', 'loser_ht'] = float(avgHt)
data.loc[data['loser_name'] == 'Anil Yuksel','loser_ht'] = float(avgHt)
data.loc[data['loser_name'] == 'Brian Shi','loser_ht'] = float(avgHt)
data.loc[data['loser_name'] == 'Mubarak Shannan Zayid','loser_ht'] = float(avgHt)
data.loc[data['loser_name'] == 'Riccardo Bellotti','loser_ht'] = float(avgHt)
data.loc[data['loser_name'] == 'Yassine Idmbarek','loser_ht'] = float(avgHt) 
data.loc[data['loser_name'] == 'Omar Awadhy','loser_ht'] = float(avgHt)

In [197]:
heights = pd.read_csv('PlayerNames2.csv')
for name in np.unique(data[data['loser_ht'].isna() == True]['loser_name']):
    data.loc[data['loser_name'] == name, 'loser_ht'] = int(heights.loc[heights['Name'] == name,'HT'])

Now I am going to fill in the missing data for the rest of the columns. These columns consist of 1st serve in for both winners and losers,1st serve won for both, aces, 2nd serves won, number of service points, break points faced and saved. The way I am going to do this is, for each player that is missing a value in one of these categories, I will calculate there average value of the statistic for when then lose or win respectively and then fill the missing value in as the average for that player. I choose to round the average since a decimal doesnt make sense as an entry for these statistics only a whole number would make sense.

In [201]:
columnNames = ['1stIn','1stWon','ace','svpt','bpFaced','bpSaved','2ndWon']
for col in columnNames:
    for name in np.unique(data[data['w_'+col].isna() == True]['winner_name']):
        playerdata = data[data['winner_name'] == name]
        playerdata = playerdata[playerdata['w_'+col].isna() == False]
        avg = np.average(playerdata['w_'+col])
        data.loc[(data['winner_name'] == name) & (data['w_'+col].isna()==True),'w_'+col] = np.round(avg)

    for name in np.unique(data[data['l_'+col].isna() == True]['loser_name']):
        playerdata = data[data['loser_name'] == name]
        playerdata = playerdata[playerdata['l_'+col].isna() == False]
        avg = np.average(playerdata['l_'+col])
        data.loc[(data['loser_name'] == name) & (data['l_'+col].isna()==True),'l_'+col] = np.round(avg)

In [199]:
for column in data.columns:
    if len(data[data[column].isna() == True]) != 0 :
        print(column)

l_1stIn
l_1stWon
l_ace
l_2ndWon
l_bpSaved
l_bpFaced
l_svpt


As seen above there is missing data in all of the columns associated with the loser. In fact they were all refering to the same observation. I choose to just remove this oberservation from the table because all of the information was missing for that match, and for the losing player it was his only appearance in all of the data. Instead of having a match of completely estimated data I thought it best to just remove this one match. Since all the data was missing other than the names I thought it would be better to remove it since your not really losing any extra data other then the names.

In [200]:
data = data[data['l_1stIn'].isna() == False]

In [205]:
data[data['w_svpt']==0]

Unnamed: 0,w_1stIn,w_1stWon,l_1stIn,l_1stWon,winner_name,loser_name,loser_ht,w_ace,w_svpt,winner_ht,w_2ndWon,w_bpFaced,w_bpSaved,l_ace,l_2ndWon,l_bpSaved,l_bpFaced,l_svpt
2612,0.0,0.0,5.0,3.0,Yuichi Sugita,Milos Raonic,196.0,0.0,0.0,173.0,0.0,0.0,0.0,0.0,0.0,2.0,3.0,8.0


Now that I have filled in all the missing data the table is ready for analysis and make calculations. I will use the data in the table to calculate 7 different statistics for all the players. Four will be service statistics and the other three will be return statistics. I will calculate the first serve percent, which is the percent of first serves that are in. Then I will calculate both the first serve and second serve win percent, which is how many points they are winning of there serves. A double fault,when second serve goes out, will count to losing a second serve point. Finally for serve I will calculate the pertenage of serves that are aces. Now for the return statistics I will calculate the total number of break chances they had, when they were one point away from winning a game on return. The break points won percent, of the chances they had how many did they win. For this statistic I decided if they had 0 break point chances then the percent would also be zero, since if they never get a break chance they aren't being very successful at returning so the percent should reflect that. Then just the total return point won percent.

In [257]:
players = np.unique(np.append(data['winner_name'],data['loser_name']))
winrates = []
firstPercents = []
heights = []
firstWRates = []
secWinRate = []
aceRate = []
bpChances = []
bpWinPer = []
winReturnPercents = []
for player in players:
    #this adds the heights of all the player to height
    heights.append(int(np.unique(np.append(
        data[data['winner_name']==player]['winner_ht'],data[data['loser_name']==player]['loser_ht']))))
    #this next section calculate the 1st serve percent for each player
    windata = data[data['winner_name']==player]
    losedata = data[data['loser_name']==player]
    winrate = len(windata)/(len(windata)+len(losedata))
    winServesIn = np.sum(windata['w_1stIn'])
    winServes = np.sum(windata['w_svpt'])
    if winServesIn == None:
        winServesIn = 0
    if winServes == None:
        winServes = 0
    loseServesIn = np.sum(losedata['l_1stIn'])
    loseServes = np.sum(losedata['l_svpt'])
    if loseServesIn == np.nan:
        loseServesIn = 0
    if loseServes == np.nan:
        loseServes = 0
    firstPercent = (winServesIn +loseServesIn)/(winServes+loseServes)
    firstPercents.append(firstPercent)
    
    #Now I am going to calculate the first and second serve win rate.
    firstWRate = (np.sum(windata['w_1stWon']) + np.sum(losedata['l_1stWon']))/(winServesIn + loseServesIn)
    firstWRates.append(firstWRate)
    secServes = (winServes - winServesIn) + (loseServes - loseServesIn)
    secWinPer = (np.sum(windata['w_2ndWon'])+np.sum(losedata['l_2ndWon']))/secServes
    secWinRate.append(secWinPer)
    
    #Now I will calculate the ace rate
    acePer = (np.sum(windata['w_ace']) + np.sum(losedata['l_ace']))/(winServes + loseServes)
    aceRate.append(acePer)
    
    #Now will calculate the return stats for players
    #starting with break point chances.
    breakpoints = np.sum(windata['l_bpFaced']) + np.sum(losedata['w_bpFaced'])
    bpChances.append(breakpoints)
    
    #There were some matches with no break points so can't divide by 0
    #Choose to append 0 as percent since if you got no break point chances
    #you weren't returning well so should have a percent to reflect that.
    if breakpoints == 0:
        bpWinPer.append(0)
    else:
        bpPercent = (breakpoints - (np.sum(windata['l_bpSaved']) + np.sum(losedata['w_bpSaved'])))/breakpoints
        bpWinPer.append(bpPercent)
    
    returnWinper = (np.sum(windata['l_svpt'])-(np.sum(windata['l_1stWon']) + np.sum(windata['l_2ndWon'])) +
                np.sum(losedata['w_svpt'])-(np.sum(losedata['w_1stWon']) + np.sum(losedata['w_2ndWon'])))
    returnWinper = returnWinper/(np.sum(windata['l_svpt']) + np.sum(losedata['w_svpt']))
    
    winReturnPercents.append(returnWinper)       
    
    

In [255]:
data[data['loser_name'] == 'Adrien Bossel']

Unnamed: 0,w_1stIn,w_1stWon,l_1stIn,l_1stWon,winner_name,loser_name,loser_ht,w_ace,w_svpt,winner_ht,w_2ndWon,w_bpFaced,w_bpSaved,l_ace,l_2ndWon,l_bpSaved,l_bpFaced,l_svpt
1789,51.0,41.0,39.0,28.0,Dustin Brown,Adrien Bossel,187.0,18.0,73.0,196.0,13.0,2.0,2.0,8.0,20.0,5.0,6.0,75.0


In [254]:
players[5]

'Adrien Bossel'

In [260]:
len(winReturnPercents)

580