# Swinging Succes: Unveiling Tennis Players Careers

## Introduction

Tennis is one of the most prestige and popular sports out there, usually with big prizes involved for top world players. The path to the top is however not easy.





This data science project endeavors to construct a predictive model capable of forecasting a player's career trajectory based on a set of individual parameters. 

Data were collected on 500 players from all around the world, various ranking and game styles. Columns include: Age, Current Rank, Best Rank, Seasons, Country, Plays, Backhand, and Prize Money, Last Appearance. 

## Preliminary Data Analysis

In [1]:
import random
import altair as alt
import pandas as pd
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics.pairwise import euclidean_distances
from sklearn import set_config
import seaborn as sns



#Simplifyinf work with large datasets in ALtair
alt.data_transformers.disable_max_rows()

#Outputs Dataframes instead of Arrays
set_config(transform_output="pandas")

**Reading Data from Web**

In [2]:
url= "https://drive.google.com/uc?export=download&id=1_MECmUXZuuILYeEOfonSGqodW6qVdhsS"

In [3]:
players= pd.read_csv(url)


**Data Summary**


In [4]:
columns=players.columns.to_list()
columns

['Unnamed: 0',
 'Age',
 'Country',
 'Plays',
 'Wikipedia',
 'Current Rank',
 'Best Rank',
 'Name',
 'Backhand',
 'Prize Money',
 'Height',
 'Favorite Surface',
 'Turned Pro',
 'Seasons',
 'Active',
 'Current Elo Rank',
 'Best Elo Rank',
 'Peak Elo Rating',
 'Last Appearance',
 'Titles',
 'GOAT Rank',
 'Best Season',
 'Retired',
 'Masters',
 'Birthplace',
 'Residence',
 'Weight',
 'Coach',
 'Facebook',
 'Twitter',
 'Nicknames',
 'Grand Slams',
 'Davis Cups',
 'Web Site',
 'Team Cups',
 'Olympics',
 'Weeks at No. 1',
 'Tour Finals']

In [5]:
players.head()

Unnamed: 0.1,Unnamed: 0,Age,Country,Plays,Wikipedia,Current Rank,Best Rank,Name,Backhand,Prize Money,...,Facebook,Twitter,Nicknames,Grand Slams,Davis Cups,Web Site,Team Cups,Olympics,Weeks at No. 1,Tour Finals
0,0,26 (25-04-1993),Brazil,Right-handed,Wikipedia,378 (97),363 (04-11-2019),Oscar Jose Gutierrez,,,...,,,,,,,,,,
1,1,18 (22-12-2001),United Kingdom,Left-handed,Wikipedia,326 (119),316 (14-10-2019),Jack Draper,Two-handed,"$59,040",...,,,,,,,,,,
2,2,32 (03-11-1987),Slovakia,Right-handed,Wikipedia,178 (280),44 (14-01-2013),Lukas Lacko,Two-handed,"US$3,261,567",...,,,,,,,,,,
3,3,21 (29-05-1998),"Korea, Republic of",Right-handed,Wikipedia,236 (199),130 (10-04-2017),Duck Hee Lee,Two-handed,"$374,093",...,,,,,,,,,,
4,4,27 (21-10-1992),Australia,Right-handed,Wikipedia,183 (273),17 (11-01-2016),Bernard Tomic,Two-handed,"US$6,091,971",...,,,,,,,,,,


In [6]:
#let's check how many missing variables there are in the dataset

missing_count=players.isnull().sum()
missing_percentage = (missing_count / len(players)) * 100
column_type= players.dtypes

# Create a DataFrame to display the missing values count and percentage
missing_data = pd.DataFrame({'Missing Count': missing_count, 'Missing Percentage': missing_percentage, "Data Type" : column_type})
missing_data = missing_data.sort_values(by='Missing Count', ascending=True)

print(missing_data)

                  Missing Count  Missing Percentage Data Type
Unnamed: 0                    0                 0.0     int64
Name                          0                 0.0    object
Age                           1                 0.2    object
Country                       1                 0.2    object
Wikipedia                     1                 0.2    object
Best Rank                     1                 0.2    object
Current Rank                  5                 1.0    object
Plays                        47                 9.4    object
Prize Money                  81                16.2    object
Backhand                     92                18.4    object
Seasons                     126                25.2   float64
Last Appearance             158                31.6    object
Active                      218                43.6    object
Turned Pro                  254                50.8   float64
Favorite Surface            259                51.8    object
Best Elo

Given substantial amounts of missing values in some columns, these will be dropped later

Describing count, mean, min, max, 25%, 50%, 75% for each variable

In [7]:
round(players.describe(),2)

Unnamed: 0.1,Unnamed: 0,Turned Pro,Seasons,Titles,Best Season,Retired,Masters,Grand Slams,Davis Cups,Team Cups,Olympics,Weeks at No. 1,Tour Finals
count,500.0,246.0,374.0,95.0,101.0,80.0,16.0,7.0,32.0,6.0,2.0,4.0,6.0
mean,249.5,2009.25,6.49,7.8,2015.55,2016.75,7.94,9.0,1.41,1.33,1.5,208.5,2.5
std,144.48,4.72,4.95,15.97,3.86,1.91,12.59,8.85,1.07,0.52,0.71,119.42,2.35
min,0.0,1997.0,1.0,1.0,2004.0,2008.0,1.0,1.0,1.0,1.0,1.0,41.0,1.0
25%,124.75,2005.0,2.0,1.0,2014.0,2016.0,1.0,2.0,1.0,1.0,1.25,166.25,1.0
50%,249.5,2009.0,5.0,3.0,2017.0,2017.0,1.0,3.0,1.0,1.0,1.5,241.5,1.0
75%,374.25,2013.0,10.0,7.0,2019.0,2018.0,5.75,17.5,1.0,1.75,1.75,283.75,4.0
max,499.0,2019.0,22.0,103.0,2020.0,2019.0,35.0,20.0,5.0,2.0,2.0,310.0,6.0


**Preliminary Data Cleaning**

In [8]:
#extracting actuale Age from the column Age since it had format "Age: Date of Birth"df=df.dropna(inplace=True)
df= pd.DataFrame(players)

df["Clean_Age"]= pd.to_numeric(df["Age"].str.split().str[0])
df["Clean_Best_Rank"]= pd.to_numeric(df["Best Rank"].str.split().str[0])
df["Clean_Height"]= pd.to_numeric(df["Height"].str.split().str[0])
df["Clean_Current_Rank"]= pd.to_numeric(df["Current Rank"].str.split().str[0])

# also removing '$' and 'US$' from the data in the column
df['Prize Money'] = df['Prize Money'].str.replace(",", "", regex= False).str.replace('US$', '', regex=False).str.replace('$', '',  regex=False)
# Convert the values to float
df['Prize Money'] = pd.to_numeric(df['Prize Money'], errors='coerce')


In [9]:
#trnasforming categorical values to numerical
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

# Turning the "Country" column values into numerical
country_encoded = pd.get_dummies(df['Country'], prefix='Country')
country_encoded['Country_encoded'] = label_encoder.fit_transform(df['Country'])
country = df["Country"]
             
# Turning the "Plays" column values into numerical
plays_encoded = pd.get_dummies(df['Plays'], prefix='Plays')
plays_encoded['Plays_encoded'] = label_encoder.fit_transform(df['Plays'])

# Turning the "Backhand" column values into numerical
backhand_encoded = pd.get_dummies(df['Backhand'], prefix='Backhand')
backhand_encoded['Backhand_encoded'] = label_encoder.fit_transform(df['Backhand'])


# Concatenating the results
players = pd.concat([df, country_encoded, plays_encoded, backhand_encoded], axis=1)
players

Unnamed: 0.1,Unnamed: 0,Age,Country,Plays,Wikipedia,Current Rank,Best Rank,Name,Backhand,Prize Money,...,Country_Uruguay,Country_Uzbekistan,Country_Zimbabwe,Country_encoded,Plays_Left-handed,Plays_Right-handed,Plays_encoded,Backhand_One-handed,Backhand_Two-handed,Backhand_encoded
0,0,26 (25-04-1993),Brazil,Right-handed,Wikipedia,378 (97),363 (04-11-2019),Oscar Jose Gutierrez,,,...,0,0,0,8,0,1,1,0,0,2
1,1,18 (22-12-2001),United Kingdom,Left-handed,Wikipedia,326 (119),316 (14-10-2019),Jack Draper,Two-handed,59040.0,...,0,0,0,57,1,0,0,0,1,1
2,2,32 (03-11-1987),Slovakia,Right-handed,Wikipedia,178 (280),44 (14-01-2013),Lukas Lacko,Two-handed,3261567.0,...,0,0,0,47,0,1,1,0,1,1
3,3,21 (29-05-1998),"Korea, Republic of",Right-handed,Wikipedia,236 (199),130 (10-04-2017),Duck Hee Lee,Two-handed,374093.0,...,0,0,0,34,0,1,1,0,1,1
4,4,27 (21-10-1992),Australia,Right-handed,Wikipedia,183 (273),17 (11-01-2016),Bernard Tomic,Two-handed,6091971.0,...,0,0,0,1,0,1,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,495,20 (13-04-1999),France,Right-handed,Wikipedia,382 (95),380 (11-11-2019),Dan Added,Two-handed,57943.0,...,0,0,0,24,0,1,1,0,1,1
496,496,26 (03-09-1993),Austria,Right-handed,Wikipedia,5 (5890),4 (06-11-2017),Dominic Thiem,One-handed,,...,0,0,0,2,0,1,1,1,0,0
497,497,23 (14-03-1996),Netherlands,Left-handed,Wikipedia,495 (60),342 (05-08-2019),Gijs Brouwer,,,...,0,0,0,39,1,0,0,0,0,2
498,498,24 (17-05-1995),Ukraine,,Wikipedia,419 (81),419 (20-01-2020),Vladyslav Orlov,,,...,0,0,0,56,0,0,2,0,0,2


In [10]:
#dropping null values in columns:

players=players.dropna(subset=["Clean_Age", "Clean_Current_Rank", "Clean_Best_Rank", "Seasons", "Country_encoded", "Plays_encoded", "Backhand_encoded", "Prize Money", ])


In the project, we will use columns with low amount of missing values. Since some of them are of type object, work will be further done to convert them to types possible to build  prediction model. 

**Data Visualizations**

In [11]:

#the number of players in each age 
df["Claen_Age"]= df["Clean_Age"].astype(str)

age_counts = df['Clean_Age'].value_counts().sort_index().reset_index()
age_counts.columns = ['Clean_Age', 'Player Count']

plot1 = alt.Chart(age_counts).mark_bar().encode(
    x=alt.X("Clean_Age", title="Age"),
    y=alt.Y("Player Count", title="Count")
).properties(
    title='Age of the players'
)
plot1

In [12]:

bins = [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500]  # Define custom bin intervals

df['Rank Group'] = pd.cut(df['Clean_Best_Rank'], bins=bins)

# Count the number of players in each rank group
rank_counts = df['Rank Group'].value_counts().sort_index().reset_index()
rank_counts.columns = ['Rank Group', 'Player Count']
rank_counts['Rank Group'] = rank_counts['Rank Group'].astype(str)  # Convert Interval to string

plot2 = alt.Chart(rank_counts).mark_bar().encode(
    x='Rank Group:O',  # O: Ordinal (categorical) axis
    y='Player Count:Q',  # Q: Quantitative axis (numerical)
    tooltip=['Rank Group', 'Player Count']  # Hover tooltip with data
).properties(
    title='Number of Players in their Best Rank Groups'
).configure_axisX(
    labelAngle=45
).configure_view(
    width=500)
plot2

In [13]:
df['Current Rank Group'] = pd.cut(df['Clean_Current_Rank'], bins=bins)
df["Current Rank Group"]= df["Current Rank Group"].astype(str)

# Count the number of players in each rank group
rank_counts = df['Current Rank Group'].value_counts().sort_index().reset_index()
rank_counts.columns = ['Current Rank Group', 'Player Count']


plot3 = alt.Chart(rank_counts).mark_bar().encode(
    x='Current Rank Group:O',  # O: Ordinal (categorical) axis
    y='Player Count:Q',  # Q: Quantitative axis (numerical)
    tooltip=['Current Rank Group', 'Player Count']  # Hover tooltip with data
).properties(
    title='Number of Players in their Current Rank Groups'
).configure_axisX(
    labelAngle=45
).configure_view(
    width=500)
plot3

In [14]:
players["Country"]

2                Slovakia
3      Korea, Republic of
4               Australia
5                  Poland
6           United States
              ...        
491              Bulgaria
492               Ecuador
493                 India
494    Russian Federation
499               Tunisia
Name: Country, Length: 344, dtype: object

In [15]:
#visualization of number of players from the most represented countries

country_counts= players["Country"].value_counts()

top_countries = country_counts[country_counts > 5]

filtered_df = df.copy()
filtered_df['Country'] = filtered_df['Country'].apply(lambda x: x if x in top_countries else 'others')

final_counts = filtered_df['Country'].value_counts()

plot_data = pd.DataFrame({
    'Country': final_counts.index,
    'Players': final_counts.values
})

chart = alt.Chart(plot_data).mark_bar().encode(
    x='Country',
    y='Players',
    tooltip=['Country', 'Players']
).properties(
    width=600,
    title="Top countries in number of players"
).configure_axisX(
    labelAngle=45
)

chart

**Buidilng a model**

In [16]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import set_config


Given the large amount of columns with missing values, I am setting the Threshold and I am only going to use columns with up to 50% missing values as to achieve the maximum accuracy of the predictive model.

In [17]:
#Now, let's split the set into trainig and testing. train size is 75%

players_train, players_test = train_test_split(
    players, train_size=0.75
)

In [18]:
from sklearn.neighbors import KNeighborsRegressor

players_preprocessor = make_column_transformer((StandardScaler(), ["Clean_Age", "Clean_Current_Rank", "Seasons", "Country_encoded", "Plays_encoded", "Backhand_encoded", "Prize Money"]))
players_pipeline = make_pipeline(players_preprocessor, KNeighborsRegressor())

players_grid = {
    "kneighborsregressor__n_neighbors": range(1, 201, 3),
}
players_gridsearch = GridSearchCV(
    estimator=players_pipeline,
    param_grid=players_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
)

In [19]:
print(players_train.columns)


Index(['Unnamed: 0', 'Age', 'Country', 'Plays', 'Wikipedia', 'Current Rank',
       'Best Rank', 'Name', 'Backhand', 'Prize Money',
       ...
       'Country_Uruguay', 'Country_Uzbekistan', 'Country_Zimbabwe',
       'Country_encoded', 'Plays_Left-handed', 'Plays_Right-handed',
       'Plays_encoded', 'Backhand_One-handed', 'Backhand_Two-handed',
       'Backhand_encoded'],
      dtype='object', length=111)


In [20]:
# fit the GridSearchCV object
players_gridsearch.fit(
    players_train[["Clean_Age", "Clean_Current_Rank", "Seasons", "Country_encoded", "Plays_encoded", "Backhand_encoded", "Prize Money"]],  # A single-column data frames
    players_train["Clean_Best_Rank"]  # A series
)



In [21]:
# Retrieve the CV scores
players_results = pd.DataFrame(players_gridsearch.cv_results_)
players_results["sem_test_score"] = players_results["std_test_score"] / 5**(1/2)
players_results = (
    players_results[[
        "param_kneighborsregressor__n_neighbors",
        "mean_test_score",
        "sem_test_score"
    ]]
    .rename(columns={"param_kneighborsregressor__n_neighbors": "n_neighbors"})
)


In [22]:
players_results["mean_test_score"]=-players_results["mean_test_score"]
players_results

Unnamed: 0,n_neighbors,mean_test_score,sem_test_score
0,1,58.855075,3.888702
1,4,56.943450,3.058051
2,7,56.962370,2.796000
3,10,57.418989,2.550775
4,13,57.243775,3.114117
...,...,...,...
62,187,101.851363,5.300009
63,190,102.516574,5.277674
64,193,103.323762,5.243979
65,196,104.188396,5.146798


In [23]:
players_gridsearch.best_params_

{'kneighborsregressor__n_neighbors': 4}

so knn=7 is the best number of neighbours to build the most accuarate model.

In [24]:
plot_neighbours= alt.Chart(players_results).mark_line().encode(
    x="n_neighbors",
    y="mean_test_score"
)
plot_neighbours

In [25]:
from sklearn.metrics import mean_squared_error

players_test["predicted"] = players_gridsearch.predict(players_test)

RMSPE = mean_squared_error(
    y_true=players_test["Clean_Best_Rank"],
    y_pred=players_test["predicted"]
)**(1/2)
RMSPE

52.066873368059866

## Summary

Given the KNN Linear Regression model, we can predict the best rank in a player's carrer with a certainity up to 50 positions.
It means that given predictors from the dataset, we can be more certain in predicting in which category the player will be (0-50, 50-10, 100-150 etc.), rather than what one's best position in the ranking.

**Future Questions**

The correlation between the variables used—Age, Current Rank, Seasons, Country, Plays, Backhand, and Prize Money—was higher than initially assumed at the beginning of the project. This indicates the need to include additional variables such as health and physical performance, history of injuries, diet, mental health, and years of training to enhance and make the predictor more accurate.

These results clearly demonstrate that sports analytics plays a crucial role in achieving the high performance of a player. Further studies and models can be developed to investigate these correlations more thoroughly.

Citations:
Data Derived from the Ultimate Tennis Statistics:https://www.ultimatetennisstatistics.com/
dataset link: https://drive.google.com/uc?export=download&id=1_MECmUXZuuILYeEOfonSGqodW6qVdhsS