### What is the Type of Anticipated Tennis Court?¶

### Introduction

Tennis, often recognized as the most popular individual sport globally, attracts a diverse array of competitors and enthusiasts alike.  In elite tennis, matches are contested on a variety of surfaces, each imparting distinct characteristics to the play.  These include hard courts, exemplified by the US Open;  clay courts, as seen in Paris at the French Open;  and grass courts, most famously at Wimbledon.  The choice of surface significantly influences the dynamics of a match, altering the speed and trajectory of the ball.  For instance, clay courts tend to decelerate the ball, favoring players who excel in precision and endurance.  Conversely, grass surfaces facilitate faster play, benefiting those who rely on speed and quick reflexes.

Given these distinctions, our research endeavors to explore the impact of playing surfaces on the nature of the game, specifically through the metric of ace counts during matches.  We pose the following research question: Can the type of playing surface—be it hard or clay—be predicted based on the number of aces recorded in a game?

To address this query, we will employ a comprehensive dataset provided by "Tennis Data," which incorporates both qualitative and quantitative aspects of numerous tennis matches.  Each record within the dataset corresponds to an individual game, offering detailed insights into various game metrics such as scores, number of aces, surface type, and tournament particulars.  By analyzing this data, we aim to ascertain the correlation between the court surface and the frequency of aces, thereby enhancing our understanding of how surface preferences may influence game outcomes in professional tennis.

### HypothesisWe are analyzing two types of tennis courts: hard and clay. Clay courts typically consist of a mix of clay and sand, which tends to slow down the ball and results in a more predictable bounce. In contrast, hard courts are often made from a combination of asphalt, concrete, and acrylic, generally producing faster ball speeds and less predictable bounces. According to Sportskeeda, an ace in tennis occurs when a serve lands within the play bounds and is not touched by the receiver, awarding a point to the server. Our analysis suggests that hard courts are likely to witness a higher number of aces due to their unpredictable bounce, which can prevent the receiver from returning the ball effectively. This research aims to verify if this inference is accurate.


### Methods 

## Reading the data
In the code below, we load data from three databases for tennis matches from 2017 to 2019. We then combine these into one database, sum the values from 'w_ace' and 'l_ace' into a new 'total_aces' column, and drop the original ace columns.

In [6]:
import pandas as pd 

#reading in the data and choosing only the relevant columns for data anyalisis:
tennis_2017=pd.read_csv('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2017.csv').loc[:, [
    'match_num',
    'surface',
    'w_ace',
    'l_ace']
].dropna().assign(year = 2017)

tennis_2018 = pd.read_csv('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2016.csv').loc[:, [
    'match_num',
    'surface',
    'w_ace',
    'l_ace']
].dropna().assign(year = 2018)


tennis_2019 = pd.read_csv('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2019.csv').loc[:, [
    'match_num',
    'surface',
    'w_ace',
    'l_ace']
].dropna().assign(year = 2019)

##Combines the csvs
files= [tennis_2017, tennis_2018, tennis_2019]
tennis_pre = pd.concat(files)

##Sum aces and drop other ace related cols
tennis = tennis_pre.assign(total_aces = tennis_pre["w_ace"] + tennis_pre["l_ace"]).drop(columns = ["w_ace","l_ace"])

## Visualizing DataFrame

In this code segment, we perform exploratory data analysis on a dataset of tennis matches, focusing on those played on 'Hard' or 'Clay' surfaces. We calculate and display mean statistics for these surfaces, count the number of matches on each, and merge these summaries into a final table. Additionally, we create a scatter plot using Altair to visualize the relationship between game length and total aces for these surface types. This analysis helps us simply determine if we can use models to accurately make predictions based on the observed correlations rather than actually using this to answer our question.

In [7]:
## Explortary anaylis of Data set

#The following code is to create a relevant summzization of the Data set. 
tennis_sur=tennis[(tennis['surface']=='Hard') | (tennis['surface']=='Clay')]

mean_tennis_sur= tennis_sur.groupby('surface').mean()

#counts the number of observations for each surface type
surface_counts = pd.DataFrame(tennis['surface'].value_counts()).reset_index()

#filters out to the relevant surface types 
rel_surface_counts=surface_counts[(surface_counts['surface']=='Hard') | (surface_counts['surface']=='Clay')]

#mergine the values taken from the groupby function and the counts for each observation. 
final_table=mean_tennis_sur.merge(rel_surface_counts, on='surface').drop(columns=['year'])


display(final_table)

#creating a visualization for exploatory anyalysis
import altair as alt

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

#The following code makes a scatter plot showing relationship between minutes and total aces.
aces_per_surface = alt.Chart(tennis_sur,
                            title = 'Minutes in a Game vs Aces in a Game'
                            ).mark_point(opacity=0.4).encode(
    x=alt.X("minutes").title("Game Length in Minutes").scale(domain = [0, 350]),
    y=alt.Y("total_aces").title("Total Aces in the Game").scale(zero=False),
    color=alt.Color("surface").title('All Surface Types').legend(orient="top")
).properties(width = 700)
aces_per_surface

Unnamed: 0,surface,match_num,minutes,total_aces,count
0,Clay,299.586619,111.272003,9.073278,2511
1,Hard,236.91533,108.464772,13.954112,4925


## Optimizing KNN model for surface prediction


In this code segment, we are analyzing a dataset of tennis matches to predict the surface type using the K-Nearest Neighbors(KN) classifier. First, we split the data into training and testing sets, then prepare it by scaling the"'total_ace"' feature. We use GridSearchCV to optimize the KNN model by finding the best number of neighbors. Finally, we visualize the accuracy of different KNN settings and determine the optimal number of neighbors for our model Essentially this allows us to find an optimal k value to use for our models later on. .

In [50]:
### Data Anaylisis: 
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.compose import make_column_transformer

#splitting the orginal data set to create a testign and training data set.
tennis_training, tennis_testing = train_test_split(
    tennis_sur,
    test_size=0.25,
    random_state = 42
)
# random state allows for a consistent k value each time


#setting predicting values and parameters. 
X_train = tennis_training[["total_aces"]]  # A single column data frame
y_train = tennis_training["surface"]  # A series

X_test = tennis_testing[["total_aces"]]  # A single column data frame
y_test = tennis_testing["surface"]  # A series

#creating the model's preprocessor 
tennis_preprocessor = make_column_transformer(
    (StandardScaler(), ["total_aces"]),
)

#Tuning the model to find the best knn value
knn = KNeighborsClassifier()
tennis_tune_pipe = make_pipeline(tennis_preprocessor, knn)

### Using GridSearchCV to find the best knn value 
parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 200, 2),
}

tennis_tune_grid = GridSearchCV(
    estimator=tennis_tune_pipe,
    param_grid=parameter_grid,
    cv=10
)

### Fitting grid 
tennis_tune_grid.fit(X_train, y_train)

#find the accuary of different knn values 
accuracies_grid = pd.DataFrame(tennis_tune_grid.cv_results_)
#accuracies_grid["sem_test_score"] = accuracies_grid["std_test_score"] / 10**(1/2)

#this is a plot to see what knn value is the best. 
accuracy_vs_k = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("Neighbors").scale(zero=False),
    y=alt.Y("mean_test_score")
        .scale(zero=False)
        .title("Accuracy estimate")
)

#finding the best knn value 
tennis_tune_grid.best_params_ #from this the best knn value will be 143.


{'kneighborsclassifier__n_neighbors': 125}

## Classifiying Anaylsis

In this code, we create and train a K-Nearest Neighbors KNN model using the best determined number of neighbors 125 to predict the surface type of tennis matches based on the "total_aces" feature. We then apply this model to the testing dataset to generate predictions for each match. Key performance metrics such as accuracy, precision, and recall are calculated to evaluate how well the model predicts that matches are played on "Hard" surfaces. Finally, the results are displayed, summarizing the model's effectiveness in predicting tennis match surfaces.

In [45]:

from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

#creating the new knn model withthe best knn value 
knn = KNeighborsClassifier(n_neighbors=125)

#making and training the new knn pipeline 
knn_pipeline = make_pipeline(tennis_preprocessor, knn)
knn_pipeline.fit(X_train, y_train)

#creating a new colum of predicted surfaces. 
tennis_testing["predicted_surface"] = knn_pipeline.predict(tennis_testing[["total_aces"]])

#filtering the relevant columns for the data set. 
tennis_testing[['surface', "predicted_surface", 'match_num', 'total_aces']]


#Calulcating accuracy, precision and recall scores.
score= tennis_tune_grid.score(tennis_testing[["total_aces"]],tennis_testing["surface"])

precision= precision_score(tennis_testing[["surface"]],tennis_testing["predicted_surface"], pos_label = "Hard")

recall = recall_score(tennis_testing[["surface"]],tennis_testing["predicted_surface"], pos_label = "Hard")

# # pd.crosstab

print("Accuracy Tests: Score = {}, Precision = {}, Recall = {}".format(round(score,4), round(precision,4), round(recall,4)))

Accuracy Tests: Score = 0.6923, Precision = 0.7251, Recall = 0.8688


<h2> 
Visualizing actual vs predicted surface </h2>

Here we are visualizing the actual vs predicted results for all matches from 2017 to 2019. Essentially, in the predicted visualization, the model went through each match and was able to predict based on the number of aces what type of surface the match was played on.s

In [49]:
#Chart of findings 
plot_predicted_surface = alt.Chart(tennis_testing [tennis_testing["match_num"] <=300], 
    title = 'Predicted Surface Type and Total Aces in a Match').mark_point().encode(
    x=alt.X('match_num').title('Match Number').scale(zero=False), 
    y=alt.Y('total_aces').title('Total Aces'), 
    color = alt.Color('predicted_surface').title('Predicted Surface Type')
)
plot_actual_surface= alt.Chart(tennis_testing [tennis_testing["match_num"] <=300], title = 'Actual Surface Type and Total Aces in a Match').mark_point().encode(
    x=alt.X('match_num').title('Match Number').scale(zero=False),
    y=alt.Y('total_aces').title('Total Aces'),
    color = alt.Color('surface').title('Actual Surface Type')
)
#Above we filtered only for the Match Numbers less that 300 to plot the graphs

display(plot_actual_surface, plot_predicted_surface)


## Discussion 

Through this research project, we aimed to determine if there was a correlation between surface type and the number of aces in a game, and, if a correlation was found, to further create a predictive model to predict the surface type of a match from the number of aces in the match. The two surface types contrasted were clay and hard surfaces. In order to answer the questions at hand, we used the ATP matches dataset from 2017 to 2019. This allowed us to have a comprehensive set of examples to use in our predictive models and also to help us visualize any initial trends we saw between the quantities. To find an initial correlation, we concatenated the datasets for the various years into one dataset, and then we created a scatter plot for the dataset, which compared the total number of aces in a game to the game length in minutes, where the color of the points indicated the surface type. What we observed, both through the visualization of this plot and through numerical calculations of each point, was that the hard surface had a larger average number of aces per game, at 13.954112, compared to 9.073278 aces per game observed on the clay surfaces. Although it is important to note that while the data did reveal that there is a correlation between the two quantities, the number of games played on clay surfaces was also fewer, at 2511, compared to 4925 played on hard surfaces. The larger sample size for hard surfaces means that outliers will have a smaller effect when computing the mean value for the number of aces per game. To support our earlier questions, we then proceeded to create a K-Neighbors Classifier model that predicted the surface type based on the number of aces. Lastly, we visually compared the actual and predicted data through scatter plots and were able to conclude that our model was able to make correct predictions.

When comparing the findings to our initial visualizations of the dataset, we observed a clear correlation between the court surface type and the number of aces per game. This correlation aligned with our initial hypothesis, which predicted that hard surface types would have more aces compared to their clay counterparts. Additionally, when creating a model to answer our predictive question, the results indicated that matches with a low number of aces were predominantly predicted to be on hard surfaces, whereas matches with a high number of aces were mostly predicted to be on clay surfaces.

Understanding how the surface material of a tennis court influences the number of aces in a game can be highly beneficial for the sport. One use of our predictive model is to help players adjust their play styles to different match environments based on the location. Knowing the number of aces scored at previous venues can enable players to develop more effective defensive strategies to counteract the impact of the surface on the ball's motion. Additionally, insights into how the ball behaves on different surfaces can assist tennis leagues in standardizing court materials to ensure fairness. Leagues might also explore new materials that better accommodate players familiar with a variety of court surfaces.

Analyzing our results can answer many questions and pave the way for further research. Future studies could examine how players' serving statistics vary across different surfaces to determine whether the number of aces in a game is influenced more by the court surface or by the players' skills. Additionally, the model could be refined by incorporating factors such as players' rankings, speed, home environment, and environmental conditions like wind and precipitation. Lastly, our findings raise an important question for global tennis organizations: Should material types be standardized across all courts to ensure fairness and consistency in the sport?


<h2> Conclusion </h2>
In conlusion there are many factors that determine weather or not an ace occurs in a game of tennis and a more accurate model would take into account others factors such as environmental and player specific histores to generate an accurate clue. That being through our intial explosatory visualtions we were able to see that their defeintly was a correlation between the two quanties. Hence answering the question: Can the type of playing surface—be it hard or clay—be predicted based on the number of aces recorded in a game? we answer that our model is accurate in most scenarios but that being said like most things, their are other varibles which can affect the accuracy of the predictions being made. Although we were able to reciece and accuracy of around 70 percent when prediction the surface type so our model was defintly able to find correlations between the two quatities.





## References 

1. "Tennis Surfaces." Talk Tennis, www.talktennis.co.uk/tennis-surfaces/.
   Accessed 10 Apr. 2024.

2. "What Is an Ace in Tennis?" Sportskee
   a, www.sportskeeda.com/tennis/what-is-an-ace-in-tennis.
   Accessed 10 Apr. 2034.
3. Sackmann, Jeff. "atp_matches_doubles_2017." GitHub, GitHub repository, 
4. 017, https://github.com/JeffSackmann/tennis_atp/blob/master/atp_matches_doubles_2017.
   Accessed 10 Apr.52024
4. Sackmann, Jeff. "atp_matches_doubles_2018." GitHub, GitHub repository, 2018, https://github.com/JeffSackmann/tennis_atp/blob/master/atp_matches_doubles_2018.csv.
   Accessed 10 Ap6. 2024
5. Sackmann, Jeff. "atp_matches_qual_chall_2019." GitHub, GitHub repository, 2019, https://github.com/JeffSackmann/tennis_atp/blob/master/atp_matches_qual_chall_2019.csv.
   Accessed 10 Apr. 2024