### Project Report 

### Introduction

Tennis, often recognized as the most popular individual sport globally, attracts a diverse array of competitors and enthusiasts alike.  In elite tennis, matches are contested on a variety of surfaces, each imparting distinct characteristics to the play.  These include hard courts, exemplified by the US Open;  clay courts, as seen in Paris at the French Open;  and grass courts, most famously at Wimbledon.  The choice of surface significantly influences the dynamics of a match, altering the speed and trajectory of the ball.  For instance, clay courts tend to decelerate the ball, favoring players who excel in precision and endurance.  Conversely, grass surfaces facilitate faster play, benefiting those who rely on speed and quick reflexes.

Given these distinctions, our research endeavors to explore the impact of playing surfaces on the nature of the game, specifically through the metric of ace counts during matches.  We pose the following research question: Can the type of playing surface—be it hard or clay—be predicted based on the number of aces recorded in a game?

To address this query, we will employ a comprehensive dataset provided by "Tennis Data," which incorporates both qualitative and quantitative aspects of numerous tennis matches.  Each record within the dataset corresponds to an individual game, offering detailed insights into various game metrics such as scores, number of aces, surface type, and tournament particulars.  By analyzing this data, we aim to ascertain the correlation between the court surface and the frequency of aces, thereby enhancing our understanding of how surface preferences may influence game outcomes in professional tennis.

### Methods 

In [3]:
import pandas as pd 

#reading in the data and choosing only the relevant columns for data anyalisis:
tennis_2017=pd.read_csv('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2017.csv').loc[:, [
    'match_num',
    'surface',
    'minutes',
    'w_ace',
    'l_ace']
].dropna().assign(year = 2017)

tennis_2018 = pd.read_csv('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2016.csv').loc[:, [
    'match_num',
    'surface',
    'minutes',
    'w_ace',
    'l_ace']
].dropna().assign(year = 2018)


tennis_2019 = pd.read_csv('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2019.csv').loc[:, [
    'match_num',
    'surface',
    'minutes',
    'w_ace',
    'l_ace']
].dropna().assign(year = 2019)

##Combines the csvs
files= [tennis_2017, tennis_2018, tennis_2019]
tennis_pre = pd.concat(files)

##Sum aces and drop other ace related cols
tennis = tennis_pre.assign(total_aces = tennis_pre["w_ace"] + tennis_pre["l_ace"]).drop(columns = ["w_ace","l_ace"])

In [4]:
## Explortary anaylis of Data set

#The following code is to create a relevant summzization of the Data set. 
tennis_sur=tennis[(tennis['surface']=='Hard') | (tennis['surface']=='Clay')]

mean_tennis_sur= tennis_sur.groupby('surface').mean()

#counts the number of observations for each surface type
surface_counts = pd.DataFrame(tennis['surface'].value_counts()).reset_index()

#filters out to the relevant surface types 
rel_surface_counts=surface_counts[(surface_counts['surface']=='Hard') | (surface_counts['surface']=='Clay')]

#mergine the values taken from the groupby function and the counts for each observation. 
final_table=mean_tennis_sur.merge(rel_surface_counts, on='surface').drop(columns=['year'])


display(final_table)

#creating a visualization for exploatory anyalysis
import altair as alt

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

#The following code makes a scatter plot showing relationship between minutes and total aces.
aces_per_surface = alt.Chart(tennis_sur,
                            title = 'Minutes in a Game vs Aces in a Game'
                            ).mark_point(opacity=0.4).encode(
    x=alt.X("minutes").title("Game Length in Minutes").scale(domain = [0, 350]),
    y=alt.Y("total_aces").title("Total Aces in the Game").scale(zero=False),
    color=alt.Color("surface").title('All Surface Types').legend(orient="top")
).properties(width = 700)
aces_per_surface

Unnamed: 0,surface,match_num,minutes,total_aces,index


TypeError: 'UndefinedType' object is not callable

In [None]:
### Data Anaylisis: 
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.compose import make_column_transformer

#renaming columns to so data KNN can run on numeric values: 
tennis_training, tennis_testing = train_test_split(
    tennis_sur,
    test_size=0.25,
)

X_train = tennis_training[["total_aces"]]  # A single column data frame
y_train = tennis_training["surface"]  # A series

X_test = tennis_testing[["total_aces"]]  # A single column data frame
y_test = tennis_testing["surface"]  # A series

tennis_preprocessor = make_column_transformer(
    (StandardScaler(), ["total_aces"]),
)

#Tuning the model to find the best knn value
knn = KNeighborsClassifier()
tennis_tune_pipe = make_pipeline(tennis_preprocessor, knn)

parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 200, 2),
}

tennis_tune_grid = GridSearchCV(
    estimator=tennis_tune_pipe,
    param_grid=parameter_grid,
    cv=10
)

## Fitting grid 
tennis_tune_grid.fit(X_train, y_train)

#find the accuary of different knn values 
accuracies_grid = pd.DataFrame(tennis_tune_grid.cv_results_)
#accuracies_grid["sem_test_score"] = accuracies_grid["std_test_score"] / 10**(1/2)

#this is a plot to see what knn value is the best. 
accuracy_vs_k = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("Neighbors").scale(zero=False),
    y=alt.Y("mean_test_score")
        .scale(zero=False)
        .title("Accuracy estimate")
)

#finding the best knn value 
tennis_tune_grid.best_params_ #from this the best knn value will be 155.


In [28]:
#Classifiying Anaylsis

from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

knn = KNeighborsClassifier(n_neighbors=79)

knn_pipeline = make_pipeline(tennis_preprocessor, knn)
knn_pipeline.fit(X_train, y_train)

tennis_testing["predicted_surface"] = knn_pipeline.predict(tennis_testing[["total_aces"]])

tennis_testing[['surface', "predicted_surface", 'match_num', 'total_aces']]

#Calulcating accuracy, precision and recall scores.

score= tennis_tune_grid.score(tennis_testing[["total_aces"]],tennis_testing["surface"])

precision= precision_score(tennis_testing[["surface"]],tennis_testing["predicted_surface"], pos_label = "Hard")

recall = recall_score(tennis_testing[["surface"]],tennis_testing["predicted_surface"], pos_label = "Hard")

# # pd.crosstab

print("Accuracy Tests: Score = {}, Precision = {}, Recall = {}".format(round(score,4), round(precision,4), round(recall,4)))

Accuracy Tests: Score = 0.6859, Precision = 0.7117, Recall = 0.8655


In [11]:
#Chart of findings 
plot_predicted = alt.Chart(tennis_testing).mark_point().encode(
    x=alt.Y('match_num',
           scale = alt.Scale(domain=[0, 300])),
    y='total_aces',
    color = 'predicted_surface'
)
plot_actual= alt.Chart(tennis_testing).mark_point().encode(
    x=alt.Y('match_num',
           scale = alt.Scale(domain=[0, 300])),
    y='total_aces',
    color = 'surface'
)
display(plot_actual, plot_predicted)

### Discussion 

### References 