# What is the Type of Anticipated Tennis Court?

### Introduction

Tennis is the world's most popular individual sport worldwide. At the elite level, the sport is played on various types of surfaces including hard (US Open), clay (Paris), and grass (Wimbledon). The surface used can greatly affect the game since the bounces go at different speeds. For example, clay slows the ball down thus giving precise players an advantage whereas grass prefers quicker players. Therefore, our research question is that: Based on the number of aces in a game, will the surface of a court be hard, or clay? 

To answer the question, we will use the data set provided by “Tennis Data”.  There is a mix of qualitative and quantitative variables about various tennis games. Each observation is a specific tennis game whereas the variables provide insight into the score, number of aces, surface, tournament and more. 

### Preliminary exploratory data analysis: 

Demonstrate that the dataset can be read from the web into Python 

Clean and wrangle your data into a tidy format 

Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data.  

Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis. 

In [3]:
import pandas as pd 

tennis_2017=pd.read_csv('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2017.csv').loc[:, [
    'match_num',
    'surface',
    'minutes',
    'w_ace',
    'l_ace']
].dropna().assign(year = '2017')

tennis_2018 = pd.read_csv('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2016.csv').loc[:, [
    'match_num',
    'surface',
    'minutes',
    'w_ace',
    'l_ace']
].dropna().assign(year = '2018')


tennis_2019 = pd.read_csv('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2019.csv').loc[:, [
    'match_num',
    'surface',
    'minutes',
    'w_ace',
    'l_ace']
].dropna().assign(year = '2019')

##Combines the csvs
files= [tennis_2017,tennis_2018,tennis_2019]
tennis_pre = pd.concat(files)

##Sum aces and drop other cols
tennis = tennis_pre.assign(total_aces = tennis_pre["w_ace"] + tennis_pre["l_ace"]).drop(columns = ["w_ace","l_ace"])

#The following code is to create a relevant summzization of the Data set. 
tennis_sur=tennis[(tennis['surface']=='Hard') | (tennis['surface']=='Clay')]

mean_tennis_sur= tennis_sur.groupby('surface').mean()

#counts the number of observations for each surface type
surface_counts = pd.DataFrame(tennis['surface'].value_counts()).reset_index()

#filters out to the relevant surface types 
rel_surface_counts=surface_counts[(surface_counts['surface']=='Hard') | (surface_counts['surface']=='Clay')]

#mergine the values taken from the groupby function and the counts for each observation. 
final_table=mean_tennis_sur.merge(rel_surface_counts, on='surface').drop(columns=['year'])

final_table

Unnamed: 0,surface,match_num,minutes,total_aces,count
0,Clay,299.586619,111.272003,9.073278,2511
1,Hard,236.91533,108.464772,13.954112,4925


In [4]:
import altair as alt

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

#The following code makes a scatter plot showing relationship between minutes and total aces.
aces_per_surface = alt.Chart(tennis,
                            title = 'Minutes in a Game vs Aces in a Game'
                            ).mark_point(opacity=0.4).encode(
    x=alt.X("minutes").title("Game Length in Minutes").scale(domain = [0, 350]),
    y=alt.Y("total_aces").title("Total Aces in the Game").scale(zero=False),
    color=alt.Color("surface").title('All Surface Types').legend(orient="top")
).properties(width = 700)
aces_per_surface

### Methods 
The data analysis on how the number of aces in a game relates to surface of the court will be conducted using the columns, ‘total_aces’, or the sum of the columns ‘l_ace’ (loser’s ace’s) and ‘w_ace’ (winner ace‘s) from the original data set, the length of the game in minutes (’minutes’), and a games identify number (match_num). Addtionally, only the surfaces Clay and Hard were used in this data set because there are two little observations of them to conducte an effective data analysis. There are several other columns in the data including other scoring types form the winner and the loser, tournaments information, player information and match score, but are not included in this data analysis as these variables are not influence by the surface of the court.  The results will be visualized with a scatter plot that displays a k-nearest neighbor analysis. 

  

### Expectations 
We anticipate that our model will accurately predict the surface of the tennis court (hard, clay, or grass) based on the number of aces served in a game. This expectation is grounded in the hypothesis that the surface type significantly influences the game's pace and, consequently, the ability of players to serve aces.

We expect to find that our model will be very accurate considering the effects of the surface court on the speed of tennis serves. Also, we quantify the extent to which different surfaces affect the probability of serving an ace. Through analyzing the data, we aim to identify clear patterns that differentiate the playing styles and strategies best suited for hard, clay, and grass courts.

When we complete the project, the final expectation would be providing a comparative insight into how surface affects the game's dynamics.