In [None]:
pip install -U altair 

# Handedness and Tennis Success 

## Introduction

It was hypothesized that left handed tennis players have better developed tennis skills than those of right handed players (Holtzen, 2000). It is argued that left handed players can serve more strategically to their opponent given the developed spatial, motor and attentional functions (Holtzen, 2000). Given the hypothesis in Holtzen’s study that left handed players are better than right handed players at tennis, we want to examine if there is an association between handedness and tennis success. The data set we will use is Player Stats for Top 500 Players from Ultimate Tennis Statistics (https://www.ultimatetennisstatistics.com). The data set includes the player’s name, age, where they're from, handedness, source of information, current ranking, best ranking, their backhand position, net prize money, height, weight, coach, preferred surface to play on, information on the duration and dates of their career, social media, and personal game stats. Given our question on whether handedness and specifically, left-handedness, has an impact on success in tennis, we will designate handedness as the classifier variable and will designate best rank, prize money, best elo rank, peak elo rating, titles, and GOAT rank as the predictor variables. We chose these predictor variables as they all relate to the success of a tennis player's career. We omitted other variables like the player's personal info (name, age, height, weight, where they're from), social media and information source, coach, nicknames, backhand position, preferred surface to play on, and information on the duration and dates of their career as they were not relevant to the question we are answering in this project. From the remaining variables regarding the player's game stats, we omitted masters, grand slams, Davis cups, team cups, Olympics, weeks at No. 1, and tour finals as there was very limited data for these variables and we felt that including them in our data analysis would require a significant reduction in the sample size of the data set. We also decided to omit current rank, and current elo rating as some of the players may be retired and ranking decreases when a player is not longer playing, therefore misrepresenting the performance of the player when they were active.





## Methods









### Loading and Cleaning  the Data

For the data analysis of this project, we require functions from the pandas package and the altair package. We will load both packages below.

In [None]:
import pandas as pd
import altair as alt

To read the data into our notebook, we will use the url (https://drive.google.com/uc?export=download&id=1_MECmUXZuuILYeEOfonSGqodW6qVdhsS) and the read_csv function from the pandas package. At this point, we will name the data 'data'.

In [None]:
data = pd.read_csv("https://drive.google.com/uc?export=download&id=1_MECmUXZuuILYeEOfonSGqodW6qVdhsS")
data

To visualize all the columns of the data set, we will use the .columns function.

In [None]:
data.columns

Now that we have a list of the columns in this data set, we will filter out the columns we need to answer our project's question. The columns we will include are plays, best rank, prize money, best elo rank, peak elo rating, titles, and GOAT rank. Will name this 'data_filtered'.

In [None]:
data_filtered = data[['Plays', 'Best Rank', 'Prize Money', 'Best Elo Rank', 'Peak Elo Rating', 'Titles', 'GOAT Rank']]
data_filtered

Now that we have selected the columns we will be using for analysis, we want to omit the observations that are missing values by using the .dropna() function. This will allow us to perform K-nearest neighbors classification later on. We will name this 'data_no_na'.

In [None]:
data_no_na = data_filtered.dropna()
data_no_na

Many of the columns in the data set contain more than one value in a cell. For the best rank, best elo rank, and peak elo rating columns, there is a date included in the cell that tells us when the ranking or rating was obtained. Since we are not concerned with the date of achievements for the tennis players in answering our predictive question, we can omit this data. In the GOAT rank column, there is the ranking of the player and value of the points the player has that determines their ranking in brackets. Since both values go towards GOAT ranking, we will simplify the data by excluding the points of the player in the brackets. In the prize money column, there is additional information in the form of a string. Since the value of the player's prize money will be sufficient in our analysis, we will keep only the prize money value in this column. In the next step of cleaning the data, we will clean the data so each cell only contains one value, filter for only the values we need in our analysis and will convert the columns (besides the handedness column) from object columns to numeric data.

To clean the data so that there is only one value per cell, we will use the str.split function. We will complete this process in multiply steps, starting with the GOAT rank column. We will separate the values of the GOAT rank column by specifying that the split should be made on the space between the values. We will name this 'data_spit_GOAT'. In this step, we will rename the columns so that when we combine all the adjusted columns later, it will be easy to differentiate them and easy to discard the ones we do not need. We will name the column we will be keeping 'GOAT Rank' and the column we will discard 'GOAT Discard'.

In [None]:
data_split_GOAT = data_no_na['GOAT Rank'].str.split(" ", expand = True)
data_split_GOAT = data_split_GOAT.rename(columns={0:"GOAT Rank", 1:"GOAT Discard"})
data_split_GOAT

We will now separate the values of the peak elo rating column, naming it 'data_split_Peak_Elo'. Again, we will specify that the space is where the data should be split and rename the columns to 'Peak Elo Rating' and 'Peak Elo Discard'.

In [None]:
data_split_Peak_Elo = data_no_na['Peak Elo Rating'].str.split(" ", expand = True)
data_split_Peak_Elo = data_split_Peak_Elo.rename(columns={0:"Peak Elo Rating", 1:"Peak Elo Discard"})
data_split_Peak_Elo

We will do the same for the best elo rank column, naming it 'data_split_Best_Elo'. We will rename the columns to 'Best Elo Rank' and 'Best Elo Discard'.

In [None]:
data_split_Best_Elo = data_no_na['Best Elo Rank'].str.split(" ", expand = True)
data_split_Best_Elo = data_split_Best_Elo.rename(columns={0:"Best Elo Rank", 1:"Best Elo Discard"})
data_split_Best_Elo

We will do the same for the best rank column, naming it 'data_split_Best_Rank'. We will rename the columns to 'Best Rank' and 'Best Rank Discard'.

In [None]:
data_split_Best_Rank = data_no_na['Best Rank'].str.split(" ", expand = True)
data_split_Best_Rank = data_split_Best_Rank.rename(columns={0:"Best Rank", 1:"Best Rank Discard"})
data_split_Best_Rank

Finally, we will remove the '$', 'US' and all other unnecessary information from the prize money column

In [None]:
data_clean_Prize_Money= data_no_na['Prize Money'].str.strip('$US')
data_clean_Prize_Money= data_clean_Prize_Money.str.strip('all-time leader in earnings')
data_clean_Prize_Money= data_clean_Prize_Money.str.replace(',','')
data_clean_Prize_Money= data_clean_Prize_Money.str.extract(r'(\d+)', expand=False)
data_clean_Prize_Money= pd.DataFrame(data_clean_Prize_Money)
data_clean_Prize_Money= data_clean_Prize_Money.rename(columns={0:"Prize Money"})
data_clean_Prize_Money

Now that all of the columns have been split or cleaned so that each cell does not contain more than one value, we can combine the columns from each individual step so that we have all the data from the data_no_na data frame. Using the concat function from pandas, we will concantenate the plays column from the data_no_na data frame, data_split_GOAT, data_split_Peak_Elo, data_split_Best_Elo, data_split_Best_Rank, and data_clean_Prize_Money. This data frame will be called tennis.

In [None]:
tennis = pd.concat(
    [data_no_na['Plays'], data_no_na['Titles'], data_split_GOAT, data_split_Peak_Elo, data_split_Best_Elo, 
     data_split_Best_Rank, data_clean_Prize_Money],
    axis=1,
)
tennis

Now that all of the data has been combined together, we can drop the columns that aren't needed for our analysis (labeled with 'discard' in their column titles) using the drop function. This data frame will be called tennis_columns

In [None]:
tennis_clean = tennis.drop(columns=['GOAT Discard', 'Peak Elo Discard', 'Best Elo Discard', 'Best Rank Discard'])
tennis_clean

The final step of the data cleaning process is to ensure the predictor columns are numeric . Using the function str.split causes the data returned to be in string type. Since we will be wanting to use functions on the predictor columns that will treat them as numbers, we need to change the predictor columns to numeric data types using the pandas.to_numeric function. Included below is a code cell where the info function is applied to the tennis data set, checking to see which columns are object types and therefore strings. We will name the data frame that has it's predictor variables converted to numerical data 'tennis_clean' and this will be the final version of the clean data.

In [None]:
tennis_clean.info()

In [None]:
tennis_clean["Titles"] = pd.to_numeric(tennis_clean["Titles"])
tennis_clean["GOAT Rank"] = pd.to_numeric(tennis_clean["GOAT Rank"])
tennis_clean["Peak Elo Rating"] = pd.to_numeric(tennis_clean["Peak Elo Rating"])
tennis_clean["Best Elo Rank"] = pd.to_numeric(tennis_clean["Best Elo Rank"])
tennis_clean["Best Rank"] = pd.to_numeric(tennis_clean["Best Rank"])
tennis_clean["Prize Money"] = pd.to_numeric(tennis_clean["Prize Money"])
tennis_clean

To check that the predictor variables are in the form of numerical data, we will use the info function again.

In [None]:
tennis_clean.info()

### Spliting the Data into Training Data and Testing Data and Exploring the Training Data

To split the data randomly to create the training and testing data, we will need to set the seed using the np.random.seed function from the numpy package. We will also need the sklearn package to split the data into training and testing data. We will import the test_train_split function from the sklearn package and the numpy package below.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

Now, we will split the data into the testing and training data. We will set the seed using the np.random.seed function and then use the test_train_split function on the tennis_clean data frame. As we want to balance the size of the training and testing data sets so that we can have train a relatively accurate model and have a good evaluation of the model's performance, we will split the data so that 75% of the data is training data and 25% is testing data. Lastly, we will stratify the data based on the class label (plays) so that roughly the same proportions of each label is divided into the training and testing data. The training data will be called 'tennis_train', and the testing data will be called 'tennis_test'.

In [None]:
np.random.seed(1)

tennis_train, tennis_test = train_test_split(
    tennis_clean, train_size=0.75, stratify=tennis_clean['Plays']
)

We will now use the info function to check that the data has been split correctly.

In [None]:
tennis_train.info()

In [None]:
tennis_test.info()

One concerning point in answering our predictive question is that there may be significantly less left-handed classifications in the data set than there are right-handed classifications. This may impact the model's left-handed classification as the K-nearest neighbors algorithm bases classification on the majority classification of nearby points. To check the percentage of each class (right-handed and left-handed) in the data set, we will use the value_counts function with the normalize argument set to true.

In [None]:
tennis_clean['Plays'].value_counts(normalize=True)

In [None]:
tennis_train['Plays'].value_counts(normalize=True)

In [None]:
tennis_test['Plays'].value_counts(normalize=True)

Visualizing the number of each classification by creating a bar plot also allows us to check if there are significantly less left-handed classification that right-handed classifications. We will use the chart function from the altair package to create the plot and specify that we want a bar plot by using mark_bar. The x-axis will be the different classifications and the y-axis will be the counts for each classification. We will name this plot 'classification_count_plot'.

In [None]:
classification_count_plot = alt.Chart(tennis_train).mark_bar().encode(
    x=alt.X("Plays").title("Handedness"),
    y=alt.Y("count()").title("Count")
)
classification_count_plot

As suspected, there is a much greater proportion of right-handed classes than left-handed classes. In an effort to decrease the class imbalance to limit its effect on the training of our model, we will oversample the left-handed class in the training data. We will need to use the resample function from the sklearn package so we will begin by importing that. Then, we will need to split the training data into two different data groups depending on class through filtering. Next, we will use the resample function on the left-handed observations to increase the number of observations to match the right-handed observations by setting the n_samples argument to equal number of right-handed observations. Finally we will use the concat function to concatenate the data back together.

In [None]:
from sklearn.utils import resample

right_tennis = tennis_train[tennis_train['Plays'] == 'Right-handed']
left_tennis = tennis_train[tennis_train['Plays'] == 'Left-handed']

left_tennis_upsample = resample(
    left_tennis, n_samples = right_tennis.shape[0]
)
    
tennis_train = pd.concat((left_tennis_upsample, right_tennis))

To check that the number of classes are now the same, we will use the value_counts function.

In [None]:
tennis_train['Plays'].value_counts()

To explore the data further, we will also calculate the means of each predictor variable in our analysis using the mean function and specifying the numeric_only argument as true to exclude the classifier variable.

In [None]:
tennis_train.mean(numeric_only=True)

The last step of our exploratory data analysis is to visualize the training data. To do so, we will create several scatters plot of the training data, coloring the left-handed and right-handed classes differently to visualize their distribution. We will use the chart function from altair and specify that we want a scatter plot by using mark_circle. For this first scatter plot, we will designate the x variable as prize money and the y variable as best rank. We will name this plot br_pm_plot.

In [None]:
br_pm_plot = alt.Chart(tennis_train).mark_circle().encode(
    x=alt.X("Prize Money").title("Total Prize Money (USD)"),
    y=alt.Y("Best Rank").title("Player's Best Rank"),
    color=alt.Color("Plays").title("Handedness")
)
br_pm_plot

In [None]:
per_ber_plot = alt.Chart(tennis_train).mark_circle().encode(
    x=alt.X("Best Elo Rank").title("Player's Best Elo Rank"),
    y=alt.Y("Peak Elo Rating").title("Player's Peak Elo Rating"),
    color=alt.Color("Plays").title("Handedness")
)
per_ber_plot

In [None]:
br_ber_plot = alt.Chart(tennis_train).mark_circle().encode(
    x=alt.X("GOAT Rank").title("Player's GOAT Elo Rank"),
    y=alt.Y("Titles").title("Number of Titles"),
    color=alt.Color("Plays").title("Handedness")
)
br_ber_plot

### Data Analysis

To begin our KNN classification, we will create our preprocessor. Our preprocesor will utilize the make_column_transformer argument to select the predictor variables to scale. Once selected, our prprocessor will use the argument StandardScaler() to transform the predictor variables so that they have a mean of 0 and a standard deviation of 1. This will remove any outliers that could affect the euclidean distance when apply our K Nearest Neighbor. In doing so, we remove any noise from the data that could obscure our results.  

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer


tennis_preprocessor = make_column_transformer(
    (StandardScaler(), ['Best Rank', 'Prize Money', 'Best Elo Rank', 'Peak Elo Rating', 'Titles', 'GOAT Rank']),
)

After creating our preprocessor, we will utilize the KNeighborsClassifier() argument to set our K value. We have set our K value to 3 as XXXX. Using the training split of our tennis data, we have assigned our predictor variables to "X" and our response variable to "y". 

We created a pipeline using the make_pipeline() argument to execute our preprocessor and classification step on the data that we input into our model. Using the .fit() argument we will fit our training data into our model which will train our model before applying our test data. This is crucial to our classification as it will ensure that we get the most accurate prediction when we run our model on our test split. 

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

knn = KNeighborsClassifier(n_neighbors=3)

X = tennis_train[['Best Rank', 'Prize Money', 'Best Elo Rank', 'Peak Elo Rating', 'Titles', 'GOAT Rank']]
y = tennis_train['Plays']

knn_fit = make_pipeline(tennis_preprocessor, knn).fit(X, y)
knn_fit

After training our model, we will apply our model onto our test split. This is done with the .predict() argument that will call on the pipeline to make a prediction on unseen data. In our case, the model will decide handedness depending on our predictor variables. Using the .assign() function, our predictions that come from our model will be displayed in an additional column within our test data frame. 

To determine the accuracy of the model we will run the .score() argument on our predictor and response variable. 

In [None]:
tennis_test_predictions = tennis_test.assign(
    predicted = knn_fit.predict(tennis_test[['Best Rank', 'Prize Money', 'Best Elo Rank', 'Peak Elo Rating', 'Titles', 'GOAT Rank']])
)
tennis_test_predictions[['Plays', 'predicted']]

tennis_acc_1 = knn_fit.score(
    tennis_test[['Best Rank', 'Prize Money', 'Best Elo Rank', 'Peak Elo Rating', 'Titles', 'GOAT Rank']],
    tennis_test["Plays"]
)
tennis_acc_1

Our score of 0.5 means that our model is 50% accurate. 

### Methods
Using KNN we will use best rank, best elo rank, peak elo rank, number of titles, and GOAT (greatest of all time) as predictors to determine whether a player is left or right handed (Response variable)  We will compare the output of our model and compare whether top ranked players are left or right handed. 
We chose not to use the rest of the columns as they were not relevant to the players skills or there was not enough observations for other players for the data to be significant. Using a scatter plot we will display the ranking (y-axis) of each player, number of titles (x-axis), and color code handedness. In doing so, we hope to show if there is a correlation between handedness and tennis success.  



### Expected outcomes and significance









In this project we hope to identify if there is a correlation between handedness and tennis success. From this data we can explore this hypothesis but also use this data to recognize where points are lost to an opponent given their handedness. This could be used to develop/strategze offensive and defensive plays. 

### References

Holtzen D. W. (2000). Handedness and professional tennis. The International journal of neuroscience, 105(1-4), 101–119. https://doi.org/10.3109/00207450009003270

FlashScore. (2019, December 2). Ultimate Tennis Statistics. https://www.ultimatetennisstatistics.com/