In [61]:
pip install -U altair 

Note: you may need to restart the kernel to use updated packages.


# Handedness and Tennis Success 

## Introduction

It was hypothesized that left handed tennis players have better developed tennis skills than those of right handed players (Holtzen, 2000). It is argued that left handed players can serve more strategically to their opponent given the developed spatial, motor and attentional functions (Holtzen, 2000). Given the hypothesis in Holtzen’s study that left handed players are better than right handed players at tennis, we want to examine if there is an association between handedness and tennis success. The data set we will use is Player Stats for Top 500 Players from Ultimate Tennis Statistics (https://www.ultimatetennisstatistics.com). The data set includes the player’s name, age, where they're from, handedness, source of information, current ranking, best ranking, their backhand position, net prize money, height, weight, coach, preferred surface to play on, information on the duration and dates of their career, social media, and personal game stats. Given our question on whether handedness and specifically, left-handedness, has an impact on success in tennis, we will designate handedness as the classifier variable and will designate best rank, prize money, best elo rank, peak elo rating, titles, and GOAT rank as the predictor variables. We chose these predictor variables as they all relate to the success of a tennis player's career. We omitted other variables like the player's personal info (name, age, height, weight, where they're from), social media and information source, coach, nicknames, backhand position, preferred surface to play on, and information on the duration and dates of their career as they were not relevant to the question we are answering in this project. From the remaining variables regarding the player's game stats, we omitted masters, grand slams, Davis cups, team cups, Olympics, weeks at No. 1, and tour finals as there was very limited data for these variables and we felt that including them in our data analysis would require a significant reduction in the sample size of the data set. We also decided to omit current rank, and current elo rating as some of the players may be retired and ranking decreases when a player is not longer playing, therefore misrepresenting the performance of the player when they were active.





## Methods









### Loading and Cleaning  the Data

For the data analysis of this project, we require functions from the pandas package and the altair package. We will load both packages below.

In [62]:
import pandas as pd
import altair as alt

To read the data into our notebook, we will use the url (https://drive.google.com/uc?export=download&id=1_MECmUXZuuILYeEOfonSGqodW6qVdhsS) and the read_csv function from the pandas package. At this point, we will name the data 'data'.

In [63]:
data = pd.read_csv("https://drive.google.com/uc?export=download&id=1_MECmUXZuuILYeEOfonSGqodW6qVdhsS")
data

Unnamed: 0.1,Unnamed: 0,Age,Country,Plays,Wikipedia,Current Rank,Best Rank,Name,Backhand,Prize Money,...,Facebook,Twitter,Nicknames,Grand Slams,Davis Cups,Web Site,Team Cups,Olympics,Weeks at No. 1,Tour Finals
0,0,26 (25-04-1993),Brazil,Right-handed,Wikipedia,378 (97),363 (04-11-2019),Oscar Jose Gutierrez,,,...,,,,,,,,,,
1,1,18 (22-12-2001),United Kingdom,Left-handed,Wikipedia,326 (119),316 (14-10-2019),Jack Draper,Two-handed,"$59,040",...,,,,,,,,,,
2,2,32 (03-11-1987),Slovakia,Right-handed,Wikipedia,178 (280),44 (14-01-2013),Lukas Lacko,Two-handed,"US$3,261,567",...,,,,,,,,,,
3,3,21 (29-05-1998),"Korea, Republic of",Right-handed,Wikipedia,236 (199),130 (10-04-2017),Duck Hee Lee,Two-handed,"$374,093",...,,,,,,,,,,
4,4,27 (21-10-1992),Australia,Right-handed,Wikipedia,183 (273),17 (11-01-2016),Bernard Tomic,Two-handed,"US$6,091,971",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,495,20 (13-04-1999),France,Right-handed,Wikipedia,382 (95),380 (11-11-2019),Dan Added,Two-handed,"$57,943",...,,,,,,,,,,
496,496,26 (03-09-1993),Austria,Right-handed,Wikipedia,5 (5890),4 (06-11-2017),Dominic Thiem,One-handed,"$22,132,368 15th all-time leader in earnings",...,1.Dominic.Thiem,@ThiemDomi,Dominator,,,dominicthiem.tennis,,,,
497,497,23 (14-03-1996),Netherlands,Left-handed,Wikipedia,495 (60),342 (05-08-2019),Gijs Brouwer,,,...,,,,,,,,,,
498,498,24 (17-05-1995),Ukraine,,Wikipedia,419 (81),419 (20-01-2020),Vladyslav Orlov,,,...,,,,,,,,,,


To visualize all the columns of the data set, we will use the .columns function.

In [64]:
data.columns

Index(['Unnamed: 0', 'Age', 'Country', 'Plays', 'Wikipedia', 'Current Rank',
       'Best Rank', 'Name', 'Backhand', 'Prize Money', 'Height',
       'Favorite Surface', 'Turned Pro', 'Seasons', 'Active',
       'Current Elo Rank', 'Best Elo Rank', 'Peak Elo Rating',
       'Last Appearance', 'Titles', 'GOAT Rank', 'Best Season', 'Retired',
       'Masters', 'Birthplace', 'Residence', 'Weight', 'Coach', 'Facebook',
       'Twitter', 'Nicknames', 'Grand Slams', 'Davis Cups', 'Web Site',
       'Team Cups', 'Olympics', 'Weeks at No. 1', 'Tour Finals'],
      dtype='object')

Now that we have a list of the columns in this data set, we will filter out the columns we need to answer our project's question. The columns we will include are plays, best rank, prize money, best elo rank, peak elo rating, titles, and GOAT rank. Will name this 'data_filtered'.

In [65]:
data_filtered = data[['Plays', 'Best Rank', 'Prize Money', 'Best Elo Rank', 'Peak Elo Rating', 'Titles', 'GOAT Rank']]
data_filtered

Unnamed: 0,Plays,Best Rank,Prize Money,Best Elo Rank,Peak Elo Rating,Titles,GOAT Rank
0,Right-handed,363 (04-11-2019),,,,,
1,Left-handed,316 (14-10-2019),"$59,040",,,,
2,Right-handed,44 (14-01-2013),"US$3,261,567",60 (06-02-2012),1886 (06-02-2012),,
3,Right-handed,130 (10-04-2017),"$374,093",,,,
4,Right-handed,17 (11-01-2016),"US$6,091,971",21 (23-03-2015),2037 (01-02-2016),4.0,264 (6)
...,...,...,...,...,...,...,...
495,Right-handed,380 (11-11-2019),"$57,943",,,,
496,Right-handed,4 (06-11-2017),"$22,132,368 15th all-time leader in earnings",5 (18-11-2019),2211 (18-11-2019),16.0,58 (58)
497,Left-handed,342 (05-08-2019),,,,,
498,,419 (20-01-2020),,,,,


Now that we have selected the columns we will be using for analysis, we want to omit the observations that are missing values by using the .dropna() function. This will allow us to perform K-nearest neighbors classification later on. We will name this 'data_no_na'.

In [66]:
data_no_na = data_filtered.dropna()
data_no_na

Unnamed: 0,Plays,Best Rank,Prize Money,Best Elo Rank,Peak Elo Rating,Titles,GOAT Rank
4,Right-handed,17 (11-01-2016),"US$6,091,971",21 (23-03-2015),2037 (01-02-2016),4.0,264 (6)
5,Right-handed,31 (20-01-2020),"$1,517,157",33 (19-01-2020),1983 (20-01-2020),1.0,489 (1)
11,Right-handed,4 (09-09-2019),"US$ 10,507,693",4 (14-10-2019),2243 (14-10-2019),7.0,109 (27)
15,Right-handed,3 (13-08-2018),"US$25,889,586 11th all-time leader in earnings",3 (07-06-2010),2329 (14-09-2009),22.0,33 (109)
19,Right-handed,25 (05-08-2019),"US$2,722,314",29 (05-08-2019),1999 (05-08-2019),1.0,489 (1)
...,...,...,...,...,...,...,...
473,Right-handed,8 (21-08-2006),"$8,918,917",9 (20-08-2007),2098 (18-02-2008),4.0,169 (15)
484,Right-handed,29 (11-02-2019),"US$3,304,117",36 (13-05-2019),1962 (28-01-2019),1.0,400 (2)
487,Left-handed,58 (20-03-2017),"$1,703,096",43 (10-01-2020),1952 (10-01-2020),1.0,489 (1)
490,Right-handed,74 (19-02-2018),"US$1,048,314",57 (12-02-2018),1904 (12-02-2018),1.0,489 (1)


Many of the columns in the data set contain more than one value in a cell. For the best rank, best elo rank, and peak elo rating columns, there is a date included in the cell that tells us when the ranking or rating was obtained. Since we are not concerned with the date of achievements for the tennis players in answering our predictive question, we can omit this data. In the GOAT rank column, there is the ranking of the player and value of the points the player has that determines their ranking in brackets. Since both values go towards GOAT ranking, we will simplify the data by excluding the points of the player in the brackets. In the prize money column, there is additional information in the form of a string. Since the value of the player's prize money will be sufficient in our analysis, we will keep only the prize money value in this column. In the next step of cleaning the data, we will clean the data so each cell only contains one value, filter for only the values we need in our analysis and will convert the columns (besides the handedness column) from object columns to numeric data.

To clean the data so that there is only one value per cell, we will use the str.split function. We will complete this process in multiply steps, starting with the GOAT rank column. We will separate the values of the GOAT rank column by specifying that the split should be made on the space between the values. We will name this 'data_spit_GOAT'. In this step, we will rename the columns so that when we combine all the adjusted columns later, it will be easy to differentiate them and easy to discard the ones we do not need. We will name the column we will be keeping 'GOAT Rank' and the column we will discard 'GOAT Discard'.

In [67]:
data_split_GOAT = data_no_na['GOAT Rank'].str.split(" ", expand = True)
data_split_GOAT = data_split_GOAT.rename(columns={0:"GOAT Rank", 1:"GOAT Discard"})
data_split_GOAT

Unnamed: 0,GOAT Rank,GOAT Discard
4,264,(6)
5,489,(1)
11,109,(27)
15,33,(109)
19,489,(1)
...,...,...
473,169,(15)
484,400,(2)
487,489,(1)
490,489,(1)


We will now separate the values of the peak elo rating column, naming it 'data_split_Peak_Elo'. Again, we will specify that the space is where the data should be split and rename the columns to 'Peak Elo Rating' and 'Peak Elo Discard'.

In [68]:
data_split_Peak_Elo = data_no_na['Peak Elo Rating'].str.split(" ", expand = True)
data_split_Peak_Elo = data_split_Peak_Elo.rename(columns={0:"Peak Elo Rating", 1:"Peak Elo Discard"})
data_split_Peak_Elo

Unnamed: 0,Peak Elo Rating,Peak Elo Discard
4,2037,(01-02-2016)
5,1983,(20-01-2020)
11,2243,(14-10-2019)
15,2329,(14-09-2009)
19,1999,(05-08-2019)
...,...,...
473,2098,(18-02-2008)
484,1962,(28-01-2019)
487,1952,(10-01-2020)
490,1904,(12-02-2018)


We will do the same for the best elo rank column, naming it 'data_split_Best_Elo'. We will rename the columns to 'Best Elo Rank' and 'Best Elo Discard'.

In [69]:
data_split_Best_Elo = data_no_na['Best Elo Rank'].str.split(" ", expand = True)
data_split_Best_Elo = data_split_Best_Elo.rename(columns={0:"Best Elo Rank", 1:"Best Elo Discard"})
data_split_Best_Elo

Unnamed: 0,Best Elo Rank,Best Elo Discard
4,21,(23-03-2015)
5,33,(19-01-2020)
11,4,(14-10-2019)
15,3,(07-06-2010)
19,29,(05-08-2019)
...,...,...
473,9,(20-08-2007)
484,36,(13-05-2019)
487,43,(10-01-2020)
490,57,(12-02-2018)


We will do the same for the best rank column, naming it 'data_split_Best_Rank'. We will rename the columns to 'Best Rank' and 'Best Rank Discard'.

In [70]:
data_split_Best_Rank = data_no_na['Best Rank'].str.split(" ", expand = True)
data_split_Best_Rank = data_split_Best_Rank.rename(columns={0:"Best Rank", 1:"Best Rank Discard"})
data_split_Best_Rank

Unnamed: 0,Best Rank,Best Rank Discard
4,17,(11-01-2016)
5,31,(20-01-2020)
11,4,(09-09-2019)
15,3,(13-08-2018)
19,25,(05-08-2019)
...,...,...
473,8,(21-08-2006)
484,29,(11-02-2019)
487,58,(20-03-2017)
490,74,(19-02-2018)


Finally, we will remove the '$', 'US' and all other unnecessary information from the prize money column

In [71]:
data_clean_Prize_Money= data_no_na['Prize Money'].str.strip('$US')
data_clean_Prize_Money= data_clean_Prize_Money.str.strip('all-time leader in earnings')
data_clean_Prize_Money= data_clean_Prize_Money.str.replace(',','')
data_clean_Prize_Money= data_clean_Prize_Money.str.extract(r'(\d+)', expand=False)
data_clean_Prize_Money= pd.DataFrame(data_clean_Prize_Money)
data_clean_Prize_Money= data_clean_Prize_Money.rename(columns={0:"Prize Money"})
data_clean_Prize_Money

Unnamed: 0,Prize Money
4,6091971
5,1517157
11,10507693
15,25889586
19,2722314
...,...
473,8918917
484,3304117
487,1703096
490,1048314


Now that all of the columns have been split or cleaned so that each cell does not contain more than one value, we can combine the columns from each individual step so that we have all the data from the data_no_na data frame. Using the concat function from pandas, we will concantenate the plays column from the data_no_na data frame, data_split_GOAT, data_split_Peak_Elo, data_split_Best_Elo, data_split_Best_Rank, and data_clean_Prize_Money. This data frame will be called tennis.

In [72]:
tennis = pd.concat(
    [data_no_na['Plays'], data_no_na['Titles'], data_split_GOAT, data_split_Peak_Elo, data_split_Best_Elo, 
     data_split_Best_Rank, data_clean_Prize_Money],
    axis=1,
)
tennis

Unnamed: 0,Plays,Titles,GOAT Rank,GOAT Discard,Peak Elo Rating,Peak Elo Discard,Best Elo Rank,Best Elo Discard,Best Rank,Best Rank Discard,Prize Money
4,Right-handed,4.0,264,(6),2037,(01-02-2016),21,(23-03-2015),17,(11-01-2016),6091971
5,Right-handed,1.0,489,(1),1983,(20-01-2020),33,(19-01-2020),31,(20-01-2020),1517157
11,Right-handed,7.0,109,(27),2243,(14-10-2019),4,(14-10-2019),4,(09-09-2019),10507693
15,Right-handed,22.0,33,(109),2329,(14-09-2009),3,(07-06-2010),3,(13-08-2018),25889586
19,Right-handed,1.0,489,(1),1999,(05-08-2019),29,(05-08-2019),25,(05-08-2019),2722314
...,...,...,...,...,...,...,...,...,...,...,...
473,Right-handed,4.0,169,(15),2098,(18-02-2008),9,(20-08-2007),8,(21-08-2006),8918917
484,Right-handed,1.0,400,(2),1962,(28-01-2019),36,(13-05-2019),29,(11-02-2019),3304117
487,Left-handed,1.0,489,(1),1952,(10-01-2020),43,(10-01-2020),58,(20-03-2017),1703096
490,Right-handed,1.0,489,(1),1904,(12-02-2018),57,(12-02-2018),74,(19-02-2018),1048314


Now that all of the data has been combined together, we can drop the columns that aren't needed for our analysis (labeled with 'discard' in their column titles) using the drop function. This data frame will be called tennis_columns

In [73]:
tennis_clean = tennis.drop(columns=['GOAT Discard', 'Peak Elo Discard', 'Best Elo Discard', 'Best Rank Discard'])
tennis_clean

Unnamed: 0,Plays,Titles,GOAT Rank,Peak Elo Rating,Best Elo Rank,Best Rank,Prize Money
4,Right-handed,4.0,264,2037,21,17,6091971
5,Right-handed,1.0,489,1983,33,31,1517157
11,Right-handed,7.0,109,2243,4,4,10507693
15,Right-handed,22.0,33,2329,3,3,25889586
19,Right-handed,1.0,489,1999,29,25,2722314
...,...,...,...,...,...,...,...
473,Right-handed,4.0,169,2098,9,8,8918917
484,Right-handed,1.0,400,1962,36,29,3304117
487,Left-handed,1.0,489,1952,43,58,1703096
490,Right-handed,1.0,489,1904,57,74,1048314


The final step of the data cleaning process is to ensure the predictor columns are numeric . Using the function str.split causes the data returned to be in string type. Since we will be wanting to use functions on the predictor columns that will treat them as numbers, we need to change the predictor columns to numeric data types using the pandas.to_numeric function. Included below is a code cell where the info function is applied to the tennis data set, checking to see which columns are object types and therefore strings. We will name the data frame that has it's predictor variables converted to numerical data 'tennis_clean' and this will be the final version of the clean data.

In [74]:
tennis_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 4 to 496
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Plays            95 non-null     object 
 1   Titles           95 non-null     float64
 2   GOAT Rank        95 non-null     object 
 3   Peak Elo Rating  95 non-null     object 
 4   Best Elo Rank    95 non-null     object 
 5   Best Rank        95 non-null     object 
 6   Prize Money      95 non-null     object 
dtypes: float64(1), object(6)
memory usage: 5.9+ KB


In [75]:
tennis_clean["Titles"] = pd.to_numeric(tennis_clean["Titles"])
tennis_clean["GOAT Rank"] = pd.to_numeric(tennis_clean["GOAT Rank"])
tennis_clean["Peak Elo Rating"] = pd.to_numeric(tennis_clean["Peak Elo Rating"])
tennis_clean["Best Elo Rank"] = pd.to_numeric(tennis_clean["Best Elo Rank"])
tennis_clean["Best Rank"] = pd.to_numeric(tennis_clean["Best Rank"])
tennis_clean["Prize Money"] = pd.to_numeric(tennis_clean["Prize Money"])
tennis_clean

Unnamed: 0,Plays,Titles,GOAT Rank,Peak Elo Rating,Best Elo Rank,Best Rank,Prize Money
4,Right-handed,4.0,264,2037,21,17,6091971
5,Right-handed,1.0,489,1983,33,31,1517157
11,Right-handed,7.0,109,2243,4,4,10507693
15,Right-handed,22.0,33,2329,3,3,25889586
19,Right-handed,1.0,489,1999,29,25,2722314
...,...,...,...,...,...,...,...
473,Right-handed,4.0,169,2098,9,8,8918917
484,Right-handed,1.0,400,1962,36,29,3304117
487,Left-handed,1.0,489,1952,43,58,1703096
490,Right-handed,1.0,489,1904,57,74,1048314


To check that the predictor variables are in the form of numerical data, we will use the info function again.

In [76]:
tennis_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 4 to 496
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Plays            95 non-null     object 
 1   Titles           95 non-null     float64
 2   GOAT Rank        95 non-null     int64  
 3   Peak Elo Rating  95 non-null     int64  
 4   Best Elo Rank    95 non-null     int64  
 5   Best Rank        95 non-null     int64  
 6   Prize Money      95 non-null     int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 5.9+ KB


### Spliting the Data into Training Data and Testing Data and Exploring the Training Data

To split the data randomly to create the training and testing data, we will need to set the seed using the np.random.seed function from the numpy package. We will also need the sklearn package to split the data into training and testing data. We will import the test_train_split function from the sklearn package and the numpy package below.

In [77]:
import numpy as np
from sklearn.model_selection import train_test_split

Now, we will split the data into the testing and training data. We will set the seed using the np.random.seed function and then use the test_train_split function on the tennis_clean data frame. As we want to balance the size of the training and testing data sets so that we can have train a relatively accurate model and have a good evaluation of the model's performance, we will split the data so that 75% of the data is training data and 25% is testing data. Lastly, we will stratify the data based on the class label (plays) so that roughly the same proportions of each label is divided into the training and testing data. The training data will be called 'tennis_train', and the testing data will be called 'tennis_test'.

In [78]:
np.random.seed(1)

tennis_train, tennis_test = train_test_split(
    tennis_clean, train_size=0.75, stratify=tennis_clean['Plays']
)

We will now use the info function to check that the data has been split correctly.

In [79]:
tennis_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71 entries, 207 to 62
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Plays            71 non-null     object 
 1   Titles           71 non-null     float64
 2   GOAT Rank        71 non-null     int64  
 3   Peak Elo Rating  71 non-null     int64  
 4   Best Elo Rank    71 non-null     int64  
 5   Best Rank        71 non-null     int64  
 6   Prize Money      71 non-null     int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 4.4+ KB


In [80]:
tennis_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 316 to 29
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Plays            24 non-null     object 
 1   Titles           24 non-null     float64
 2   GOAT Rank        24 non-null     int64  
 3   Peak Elo Rating  24 non-null     int64  
 4   Best Elo Rank    24 non-null     int64  
 5   Best Rank        24 non-null     int64  
 6   Prize Money      24 non-null     int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 1.5+ KB


One concerning point in answering our predictive question is that there may be significantly less left-handed classifications in the data set than there are right-handed classifications. This may impact the model's left-handed classification as the K-nearest neighbors algorithm bases classification on the majority classification of nearby points. To check the percentage of each class (right-handed and left-handed) in the data set, we will use the value_counts function with the normalize argument set to true.

In [81]:
tennis_clean['Plays'].value_counts(normalize=True)

Right-handed    0.852632
Left-handed     0.147368
Name: Plays, dtype: float64

In [82]:
tennis_train['Plays'].value_counts(normalize=True)

Right-handed    0.859155
Left-handed     0.140845
Name: Plays, dtype: float64

In [83]:
tennis_test['Plays'].value_counts(normalize=True)

Right-handed    0.833333
Left-handed     0.166667
Name: Plays, dtype: float64

Visualizing the number of each classification by creating a bar plot also allows us to check if there are significantly less left-handed classification that right-handed classifications. We will use the chart function from the altair package to create the plot and specify that we want a bar plot by using mark_bar. The x-axis will be the different classifications and the y-axis will be the counts for each classification. We will name this plot 'classification_count_plot'.

In [85]:
classification_count_plot = alt.Chart(tennis_train).mark_bar().encode(
    x=alt.X("Plays").title("Handedness"),
    y=alt.Y("count()").title("Count")
)
classification_count_plot

As suspected, there is a much greater proportion of right-handed classes than left-handed classes. In an effort to decrease the class imbalance to limit its effect on the training of our model, we will oversample the left-handed class in the training data. We will need to use the resample function from the sklearn package so we will begin by importing that. Then, we will need to split the training data into two different data groups depending on class through filtering. Next, we will use the resample function on the left-handed observations to increase the number of observations to match the right-handed observations by setting the n_samples argument to equal number of right-handed observations. Finally we will use the concat function to concatenate the data back together.

In [54]:
from sklearn.utils import resample

right_tennis = tennis_train[tennis_train['Plays'] == 'Right-handed']
left_tennis = tennis_train[tennis_train['Plays'] == 'Left-handed']

left_tennis_upsample = resample(
    left_tennis, n_samples = right_tennis.shape[0]
)
    
tennis_train = pd.concat((left_tennis_upsample, right_tennis))

To check that the number of classes are now the same, we will use the value_counts function.

In [55]:
tennis_train['Plays'].value_counts()

Left-handed     61
Right-handed    61
Name: Plays, dtype: int64

To explore the data further, we will also calculate the means of each predictor variable in our analysis using the mean function and specifying the numeric_only argument as true to exclude the classifier variable.

In [56]:
tennis_train.mean(numeric_only=True)

Titles             6.057377e+00
GOAT Rank          3.044508e+02
Peak Elo Rating    2.035820e+03
Best Elo Rank      2.986885e+01
Best Rank          2.440984e+01
Prize Money        1.035640e+07
dtype: float64

The last step of our exploratory data analysis is to visualize the training data. To do so, we will create several scatters plot of the training data, coloring the left-handed and right-handed classes differently to visualize their distribution. We will use the chart function from altair and specify that we want a scatter plot by using mark_circle. For this first scatter plot, we will designate the x variable as prize money and the y variable as best rank. We will name this plot br_pm_plot.

In [57]:
br_pm_plot = alt.Chart(tennis_train).mark_circle().encode(
    x=alt.X("Prize Money").title("Total Prize Money (USD)"),
    y=alt.Y("Best Rank").title("Player's Best Rank"),
    color=alt.Color("Plays").title("Handedness")
)
br_pm_plot

In [58]:
per_ber_plot = alt.Chart(tennis_train).mark_circle().encode(
    x=alt.X("Best Elo Rank").title("Player's Best Elo Rank"),
    y=alt.Y("Peak Elo Rating").title("Player's Peak Elo Rating"),
    color=alt.Color("Plays").title("Handedness")
)
per_ber_plot

In [59]:
br_ber_plot = alt.Chart(tennis_train).mark_circle().encode(
    x=alt.X("GOAT Rank").title("Player's GOAT Elo Rank"),
    y=alt.Y("Titles").title("Number of Titles"),
    color=alt.Color("Plays").title("Handedness")
)
br_ber_plot

### Data Analysis

### Methods
Using KNN we will use best rank, best elo rank, peak elo rank, number of titles, and GOAT (greatest of all time) as predictors to determine whether a player is left or right handed (Response variable)  We will compare the output of our model and compare whether top ranked players are left or right handed. 
We chose not to use the rest of the columns as they were not relevant to the players skills or there was not enough observations for other players for the data to be significant. Using a scatter plot we will display the ranking (y-axis) of each player, number of titles (x-axis), and color code handedness. In doing so, we hope to show if there is a correlation between handedness and tennis success.  



### Expected outcomes and significance









In this project we hope to identify if there is a correlation between handedness and tennis success. From this data we can explore this hypothesis but also use this data to recognize where points are lost to an opponent given their handedness. This could be used to develop/strategze offensive and defensive plays. 

### References