# Project Proposal

### Introduction

Tennis is a racket sport that is reliant on different shot techniques in their repertoire (i.e. forehand, backhand). It can be played in a singles or doubles match, wherein a player wins a match by winning two/three sets in a best-of-three/best-of-five match. A set is won by winning at least six games and two games more than the opponent, and every game is won by winning a total for at least four points and their score is two points more than the opponent.

Officially, players are assigned an Elo rating according to the Universal Tennis Rating (UTR) system, which rates players according to head-to-head results, independent of their age or nationality. The UTR system factors in the opponent that a player competed against as well as the set scores in the matches. As it is used on a global scale, the UTR can be used to match players with similar Elo ratings and therefore similar ability. This system, which is not only used globally and makes the matches more competitive, is recognized to be a helpful tool for player development and evaluator of rating tennis skill.

As such, in this project, we seek to assess the possible factors that also contribute to tennis players' current Elo ratings. We will be observing the experience (# of years played in Tennis) and hand techniques (backhand, dominant hand) to see if experience is more impactful than technique which would reflect in their current Elo rating/ranking. 

To help answer these questions, the dataset that will be used is the Player Stats for Top 500 Players dataset from https://www.ultimatetennisstatistics.com/. This dataset uses a similar system to the UTR, but with an optimized new K-factor function which allows for more stabilized ratings and player rankings.

In [2]:
# Loading Libraries, Remember to run this cell!
library(tidyverse)
library(repr)
library(tidymodels)

options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

### Data Analysis

In [3]:
# Loading in the data
player_stats <- read_csv("https://drive.google.com/uc?export=download&id=1_MECmUXZuuILYeEOfonSGqodW6qVdhsS")
colnames(player_stats) <- make.names(colnames(player_stats))
#   Remove unnecessary columns
player_stats <- player_stats %>%
    select(Age:Peak.Elo.Rating, Retired, -Country, -Wikipedia, -Current.Rank,-Name, -Seasons, -Prize.Money, -Active, -Favorite.Surface, -Best.Elo.Rank, Peak.Elo.Rating) %>%
#   Mutating columns with as_factor()
    mutate(Plays = as_factor(Plays)) %>%
    mutate(Backhand = as_factor(Backhand)) %>%
#   Cleaning up columns with string values
    mutate(Height = strtoi(str_remove(Height, " cm"))) %>%
    mutate(Year.Experience = 2020 - Turned.Pro) %>%
    mutate(Age = strtoi(substr(Age, 0, 2))) %>%
    filter(!is.na(Turned.Pro))

player_stats

“Missing column names filled in: 'X1' [1]”
Parsed with column specification:
cols(
  .default = col_character(),
  X1 = [32mcol_double()[39m,
  `Turned Pro` = [32mcol_double()[39m,
  Seasons = [32mcol_double()[39m,
  Titles = [32mcol_double()[39m,
  `Best Season` = [32mcol_double()[39m,
  Retired = [32mcol_double()[39m,
  Masters = [32mcol_double()[39m,
  `Grand Slams` = [32mcol_double()[39m,
  `Davis Cups` = [32mcol_double()[39m,
  `Team Cups` = [32mcol_double()[39m,
  Olympics = [32mcol_double()[39m,
  `Weeks at No. 1` = [32mcol_double()[39m,
  `Tour Finals` = [32mcol_double()[39m
)

See spec(...) for full column specifications.



Age,Plays,Best.Rank,Backhand,Height,Turned.Pro,Current.Elo.Rank,Peak.Elo.Rating,Retired,Year.Experience
<int>,<fct>,<chr>,<fct>,<int>,<dbl>,<chr>,<chr>,<dbl>,<dbl>
32,Right-handed,44 (14-01-2013),Two-handed,185,2005,144 (1764),1886 (06-02-2012),,15
27,Right-handed,17 (11-01-2016),Two-handed,193,2008,100 (1826),2037 (01-02-2016),,12
22,Right-handed,31 (20-01-2020),Two-handed,,2015,33 (1983),1983 (20-01-2020),,5
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
28,Right-handed,74 (19-02-2018),Two-handed,,2008,143 (1764),1904 (12-02-2018),,12
26,Right-handed,249 (24-12-2018),Two-handed,,2009,180 (1679),1679 (10-01-2020),,11
26,Right-handed,4 (06-11-2017),One-handed,185,2011,6 (2188),2211 (18-11-2019),,9


### Methods

* If we use all variable to fit a mode, we will face 2 problems. The first problem is that the model might become too complex. The second problem is that the model might become overfitting that the model will lose the prediction power. The overfitting problem comes from the model is influenced by each data too much.

* We firstly select the variables by removing variables with subjectively less correlations which is done in the data analysis. Then we did the variable selection again as the variable is still too much which will make the model complex. We did three mthods here which is exhaustive,forward and backlward selection. It can show the which variable is needed to be included for best 3 variable model(current_rank,Peak.Elo.Rating,Age).

* We will visualize the results through analyzing the distributions from the scatter plots by using each exploratory variable and response variable. Then we compare it and our final predictions to see how good our model is.

In [6]:
df2<-separate(player_stats,Best.Rank,sep=" ",into=c("best_rank","time"))%>%
     separate(Current.Elo.Rank,sep=" ",into=c("current_rank","current_time"))%>%
     separate(Peak.Elo.Rating,sep=" ",into = c("Peak.Elo.Rating","timing"))%>%
     select(-time,-current_time,-timing)%>%mutate(best_rank=as.numeric(best_rank))%>%
     mutate(current_rank=as.numeric(current_rank))%>%mutate(Peak.Elo.Rating=as.numeric(Peak.Elo.Rating))


head(df2)

install.packages("leaps")
library(leaps)
rgs1 <- regsubsets(best_rank~Age+Plays+Backhand+Height+Turned.Pro+current_rank+Peak.Elo.Rating, data=df2, method="exhaustive") 
srgs1 <- summary(rgs1)

rgs2 <- regsubsets(best_rank~Age+Plays+Backhand+Height+Turned.Pro+current_rank+Peak.Elo.Rating, data=df2, method="forward") 
srgs2 <- summary(rgs2)

rgs3 <- regsubsets(best_rank~Age+Plays+Backhand+Height+Turned.Pro+current_rank+Peak.Elo.Rating, data=df2,method="backward") 
srgs3 <- summary(rgs3)


srgs1
srgs2
srgs3

summary(lm(best_rank~Age+Plays+Backhand+Height+Turned.Pro+current_rank+Peak.Elo.Rating,data=df2))

Age,Plays,best_rank,Backhand,Height,Turned.Pro,current_rank,Peak.Elo.Rating,Retired,Year.Experience
<int>,<fct>,<dbl>,<fct>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
32,Right-handed,44,Two-handed,185.0,2005,144.0,1886.0,,15
27,Right-handed,17,Two-handed,193.0,2008,100.0,2037.0,,12
22,Right-handed,31,Two-handed,,2015,33.0,1983.0,,5
28,Right-handed,213,Two-handed,,2010,,,2017.0,10
19,Right-handed,17,Two-handed,,2017,51.0,1992.0,,3
23,Right-handed,4,Two-handed,,2014,5.0,2243.0,,6


Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



Subset selection object
Call: regsubsets.formula(best_rank ~ Age + Plays + Backhand + Height + 
    Turned.Pro + current_rank + Peak.Elo.Rating, data = df2, 
    method = "exhaustive")
7 Variables  (and intercept)
                   Forced in Forced out
Age                    FALSE      FALSE
PlaysLeft-handed       FALSE      FALSE
BackhandOne-handed     FALSE      FALSE
Height                 FALSE      FALSE
Turned.Pro             FALSE      FALSE
current_rank           FALSE      FALSE
Peak.Elo.Rating        FALSE      FALSE
1 subsets of each size up to 7
Selection Algorithm: exhaustive
         Age PlaysLeft-handed BackhandOne-handed Height Turned.Pro current_rank
1  ( 1 ) " " " "              " "                " "    " "        " "         
2  ( 1 ) " " " "              " "                " "    " "        "*"         
3  ( 1 ) "*" " "              " "                " "    " "        "*"         
4  ( 1 ) "*" " "              " "                "*"    " "        "*"         
5  

Subset selection object
Call: regsubsets.formula(best_rank ~ Age + Plays + Backhand + Height + 
    Turned.Pro + current_rank + Peak.Elo.Rating, data = df2, 
    method = "forward")
7 Variables  (and intercept)
                   Forced in Forced out
Age                    FALSE      FALSE
PlaysLeft-handed       FALSE      FALSE
BackhandOne-handed     FALSE      FALSE
Height                 FALSE      FALSE
Turned.Pro             FALSE      FALSE
current_rank           FALSE      FALSE
Peak.Elo.Rating        FALSE      FALSE
1 subsets of each size up to 7
Selection Algorithm: forward
         Age PlaysLeft-handed BackhandOne-handed Height Turned.Pro current_rank
1  ( 1 ) " " " "              " "                " "    " "        " "         
2  ( 1 ) " " " "              " "                " "    " "        "*"         
3  ( 1 ) "*" " "              " "                " "    " "        "*"         
4  ( 1 ) "*" " "              " "                "*"    " "        "*"         
5  ( 1 ) 

Subset selection object
Call: regsubsets.formula(best_rank ~ Age + Plays + Backhand + Height + 
    Turned.Pro + current_rank + Peak.Elo.Rating, data = df2, 
    method = "backward")
7 Variables  (and intercept)
                   Forced in Forced out
Age                    FALSE      FALSE
PlaysLeft-handed       FALSE      FALSE
BackhandOne-handed     FALSE      FALSE
Height                 FALSE      FALSE
Turned.Pro             FALSE      FALSE
current_rank           FALSE      FALSE
Peak.Elo.Rating        FALSE      FALSE
1 subsets of each size up to 7
Selection Algorithm: backward
         Age PlaysLeft-handed BackhandOne-handed Height Turned.Pro current_rank
1  ( 1 ) " " " "              " "                " "    " "        " "         
2  ( 1 ) " " " "              " "                " "    " "        "*"         
3  ( 1 ) "*" " "              " "                " "    " "        "*"         
4  ( 1 ) "*" " "              " "                "*"    " "        "*"         
5  ( 1 


Call:
lm(formula = best_rank ~ Age + Plays + Backhand + Height + Turned.Pro + 
    current_rank + Peak.Elo.Rating, data = df2)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.416  -8.607  -1.136   5.639  41.448 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         977.22480 2260.19533   0.432    0.667    
Age                  -1.08425    1.05821  -1.025    0.309    
PlaysLeft-handed     -0.84609    4.32001  -0.196    0.845    
BackhandOne-handed   -1.92972    4.12155  -0.468    0.641    
Height               -0.19703    0.20901  -0.943    0.349    
Turned.Pro           -0.38379    1.11102  -0.345    0.731    
current_rank          0.22246    0.04808   4.627 1.41e-05 ***
Peak.Elo.Rating      -0.06139    0.01480  -4.147 8.33e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.66 on 80 degrees of freedom
  (158 observations deleted due to missingness)
Multiple R-squared:  0.7289,	Adju

### Expected Outcomes and Significance

From our data analysis, we compared the players’ techniques as well as experience. We expect to find that experience will outweigh the technique they use and would therefore impact their ranking more. The reason behind our hypothesis is simply because with enough practice, even if a technique may be less efficient, the player is more successful in terms of current ranking. However, our findings may show which techniques are superior, and would allow for the use of one technique to outperform someone with equal experience. This may prove a significance in that it shows what hand techniques that top-ranked players use to have a higher chance of defeating their opponents.