In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
library(rvest)
library(stringr)
library(DBI)
library(dbplyr)
library(GGally)
options(repr.matrix.max.rows = 10)

# Predicting the Win Rate of Tennis Players  
<img align="left" src="https://images.unsplash.com/photo-1554068865-24cecd4e34b8?ixid=MnwxMjA3fDB8MHxzZWFyY2h8MXx8dGVubmlzfGVufDB8fDB8fA%3D%3D&ixlib=rb-1.2.1&auto=format&fit=crop&w=500&q=60" width="1000" />  

**Source: https://unsplash.com/photos/WqI-PbYugn4**

# Introduction<img align="left" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSDixQBw3HoqS_gnC9xVtHO-5NrnS1eQ91N3w&usqp=CAU" width="35" />   


### <u> Background </u> 

Tennis is a popular, competitive sport played around the world.
Tennis can be played in "singles" where there is only one person on each side of the net or "doubles" where there are teams of two. 
It can be played on a variety of surfaces such as grass, clay, or hard court (i.e. like a gym floor).

The association of tennis professionals, or ATP, organizes these tournaments and collects data on the players and the matches that take place.

### <u> Our Question </u> 

Based on the career statistics of a tennis player, what will be their win rate?

### <u>  Our Dataset </u> 

We are using the "Game results for Top 500 Players from 2017-2019" dataset for our analysis. Each row in this dataset represents a singles match between two players. Each row contains player stats (e.g. age, height, rank) and match stats (break points, serve points, double faults etc.). We can use this data to determine the relationship between a player's stats and their win rate for this time period.

*Picture source: https://www.emojipng.com/preview/458725*

# Methods and Results<img align="left" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSDixQBw3HoqS_gnC9xVtHO-5NrnS1eQ91N3w&usqp=CAU" width="35" />   

<span style="color:red">
    describe in written English the methods you used to perform your analysis from beginning to end that narrates the code the does the analysis.
    your report should include code which:
        loads data from the original source on the web 
        wrangles and cleans the data from it's original (downloaded) format to the format necessary for the planned analysis
        performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
        creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
        performs the data analysis
        creates a visualization of the analysis 
        note: all tables and figure should have a figure/table number and a legend
</span>

### <u> Outline <u>

We will first transform the dataset of tennis matches into a tidy dataset with only player stats. Next, we will look at the relationships between the different variables and win_rate to choose predictors for our regression models. To answer our predictive question, we will train both KNN and linear regression models and then compare to find the model type and predictors that give the lowest error. Once we we have the best model, we will try to predict the win rate for a new player observation.

### <u> Exploring the Data <u> 

<span style="color:blue"> **The code below reads the CSV file from the given URL link.** </span>

In [None]:
## the dataset can be read from the URL link:
atp_data_frame <- read_csv("https://drive.google.com/uc?export=download&id=1fOQ8sy_qMkQiQEAO6uFdRX4tLI8EpSTn")
atp_data_frame

**Table 1: Raw data table**

<span style="color:blue"> **Next, we clean and wrangle the raw data set into a tidy data form by grouping the observations by player. Then we mutate some statistics to percentages through ratios of the raw variables and obtain each player's "career stats" by joining observations in both winning and losing rounds to the corresponding player ID. This forms a data frame with each row representing an individual player.** </span>

By mutating the data into nine predictors we can set a KNN regression model to predict a player's career win rate. The predictors include:


| Variable                        | Explanation                                                                                         |
|---------------------------------|-----------------------------------------------------------------------------------------------------|
| Age (years)                     | Older players will have sustained more injuries and be less fit.                                    |
| Height (cm)                     | Height can provide an advantage when serving.                                                       |
| Serve Points that were Aces (%) | Winning points on a serve indicates a strong serve.                                                 |
| First Serves (%)                | The ratio of "first serve points" to "first serves made in" means a player's serve is more accurate |
| First Serves Won (%)            | Strong and accurate first serves will lead to fewer double faults.                                  |
| Second Serves Won (%)           | Strong second serves means fewer lost points due to a slow serve.                                   |
| Double Faults per Game (ratio)  | Fewer double faults per game indicates accurate serving.                                            |
| Breakpoints Saved (%)           | Preventing breaks means a player wins the important points for winning the match                    |
| Rank Points                     | Awarded to players by the ATP for winning matches                                                   |

**Table 2: List of Potential Predictors created for our data set**

The predictors related to serving are useful because a player has the most control over the match during the games when they are serving. For information on each type of serve stat see (Keith Prowse Editor) under references.

The stat on rank points is important because players earn a different number of rank points for each type of match (Nag, Utathya). Players may accumulate a lot of rank points by winning many lower ranked matches or by winning a few major matches, thus providing us insight to the wins a player may have.

<span style="color:blue"> **The code below cleans and wrangles the raw data set into tidy form by grouping the observations by player. We mutate some statistics to percentages through ratios of the raw variables. We then obtain each player's "career stats" by joining observations in both winning and losing rounds to the player ID. This forms a data frame with each row representing an individual player.** </span>

In [None]:
# calculate player wins and mean match stats for winning matches
player_wins <- atp_data_frame %>%
    group_by(player_id = winner_id) %>%
    summarize(w_height = mean(winner_ht, na.rm =TRUE),
              w_breakpoint_saved_pct = mean(w_bpSaved/w_bpFaced, na.rm =TRUE),
              w_second_serve_win_pct = mean(w_2ndWon / w_svpt,na.rm =TRUE),
              w_first_serve_pct = mean(w_1stWon / w_1stIn,na.rm =TRUE),
              w_first_serve_win_pct = mean(w_1stWon / w_svpt, na.rm = TRUE),
              n_wins = n(),
              mean_age_w  = mean(winner_age),
              mean_rank_points_w = mean(winner_rank_points),
              w_ace_point_pct = mean(w_ace/w_svpt,na.rm = TRUE)
             ) %>%
    drop_na() %>%
    mutate(player_id = as.character(player_id))

# calculate player losses and mean match stats for losing matches
player_lose <- atp_data_frame %>%
    group_by(player_id = loser_id) %>%
    summarize(l_height = mean(loser_ht, na.rm =TRUE),
              l_breakpoint_saved_pct = mean(l_bpSaved/l_bpFaced, na.rm =TRUE),
              l_second_serve_win_pct = mean(l_2ndWon / l_svpt,na.rm =TRUE),
              l_first_serve_pct = mean(l_1stWon / l_1stIn,na.rm =TRUE),
              l_first_serve_win_pct = mean(l_1stWon / l_svpt, na.rm = TRUE),
              n_lose = n(),
              mean_age_l  = mean(loser_age),
              mean_rank_points_l = mean(loser_rank_points),
              l_ace_point_pct = mean(l_ace/l_svpt,na.rm = TRUE)
             ) %>%
    drop_na() %>%
    mutate(player_id = as.character(player_id))

# join datasets for wins and losses using unique player ids
player_join <- left_join(player_wins, player_lose, by = NULL, copy = TRUE)

# calculate career stats for all player matches
player_career <- player_join %>%
    mutate(height = (w_height + l_height)/2,
          breakpoint_saved_pct = (w_breakpoint_saved_pct+l_breakpoint_saved_pct)/2,
          second_serve_win_pct = (w_second_serve_win_pct+l_second_serve_win_pct)/2,
          first_serve_pct = (w_first_serve_pct+l_first_serve_pct)/2,
          first_serve_win_pct = (w_first_serve_win_pct + l_first_serve_win_pct)/2,
          win_rate = (n_wins/(n_lose+n_wins)*100),
          age = (mean_age_w + mean_age_l) /2,
          mean_rank_points = (mean_rank_points_w + mean_rank_points_l)/2,
          ace_point_pct = (w_ace_point_pct+l_ace_point_pct)/2) %>%
    select(player_id,height,breakpoint_saved_pct,second_serve_win_pct,first_serve_pct,first_serve_win_pct, win_rate,age,mean_rank_points,ace_point_pct) %>%
    drop_na()
player_career

**Table 3: Mutated data table used for data processing**

<span style="color:blue">**We split the player career dataset into testing and training sets by a 75/25 split. We decided that this split ratio allowed for enough observations to be used to train our model while still having enough observations in our test set to evaluate its accuracy.** </span>

In [None]:
# split the data set into training and testing set. The following exploratory data analysis uses only the training set
set.seed(20)
player_split <- initial_split(player_career, prop = 0.75, strata = win_rate)
player_train <- training(player_split)
player_test <- testing(player_split)

<span style="color:blue">**The table below contains the means of each quantitative variable in the training set. This gives an idea of the average statistics for a given player, which is relevant for exploratory data analysis. It tells us what sort of values (or percentages) we can expect for each stat.** </span>

In [None]:
# the means of the predictor variables we plan to use in our analysis
exploratory_data_analysis_table <- player_train %>%
    select(-player_id) %>%
    map_df(mean, na.rm = TRUE)
exploratory_data_analysis_table

**Table 4: Mean Values for each Predictor Variable**

<span style="color:blue"> **The code below produces a visualization which is also very useful in our exploratory data analysis. By using the function `ggpairs`, we can see the "big picture" of all the relationships between each pair of variables. This visualization helps us pick which variables have a relatively strong relationship with win rate, and thus will be effective in predictions.** </span>

In [None]:
player_ggpairs <- player_train %>%
    select(-player_id) %>%
    ggpairs()

player_ggpairs

**Figure 1: Plot of All Predictor Relationships using ggpairs**

### <u> Model Selection </u> 

<span style="color:blue"> **The first option for our model is K-NN regression for *individual predictors* with win_rate as the target value. In order to simplify the steps, we use a for loop to run the model on each predictor. The result is a table with 3 columns: predictor, best k value (as chosen through cross validation), and RMSPE.** </span> 

<span style="color:red">**Warning! The for_loop iteration may take time.** </span>

In [None]:
set.seed(1)

predictors <- c(
    'height','breakpoint_saved_pct','second_serve_win_pct','first_serve_pct','first_serve_win_pct','age','mean_rank_points','ace_point_pct'
)

results <- tibble()
results <- mutate(results, predictor = "", best_k = 0, rmspe = 0)

for (pred in predictors) {

    print(pred)
    
    train_data <- player_train %>%
        select(win_rate, all_of(pred))
    
    test_data <- player_test %>%
        select(win_rate, all_of(pred))
    
    tennis_recipe <- recipe(win_rate ~ ., data = train_data) %>%
       step_scale(all_predictors()) %>%
       step_center(all_predictors())
    
    tennis_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
       set_engine("kknn") %>%
       set_mode("regression")
    
    tennis_vfold <- vfold_cv(train_data, v = 5, strata = win_rate)
    
    tennis_workflow <- workflow() %>%
       add_recipe(tennis_recipe) %>%
       add_model(tennis_spec)
    
    gridvals <- tibble(neighbors = seq(1,40))
    
    tennis_results <- tennis_workflow %>%
       tune_grid(resamples = tennis_vfold, grid = gridvals) %>%
       collect_metrics() %>%
       filter(.metric == "rmse") %>%
       filter(mean == min(mean))
    
    kmin <- pull(tennis_results, neighbors)
    
    tennis_spec_kmin <- nearest_neighbor(weight_func = "rectangular", neighbors = kmin) %>%
       set_engine("kknn") %>%
       set_mode("regression")
    
    tennis_fit <- workflow() %>%
       add_recipe(tennis_recipe) %>%
       add_model(tennis_spec_kmin) %>%
       fit(data = train_data)
    
    rmspe_val <- tennis_fit %>%
       predict(test_data) %>%
       bind_cols(test_data) %>%
       metrics(truth = win_rate, estimate = .pred) %>%
       filter(.metric == "rmse") %>%
       select(.estimate) %>%
       pull()
    
    
    results <- results %>%
        add_row(predictor = pred, best_k=kmin, rmspe = rmspe_val)
    
}

In [None]:
results %>% arrange(rmspe)

**Table 5: RMSPE and Best K Values for Single Predictor Models**

<span style="color:blue"> **The second option is K-NN regression for *combined predictors* with win_rate as the target. The combined predictors are chosen from the strongest relationships we observed in player_ggpairs. Again, we iterate with a for loop to reduce the amount of code. The resulting table contains the same 3 columns as the individual predictors.** 

</span> <span style="color:red"> **Warning! The for_loop iteration may take time.** </span>

In [None]:
set.seed(1)

formulas <- c(
"win_rate ~ mean_rank_points + first_serve_win_pct",
"win_rate ~ mean_rank_points + height",
"win_rate ~ mean_rank_points + first_serve_pct",
"win_rate ~ mean_rank_points + first_serve_pct + first_serve_win_pct",
"win_rate ~ mean_rank_points + first_serve_pct + height"    
)

multi_results <- tibble()
multi_results <- mutate(multi_results, predictor = "", best_k = 0, rmspe = 0)

for (f in formulas) {
    
    print(as.formula(f))
    
    tennis_recipe_multiple <- recipe(as.formula(f), data = player_train) %>%
        step_scale(all_predictors()) %>%
        step_center(all_predictors())

    tennis_spec_mul <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
       set_engine("kknn") %>%
       set_mode("regression")
    
    tennis_vfold_mul <- vfold_cv(player_train, v = 5, strata = win_rate)
    
    tennis_workflow_multiple <- workflow() %>%
       add_recipe(tennis_recipe_multiple) %>%
       add_model(tennis_spec_mul)
    
    gridvals_mul <- tibble(neighbors = seq(1,40))
    
    tennis_results_multiple <- tennis_workflow_multiple %>%
       tune_grid(resamples = tennis_vfold_mul, grid = gridvals_mul) %>%
       collect_metrics() %>%
       filter(.metric == "rmse")
    if(f == "win_rate ~ mean_rank_points + first_serve_win_pct") {
    best_plot <- ggplot(tennis_results_multiple,aes(x = neighbors, y = mean)) +
        geom_point() +
        geom_line() +
        labs(x = "K", y = "RMSPE") + 
        theme(text = element_text(size = 20))
    }
    tennis_result_min <-  tennis_results_multiple %>%
       filter(mean == min(mean))
    
    kmin_multiple <- pull(tennis_result_min, neighbors)
    
    kmin_multiple
    
    tennis_spec_kmin_mul <- nearest_neighbor(weight_func = "rectangular", neighbors = kmin_multiple) %>%
       set_engine("kknn") %>%
       set_mode("regression")
    
    tennis_fit_multiple <- workflow() %>%
       add_recipe(tennis_recipe_multiple) %>%
       add_model(tennis_spec_kmin_mul) %>%
       fit(data = player_train)
    
    rmspe_val_mul <- tennis_fit_multiple %>%
       predict(player_test) %>%
       bind_cols(player_test) %>%
       metrics(truth = win_rate, estimate = .pred) %>%
       filter(.metric == "rmse") %>%
       select(.estimate) %>%
       pull()
    
    multi_results <- multi_results %>%
        add_row(predictor = f, best_k = kmin_multiple, rmspe = rmspe_val_mul)
}

In [None]:
multi_results %>%
    arrange(rmspe)

**Table 6: RMSPE and Best K Values for Multi Predictor Models**

In [None]:
best_plot

**Figure 2: K vs RMSPE for mean rank points and first serve win %**

<span style="color:blue"> **The third option is linear regression for *individual predictors* with win_rate as the target. Again, we iterate with a for loop to reduce the amount of code. The resulting table has only 2 columns this time: predictor and RMSPE. Since the model is using linear regression, there is no k-value.** 

</span> <span style="color:red"> **Warning! The for_loop iteration may take time.** </span>

In [None]:
set.seed(1)

formulas <- c(
"win_rate ~ mean_rank_points + first_serve_win_pct",
"win_rate ~ mean_rank_points + height",
"win_rate ~ mean_rank_points + first_serve_pct",
"win_rate ~ mean_rank_points + first_serve_pct + first_serve_win_pct",
"win_rate ~ mean_rank_points + first_serve_pct + height"    
)

multi_lm_results <- tibble()
multi_lm_results <- mutate(multi_lm_results, lm_predictor = "", lm_rmspe = 0)

for (f in formulas) {

    print(f)
    
    tennis_spec_lm <- linear_reg() %>%
        set_engine("lm") %>%
        set_mode("regression")
    
    tennis_recipe_lm <- recipe(as.formula(f), data = player_train)
    
    tennis_fit_lm <- workflow() %>%
        add_recipe(tennis_recipe_lm) %>%
        add_model(tennis_spec_lm) %>%
        fit(data = player_train)
    
    lm_rmspe_val <- tennis_fit_lm %>%
        predict(player_test) %>%
        bind_cols(player_test) %>%
        metrics(truth = win_rate, estimate = .pred) %>%
        filter(.metric == "rmse") %>%
        select(.estimate) %>%
        pull()

    multi_lm_results <- multi_lm_results %>%
        add_row(lm_predictor = f, lm_rmspe = lm_rmspe_val)

}

In [None]:
multi_lm_results %>%
    arrange(lm_rmspe)

**Table 7: RMSPE for Single Variable Linear Regression**

<span style="color:blue"> **Finally, the last option is linear regression for *combined predictors* with win_rate as the target. Again, we iterate with a for loop to reduce the amount of code. The result is presented by a table with 2 columns.** 

</span> <span style="color:red"> **Warning！the for_loop iteration may take time.** </span>

In [None]:
set.seed(1)

formulas <- c(
"win_rate ~ mean_rank_points + first_serve_win_pct",
"win_rate ~ mean_rank_points + height",
"win_rate ~ mean_rank_points + first_serve_pct",
"win_rate ~ mean_rank_points + first_serve_pct + first_serve_win_pct",
"win_rate ~ mean_rank_points + first_serve_pct + height"    
)

multi_lm_results <- tibble()
multi_lm_results <- mutate(multi_lm_results, lm_predictor = "", lm_rmspe = 0)

for (f in formulas) {

    print(f)
    
    tennis_spec_lm <- linear_reg() %>%
        set_engine("lm") %>%
        set_mode("regression")
    
    tennis_recipe_lm <- recipe(as.formula(f), data = player_train)
    
    tennis_fit_lm <- workflow() %>%
        add_recipe(tennis_recipe_lm) %>%
        add_model(tennis_spec_lm) %>%
        fit(data = player_train)
    
    lm_rmspe_val <- tennis_fit_lm %>%
        predict(player_test) %>%
        bind_cols(player_test) %>%
        metrics(truth = win_rate, estimate = .pred) %>%
        filter(.metric == "rmse") %>%
        select(.estimate) %>%
        pull()

    multi_lm_results <- multi_lm_results %>%
        add_row(lm_predictor = f, lm_rmspe = lm_rmspe_val)

}

In [None]:
multi_lm_results %>%
    arrange(lm_rmspe)

**Table 8: RMSPE for Multi-Variable Linear Regression**

### <u> Using the Model </u> 

<span style="color:blue"> **Following our testing of the many potential models, the one that produces the lowest RMSPE is:**
- K-NN regression with mean rank points and first serve win percentage as predictors
- k = 6

<span style="color:blue">**Therefore, this is what we will use to predict win rate.**</span>

In [None]:
set.seed(1)

tennis_recipe_final <- recipe(win_rate ~ mean_rank_points + first_serve_win_pct, data = player_train) %>%
    step_scale(all_predictors()) %>%
    step_center(all_predictors())

tennis_model_final <- nearest_neighbor(weight_func = "rectangular", neighbors = 6) %>%
    set_engine("kknn") %>%
    set_mode("regression")

tennis_fit_final <- workflow() %>%
    add_recipe(tennis_recipe_final) %>%
    add_model(tennis_model_final) %>%
    fit(data = player_train)

<span style="color:blue"> **Now, we can try testing the model for a new player.** </span>

In [None]:
new_player <- tibble(mean_rank_points = 1400, first_serve_win_pct = 0.46, age = 29)

prediction <- predict(tennis_fit_final, new_player) %>%
    bind_cols(new_player) %>%
    rename(predicted_win_rate = .pred)

prediction

**Table 9: New Player Analysis**

# Discussion <img align="left" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSDixQBw3HoqS_gnC9xVtHO-5NrnS1eQ91N3w&usqp=CAU" width="35" />  

Overall, we found that a player's "mean rank points" and "first serve win %" are good predictors for estimating a player's win rate. Using these variables, we trained a KNN regression model that had RMSPE of only 8.53. This RMSPE is lower compared to a linear regression model using the same predictors. We tested our model on a newly created player with 1400 rank points and a first serve win % of 46%, and the predicted win rate was 52%.

This result makes sense because a player that has lots of rank points and a strong serve is likely to win more games. The RMSPE differences between the KNN model and linear regression models was also expected because we observed a non-linear relationship between rank points and win rate. In our test player, the outcome was approximately what we would expect. Since the predictors were slightly above  based on the mean stats we calculated above.


Our model can predict the win rate of a tennis player using only a few statistics, and this could be useful in several ways. For one, it gives a sense of how the player will perform in the future. In other words, with the knowledge of their win rate, one can make a rough estimate of a player's chances in an upcoming match, tournament or season. This could be useful in a range of practical applications including scouting, sports betting, or even simply personal knowledge.


Some further questions that this analysis raised include: 

 - Are the other stats within the dataset able to be predicted by win rate and first serve win percentage (i.e. going off those two stats, can we be confident all the other stats are "good")?
 - Are there stats not included in the dataset that could improve the effectiveness of the model?
 - Do certain stats influence win rate more, and if so, is there a better weight function to use in the regression engine?





# References <img align="left" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSDixQBw3HoqS_gnC9xVtHO-5NrnS1eQ91N3w&usqp=CAU" width="35" /> 


Keith Prowse Editors. “Love? Ace? Tennis Terminology Explained: Tennis Glossary.” *Keith Prowse*, 2019, www.keithprowse.co.uk/news-and-blog/2019/01/02/tennis-terminology-explained/
\
\
Nag, Utathya. “Tennis Rankings: How They Work and Difference between ATP and Wta Systems.” *Tennis Rankings: Everything You Need to Know*, International Olympic Committee, 2021, www.olympics.com/en/featured-news/tennis-rankings-atp-wta-men-women-doubles-singles-system-grand-slam-olympics
\
\
Timbers et al. \"Data Science: A First Introduction.\" *UBC Data Science*, 2021
\
\
**Data Source:** https://github.com/JeffSackmann/tennis_atp