In [None]:
install.packages("knitr") 

#installs knitr, a package used to render tables here.
#it's here because we didn't use it in class. No need to run this if you have it already.

## Project Report

**Introduction**

Tennis is a sport that is played using a racket and a rubber ball. It can be played with either one or two players on each team, and only 2 teams play at a time. The objective of the game is to pass the ball over the net in such a way so that the opponent is unable to return the ball.[1] For every time the opponent misses the ball, the team earns points. The first team to reach the specified number of points wins a game, and after winning 6 games, the team is said to have won a "set". The first team to win 3 sets wins the match. Every year a new season starts. Throughout the seasons, players travel around the world to attend different tournaments. For each winning match, these players earn points that are used to rank them. Therefore, each player gets a rank corresponding to the number of wins. Throughout tennis history, there have been great battles between veterans to attain the top spot and become the best tennis player of all time.[2]

In our data analysis, we aim to answer the following question: “What relationship does a player’s current ranking, number of seasons played, and prize money have on a player’s best ranking?” We will be using the “Player Stats for Top 500 Players” dataset which includes statistical information about the top 500 tennis players in the world.[3] Specifically, we will focus on the following variables: “Age, Prize Money, Seasons, Current Rank, and Best Rank”. We will use knn regression to do this analysis.

**Methods and Results**

In [None]:
suppressMessages(library(tidyverse))
suppressMessages(library(tidymodels))
suppressMessages(library(repr))
suppressMessages(library(GGally))
suppressMessages(library(knitr))



set.seed(4747) #hex decimal of g is 47, we are group 47. Thus: 4747 = g47

Here we load all the libraries which we will be using to do the analysis on our dataset.

In [None]:
ifelse(file.exists("data"), stop("File already exists. If CSV file isn't downloaded, delete data folder"), dir.create("data"))
#checks if directory exists, to prevent possible errors with r kernel crashing from running this twice.


url <- "https://drive.google.com/uc?export=download&id=1_MECmUXZuuILYeEOfonSGqodW6qVdhsS"
download.file(url, "data/top500players.csv") # creates directory "data" and downloads the data used into the folder

### RUN THIS BLOCK ONLY ONCE 

In this code block, we check if the file "data" already exists. If yes, we will force an error to avoid creating replicates. If not, we will download the data from the website and store it in a folder named data and name the file "top500players.csv". 

In [None]:
player_data <- read_csv("data/top500players.csv", show_col_types = FALSE)
glimpse(player_data)

In this code block we use the method read_csv() to read the data file by using it's relative path. We also display a small subset of the data to see the columns and how the data is displayed using the function glimpse(). We name the data frame after reading the data as player_data.

In [None]:
colnames(player_data) <- make.names(colnames(player_data))
player_data_tidy <- player_data %>%
                        separate(Best.Elo.Rank, c("Best.Rank", NA), sep = " ") %>% #Removes the date next to the all-time best ranks
                        separate(Current.Elo.Rank, c("Current.Rank", NA), sep = " ") %>% #Removes the elo next to the ranks
                        separate(Age, c("Age", NA), sep = " ") %>%
                            select(Name, Age, Prize.Money, Seasons, Current.Rank, Best.Rank) %>%
                         mutate(Prize.Money = gsub("[a-zA-Z$, -]", "", Prize.Money)) %>% #removes all string chars next to numbers, US$ etc.
                         mutate(Prize.Money = as.numeric(Prize.Money)) %>%
                         mutate(Best.Rank = as.numeric(Best.Rank)) %>%
                         mutate(Current.Rank = as.numeric(Current.Rank)) %>%
                         mutate(Age = as.numeric(Age)) %>%
                        na.omit()

kable(head(player_data_tidy),
      caption = "Table 1.0")

In this code block, we wrangled the data and tidied it. In our original data, we had the date next to the best rank. To make our data more readable, we wanted to only include the rank without the date. We did this using the separate() function. We used this same function to remove the points scored in Current.Rank and all the string chars next to the numbers such as currency from the Prize.Money column. Then, we stored the columns Prize.money, Current.Rank, Best.Rank and Age as numeric data.


Below, we will use scatter plots to compare the relationships between our variables. This will give us an overview about how each variable affects a player's best ranking. It will also help us determine which variables to omit as it may affect the accuracy of our predictions.

In [None]:
options(repr.plot.length = 10, repr.plot.width = 8)

SeasonsVsBestRank <- ggplot(player_data_tidy, aes(x = Seasons, y = Best.Rank)) + 
                    # geom_smooth(method = "lm", colour = "red", formula = y ~ x, se = FALSE) + #Line of best fit, helps in seeing relationship
                    geom_point(colour = "blue", alpha = 0.4) +
                     labs(x = "Seasons Played", y = "Best Rank", 
                          title = "Seasons vs Best Rank (lower is better)",
                          caption = "Figure 1.0") + 
                     theme(text = element_text(size = 17))

SeasonsVsBestRank

In figure 1.0, we are comparing the seasons played to best rank. It appears that these variables have a negative relationship—y tends to decrease as x increases. This means that players who have played more seasons tend to have a higher best rank. 

In [None]:
CurrentRankVsBestRank <- ggplot(player_data_tidy, aes(x = Current.Rank, y = Best.Rank)) + 
                    geom_point(color = "blue", alpha = 0.4) +
                     labs(x = "Current Rank", y = "Best Rank",
                          title = "Current Rank vs Best Rank",
                          caption = "Figure 1.1") + 
                     theme(text = element_text(size = 17)) 

CurrentRankVsBestRank

In figure 1.1, we are comparing a player's current rank to best rank. It appears that these variables have a positive relationship—y tends to increase as x increases. This means that players who have a high current ranking tend to also have a high best ranking. We also noticed that the y value is always equal or less than the x value. That is because a player's current ranking (x value) can not exceed their best ranking (y value)—their current ranking is either equal to their best ranking or at a lower rank. 

In [None]:
AgeVsSeasons <- ggplot(player_data_tidy, aes(x = Age, y = Seasons)) + 
                    geom_point(color = "blue", alpha = 0.4) +
                     labs(x = "Age", y = "Seasons Played",
                          title = "Age vs Seasons",
                          caption = "Figure 1.2") + 
                     theme(text = element_text(size = 17)) 

AgeVsSeasons

In figure 1.2, we can see that the variables Age and seasons are very strongly correlated and have a close relationship with each other. Due to this relationship, multicollinearity can be caused and it will lead to problems. For this reason, we will omit Age as a predictor variable and use Seasons in our analysis.

In [None]:
PrizeMoneyVsBestRank <- ggplot(player_data_tidy, aes(x = Prize.Money, y = Best.Rank)) + 
                    geom_point(color = "blue", alpha = 0.4) +
                     labs(x = "Prize Money", y = "Best Rank",
                          title = "Prize Money vs Best Rank",
                          caption = "Figure 2.0") +  
                     theme(text = element_text(size = 17))

PrizeMoneyVsBestRank

In figure 2.0 we are comparing prize money to best rank. As we can see on the plot, the majority of the data points are located at 0e+00. Since some of the player's prize money are at extremely high values, we are unable to properly view the majority of the data points which are located at the lower values. To fix this problem we will implement log scaling to prize money. This will make the x and y axis at similar scales so that we can properly see the relationships between these variables.

In [None]:
LogPrizeMoneyVsBestRank <- ggplot(player_data_tidy, aes(x = Prize.Money, y = Best.Rank)) + 
                    geom_point(color = "blue", alpha = 0.4) +
                     labs(x = "Log-Scaled Prize Money", y = "Best Rank",
                          title = "Log-Scaled Prize Money vs Best Rank",
                          caption = "Figure 2.1") +  
                     theme(text = element_text(size = 17)) +
                     scale_x_log10()

LogPrizeMoneyVsBestRank

In figure 2.1, prize money is log-scaled and is compared to best rank. It appears that these variables have a negative relationship—y tends to decrease as x increases. This means that players who have a higher value of prize money tend to also have a high best ranking. The overall shape is roughly linear before 1e+07 on the x axis, after this point, the data flattens out. The relationship between the variables is fairly strong as the y variable reliably decreases as x increases.


Below, we will use KNN regression for our data analysis. We will use regression instead of classification because we want to determine a player's best rank, which is a numerical value. We chose KNN regression over linear regression because some of the relationships between the variables are not very linear. For example in figure 1.0 and figure 1.1, although the y value either tends to increase or decrease, the data points are still quite scattered. To take this into account, we will use KNN regression. 

In [None]:
log_player_data_tidy <- player_data_tidy %>%
                            mutate(Prize.Money = log(Prize.Money))

In this code, we mutate the prize money column into a log scaled value for further analysis. 

In [None]:
set.seed(4747)


player_split <- initial_split(log_player_data_tidy)
player_training <- training(player_split)
player_testing <- testing(player_split)

In [None]:
set.seed(4747)

player_vfold <- vfold_cv(player_training, v = 5, strata = Best.Rank)
k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 1))

player_recipe <- recipe(Best.Rank ~ Seasons + Prize.Money + Current.Rank, data = player_training) %>% 
                        step_scale(all_predictors()) %>%
                        step_center(all_predictors())

knn_spec_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>%
                set_engine("kknn") %>%
                set_mode("regression")

player_resamples_metrics <- workflow() %>%
                            add_model(knn_spec_tune) %>%
                            add_recipe(player_recipe) %>%
                            tune_grid(resamples = player_vfold, grid = k_vals) %>%
                            collect_metrics()

In [None]:
resamples_metrics_clean <- player_resamples_metrics %>%
                                filter(.metric == "rmse") %>%
                                arrange(mean) %>%
                                slice(1) %>%
                                select(-.estimator, -n, -.config)

kable(resamples_metrics_clean, caption = "Table 2.0")

In the coding above, we split the data into training and testing sets. Next, using our training data, we run cross validation to find the best k value. Using neighbors ranging from 1 to 100, we tuned the model so that it returns the RMSE for each number of neighbors. We will use the neighbor with the lowest RMSE as the best k value. In table 2.0, we used the arrange and slice function to show the neighbor that has the lowest RMSE. In this case, our best k value is 10. 

In [None]:
set.seed(4747)

best_k <- resamples_metrics_clean %>%
            select(neighbors) %>%
            pull()

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = best_k) %>%
                set_engine("kknn") %>%
                set_mode("regression")

knn_fit <- workflow() %>%
            add_model(knn_spec) %>%
            add_recipe(player_recipe) %>%
            fit(player_training) 

knn_predict <- predict(knn_fit, player_testing) %>%
                bind_cols(player_testing)

knn_metrics <- metrics(knn_predict, truth = Best.Rank, estimate = .pred)

knn_fit

In this code, we are evaluating the test set with k = 10. We will analyze our results using the two tables below.

In [None]:
kable(head(knn_predict),
      caption = "Table 3.0")

In table 3.0, we can compare our predictions with the actual values. This table only shows the first few columns. 

In [None]:
kable(head(knn_metrics), 
      caption = "Table 3.1")

Table 3.1 shows us the RMSE of our KNN regression model. Our RMSE value is 18.1065986.

In [None]:
pred_current_rank <- ggplot(knn_predict, aes(x = Current.Rank, y = Best.Rank)) + 
                    geom_point(color = "blue", alpha = 0.4) +
                    geom_line(aes(x = Current.Rank, y = .pred), color = "red") +
                     labs(x = "Current Rank", y = "Best Rank",
                          title = paste("Current Rank vs Best Rank, k =", best_k, sep = " "),
                          caption = "Figure 3.0") +  
                     theme(text = element_text(size = 17))
pred_current_rank

Figure 3.0 shows the KNN regression model for Best rank vs Current rank with K = 10. It has an increasing trend, indicating that as current rank increases, the best rank also increases.

In [None]:
pred_prize_money <- ggplot(knn_predict, aes(x = Prize.Money, y = Best.Rank)) + 
                    geom_point(color = "blue", alpha = 0.4) +
                    geom_line(aes(x = Prize.Money, y = .pred), color = "red") +
                     labs(x = "Prize Money", y = "Best Rank",
                          title = paste("Prize Money vs Best Rank, k =", best_k, sep = " "),
                          caption = "Figure 3.1") +  
                     theme(text = element_text(size = 17))
pred_prize_money

Figure 3.1 shows the KNN regression model for Best rank vs Prize money with K = 10. It has an overall decreasing trend. As prize money increases, best rank decreases, indicating that the player is ranked higher.

In [None]:
pred_seasons <- ggplot(knn_predict, aes(x = Seasons, y = Best.Rank)) + 
                    geom_point(color = "blue", alpha = 0.4) +
                    geom_line(aes(x = Seasons, y = .pred), color = "red") +
                     labs(x = "Seasons Played", y = "Best Rank",
                          title = paste("Seasons Played vs Best Rank, k =", best_k, sep = " "),
                          caption = "Figure 3.2") +  
                     theme(text = element_text(size = 17))
pred_seasons

Figure 3.2 shows that the KNN regression model for Seasons played vs Best rank with K = 10. It has an overall decreasing trend, indicating that as players attend more seasons, they also tend to have higher best ranks.

**Discussion**

- We see clear relationships from our data analysis. As the prize money increases, the best rank decreases, indicating that these players are ranked higher. This was expected because top players earn more prize money from winning games. [4]

- As the seasons played increase, the best rank decreases, indicating that these players are at a higher rank. This was expected because players gain vital experience in the previous seasons and to improve themselves, which is the key role in getting better and therefore ranking higher.

- As current rank increases, their best rank also increases. This means that if a player's current rank is lower than their best rank, it may indicate that the player's performance has been declining. Whereas if their current rank is equal to their best rank, it may indicate improvement. 

- These findings could help the Sport industry select potential athletes and predict their success. Sport universities could use seasons played, prize money, and current rank to determine which athletes they would like to admit in competitions.

- Using only Seasons, Prize Money and Current Rank as predictors for athlete rank is not very accurate because ranks can be influenced by many other factors which are not present in this dataset. For example, weather on the game day, psychological quality of athletes, brands of shoes and sportswear could be potential factors for athlete grades. The physical trait also matters to a great extent of their performance and rank. Future questions could focus on addressing these problems by exploring other variables that have relationships with best rank.

- These plots also show that there are a few outliers where people start off very well, and even with a very few seasons, acquire a good rank. These outliers could have affected our results.

- A root mean square error of 18.1066 is not too big. However, as the data set has 500 players involved, a few errors could have occured in the analysis. An R squared value of 0.84 means that most of the data points were included in our analysis, which is good because having more data can improve our predictions.


**References**
1. Bruce, Morys George Lyndhurst , Aberdare, 4th Baron and Lorge, Barry Steven. "tennis". Encyclopedia Britannica, 4 Jun. 2021, https://www.britannica.com/sports/tennis.
2. Tennis Scoring: Points, Sets & Games | Tennis Rules | USTA. (n.d.). Tennis Scoring: Points, Sets & Games | Tennis Rules | USTA. https://www.usta.com/en/home/improve/tips-and-instruction/national/tennis-scoring-rules.html.
3. Sackmann, J. (2015). GitHub - JeffSackmann/tennis_atp: ATP Tennis Rankings, Results, and Stats. GitHub. https://github.com/JeffSackmann/tennis_atp
4. Spiegel, J. (2021, September 12). Us open prize money: How much will the winners make in 2021? Purse, breakdown for field. Sporting News Canada. Retrieved December 8, 2021.
https://www.sportingnews.com/ca/tennis/news/us-open-prize-money-2021-purse-breakdown/jtbepmuo3vmu1xm049906xs2i. 