Jonathan Chan (34466219), Erick Jovan Muljadi, Katie Swangard, Aurora Wang

DSCI 100 004

29 October 2022

Group Project Proposal

In [None]:
# Loading in necessary libraries:
library(tidyverse)
library(repr)
library(tidymodels)
library(ggplot2)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



**Background Information**

ATP is a worldwide top-tier tennis tour for men organized by the Association of Tennis Professionals. 

The question we are trying to ask with our project is: based on three given characteristics of a new player (height, age and hand use), what ATP ranking will he achieve on each of the three different surfaces.

We will use the dataset "Match Results for Top 500 Players 2017-2019". This data set includes the game statistics from the year 2017-2019 for tournaments. It contains the historical ranking, results and match statIstics of the top 500 players in the ATP ranking, as well as individual characteristics of the winners and losers of each match, such as height, age, and hand use. By analyzing the relationship between these characteristics and the winner’s ranking positions, we can predict the ranking of a new player on each of the three different surfaces based on his characteristics.


In [None]:
# Reading in our dataset:
tennis <- read_csv("https://drive.google.com/uc?export=download&id=1fOQ8sy_qMkQiQEAO6uFdRX4tLI8EpSTn")|>
    rename("ID_num" = 1)
# As the whole dataset would be too large to display, we are only displaying the first 10 rows here.
head(tennis, 10)

In [None]:
# Selecting only for rows needed in our data analysis:
tennis_smaller <- tennis|>
    select(ID_num,winner_hand,winner_ht,winner_age, winner_rank,surface, loser_hand,loser_ht, loser_age, loser_rank)|>
    mutate(ID_num = as_factor(ID_num), winner_hand = as_factor(winner_hand)) # Mutating ID_num and winner_hand into factor columns.
# Combining winner and loser data into one column:
tennis_pivot <- tennis_smaller |>
    pivot_longer(cols = c("winner_rank", "loser_rank"),
                     names_to = "wol_rank",
                     values_to = "rank") |>
    pivot_longer(cols = c("winner_ht", "loser_ht"),
                     names_to = "wol_height",
                     values_to = "height") |>
    pivot_longer(cols = c("winner_age", "loser_age"),
                     names_to = "wol_age",
                     values_to = "age") |>
    pivot_longer(cols = c("winner_hand", "loser_hand"),
                     names_to = "wol_hand",
                     values_to = "hand")
tennis_pivot |>
    slice(1 : 10) # Displaying the top 10 columns.
#
#Filter all the NA first before continuing

In [None]:
# Groupping by rank, surface, and hand, then finding mean height and age.
tennis_mean <- tennis_pivot|>
    group_by(rank, surface, hand) |>
    summarize(mean_ht = mean(height, na.rm = TRUE), mean_age = mean(age))
# Displaying only the top 10 columns due to space constraints:
head(tennis_mean, 10)

In [None]:
# Splitting our data 75/25 into training and testing data:
tennis_split <- initial_split(tennis_mean, prop = 0.75, strata = rank )
tennis_training <- training(tennis_split)
tennis_testing <- testing(tennis_split)

In [None]:
tennis_training

In [None]:
# Plotting rank against mean age:
options(repr.plot.width = 9, repr.plot.height = 7)
rank_vs_age_plot <- tennis_training|>
    filter(surface == "Clay"| surface == "Grass"| surface == "Hard")|>
    ggplot(aes(x=mean_age, y=rank, colour = surface))+
    geom_point() +
    labs(x = "Mean Age (years)", y = "Rank")
rank_vs_age_plot
#
# use facet_grid
#
# Plotting height against mean age:
rank_vs_height_plot<- tennis_training|>
    filter(surface == "Clay"| surface == "Grass"| surface == "Hard")|>
    ggplot(aes(x=mean_ht, y=rank, colour = surface))+
    geom_point() +
    labs(x = "Mean Height (cm)", y = "Rank")
rank_vs_height_plot

In [None]:
# Creating a workflow for further data analysis and prediction using regression:
tennis_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> 
                  set_engine("kknn") |>
                  set_mode("regression") 

tennis_recipe <- recipe(winner_rank ~ mean_ht + mean_age, data = tennis_training) |>
                  step_scale(all_predictors()) |>
                  step_center(all_predictors())

tennis_workflow <- workflow() |>
                    add_recipe(tennis_recipe) |>
                    add_model(tennis_spec)
tennis_workflow

**Methods**

We will use the variables height, winner-age, rank, and playing hand (left or right) to predict a new player’s rank. After splitting the data into testing and training sets (75/25 split), we will use cross-validation and regression to first find the best value of K for our k-nearest neighbours model. We will then fit our training data to this model and attempt to predict the ranks for our training data.  Afterwards, we will compare the true ranks to those estimated by our model and determine the model’s accuracy using RMSE. If the RMSE of the training set and the testing set are similar, then we can demonstrate the fitness of our model. 

Next, we will input a new player with height, age, and the player’s preferred hand and find the new player’s rank using the model that was fitted with the training data. Then, we can make 2 plots with winner_rank vs winner_height and winner_rank vs winner_age with the new player’s data to visualise whether the model is properly fitted and accurate. 


**Expected Outcomes:**

We expect to be able to predict the rank of a new player with reasonable accuracy, as we believe the predictors that we have picked are good at predicting the ability of a new player. Furthermore, we expect to find that the surface each game is played on will affect the predicted rank of a new player. This is because players often have more practice on specific type(s) of court(s), which would affect their performance in the game on these courts. 

The impacts of such findings are profound: as can be seen from our dataset, tennis players with a higher rank often win over players of lower rank. By predicting the rank of an unknown player, we are also able to roughly predict the outcome of a tennis game before the game is ever played. This has repercussions not only in the field of tennis (being able to predict whether you would win against your opponent), but also in sports gambling and fantasy sports. 

As to the impacts of our secondary hypothesis, if our hypothetical player finds that he is underperforming on certain playing field(s), he could practice more on these fields and improve his rank here. Our data can thus also help tennis players improve their abilities and earnings.

One additional question that could arise from the results of this data analysis is whether the same results apply to the top 500 female players. Female tennis is one of the most watched female sports in the world, and if the same results apply to female tennis, then all the impacts discussed above would also apply to the field of female tennis.
