**PREDICTING THE OUTCOME OF A TENNIS ROUND BASED ON SERVE STATISTICS**

**Introduction:**


In tennis, the serve is the only move where the server has complete control on how they want to hit the ball. Then it may be reasonable for one to assume that serves are the most influential part of a round, and have a big role on determining whether one wins the round or not. Using serve statistics, we will be able to understand if this assumption is indeed the case and if so, how big of a role it plays so that we can give advice to aspiring tennis players that the key to doing well would be really honing their serving skills. The dataset used will be Jeff Sackman's tennis ATP tennis ranking data from 2017-2019.

**Preliminary exploratory data analysis**:

In [6]:
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(tidymodels)
library(cowplot)
options(repr.matrix.max.rows = 6)
url <- "https://drive.google.com/uc?export=download&id=1fOQ8sy_qMkQiQEAO6uFdRX4tLI8EpSTn"
tennis_data <- read_csv(url)
head(tennis_data)

“Missing column names filled in: 'X1' [1]”
Parsed with column specification:
cols(
  .default = col_double(),
  tourney_id = [31mcol_character()[39m,
  tourney_name = [31mcol_character()[39m,
  surface = [31mcol_character()[39m,
  tourney_level = [31mcol_character()[39m,
  winner_seed = [31mcol_character()[39m,
  winner_entry = [31mcol_character()[39m,
  winner_name = [31mcol_character()[39m,
  winner_hand = [31mcol_character()[39m,
  winner_ioc = [31mcol_character()[39m,
  loser_seed = [31mcol_character()[39m,
  loser_entry = [31mcol_character()[39m,
  loser_name = [31mcol_character()[39m,
  loser_hand = [31mcol_character()[39m,
  loser_ioc = [31mcol_character()[39m,
  score = [31mcol_character()[39m,
  round = [31mcol_character()[39m
)

See spec(...) for full column specifications.



X1,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,⋯,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
<dbl>,<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0,2019-M020,Brisbane,Hard,32,A,20181231,300,105453,2.0,⋯,54,34,20,14,10,15,9,3590,16,1977
1,2019-M020,Brisbane,Hard,32,A,20181231,299,106421,4.0,⋯,52,36,7,10,10,13,16,1977,239,200
2,2019-M020,Brisbane,Hard,32,A,20181231,298,105453,2.0,⋯,27,15,6,8,1,5,9,3590,40,1050
3,2019-M020,Brisbane,Hard,32,A,20181231,297,104542,,⋯,60,38,9,11,4,6,239,200,31,1298
4,2019-M020,Brisbane,Hard,32,A,20181231,296,106421,4.0,⋯,56,46,19,15,2,4,16,1977,18,1855
5,2019-M020,Brisbane,Hard,32,A,20181231,295,104871,,⋯,54,40,18,15,6,9,40,1050,185,275


The data is already in an easy to read format, with the criteria of "tidy data" being met. However, an inspection of the data shows that there are missing values. Hence for simplicity we will omit the lines with missing cells from our data: 

| Variable    | No. of obs  | Description |
| ----------- | ----------- | ----------- |
| w_ace       |             |             |
| w_df        | Text        |             |
| w_svpt      | Description |
| w_1stIn     | ----------- |
| w_1stWon    | Title       |
| w_2ndWon    | Text        |

In [12]:
nrow(tennis_data)
nrow(tennis_data) - sum(is.na(tennis_data$w_ace))