In [5]:
library(tidyverse)
library(caret)

Loading required package: lattice


Attaching package: ‘caret’


The following object is masked from ‘package:purrr’:

    lift




# Predicting Tennis Match Winners

# Introduction

The US Open grand slam tennis tournament has an estimated total prize money pool of USD $50.4 million. This puts it in line with high-value tournaments such as the MLB World Series and the PGA FedEx Cup (source: pledgesports.org). With such a high potential payoff, there is much incentive for competitive tennis athletes and coaches to understand the player attributes that contribute to tournament success. 

This project will study the data of tennis match results collected for the top 500 tennis players. The purpose of the study is to determine whether certain attributes to greater competitive success by analyzing player and match information. The dataset contains results from nearly 7,000 matches and includes the winner and loser heights, playing hands, countries of origin, ages, match times and other information (source: https://github.com/JeffSackmann/tennis_atp). 

# Preliminary Exploratory Data Analysis

In [7]:
data2017 <- read_csv('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2017.csv')
data2018 <- read_csv('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2018.csv')
data2019 <- read_csv('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2019.csv')

combined_data <- rbind(data2017, data2018, data2019)

tennis <- combined_data %>% 
    select(winner_hand, winner_age, loser_hand, loser_age, winner_rank, loser_rank)

training_rows <- tennis %>%
    select(result) %>% 
    unlist() %>%
    createDataPartition (p = 0.75 , list = FALSE)

training_set <- tennis %>% slice (training_rows)
testing_set <- tennis %>% slice (-training_rows)

Parsed with column specification:
cols(
  .default = col_double(),
  tourney_id = [31mcol_character()[39m,
  tourney_name = [31mcol_character()[39m,
  surface = [31mcol_character()[39m,
  tourney_level = [31mcol_character()[39m,
  winner_entry = [31mcol_character()[39m,
  winner_name = [31mcol_character()[39m,
  winner_hand = [31mcol_character()[39m,
  winner_ioc = [31mcol_character()[39m,
  loser_entry = [31mcol_character()[39m,
  loser_name = [31mcol_character()[39m,
  loser_hand = [31mcol_character()[39m,
  loser_ioc = [31mcol_character()[39m,
  score = [31mcol_character()[39m,
  round = [31mcol_character()[39m
)

See spec(...) for full column specifications.

Parsed with column specification:
cols(
  .default = col_double(),
  tourney_id = [31mcol_character()[39m,
  tourney_name = [31mcol_character()[39m,
  surface = [31mcol_character()[39m,
  tourney_level = [31mcol_character()[39m,
  winner_entry = [31mcol_character()[39m,
  winner_name = [3

ERROR: Error: Can't subset columns that don't exist.
[31m✖[39m The column `result` doesn't exist.


# Methods

In order to wrangle our data, we have to use the read_csv() function on the raw links to the data on: https://github.com/JeffSackmann/tennis_atp. Because we are looking across three years of data we read 3 files and combine them using the rbind() function. 

For classification analysis we will be using the following columns:

* item winner_hand (the dominant hand of the winner)
* item loser_hand (the dominant hand of the loser)
* item winner_age 
* item loser_age
* item winner_ht (the height of the winner)
* item loser_ht (the height of the loser)
* item minutes (The length of the match)
* item w_ace (number of aces the winner had)
* item l_ace (the number aces the loser had)

and creating a knn-model in order to classify a whether or not a hypothetical player with given attributes will be able to win their match. We will be removing the left over columns from the set as they are not variables of interest for this analysis.

A visualisation we will create is a plot of height versus aces and colour the points by whether they were wins or losses.

# Expected outcomes and significance