## Remaining Write-up Sections
- Introduction/outline of methods section
- Descriptions of exploratory analysis section (scatter plots and ggpairs)
- Descriptions of modelling section (single and multi knn and linear regression)
- Discussion section (see outline below)
- References (see outline below)

In [2]:
library(tidyverse)
library(repr)
library(tidymodels)
library(rvest)
library(stringr)
library(DBI)
library(dbplyr)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

# Predicting the Win Rate of Tennis Players  
<img align="left" src="https://images.unsplash.com/photo-1554068865-24cecd4e34b8?ixid=MnwxMjA3fDB8MHxzZWFyY2h8MXx8dGVubmlzfGVufDB8fDB8fA%3D%3D&ixlib=rb-1.2.1&auto=format&fit=crop&w=500&q=60" width="1000" />  

*Source: https://unsplash.com/photos/WqI-PbYugn4*

# Introduction<img align="left" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSDixQBw3HoqS_gnC9xVtHO-5NrnS1eQ91N3w&usqp=CAU" width="35" />   


### Background 

Tennis is a popular, competitive sport played around the world. In tennis, players gain points by hitting a tennis ball, across a net in the court, in such a way that their opponent is unable to hit it back without the ball bouncing twice. 
Tennis can be played in "singles" where there is only one person on each side of the net or "doubles" where there are teams of two. 
It can be played on a variety of surfaces such as grass, clay, or hard court (i.e. like a gym floor).

The association of tennis professionals, or ATP, organizes these tournaments and collects data on the players and the matches that take place.

### Our Question

Based on the career statistics of a tennis player, what will be their win rate?

### Our Dataset

We are using the "Game results for Top 500 Players from 2017-2019" dataset for our analysis. Each row in this dataset represents a match between two players. Each row contains player stats (e.g. age, height, rank) and match stats (break points, serve points, double faults etc.). We can use this data to determine the relationship between player stats and win rate.

*Picture source: https://www.emojipng.com/preview/458725*

# Methods and Results<img align="left" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSDixQBw3HoqS_gnC9xVtHO-5NrnS1eQ91N3w&usqp=CAU" width="35" />   


    describe in written English the methods you used to perform your analysis from beginning to end that narrates the code the does the analysis.
    your report should include code which:
        loads data from the original source on the web 
        wrangles and cleans the data from it's original (downloaded) format to the format necessary for the planned analysis
        performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis 
        creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
        performs the data analysis
        creates a visualization of the analysis 
        note: all tables and figure should have a figure/table number and a legend


In [22]:
## the dataset can be read from the URL link:
atp_data_frame <- read_csv("https://drive.google.com/uc?export=download&id=1fOQ8sy_qMkQiQEAO6uFdRX4tLI8EpSTn")

“Missing column names filled in: 'X1' [1]”
Parsed with column specification:
cols(
  .default = col_double(),
  tourney_id = [31mcol_character()[39m,
  tourney_name = [31mcol_character()[39m,
  surface = [31mcol_character()[39m,
  tourney_level = [31mcol_character()[39m,
  winner_seed = [31mcol_character()[39m,
  winner_entry = [31mcol_character()[39m,
  winner_name = [31mcol_character()[39m,
  winner_hand = [31mcol_character()[39m,
  winner_ioc = [31mcol_character()[39m,
  loser_seed = [31mcol_character()[39m,
  loser_entry = [31mcol_character()[39m,
  loser_name = [31mcol_character()[39m,
  loser_hand = [31mcol_character()[39m,
  loser_ioc = [31mcol_character()[39m,
  score = [31mcol_character()[39m,
  round = [31mcol_character()[39m
)

See spec(...) for full column specifications.



# Discussion <img align="left" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSDixQBw3HoqS_gnC9xVtHO-5NrnS1eQ91N3w&usqp=CAU" width="35" />  


    summarize what you found
    discuss whether this is what you expected to find?
    discuss what impact could such findings have?
    discuss what future questions could this lead to?


There are many components involved in the game of tennis and that is reflected in the 50 columns of this raw data set. These variables contribute to whether a player wins or loses a match and affects that player's overall win rate. By mutating the data into nine predictors we can set a KNN regression model to predict a player's career win rate. The predictors include:


| Variable                        | Explanation                                                                                         |
|---------------------------------|-----------------------------------------------------------------------------------------------------|
| Age (years)                     | Older players will have sustained more injuries and be less fit.                                    |
| Height (cm)                     | Height can provide an advantage when serving.                                                       |
| Serve Points that were Aces (%) | Winning points on a serve indicates a strong serve.                                                 |
| First Serves (%)                | The ratio of "first serve points" to "first serves made in" means a player's serve is more accurate |
| First Serves Won (%)            | Strong and accurate first serves will lead to fewer double faults.                                  |
| Second Serves Won (%)           | Strong second serves means fewer lost points due to a slow serve.                                   |
| Double Faults per Game (ratio)  | Fewer double faults per game indicates accurate serving.                                            |
| Breakpoints Saved (%)           | Preventing breaks means a player wins the important points for winning the match                    |
| Rank Points                     | Awarded to players by the ATP for winning matches                                                   |

The KNN regression model will need to be trained, evaluated, and tuned with training and testing sets before predicting a win rate on a new player. By organizing a data set to keep track of both the K and RMSE (Root Mean Square Error) values calculated, we will be able to visualize how each predictor affects the player's win rate. We can then further analyze which combination of predicators, if any, will yield the lowest test error resulting in a better ability to predict a player's win rate. 

# References <img align="left" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSDixQBw3HoqS_gnC9xVtHO-5NrnS1eQ91N3w&usqp=CAU" width="35" /> 


    At least 2 citations of literature relevant to the project (format is your choice, just be consistent across the references).
    Make sure to cite the source of your data as well.


We expect that the estimated win rate will reflect the respective career stats of the player. For example, if they have impressive stats, such as a high "breakpoint save percentage" and high "rank points", then their win rate will be high. Conversely, if our player has poor stats, their win rate will reflect this and be lower. 

Predicting the approximate win rate of a given player could be useful in a number of applications. For one, it could be used to get a sense of how the player will perform in the future. In other words, with the knowledge of their win rate, one can make a rough estimate of a player's chances in an upcoming tournament or season. Additionally, as the win rate is based on existing stats, it can be used to provide a clearer estimate of the result of a head-to-head match. 

This analysis could also prompt further questions:
 - Are there stats not included in the dataset that could improve the effectiveness of the model?
 - Do certain stats influence win rate more, and if so, is there a better weight function to use in the regression engine?

