# Predicting the next Tennis Tournament Winner's country

For this project we are using he dataset containing the information about the tennis players from various countries to develop a predictive model, for forecasting the next tournament winner's country. The data set includes multiple varibles like winner's height, age, rank, country, loser's country age, height and match statistics. 

The data captures diverse data from 2017 to 2019, so in this project by grouping winners based in their countries and historical tournament data, we aim to identify patterns and trends that can predict about the future tournament outcomes. 

First we will read the data using various libraries and functions.

In [2]:
library(tidyverse)
library(repr)
library(tidymodels)
library(readr)
library(dplyr)
library(rsample)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39

Now we will use functions like "read_csv" to get the data set. Next we will change the characteristics of the categorical variable. Also we fropped the "na" values for the numerical varibles.

In [6]:
# Read the dataset from the web into R
tennis_players <- read_csv("data/tennis_players.csv")
# Preprocess data
tournament_players <- tennis_players |>
mutate(winner_country=as.factor(winner_ioc))


[1m[22mNew names:
[36m•[39m `` -> `...1`
[1mRows: [22m[34m6866[39m [1mColumns: [22m[34m50[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (16): tourney_id, tourney_name, surface, tourney_level, winner_seed, win...
[32mdbl[39m (34): ...1, draw_size, tourney_date, match_num, winner_id, winner_ht, wi...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Here we will group the winner countries and count the number of times they won the tournament. Also, we will calculate the averages of age and height.

In [7]:
# Group winners by country and calculate statistics
winner_stats <- tournament_players |>
  group_by(winner_country) |>
  summarize(avg_age = mean(winner_age, na.rm = TRUE), 
            avg_height = mean(winner_ht, na.rm = TRUE),
            total_wins = n())|>
filter(!is.na(avg_height))
winner_stats

winner_country,avg_age,avg_height,total_wins
<fct>,<dbl>,<dbl>,<int>
ARG,27.97259,184.4926,389
AUS,24.07626,189.6061,295
AUT,24.68328,185.0,157
BEL,28.46732,165.5746,148
BIH,25.96133,172.0,85
BRA,27.31489,183.0,42
BUL,26.80617,188.0,88
CAN,23.12242,194.5567,215
COL,26.78224,188.0,14
CRO,28.01608,200.4242,212


So from the above steps we got four variables, one of them is a categorical varible and the others we will use as a predictors. 
So we are trying to predict the name of the winner country in tennis matches based on certain player characteristics. We aim to use these to train a predictive model.

So we split the data into two sets:
# Training Set:
- It is used to train the predictive model. It contains a subset of data with known outcomes, and this we will use to learn some patterns between input and predictive varibales.
- It also helps us to minimise the prediction error.

# Testing Set:
- This set is used to evaluate the performance of the trained model. It consists of the data that has not been classified inot training set.
- It helps us to assess the performance of the model.
- By comparing the prediction with testing set and the actual outcome we can measure it's accuracy, percision and recall.

This approach helps us to evaluate the performance of the model, and also helps us to check whether it is overfitting or underfitting.

In [9]:
# Split the dataset into training and testing sets
library(rsample)
set.seed(123) # for reproducibility
data_split <- initial_split(winner_stats, prop = 0.8)
data_train <- training(data_split)
data_test <- testing(data_split)
data_train
data_test

winner_country,avg_age,avg_height,total_wins
<fct>,<dbl>,<dbl>,<int>
SLO,28.95868,181.2456,57
EST,30.13224,190.0,5
ESP,30.90478,186.0724,679
AUT,24.68328,185.0,157
URU,32.36745,180.0,52
LTU,28.09732,175.3636,22
NED,30.84991,190.0,58
POR,28.65361,184.038,85
BIH,25.96133,172.0,85
HUN,27.01081,180.0,60


winner_country,avg_age,avg_height,total_wins
<fct>,<dbl>,<dbl>,<int>
ARG,27.97259,184.4926,389
AUS,24.07626,189.6061,295
BRA,27.31489,183.0,42
FRA,29.92857,187.8065,709
GER,27.067,189.9818,460
ISR,31.71397,175.0,19
LAT,29.85739,190.0,17
RSA,30.70277,203.0,91


In [None]:
tennis_recipe <- recipe(winner_country ~ avg_age + avg_height+total_wins, data = data_train)|>
step_center(all_predictors())|>
step_scale(all_predictors())

tennis_recipe

In [None]:
# Specify the k-NN model
knn_model <- nearest_neighbor(weight_func = "rectangular", neighbors = tune())|> 
  set_engine("kknn") |>
  set_mode("classification")
knn_model

In [None]:
# Create the workflow
workflow <- workflow() |>
  add_recipe(tennis_recipe) |>
  add_model(knn_model)
workflow

In [None]:
# Train the model
model <- fit(workflow, data_train)
model

In [None]:
# # Make predictions on the test set
# predictions <- predict(model, data_test)|>
# bind_cols(data_test)
# predictions
# # Evaluate model performance
# players_test_metrics <- predictions |>
#   metrics(truth = winner_country, estimate = .pred_class)
# players_test_precision <- predictions |>
#   precision(truth = winner_country, estimate = .pred_class)
# players_test_recall <- predictions |>
#   recall(truth = winner_country, estimate = .pred_class)
# players_test_conf_matrix <- predictions |>
#   conf_mat(truth = winner_country, estimate = .pred_class)

# # Display evaluation metrics and confusion matrix
# players_test_metrics
# players_test_precision
# players_test_recall
# players_test_conf_matrix

In [None]:
# set.seed(2020)


# players_test_metrics <- predictions |>
#   metrics(truth = winner_country, estimate = .pred_class)
# #filter(.metric=="accuracy")
# players_test_metrics
# # Precision
# players_test_precision <- predictions |>
#   precision(truth = winner_country, estimate = .pred_class, event_level="first")
# players_test_precision
# # Recall
# players_test_recall <- predictions |>
#   recall(truth = winner_country, estimate = .pred_class, event_level="first")
# players_test_recall

# # Confusion Matrix
# players_test_conf_matrix <- predictions |>
#   conf_mat(truth = winner_country, estimate=.pred_class)
# players_test_conf_matrix