# WTA Match Predictor

### Contents

- Understanding the Problem Statement
- Data Collection
- Data Checks
- Exploratory Data Analysis
- Preprocessing
- Model Training
- Choose Best Model

### **Understanding the Problem Statement**

- This project seeks to determine what variables affect the outcome of a tennis match on the WTA tour.
- Additionally, this project seeks to train and test a model that can accurately predict the winner of a WTA match.

### **Data Collection**

* The data for this project comes from [Jeff Sackmann of Tennis Abstract.](https://github.com/JeffSackmann/tennis_wta)
* The data is presented such that each observation is a match played, and matches are grouped by year/season. Each observation includes information about each player and their performance in the match.
    - Key statistics include 1st serve percentage, break point conversions, aces/double faults, and more. 

We will be using several libraries to help us explore the data, including ``numpy``, ``pandas``, ``seaborn``, and ``matplotlib``.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

#### Import Data as Pandas DataFrame

In [6]:
df = pd.read_csv('data/stud.csv')

#### Explore Head and Shape

In [7]:
df.head()

Unnamed: 0,tourney_id,tourney_name,surface,draw_size,tourney_level,tourney_date,match_num,winner_id,winner_seed,winner_entry,...,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
0,2023-609,Indian Wells,Hard,96,PM,20230306,286,214544,2.0,,...,58.0,37.0,16.0,13.0,8.0,11.0,2,6100,16,2205
1,2023-609,Indian Wells,Hard,96,PM,20230306,285,216347,1.0,,...,55.0,30.0,4.0,10.0,3.0,8.0,1,10585,36,1303
2,2023-609,Indian Wells,Hard,96,PM,20230306,284,221054,,,...,52.0,37.0,10.0,12.0,5.0,8.0,77,784,13,2246
3,2023-609,Indian Wells,Hard,96,PM,20230306,283,201514,,,...,24.0,14.0,12.0,8.0,0.0,5.0,83,743,43,1200
4,2023-609,Indian Wells,Hard,96,PM,20230306,282,201614,5.0,,...,50.0,34.0,21.0,14.0,1.0,4.0,5,4905,49,1080


#### Dataset Shape

In [8]:
df.shape

(113, 49)

This CSV stud, which is a sample of the larger dataset, includes 113 observations and 49 variables.

### Dataset Information

- Tourney Information: tourney_id, tourney_name, surface, draw_size, etc.
- Match Information: match_num, winner_id, loser_id, etc.
- Match Statistics: w_1stIn, w_1stWon, w_2ndWon, w_SvGms, w_bpSaved, w_bpFaced, etc.
    - Includes the same statistics for winner/loser

### **Data Checks**
Before working with the data, we must check and address the following:
- Missing values
- Duplicates
- Data types
- Unique values
- Statistics
- Categories

#### Check Missing Values

In [14]:
df.isna().sum()

tourney_id             0
tourney_name           0
surface                0
draw_size              0
tourney_level          0
tourney_date           0
match_num              0
winner_id              0
winner_seed           48
winner_entry          99
winner_name            0
winner_hand            0
winner_ht             30
winner_ioc             0
winner_age             0
loser_id               0
loser_seed            77
loser_entry           81
loser_name             0
loser_hand             0
loser_ht              36
loser_ioc              0
loser_age              0
score                  0
best_of                0
round                  0
minutes                0
w_ace                  1
w_df                   1
w_svpt                 1
w_1stIn                1
w_1stWon               1
w_2ndWon               1
w_SvGms                1
w_bpSaved              1
w_bpFaced              1
l_ace                  1
l_df                   1
l_svpt                 1
l_1stIn                1


#### Check Duplicates

In [13]:
df.duplicated().sum()

np.int64(0)

There are no duplicates values in the data set.

#### Check Data Types

In [16]:
# Check Null and Dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113 entries, 0 to 112
Data columns (total 49 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   tourney_id          113 non-null    object 
 1   tourney_name        113 non-null    object 
 2   surface             113 non-null    object 
 3   draw_size           113 non-null    int64  
 4   tourney_level       113 non-null    object 
 5   tourney_date        113 non-null    int64  
 6   match_num           113 non-null    int64  
 7   winner_id           113 non-null    int64  
 8   winner_seed         65 non-null     float64
 9   winner_entry        14 non-null     object 
 10  winner_name         113 non-null    object 
 11  winner_hand         113 non-null    object 
 12  winner_ht           83 non-null     float64
 13  winner_ioc          113 non-null    object 
 14  winner_age          113 non-null    float64
 15  loser_id            113 non-null    int64  
 16  loser_se

#### Check Count of Unique Values

In [18]:
df.nunique()

tourney_id              7
tourney_name            7
surface                 3
draw_size               3
tourney_level           3
tourney_date            5
match_num              38
winner_id              61
winner_seed            18
winner_entry            3
winner_name            61
winner_hand             3
winner_ht              22
winner_ioc             28
winner_age             55
loser_id               96
loser_seed             20
loser_entry             3
loser_name             96
loser_hand              3
loser_ht               22
loser_ioc              36
loser_age              78
score                  77
best_of                 1
round                   6
minutes                70
w_ace                  14
w_df                   14
w_svpt                 62
w_1stIn                48
w_1stWon               35
w_2ndWon               24
w_SvGms                13
w_bpSaved              15
w_bpFaced              18
l_ace                  10
l_df                   15
l_svpt      

### Check Statistics of Dataset

In [19]:
df.describe()

Unnamed: 0,draw_size,tourney_date,match_num,winner_id,winner_seed,winner_ht,winner_age,loser_id,loser_seed,loser_ht,...,l_1stIn,l_1stWon,l_2ndWon,l_SvGms,l_bpSaved,l_bpFaced,winner_rank,winner_rank_points,loser_rank,loser_rank_points
count,113.0,113.0,113.0,113.0,65.0,83.0,113.0,113.0,36.0,77.0,...,112.0,112.0,112.0,112.0,112.0,112.0,113.0,113.0,113.0,113.0
mean,50.123894,20230460.0,284.185841,210496.292035,6.230769,174.626506,26.423894,210621.433628,9.666667,173.636364,...,43.991071,25.866071,10.839286,10.392857,4.642857,9.205357,93.787611,1665.982301,126.19469,1033.424779
std,24.518627,113.9821,9.897284,7052.453367,5.98114,5.547381,4.08611,6756.064422,8.691539,6.29099,...,15.723681,10.907057,5.260003,2.819997,2.995062,3.678862,138.02112,1720.814072,148.874339,1007.0179
min,32.0,20230310.0,263.0,201458.0,1.0,160.0,17.9,200748.0,1.0,160.0,...,15.0,5.0,0.0,5.0,0.0,3.0,1.0,26.0,3.0,11.0
25%,32.0,20230400.0,277.0,203475.0,2.0,170.5,23.7,203500.0,3.75,170.0,...,31.75,16.75,7.0,8.0,2.0,7.0,15.0,563.0,37.0,429.0
50%,32.0,20230400.0,284.0,211651.0,4.0,175.0,26.1,211702.0,6.5,175.0,...,40.5,25.0,10.0,10.0,4.0,9.0,66.0,851.0,103.0,630.0
75%,64.0,20230610.0,293.0,214954.0,7.0,178.0,28.6,214939.0,13.5,178.0,...,54.25,33.25,14.0,12.0,6.0,11.0,116.0,2296.0,154.0,1258.0
max,96.0,20230620.0,300.0,232447.0,29.0,185.0,35.6,232447.0,32.0,185.0,...,85.0,59.0,26.0,18.0,16.0,24.0,817.0,10585.0,1077.0,5605.0


### **Exploratory Data Analysis**

asdf

### **Preprocessing**

asdf

### **Model Training**

asdf

### **Choose Best Model**

asdf