# Tennis Match Prediction
#### By: Alejandro Velasquez, Chloe Whitaker, Daniel Northcutt, Mason Sherbondy

## Project Goals
- Create a tennis match predictor that will determine the outcome of a match
- Explore and compare the great rivalries of Roger Federer

## Executive Summary 
We discovered:
* Rivals that won half or more meetings with Federer beat him at Grand Slam events
* The surface of the court affects the outcome of players performance
* Players performed differently based on the tourney level
* Higher ranked players win more often (about 64.28% of the time)
* Baseline beat our best model by 6% 
* Features we moved forward with: player1_rankpoints, player_1_hand_R, player_1_hand_L, Clay.

## Initial Questions
- What aspects of the game drive the performance of the players?
- Do Federer's rivalries take a different story at Grand Slam events?
- Do higher ranked players win more?
- Does the court surface matter?

In [1]:
%%time

# imports

import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import regex as re

# Custom Helper Files
from modules.proper_prep import *
from modules.explore import *
from modules.model import *
import modules.explore1 as e1

# Split 
from sklearn.model_selection import train_test_split

# Stats
from scipy import stats

# Model
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

# Visualize
import matplotlib.pyplot as plt
import seaborn as sns

# Ignore Warnings
import warnings
warnings.filterwarnings('ignore')

# Remove Limits On Viewing Dataframes
pd.set_option('display.max_columns', None)

ModuleNotFoundError: No module named 'modules.explore'

## Acquire:

- Data was acquired from repo collecting Men's ATP tennis match data from 1968 - 2019
- https://github.com/JeffSackmann (source)
- Collected 180k rows of data

## ATPTotal Prepare:
- Randomized winner & loser as player1 and player2 alphabetically 
- Set our target to player-1 wins (True or False) for binary classification
- Filtered for matches that took place in the years 1999 all the way up to the end of 2019
- Removed all walkovers, best of 1 matches
- Removed matches where players retired way too early
- Remove records where data was incomplete (extremely small proportion)
- filled in player heights where null
- Set our index to tourney date w/ format to %Y%m%d
- Filtered out players that played less than 50 matches
- Filled mising match time lengths with average match length time for respective tournaments
- 50470 matches

## Data Dictionary
#### DatetimeIndex: 50470 entries, 1999-01-11 to 2019-11-24
 
 |  #  | Column                            |  Non-Null Count | Dtype   | Description
 |:--- |:----------------------------------|:--------------- |:--------|:----------------------------------------------------------------------------------
 | 0   | tourney_id                        |  50470 non-null | object  | Unique identifier for the tournament that the record of match data belongs to.
 | 1   | tourney_name                      |  50470 non-null | object  | Name of the tournament the recorded match belongs to.
 | 2   | surface                           |  50470 non-null | object  | Court construction type - surface material of the court the match is played on.
 | 3   | draw_size                         |  50470 non-null | int64   | Number of players in a tournament rounded to nearest power of 2.
 | 4   | tourney_level                     |  50470 non-null | object  | Level of tournament: G = Grand Slams, M = Masters 1000, A = Other tour level events.
 | 5   | match_num                         |  50470 non-null | int64   | Match specific identifier within the tourney id.
 | 6   | score                             |  50470 non-null | object  | The final results of the match outcome.
 | 7   | best_of                           |  50470 non-null | int64   | The match format. 3 = Best of 3 sets, 5 = Best of 5 sets for the match.
 | 8   | round                             |  50470 non-null | object  | What round the match is in a tournament. RR = Round robin, ER = Early rounds.
 | 9   | minutes                           |  35543 non-null | float64 | Match length.
 | 10  | player_1                          |  50470 non-null | object  | One of the players featured in a match.
 | 11  | player_2                          |  50470 non-null | object  | The other player featured in the match.
 | 12  | player_1_age                      |  50470 non-null | float64 | Age of player 1 at the time of the match.
 | 13  | player_2_age                      |  50470 non-null | float64 | Age of player 2 at the time of the match.
 | 14  | player_1_hand                     |  50470 non-null | object  | Dominant hand for player 1.
 | 15  | player_2_hand                     |  50470 non-null | object  | Dominant hand for player 2.
 | 16  | player_1_ht                       |  50470 non-null | float64 | Height of player 1.
 | 17  | player_2_ht                       |  50470 non-null | float64 | Height of player 2.
 | 18  | player_1_id                       |  50470 non-null | int64   | Unique player identifier for player 1.
 | 19  | player_2_id                       |  50470 non-null | int64   | Unique player identifier for player 2.
 | 20  | player_1_ioc                      |  50470 non-null | object  | Country of origin for player 1.
 | 21  | player_2_ioc                      |  50470 non-null | object  | Country of origin for player 2.
 | 22  | player_1_rank                     |  50470 non-null | float64 | Player 1 rank at the start of the tournament.
 | 23  | player_2_rank                     |  50470 non-null | float64 | Player 2 rank at the start of the tournament.
 | 24  | player_1_rank_points              |  50470 non-null | float64 | Player 1 rank points at the start of the tournament.
 | 25  | player_2_rank_points              |  50470 non-null | float64 | Player 2 rank points at the start of the tournament.
 | 26  | player_1_seed                     |  13218 non-null | float64 | Player 1 seed for the tournament, if seeded.
 | 27  | player_2_seed                     |  14181 non-null | float64 | Player 2 seed for the tournament, if seeded.
 | 28  | player_1_aces                     |  50470 non-null | float64 | Number of serves from player 1 in the match completely untouched by player 2.
 | 29  | player_2_aces                     |  50470 non-null | float64 | Number of serves from player 2 in the match completely untouched by player 1.
 | 30  | player_1_double_faults            |  50470 non-null | float64 | Number of times player 1 failed to start a point by faulting twice (free p2 point).
 | 31  | player_2_double_faults            |  50470 non-null | float64 | Number of times player 2 failed to start a point by faulting twice (free p1 point).
 | 32  | player_1_service_points           |  50470 non-null | float64 | Number of points player 1 played on his serve.
 | 33  | player_2_service_points           |  50470 non-null | float64 | Number of points player 2 played on his serve.
 | 34  | player_1_first_serves_in          |  50470 non-null | float64 | Number of first serves player 1 made.
 | 35  | player_2_first_serves_in          |  50470 non-null | float64 | Number of first serves player 2 made.
 | 36  | player_1_first_serve_points_won   |  50470 non-null | float64 | Number of first serve points won by player 1.
 | 37  | player_2_first_serve_points_won   |  50470 non-null | float64 | Number of first serve points won by player 2.
 | 38  | player_1_second_serve_points_won  |  50470 non-null | float64 | Number of second serve points won by player 1.
 | 39  | player_2_second_serve_points_won  |  50470 non-null | float64 | Number of second serve points won by player 2.
 | 40  | player_1_service_game_total       |  50470 non-null | float64 | Number of games player 1 served in a match.
 | 41  | player_2_service_game_total       |  50470 non-null | float64 | Number of games player 2 served in a match.
 | 42  | player_1_break_points_saved       |  50470 non-null | float64 | Number of points player 1 won to stave off a break of serve.
 | 43  | player_2_break_points_saved       |  50470 non-null | float64 | Number of points player 2 won to stave off a break of serve.
 | 44  | player_1_break_points_faced       |  50470 non-null | float64 | Number of break points player 1 faced.
 | 45  | player_2_break_points_faced       |  50470 non-null | float64 | Number of break points player 2 faced.
 | 46  | winner                            |  50470 non-null | object  | The name of the winner.
 | 47  | player_1_first_serve_%            |  50470 non-null | float64 | Percent of first serves in for player 1.
 | 48  | player_2_first_serve_%            |  50470 non-null | float64 | Percent of first serves in for player 2.
 | 49  | player_1_first_serve_win_%        |  50470 non-null | float64 | Percent of first service points won for player 1.
 | 50  | player_2_first_serve_win_%        |  50470 non-null | float64 | Percent of first service points won for player 2.
 | 51  | player_1_break_points_won         |  50470 non-null | float64 | Number of times player 1 broke player 2's service.
 | 52  | player_2_break_points_won         |  50470 non-null | float64 | Number of times player 2 broke player 2's service.
 | 53  | player_1_wins                     |  50470 non-null | bool    | Target variable. Boolean value that designates whether or not player 1 won the match.
 | 54  | player_2_seeded                   |  50470 non-null | bool    | Boolean value that designates whether or not player 2 is seeded.
 | 55  | player_1_seeded                   |  50470 non-null | bool    | Boolean value that designates whether or not player 1 is seeded.
 | 56  | surface_Carpet                    |  50470 non-null | uint8   | Whether or not the match was played on carpet. 1 = Yes, 0 = No.
 | 57  | surface_Clay                      |  50470 non-null | uint8   | Whether or not the match was played on clay. 1 = Yes, 0 = No.
 | 58  | surface_Grass                     |  50470 non-null | uint8   | Whether or not the match was played on grass. 1 = Yes, 0 = No.
 | 59  | surface_Hard                      |  50470 non-null | uint8   | Whether or not the match was played on hard court. 1 = Yes, 0 = No.
 | 60  | tourney_level_A                   |  50470 non-null | uint8   | Whether or not the tournament was an tour level event. 1 = Yes, 0 = No.
 | 61  | tourney_level_D                   |  50470 non-null | uint8   | Whether or not the tournament was a Davis Cup event. 1 = Yes, 0 = No.
 | 62  | tourney_level_F                   |  50470 non-null | uint8   | Whether or not the tournament was a Tour Finals or season-ending event. 1 = Yes, 0 = No.
 | 63  | tourney_level_G                   |  50470 non-null | uint8   | Whether or not the tournament was a Grand Slam event. 1 = Yes, 0 = No.
 | 64  | tourney_level_M                   |  50470 non-null | uint8   | Whether or not the tournament was a Masters 1000 event. 1 = Yes, 0 = No.
 | 65  | player_1_hand_L                   |  50470 non-null | uint8   | Whether or not player 1 plays left-handed. 1 = Yes, 0 = No.
 | 66  | player_1_hand_R                   |  50470 non-null | uint8   | Whether or not player 1 plays right-handed. 1 = Yes, 0 = No.
 | 67  | player_2_hand_L                   |  50470 non-null | uint8   | Whether or not player 2 plays left-handed. 1 = Yes, 0 = No.
 | 68  | player_2_hand_R                   |  50470 non-null | uint8   | Whether or not player 2 plays right-handed. 1 = Yes, 0 = No.
 | 69  | round_ER                          |  50470 non-null | uint8   | Whether or not the match was in the early rounds. 1 = Yes, 0 = No.
 | 70  | round_BR                          |  50470 non-null | uint8   | W.o.n.t.m.w a (bronze round) in 2016 Olympics/ 2018 season-ending event. 1 = Y, 0 = N
 | 71  | round_F                           |  50470 non-null | uint8   | Whether or not the match was the final round. 1 = Yes, 0 = No.
 | 72  | round_QF                          |  50470 non-null | uint8   | Whether or not the match was a quarter-final. 1 = Yes, 0 = No.
 | 73  | round_R128                        |  50470 non-null | uint8   | Whether or not the match was in the round of 128 1 = Yes, 0 = No.
 | 74  | round_R16                         |  50470 non-null | uint8   | Whether or not the match was in the round of 16. 1 = Yes, 0 = No.
 | 75  | round_R32                         |  50470 non-null | uint8   | Whether or not the match was in the round of 32. 1 = Yes, 0 = No.
 | 76  | round_R64                         |  50470 non-null | uint8   | Whether or not the match was in the round of 64. 1 = Yes, 0 = No.
 | 77  | round_RR                          |  50470 non-null | uint8   | Whether or not the match was a round robin match. 1 = Yes, 0 = No.
 | 78  | round_SF                          |  50470 non-null | uint8   | Whether or not the match was a semifinal. 1 = Yes, 0 = No.
 | 79  | h2h_1                             |  50470 non-null | int64   | Number that represents how many victories player1 has over player 2.
 | 80  | h2h_1                             |  50470 non-null | int64   | Number that represents how many victories player2 has over player 1.
 
Break Point: If a player wins a break point, they win the service game of the opponent.

In [2]:
%%time

# pulling function - 50470 matches
df = prepare_atp()
df.shape

FileNotFoundError: [Errno 2] No such file or directory: 'ATPMain.csv'

In [3]:
# confirm prepare
df.head(3)

NameError: name 'df' is not defined

In [None]:
df.tail(3)

In [None]:
# # Pulling Player Database aggregated stats of players within the time span
# PlayerData = pd.read_csv('PlayerData.csv')
# PlayerData.shape

In [None]:
## PlayerDatabase of 371 players that hit a maxrank of at least 100 and have 50 or more games
#PlayerData.head()

In [None]:
# confirm clean
# df = clean_for_model(df)

In [None]:
# split the data into train, validate, and test so that we can conform there is not overfitting on our model
train, validate, test = train_validate_test_split(df)

In [None]:
# confirm split
train.head(1)

## Explore

### What drives winning? 

Plan for explore:

To identify features to use in modeling, separate the groups by wins / losses and then see if there was a significant difference between variables in that group. For exapmple: I'll take all of the matches that player1 wins and get the mean rank points for player1 for that group then do the same for all of the matches where player1 loses. If there is a significant difference then hight is likely a driver of winning and losing.

In [None]:
# seperate player 1 wins and losses into dfs
player_1_w = train[train['player_1_wins'] == True]
player_1_l = train[train['player_1_wins'] == False]

In [None]:
# confirm wins df
player_1_w.head(1)

In [None]:
# confirm losses df
player_1_l.head(1)

#### Does player1_rankpoints impact player_1_wins? 

I will check to see if the average ranking points when player 1 wins is not the same as when player 1 loses.

In [None]:
# Rank points
print('Average rank points, all matches: ' + str(train['player_1_rankpoints'].mean()))
print('Average rank points, player 1 wins: ' + str(player_1_w['player_1_rank_points'].mean()))
print('Average rank points, player 1 loses: ' + str(player_1_l['player_1_rank_points'].mean()))
print('\n')

It does look like there's a significant difference in the average rank points amongst wins and losses.

I will confirm that with stats and a visual: 

In [None]:
# We want to be sure that the effect we see isn't explained by chance, so I will use a confidance interval of 95%
# The resulting alpha is .05.
null_hypothesis = "The average ranking points when player 1 wins is the same as when player 1 loses."
alternative_hypothesis = "The average ranking points when player 1 wins is not the same as when player 1 loses." 
alpha = 0.01

In [None]:
get_ttest_rank_points(train)

In [None]:
p = 0.000
if p < alpha:
    print("We reject the null hypothesis that", null_hypothesis)
    print("We move forward with the alternative hypothesis that", alternative_hypothesis)
else:
    print("We fail to reject the null")
    print("Our evidence does not support the claim that the average ranking points when player 1 wins is not the same as when player 1 loses")


In [None]:
get_winning_player_rank_points(train)

##### Does player_1_hand_R impact player_1_wins?

I will check to see if there is a dependence between right hand used and player 1 winning using stats: 

In [None]:
# We want to be sure that the effect we see isn't explained by chance, so I will use a confidance interval of 95%
# The resulting alpha is .05.
null_hypothesis = "There is no dependence between right hand used and player 1 winning"
alternative_hypothesis = "There is a dependence between right hand used and player 1 winning"
alpha = 0.05

In [None]:
get_chi_right_hand(train)

In [None]:
p = 0.0055
if p < alpha:
    print("We reject the null that", null_hypothesis)
    print("We move forward with the alternative hypothesis that", alternative_hypothesis)
else:
    print("We fail to reject the null")
    print("Evidence does not support", alternative_hypothesis)

##### Does player_1_hand_L impact player_1_wins?

I will check to see if There is a dependence between left hand used and player 1 winning using stats:

In [None]:
# We want to be sure that the effect we see isn't explained by chance, so I will use a confidance interval of 95%
# The resulting alpha is .05.
null_hypothesis = "There is no dependence between left hand used and player 1 winning"
alternative_hypothesis = "There is a dependence between left hand used and player 1 winning"
alpha = 0.05

In [None]:
get_chi_left_hand(train)

In [None]:
p = 0.0055
if p < alpha:
    print("We reject the null that", null_hypothesis)
    print("We move forward with the alternative hypothesis that", alternative_hypothesis)
else:
    print("We fail to reject the null")
    print("Evidence does not support", alternative_hypothesis)

##### Does Clay impact player_1_wins?

I will check to see if there is a dependence between clay surface used and player 1 winning using stats and a visual.

In [None]:
# We want to be sure that the effect we see isn't explained by chance, so I will use a confidance interval of 95%
# The resulting alpha is .05.
null_hypothesis = "There is no dependence between clay surface and player 1 winning"
alternative_hypothesis = "There is a dependence between clay surface used and player 1 winning"
alpha = 0.05

In [None]:
get_chi_clay(train)

In [None]:
p = 0.0032
if p < alpha:
    print("We reject the null that", null_hypothesis)
    print("We move forward with the alternative hypothesis that", alternative_hypothesis)
else:
    print("We fail to reject the null")
    print("Evidence does not support", alternative_hypothesis)

In [None]:
get_pie_surface(train)

### Federer vs the World.

Whatever drives winning, no one can deny that Roger Federer has it in spades (or maybe in rackets). Over the years many have tried to dethrone the man many consider to be the best that ever played the game. Here is a look at Federer compared to some of his top rivals over the past 20 years. 

## Roger Federer vs Andy Roddick

In [None]:
# Roger Federer 14 Wins
# Andy Roddick 3 Wins
e1.get_pie_wins_rod_fed()

#### Federer dominated his rivalry with Roddick.

In [None]:
# Showing the rivalry across the multiple years
e1.rod_fed_bar()

#### Roddick found the answers he was looking for on the court with Federer, but these moments were few and far between.

In [None]:
e1.get_pies_upsets_fed_rod()

#### Most of Roddick's wins against Federer were upsets.

In [None]:
e1.get_pie_tourney_level_fed_rod()

#### Andy Roddick found answers for Federer only in Masters 1000s events. Specifically,  in the 2003 Canada Masters, and in 2008 & 2012 Miami.

## Roger Federer vs Andy Murray

In [None]:
# Roger Federer 10 Wins
# Andy Murray 10 Wins
e1.get_pie_wins_mur_fed()

#### Federer and Murray have split their matchup wins throughout their careers.

In [None]:
# Showing the rivalry across the multiple years
e1.rod_mur_bar()

#### While Murray did win a lot early meetings, Federer held most of the victories toward the end.

In [None]:
e1.get_pies_upsets_fed_mur()

#### Most of Andy Murray's wins against Federer were upsets.

In [None]:
e1.get_pie_tourney_level_fed_mur()

#### While Federer generally dominated Andy Murray at Major (Grand Slam) events, Andy Murray played past the pressure at lower tier events, especially in Masters 1000 tournaments.

## Roger Federer vs Rafael Nadal

In [None]:
#Roger Federer 11 Wins
#Rafael Federer 19 Wins

e1.get_pie_wins_nad_fed()

#### As spectacular as all of their matches were, Nadal generally dominated Federer throughout his career.

In [None]:
e1.rod_nad_bar()

#### While Nadal did dominate the rivalry over the years, Federer has won the majority of their recent meetings.

In [None]:
e1.get_pies_upsets_fed_nad()

#### Federer and Nadal both have a fair amount of upset wins between the two, but both players also beat each other quite a few as the favorite.

In [None]:
e1.get_pie_tourney_level_fed_nad()

#### Nadal dominated the rivalry on the big stages like Grand Slam events and at Masters 1000s events, but Federer won most of their meetings at the season ending events.

## Roger Federer vs Novak Djokovic

In [None]:
# Roger Federer 11 Wins
# Novak Djokovic 21 Wins
e1.get_pie_wins_djo_fed()

#### Federer and Djokovic have a pretty even amount of wins against each other.

In [None]:
e1.fed_djo_bar()

#### While Federer dominated Djokovic early in the "Djoker's" career, Djokovic has been writing the pages for their meetings more often as of late.

In [None]:
e1.get_pies_upsets_djo_fed()

#### Federer and Djokovic upset each other quite a bit.

In [None]:
e1.get_pie_tourney_level_fed_djo()

#### Federer won the majority of their matchups at Masters 1000 events, but Djokovic wouldn't let victory go on the biggest of stages; Djokovic has grabbed the majority of their meetings at Majors.

### Explore Summary: 
#### Federer vs. Roddick
* Federer dominated his rivalry with Roddick.
* Roddick found the answers he was looking for on the court with Federer, but these moments were few and far between.
* Most of Roddick's wins against Federer were upsets.
* Andy Roddick found answers for Federer only in Masters 1000s events. Specifically,  in the 2003 Canada Masters, and in 2008 & 2012 Miami.

#### Federer vs. Murray
* Federer and Murray have split their matchup wins throughout their careers.
* While Murray did win a lot early meetings, Federer held most of the victories toward the end.
* Most of Andy Murray's wins against Federer were upsets.
* While Federer generally dominated Andy Murray at Major (Grand Slam) events, Andy Murray played past the pressure at lower tier events, especially in Masters 1000 tournaments.

#### Federer vs. Nadal
* As spectacular as all of their matches were, Nadal generally dominated Federer throughout his career.
* While Nadal did dominate the rivalry over the years, Federer has won the majority of their recent meetings.
* Federer and Nadal both have a fair amount of upset wins between the two, but both players also beat each other quite a few as the favorite.
* Nadal dominated the rivalry on the big stages like Grand Slam events and at Masters 1000s events, but Federer won most of their meetings at the season ending events.

#### Federer vs Djokovic
* Federer and Djokovic have a pretty even amount of wins against each other.
* While Federer dominated Djokovic early in the "Djoker's" career, Djokovic has been writing the pages for their meetings more often as of late.
* Federer and Djokovic upset each other quite a bit.
* Federer won the majority of their matchups at Masters 1000 events, but Djokovic wouldn't let victory go on the biggest of stages; Djokovic has grabbed the majority of their meetings at Majors.

All of these rivalries have been unique. While Federer dominated Roddick throughout his entire career, Nadal dominated Federer earlier on, and Federer has won most of their recent meetings. While Federer split wins evenly with Murray and Djokovic career-total wise, Federer dominated Murray at Grand Slam events while Djokovic took control of most of his Grand Slam meetings with Federer. Nadal, Djokovic and Murray did upset him quite a bit on their paths to world number 1, and the few times Roddick prevailed, they, too, were upsets.

### Features to Move Forward with to Modeling: 

Features to move forward with: player1_rankpoints, player_1_hand_R, player_1_hand_L, Clay.

In [None]:
# split data into train, validate and test sets
train, validate, test = train_validate_test_split(df)

In [None]:
# verify training set
train.head(1)

## Modeling

Before modeling, there a little bit of preparing that needs to happen. For model_prep function see model.py. 

In [None]:
train.head()

In [None]:
# set up modeling data
X_train, X_validate, X_test, y_train, y_validate, y_test = model_prep(train,validate,test)

In [None]:
# verify model data
X_train.head(1)

In [None]:
# verify clean
X_train.isnull().sum()

### Baseline

Creating a baseline to compare the final model preformance to is the last step before modeling. 

In [None]:
train.player_1_wins.value_counts(normalize=True)

In [None]:
baseline = y_train.mode()

In [None]:
baseline

In [None]:
match_bsl_prediction = y_train == 0

In [None]:
baseline_accuracy = match_bsl_prediction.mean()

In [None]:
# basline accuracy = 52%
baseline_accuracy

### No_upset Model

The next model we will make is based off the chance that the highest ranked player will win without an upset. This should be more accuracte than our baseline of 52%.  

In [None]:
# look at the no_upset column
train.no_upset.head(1)

In [None]:
# find the value counts for no_upset
train.no_upset.value_counts(normalize=True)

In [None]:
# find the most frequently occurring value in no_upset
baseline = train.no_upset.mode()
baseline

In [None]:
# set baseline and check accuracy
match_bsl_prediction = train.no_upset == 1
baseline_accuracy = match_bsl_prediction.mean()
# basline accuracy = 64%%
baseline_accuracy

##### The no_upset model accuracy is 65%.

### Adding More Features

Instead of assuming the highest ranking person wins, we will now use the features we identified earlier to create new models:

Features to move forward with: player1_rankpoints, player_1_hand_R, player_1_hand_L, Clay, h2h_1, h2h_2.

### Decision Tree

In [None]:
# get_decision_tree from model.py with a max depth of 3
get_decision_tree(X_train, X_validate, y_train, y_validate)

### Random Forest

In [None]:
# get_random_forest from model.py with a max depth of 13 and min samples leaf of 3
get_random_forest(X_train, X_validate, y_train, y_validate)

### Logistic Regression

In [None]:
# get_log_reg from model.py with c = 9
get_log_reg(X_train, X_validate, y_train, y_validate)

##### best model = logistic regression with an accuracy on validate date of 70%.

## Conclusion 
* Rivals that won half or more meetings with Federer beat him at Grand Slam events
* The surface of the court affects the outcome of players performance
* Players performed differently based on the tourney level
* Higher ranked players win more often (about 64.28% of the time)
* Baseline beat our best model by 6%

## Next Steps

- Create a model to predict if a player will reach the top 30 ranking by evaluating their first 50 games
- Filter for players that hit a max rank of 100 or better
- Aggregate full stats for Aces, Breakpoints, Double Faults, Wins, and First Serve Win by Match
- Aggregate career performance by court surface (hard, grass, clay, carpet)
- Aggregate first 50 matches statistics - later use to predict future ranking

### Questions for our further exploration:
- Does a difference in career average break points won impact victory?

- Does a difference in career average break points saved impact victory?

- What characteristics and trends determine a player to become a top 30?

- Does surface performance predict a player's future rank?

- Does a difference in career percent-of-break-points-won impact victory?