# Tennis Match Prediction
#### By: Alejandro Velasquez, Chloe Whitaker, Daniel Northcutt, Mason Sherbondy

## Project Goals:
- Create a tennis match predictor that will determine the outcome of a match
- Explore and compare the great rivalries of current tennis stars
- Create a model to predict if a player will reach the top 30 ranking by valuating their first 30 games


## Initial Questions:
- Does a difference in career average break points saved impact victory?

- Does a difference in career average break points won impact victory?

- Does a difference in career percent-of-break-points-won impact victory?

- Does a difference in average forehand winners impact victory?

- Does a difference in average backhand winners impact victory?

- What are the drivers that determine a change in the dynamic between two players? Is there anything in our data set to suggest a change in dynamic?

- How do key rivalries play out in best of 3 matches vs best of 5? Do rivalries take a different story at Grand Slam events?

- How do key rivalries play out on clay? On grass? On hard court?

- What characteristics and trends determine a player to become a top 30?

- Does surface performance predict a player's future rank?

In [1]:
# imports

import pandas as pd
import numpy as np
import regex as re

# Custom Helper Files
from prepare import *

# Split 
from sklearn.model_selection import train_test_split

# Stats
from scipy import stats

# Model
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

# Visualize
import matplotlib.pyplot as plt
import seaborn as sns

# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")

# Remove Limits On Viewing Dataframes
pd.set_option('display.max_columns', None)

## Executive Summary:

## Acquire:

- Data was acquired from repo collecting Men's ATP tennis match data from 1968 - 2019
- https://github.com/JeffSackmann (source)
- Collected 180k rows of data

## ATPTotal Prepare:
- Randomized winner & loser as player1 and player2 alphabetically 
- Filtered for matches between 1999-01-01 to 2020-01-01
- Removed all Walk Offs and best of 1 matches
- Set our index to tourney date w/ format to %Y%m%d
- Dropped all values that did not provide full match statistics
- Filtered out players that played less than 50 matches
- 36000 matches

## PlayerDatabase Prepare:
- (Same prepares steps above)
- Filtered for players that hit a max rank of 100 or better
- Aggregated full stats for Aces, Breakpoints, Double Faults, Wins, and First Serve Win by Match
- Aggregated career performance by court surface (hard, grass, clay, carpet)
- Aggregated first 30 matches statistics - later use to predict future ranking


In [4]:
# pulling function - 35969 matches
df = prepare_atp()
df.shape

(35969, 75)

In [5]:
df.head()

Unnamed: 0_level_0,tourney_id,tourney_name,surface,draw_size,tourney_level,match_num,score,best_of,round,minutes,player_1,player_2,player_1_age,player_2_age,player_1_entry,player_2_entry,player_1_hand,player_2_hand,player_1_ht,player_2_ht,player_1_id,player_2_id,player_1_ioc,player_2_ioc,player_1_name,player_2_name,player_1_rank,player_2_rank,player_1_rank_points,player_2_rank_points,player_1_seed,player_2_seed,player_1_aces,player_2_aces,player_1_double_faults,player_2_double_faults,player_1_service_points,player_2_service_points,player_1_first_serves_in,player_2_first_serves_in,player_1_first_serve_points_won,player_2_first_serve_points_won,player_1_second_serve_points_won,player_2_second_serve_points_won,player_1_service_game_total,player_2_service_game_total,player_1_break_points_saved,player_2_break_points_saved,player_1_break_points_faced,player_2_break_points_faced,winner,player_1_first_serve_%,player_2_first_serve_%,player_1_first_serve_win_%,player_2_first_serve_win_%,player_1_break_points_won,player_2_break_points_won,player_1_wins,surface_Clay,surface_Grass,surface_Hard,tourney_level_D,tourney_level_F,tourney_level_G,tourney_level_M,player_1_hand_R,player_2_hand_R,round_F,round_QF,round_R128,round_R16,round_R32,round_R64,round_RR,round_SF
tourney_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1
1999-01-11,1999-338,Sydney,Hard,32,A,16,7-6(1) 6-1,3,R32,84.0,Lleyton Hewitt,Patrick Rafter,17.878166,26.036961,,,R,R,180.0,185.0,103720,102158,AUS,AUS,Lleyton Hewitt,Patrick Rafter,104.0,4.0,456.0,3315.0,,2.0,2.0,0.0,0.0,5.0,73.0,59.0,51.0,36.0,32.0,24.0,14.0,8.0,10.0,9.0,4.0,3.0,6.0,7.0,Lleyton Hewitt,0.69863,0.610169,0.627451,0.666667,4.0,2.0,True,0,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0
1999-01-11,1999-338,Sydney,Hard,32,A,15,5-7 6-3 6-2,3,R32,115.0,Martin Damm Sr,Nicolas Kiefer,26.444901,21.519507,Q,,R,R,188.0,183.0,210013,103017,CZE,GER,Martin Damm Sr,Nicolas Kiefer,75.0,36.0,657.0,1007.0,,,8.0,6.0,2.0,4.0,84.0,86.0,52.0,38.0,40.0,31.0,11.0,32.0,14.0,15.0,3.0,2.0,6.0,3.0,Nicolas Kiefer,0.619048,0.44186,0.769231,0.815789,1.0,3.0,False,0,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0
1999-01-11,1999-338,Sydney,Hard,32,A,14,6-2 6-4,3,R32,61.0,Jan Siemerink,Mariano Puerta,28.744695,20.312115,,,L,L,183.0,180.0,101733,103264,NED,ARG,Jan Siemerink,Mariano Puerta,19.0,38.0,1664.0,983.0,,,1.0,1.0,7.0,1.0,51.0,46.0,24.0,31.0,15.0,25.0,14.0,11.0,9.0,9.0,3.0,0.0,6.0,0.0,Mariano Puerta,0.470588,0.673913,0.625,0.806452,0.0,3.0,False,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0
1999-01-11,1999-338,Sydney,Hard,32,A,13,6-1 6-3,3,R32,43.0,Hicham Arazi,Todd Martin,25.229295,28.511978,,,L,R,175.0,198.0,102271,101774,MAR,USA,Hicham Arazi,Todd Martin,34.0,16.0,1069.0,1774.0,,8.0,2.0,10.0,3.0,0.0,47.0,36.0,27.0,27.0,16.0,25.0,6.0,7.0,8.0,8.0,1.0,0.0,5.0,0.0,Todd Martin,0.574468,0.75,0.592593,0.925926,0.0,4.0,False,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0
1999-01-11,1999-338,Sydney,Hard,32,A,12,6-3 7-6(4),3,R32,85.0,Carlos Moya,Thomas Johansson,22.373717,23.802875,,,R,R,190.0,180.0,102845,102563,ESP,SWE,Carlos Moya,Thomas Johansson,5.0,17.0,3159.0,1761.0,3.0,,7.0,8.0,1.0,2.0,79.0,58.0,37.0,26.0,28.0,19.0,22.0,17.0,11.0,10.0,5.0,2.0,7.0,5.0,Carlos Moya,0.468354,0.448276,0.756757,0.730769,3.0,2.0,True,0,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0


In [10]:
# Pulling Player Database aggregated stats of players within the time span
PlayerData = pd.read_csv('PlayerData.csv')
PlayerData.shape

(371, 48)

In [13]:
# PlayerDatabase of 371 players that hit a maxrank of at least 100 and have 50 or more games
PlayerData.head()

Unnamed: 0.1,Unnamed: 0,PlayerID,Player_Name,Age,Height,MaxRank,Hand,Country,win_count,lose_count,match_count,win%,aces_in_match_lost,aces_in_match_won,ace_count,aces_per_game,first_serve_percentage_match_lost,first_serve_percentage_match_won,first_serve_won_percentage_match_lost,first_serve_won_percentage_match_won,breakpoints_won_match_lost,breakpoints_won_match_won,breakpoint_count,breakpoints_per_game,win_count_30,loss_count_30,win_count_100,loss_count_100,total_top30_matches,total_top100_matches,top_30_win%,top_100_win%,hard_surface_win,hard_surface_loss,hard_surface_match_count,hard_win%,clay_surface_win,clay_surface_loss,clay_surface_match_count,clay_win%,grass_surface_win,grass_surface_loss,grass_surface_match_count,grass_win%,carpet_surface_win,carpet_surface_loss,carpet_surface_match_count,carpet_win%
0,0,103720,Lleyton Hewitt,20.752909,180.0,1.0,R,AUS,458.0,196.0,654.0,0.7,183.0,422.0,605.0,0.925,0.525558,0.539098,0.675974,0.77347,183.0,422.0,605.0,0.925076,134.0,110.0,351.0,184.0,244.0,535.0,0.54918,0.656075,265.0,118.0,383.0,0.691906,85.0,46.0,131.0,0.648855,93.0,27.0,120.0,0.775,15.0,5.0,20.0,0.75
1,1,102158,Patrick Rafter,26.477755,185.0,2.0,R,AUS,97.0,44.0,141.0,0.69,40.0,86.0,126.0,0.894,0.637279,0.653328,0.700358,0.795185,40.0,86.0,126.0,0.893617,35.0,25.0,75.0,38.0,60.0,113.0,0.583333,0.663717,47.0,25.0,72.0,0.652778,14.0,11.0,25.0,0.56,31.0,6.0,37.0,0.837838,5.0,2.0,7.0,0.714286
2,2,103017,Nicolas Kiefer,22.53525,183.0,4.0,R,GER,217.0,162.0,379.0,0.57,150.0,205.0,355.0,0.937,0.514047,0.53799,0.689337,0.79237,150.0,205.0,355.0,0.936675,70.0,87.0,168.0,148.0,157.0,316.0,0.44586,0.531646,141.0,88.0,229.0,0.615721,37.0,44.0,81.0,0.45679,19.0,14.0,33.0,0.575758,20.0,16.0,36.0,0.555556
3,3,210013,Martin Damm Sr,28.227242,188.0,67.0,R,CZE,19.0,32.0,51.0,0.37,31.0,19.0,50.0,0.98,0.546553,0.611743,0.69308,0.806999,31.0,19.0,50.0,0.980392,4.0,9.0,11.0,25.0,13.0,36.0,0.307692,0.305556,9.0,16.0,25.0,0.36,4.0,7.0,11.0,0.363636,6.0,7.0,13.0,0.461538,0.0,0.0,0.0,0.0
4,4,103264,Mariano Puerta,26.90486,180.0,9.0,L,ARG,76.0,79.0,155.0,0.49,77.0,70.0,147.0,0.948,0.637432,0.680268,0.613399,0.746508,77.0,70.0,147.0,0.948387,11.0,33.0,55.0,71.0,44.0,126.0,0.25,0.436508,14.0,21.0,35.0,0.4,61.0,48.0,109.0,0.559633,0.0,0.0,0.0,0.0,1.0,7.0,8.0,0.125
