### ATP Tennis Data

Database Compiled by Jeff Sackmann / Tennis Abstract: [Jeff Sackmann GitHub](https://github.com/JeffSackmann/tennis_atp) | [Tennis Abstract](http://www.tennisabstract.com/)

We'll explore ATP match level data from 1991 to 2020. 

Some of the questions we'll look into include:

* What is the effect of age on performance and has this changed over time? (Interpertable Models)

* What are the most important features for identifying variation in performance? (PCA)

* How well does previous season's preformance determine current season preformance? 

* Do players have "hot" streaks where winning begets more winning? 

* Is there more variance in the results for best-of-three versus best-of-five set matches? (Does the higher ranked player or player who has done better in head-to-head win more frequently?)

* How important are first serve percentage, break points saved/won, number of aces, average serve speed, ... effect performance? 

* Are big servers better? (Not controlling for other factors; does a big serve usually come at the cost of other elements that make players more successful)

* Do lefties perfrom better against righties or other lefties?

We'll also try to build a model for predictinig match winner. (Using DL, SVM, and/or XGBoost.)

---
#### Preliminaries and Read Data
We import the relevant packages and read the data.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob2
import datetime, sys

In [42]:
def readAllFiles(directory,prefix,ext):
    '''
    This function reads and appends all files in a directory with similar file names. 
    Required inputs: 
        directory ("directory")
        file name prefix ("prefix")
        file extension ("ext")
    '''
    AllFiles = glob2.glob(directory + "/" + prefix + "*" + ext)
    df = pd.DataFrame()
    list_ = list()
    for file in AllFiles:
        print(file)
        temp_df = pd.read_csv(file,index_col=None,header=None)
        list_.append(temp_df)
    df = pd.concat(list_,sort=False)
    return df

#read all rank data
ranks = readAllFiles("..\\data\\","atp_rankings_",".csv")
print(ranks.shape)

..\data\atp_rankings_00s.csv
..\data\atp_rankings_10s.csv
..\data\atp_rankings_70s.csv
..\data\atp_rankings_80s.csv
..\data\atp_rankings_90s.csv
..\data\atp_rankings_current.csv
(2851265, 4)


In [47]:
#clean up rank data
ranks.rename(columns = {0:'ranking_date',1:'rank',2:'player',3:'points'}, inplace = True)
ranks = ranks[ ranks['rank'] != "rank" ]

ranks.head()

Unnamed: 0,ranking_date,rank,player,points
1,20000110,1,101736,4135
2,20000110,2,102338,2915
3,20000110,3,101948,2419
4,20000110,4,103017,2184
5,20000110,5,102856,2169
