# 2018 PGA Tour Strokes Gained Analysis - Performance from different distance ranges

Data is 2018 PGA Tour Shotlink data courtesy of PGA Tour ShotLink. Let's first read the data and modules:

In [69]:
import pandas as pd
import numpy as np

data = pd.read_csv('data/rshot.csv', sep = ';', low_memory=False)


In [70]:
data.columns

Index(['Tour Code', ' Tour Description', ' Year', ' Tourn.#', ' Player #',
       ' Course #', ' Permanent Tournament #', ' Player First Name',
       ' Player Last Name', ' Round', ' Tournament Name', 'Course Name',
       ' Hole', ' Hole Score', ' Par Value', ' Yardage', '  Shot',
       ' Shot Type(S/P/D)', '# of Strokes', ' From Location(Scorer)',
       ' From Location(Enhanced)', ' To Location(Scorer)',
       ' To Location(Enhanced)', ' Distance', ' Distance to Pin',
       ' In the Hole Flag', ' Around the Green Flag', '1st Putt Flag',
       ' Distance to Hole after the Shot', ' Time', ' Lie', ' Elevation',
       ' Slope', ' X Coordinate', ' Y Coordinate', '  Z Coordinate',
       ' Distance from Center', ' Distance from Edge', ' Date', ' Left/Right',
       ' Strokes Gained/Baseline', ' Strokes Gained Category',
       ' Recovery Shot'],
      dtype='object')

A lot of the columns are not needed for this analysis, so we can remove them. We can also change the name of some of them:

In [71]:
#Column renaming and cleanup
data.columns = data.columns.str.replace(' ', '')
data.columns = data.columns.str.replace('#', 'Nr')
data.columns = data.columns.str.replace('.', '')
data['player_name'] = data['PlayerFirstName'] + " " + data['PlayerLastName']
data_cleaned = data.drop(['XCoordinate', 'YCoordinate', 'ZCoordinate', 'TourCode', 
                          'TournNr', 'CourseNr', 'PermanentTournamentNr', 'TourDescription', 
                          'Lie', 'Slope', 'Elevation', '1stPuttFlag', 
                          'Time', 'Date', 'Yardage', 'DistancefromEdge', 'DistancetoHoleaftertheShot', 
                          'DistancefromCenter', 'ToLocation(Scorer)', 'ToLocation(Enhanced)', 'HoleScore', 'ParValue', 'Left/Right',
                          'PlayerFirstName', 'PlayerLastName', 'IntheHoleFlag', 'AroundtheGreenFlag', 'RecoveryShot', 'NrofStrokes'], axis=1)

For this analysis we are only interested in approach shots, which only includes shots in categories Around the Green and Approach the green. We can thus remove the rest. We can also clean up shots which have no distance data, and tournaments which do not have data in the correct format.

In [72]:
#Create dummy categories for strokes gained category
data_cleaned = pd.concat([data_cleaned,pd.get_dummies(data_cleaned['StrokesGainedCategory'] ) ], axis=1 )
#Remove original category
data_cleaned.drop('StrokesGainedCategory', axis=1, inplace=True)
# Remove pair tournament with invalid data
data_cleaned = data_cleaned[data_cleaned['TournamentName'] != 'Zurich Classic of New Orleans']
#Remove match play tournament
data_cleaned = data_cleaned[~data_cleaned['TournamentName'].str.contains('Match Play')]
#Remove shots from rounds without distance data
data_cleaned = data_cleaned[data_cleaned['DistancetoPin'] != 0]
#Remove penalty and drop shots
data_cleaned = data_cleaned[data_cleaned['ShotType(S/P/D)'] == 'S']
data_cleaned.drop('ShotType(S/P/D)', axis=1, inplace=True)
#Remove putting strokes
data_cleaned = data_cleaned[data_cleaned['Putting'] != 1]
#Remove strokes off the tee (excluding tee shots on par 3s and 3rd shots)
data_cleaned = data_cleaned[data_cleaned['Off the Tee'] != 1]
#Remove the stroke categories since we do not need them anymore
data_cleaned.drop(['Putting', 'Around the Green', 'Approach the Green', 'Off the Tee'], axis = 1, inplace=True )
#Convert from inches to yards
data_cleaned['DtP'] = ( data_cleaned['DistancetoPin'] / 36 ).round(0) 
#Create new FromLocation category
data_cleaned['FromLocation'] = np.where( data_cleaned['FromLocation(Scorer)'].isnull(), data_cleaned['FromLocation(Enhanced)'], data_cleaned['FromLocation(Scorer)'])
#Remove the other columns
data_cleaned.drop(['FromLocation(Scorer)', 'FromLocation(Enhanced)'], axis=1, inplace=True)
data_cleaned.head()

Unnamed: 0,Year,PlayerNr,Round,TournamentName,CourseName,Hole,Shot,Distance,DistancetoPin,StrokesGained/Baseline,player_name,DtP,FromLocation
1,2018,1810,1,Safeway Open,Silverado Resort and Spa North,1,2,4824,4929,0.505,Phil Mickelson,137.0,Primary Rough
4,2018,1810,1,Safeway Open,Silverado Resort and Spa North,2,1,7128,7308,0.158,Phil Mickelson,203.0,Tee Box
8,2018,1810,1,Safeway Open,Silverado Resort and Spa North,3,2,4994,4917,0.143,Phil Mickelson,137.0,Fairway
12,2018,1810,1,Safeway Open,Silverado Resort and Spa North,4,2,4952,4740,0.225,Phil Mickelson,132.0,Fairway Bunker
16,2018,1810,1,Safeway Open,Silverado Resort and Spa North,5,2,7978,7645,0.499,Phil Mickelson,212.0,Primary Rough


To categorize player performance in different categories, we must first define the distance ranges:
 - Define the distance tresholds. The last one will be everything longer than that distance
 - Give them a label
 - Categorize every shot into one of those labels

In [73]:
#Create categories
distances = [0, 50, 100, 140, 175, 200, 230 ]
distance_categories = []

#Place every shot into one category based on distance
for i in range(0, len(distances) - 1):
    cat_name = str(distances[i]) + "-" + str(distances[i+1]) 
    data_cleaned[cat_name] = ( data_cleaned['DtP'] >= distances[i] ) & ( data_cleaned['DtP'] < distances[i+1] )
    distance_categories.append(cat_name)
#Remaining shots (longer than 250)
data_cleaned[ ">=" + str(distances[-1]) ] = data_cleaned['DtP'] >= distances[-1] 
distance_categories.append(">=" + str(distances[-1]))

#Create one column for category instead of multiple binary features
data_cleaned['dist_cat'] = data_cleaned[distance_categories].idxmax(axis=1)
#Drop old features
data_cleaned.drop(distance_categories, axis=1, inplace=True)
#For each tournament and round, take the average strokes gained per category to adjust for difficulty of course
new_baseline =  data_cleaned.groupby(['TournamentName', 'dist_cat', 'Round']).mean()[['StrokesGained/Baseline']].reset_index(level=[0,2])
#Rename column
new_baseline.rename(index=str, columns={'StrokesGained/Baseline': 'new_baseline'}, inplace=True)
#Merge with shot data
data_cleaned_new = data_cleaned.merge(new_baseline, how='left', left_on=['TournamentName', 'Round', 'dist_cat'], right_on=['TournamentName', 'Round', 'dist_cat'])
#Create normalied strokes gained per shot
data_cleaned_new['adj_sg'] = data_cleaned_new['StrokesGained/Baseline'] - data_cleaned_new['new_baseline']


Now that we have the adjusted strokes gained for each shot, we can get the mean for each player and category for the whole season:

In [None]:
#Create new dataframe with each players mean strokes gained per category over the season

player_category_adj_sg = data_cleaned_new.groupby(['PlayerNr', 'player_name', 'dist_cat'])['adj_sg'].describe()
player_category_adj_sg = player_category_adj_sg[['mean', 'count', 'std']].rename(index = str, columns={"mean": 'adj_sg', 'count': 'shot_count'})
player_category_adj_sg['std_err'] = player_category_adj_sg['std'] / np.sqrt(player_category_adj_sg['shot_count'])
player_category_adj_sg['95_conf_upp'] = 1.96*player_category_adj_sg['std_err'] + player_category_adj_sg['adj_sg']
player_category_adj_sg['95_conf_low'] = -1.96*player_category_adj_sg['std_err'] + player_category_adj_sg['adj_sg']
player_category_adj_sg.head()


In [None]:
#Restructure the dataframe
player_category_adj_sg = player_category_adj_sg.unstack(level=-1).swaplevel(0,1,axis=1).sort_index(level=0, axis=1)
player_category_adj_sg.head()

In [None]:

#Players for plot
players = [
          'Tiger Woods', 
          'Rory McIlroy', 
        #  'Jordan Spieth', 
          'Justin Thomas', 
          'Henrik Stenson', 
          #'Brooks Koepka',
          #'Phil Mickelson',
          #'Keegan Bradley',
        #  'Dustin Johnson',
          #'Jason Day'
         # 'Bryson DeChambeau'
          ]

#For sorting the categories
def getSortValue(category):
    return distance_categories.index(category)






In [None]:
#Plotting
import plotly.plotly as py
import plotly
import plotly.graph_objs as go


plots = []
for player in players:
    playerdf = player_category_adj_sg[player_category_adj_sg.index.get_level_values('player_name') == player]
    playerdf = playerdf.transpose().unstack(level=1)
    playerdf.columns =  [ '95_up', '95_low', 'adj_sg', 'shot_count', 'std', 'std_err']
    playerdf['cat_ind'] = pd.Series(playerdf.index.values).apply(getSortValue).values
    playerdf = playerdf.sort_values('cat_ind')
    pplot = go.Scatter (  
        x = playerdf.index.values, 
        y = playerdf['adj_sg'].round(3), 
        name = player,
        text =  "Nr of shots: " + playerdf['shot_count'].astype(int).astype(str),
        error_y=dict(
            type='data',
            symmetric=True,
            array=1.96*playerdf['std_err'],
        )
    )
    plots.append(pplot)


layout = {"title": "Average strokes gained per shot in category"}
py.iplot({"data":plots, "layout": layout } )


## Results
First thing one notices here is that the errors bars representing the 95% confidence interval is very big. This leads us to believe that all of these results should be taken with a grain of salt due to the small sample size, but there are still a few interesting observations to make:
 - Tiger Woods was the best player on tour between 50 and 100 yards last season (see next table).
 - Justin Thomas, who had a great season, was dominating for shorter approach shots but fell off in the longer categories.
 - Henrik Stenson, the Swedish Iceman (I might be biased), is an absolute monster >175 yards while weaker at shorter distances. His legendary 3-wood had another impressive season judging by the >250 metric. 
 - Rory McIlroy who had a relatively weak season was average across the board, perhaps poking hole in the theory that it is his wedge game which is pulling him down.
 
We can also have a look at the best and worst players in each category:

In [None]:
max_player = pd.DataFrame( columns=['cat', 'player_name', 'adj_sg']) 

min_player = pd.DataFrame( columns=['cat', 'player_name', 'adj_sg'])

for cat in distance_categories:
    min_df = player_category_adj_sg[cat][player_category_adj_sg[cat]['shot_count'] > 30]
    min_row = min_df.loc[[min_df['adj_sg'].idxmin(axis=0)]].reset_index()[['player_name', 'adj_sg']]
    min_row['cat'] = cat
    min_player = min_player.append(min_row, ignore_index=True, sort=False)
    
    max_df = player_category_adj_sg[cat][player_category_adj_sg[cat]['shot_count'] > 30]
    max_row = max_df.loc[[max_df['adj_sg'].idxmax(axis=0)]].reset_index()[['player_name', 'adj_sg']]
    max_row['cat'] = cat
    max_player = max_player.append(max_row, ignore_index=True, sort=False)

display(max_player)
display(min_player)

In [None]:
player_category_adj_sg.to_csv('data/player_category_sg_2018')