# Rough vs fairway difficulty analysis for the 2017-2018 PGA Tour Season
Data courtesy of PGA Tour Shotlink. The data contains every shot recorded by the PGA Tours Shotlink system over the 2017-2018 PGA Tour season, together with a strokes gained value for each shot. 

In [1]:
import pandas as pd;
import numpy as np;

Read in data:

In [2]:
data = pd.read_csv('../data/rshot.csv.TXT', delimiter=';', low_memory=False)

Rename columns for easier reading

In [3]:
data.columns = data.columns.str.replace(' ', '')
data.columns = data.columns.str.replace('#', 'Nr')
data.columns = data.columns.str.replace('.', '')
data.columns

Index(['TourCode', 'TourDescription', 'Year', 'TournNr', 'PlayerNr',
       'CourseNr', 'PermanentTournamentNr', 'PlayerFirstName',
       'PlayerLastName', 'Round', 'TournamentName', 'CourseName', 'Hole',
       'HoleScore', 'ParValue', 'Yardage', 'Shot', 'ShotType(S/P/D)',
       'NrofStrokes', 'FromLocation(Scorer)', 'FromLocation(Enhanced)',
       'ToLocation(Scorer)', 'ToLocation(Enhanced)', 'Distance',
       'DistancetoPin', 'IntheHoleFlag', 'AroundtheGreenFlag', '1stPuttFlag',
       'DistancetoHoleaftertheShot', 'Time', 'Lie', 'Elevation', 'Slope',
       'XCoordinate', 'YCoordinate', 'ZCoordinate', 'DistancefromCenter',
       'DistancefromEdge', 'Date', 'Left/Right', 'StrokesGained/Baseline',
       'StrokesGainedCategory', 'RecoveryShot'],
      dtype='object')

Check what locations are possible:

In [4]:
data['FromLocation(Scorer)'].unique()

array(['Tee Box', 'Primary Rough', 'Green', 'Fairway', 'Fairway Bunker',
       'Intermediate Rough', 'Fringe', 'Green Side Bunker', 'Unknown',
       'Native Area', 'Other', nan, 'Water'], dtype=object)

Define conditions for when shots were hit from the rough or fairway

In [5]:
data['is_rough'] =  ~data['FromLocation(Scorer)'].isna() & data['FromLocation(Scorer)'].str.contains('Rough') 

In [7]:
data['is_fairway'] =  ~data['FromLocation(Enhanced)'].isna() & ( data['FromLocation(Enhanced)'].str.contains('Fairway') |  data['FromLocation(Enhanced)'].str.contains('Enhanced') )

Filter out shots where no distance information (and thus no strokes gained) was recorded

In [8]:
data = data[data.DistancetoPin > 0]

Filter out tournaments with invalid data:

In [9]:
# Remove pair tournament with invalid data
data = data[data['TournamentName'] != 'Zurich Classic of New Orleans']
#Remove match play tournament
data = data[~data['TournamentName'].str.contains('Match Play')]

Create separate dataframes for shots:

In [15]:
rough_shots = data[data['is_rough']]
fairway_shots = data[data['is_fairway']]

Calculate stats for the two dataframes:

In [11]:
rough_sg = rough_shots.groupby('CourseName').describe()['StrokesGained/Baseline'][['mean', 'count', 'std']]

In [12]:
fairway_sg = fairway_shots.groupby('CourseName').describe()['StrokesGained/Baseline'][['mean', 'count', 'std']]

Combine dataframes:

In [19]:
sg_mean = pd.concat((fairway_sg, rough_sg), axis=1)

In [20]:
sg_mean.columns = ['fairway_mean', 'fairway_count', 'fairway_std', 'rough_mean', 'rough_count', 'rough_std']

Normalize around mean of all rought/fairway shots. The mean should be 0 since strokes gained is calculated against a baseline, but depending on playing conditions over a season it may deviate. 

In [21]:
all_rough_mean = rough_shots['StrokesGained/Baseline'].mean()
all_fairway_mean = fairway_shots['StrokesGained/Baseline'].mean()
sg_mean['rough_mean'] = sg_mean['rough_mean'] - all_rough_mean
sg_mean['fairway_mean'] = sg_mean['fairway_mean'] - all_fairway_mean
sg_mean['diff'] = sg_mean['rough_mean'] - sg_mean['fairway_mean']

In [22]:
import scipy.stats as stats


## Two sample t-test
Using a two sample t-test, we can calculate p-value for the zero hypothesis that the rough and fairway difficulty are equal (compared to baseline for each lie type). We can then compare this p-value against our desired signifcance level (0.05). We can do this since the samples are so large that the distribution of the means is approximately normal. Since we do not assume equal variances for the two populations (rough and fairway shots), we use [Welch's t-test](https://en.wikipedia.org/wiki/Welch%27s_t-test). This is applied using the scipy.stats library.

In [23]:
def two_sample_ttest(row):
    statistic, p_val = stats.ttest_ind_from_stats(row['fairway_mean'],
                                                  row['fairway_std'],
                                                  row['fairway_count'],
                                                  row['rough_mean'],
                                                  row['rough_std'],
                                                  row['rough_count'],
                                                  equal_var = False)
    return p_val

sg_mean['p_val'] = sg_mean.apply(two_sample_ttest, axis=1)

Compare p-values against a significance level of 0.05:

In [24]:
alpha = 0.05
sg_mean['significant'] = sg_mean['p_val'] < alpha


## Results
Plot courses by difference in difficulty between rough and fairway. Courses at the top are harder from the rough than fairway (compared to baseline, as rough shots are almost always more difficult than fairway shots from an absolute standpoint). 

In [25]:
sg_mean.sort_values('diff')

Unnamed: 0_level_0,fairway_mean,fairway_count,fairway_std,rough_mean,rough_count,rough_std,diff,p_val,significant
CourseName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
East Lake GC,0.066558,1156.0,0.330997,-0.059045,1042.0,0.338802,-0.125603,3.431573e-18,True
Muirfield Village GC,0.019699,5070.0,0.383432,-0.054591,2977.0,0.386643,-0.07429,8.55837e-17,True
Bellerive CC,0.04618,6055.0,0.32033,-0.027218,2590.0,0.360528,-0.073398,4.775591999999999e-19,True
TPC Southwind,0.012856,4885.0,0.346411,-0.057463,4077.0,0.381274,-0.070319,1.5816809999999998e-19,True
TPC Potomac at Avenel Farm,0.044356,4198.0,0.357672,-0.015418,3132.0,0.359574,-0.059773,1.880642e-12,True
TPC Deere Run,0.032486,5974.0,0.358393,-0.026956,2575.0,0.354337,-0.059442,1.514008e-12,True
Sedgefield CC,0.04742,5119.0,0.331337,-0.006151,3117.0,0.349761,-0.053571,6.730312e-12,True
Bay Hill Club & Lodge,-0.016071,5746.0,0.351686,-0.065948,1893.0,0.370427,-0.049877,2.855011e-07,True
Ridgewood CC,0.021153,5020.0,0.326354,-0.017918,2859.0,0.338498,-0.039071,6.196614e-07,True
Firestone CC (South),0.024789,2667.0,0.353075,-0.013633,2979.0,0.36569,-0.038422,6.052872e-05,True


The site of the Tour Championship, East Lake, has a clear lead at the top. This is not surprising as that course is famous for it's difficult rough, and I believe our favourite scientist golfer [Bryson Dechambeau would agree](https://www.golfchannel.com/article/golf-central-blog/dechambeau-rough-never-encountered-something-thick)!