## Hypothesis Testing to Test Some Classical Golf Theories

### The Drive for Show Putt for Dough theory

In [8]:
import pandas as pd
import os

# check directories
os.listdir('../ml')

['training_data.csv',
 'made_the_cut.sav',
 'win_with_other_info.ipynb',
 'training_data.pkl',
 'make_the_cut.ipynb',
 'winners_model.sav',
 'win.ipynb',
 '.ipynb_checkpoints']

In [9]:
# import data
path = '../ml/training_data.pkl'
data: pd.DataFrame = pd.read_pickle(path)
data.head()

Unnamed: 0,tournament_id,player_name,score,Alabama,Arizona,California,Canada,Connecticut,Delaware,Florida,...,app,ott,t2g,result,tournament_putt,tournament_arg,tournament_app,tournament_ott,tournament_t2g,tournament_cluster
0,147,Grayson Murray,-21,1,0,0,0,0,0,0,...,1,3,3,1,-0.125295,-0.053159,-0.217258,-0.091008,-0.361742,3
1,147,Chad Collins,-20,1,0,0,0,0,0,0,...,1,3,3,1,-0.125295,-0.053159,-0.217258,-0.091008,-0.361742,3
2,147,Brian Gay,-19,1,0,0,0,0,0,0,...,0,0,2,1,-0.125295,-0.053159,-0.217258,-0.091008,-0.361742,3
3,147,Scott Stallings,-19,1,0,0,0,0,0,0,...,1,0,3,1,-0.125295,-0.053159,-0.217258,-0.091008,-0.361742,3
4,147,Tag Ridings,-19,1,0,0,0,0,0,0,...,1,3,3,1,-0.125295,-0.053159,-0.217258,-0.091008,-0.361742,3


In [10]:
# data columns
print(data.columns)

Index(['tournament_id', 'player_name', 'score', 'Alabama', 'Arizona',
       'California', 'Canada', 'Connecticut', 'Delaware', 'Florida', 'Georgia',
       'Hawaii', 'Illinois', 'Kentucky', 'Maryland', 'Massachusetts', 'Mexico',
       'Michigan', 'Minnesota', 'Missouri', 'New Jersey', 'New York',
       'North Carolina', 'Ohio', 'Oklahoma', 'Pennsylvania', 'Scotland',
       'South Carolina', 'Tennessee', 'Texas', 'sg_putt', 'sg_arg', 'sg_app',
       'sg_ott', 'sg_t2g', 'sg_total', 'pos', 'putting', 'arg', 'app', 'ott',
       't2g', 'result', 'tournament_putt', 'tournament_arg', 'tournament_app',
       'tournament_ott', 'tournament_t2g', 'tournament_cluster'],
      dtype='object')


In [11]:
# extracting just numeric data
numeric_cols = ['sg_putt', 'sg_arg', 'sg_app', 'sg_ott', 'sg_t2g', 'sg_total', 'pos']
num_df: pd.DataFrame = data[numeric_cols].dropna()
num_df.mean(numeric_only=True)

sg_putt    -0.149599
sg_arg     -0.066407
sg_app     -0.157595
sg_ott     -0.077082
sg_t2g     -0.301058
sg_total   -0.422657
dtype: float64

In [12]:
# average strokes gained by the winner
num_df.query('pos == "1"').mean(numeric_only=True)

sg_putt     1.264197
sg_arg      0.438145
sg_app      1.331370
sg_ott      0.669659
sg_t2g      2.439595
sg_total    3.717081
dtype: float64

### Driver vs Putter Debate

In [13]:
from scipy.stats import ttest_ind
import numpy as np

# test to see if putting and approach are statistic significantly different
putt: np.array = num_df.sg_putt.values
drive: np.array = num_df.sg_ott.values

# test
ttest_ind(putt, drive, alternative="less")

Ttest_indResult(statistic=-7.885895183600832, pvalue=1.5946451749026323e-15)

Fail to reject null stating that putting has a larger impact on scoring than driving, so Driving may be more improtant than the old saying says

### Do Winners Putt Better?

In [14]:
# test to see if winners putt better
winners_putting: np.array = num_df.query('pos == "1"').sg_putt.values
others_putting: np.array = num_df.query('pos != "1"').sg_putt.values

# run test
ttest_ind(winners_putting, others_putting, alternative="greater")

Ttest_indResult(statistic=16.369259579106767, pvalue=3.3819991028603986e-60)

These results allow us to reject null that winners putt worse than those who do not win so we can state that putting does have an impact on winning a golf tournament
<br>
<br>
To see some more interesting results lets see if a winner can differentiated between some one in the top 5 just by putting

In [15]:
# test to see if those who barely lose have worse putting than those who win
top5_putting: np.array = num_df.query('pos in ["2", "3", "4", "5", "T2", "T3", "T4", "T5"]').sg_putt.values

# run test
ttest_ind(winners_putting, top5_putting, alternative="greater")

Ttest_indResult(statistic=5.0775060206955684, pvalue=2.265078298359371e-07)

This test shows that putting can differentiate with statistical signifigance, even the difference between a win and a top 5

### Do Winners Rely more on Putting or Ball Strinking?

In [16]:
# test winners sg_putt > sg_app
winners: pd.DataFrame = num_df.query('pos == "1"')
putts: np.array = winners.sg_putt.values
approach: np.array = winners.sg_app.values

# test
ttest_ind(putts, approach, alternative="greater")

Ttest_indResult(statistic=-0.8359178209234898, pvalue=0.7981093118397729)

In [17]:
# test sg_putt < sg_app
# test
ttest_ind(putts, approach, alternative="less")

Ttest_indResult(statistic=-0.8359178209234898, pvalue=0.20189068816022715)

In [18]:
# test sg_putt == sg_app
# test
ttest_ind(putts, approach, alternative="two-sided")

Ttest_indResult(statistic=-0.8359178209234898, pvalue=0.4037813763204543)

Winners do not rely on sg_putting and sg_approach in different ways. The test show that there is not statistical evidence to show that either stat has a larger impact on player performance.

### Lets combine all strokes gained stats to find differences in winners and those in top 5

In [21]:
num_df.head()

Unnamed: 0,sg_putt,sg_arg,sg_app,sg_ott,sg_t2g,sg_total,pos
0,0.387,0.064,1.417,1.518,3.0,3.387,1
1,1.429,0.107,1.668,-0.067,1.707,3.137,2
2,1.017,0.944,1.093,-0.167,1.87,2.887,T3
3,-0.187,0.165,1.887,1.022,3.074,2.887,T3
4,2.118,0.414,-0.212,0.566,0.768,2.886,T3


In [19]:
from scipy.stats import f_oneway

# create winners and top5 matrices
winners: np.array = num_df.query('pos == "1"').iloc[:, :-3].values
top5: np.array = num_df.query('pos in ["2", "3", "4", "5", "T2", "T3", "T4", "T5"]').iloc[:, :-3].values

# run one-way ANOVA
f_stat, p_value = f_oneway(winners, top5)
p_value

array([4.53015660e-07, 1.69006938e-01, 3.05133542e-08, 8.84656344e-04])

These results show that there is a statistically significant difference between all strokes gained categories, between winners and those in the top 5

### Lets See if Winners and Runner Ups are Unique

In [20]:
# create winners and top5 matrices
winners: np.array = num_df.query('pos == "1"').iloc[:, :-3].values
runner_up: np.array = num_df.query('pos in ["2", "T2"]').iloc[:, :-3].values

# run one-way ANOVA
f_stat, p_value = f_oneway(winners, runner_up)
p_value

array([0.01148892, 0.81770125, 0.00089102, 0.10420376])

These results show that sg_putting and sg_app are the two stats we can use to differentiate between winners and runner ups. This aligns a lot with Mark Broadie's claims and show that approach shots are a very important indicator on success.