# Project to determine the Best PGA player from 2010 to 2018
In this notebook, we're going to create a machine learning model that predicts which player on the PGA tour is the best, utilizing criteria from Kaggle's PGA tour data.
## 1. Problem definition:
How well can we predict the best player on the PGA tour, given their characteristics and previous examples of wins and top 10 finishes.
## 2. Data:
The data is downloaded from Kaggle: https://www.kaggle.com/jmpark746/pga-tour-data-2010-2018
## 3. Evaluation:
The evaluation metric will be determined by accuracy for predcicting which player obtains the most wins and top 10 finishes using the available criteria.
## 4. Features:
Kaggle provides a data list of 18 columns for evaulation (see the link above)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

In [2]:
# Import data for project
df = pd.read_csv('pgaTourData.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2312 entries, 0 to 2311
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Player Name         2312 non-null   object 
 1   Rounds              1678 non-null   float64
 2   Fairway Percentage  1678 non-null   float64
 3   Year                2312 non-null   int64  
 4   Avg Distance        1678 non-null   float64
 5   gir                 1678 non-null   float64
 6   Average Putts       1678 non-null   float64
 7   Average Scrambling  1678 non-null   float64
 8   Average Score       1678 non-null   float64
 9   Points              2296 non-null   object 
 10  Wins                293 non-null    float64
 11  Top 10              1458 non-null   float64
 12  Average SG Putts    1678 non-null   float64
 13  Average SG Total    1678 non-null   float64
 14  SG:OTT              1678 non-null   float64
 15  SG:APR              1678 non-null   float64
 16  SG:ARG

In [4]:
df.isna().sum()

Player Name              0
Rounds                 634
Fairway Percentage     634
Year                     0
Avg Distance           634
gir                    634
Average Putts          634
Average Scrambling     634
Average Score          634
Points                  16
Wins                  2019
Top 10                 854
Average SG Putts       634
Average SG Total       634
SG:OTT                 634
SG:APR                 634
SG:ARG                 634
Money                   12
dtype: int64

In [5]:
df.columns

Index(['Player Name', 'Rounds', 'Fairway Percentage', 'Year', 'Avg Distance',
       'gir', 'Average Putts', 'Average Scrambling', 'Average Score', 'Points',
       'Wins', 'Top 10', 'Average SG Putts', 'Average SG Total', 'SG:OTT',
       'SG:APR', 'SG:ARG', 'Money'],
      dtype='object')

In [6]:
df.head().T

Unnamed: 0,0,1,2,3,4
Player Name,Henrik Stenson,Ryan Armour,Chez Reavie,Ryan Moore,Brian Stuard
Rounds,60,109,93,78,103
Fairway Percentage,75.19,73.58,72.24,71.94,71.44
Year,2018,2018,2018,2018,2018
Avg Distance,291.5,283.5,286.5,289.2,278.9
gir,73.51,68.22,68.67,68.8,67.12
Average Putts,29.93,29.31,29.12,29.17,29.11
Average Scrambling,60.67,60.13,62.27,64.16,59.23
Average Score,69.617,70.758,70.432,70.015,71.038
Points,868,1006,1020,795,421


In [7]:
# Rename Player Columns and Top 10 without spaces
df.rename(columns = {'Player Name':'Player'}, inplace = True)
df.rename(columns = {'Top 10':'Top_10'}, inplace = True)
df.head().T

Unnamed: 0,0,1,2,3,4
Player,Henrik Stenson,Ryan Armour,Chez Reavie,Ryan Moore,Brian Stuard
Rounds,60,109,93,78,103
Fairway Percentage,75.19,73.58,72.24,71.94,71.44
Year,2018,2018,2018,2018,2018
Avg Distance,291.5,283.5,286.5,289.2,278.9
gir,73.51,68.22,68.67,68.8,67.12
Average Putts,29.93,29.31,29.12,29.17,29.11
Average Scrambling,60.67,60.13,62.27,64.16,59.23
Average Score,69.617,70.758,70.432,70.015,71.038
Points,868,1006,1020,795,421


In [8]:
# Look at One Player
df[df.Player == ("Dustin Johnson")]

Unnamed: 0,Player,Rounds,Fairway Percentage,Year,Avg Distance,gir,Average Putts,Average Scrambling,Average Score,Points,Wins,Top_10,Average SG Putts,Average SG Total,SG:OTT,SG:APR,SG:ARG,Money
124,Dustin Johnson,77.0,59.46,2018,314.0,70.57,28.47,62.5,68.698,2717,3.0,10.0,0.385,2.372,0.919,0.829,0.238,"$8,457,352"
336,Dustin Johnson,77.0,56.44,2017,314.4,69.61,29.0,62.63,69.549,2466,3.0,7.0,0.019,1.972,1.071,0.67,0.121,"$8,732,193"
520,Dustin Johnson,87.0,57.17,2016,313.6,67.82,28.49,59.58,69.172,2701,2.0,12.0,0.328,1.993,1.117,0.477,0.07,"$9,365,185"
732,Dustin Johnson,75.0,55.53,2015,317.7,67.05,28.47,57.85,69.585,1718,1.0,8.0,0.128,1.455,0.96,0.579,-0.212,"$5,509,467"
898,Dustin Johnson,58.0,57.18,2014,311.0,68.03,28.74,59.76,69.546,1769,1.0,7.0,0.082,1.331,0.73,0.533,-0.015,"$4,249,180"
1099,Dustin Johnson,71.0,53.36,2013,305.8,66.75,29.4,53.22,70.115,1226,1.0,5.0,-0.101,0.805,0.332,0.663,-0.09,"$2,963,214"
1266,Dustin Johnson,70.0,56.3,2012,310.2,65.75,28.64,60.29,69.564,1097,1.0,5.0,0.185,1.509,0.72,0.638,-0.035,"$3,393,820"
1446,Dustin Johnson,72.0,57.17,2011,314.2,68.39,29.56,51.24,70.457,1191,,5.0,-0.549,0.576,0.912,0.195,0.019,"$4,309,961"
1657,Dustin Johnson,83.0,56.35,2010,308.5,67.95,29.37,55.81,70.135,1362,1.0,5.0,0.112,0.979,0.76,0.288,-0.179,4473122


In [9]:
# Sort DataFrame in date order
df.sort_values(by=["Year"], inplace=True, ascending=True)
df.Year.head(20)

2311    2010
1607    2010
1608    2010
1609    2010
1610    2010
1611    2010
1612    2010
1613    2010
1614    2010
1615    2010
1606    2010
1616    2010
1618    2010
1619    2010
1620    2010
1621    2010
1622    2010
1623    2010
1624    2010
1625    2010
Name: Year, dtype: int64

### Make a copy of the original dataframe
So when we manipulate the copy, we've still got our original data. 

In [10]:
# Make a copy of the original dataframe to perform edits on.
df_tmp = df.copy()

In [11]:
# Check the values of different comlumns
df_tmp.Points.value_counts()

1        25
2        23
3        15
4        15
9        15
         ..
451       1
795       1
1,575     1
774       1
1,846     1
Name: Points, Length: 1039, dtype: int64

In [12]:
df_tmp.head()

Unnamed: 0,Player,Rounds,Fairway Percentage,Year,Avg Distance,gir,Average Putts,Average Scrambling,Average Score,Points,Wins,Top_10,Average SG Putts,Average SG Total,SG:OTT,SG:APR,SG:ARG,Money
2311,"Jim Gallagher, Jr.",,,2010,,,,,,,,,,,,,,6552
1607,Freddie Jacobson,83.0,62.16,2010,283.1,67.8,28.93,63.2,70.367,826.0,,3.0,0.241,0.722,-0.064,0.302,0.241,1666252
1608,Matt Every,60.0,62.06,2010,290.9,64.75,28.93,57.61,70.986,322.0,,1.0,0.24,-0.073,0.166,-0.292,-0.185,456847
1609,Steve Marino,81.0,61.96,2010,290.6,67.49,29.53,56.33,70.875,636.0,,2.0,0.261,0.567,0.27,-0.016,0.053,1479239
1610,Steve Lowery,67.0,61.9,2010,286.2,65.52,30.03,50.38,72.202,67.0,,,-0.386,-1.177,0.122,-0.31,-0.607,118602


In [13]:
len(df_tmp)

2312

## Manipulate the Data into numbers

In [14]:
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2312 entries, 2311 to 0
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Player              2312 non-null   object 
 1   Rounds              1678 non-null   float64
 2   Fairway Percentage  1678 non-null   float64
 3   Year                2312 non-null   int64  
 4   Avg Distance        1678 non-null   float64
 5   gir                 1678 non-null   float64
 6   Average Putts       1678 non-null   float64
 7   Average Scrambling  1678 non-null   float64
 8   Average Score       1678 non-null   float64
 9   Points              2296 non-null   object 
 10  Wins                293 non-null    float64
 11  Top_10              1458 non-null   float64
 12  Average SG Putts    1678 non-null   float64
 13  Average SG Total    1678 non-null   float64
 14  SG:OTT              1678 non-null   float64
 15  SG:APR              1678 non-null   float64
 16  SG:ARG

In [15]:
df_tmp["Player"].dtype

dtype('O')

In [16]:
df_tmp.isna().sum()

Player                   0
Rounds                 634
Fairway Percentage     634
Year                     0
Avg Distance           634
gir                    634
Average Putts          634
Average Scrambling     634
Average Score          634
Points                  16
Wins                  2019
Top_10                 854
Average SG Putts       634
Average SG Total       634
SG:OTT                 634
SG:APR                 634
SG:ARG                 634
Money                   12
dtype: int64

### Convert string into categories
One way we can turn all of our data into numbers is by converting them into pandas categories.

In [17]:
df_tmp.head().T

Unnamed: 0,2311,1607,1608,1609,1610
Player,"Jim Gallagher, Jr.",Freddie Jacobson,Matt Every,Steve Marino,Steve Lowery
Rounds,,83,60,81,67
Fairway Percentage,,62.16,62.06,61.96,61.9
Year,2010,2010,2010,2010,2010
Avg Distance,,283.1,290.9,290.6,286.2
gir,,67.8,64.75,67.49,65.52
Average Putts,,28.93,28.93,29.53,30.03
Average Scrambling,,63.2,57.61,56.33,50.38
Average Score,,70.367,70.986,70.875,72.202
Points,,826,322,636,67


In [18]:
pd.api.types.is_string_dtype(df_tmp["Points"])

True

In [19]:
# Find the columns whick contain strings
for label, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        print(label)

Player
Points
Money


In [20]:
# This will turn all of the string value into category values
for label, content in df_tmp.items():
    if pd.api.types.is_string_dtype(content):
        df_tmp[label] = content.astype("category").cat.as_ordered()

In [21]:
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2312 entries, 2311 to 0
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   Player              2312 non-null   category
 1   Rounds              1678 non-null   float64 
 2   Fairway Percentage  1678 non-null   float64 
 3   Year                2312 non-null   int64   
 4   Avg Distance        1678 non-null   float64 
 5   gir                 1678 non-null   float64 
 6   Average Putts       1678 non-null   float64 
 7   Average Scrambling  1678 non-null   float64 
 8   Average Score       1678 non-null   float64 
 9   Points              2296 non-null   category
 10  Wins                293 non-null    float64 
 11  Top_10              1458 non-null   float64 
 12  Average SG Putts    1678 non-null   float64 
 13  Average SG Total    1678 non-null   float64 
 14  SG:OTT              1678 non-null   float64 
 15  SG:APR              1678 non-null   fl

In [22]:
df_tmp.Player.cat.categories

Index(['Aaron Baddeley', 'Aaron Watkins', 'Aaron Wise', 'Abraham Ancer',
       'Adam Hadwin', 'Adam Schenk', 'Adam Scott', 'Alex Aragon', 'Alex Cejka',
       'Alex Noren',
       ...
       'Woody Austin', 'Xander Schauffele', 'Xinjun Zhang', 'Y.E. Yang',
       'Zac Blair', 'Zach Johnson', 'Zack Miller', 'Zack Sucher',
       'Zecheng Dou', 'Ángel Cabrera'],
      dtype='object', length=526)

In [23]:
df_tmp.Player.cat.codes

2311    233
1607    169
1608    324
1609    456
1610    455
       ... 
166     117
165     324
164     365
179     484
0       194
Length: 2312, dtype: int16

## Thanks to pandas Categories we now have a way to access all of our data in the form of numbers

## But we still have a bunch of missing data

In [24]:
# Check the missing data
df_tmp.isnull().sum()/len(df_tmp)

Player                0.000000
Rounds                0.274221
Fairway Percentage    0.274221
Year                  0.000000
Avg Distance          0.274221
gir                   0.274221
Average Putts         0.274221
Average Scrambling    0.274221
Average Score         0.274221
Points                0.006920
Wins                  0.873270
Top_10                0.369377
Average SG Putts      0.274221
Average SG Total      0.274221
SG:OTT                0.274221
SG:APR                0.274221
SG:ARG                0.274221
Money                 0.005190
dtype: float64

## Save the preprocessed data

In [25]:
# Export the current tmp dataframe
# df_tmp.to_csv("Desktop/pga-project-folder",
#             index=False)

# Import the preprocessed data
# df_tmp = pd.read_csv(""Desktop/pga-project-folder,
#                      low_memory=False)

# Set up environment when beginning

In [26]:
df_tmp.head().T

Unnamed: 0,2311,1607,1608,1609,1610
Player,"Jim Gallagher, Jr.",Freddie Jacobson,Matt Every,Steve Marino,Steve Lowery
Rounds,,83,60,81,67
Fairway Percentage,,62.16,62.06,61.96,61.9
Year,2010,2010,2010,2010,2010
Avg Distance,,283.1,290.9,290.6,286.2
gir,,67.8,64.75,67.49,65.52
Average Putts,,28.93,28.93,29.53,30.03
Average Scrambling,,63.2,57.61,56.33,50.38
Average Score,,70.367,70.986,70.875,72.202
Points,,826,322,636,67


In [27]:
df.isna().sum()

Player                   0
Rounds                 634
Fairway Percentage     634
Year                     0
Avg Distance           634
gir                    634
Average Putts          634
Average Scrambling     634
Average Score          634
Points                  16
Wins                  2019
Top_10                 854
Average SG Putts       634
Average SG Total       634
SG:OTT                 634
SG:APR                 634
SG:ARG                 634
Money                   12
dtype: int64

## Fill missing values
### Fill numerical missing values first

In [28]:
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        print(label)

Rounds
Fairway Percentage
Year
Avg Distance
gir
Average Putts
Average Scrambling
Average Score
Wins
Top_10
Average SG Putts
Average SG Total
SG:OTT
SG:APR
SG:ARG


In [29]:
df_tmp.gir

2311      NaN
1607    67.80
1608    64.75
1609    67.49
1610    65.52
        ...  
166     62.48
165     64.59
164     69.04
179     67.08
0       73.51
Name: gir, Length: 2312, dtype: float64

In [30]:
# Check for which numeric columns have null values
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

Rounds
Fairway Percentage
Avg Distance
gir
Average Putts
Average Scrambling
Average Score
Wins
Top_10
Average SG Putts
Average SG Total
SG:OTT
SG:APR
SG:ARG


In [31]:
# Fill numeric rows with the median
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            # Add a binary column which tells is the data is missing or not
            df_tmp[label+"_is_missing"] = pd.isnull(content)
            # Fill the missing numeric values with median
            df_tmp[label] = content.fillna(content.median())

In [32]:
# Check if there's any null numeric values
for label, content in df_tmp.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

In [33]:
df_tmp.isna().sum()

Player                            0
Rounds                            0
Fairway Percentage                0
Year                              0
Avg Distance                      0
gir                               0
Average Putts                     0
Average Scrambling                0
Average Score                     0
Points                           16
Wins                              0
Top_10                            0
Average SG Putts                  0
Average SG Total                  0
SG:OTT                            0
SG:APR                            0
SG:ARG                            0
Money                            12
Rounds_is_missing                 0
Fairway Percentage_is_missing     0
Avg Distance_is_missing           0
gir_is_missing                    0
Average Putts_is_missing          0
Average Scrambling_is_missing     0
Average Score_is_missing          0
Wins_is_missing                   0
Top_10_is_missing                 0
Average SG Putts_is_missing 

### Filling and turning categorical variables into numbers

In [34]:
# Check for columns which aren't numeric
for label, content in df_tmp.items():
    if not pd.api.types.is_numeric_dtype(content):
        print(label)

Player
Points
Money


In [35]:
# Turn categorical variables into numbers and fill missing
for label, content in df_tmp.items():
    if not pd.api.types.is_numeric_dtype(content):
        # Add binary column to indicate whether sample had missing value
        df_tmp[label+"_is_missing"] = pd.isnull(content)
        # Turn categories into numbers and add +1
        df_tmp[label] = pd.Categorical(content).codes+1

In [36]:
pd.Categorical(df_tmp["Player"]).codes+1

array([234, 170, 325, ..., 366, 485, 195], dtype=int16)

In [37]:
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2312 entries, 2311 to 0
Data columns (total 35 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Player                         2312 non-null   int16  
 1   Rounds                         2312 non-null   float64
 2   Fairway Percentage             2312 non-null   float64
 3   Year                           2312 non-null   int64  
 4   Avg Distance                   2312 non-null   float64
 5   gir                            2312 non-null   float64
 6   Average Putts                  2312 non-null   float64
 7   Average Scrambling             2312 non-null   float64
 8   Average Score                  2312 non-null   float64
 9   Points                         2312 non-null   int16  
 10  Wins                           2312 non-null   float64
 11  Top_10                         2312 non-null   float64
 12  Average SG Putts               2312 non-null   f

In [38]:
df_tmp.head().T

Unnamed: 0,2311,1607,1608,1609,1610
Player,234,170,325,457,456
Rounds,79.5,83,60,81,67
Fairway Percentage,61.43,62.16,62.06,61.96,61.9
Year,2010,2010,2010,2010,2010
Avg Distance,290.55,283.1,290.9,290.6,286.2
gir,65.79,67.8,64.75,67.49,65.52
Average Putts,29.14,28.93,28.93,29.53,30.03
Average Scrambling,58.275,63.2,57.61,56.33,50.38
Average Score,70.902,70.367,70.986,70.875,72.202
Points,0,929,461,768,798


In [39]:
df_tmp.isna().sum()

Player                           0
Rounds                           0
Fairway Percentage               0
Year                             0
Avg Distance                     0
gir                              0
Average Putts                    0
Average Scrambling               0
Average Score                    0
Points                           0
Wins                             0
Top_10                           0
Average SG Putts                 0
Average SG Total                 0
SG:OTT                           0
SG:APR                           0
SG:ARG                           0
Money                            0
Rounds_is_missing                0
Fairway Percentage_is_missing    0
Avg Distance_is_missing          0
gir_is_missing                   0
Average Putts_is_missing         0
Average Scrambling_is_missing    0
Average Score_is_missing         0
Wins_is_missing                  0
Top_10_is_missing                0
Average SG Putts_is_missing      0
Average SG Total_is_

In [43]:
## Combine target values of Wins and Top_10 and create Best Category
df_tmp['Best'] = df_tmp['Wins'] + df_tmp['Top_10']
df_tmp.head().T

Unnamed: 0,2311,1607,1608,1609,1610
Player,234,170,325,457,456
Rounds,79.5,83,60,81,67
Fairway Percentage,61.43,62.16,62.06,61.96,61.9
Year,2010,2010,2010,2010,2010
Avg Distance,290.55,283.1,290.9,290.6,286.2
gir,65.79,67.8,64.75,67.49,65.52
Average Putts,29.14,28.93,28.93,29.53,30.03
Average Scrambling,58.275,63.2,57.61,56.33,50.38
Average Score,70.902,70.367,70.986,70.875,72.202
Points,0,929,461,768,798


In [44]:
len(df_tmp)

2312

## Splitting data into train & test sets

In [46]:
# Split data into X and y
X = df_tmp.drop("Best", axis=1)

y = df_tmp["Best"]

In [47]:
X.head().T

Unnamed: 0,2311,1607,1608,1609,1610
Player,234,170,325,457,456
Rounds,79.5,83,60,81,67
Fairway Percentage,61.43,62.16,62.06,61.96,61.9
Year,2010,2010,2010,2010,2010
Avg Distance,290.55,283.1,290.9,290.6,286.2
gir,65.79,67.8,64.75,67.49,65.52
Average Putts,29.14,28.93,28.93,29.53,30.03
Average Scrambling,58.275,63.2,57.61,56.33,50.38
Average Score,70.902,70.367,70.986,70.875,72.202
Points,0,929,461,768,798


## Modelling

In [53]:
# Let's build a learning machine model
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve
from sklearn.ensemble import RandomForestRegressor

In [54]:
# Fitting the model to the training data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [56]:
%%time
# Instantiate model
model = RandomForestRegressor(n_jobs=-1,
                              random_state=42)

# Fit the model
model.fit(X_train, y_train)

Wall time: 759 ms


RandomForestRegressor(n_jobs=-1, random_state=42)

In [57]:
# Score the model
model.score(X_train, y_train)

0.997947941627134

## Check the test set

In [59]:
model.fit(X_test, y_test)

RandomForestRegressor(n_jobs=-1, random_state=42)

In [60]:
model.score(X_test, y_test)

0.9978104736728195

In [70]:
df_tmp["Best"].head(30)

2311    3.0
1607    4.0
1608    2.0
1609    3.0
1610    3.0
1611    3.0
1612    7.0
1613    4.0
1614    3.0
1615    3.0
1606    3.0
1616    3.0
1618    6.0
1619    6.0
1620    2.0
1621    2.0
1622    4.0
1623    2.0
1624    3.0
1625    2.0
1626    3.0
1617    3.0
1627    3.0
1605    6.0
1603    2.0
1583    2.0
1584    2.0
1585    3.0
1586    4.0
1587    3.0
Name: Best, dtype: float64

In [71]:
df_tmp["Player"].head(30)

2311    234
1607    170
1608    325
1609    457
1610    456
1611    228
1612     91
1613    103
1614    258
1615    186
1606    312
1616     20
1618    128
1619    368
1620    345
1621    222
1622     39
1623    172
1624    471
1625    506
1626    332
1617     48
1627    496
1605    303
1603    404
1583    336
1584    334
1585    214
1586    338
1587    505
Name: Player, dtype: int16