# Final Project - What is the Best Indicator of Success on the PGA Tour?
Max Guryan
DS 2023
## Notebook 1: Data Download and Read
Before we beign, it is important to set up a virtual environment with the necessary packages. This can be done by running the following command in your terminal. Make sure that your working directory is set to `final_project` before running this command. (Copy and paste the command as a single line in your terminal.)

```bash
python -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt
```

Once this environment is set up, make sure you select the `.venv` kernel in Jupyter Notebook to ensure that the correct packages are being used.

### Import Packages
Run the cell below to import the necessary packages for this notebook.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Data Download
The data used in this analysis comes from the PGA TOUR's website. Unfortunately, the data is only available for download as individual CSV files for a given stat in a given year. I have written a series of python scripts to download and merge these files into a master dataset. The scripts can be found in the `final_project/src` directory. However, to save time, I have already run these scripts and saved the final merged dataset as `master_player_seasons.csv` in the `final_project/data/processed` directory. 

**Note:** The three scripts used to download and merege the dataset are from my DS2022 final project. They have been modified to fit the needs of this analysis.

The external link to the PGA TOUR's stats page is: [PGA TOUR Stats](https://www.pgatour.com/stats)

The API endpoint used to download the data is:
```
    "https://www.pgatour.com/api/stats-download?timePeriod=THROUGH_EVENT&tourCode=R&statsId={stat_id}&year={year}"
```
`stat_id` indicates which stat is being downloaded, and `year` indicates which year the data is from.

Here is a dictionary of each stat and its corresponding `stat_id`:
```python
STAT_IDS = {
    "sg_total": "02675", # Strokes Gained: Total
    "sg_ott": "02567", # Strokes Gained: Off the Tee
    "sg_app": "02568", # Strokes Gained: Approaching the Green
    "sg_arg": "02569", # Strokes Gained: Around the Green
    "sg_putt": "02564", # Strokes Gained: Putting

    # Traditional stats
    "driving_distance": "101",
    "driving_accuracy": "102",
    "greens_in_regulation": "103",
    "scoring_average": "120",

    # Success metrics
    "money_earned": "109",   # Money Leaders
    "fedex_rank": "02671",   # FedEx / points-style ranking you picked
}    
```

### How the PGA TOUR Collects the Data:
The PGA TOUR collects its data through automated tracking and manual entry of every shot taken by every golfer in every tournament during a given season. This data is then used to calculate various advanced statistics, such as strokes gained metrics, greens in regulation, and scoring average. Additionally, advanced shot tracking technology, such as ShotLink, is used to collect data on shot distances, accuracy, proximity, and other factors that contribute to a golfer's performance on the course. Finally, FedEx ranking and money earned data is collected based on a golfer's placement in tournaments throughout the season. All this data is compiled and made publicly available to inspect and download through the PGA TOUR website.

### --- IGNORE BELOW ---
Do not run the cells below. They are only for reference.


In [6]:
%run src/download_stats.py

Downloading sg_total (02675) for 2007
  URL: https://www.pgatour.com/api/stats-download?timePeriod=THROUGH_EVENT&tourCode=R&statsId=02675&year=2007
  Saved to data/raw/sg_total_2007.csv

Downloading sg_ott (02567) for 2007
  URL: https://www.pgatour.com/api/stats-download?timePeriod=THROUGH_EVENT&tourCode=R&statsId=02567&year=2007
  Saved to data/raw/sg_ott_2007.csv

Downloading sg_app (02568) for 2007
  URL: https://www.pgatour.com/api/stats-download?timePeriod=THROUGH_EVENT&tourCode=R&statsId=02568&year=2007
  Saved to data/raw/sg_app_2007.csv

Downloading sg_arg (02569) for 2007
  URL: https://www.pgatour.com/api/stats-download?timePeriod=THROUGH_EVENT&tourCode=R&statsId=02569&year=2007
  Saved to data/raw/sg_arg_2007.csv

Downloading sg_putt (02564) for 2007
  URL: https://www.pgatour.com/api/stats-download?timePeriod=THROUGH_EVENT&tourCode=R&statsId=02564&year=2007
  Saved to data/raw/sg_putt_2007.csv

Downloading driving_distance (101) for 2007
  URL: https://www.pgatour.com/api/

In [10]:
%run src/parse_stats.py

Wrote intermediate file: data/intermediate/driving_accuracy_2007.csv (196 rows)
Wrote intermediate file: data/intermediate/driving_accuracy_2008.csv (196 rows)
Wrote intermediate file: data/intermediate/driving_accuracy_2009.csv (184 rows)
Wrote intermediate file: data/intermediate/driving_accuracy_2010.csv (192 rows)
Wrote intermediate file: data/intermediate/driving_accuracy_2011.csv (186 rows)
Wrote intermediate file: data/intermediate/driving_accuracy_2012.csv (191 rows)
Wrote intermediate file: data/intermediate/driving_accuracy_2013.csv (180 rows)
Wrote intermediate file: data/intermediate/driving_accuracy_2014.csv (177 rows)
Wrote intermediate file: data/intermediate/driving_accuracy_2015.csv (184 rows)
Wrote intermediate file: data/intermediate/driving_accuracy_2016.csv (185 rows)
Wrote intermediate file: data/intermediate/driving_accuracy_2017.csv (190 rows)
Wrote intermediate file: data/intermediate/driving_accuracy_2018.csv (193 rows)
Wrote intermediate file: data/intermedia

In [11]:
%run src/build_master.py

Building master dataset for years: [2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025]
  -> Merging stats for 2007
    Loaded 11 stats: sg_total, sg_ott, sg_app, sg_arg, sg_putt, driving_distance, driving_accuracy, greens_in_regulation, scoring_average, money_earned, fedex_rank
    Player counts per stat: {'sg_total': 196, 'sg_ott': 196, 'sg_app': 196, 'sg_arg': 196, 'sg_putt': 196, 'driving_distance': 196, 'driving_accuracy': 196, 'greens_in_regulation': 196, 'scoring_average': 196, 'money_earned': 256, 'fedex_rank': 253}
    Final merged dataset: 257 players (may be more due to outer join)
     Saved data/processed/master_2007.csv (257 rows)
  -> Merging stats for 2008
    Loaded 11 stats: sg_total, sg_ott, sg_app, sg_arg, sg_putt, driving_distance, driving_accuracy, greens_in_regulation, scoring_average, money_earned, fedex_rank
    Player counts per stat: {'sg_total': 196, 'sg_ott': 196, 'sg_app': 196, 'sg_arg': 196, 's

### --- IGNORE ABOVE ---

### Data Reading
Run the cell below to read in the processed data.

In [4]:
PGA = pd.read_csv('data/processed/master_player_seasons.csv')

### Definition of Each Metric

The PGA TOUR defines the Strokes Gained (SG) metric as a golf statistic that measures a golfer's performance for a specific shot by comparing it to the average number of strokes it takes other players to complete the hole from that same position. A positive SG value indicates that the golfer took more strokes than the average player to complete the hole from that position (performing worse), while a negative SG value indicates that the golfer took fewer strokes than the average player from that position (performing better). The SG metric can be used to evaluate a golfer's performance compared to the field average for a given shot type. This is done using the following sub-metrics defined below:

- Strokes Gained Off-the-Tee (SG: OTT): Measures a golfer's tee shot performance on Par 4s and Par 5s, comparing how far they  advance the ball versus the field average from that distance/lie. Ultimately, reflecting how much their driving helps or hurts their scoring opportunities compared to the field average.
- Strokes Gained on Approach (SG: APP): Measures a golfer's performance on shots from the fairway/rough that are further than 30 yards from the flag, measuring accuracy into greens. Ultimately, reflecting how much their approach shots outside of 30 yards help or hurt their scoring opportunities compared to the field average.
- Strokes Gained Around-the-Green (SG: ARG): Shots within about 30 yards of the green, but not on the putting surface, for example, chipping, pitching, and bunker shots. Ultimately, reflecting how much their short game around the green helps or hurts their scoring opportunities compared to the field average.
- Strokes Gained Putting (SG: PUT): Compares putting performance from your initial distances to the hole, while on the green, against the field average for those same putts. Ultimately, reflecting how much their putting helps or hurts their scoring opportunities compared to the field average.
- Strokes Gained Tee-to-Green (SG: T2G): All shots from the tee box until the ball is on the green, showing overall long-game skill. This is just the sum of SG: OTT, SG: APP, and SG: ARG. Ultimately, reflecting how much their long game (driving, approach, and short game) helps or hurts their scoring opportunities compared to the field average.
- Strokes Gained Total (SG: TOT): The sum of all strokes gained shots, including putting. This is just the sum of SG: T2G and SG: PUT. Ultimately, reflecting how much their overall game (long game and putting) helps or hurts their scoring opportunities compared to the field average.

Additional Statistics:
- Driving Distance: Average distance (in yards) a golfer hits the ball off the tee on Par 4s and Par 5s.
- Driving Accuracy: Percentage of times a golfer's tee shot on Par 4s and Par 5s lands in the fairway.
- Scoring Average: Average number of strokes a golfer takes per round. (Success Metric)
- Money Earned: Total prize money a golfer has earned in a season (in USD). (Success Metric)
- FedEx Rank: A points-based ranking system used to determine the top players on the PGA TOUR throughout a given season. (Success Metric)

### Data Features
Run the following cells to see the features of the dataset.

#### The dataset is made up of the following columns:
- year: The year of the season (integer)
- player_name: The name of the player (string)
- sg_total: Strokes Gained: Total (float)
- sg_off_the_tee: Strokes Gained: Off the Tee (float)
- sg_approach: Strokes Gained: Approach the Green (float)
- sg_around_the_green: Strokes Gained: Around the Green (float)
- sg_putting: Strokes Gained: Putting (float)
- driving_distance: Average Driving Distance in yards (float)
- driving_accuracy: Driving Accuracy Percentage (float)
- scoring_average: Scoring Average (float)
- money_earned: Total Money Earned in USD (float)
- fedex_rank: FedEx Cup Ranking (float)
- sg_tee_to_green: Strokes Gained: Tee to Green (float)


In [40]:
COLS = pd.DataFrame(index=PGA.columns)
COLS.index.name = 'col_id'
COLS['dtypes'] = PGA.dtypes
COLS['n_unique'] = PGA.nunique()
COLS["tot_observations"] = len(PGA)
COLS["na_count"] = PGA.isna().sum()

COLS

Unnamed: 0_level_0,dtypes,n_unique,tot_observations,na_count
col_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
year,int64,19,4839,0
player_name,object,806,4839,0
sg_total,float64,1917,4839,1258
sg_off_the_tee,float64,1323,4839,1258
sg_approach,float64,1364,4839,1258
sg_around_green,float64,958,4839,1258
sg_putting,float64,1306,4839,1258
driving_distance,float64,484,4839,1258
driving_accuracy,float64,1671,4839,1258
greens_in_regulation,float64,1061,4839,1258


As shown in the `na_count` row of the `COLS` DataFrame, there are missing values in the dataset. These missing values happen when players do not play enough rounds in a season to qualify for the advacned stats metrics, they never made the cut at a tournament (you need to make the cut in order to get paid), or they played in a tournament with a sponsors exemption, which means they could've won money, but don't have any other stats to track. To ensure the integrity of the analysis and create a clean dataset, I will remove any rows with missing values.

In [42]:
PGA_clean = PGA.copy().dropna()

In [43]:
COLS_clean = pd.DataFrame(index=PGA_clean.columns)
COLS_clean.index.name = 'col_id'
COLS_clean['dtypes'] = PGA_clean.dtypes
COLS_clean['n_unique'] = PGA_clean.nunique()
COLS_clean["tot_observations"] = len(PGA_clean)
COLS_clean["na_count"] = PGA_clean.isna().sum()
COLS_clean

Unnamed: 0_level_0,dtypes,n_unique,tot_observations,na_count
col_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
year,int64,19,3218,0
player_name,object,670,3218,0
sg_total,float64,1762,3218,0
sg_off_the_tee,float64,1251,3218,0
sg_approach,float64,1296,3218,0
sg_around_green,float64,925,3218,0
sg_putting,float64,1250,3218,0
driving_distance,float64,477,3218,0
driving_accuracy,float64,1591,3218,0
greens_in_regulation,float64,1018,3218,0


Now, we have a clean dataset, where each column has the same number of total observations and no missing values.

In [44]:
PGA_clean.head(20)

col_id,year,player_name,sg_total,sg_off_the_tee,sg_approach,sg_around_green,sg_putting,driving_distance,driving_accuracy,greens_in_regulation,scoring_average,money_earned,final_season_rank,sg_tee_to_green
0,2007,Aaron Baddeley,1.064,0.152,-0.252,0.535,0.629,291.9,60.0,60.35,70.088,3441119.0,9.0,0.435
1,2007,Adam Scott,1.234,0.478,0.708,-0.081,0.129,300.9,59.17,65.44,70.008,3413185.0,12.0,1.105
2,2007,Alex Cejka,0.728,0.257,0.609,0.34,-0.479,288.9,68.08,69.44,70.437,868303.0,129.0,1.206
3,2007,Anders Hansen,-0.089,-0.224,0.335,-0.023,-0.176,280.7,66.95,62.85,70.856,461216.0,138.0,0.088
4,2007,Andrew Buckle,-0.265,-0.223,-0.28,0.077,0.161,294.7,58.14,62.52,71.443,513630.0,141.0,-0.426
5,2007,Anthony Kim,0.673,0.578,0.2,0.016,-0.121,302.4,60.79,65.35,70.128,1545195.0,44.0,0.794
7,2007,Arron Oberholser,1.183,0.125,0.3,0.396,0.362,285.5,61.7,62.25,69.807,1797458.0,33.0,0.821
9,2007,Bart Bryant,0.638,0.18,0.253,0.008,0.198,281.1,70.66,66.34,70.637,1167874.0,80.0,0.441
12,2007,Ben Curtis,-0.749,-0.03,-0.554,0.023,-0.188,277.1,67.37,60.56,71.582,772321.0,115.0,-0.561
14,2007,Bill Haas,0.408,0.443,0.063,-0.081,-0.018,302.7,62.93,65.98,70.653,967443.0,127.0,0.425


In [11]:
PGA_clean_z.head()

col_id,year,player_name,sg_total,sg_off_the_tee,sg_approach,sg_around_green,sg_putting,driving_distance,driving_accuracy,greens_in_regulation,scoring_average,money_earned,final_season_rank,sg_tee_to_green,money_earned_log,money_earned_log_z
0,2007,Aaron Baddeley,1.270978,0.257487,-0.934173,2.264203,1.724075,-0.169482,-0.344816,-2.128946,-1.128322,3441119.0,9.0,0.416555,15.051308,1.096274
1,2007,Adam Scott,1.526412,1.181114,1.685539,-0.484133,0.261055,0.735368,-0.507903,-0.232152,-1.247124,3413185.0,12.0,1.506587,15.043157,1.087346
2,2007,Alex Cejka,0.766121,0.554974,1.415382,1.394194,-1.517978,-0.471099,1.242826,1.258452,-0.610048,868303.0,129.0,1.670906,13.674297,-0.411903
3,2007,Anders Hansen,-0.461464,-0.807801,0.667672,-0.225361,-0.631387,-1.295518,1.020792,-1.197318,0.012177,461216.0,138.0,-0.147984,13.041624,-1.104841
4,2007,Andrew Buckle,-0.725913,-0.804968,-1.010582,0.220797,0.354688,0.112027,-0.710288,-1.320293,0.883886,513630.0,141.0,-0.984218,13.14926,-0.986952
