# CMSC320 Final Tutorial
By: Austin Brady

# Table of Contents  
&emsp;**0.** Introduction  
&emsp;**1.** Data Collection  
&emsp;**2.** Data Processing  
&emsp;**3.** Exploratory Analysis and Data Visualization

# 0. Introduction
Some say golf is a game of luck. While this may be true for the weekend player hacking it around his local municipal course, the players on the PGA Tour are a different beast. They bomb drives down the fairway with pinpoint accuracy and have beautiful short games to match. The question remians though: What makes a player have a great season? In this tutorial, I will answer that question as I analyze player rankings in a few key categories to determine which of these categories matter most.

# 1.Data Collection

This is the first stage of the data lifecycle. This stage consists of gathering data from whatever source best suits your data science project. Some of these sources are websites, publicly available datasets, and APIs (more on these later). This information normally comes in a few standards forms (we will only talk about three of these forms):  
&emsp;**1. CSV Files:** CSV stands for comma separated values and it is exactly what is sounds like. The data will come in a format where each attribute (column) is separated by a comma and each observation (row) is separated by a newline. For more information, folow this link, https://www.howtogeek.com/348960/what-is-a-csv-file-and-how-do-i-open-it/  
&emsp;**2. JSON Files:** JSON is another standard data format where there are two basic data types you will have to deal with (not including strings, integers, booleans, and null). These two formats are an object, called a JSON object, and an array. A JSON object is a key-value pair where the key is a some string representing a defining characteristic for whatever the value is. An array, on the other hand, functions as a container for a user to store certain values that are related in some way. For example, if a teacher were to store their student's names in a JSON object, it may look like this, {"students":["John","Suzy",...]}, where the ... is the rest of array containing the other names. For more information on JSON, follow this link, https://www.copterlabs.com/json-what-it-is-how-it-works-how-to-use-it/  
&emsp;**3. HTML (this tutorial):** Many times, the data we gather may need to be extracted from an HTML table on a webpage. The structure this data is presented in is not always as uniform as the prior few forms, but HTML parsing libraries such as BeautifulSoup make the process hassle-free.

**Libraries needed for data collection stage:**    
&emsp;1.requests: Used to send and recieve GET requests for different web pages on the PGA Tour website  
&emsp;2.BeautifulSoup: Used to parse the html retrieved by the GET request  
&emsp;3.Pandas: Used to create data sturctures to organize data for ease of analysis  
&emsp;4.Pickle(optional): Used to store data structure in an external file for quicker subsequent retrievals (multiple get requests take up time). As mentioned above, CSV would be another valid format to store the data in, but since my data will be split up into about 17 different DataFrames all stored in a dictionary, it will be easier to store it all in one pickle file. Int he case of a single DataFrame, CSV should definitely be one of a user's first instincts for faster retrieval.

In [44]:
# libraries for this stage
import requests
from bs4 import BeautifulSoup
import pandas as pd
import pickle

In [101]:
pga_page_url = "http://www.pgatour.com"
basic_stat_url_format = "http://www.pgatour.com/stats/stat.%s.2018.html"
basic_category_url = "http://www.pgatour.com/stats/categories.%s.html"
categories = ['ROTT_INQ', 'RAPP_INQ', 'RARG_INQ', 'RPUT_INQ', 'RSCR_INQ', 'RMNY_INQ','RPTS_INQ']
pga_tour_stats_dict = {}
stat_ids = []
categories_i_care_about = ['SG: TEE-TO-GREEN',
'SG: OFF-THE-TEE','DRIVING DISTANCE','DRIVING ACCURACY PERCENTAGE',
'TOTAL DRIVING','BALL STRIKING','SG: APPROACH-THE-GREEN','GREENS IN REGULATION PERCENTAGE','SG: AROUND-THE-GREEN',
'SCRAMBLING','SG: PUTTING','SG: TOTAL','SCORING AVERAGE','OFFICIAL MONEY','TOP 10 FINISHES','FEDEXCUP SEASON POINTS','FEDEXCUP STANDINGS',
'OFFICIAL WORLD GOLF RANKING']

In [102]:
for category in categories:
    spec_category_url = basic_category_url % category
    r = requests.get(spec_category_url)
    soup = BeautifulSoup(r.content,'html.parser')
    for table in soup.find_all("div", class_="table-content"):
        for link in table.find_all("a"):
            if link.text.upper() in categories_i_care_about and link['href'].split('.')[1] not in stat_ids:
                stat_ids.append(link['href'].split('.')[1])

In [168]:
for identifier in stat_ids:
    spec_stat_url = basic_stat_url_format % identifier
    r = requests.get(spec_stat_url)
    soup = BeautifulSoup(r.content,'html.parser')
    table = soup.find('table',class_='table-styled')
    df = pd.read_html(str(table))[0]
    for stat in soup.find_all('meta'):
        if stat['content'].split(' ')[0] == 'Stat':
            stat_name = ' '.join(stat['content'].split(' ')[2:]).upper()
            pga_tour_stats_dict[stat_name] = df
            break
    pga_tour_stats_dict[stat_name] = pga_tour_stats_dict[stat_name].drop(columns=['RANK LAST WEEK'])
with open('./pga_data.pk','wb') as fi:
    pickle.dump(pga_tour_stats_dict,fi)

In [187]:
with open('./pga_data.pk','rb') as fi:
    pga_tour_stats_dict = pickle.load(fi)
pga_tour_stats_dict['OFFICIAL WORLD GOLF RANKING'].head()

Unnamed: 0,RANK THIS WEEK,PLAYER NAME,EVENTS,AVG POINTS,TOTAL POINTS,POINTS LOST,POINTS GAINED,COUNTRY
0,1,Dustin Johnson,45,10.29,463.25,-340.53,335.37,USA
1,2,Justin Rose,45,10.23,460.51,-192.95,300.75,ENG
2,3,Brooks Koepka,43,9.92,426.49,-169.68,298.73,USA
3,4,Justin Thomas,50,9.43,471.25,-267.04,303.54,USA
4,5,Francesco Molinari,50,7.25,362.57,-125.16,295.49,ITA


# 2. Data Processing

Now comes the second step of the data science pipeline, Data Processing. We did a little bit of this step in our Data Collection step. This step involves taking the data you collected and put into a DataFrame, and cleaning it up for future analysis. This can include tidying up data (each attribute in a column, and each observation in a row), dealing with columns or entries which contain missing data, and melting data. In our last step we put the data into the dataframe, but there is one major problem. Our data is very wide and spread out as it is stored in 17 different DataFrames. It gives us a lot of information, but way more than we will actually need. We will condense all 17 DataFrames into 1 DataFrame. Below I have the code that does this, and will explain what I did after.

In [192]:
condense_dict = {}
df = pd.DataFrame([])
for key in pga_tour_stats_dict.keys():
    if key != "OFFICIAL WORLD GOLF RANKING":
        condense_dict[key] = pga_tour_stats_dict[key].iloc[:,[1,0]].rename(columns = {'RANK THIS WEEK':key+' RANKING'}).astype(str)
        condense_dict[key].iloc[:,1:] = condense_dict[key].iloc[:,1:].replace({'T': ''}, regex=True).astype(int)
    else:
        condense_dict[key] = pga_tour_stats_dict[key].iloc[:,[1,0]].rename(columns = {'RANK THIS WEEK':key}).astype(str)
        condense_dict[key].iloc[:,1:] = condense_dict[key].iloc[:,1:].replace({'T': ''}, regex=True).astype(int)
    if df.empty:
        df = condense_dict[key]
    else:
        df = pd.merge(df,condense_dict[key],how='outer',on='PLAYER NAME')        
col_list = list(df)
col_list[1], col_list[16] = col_list[16], col_list[1]
df.columns = col_list
df.iloc[:,1:] = df.iloc[:,1:].apply(pd.to_numeric)
df.sort_values("FEDEXCUP SEASON POINTS RANKING",inplace=True)
df.head()

Unnamed: 0,PLAYER NAME,FEDEXCUP SEASON POINTS RANKING,SG: OFF-THE-TEE RANKING,DRIVING DISTANCE RANKING,DRIVING ACCURACY PERCENTAGE RANKING,TOTAL DRIVING RANKING,BALL STRIKING RANKING,SG: APPROACH-THE-GREEN RANKING,GREENS IN REGULATION PERCENTAGE RANKING,SG: AROUND-THE-GREEN RANKING,SCRAMBLING RANKING,SG: PUTTING RANKING,SG: TOTAL RANKING,SCORING AVERAGE RANKING,OFFICIAL MONEY RANKING,TOP 10 FINISHES RANKING,SG: TEE-TO-GREEN RANKING,FEDEXCUP STANDINGS RANKING,OFFICIAL WORLD GOLF RANKING
0,Dustin Johnson,1,1,6,125,28,11,5,9,33,23,25,1,1,2,1,1,4,1
1,Francesco Molinari,2,8,52,46,5,5,10,16,17,82,182,16,14,11,24,8,17,5
2,Justin Thomas,3,28,11,138,47,41,4,45,20,24,47,3,3,1,4,2,7,4
3,Justin Rose,4,16,34,33,1,6,17,23,6,15,21,2,2,3,2,4,1,2
4,Henrik Stenson,5,26,139,1,37,13,1,1,113,48,157,14,12,40,24,50,57,24


Our data is now in one DataFrame with each PGA tour players rankings in each measured category in the attributes next to their name. This will allow us to determine later on ere a player would stand given their stats in all of the regular season rankings given their attributes above. We are not done with out dta processing yet. While it is in a more usable form, there are players way down te fedexcup season points standings which have NaN as their rankings in some, or many categories, so now we must remedy that.

The pga tour does not keep rankings for a categroy on a player if their ranking in that category is below a certain threshold, so we can either choose to drop these players or replace NaN with some other value. I am going to replace each NaN with that players Official World Golf Ranking, assuming they have a FedexCup standing, as that should give a sound indicator of how their game is at the moment. If we just wanted to analyze the top 100 player or so, it would make sense to drop them, but I think it would be fun to see how accurately Offical World Golf Ranking can predict a lower player's Fedexcup Standing. If they do not have a FedexCup Standing though, I will drop them from the DataFrame as we have no metric to measure their year by.

In [195]:
df = df.dropna(subset=["FEDEXCUP SEASON POINTS RANKING"]) # drop players without fedexcup ranking
for col in df.columns[2:]:
    # Replace NaN columns (player name/fedexcup ranking excluded) with player's world ranking
    df[col] = df[col].fillna(df['OFFICIAL WORLD GOLF RANKING']).astype(int)

In [196]:
# Save the new tidy dataframe as a csv file for later use
df.to_csv('ranking_data.csv')

Now our data is clean and tidy and missing data is accounted for. We are now ready for the next step of the data science pipeline, exploratory anaylsis and data visualization.

# 3. Exploratory Analysis and Data Visualization