## EDA On PGA Tour Statistics

This data contains all of the recorded statistics from the PGA Tour since the beginning of the 1980 season. The structure of the data is fairly simple. Each tournament has an associated date. Each tournament was participated in by a number of players. Each of those players had a number of statistics recorded. 

In [1]:
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
#Starting a Spark session
spark = SparkSession.builder.appName("PGA_Analysis").getOrCreate()

In [3]:
#Reading the data into a Spark dataframe
df = spark.read.csv("/home/gerardo/Desktop/Projects/Datasets/PGA/PGA_Data_Historical.csv", header="true")
df.show()

+-----------------+----------+--------------------+--------------------+--------+-----+
|      player_name|      date|          tournament|           statistic|variable|value|
+-----------------+----------+--------------------+--------------------+--------+-----+
|   Rik Massengale|1980-01-13|Bob Hope Desert C...|Final Round Scori...|     AVG|70.00|
|    Bobby Nichols|1980-01-13|Bob Hope Desert C...|Final Round Scori...|     AVG|73.00|
|       Andy North|1980-01-13|Bob Hope Desert C...|Final Round Scori...|     AVG|73.00|
|    John Mahaffey|1980-01-13|Bob Hope Desert C...|Final Round Scori...|     AVG|73.00|
|   Peter Jacobsen|1980-01-13|Bob Hope Desert C...|Final Round Scori...|     AVG|73.00|
|    Charles Coody|1980-01-13|Bob Hope Desert C...|Final Round Scori...|     AVG|72.00|
|      Grier Jones|1980-01-13|Bob Hope Desert C...|Final Round Scori...|     AVG|72.00|
|     Calvin Peete|1980-01-13|Bob Hope Desert C...|Final Round Scori...|     AVG|73.00|
|      Jim Nelford|1980-01-13|Bo

In [4]:
type(df)

pyspark.sql.dataframe.DataFrame

In [4]:
#How many rows are there?
rows = df.count()
print(f"The dataset contains {rows:,} rows")

The dataset contains 46,147,897 rows


In [14]:
#How many different statistics have been collected?
df.createOrReplaceTempView('stats')
spark.sql("SELECT COUNT(DISTINCT statistic) AS number_of_stats FROM stats").show()

+---------------+
|number_of_stats|
+---------------+
|            442|
+---------------+



In [15]:
#How many players have had stats recorded?
spark.sql("SELECT COUNT(DISTINCT player_name) AS number_of__players FROM stats").show()

+--------------------------+
|number_of_distinct_players|
+--------------------------+
|                      2441|
+--------------------------+



In [16]:
#How many tournaments have been played?
spark.sql("SELECT COUNT(DISTINCT tournament) AS number_of_tournaments FROM stats").show()

+---------------------+
|number_of_tournaments|
+---------------------+
|                  305|
+---------------------+



In [28]:
#I'd like to create a list of all of the recorded stats
stats = spark.sql("SELECT DISTINCT statistic FROM stats").toPandas()
stats_list = stats['statistic'].tolist()

#Write the list to a text file for easy reference
with open('stats.txt', 'w') as file:
    for stat in stats_list:
        file.write(stat + "\n")