This python notebook is designed to pull the weather data from NOAA's API and transform the data into a clean dataframe. That dataframe will be saved as a CSV and moved to the file share for the team to use. The goal is to be able to join this dataset by end date (and location if needed) to the other completed datasets. This file will contain the lows, highs, percipitation for each day that has been recorded. I am looking for a way to add in average daily temp, but I may just take the high and low to make a mean temp for the day. I am using the requests package to pull data for the 5 courses we chose to pull weather on. We chose these specific 5 courses because they are played every year and are held in the same location. This is great for exploratory analysis, because we can see differences in the year to year results using the weather/location.

In [1]:
import pandas as pd
import requests
import json
import numpy as np
from datetime import datetime

I start the file by reading in the other datasets in the file share to get info on the courses I am pulling weather for. 

In [2]:
df = pd.read_csv('/dsa/groups/casestudy2022su/team06/PGA_majors.csv')
df.head(40)

Unnamed: 0,Year,Start_date,End_date,Major,Location,Course
0,2014,4/10/2014,4/13/2014,The Masters,"Augusta, GA",Augusta National Golf Club
1,2015,4/9/2015,4/12/2015,The Masters,"Augusta, GA",Augusta National Golf Club
2,2016,4/7/2016,4/10/2016,The Masters,"Augusta, GA",Augusta National Golf Club
3,2017,4/6/2017,4/9/2017,The Masters,"Augusta, GA",Augusta National Golf Club
4,2018,4/5/2018,4/8/2016,The Masters,"Augusta, GA",Augusta National Golf Club
5,2019,4/11/2019,4/14/2019,The Masters,"Augusta, GA",Augusta National Golf Club
6,2020,11/12/2020,11/15/2020,The Masters,"Augusta, GA",Augusta National Golf Club
7,2021,4/5/2021,4/11/2021,The Masters,"Augusta, GA",Augusta National Golf Club
8,2022,4/7/2022,4/10/2022,The Masters,"Augusta, GA",Augusta National Golf Club
9,2014,9/7/2014,9/10/14,PGA Championship,"Louisville, KY",Valhalla


In [3]:
golftourn = pd.read_csv('/dsa/groups/casestudy2022su/team06/GolfTournament.csv')
golftourn.head()

Unnamed: 0,Player_initial_last,tournament id,player id,hole_par,strokes,hole_DKP,hole_FDP,hole_SDP,streak_DKP,streak_FDP,...,purse,season,no_cut,Finish,sg_putt,sg_arg,sg_app,sg_ott,sg_t2g,sg_total
0,A. Ancer,401353254,9261,288,285,60.5,55.4,65,3,6.6,...,20.0,2022,0,T33,0.16,0.04,0.7,0.2,0.94,1.09
1,A. Hadwin,401353254,5548,288,281,71.0,65.3,70,0,11.8,...,20.0,2022,0,T9,1.43,-0.11,1.04,-0.27,0.67,2.09
2,A. Lahiri,401353254,4989,288,276,87.5,80.5,79,0,11.8,...,20.0,2022,0,2,1.41,0.02,1.04,0.87,1.93,3.34
3,A. Long,401353254,6015,288,287,57.0,51.3,62,0,5.6,...,20.0,2022,0,T46,0.37,0.15,-0.61,0.67,0.22,0.59
4,A. Noren,401353254,3832,288,284,67.0,60.4,66,0,6.2,...,20.0,2022,0,T26,-0.15,0.08,1.25,0.16,1.49,1.34


In [4]:
golftourn.columns

Index(['Player_initial_last', 'tournament id', 'player id', 'hole_par',
       'strokes', 'hole_DKP', 'hole_FDP', 'hole_SDP', 'streak_DKP',
       'streak_FDP', 'streak_SDP', 'n_rounds', 'made_cut', 'pos', 'finish_DKP',
       'finish_FDP', 'finish_SDP', 'total_DKP', 'total_FDP', 'total_SDP',
       'player', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'tournament name',
       'course', 'date', 'purse', 'season', 'no_cut', 'Finish', 'sg_putt',
       'sg_arg', 'sg_app', 'sg_ott', 'sg_t2g', 'sg_total'],
      dtype='object')

I take the unique date from each tournament to find the end dates of each tournament. The dataset we have only has end date recorded so far, so I need to manually search the start dates for each tournament. I will then take the weather for each tournament from start to end. When I manually searched for the start dates, I made sure they were correct by seeing if the end date online matches the dates in our dataset.

In [6]:
TPCsawgrass = golftourn[golftourn['course']=='TPC Sawgrass - Ponte Vedra Beach, FL']
TPCsawgrass['date'].unique()

array(['3/13/2022', '3/14/2021', '3/17/2019', '5/13/2018', '5/14/2017',
       '5/15/2016', '5/10/2015'], dtype=object)

In [7]:
Augusta = golftourn[golftourn['course']=='Augusta National Golf Club - Augusta, GA']
Augusta['date'].unique()
#Augusta was played in april all years except 2020 was played in November

array(['4/11/2021', '11/15/2020', '4/14/2019', '4/8/2018', '4/9/2017',
       '4/10/2016', '4/12/2015'], dtype=object)

In [8]:
Augusta1 = df[df['Course']=='Augusta National Golf Club']
start_end = Augusta1[['Start_date','End_date']]
start_end

Unnamed: 0,Start_date,End_date
0,4/10/2014,4/13/2014
1,4/9/2015,4/12/2015
2,4/7/2016,4/10/2016
3,4/6/2017,4/9/2017
4,4/5/2018,4/8/2016
5,4/11/2019,4/14/2019
6,11/12/2020,11/15/2020
7,4/5/2021,4/11/2021
8,4/7/2022,4/10/2022


In [9]:
TPCscottsdale = golftourn[golftourn['course']=='TPC Scottsdale - Scottsdale, AZ']
TPCscottsdale['date'].unique()

array(['2/13/2022', '2/7/2021', '2/1/2020', '2/3/2019', '2/4/2018',
       '2/5/2017', '2/7/2016', '2/1/2015'], dtype=object)

In [10]:
HiltonHead = golftourn[golftourn['course']=='Harbour Town Golf Links - Hilton Head Island, SC']
HiltonHead['date'].unique()

array(['4/18/2021', '6/21/2020', '4/21/2019', '4/15/2018', '4/16/2017',
       '4/17/2016', '4/19/2015'], dtype=object)

In [11]:
TorreyPines = golftourn[golftourn['course']=='Torrey Pines North - La Jolla, CA']
TorreyPines['date'].unique()

array(['1/29/2022', '1/31/2021', '1/25/2020', '2/27/2019', '1/30/2018',
       '1/29/2017', '2/1/2016', '2/8/2015'], dtype=object)

In [12]:
#Augusta National Golf Club - Augusta, GA (the masters)
#TPC Sawgrass Ponte Verda Beach FL -PLAYERS 
#TPC Scottsdale - Scottsdale, AZ - WMPO
#Harbour Town Golf Links - Hilton Head Island, SC 
#Torrey Pines North - La Jolla, CA




#base url request location = https://www.ncdc.noaa.gov/cdo-web/api/v2/data?
#getting weather for a specific day GHCND = Global Historical Climatology Network Daily

This is the start of my weather data collecting. I need an API token, which I should hide in a text file as to not share it with everyone. I will do that before publishing anything. I went on the NOAA website and found the stations that recorded the weather at the location of each tournament. I made sure the station was capturing the weather for each year before selecting the station that I will use. I then saved the stations that I am using for each in a commented out cell below. 

In [13]:
token = WeatherAPI_token_key.my_key


In [14]:
augusta_stationid = 'GHCND:USW00013837'


In [15]:
start= '2016-01-01'
end = '2017-01-01'

This part of the notebook is a practice to see if my API call was successful, and what the call contained as a json. I saw the data I wanted to collect was within the "results" portion of the json. 

In [16]:
r = requests.get('https://www.ncdc.noaa.gov/cdo-web/api/v2/data?datasetid=GHCND&limit=1000&stationid=GHCND:USW00013837&startdate=2014-04-10&enddate=2014-04-10', headers={'token':token})
d = json.loads(r.text)
r

<Response [200]>

In [17]:
print(r.text)

{"metadata":{"resultset":{"offset":1,"count":8,"limit":1000}},"results":[{"date":"2014-04-10T00:00:00","datatype":"AWND","station":"GHCND:USW00013837","attributes":",,W,","value":26},{"date":"2014-04-10T00:00:00","datatype":"PRCP","station":"GHCND:USW00013837","attributes":",,W,","value":0},{"date":"2014-04-10T00:00:00","datatype":"TMAX","station":"GHCND:USW00013837","attributes":",,W,","value":250},{"date":"2014-04-10T00:00:00","datatype":"TMIN","station":"GHCND:USW00013837","attributes":",,W,","value":111},{"date":"2014-04-10T00:00:00","datatype":"WDF2","station":"GHCND:USW00013837","attributes":",,W,","value":170},{"date":"2014-04-10T00:00:00","datatype":"WDF5","station":"GHCND:USW00013837","attributes":",,W,","value":170},{"date":"2014-04-10T00:00:00","datatype":"WSF2","station":"GHCND:USW00013837","attributes":",,W,","value":67},{"date":"2014-04-10T00:00:00","datatype":"WSF5","station":"GHCND:USW00013837","attributes":",,W,","value":89}]}


In [18]:
r.json()

{'metadata': {'resultset': {'offset': 1, 'count': 8, 'limit': 1000}},
 'results': [{'date': '2014-04-10T00:00:00',
   'datatype': 'AWND',
   'station': 'GHCND:USW00013837',
   'attributes': ',,W,',
   'value': 26},
  {'date': '2014-04-10T00:00:00',
   'datatype': 'PRCP',
   'station': 'GHCND:USW00013837',
   'attributes': ',,W,',
   'value': 0},
  {'date': '2014-04-10T00:00:00',
   'datatype': 'TMAX',
   'station': 'GHCND:USW00013837',
   'attributes': ',,W,',
   'value': 250},
  {'date': '2014-04-10T00:00:00',
   'datatype': 'TMIN',
   'station': 'GHCND:USW00013837',
   'attributes': ',,W,',
   'value': 111},
  {'date': '2014-04-10T00:00:00',
   'datatype': 'WDF2',
   'station': 'GHCND:USW00013837',
   'attributes': ',,W,',
   'value': 170},
  {'date': '2014-04-10T00:00:00',
   'datatype': 'WDF5',
   'station': 'GHCND:USW00013837',
   'attributes': ',,W,',
   'value': 170},
  {'date': '2014-04-10T00:00:00',
   'datatype': 'WSF2',
   'station': 'GHCND:USW00013837',
   'attributes': ',,

In [186]:
for i in r.json()['results']:
    print(i)
    print(i['date'],i['station'],i['datatype'],i['value'])

{'date': '2014-04-10T00:00:00', 'datatype': 'AWND', 'station': 'GHCND:USW00013837', 'attributes': ',,W,', 'value': 26}
2014-04-10T00:00:00 GHCND:USW00013837 AWND 26
{'date': '2014-04-10T00:00:00', 'datatype': 'PRCP', 'station': 'GHCND:USW00013837', 'attributes': ',,W,', 'value': 0}
2014-04-10T00:00:00 GHCND:USW00013837 PRCP 0
{'date': '2014-04-10T00:00:00', 'datatype': 'TMAX', 'station': 'GHCND:USW00013837', 'attributes': ',,W,', 'value': 250}
2014-04-10T00:00:00 GHCND:USW00013837 TMAX 250
{'date': '2014-04-10T00:00:00', 'datatype': 'TMIN', 'station': 'GHCND:USW00013837', 'attributes': ',,W,', 'value': 111}
2014-04-10T00:00:00 GHCND:USW00013837 TMIN 111
{'date': '2014-04-10T00:00:00', 'datatype': 'WDF2', 'station': 'GHCND:USW00013837', 'attributes': ',,W,', 'value': 170}
2014-04-10T00:00:00 GHCND:USW00013837 WDF2 170
{'date': '2014-04-10T00:00:00', 'datatype': 'WDF5', 'station': 'GHCND:USW00013837', 'attributes': ',,W,', 'value': 170}
2014-04-10T00:00:00 GHCND:USW00013837 WDF5 170
{'da

Once I had figured out what I was doing with the API calls, I decided to make a function that could pull any data from NOAA as long as I had my token, station that I wanted to pull from, start and end dates. I did this by creating the function below that pulls the weather with any request and saves the data in a dictionary. The dictionary gets appended to a list for creation of a dataframe.

In [232]:
token = 'MtVkYnPlJWbqxrPmIqWOosINijiLHJHU'
datasetid = 'GHCND'
limit = '1000'
url = 'https://www.ncdc.noaa.gov/cdo-web/api/v2/data?'

In [507]:
def get_weather(token,datasetid,limit,station,startdate,enddate):
    weather_list = []
    url = 'https://www.ncdc.noaa.gov/cdo-web/api/v2/data?'
    apirequest = requests.get(url+'datasetid='+datasetid+'&limit='+limit+'&stationid='+station+'&startdate='+startdate+'&enddate='+enddate, headers={'token':token})
    datajson = apirequest.json()
    for i in datajson['results']: 
        weather_rows = {}
        weather_rows['date'] = i['date']
        weather_rows['datatype'] = i['datatype']
        weather_rows['station'] = i['station']
        weather_rows['value'] = i['value']
        weather_list.append(weather_rows)
#     print(weather_list)
    return weather_list
#     weather_df = pd.DataFrame(weather_list)
#     return weather_df
#     return weather_df.to_csv('weather_data.csv')
    
    

Masters day 1 was my test to make sure the function worked correctly. I pulled only the weather for the first masters tournament in 2014. 

In [463]:
masters_day1 = get_weather('MtVkYnPlJWbqxrPmIqWOosINijiLHJHU','GHCND','1000','GHCND:USW00013837','2014-04-10','2014-04-13')

[{'date': '2014-04-10T00:00:00', 'datatype': 'AWND', 'station': 'GHCND:USW00013837', 'value': 26}, {'date': '2014-04-10T00:00:00', 'datatype': 'PRCP', 'station': 'GHCND:USW00013837', 'value': 0}, {'date': '2014-04-10T00:00:00', 'datatype': 'TMAX', 'station': 'GHCND:USW00013837', 'value': 250}, {'date': '2014-04-10T00:00:00', 'datatype': 'TMIN', 'station': 'GHCND:USW00013837', 'value': 111}, {'date': '2014-04-10T00:00:00', 'datatype': 'WDF2', 'station': 'GHCND:USW00013837', 'value': 170}, {'date': '2014-04-10T00:00:00', 'datatype': 'WDF5', 'station': 'GHCND:USW00013837', 'value': 170}, {'date': '2014-04-10T00:00:00', 'datatype': 'WSF2', 'station': 'GHCND:USW00013837', 'value': 67}, {'date': '2014-04-10T00:00:00', 'datatype': 'WSF5', 'station': 'GHCND:USW00013837', 'value': 89}, {'date': '2014-04-11T00:00:00', 'datatype': 'AWND', 'station': 'GHCND:USW00013837', 'value': 30}, {'date': '2014-04-11T00:00:00', 'datatype': 'PRCP', 'station': 'GHCND:USW00013837', 'value': 0}, {'date': '2014-04

In [464]:
masters_day1

[{'date': '2014-04-10T00:00:00',
  'datatype': 'AWND',
  'station': 'GHCND:USW00013837',
  'value': 26},
 {'date': '2014-04-10T00:00:00',
  'datatype': 'PRCP',
  'station': 'GHCND:USW00013837',
  'value': 0},
 {'date': '2014-04-10T00:00:00',
  'datatype': 'TMAX',
  'station': 'GHCND:USW00013837',
  'value': 250},
 {'date': '2014-04-10T00:00:00',
  'datatype': 'TMIN',
  'station': 'GHCND:USW00013837',
  'value': 111},
 {'date': '2014-04-10T00:00:00',
  'datatype': 'WDF2',
  'station': 'GHCND:USW00013837',
  'value': 170},
 {'date': '2014-04-10T00:00:00',
  'datatype': 'WDF5',
  'station': 'GHCND:USW00013837',
  'value': 170},
 {'date': '2014-04-10T00:00:00',
  'datatype': 'WSF2',
  'station': 'GHCND:USW00013837',
  'value': 67},
 {'date': '2014-04-10T00:00:00',
  'datatype': 'WSF5',
  'station': 'GHCND:USW00013837',
  'value': 89},
 {'date': '2014-04-11T00:00:00',
  'datatype': 'AWND',
  'station': 'GHCND:USW00013837',
  'value': 30},
 {'date': '2014-04-11T00:00:00',
  'datatype': 'PRCP

My next step was to make a way that the data can be pulled using my get_weather function for every location and start/end date. The best way of doing this was creating a dicationary within a dictionary to separate each location and loop through the dates. 

In [565]:
tournaments = {'Augusta':{'station_name':'GHCND:USW00013837', 'start_dates': ['2014-04-10','2015-04-09','2016-04-07',
                                '2017-04-06','2018-04-05','2019-04-11','2020-11-12','2021-04-05','2022-04-07'], 
                          'end_dates': ['2014-04-13','2015-04-12','2016-04-10','2017-04-09','2018-04-08',
                                       '2019-04-14','2020-11-15','2021-04-11','2022-04-10']},
               'Scottsdale':{'station_name':'GHCND:US1AZMR0264', 'start_dates': ['2015-01-29','2016-02-01',
                                '2017-01-30','2018-02-01','2019-01-31','2020-01-27','2021-02-01','2022-02-07'], 
                          'end_dates': ['2015-02-01','2016-02-07',
                                '2017-02-05','2018-02-04','2019-02-03','2020-02-01','2021-02-07','2022-02-13']},
               'Sawgrass':{'station_name':'GHCND:US1FLNS0012', 'start_dates': ['2015-05-07','2016-05-12',
                                '2017-05-11','2018-05-10','2019-03-14','2020-03-12','2021-03-11','2022-03-10'], 
                          'end_dates': ['2015-05-10','2016-05-15',
                                '2017-05-14','2018-05-13','2019-03-17','2020-03-15','2021-03-14','2022-03-13']},
               'Hilton_head':{'station_name':'GHCND:US1SCBF0002', 'start_dates': ['2015-04-15','2016-04-14',
                                '2017-04-13','2018-04-12','2019-04-15','2020-06-18','2021-04-12'], 
                          'end_dates': ['2015-04-19','2016-04-17',
                                '2017-04-16','2018-04-15','2019-04-21','2020-06-21','2021-04-18']},
               'Torrey_pines':{'station_name':'GHCND:US1CASD0015', 'start_dates': ['2015-02-05','2016-01-28',
                            '2017-01-26','2018-01-25','2019-01-24','2020-01-23', '2021-01-28', '2022-01-26'], 
                          'end_dates': ['2015-02-08','2016-02-01',
                                       '2017-01-29','2018-01-30','2019-02-27','2020-01-25', '2021-01-31','2022-01-29']}
               
              }


#longer tournaments for some years than others (scottsdale 2021/2022 are long)
#sawgrass was moved from may to march starting in 2019 could change the tourney
#hilton head had the 2020 tourney moved to june prob due to covid


# weather_data = pd.DataFrame()
# for i, j in tournaments.items():
#     station_name = j['station_name']
#     start_dates = j['start_dates']
#     end_dates = j['end_dates']
#     for k in range(len(start_dates)):
#         df = get_weather(token,datasetid,limit,station_name,start_dates[k],end_dates[k])
#         weather_data.append(df, inplace = True, axis = 0)
# weather_data.to_csv('weather_data.csv')
        


I then made a function that iterates through my dictionary and calls my previous function to pull the weather data for that specific tournament date. I input the static variables (my key, the type of weather I want to request which is always the daily summary GHCND and the limit to request is 1000 requests.) I then pulled the id of the station that each location was held at and the start and end date of each tournament. Once I had all the data pulled, I moved it into a dataframe that can be stored as a csv. This CSV can then be moved to the file share for my team to use. 

Additional work to be done that I have not done this week is to clean the data that is pulled. I want to allign each station with a location. I need to change the weather for tenths of degress Celcius to degrees fahrenheit. I also want to separate the column that holds the key components into separate columns for temp high, temp low and percip. I plan to remove the snow columns, because there is probably no snow in the summer golf tournaments. These are just ideas I have to clean the data before moving my file to the file share. This should be finished by week 3.

In [566]:
def full_weather_data(tournament_dict):
    weather_data = pd.DataFrame()
    for i, j in tournament_dict.items():
        station_name = j['station_name']
        start_dates = j['start_dates']
        end_dates = j['end_dates']
        for k in range(len(start_dates)):
            df = get_weather('MtVkYnPlJWbqxrPmIqWOosINijiLHJHU','GHCND','1000',station_name,start_dates[k],end_dates[k])
            weather_data = weather_data.append(df)
    weather_data.to_csv('weather_data.csv')
    return weather_data


In [489]:
# def update_weather(weather_df, tournament_dict):
#     for key, value in tournament_dict.items():
#         if key in list(weather_df['tournament_name']):
#             tournament_dict.remove(i)
#     new_tournament = full_weather_data(tournament_dict)
#     weather_df = weather_df.append(new_tournament, axis = 0)
#     return weather_df

In [499]:
# try:
#     weather_data = pd.read_csv('weather_data.csv')
# except:
#     weather_data = full_weather_data(tournaments)
    

In [568]:
weather_df = full_weather_data(tournaments)

In [20]:
#AUGUSTA MASTERS STATION WORKS
# Name	AUGUSTA DANIEL FIELD AIRPORT, GA US
# ID	GHCND:USW00013837
# Lat/Lon	33.466676, -82.038345
# PERIOD OF RECORD
# Start/End	1996-07-01 to 2022-06-04
# Coverage	99%



# Name	JACKSONVILLE CRAIG MUNICIPAL AIRPORT, FL US
# Network:ID	GHCND:USW00053860
# Latitude/Longitude	30.33708°, -81.51277°
# Elevation	11.9 m
# Air Temperature
# Precipitation
# Sunshine
# Weather Type
# Wind


#SCOTTSDALE AIRPORT WORKS FOR TPC SCOTTSDALE WM PHOENIX OPEN
# Name	SCOTTSDALE MUNICIPAL AIRPORT, AZ US
# Network:ID	GHCND:USW00003192
# Latitude/Longitude	33.61234°, -111.92317°
# Elevation	436.1 m


#just north of hilton head island
# Name	BEAUFORT MCAS, SC US
# Network:ID	GHCND:USW00093831
# Latitude/Longitude	32.48333°, -80.71667°
# Elevation	11.3 m
# Air Temperature
# Precipitation
# Sky cover & clouds
# Weather Type
# Wind


#using this station for torrey pines
# Name	SAN DIEGO MONTGOMERY FIELD, CA US
# Network:ID	GHCND:USW00003131
# Latitude/Longitude	32.81453°, -117.13747°
# Elevation	127.5 m
# Air Temperature
# Precipitation
# Sunshine
# Weather Type
# Wind

These cells below is where I plan to take a smaller sample of the weather data I am pulling above and clean the data. This will take less time to pull and once it is cleaned, I should be able to move my changes into the function above. That way when I pull the data, I will get back a cleaned dataset and a CSV file of the data.