# Retreiving NBA Salary and Win Data 

This section retrieves the data from external sources, saves them locally, then loads them to build a JSON object. 

Re-running these steps takes considerable time because I save 900+ html files to avoid having to revist them in consideration of the site owner. These files are used to capture the teams for each year dating back to the 1990-1991 season as well as the players and salaries for each team. These files have been provided in a zip file that are current as of 1/4/2021, which is the beginning of the 2021-2022 season. 

We will also fetch standings data from the unofficial NBA API that are current as of a similar time.

The final output is `nbaSalaryData.json` which contains the salary and win data necessary for analysis.

<a id='top'></a>
# Table of Contents
- [Import](#import)

- [Retrieve base HTML which contains all the years](#retrieve-base-html)

- [Get the years from base HTML, save to JSON, and save soup and data](#get-the-years)

- [Crawl for each past year and save HTML](#crawl-for-each)

- [Read the teams and team salary from each saved year soup and add to JSON object](#read-the-teams)

- [Crawl each team in each year and save HTML](#crawl-each-team)

- [Get the players' salaries from each team from each year and add to JSON object](#get-the-players)

- [Adding win data](#adding-win-data)
  - [Fetch the raw standings data from unofficial nba api](#fetch-the-raw)
  - [Extracting and normalizing the win data for each team](#extracting-and-normalizing)
  - [Add win data to JSON](#add-win-data)

- [Appendix](#appendix)
- [Data sources](#data-sources)
- [Git history](#git-history)

<a id='import'></a>
# Import [top](#top)
All imports for the notebook are handled here.

In [1]:
# data retrieval 
from bs4 import BeautifulSoup
import requests
import json
import os
from time import sleep
from nba_api.stats.endpoints import leaguestandings
import datetime as dt

<a id='retrieve-base-html'></a>
## Retrieve base HTML which contains all the years [top](#top)

I am getting all of my salary data from https://hoopshype.com/salaries/. I was tempted to use https://www.basketball-reference.com which is the gold-standard for nba data but they [explicitly prohibit scraping](https://www.sports-reference.com/data_use.html).

By saving the soup to file we will avoid having to re-scrape the site to get the page into memory. Putting `sleep()` in `getSoup()` ensures we make requests too quickly later when we are scraping lots of different urls and retrying requests.

In [3]:
def getSoup(url):
    page = requests.get(url)
    sleep(0.6)
    return BeautifulSoup(page.content, 'html.parser')

def saveSoup(soup, filePath):
    path = filePath.rpartition('/')[0] # == 'parent/child' for 'parent/child/file.txt'
    
    if not os.path.exists(path):
        os.makedirs(path)

    with open(filePath, 'w', encoding='utf-8') as fp:
            fp.write(str(soup))

def loadSoup(filePath):
    with open(filePath, 'rb') as html:
        return BeautifulSoup(html)

Requests if html file does not exist locally, or if forced.

In [5]:
baseDir = 'data/raw/soups/base.html'
force = False

if not os.path.exists(baseDir) or force:
    print(f'{baseDir} doesn\'t exist.\nRequesting and saving. ')
    soup = getSoup('https://hoopshype.com/salaries')
    saveSoup(soup, baseDir)
else:
    print("File already exists. Set force to 'True' if you want to re-request.")

File already exists. Set force to 'True' if you want to re-request.


The website displays the year in a format like 2019/20 but the URL uses the format 2019-2020. So before saving the date we will transform it to this new format. We will also need to reformat it to the long version later as well.

In [6]:
def toShortYear(longYear):
    '''converts from format like 2019-2020 to format like 2019-20'''
    return f'{longYear[:5]}{longYear[7:]}'
    
def toLongYear(shortYear):
    '''converts from format like 2019/20 to format like 2019-2020'''
    longYear = shortYear.replace("/","-")

    firstTwo = shortYear[:2] #20 from 2019
    lastTwo = shortYear[2:4] #19 from 2019

    indexToInsert = longYear.find("-") + 1

    if lastTwo != '99':
        longYear = longYear[:indexToInsert] + firstTwo  + longYear[indexToInsert:]
    else:
        newFirstTwo = str(int(firstTwo) + 1)
        longYear = longYear[:indexToInsert] + newFirstTwo  + longYear[indexToInsert:]

    return longYear

<a id='get-the-years'></a>
## Get the years from base HTML, save to JSON, and save soup and data [top](#top)
As we parse data at the different levels (years, teams of years, players of teams) we will save to our main data object. I will save to a work-in-progress JSON object until it is finalized to avoid saving over our complete data object later by executing an earlier cell.

In [8]:
def saveDataWIP(data):
    '''Use this function for saving to JSON while building the object'''
    with open('data/nbaSalaryData-WIP.json', 'w') as fp:
        json.dump(data, fp, indent=4)

def loadDataWIP():
    with open('data/nbaSalaryData-WIP.json', 'r') as fp:
        return json.load(fp)

First we find all the year strings from the base html. We use this to start our JSON object and will refer back to it to continue scraping.

In [11]:
#first two links are duplicate
soup = loadSoup('data/raw/soups/base.html')
yearLinks = soup.find('li', class_='all').find_all('a')[1:]
yearStrings = []

# strip
for year in yearLinks:
    txt = toLongYear(year.text.strip())
    yearStrings.append(txt)
print(f"Found {len(yearStrings)} years")
        
#sort from low year to high year, and fill dict
yearStrings.sort()
currentYear = yearStrings[-1]

#start data object and save
nbaSalaryData = dict.fromkeys(yearStrings)
saveDataWIP(nbaSalaryData)

saveSoup(soup, f'data/raw/soups/{currentYear}/{currentYear}.html')

Found 31 years


<a id='crawl-for-each'></a>
## Crawl for each past year and save HTML [top](#top)
I chose to save the HTML pages and wait between requests because we will be making a request for each year. I do not want to make many requests in quick succession nor make the same requests again in the future unless I expect that the data has actually changed.

Occasionally scraping a year works but the teams don't populate in the table. To deal with this, I check the number of teams in the table. When the table of teams doens't populate I retry the request a limited number of times, and print the year that has failed with the number of retry attempts. If all retries fail they have to be manually scraped.

I first check if the soup exists before making any new requests.

In [12]:
def soupIsValid(soup):
    # detects if table is not populated correctly
    return len(soup.find_all('td', class_='name')) > 1

In [13]:
nbaSalaryData = loadDataWIP()
force = False
existLocally = []

# past years only because current year was saved earlier
pastYears = list(nbaSalaryData.keys())[:-1]

for year in pastYears:
    soupDir = f'data/raw/soups/{year}/{year}.html'
    
    if not os.path.exists(soupDir) or force:
        print(f"Getting soup for {year}")
        soup = getSoup(f'https://hoopshype.com/salaries/{year}/')
    
        if not soupIsValid(soup):
            print(f'soup retreived for year {year} is not valid')
            for n in range(5):
                print(f'retrying {n}')
                soup = getSoup(url)
        #save soup to .html file
        saveSoup(soup, soupDir)
    else:
        existLocally.append(year)

if len(existLocally) > 0:
    print(f"Did not make requests for {len(existLocally)} " \
        f"years because they were found locally: \n{existLocally}")

Did not make requests for 30 years because they were found locally: 
['1990-1991', '1991-1992', '1992-1993', '1993-1994', '1994-1995', '1995-1996', '1996-1997', '1997-1998', '1998-1999', '1999-2000', '2000-2001', '2001-2002', '2002-2003', '2003-2004', '2004-2005', '2005-2006', '2006-2007', '2007-2008', '2008-2009', '2009-2010', '2010-2011', '2011-2012', '2012-2013', '2013-2014', '2014-2015', '2015-2016', '2016-2017', '2017-2018', '2018-2019', '2019-2020']


<a id='read-the-teams'></a>
## Read the teams and team salary from each saved year soup and add to JSON object [top](#top)

In [14]:
# load data and soup files
nbaSalaryData = loadDataWIP()
yearFiles = ([f'{x[0]}/{x[2][0]}' for x in list(os.walk("data/raw/soups"))[1:]])
yearFiles
# add teams to year data for each year
for f in yearFiles:
    soup = loadSoup(f)
    year = f.split("/")[-1].split('.')[0]
    
    # assign empty object so we can assign object with 'nbaSalaryData[year][name]' later
    # otherwise 'nbaSalaryData[year][name]' results in NoneType error (because nbaSalaryData[year]=None)
    nbaSalaryData[year] = {}

    #filter for elements containing the team names
    options = soup.find_all('td', class_='name')
    options = options[1:] # get rid of first elemented which does't contain a team
        
    #parse team data
    for o in options:
        #find team name and salary
        teamTags = o.find_all('a')
        
        for t in teamTags:
            name = t.text.strip()
            salary = t.parent.find_next_sibling('td').text.strip()
            url = t.get('href')
            nbaSalaryData[year][name] = {"salary": salary, "players": {}, "url": url}

# save
saveDataWIP(nbaSalaryData)
print(f"Team information added to WIP JSON")

Team information added to WIP JSON


<a id='crawl-each-team'></a>
## Crawl each team in each year and save HTML [top](#top)
These are the last soup objects we will need to collect.

This step will take a while (about 20 minutes in my experience) because there are 31 years and at least 28 teams pear year and `getSoup()` waits 0.6 seconds per request. We could theoretically shorten this by reducing the pause but this might negatively impact the website and our ability to scrape from it. Alternative, the repository contains [a zip file current as of 1/4/2020](https://github.com/BlairCurrey/nba-salary-distribution/tree/main/data/raw). If the pages exist locally then they are not rescraped.

In [15]:
nbaSalaryData = loadDataWIP()

In [16]:
force = False
existLocally = 0
total = 0

for year in nbaSalaryData.keys():
    for team in nbaSalaryData[year]:
        soupDir = f'data/raw/soups/{year}/{team}.html'
        
        if not os.path.exists(soupDir) or force:
            print(f"Saving soup for {year}: {team}")
            soup = getSoup(nbaSalaryData[year][team]['url'])
            saveSoup(soup, f'data/raw/soups/{year}/{team}.html')
        else:
            existLocally += 1
        total += 1

if existLocally > 0:
    print(f"Did not make requests for {existLocally}/{total} teams")

Did not make requests for 906/906 teams


The soup object file structure now looks something like this:
```
data
 |_raw
 |  |_soups
 |     |_base.html           #homepage. used to find years
 |     |_1990-1991
 |     |   |_1990-1991.html  #page containing teams for 1990-1991
 |     |   |_Atlanta.html    #page containing Atlana
 |     |   |_Boston.html
 |     |   ...
 |     |   |_Washington
 |     |_1991-1992
 |     ...
 |     |_2020-2021
 |_nbaSalaryData-WIP.json        
```

<a id='get-the-players'></a>
## Get the players' salaries from each team from each year and add to JSON object [top](#top)
Checks if players are already loaded into JSON before saving. This can be overriden by setting `force` to `True` to save over existing player salary information. This step is slow, taking roughly 30 seconds to run.

In [17]:
nbaSalaryData = loadDataWIP()

In [20]:
force = False
existJson = []
addedTo = 0

# list of paths to each year
yearPaths = [x[0] for x in os.walk("data/raw/soups")][1:]

for y in yearPaths:
    year = y.split("\\")[1]
    #list of team files in each year
    teamFiles = list(os.walk(y))[0][2][1:] 
    
    for tf in teamFiles:
        team = tf.split(".")[0]
        
        if not nbaSalaryData[year][team]["players"] or force:
            soup = loadSoup(f"{y}\\{tf}")

            #filter for elements containing the team names
            options = soup.find('table', class_='hh-salaries-team-table').find_all('td', class_="name")

            #parse player data
            for o in options:
                #find player name and salary
                playerTags = o.find_all('a')

                for p in playerTags:
                    name = p.text.strip()
                    salary = p.parent.find_next_sibling('td').text.strip()
                    nbaSalaryData[year][team]['players'][name] = salary
            addedTo += 1
        else:
            existJson.append(f"{year}: {team}")
# save
saveDataWIP(nbaSalaryData)

if len(existJson) > 0:
    print(f"Skipped {len(existJson)} teams because salary information was already found.")
else:
    print(f"{addedTo} team's player salary information added to WIP JSON ")

Skipped 906 teams because salary information was already found.


This salary data can be accessed like so:

In [21]:
nbaSalaryData["2014-2015"]["Orlando"]["players"]

{'Channing Frye': '$8,579,088',
 'Al Harrington': '$7,609,800',
 'Glen Davis': '$6,600,000',
 'Victor Oladipo': '$4,978,200',
 'Ben Gordon': '$4,500,000',
 'Aaron Gordon': '$3,992,040',
 'Nikola Vucevic': '$2,902,757',
 'Luke Ridnour': '$2,750,000',
 'Tobias Harris': '$2,511,432',
 'Elfrid Payton': '$2,397,840',
 'Jameer Nelson': '$2,000,000',
 'Moe Harkless': '$1,887,840',
 'Anthony Randolph': '$1,825,359',
 'Andrew Nicholson': '$1,545,840',
 'Evan Fournier': '$1,483,920',
 'Willie Green': '$1,448,490',
 "Kyle O'Quinn": '$915,243',
 'Devyn Marble': '$884,879',
 'Dewayne Dedmon': '$816,482'}

<a id='adding-win-data'></a>
## Adding win data [top](#top)

<a id='fetch-the-raw'></a>
### Fetch the raw standings data from unofficial nba api [top](#top)
Checks if it exists locally before requesting from the [unofficial nba api](https://github.com/swar/nba_api). Each request sleeps for 0.6 seconds but there are only 31 requests so the overall time is not that long. Saves after retrieving. Can be forced by setting `force` to `True`.

In [22]:
force = False
rsDir = 'data/raw/standings.json'
seasons = [toShortYear(k) for k in nbaSalaryData.keys()]

# gets win data from nba.com api unless found locally
if not os.path.isfile(rsDir) or force:
    print(f'Requesting data starting in {seasons[0]} and ending in {seasons[-1]}')
    
    rawStandingsData = {"resourceSets": []}
    
    # get data for all seasons
    for s in seasons:
        print(f'Requesting data for {s}')
        standing = json.loads(leaguestandings.LeagueStandings(season=s).get_json())
        rawStandingsData["resourceSets"].append(standing)
        sleep(0.6)

    with open(rsDir, 'w') as fp:
        json.dump(rawStandingsData, fp, indent=4)
    
else:
    print("Raw standings data already exists locally.")

Raw standings data already exists locally.


<a id='extracting-and-normalizing'></a>
### Extracting and normalizing the win data for each team [top](#top)
The teams in `rawStandingsData` are organized differently than our JSON data. In order to combine this with our salary data we need to normalize the cities. For example, this data source includes `Seattle` and `Oklahoma City,` which our previous data source simplifies to `Oklahoma City` (the Seattle team moved to Oklahoma City in 2008). We don't care about preserving these distinctions so the simplified version that we already have works better. Other normalizations include converted the two Los Angeles teams (Lakers, Clippers) to `LA Lakers` and `LA Clippers` to match the `nbaSalaryData` format. This transformation happens after we access the team name and city data from `rawStandingsData`.

We can validate that the team names have been normalized by printing all the teams in from our win data and comparing it to the teams in our JSON object. This is defined in `printInvalidTeams()` and utilized called later to print out the number of teams (if any) that differ. They should be 0 if they have all been converted. The following functions are used in normalizing the data from the new source:

In [23]:
def normalizeCity(originalCity, teamName=None):
    cityMap = {
        "New Jersey": "Brooklyn",
        "Vancouver": "Memphis",
        "New Orleans/Oklahoma City": "New Orleans",
        "Seattle": "Oklahoma City",
        "LA": "Los Angeles"
}
    
    c = originalCity
    if c in cityMap:
        c = cityMap[originalCity]
    if c == "Los Angeles":
        c = whichLA(teamName)
    return c

def whichLA(teamName):
    if teamName == "Lakers":
        return "LA Lakers"
    elif teamName == "Clippers":
        return "LA Clippers"
    else:
        raise Exception('teamName did not match expected values')

def printInvalidTeams(winData, nbaSalaryData):
    teams = set()

    for year in list(winData.keys()):
        list1 = list(nbaSalaryData[year].keys())
        list2 = list(winData[year].keys())
        diff = list(set(list1) - set(list2))
        for d in diff:
            teams.add(d)

    if len(teams)>0:
        print(f"{len(teams)} team(s) in requested data but not in nbaSalaryData:")
        print(teams)
    else:
        print("No teams in requested data that don't exist in nbaSalaryData")

To get the win data from our new source we will use `getWinData()`. This traverses the `rawStandingsData` and returns a winData object like so:
   ```
   {
       '1990-1991': 
           {
               'Portland': 0.768, 
               'Chicago': 0.744, 
               ...
           }, 
        ...
   }
   ```
 This is also where we use `normalizeCity()` to map this data source's naming scheme to our json object's naming scheme.

In [24]:
def getWinData(rawStandingsData):
    winData = {}

    #get indexes for categories we need
    rawStandingsDataHeaders = rawStandingsData["resourceSets"][0]["resultSets"][0]["headers"]
    iCity = rawStandingsDataHeaders.index("TeamCity")
    iName = rawStandingsDataHeaders.index("TeamName")
    iWinPct = rawStandingsDataHeaders.index("WinPCT")

    for year in rawStandingsData["resourceSets"]:
        y = toLongYear(year["parameters"]["SeasonYear"])
        winData[y] = {}
        for team in year["resultSets"][0]["rowSet"]:
            t = normalizeCity(team[iCity], team[iName])
            winData[y][t] = team[iWinPct]

    return winData

<a id='add-win-data'></a>
### Add win data to JSON [top](#top)
The following functions are used to add the data to our json object and to save it. I also created a new `loadData()` function to accompany `saveData()` because `saveData()` saves our now finalized JSON to a different location.

In [25]:
def addWinDataToJson(winData, nbaSalaryData):
    for year in winData.keys():
        for team in winData[year].keys():
            nbaSalaryData[year][team]["winPct"] = winData[year][team]
    
    return nbaSalaryData

def saveData(nbaSalaryData):
    with open('data/nbaSalaryData.json', 'w') as fp:
        json.dump(nbaSalaryData, fp, indent=4)
        
def loadData():
    with open('data/nbaSalaryData.json', 'r') as fp:
        return json.load(fp)

This cell utilizes the above functions to go through all the years in our raw data set, normalizes the team city names, print if the cities match, and adds it to our JSON unless it has already been added. This can be overriden by setting `force` to `True`.

In [26]:
force = False

# open raw standings data
with open(rsDir, 'r') as fp:
    rawStandingsData = json.load(fp)

# get win pct for each team and add to json object   
if not os.path.exists('data/nbaSalaryData.json') or force:
    nbaSalaryData = loadDataWIP()
    winData = getWinData(rawStandingsData)
    printInvalidTeams(winData, nbaSalaryData)
    nbaSalaryData = addWinDataToJson(winData, nbaSalaryData)
    saveData(nbaSalaryData)
    print("Saved winPct to nbaSalary.json")
else:
    print("winPct already added to nbaSalaryData.json")

winPct already added to nbaSalaryData.json


Now that we are done building our WIP json object we can delete the file.

In [28]:
if os.path.exists('data/nbaSalaryData-WIP.json'):
    os.remove('data/nbaSalaryData-WIP.json')
    print("removed json")
else:
    print("json not found")

removed json


Now we can access a team's win percentage for a year like so:

In [29]:
nbaSalaryData = loadData()
nbaSalaryData["2011-2012"]["Miami"]["winPct"] # returns 0.697

0.697

This concludes are data retrieval and storage.

<a id='appendix'></a>
# Appendix [top](#top)

<a id='data-sources'></a>
## Data sources [top](#top)
- https://hoopshype.com/salaries/ for team, player, and salary records
- An unofficial API for https://www.nba.com/stats/, maintained here https://github.com/swar/nba_api

<a id='git-history'></a>
## Git history [top](#top)

I used git and github for version control of this project. The git history can be seen here:

In [32]:
!git log --oneline --decorate --graph --all

* 0cf956e (HEAD -> make-presentable, origin/main, main) removed, didnt work
* 079439c moved load function to part2
* f4aafa8 (refs/original/refs/heads/main) added note about soups
* 48a3403 removed unzipped soups from git repo
* bea35e1 small tweaks
* 985194b linguist-detectable=false for html
* 29bb867 added conclusion and appendix
* 3123431 requirements added
* d888703 updated data
* 2c49acf directory structure changes and refactor
* 06d0508 no longer holds rel std dev
* 35087a3 removed dirs during refactor
* c68259f major refactor
* bf5025d added soup backup
* b111794 added relStdDev and winPct
* 5b043b4 added details on sources
* 1a670f1 added file
* d1cbd24 changed file structure
* 0be867a added backup
* c38081a modularized and added more data and analysis
* f80ef0e init commit
