# NBA Awards and Betting Models
Part 1: Webscraping

Author: Abhi Vellore
Inspired By: Dataquest Web Scraping NBA Stats With Python: Data Project [Part 1 of 3] and https://github.com/JustinGong03/nba-awards-predictor

Some portions are adapted from the below sources. 
https://www.youtube.com/watch?v=JGQGd-oa0l4
https://github.com/JustinGong03/nba-awards-predictor
Accessed 2023. 

Part 1 consists of using web scraping methods in order to gather data to eventually build our models. This part uses BeautifulSoup, Requests, and Selenium in order to create CSVs of important NBA statistics. These statistics will be used in the future to create models to predict winners of major NBA awards and build betting predictions.

In [2]:
# Import tools
import time
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from itertools import product

MVP Data

All of our data will be scraped from Basketball Reference, and we will initially focus on the MVP data, which is at the top of the awards page. 

We will use the Requests Module to save the html file in the "mvp" folder. After, we will use BeautifulSoup and Pandas to parse the data into a CSV.


In [3]:
# Variable of years to reference
years = range(2000,2024)

In [4]:
# Link to award winners
url_start = "https://www.basketball-reference.com/awards/awards_{}.html"

for year in years:
   # Sleep to prevent being rate limited
   time.sleep(2)
   url = url_start.format(year)
   data = requests.get(url)
   with open("data/mvp/{}.html".format(year), "w+") as f:
      f.write(data.text)


In [5]:
# List of DataFrames 
dfsm = []

for year in years:
   # Open page to be read
   with open ("data/mvp/{}.html".format(year)) as f:
      page = f.read()
   soup = BeautifulSoup(page, "html.parser")
   # Rid of the header in the MVP table
   soup.find("tr", class_="over_header").decompose()

   # Build MVP table from HTML; append year
   mvp_table = soup.find(id = "mvp")
   mvp = pd.read_html(str(mvp_table))[0]
   mvp["Year"] = year
   dfsm.append(mvp)

In [6]:
# Convert from list of dataframes into one CSV
mvp = pd.concat(dfsm)
mvp.to_csv("data/mvps.csv")

Other Awards Data

Unfortunately, Basketball Reference prevents us from accessing the other award information directly. From the same webpage, we therefore need to use Selenium instead.


Selenium uses a Chrome Webdriver in order to emulate a Chrome browser, and "scrolls" through the site in order to load new information that is previously hidden. We can then similarly store this information as a HTML file and then parse it, once again using BeautifulSoup.

In [7]:
# Selenium Webdriver. The latest version of Selenium automatically sets up a driver without additional input
driver=webdriver.Chrome()

In [8]:
# Scraping whole web page. Note this can be combined with the MVP scrape, but is done separately
# as a learning experience.

for year in years:
   url = "https://www.basketball-reference.com/awards/awards_{}.html".format(year)
   driver.get(url) 
   # Simulate the "scroll" using javascript
   driver.execute_script("window.scrollTo(1, 10000)")
   time.sleep(2)

   html = driver.page_source

   # Store data
   with open("data/otherAwards/{}.html".format(year), "w+") as f:
      f.write(html)


In [9]:
# Create lists of dataframes to later save as CSV
dfsd = []
dfss = []
dfsmi = []

for year in years:

   # Open page to be read
   with open ("data/otherAwards/{}.html".format(year)) as f:
      page = f.read()
   soup = BeautifulSoup(page, "html.parser")
   # Rid of all headers in tables
   for header in soup.find_all("tr", class_="over_header"):
      header.decompose()

   # DPOY Table
   dpoy_table = soup.find(id = "dpoy")
   dpoy = pd.read_html(str(dpoy_table))[0]
   dpoy["Year"] = year
   dfsd.append(dpoy)

   # SMOY Table
   smoy_table = soup.find(id = "smoy")
   smoy = pd.read_html(str(smoy_table))[0]
   smoy["Year"] = year
   dfss.append(smoy)

   # MIP Table
   mip_table = soup.find(id = "mip")
   mip = pd.read_html(str(mip_table))[0]
   mip["Year"] = year
   dfsmi.append(mip)

In [10]:
# Combine DFs and Convert to CSV

dpoy = pd.concat(dfsd)
smoy = pd.concat(dfss)
mip = pd.concat(dfsmi)
dpoy.to_csv("data/dpoys.csv")
smoy.to_csv("data/smoys.csv")
mip.to_csv("data/mips.csv")

Player Statistics 

So far, we have gathered information about award winners. However, we must also know how award winners stand relative to all other players, so we scrape information about all players across the year range. Again, we use Selenium in order to generate the full table.

Note: we only want to consider players that have played a minimum amount of games to rid of extraneous data. We choose 28 games as a minimum, or 1/3 of the season. Also, players can get traded during the season. To avoid predictions, we will pre-clean the data to drop such instances/players

In [11]:
player_stats_url = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html"

for year in years:
   url = player_stats_url.format(year)
   driver.get(url)
   # Simulate the "scroll" using javascript
   driver.execute_script("window.scrollTo(1, 10000)")
   time.sleep(2)

   html = driver.page_source

   # Store the data
   with open("data/player/{}.html".format(year), "w+") as f:
      f.write(html)

In [12]:
# Create list of dataframes to later save as CSV
dfPerGame = []

for year in years:
   # Open page to be read
   with open ("data/player/{}.html".format(year)) as f:
      page = f.read()
   soup = BeautifulSoup(page, "html.parser")
   # Rid of all headers in tables
   for header in soup.find_all("tr", class_="thead"):
      header.decompose()

   # Player table
   player_table = soup.find(id = "per_game_stats")
   players = pd.read_html(str(player_table))[0]
   players["Year"] = year
   dfPerGame.append(players)


Player 36 Minute Stats + Advanced Stats

In order to improve our models, it's critical to also have access to advanced statistics and per 36 minute statistics, to account for differences in playing time and team situations. We'll combine both the advanced statistics and per 36 minutes into one table per player. 

NOTE: Now that we're comfortable using Selenium, we'll directly add the data into a list of dfs without saving the html

In [17]:
dfsPer36 = []
dfsAdv = []

for year, type_ in product(years, ["per_poss", "advanced"]):
      driver.get("https://www.basketball-reference.com/leagues/NBA_{}_{}.html".format(year, type_))
      driver.execute_script("window.scrollTo(1, 20000)")
      time.sleep(2)
      html = driver.page_source
      
      soup = BeautifulSoup(html, "html.parser")
      for header in soup.find_all("tr", class_="thead"):
            header.decompose()

      data = soup.find(id = "{}_stats".format(type_))
      df = pd.read_html(str(data))[0]
      df["Year"] = year
      if type_ == "per_poss":
            dfsPer36.append(df)
      else:
            dfsAdv.append(df)

In [18]:
# Combine DFs and Convert to one CSV with all player statistics

playersPerGame = pd.concat(dfPerGame)
playersPer36 = pd.concat(dfsPer36)
playersAdv = pd.concat(dfsAdv)

playersPerGame.to_csv("data/players.csv")
playersPer36.to_csv("data/players36.csv")
playersAdv.to_csv("data/playersAdv.csv")

Team Data Statistics

Next, team data is also critical and plays a role in award winners, and of course provides information about winning teams. While we can just use the requests module,  

Similar to how we did with players, we should collect advanced informations about the performance of each term for our later betting model and to help provide context for player statistics.

In [19]:
# Link to team standings
url_start = "https://www.basketball-reference.com/leagues/NBA_{}_standings.html"

for year in years:
   time.sleep(2)
   url = url_start.format(year)
   data = requests.get(url)
   with open("data/teams/{}.html".format(year), "w+") as f:
      f.write(data.text)

Note: We use Selenium to scrape all our team information. However, the "click" function works better when we are not also scrolling, hence, there may occasionally be bugs

In [21]:
# Create list of dataframes to later save as CSV
dfst = []
dfso = []
dfsa = []
dfss = []

for year in years:
      
      driver.get("https://www.basketball-reference.com/leagues/NBA_{}.html".format(year))
      if year < 2016:
            driver.execute_script("window.scrollTo(1, 2300)")
      else:
            driver.execute_script("window.scrollTo(1, 2700)") 
      time.sleep(2)
      html = driver.page_source
      
      soup = BeautifulSoup(html, "html.parser")


      # Remove all table overheaders
      for header in soup.find_all("tr", class_="over_header"):
            header.decompose()


      # Per Game Team Stats
      data = soup.find(id = "per_game-team")
      df = pd.read_html(str(data))[0]
      df["Year"] = year
      dfst.append(df)


      # Scroll to the correct statistics and switch tabs  
      opponent_tab = driver.find_element("link text", "Opponent")
      opponent_tab.click()
      
      # Wait for the "Opponent" tab content to load
      time.sleep(3)
      
      # Get the page source with the "Opponent" tab content
      opponent_page_source = driver.page_source
      soup = BeautifulSoup(opponent_page_source, "html.parser")

      # Remove all table overheaders in new website
      for header in soup.find_all("tr", class_="over_header"):
            header.decompose()

      # Per game information on opponents
      data = soup.find(id = "per_game-opponent")
      df = pd.read_html(str(data))[0]
      df["Year"] = year
      dfso.append(df)      

      # Scroll further down
      driver.execute_script("window.scrollTo(1, 7500)")
      time.sleep(2)
      html = driver.page_source

      # Advanced Stats

      data = soup.find(id = "advanced-team")
      df = pd.read_html(str(data))[0]
      df["Year"] = year
      dfsa.append(df)

      # Shooting Stats

      data = soup.find(id = "shooting-team")
      df = pd.read_html(str(data))[0]
      df["Year"] = year
      dfss.append(df)


In [23]:
# Combine DFs and Convert to CSV

teamsAllStats = pd.concat(dfst)
defenseStats = pd.concat(dfso)
teamsAdvanced = pd.concat(dfsa)
shootingTeam = pd.concat(dfss)

teamsAllStats.to_csv("data/teamAllStats.csv")
teamsAdvanced.to_csv("data/teamAdvanced.csv")
defenseStats.to_csv("data/teamDefense.csv")
shootingTeam.to_csv("data/shooting.csv")
