In [33]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

We start by creating a list that contains all the years using the range function. Because the last number is not included by the range function, I use range(1991, 2023) to generate a list from 1991 to 2022. The end parameter is used to print the list horizontally so that you don’t have to scroll all the way down to confirm you have the correct output.

In [39]:
years = list(range(1991, 2023))
print(years, end=" ")

[1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022] 

The next step is to find the page with the data we need and assign its URL to a variable. This is the format of the original URL (https://www.basketball-reference.com/awards/awards_1991.html). Just replace the curly brace with any year between 1991 and 2022to access the specific web page.

In [43]:
url_start = "https://www.basketball-reference.com/awards/awards_{}.html"

The curly brace in the URL will hold the respective years that we will iterate through as we download the HTML pages. We will use the requests library to make a request to the website to download the web pages we want.

Create a folder called MVP or your preferred name. Make sure it is in the same directory as the one used to run your notebook otherwise, you will be forced to write the full path to your folder.

In [44]:
import requests # make a request to the webpage to download it
import time

for year in years:
    # create a url for a specific year
    url = url_start.format(year)
    data = requests.get(url)
    
    # W+ opens file in write mode and if it already exists it will just overwrite.
    with open("MVP/{}.html".format(year), "w+", encoding = "utf-8") as f:
        time.sleep(3)
        f.write(data.text) #text saves files as html

To avoid character encoding errors, specify the encoding as utf-8. The time.sleep(3) is used to delay making requests to the browser for 3 seconds after downloading the web page for each year to prevent overloading the website’s server. You can increase the number of seconds if you wish.

We parse the table using the Beautiful Soup library. We will first parse and extract data from a single page to ensure that we are doing it correctly, and then repeat for the remaining years.

In [45]:
# import beautiful soup
from bs4 import BeautifulSoup

# read the HTML data
with open("MVP/1991.html", encoding="utf-8") as f:
    page = f.read()

# create a parser class to extract table from the page
soup = BeautifulSoup(page, "html.parser")

To scrape the data, we need to find the tag elements of the table. We do this by inspecting the webpage. Right-click anywhere on the webpage and select Inspect. This displays the HTML code for the page and when you hover over it, it will highlight different elements of the page.

From the table, we can see that there is an extra header row. We need to remove this row because when we load the table data in pandas, it will create an extra header row that is unnecessary. Afterward, we find the specific table whose data we want to extract.

We use beautiful soup’s find function to find the header element which is in a <tr> tag with a class of “over_header”. Because class is a reserved keyword in Python and using it will result in a syntax error, we use an underscore after it.

In [46]:
# remove the top row of the table
soup.find("tr", class_="over_header").decompose()
print("Header row removed successfully")

# find the specific table we want using its id
mvp_table = soup.find(id="mvp")

Header row removed successfully


The decompose function removes a tag as well as its inner content. We find the specific table we want using the id. This is because in HTML the id is a globally unique property in HTML that only one element should have.

Finally, we read the table into pandas. By default, it is not read as a string so we use the str( ) function to convert it into a string. The result will be a list of data frames which is not what we want. So we get the first index thus the [0]

In [47]:
import pandas as pd

mvp_1991 = pd.read_html(str(mvp_table))[0]

mvp_1991

Unnamed: 0,Rank,Player,Age,Tm,First,Pts Won,Pts Max,Share,G,MP,PTS,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48
0,1,Michael Jordan,27,CHI,77.0,891.0,960,0.928,82,37.0,31.5,6.0,5.5,2.7,1.0,0.539,0.312,0.851,20.3,0.321
1,2,Magic Johnson,31,LAL,10.0,497.0,960,0.518,79,37.1,19.4,7.0,12.5,1.3,0.2,0.477,0.32,0.906,15.4,0.251
2,3,David Robinson,25,SAS,6.0,476.0,960,0.496,82,37.7,25.6,13.0,2.5,1.5,3.9,0.552,0.143,0.762,17.0,0.264
3,4,Charles Barkley,27,PHI,2.0,222.0,960,0.231,67,37.3,27.6,10.1,4.2,1.6,0.5,0.57,0.284,0.722,13.4,0.258
4,5,Karl Malone,27,UTA,0.0,142.0,960,0.148,82,40.3,29.0,11.8,3.3,1.1,1.0,0.527,0.286,0.77,15.5,0.225
5,6,Clyde Drexler,28,POR,1.0,75.0,960,0.078,82,34.8,21.5,6.7,6.0,1.8,0.7,0.482,0.319,0.794,12.4,0.209
6,7,Kevin Johnson,24,PHO,0.0,32.0,960,0.033,77,36.0,22.2,3.5,10.1,2.1,0.1,0.516,0.205,0.843,12.7,0.22
7,8,Dominique Wilkins,31,ATL,0.0,29.0,960,0.03,81,38.0,25.9,9.0,3.3,1.5,0.8,0.47,0.341,0.829,11.4,0.177
8,9T,Larry Bird,34,BOS,0.0,25.0,960,0.026,60,38.0,19.4,8.5,7.2,1.8,1.0,0.454,0.389,0.891,6.6,0.14
9,9T,Terry Porter,27,POR,0.0,25.0,960,0.026,81,32.9,17.0,3.5,8.0,2.0,0.1,0.515,0.415,0.823,13.0,0.235


The output above shows that we have successfully extracted the data for 1991. So we will do the same for the rest of the years and then merge the data frames into one dataset. All the data frames will be stored in a list called all_dfs.

One observation I made is that is impossible to tell which data came from which year so I added the year column to the data frame to help with the same.

In [48]:
all_dfs = []
for year in years:
    # read the HTML data
    with open("MVP/{}.html".format(year), encoding="utf-8") as f:
        page = f.read()

        # create a parser class to extract table from the page
        soup = BeautifulSoup(page, "html.parser")

        # remove the top row of the table
        soup.find("tr", class_="over_header").decompose()

        # remove all other page elements and only find the specific table we want
        mvp_table = soup.find(id="mvp")

        # read table into pandas dataframe
        mvp = pd.read_html(str(mvp_table))[0]
        
        # create year column to know where data came from
        mvp["Year"] = year

        all_dfs.append(mvp)

We then combine all of the dataframes into a single dataframe called mvps, which we save as a csv file.

In [50]:
mvps = pd.concat(all_dfs)

pd.pandas.set_option('display.max_columns', None) #display all column names
mvps.sample(5)

Unnamed: 0,Rank,Player,Age,Tm,First,Pts Won,Pts Max,Share,G,MP,PTS,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48,Year
12,13T,Dikembe Mutombo,30,ATL,0.0,4.0,1150,0.003,80,37.2,13.3,11.6,1.4,0.6,3.3,0.527,,0.705,11.3,0.183,1997
4,5,Stephen Curry,30,GSW,0.0,175.0,1010,0.173,69,33.8,27.3,5.3,5.2,1.3,0.4,0.472,0.437,0.916,9.7,0.199,2019
13,14,Dirk Nowitzki,35,DAL,0.0,7.0,1250,0.006,80,32.9,21.7,6.2,2.7,0.9,0.6,0.497,0.398,0.899,10.9,0.199,2014
6,7,Karl Malone,37,UTA,0.0,21.0,1240,0.017,81,35.7,23.2,8.3,4.5,1.1,0.8,0.498,0.4,0.793,13.1,0.217,2001
3,4,Chris Paul,27,LAC,0.0,289.0,1210,0.239,70,33.4,16.9,3.7,9.7,2.4,0.1,0.481,0.328,0.885,13.9,0.287,2013


The data we have gathered so far consists only of the players who have won the MVP award. We need all player stats to determine the properties associated with players who are likely to win the MVP award. For the next part of this project, we will download the player stats and also introduce selenium for scraping javascript pages.

In the next section, we will extract data for all the players and their stats as well as team data.

The following is the code used to extract data using the requests library.

In [67]:
player_stats_url = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html"

url = player_stats_url.format(1991)
data = requests.get(url)
with open("PLAYERS/1991.html", "w+", encoding="utf-8") as f:
    f.write(data.text)

The web page we intend to scrape loads the tables using javascript. As a result, the data scraped does not contain all the records we want. It only provides 17 rows, despite the fact that the table contains over 300 rows. To get around this problem, we'll use selenium.

Selenium is a free, open-source software testing framework that allows developers to automate web browser actions. It is primarily used for testing web applications, but it can also be used for web scraping.

One of the key advantages of Selenium is that it allows developers to control a real web browser, rather than just making HTTP requests like some other web scraping tools do. This means that it can interact with websites in the same way a user would, and can handle complex interactions such as JavaScript, cookies, and pop-ups.

To use Selenium, you will need to have the following prerequisites installed on your system:

A web browser
The Selenium Python library
A web driver

After installation of the web driver, import selenium and create a variable to store the driver executable

In [68]:
#!pip install selenium

from selenium import webdriver

driver = webdriver.Chrome(executable_path="PATH TO CHROMEDRIVER EXECUTABLE")

  driver = webdriver.Chrome(executable_path="PATH TO CHROMEDRIVER EXECUTABLE")


Running the code above creates a new browser window that’s being controlled by selenium. We will use it to render a page with all the rows of data we need. As we did in Part 1, we will do it for one year to ensure that our program is working properly before creating a loop for all of the years.

In [57]:
import time

year = 1991

url = player_stats_url.format(year)

# render url in the browser
driver.get(url)

# add js to tell the browser to scroll down to be able to render the entire table
driver.execute_script("window.scrollTo(1, 1000)")
time.sleep(2)

# get the html of the page
html = driver.page_source

We then download the HTML page containing all the 300+ rows. First I created a folder called PLAYERS that will contain all the HTML pages we will download

In [58]:
with open("PLAYERS/{}.html".format(year), "w+", encoding="utf-8") as f:
    f.write(html)

Open the downloaded HTML page to confirm it captured all the rows in the table. If successful, create a loop to download the pages for all the years from 1991 to 2022. We use a time delay of 2 seconds because there are many rows of data being parsed.

In [59]:
for year in years:
    url = player_stats_url.format(year)

    # render url in the browser
    driver.get(url)

    # add js to tell the browser to scroll down to be able to render the entire table
    driver.execute_script("window.scrollTo(1, 1000)")
    time.sleep(2)

    # get the html of the page
    html = driver.page_source
    
    with open("PLAYERS/{}.html".format(year), "w+", encoding="utf-8") as f:
        f.write(html)

After downloading all the pages, it is now time to parse the stats with BeautifulSoup. When we look at the structure of the table, we realize that the row headers are repeated within the table after every 20 records.

This will be a bit confusing when the table is loaded into pandas. Using the decompose() method, we will remove all of the header rows except the first one. Upon inspection, the header rows are in the <tr> tag with a class of header while the table has an id of pre_game_stats.

The dataframes are stored in a list called player_df.

In [60]:
player_df = []
for year in years:

    with open("PLAYERS/{}.html".format(year), encoding="utf-8") as f:
        page = f.read()

    # create a parser class to extract table from the page
    soup = BeautifulSoup(page, "html.parser")

    # remove the top row of the table
    soup.find("tr", class_="thead").decompose()

    # remove all other page elements and only find the specific table we want
    player_table = soup.find(id="per_game_stats")

    # convert the table into a string
    # you'll get a list of dataframes so just get the first index.
    player = pd.read_html(str(player_table))[0]
    player["Year"] = year
    player_df.append(player)

We then use the pandas concat() function to combine all of the player stats before viewing a sample of the data to ensure it worked properly.

In [61]:
players = pd.concat(player_df)

# view sample of data
pd.pandas.set_option('display.max_columns', None) #display all column names
players.sample(5)

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
368,267,Chris Mullin*,SF,33,GSW,79,63,34.6,5.5,10.0,0.553,1.1,2.6,0.411,4.5,7.5,0.602,0.605,2.3,2.7,0.864,0.9,3.1,4.0,4.1,1.6,0.4,2.4,2.0,14.5,1997
376,318,Bobby Phills,SG,30,CHH,28,9,29.5,5.4,12.0,0.454,1.1,3.3,0.33,4.4,8.7,0.5,0.499,1.7,2.3,0.723,0.6,1.9,2.5,2.8,1.5,0.3,1.7,2.6,13.6,2000
320,247,Jamaal Magloire,C,25,NOH,82,82,33.9,4.7,9.9,0.473,0.0,0.0,0.0,4.7,9.9,0.474,0.473,4.3,5.7,0.751,3.3,7.1,10.3,1.0,0.5,1.2,2.5,3.4,13.6,2004
251,200,Eddie House,PG,31,BOS,50,0,16.9,2.6,6.4,0.401,1.3,3.3,0.383,1.3,3.1,0.419,0.5,0.7,0.8,0.9,0.1,1.2,1.4,1.0,0.6,0.1,0.5,1.2,7.2,2010
551,390,Jeff Teague,PG,22,ATL,70,7,13.8,1.9,4.3,0.438,0.3,0.7,0.375,1.6,3.7,0.449,0.467,1.1,1.4,0.794,0.2,1.3,1.5,2.0,0.6,0.4,0.9,1.2,5.2,2011


In [62]:
players.to_csv("player_stats.csv")

The next thing we’ll do is scrape the team data. This will be important in helping us make predictions. The url we’ll use for this is below. The curly brace represents the year. The code downloads all the HTML pages from 1991 to 2022.

In [63]:
team_stats_url = "https://www.basketball-reference.com/leagues/NBA_{}_standings.html"

# scraping the data
for year in years:
    url = team_stats_url.format(year)
    data = requests.get(url)

    with open("TEAM/{}.html".format(year), "w+", encoding="utf-8") as f:
        f.write(data.text)

There are 2 separate tables that we need to scrape and we will do this using the BeautifulSoup Library. You can check the table elements using Inspect on your browser(Right click > Inspect)or Ctrl+Shift+I.

In [69]:
dfs = []
for year in years:
    with open("TEAM/{}.html".format(year), encoding="utf-8") as f:
        page = f.read()
    
    soup = BeautifulSoup(page, 'html.parser')
    soup.find('tr', class_="thead").decompose()
    
    # Eastern Conference
    e_table = soup.find_all(id="divs_standings_E")[0]
    e_df = pd.read_html(str(e_table))[0]
    e_df["Year"] = year
    e_df["Team"] = e_df["Eastern Conference"]
    del e_df["Eastern Conference"]
    dfs.append(e_df)
    
    # Western Conference
    w_table = soup.find_all(id="divs_standings_W")[0]
    w_df = pd.read_html(str(w_table))[0]
    w_df["Year"] = year
    w_df["Team"] = w_df["Western Conference"]
    del w_df["Western Conference"]
    dfs.append(w_df)

AttributeError: 'NoneType' object has no attribute 'decompose'

In [None]:
teams.to_csv("teams.csv")