Predict who is going to be MVP in NBA: web scraping NBA data

We need data on players, like scores during the championship

Take data from https://www.basketball-reference.com/

Take a look at https://www.basketball-reference.com/awards/awards_2020.html and inspect it before loading data: find the objects needed for the scraping

In [1]:
years = list(range(1991,2022)) #list of years we want to get data from

In [2]:
import requests

In [3]:
from bs4 import BeautifulSoup

In [None]:
with open("mvp/1991.html", encoding="utf8") as f:
    page = f.read()

In [None]:
soup = BeautifulSoup(page, "html.parser")

In [None]:
# By inspecting the table we can see that the first row is not necessary, let's remove it
# the row we want is identified by <tr class="over_header">
soup.find('tr', class_="over_header").decompose()

In [None]:
# We only want a table from each page, let's remove all other staff
mvp_table = soup.find_all(id="mvp")

In [4]:
import pandas as pd

In [None]:
# select only the first element to get a database
mvp_1991 = pd.read_html(str(mvp_table))[0]

In [None]:
mvp_1991

In [None]:
dfs = []
for year in years:
    with open("mvp/{}.html".format(year), encoding="utf8") as f:
        page = f.read()
    soup = BeautifulSoup(page, "html.parser")
    soup.find('tr', class_="over_header").decompose()
    mvp_table = soup.find(id="mvp")
    mvp = pd.read_html(str(mvp_table))[0]
    mvp["Year"] = year
    
    dfs.append(mvp)

In [None]:
mvps = pd.concat(dfs)

In [None]:
mvps.head()

In [None]:
mvps.to_csv("mvps.csv")

Now we have data from all players who actually WON the MVP

We need to know the stats of all the players to see what the MVPs stand out for

Stats per game for all players of NBA can be found in https://www.basketball-reference.com/leagues/NBA_2021_per_game.html for 2021 (as an example)

Since the web page we're taking data from is responsive to a web browser to show all of data, what we typed above isn't going to give us the whole web page in the `1991.html` file

We need to load the web page as a browser. To do so we will use selenium, and we need to know the chrome release we're using.

E.g. I am using chrome 113 

In [None]:
from selenium import webdriver

In [None]:
driver = webdriver.Chrome(executable_path="/Users/sirja/Codes/chromedriver_win32")

In [None]:
year = 1991
with open("player/{}.html".format(year), encoding="utf8") as f:
    page = f.read()
    
    soup = BeautifulSoup(page, "html.parser")
    soup.find('tr', class_="thead").decompose()
    player_table = soup.find(id="per_game_stats")
    player = pd.read_html(str(player_table))[0]
    player["Year"] = year
    
    dfs.append(player)

In [None]:
player.head()

In [None]:
year = 2021
with open("player/{}.html".format(year), encoding="utf8") as f:
    page = f.read()
    
    soup = BeautifulSoup(page, "html.parser")
    soup.find('tr', class_="thead").decompose()
    player_table = soup.find(id="per_game_stats")
    player = pd.read_html(str(player_table))[0]
    player["Year"] = year
    
    dfs.append(player)

In [None]:
player.tail()

In [None]:
del player
dfs = []
for year in years:
    with open("player/{}.html".format(year), encoding="utf8") as f:
        page = f.read()
    
        soup = BeautifulSoup(page, "html.parser")
        soup.find('tr', class_="thead").decompose()
        player_table = soup.find(id="per_game_stats")
        player = pd.read_html(str(player_table))[0]
        player["Year"] = year

        dfs.append(player)

In [None]:
players = pd.concat(dfs)

In [None]:
players

In [None]:
players.to_csv("players.csv")

Team record matters a lot for the MVP race

We need this data to correctly predict the MVP

Take data from https://www.basketball-reference.com/leagues/NBA_2021_standings.html, in the specific, we are going to use data in the Division standings tables

In [None]:
year = 1991
with open("team/{}.html".format(year), encoding="utf8") as f:
    page = f.read()

soup = BeautifulSoup(page, "html.parser")
for div in soup.find_all("tr", {'class':'thead'}): 
    div.decompose()
#soup.find('tr', class_="thead").decompose()
team_table = soup.find(id="divs_standings_E")
team = pd.read_html(str(team_table))[0]
team["Year"] = year
team["Team"] = team["Eastern Conference"]
del team["Eastern Conference"]
#dfs.append(team)

In [None]:
team

In [8]:
#del team
dfs = []
for year in years:
    with open("team/{}.html".format(year), encoding="utf8") as f:
        page = f.read()

    soup = BeautifulSoup(page, "html.parser")
    for div in soup.find_all("tr", {'class':'thead'}): 
        div.decompose()
    team_table = soup.find(id="divs_standings_E")
    team = pd.read_html(str(team_table))[0]
    team["Year"] = year
    team["Team"] = team["Eastern Conference"]
    del team["Eastern Conference"]
    dfs.append(team)

    soup = BeautifulSoup(page, "html.parser")
    for div in soup.find_all("tr", {'class':'thead'}): 
        div.decompose()
    team_table = soup.find(id="divs_standings_W")
    team = pd.read_html(str(team_table))[0]
    team["Year"] = year
    team["Team"] = team["Western Conference"]
    del team["Western Conference"]
    dfs.append(team)

In [9]:
teams = pd.concat(dfs)

In [10]:
teams.to_csv("teams.csv")