# Web Scraping NBA Data

The goal of this Notebook is to web scrape NBA player data from the Basketball Reference website.

In [1]:
# Import the necessary libraries for Web Scraping the NBA player data
import sys
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

I begin this Notebook by testing my code on a single season, before moving on and doing it over multiple seasons. The year I have tested on is the most recent NBA season from 2021 - 2022. 

In [2]:
# Establish the url with the data
first_url = "https://www.basketball-reference.com/leagues/NBA_2022_per_game.html"

In [3]:
# Perform a get request for the website
first_page = requests.get(first_url)

In [4]:
# Use BeautifulSoup to parse the HTML data
first_soup = BeautifulSoup(first_page.content, 'html.parser')
# The following variable will contain the HTML code that has the NBA player data
first_table = first_soup.find_all(class_='full_table')

In [5]:
# Use the BeautifulSoup object to create column names for the DataFrame
first_head = first_soup.find(class_='thead')
first_columns_raw = [first_head.text for item in first_head][0]
# Then clean up the column names and only save the relevant ones
first_columns = first_columns_raw.replace('\n', ',').split(',')[2:-1]

In [6]:
# Loop through the table to extract the data for each player
first_players = []
for i in range(len(first_table)):
    player = []
    for td in first_table[i].find_all('td'):
        player.append(td.text)
    first_players.append(player)
# Create a Pandas DataFrame of the player data and add the year as a column
first_df = pd.DataFrame(first_players, columns = first_columns)
first_df['Year'] = '2022'
first_df.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
0,Precious Achiuwa,C,22,TOR,73,28,23.6,3.6,8.3,0.439,...,2.0,4.5,6.5,1.1,0.5,0.6,1.2,2.1,9.1,2022
1,Steven Adams,C,28,MEM,76,75,26.3,2.8,5.1,0.547,...,4.6,5.4,10.0,3.4,0.9,0.8,1.5,2.0,6.9,2022
2,Bam Adebayo,C,24,MIA,56,56,32.6,7.3,13.0,0.557,...,2.4,7.6,10.1,3.4,1.4,0.8,2.6,3.1,19.1,2022
3,Santi Aldama,PF,21,MEM,32,0,11.3,1.7,4.1,0.402,...,1.0,1.7,2.7,0.7,0.2,0.3,0.5,1.1,4.1,2022
4,LaMarcus Aldridge,C,36,BRK,47,12,22.3,5.4,9.7,0.55,...,1.6,3.9,5.5,0.9,0.3,1.0,0.9,1.7,12.9,2022


Since the first test on the 2021 - 2022 season was successful, I now modify the code slightly to run over multiple seasons. The first season that Basketball Reference has data on is the 1949 - 1950 season. I will collect data from that first season to the most recent season. 

In [7]:
# Create a list of the years that we want data from
years = np.arange(1950, 2023).tolist()

In [8]:
# This for loop will loop over all of the years where there is data
players_list = []
for year in years:
    url = f'https://www.basketball-reference.com/leagues/NBA_{year}_per_game.html' # insert the year into the webpage
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    table = soup.find_all(class_='full_table')
    head = soup.find(class_='thead')
    columns_raw = [first_head.text for item in head][0]
    columns_clean = columns_raw.replace('\n',',').split(',')[2:-1]
    players = []
    for i in range(len(table)):
        player = []
        for td in table[i].find_all('td'):
            player.append(td.text)
        players.append(player)
    # Create a DataFrame for the specific year
    year_df = pd.DataFrame(players, columns=columns_clean)
    year_df['Year'] = year
    players_list.append(year_df)
# Create a DataFrame for all of the player data over all of the seasons
players_df = pd.concat(players_list)
players_df.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
0,Curly Armstrong,G-F,31,FTW,63,,,2.3,8.2,0.279,...,,,,2.8,,,,3.4,7.3,1950
1,Cliff Barker,SG,29,INO,49,,,2.1,5.6,0.372,...,,,,2.2,,,,2.0,5.7,1950
2,Leo Barnhorst,SF,25,CHS,67,,,2.6,7.4,0.349,...,,,,2.1,,,,2.9,6.5,1950
3,Ed Bartels,F,24,TOT,15,,,1.5,5.7,0.256,...,,,,1.3,,,,1.9,4.2,1950
4,Ralph Beard,G,22,INO,60,,,5.7,15.6,0.363,...,,,,3.9,,,,2.2,14.9,1950


After creating the DataFrame with all of the player data available, I do some quick analysis to get an idea of what's in the data.

In [9]:
# Use .value_counts() to get an idea of the number of players for each season
players_df['Year'].value_counts()

2022    605
2021    540
2018    540
2019    530
2020    529
       ... 
1958     99
1957     99
1961     93
1956     92
1959     92
Name: Year, Length: 73, dtype: int64

In [10]:
# Use .value_counts() to see how many seasons each player played in,
# although some will not be accurate if they played on multiple teams in a single season
players_df['Player'].value_counts()

Eddie Johnson     27
Mike Dunleavy     26
George Johnson    24
Vince Carter      22
Robert Parish*    21
                  ..
Trey Thompkins     1
Tony Jackson       1
Lee Johnson        1
Walter Jordan      1
Omer Yurtseven     1
Name: Player, Length: 4488, dtype: int64

After the quick look at the NBA player data, I save it to a csv file

In [11]:
players_df.to_csv('/content/drive/MyDrive/NBA_players_data.csv')