# Web Scraping NBA Data

The goal of this Notebook is to web scrape NBA player data from the Basketball Reference website (www.basketball-reference.com) to then later perform analysis on.

In [1]:
# Import the necessary libraries for Web Scraping the NBA player data
import sys
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

## Initial Testing

The website that we will use to access the player data is https://www.basketball-reference.com/players. The final code at the end of this notebook loops over each letter of the alphabet, representing the letter that a player's last name starts with, and then adds that to the end of the website's url. The url for each letter of the alphabet provides biographic data for each player such as birthdate, college attended, and position played. To start, however, we will only test for last names starting with the letter 'A'.

In [2]:
# Establish the url with the data
first_url = 'https://www.basketball-reference.com/players/a'

In [3]:
# Perform a get request for the website
first_page = requests.get(first_url)

In [4]:
# Use BeautifulSoup to parse the HTML data
first_soup = BeautifulSoup(first_page.content, 'html.parser')
# The following variable will contain the HTML code that has the NBA player data
first_table = first_soup.find_all('tr')

In [5]:
# Use the BeautifulSoup object to create column names for the DataFrame
first_head = first_soup.find('thead')
first_columns_raw = [first_head.text for item in first_head][0]
# Then clean up the column names and only save the relevant ones
first_columns = first_columns_raw.replace('\n', ',').split(',')[2:-2]
first_columns

['Player', 'From', 'To', 'Pos', 'Ht', 'Wt', 'Birth Date', 'Colleges']

After creating the BeautifulSoup object for the webpage with last names beginning with the letter 'A', we run a test for a specific player and his webpage, which provides career statistics like games played, points per game, and Player Efficiency Rating (PER). We also grab career accolades like MVP's, Championships, and All Star appearances.

In [6]:
# Code is for the specific url for Alaa Abdelnaby
player_url = f'https://www.basketball-reference.com/players/a/abdelal01.html'
player_page = requests.get(player_url)
player_soup = BeautifulSoup(player_page.content, 'html.parser')
player_target = player_soup.find(class_='stats_pullout')
player_career_stats = []
career_columns = []
# Loop through the target HTML data to extract relevant player data
for i, div in enumerate(player_target.find_all('div')):
  if i < 2:
    continue
  else:
    column_names = div.find('strong').text # extract column names
    val = div.find_all('p') # extract career statistics
    stats = val[1].text
  player_career_stats.append(stats)
  career_columns.append(column_names)
# Extract the players career accolades
acco = player_soup.find_all(id='bling')
accolades = []
if acco:
  bling = acco[0].find_all('a')
  for l in range(len(bling)):
    accolades.append(bling[l].text)

After working on the code for one specific player and the corresponding webpage, we can now test for each player who's last name begins with the letter 'A'.

As a reminder, the biographic data cones from the webpage with all the players who's last names start with a certain letter, and the career statistics and accolades come from each players specific webpage. We then combine the data from the 2 separate webpages into one table for each player.

In [7]:
# Loop through the table to extract the data for each player
first_players = []
for i in range(len(first_table)):
  if i == 0:
    continue
  elif i == 1: # I include the elif statement to only grab the column names once from the players url
    # The following is setting up to grab the career statistics
    player = []
    player_name = []
    player_stats = []
    player_link = first_table[i].find('a', href=True)['href']
    player_url = f'https://www.basketball-reference.com/{player_link}'
    player_page = requests.get(player_url)
    player_soup = BeautifulSoup(player_page.content, 'html.parser')
    player_target = player_soup.find(class_='stats_pullout')
    player_career_stats = []
    career_columns = []
    # The following grabs the players career accolades if they have any
    acco = player_soup.find_all(id='bling')
    accolades = []
    if acco:
      bling = acco[0].find_all('a')
      for l in range(len(bling)):
        accolades.append(bling[l].text)
    # The following loop is to actually grab the career statistics and store them
    for j, div in enumerate(player_target.find_all('div')):
      if j < 2:
        continue
      else:
        column_names = div.find('strong').text 
        val = div.find_all('p')
        stats = val[1].text
      player_career_stats.append(stats)
      career_columns.append(column_names)
    # The following loops grab the players names and biographic data
    for th in first_table[i].find_all('th'):
      player_name.append(th.text)
    for td in first_table[i].find_all('td'):
      player_stats.append(td.text)
    player = player_name + player_stats + player_career_stats
    first_players.append(player)
  else: 
    player = []
    player_name = []
    player_stats = []
    player_link = first_table[i].find('a', href=True)['href']
    player_url = f'https://www.basketball-reference.com/{player_link}'
    player_page = requests.get(player_url)
    player_soup = BeautifulSoup(player_page.content, 'html.parser')
    player_target = player_soup.find(class_='stats_pullout')
    player_career_stats = []
    # The following grabs the players career accolades if they have any
    acco = player_soup.find_all(id='bling')
    accolades = []
    if acco:
      bling = acco[0].find_all('a')
      for l in range(len(bling)):
        accolades.append(bling[l].text)
    # The following loop is to actually grab the career statistics and store them
    for j, div in enumerate(player_target.find_all('div')):
      if j < 2:
        continue
      else:
        val = div.find_all('p')
        stats = val[1].text
      player_career_stats.append(stats)
    # The following loops grab the players names and biographic data
    for th in first_table[i].find_all('th'):
      player_name.append(th.text)
    for td in first_table[i].find_all('td'):
      player_stats.append(td.text)
    player = player_name + player_stats + player_career_stats + accolades
    first_players.append(player)
# Create a Pandas DataFrame of the player data
full_columns = first_columns + career_columns
first_df = pd.DataFrame(first_players)
for i in np.arange(0, len(full_columns)):
  first_df.rename(columns={i:full_columns[i]}, inplace=True)
first_df.head()

Unnamed: 0,Player,From,To,Pos,Ht,Wt,Birth Date,Colleges,G,G.1,...,24,25,26,27,28,29,30,31,32,33
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240,"June 24, 1968",Duke,256,256,...,,,,,,,,,,
1,Zaid Abdul-Aziz,1969,1978,C-F,6-9,235,"April 7, 1946",Iowa State,505,505,...,,,,,,,,,,
2,Kareem Abdul-Jabbar*,1970,1989,C,7-2,225,"April 16, 1947",UCLA,1560,1560,...,1975-76 TRB Champ,4x BLK Champ,6x NBA Champ,15x All-NBA,11x All-Defensive,1969-70 All-Rookie,1969-70 ROY,2x Finals MVP,6x MVP,NBA 75th Anniv. Team
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162,"March 9, 1969",LSU,586,586,...,,,,,,,,,,
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223,"November 3, 1974","Michigan, San Jose State",236,236,...,,,,,,,,,,


Since the test on players with a last name starting with the letter 'A' was successful, we will move on to run the code for every letter of the alphabet. 

## Collect Data for Every Player

In [8]:
players = [] # initialize list to save the player data to
for x in range(97,123): # loop over every letter of the alphabet
  letter = chr(x)
  url = f'https://www.basketball-reference.com/players/{letter}'
  page = requests.get(url)
  soup = BeautifulSoup(page.content, 'html.parser')
  table = soup.find_all('tr')
  for i in range(len(table)):
    if i == 0:
      continue
    else: 
      player = []
      player_name = []
      player_stats = []
      player_link = table[i].find('a', href=True)['href']
      player_url = f'https://www.basketball-reference.com/{player_link}'
      player_page = requests.get(player_url)
      player_soup = BeautifulSoup(player_page.content, 'html.parser')
      player_target = player_soup.find(class_='stats_pullout')
      # The following grabs the players career accolades if they have any
      acco = player_soup.find_all(id='bling')
      accolades = []
      if acco:
        bling = acco[0].find_all('a')
        for l in range(len(bling)):
          accolades.append(bling[l].text)
      player_career_stats = []
      for j, div in enumerate(player_target.find_all('div')):
        if j < 2:
          continue
        else:
          val = div.find_all('p')
          stats = val[1].text
        player_career_stats.append(stats)
      for th in table[i].find_all('th'):
        player_name.append(th.text)
      for td in table[i].find_all('td'):
        player_stats.append(td.text)
      player = player_name + player_stats + player_career_stats + accolades
      players.append(player)

players_df = pd.DataFrame(players)
for i in np.arange(0, len(full_columns)):
  players_df.rename(columns={i:full_columns[i]}, inplace=True)
players_df

Unnamed: 0,Player,From,To,Pos,Ht,Wt,Birth Date,Colleges,G,G.1,...,25,26,27,28,29,30,31,32,33,34
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240,"June 24, 1968",Duke,256,256,...,,,,,,,,,,
1,Zaid Abdul-Aziz,1969,1978,C-F,6-9,235,"April 7, 1946",Iowa State,505,505,...,,,,,,,,,,
2,Kareem Abdul-Jabbar*,1970,1989,C,7-2,225,"April 16, 1947",UCLA,1560,1560,...,4x BLK Champ,6x NBA Champ,15x All-NBA,11x All-Defensive,1969-70 All-Rookie,1969-70 ROY,2x Finals MVP,6x MVP,NBA 75th Anniv. Team,
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162,"March 9, 1969",LSU,586,586,...,,,,,,,,,,
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223,"November 3, 1974","Michigan, San Jose State",236,236,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5018,Ante Žižić,2018,2020,F-C,6-10,266,"January 4, 1997",,113,113,...,,,,,,,,,,
5019,Jim Zoet,1983,1983,C,7-1,240,"December 20, 1953",Kent State University,7,7,...,,,,,,,,,,
5020,Bill Zopf,1971,1971,G,6-1,170,"June 7, 1948",Duquesne,53,53,...,,,,,,,,,,
5021,Ivica Zubac,2017,2022,C,7-0,240,"March 18, 1997",,360,360,...,,,,,,,,,,


Now that we have the data for every player that has played in the NBA, we save the data, which we will then clean up and perform analysis on.

In [9]:
# Save the data to a csv file to access later
players_df.to_csv('/content/drive/MyDrive/NBA_players_data.csv', index=False)