<a href="https://colab.research.google.com/github/GoAshim/WebScraping/blob/main/Web_Scraping_3_NBA_Player_Stats_for_Entire_Career.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scrap the entire career statistics of all NBA Players played in 2021-22 season.
In this web scraping exercise we will revisit what we did on our first assignment and take it further. We are going to start with the page containing the 2021-22 NBA Player Stats from Basketball Reference site (link [here](https://www.basketball-reference.com/leagues/NBA_2022_totals.html)). Instead of scraping the statistics of only 2021-22 season (as we did on our first exercise), we are going to find each player played in 2021-22 season from the table in the link mentioned above, then go to the page of each individual player and scrap the statistics of their entire career.


## Summary
Basketball Reference has provided the stats of all NBA players for the 2021-22 season in tabular form on the above link. We are going to identify the table, find out the URL link of every player from that table, then for each player we will go to their respective URL to scrap relevant data and load that on a dataframe. At the end the dataframe will have the the entire career statistics of all NBA Players played in 2021-22 season.

### Step 1 - Import required libraries

In [None]:
import requests # To pull data from webpage
from bs4 import BeautifulSoup # To parse data pulled from the webpage
import pandas as pd # To view, modify and store data parsed from the webpage 


### Step 2 - Extract the content of the webpage

In [None]:
url = "https://www.basketball-reference.com/leagues/NBA_2022_totals.html"

# Using requests.get to fetch the source content of the page
page_data = requests.get(url).text

# Uning BeautifulSoup to parse the content with the lxml parser
soup = BeautifulSoup(page_data, "lxml")


### Step 3 - Locate the table within the starting page where the stats of all players in 2021-22 season are listed

In [None]:
# This is a manual step where I inspected the source code of the page on my Web brouser and then identified the table where the stats are stored.
# The table can be identified with <table class="sortable stats_table" and we will use that to extract the content of the table
table_data = soup.find('table', {"class" : "sortable stats_table"})

# Then let's extract the body of the table
table_body = table_data.find('tbody')

# Now we are going to extract the rows of the table, find_all returns a list
table_rows = table_body.find_all('tr')


### Step 4 - Print the name and the link of the individual page for the first 5 players in the 2021-22 season.

In [None]:
rank = 1
main_url = "https://www.basketball-reference.com"

for table_row in table_rows:
  players = []

  if table_row['class'][0] != 'thead':
    players.append(table_row.find('th').get_text())

    table_row_cells = table_row.find_all('td')
    players.append(table_row_cells[0].text)
    player_url = main_url + table_row_cells[0].find('a')['href']
    players.append(player_url)
    
    print(players)
    rank += 1

    if rank == 6:
      break


['1', 'Precious Achiuwa', 'https://www.basketball-reference.com/players/a/achiupr01.html']
['2', 'Steven Adams', 'https://www.basketball-reference.com/players/a/adamsst01.html']
['3', 'Bam Adebayo', 'https://www.basketball-reference.com/players/a/adebaba01.html']
['4', 'Santi Aldama', 'https://www.basketball-reference.com/players/a/aldamsa01.html']
['5', 'LaMarcus Aldridge', 'https://www.basketball-reference.com/players/a/aldrila01.html']


### Step 5 - Extract the entire career statistics of one player of both regular as well as playoff season, if applicable.

In [None]:
player_url = 'https://www.basketball-reference.com/players/a/achiupr01.html'
player_name = 'Precious Achiuwa'
player_id = 'P1'

col_names = []
for i in range(33):
  col_names.append('C' + str(i + 1))

df = pd.DataFrame(columns = col_names)

# Parse the page of the player
player_page = requests.get(player_url).text
p_soup = BeautifulSoup(player_page, "lxml")

# Get the statistics of all regular seasons
player_table = p_soup.find_all(id = "per_game")
player_table_body = player_table[0].find('tbody')
player_table_rows = player_table_body.find_all('tr')

for player_table_row in player_table_rows:

  data_row = []
  data_row.append(player_id)
  data_row.append(player_name)
  data_row.append('Regular')
  data_row.append(player_table_row.find('th').find('a').text)

  player_table_cells = player_table_row.find_all('td')
  for player_table_cell in player_table_cells:
    data_row.append(player_table_cell.text)

  df.loc[len(df.index)] = data_row

# If the player has played any playoff seasons then get the statistics of all those seasons 
player_table = p_soup.find_all(id = "playoffs_per_game")
if player_table:
  player_table_body = player_table[0].find('tbody')
  player_table_rows = player_table_body.find_all('tr')

  for player_table_row in player_table_rows:

    data_row = []
    data_row.append(player_id)
    data_row.append(player_name)
    data_row.append('Playoff')
    data_row.append(player_table_row.find('th').find('a').text)

    player_table_cells = player_table_row.find_all('td')
    for player_table_cell in player_table_cells:
      data_row.append(player_table_cell.text)

    df.loc[len(df.index)] = data_row

df.head()

Unnamed: 0,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,...,C24,C25,C26,C27,C28,C29,C30,C31,C32,C33
0,P1,Precious Achiuwa,Regular,2020-21,21,MIA,NBA,PF,61,4,...,0.509,1.2,2.2,3.4,0.5,0.3,0.5,0.7,1.5,5.0
1,P1,Precious Achiuwa,Regular,2021-22,22,TOR,NBA,C,73,28,...,0.595,2.0,4.5,6.5,1.1,0.5,0.6,1.2,2.1,9.1
2,P1,Precious Achiuwa,Playoff,2020-21,21,MIA,NBA,PF,3,0,...,0.25,0.0,2.0,2.0,0.0,0.0,0.7,1.3,0.3,2.3
3,P1,Precious Achiuwa,Playoff,2021-22,22,TOR,NBA,C,6,1,...,0.6,1.3,3.5,4.8,1.0,0.2,0.8,1.5,2.3,10.2


### Step 6 - Create a function which will take parameters of each player and will find the career statistics of that player of both regular as well as playoff season, if applicable. The function will populate that career statistics into a dataframe and will return that dataframe.

In [None]:
def getPlayerStat(player_id, player_name, player_url):
  
  col_names = []
  for i in range(33):
    col_names.append('C' + str(i + 1))

  df = pd.DataFrame(columns = col_names)

  # Parse the page of the player
  player_page = requests.get(player_url).text
  p_soup = BeautifulSoup(player_page, "lxml")

  # Get the statistics of all regular seasons
  player_table = p_soup.find_all(id = "per_game")
  player_table_body = player_table[0].find('tbody')
  player_table_rows = player_table_body.find_all('tr')

  for player_table_row in player_table_rows:

    data_row = []
    data_row.append(player_id)
    data_row.append(player_name)
    data_row.append('Regular')
    data_row.append(player_table_row.find('th').find('a').text)

    player_table_cells = player_table_row.find_all('td')
    for player_table_cell in player_table_cells:
      data_row.append(player_table_cell.text)

    df.loc[len(df.index)] = data_row

  # If the player has played any playoff seasons then get the statistics of all those seasons 
  player_table = p_soup.find_all(id = "playoffs_per_game")
  if player_table:
    player_table_body = player_table[0].find('tbody')
    player_table_rows = player_table_body.find_all('tr')

    for player_table_row in player_table_rows:

      data_row = []
      data_row.append(player_id)
      data_row.append(player_name)
      data_row.append('Playoff')
      data_row.append(player_table_row.find('th').find('a').text)

      player_table_cells = player_table_row.find_all('td')
      for player_table_cell in player_table_cells:
        data_row.append(player_table_cell.text)

      df.loc[len(df.index)] = data_row

  return df  

### Step 7 - Loop through the table containing the statistics of 2021-22 season of every players from the main page. Pass the info of first 5 players to the function created above to store the statistics of entire career of every player into a dataframe.

In [None]:
rank = 1
main_url = "https://www.basketball-reference.com"

col_names = []
for i in range(33):
  col_names.append('C' + str(i + 1))
all_players_all_seasons_df = pd.DataFrame(columns = col_names)

for table_row in table_rows:

  if table_row['class'][0] != 'thead':
    p_id = table_row.find('th').get_text()

    table_row_cells = table_row.find_all('td')
    p_name = table_row_cells[0].text
    p_url = main_url + table_row_cells[0].find('a')['href']

    # Call the function for each player to get the stats of all regular and playoff seasons of that player in form of a dataframe
    each_player_all_seasons_df = getPlayerStat(p_id, p_name, p_url)

    # Append the player's stat obtained above into the main dataframe for all player's stat
    all_players_all_seasons_df = all_players_all_seasons_df.append(each_player_all_seasons_df)

    rank += 1

    if rank == 6:
      break

all_players_all_seasons_df.head(25)

Unnamed: 0,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,...,C24,C25,C26,C27,C28,C29,C30,C31,C32,C33
0,1,Precious Achiuwa,Regular,2020-21,21,MIA,NBA,PF,61,4,...,0.509,1.2,2.2,3.4,0.5,0.3,0.5,0.7,1.5,5.0
1,1,Precious Achiuwa,Regular,2021-22,22,TOR,NBA,C,73,28,...,0.595,2.0,4.5,6.5,1.1,0.5,0.6,1.2,2.1,9.1
2,1,Precious Achiuwa,Playoff,2020-21,21,MIA,NBA,PF,3,0,...,0.25,0.0,2.0,2.0,0.0,0.0,0.7,1.3,0.3,2.3
3,1,Precious Achiuwa,Playoff,2021-22,22,TOR,NBA,C,6,1,...,0.6,1.3,3.5,4.8,1.0,0.2,0.8,1.5,2.3,10.2
0,2,Steven Adams,Regular,2013-14,20,OKC,NBA,C,81,20,...,0.581,1.8,2.3,4.1,0.5,0.5,0.7,0.9,2.5,3.3
1,2,Steven Adams,Regular,2014-15,21,OKC,NBA,C,70,67,...,0.502,2.8,4.6,7.5,0.9,0.5,1.2,1.4,3.2,7.7
2,2,Steven Adams,Regular,2015-16,22,OKC,NBA,C,80,80,...,0.582,2.7,3.9,6.7,0.8,0.5,1.1,1.1,2.8,8.0
3,2,Steven Adams,Regular,2016-17,23,OKC,NBA,C,80,80,...,0.611,3.5,4.2,7.7,1.1,1.1,1.0,1.8,2.4,11.3
4,2,Steven Adams,Regular,2017-18,24,OKC,NBA,C,76,76,...,0.559,5.1,4.0,9.0,1.2,1.2,1.0,1.7,2.8,13.9
5,2,Steven Adams,Regular,2018-19,25,OKC,NBA,C,80,80,...,0.5,4.9,4.6,9.5,1.6,1.5,1.0,1.7,2.6,13.9


### Step 8 - Repeat the above step to collect the entire career statistics of every player played in the 2021-22 season and store that into a dataframe. We will print the name of each player, that way if the code fails while executing due to some missing tag, we will know for which player that happened.

In [None]:
main_url = "https://www.basketball-reference.com"

col_names = []
for i in range(33):
  col_names.append('C' + str(i + 1))
all_players_all_seasons_df = pd.DataFrame(columns = col_names)

for table_row in table_rows:

  if table_row['class'][0] != 'thead':
    p_id = table_row.find('th').get_text()

    table_row_cells = table_row.find_all('td')
    p_name = table_row_cells[0].text
    p_url = main_url + table_row_cells[0].find('a')['href']

    # Let's print the name of each player as we are looping through, which will help to find if the code gets error in any step
    print(p_name)

    # Call the function for each player to get the stats of all regular and playoff seasons of that player in form of a dataframe
    each_player_all_seasons_df = getPlayerStat(p_id, p_name, p_url)

    # Append the player's stat obtained above into the main dataframe for all player's stat
    all_players_all_seasons_df = all_players_all_seasons_df.append(each_player_all_seasons_df)


Precious Achiuwa
Steven Adams
Bam Adebayo
Santi Aldama
LaMarcus Aldridge
Nickeil Alexander-Walker
Nickeil Alexander-Walker
Nickeil Alexander-Walker
Grayson Allen
Jarrett Allen
Jose Alvarado
Justin Anderson
Justin Anderson
Justin Anderson
Kyle Anderson
Giannis Antetokounmpo
Thanasis Antetokounmpo


AttributeError: ignored

### Step 9 - By printing the name of each player in the above block, we found that the code is getting error for the player name 'Thanasis Antetokounmpo'. Upon manually checking his individual career page we found that he did not play for certain years and hence have no data for those years. To accomodate that issue, let's modify the function we created to fetch the career statistics of each player.

In [None]:
def getPlayerStatUpdated(player_id, player_name, player_url):
  
  col_names = []
  for i in range(33):
    col_names.append('C' + str(i + 1))

  df = pd.DataFrame(columns = col_names)

  # Parse the page of the player
  player_page = requests.get(player_url).text
  p_soup = BeautifulSoup(player_page, "lxml")

  # Get the statistics of all regular seasons
  player_table = p_soup.find_all(id = "per_game")
  player_table_body = player_table[0].find('tbody')
  player_table_rows = player_table_body.find_all('tr')

  for player_table_row in player_table_rows:
    # CHANGE from the function above - If the statistics is available for a season then that row will have an 'id' attribute
    if player_table_row.has_attr('id'):
      data_row = []
      data_row.append(player_id)
      data_row.append(player_name)
      data_row.append('Regular')
      data_row.append(player_table_row.find('th').find('a').text)

      player_table_cells = player_table_row.find_all('td')
      for player_table_cell in player_table_cells:
        data_row.append(player_table_cell.text)

      df.loc[len(df.index)] = data_row
  # End of loop for regular season

  # If the player has played any playoff seasons then get the statistics of all those seasons 
  player_table = p_soup.find_all(id = "playoffs_per_game")
  if player_table:
    player_table_body = player_table[0].find('tbody')
    player_table_rows = player_table_body.find_all('tr')

    for player_table_row in player_table_rows:
      # CHANGE from the function above - If the statistics is available for a season then that row will have an 'id' attribute
      if player_table_row.has_attr('id'):
        data_row = []
        data_row.append(player_id)
        data_row.append(player_name)
        data_row.append('Playoff')
        data_row.append(player_table_row.find('th').find('a').text)

        player_table_cells = player_table_row.find_all('td')
        for player_table_cell in player_table_cells:
          data_row.append(player_table_cell.text)

        df.loc[len(df.index)] = data_row
  # End of loop for playoff season

  return df  

### Step 10 - Repeat the step 8 above to collect the entire career statistics of every player played in the 2021-22 season and store that into a dataframe. Only this time we will call the updated function we created in step 9 above..

In [None]:
main_url = "https://www.basketball-reference.com"

col_names = []
for i in range(33):
  col_names.append('C' + str(i + 1))
all_players_all_seasons_df = pd.DataFrame(columns = col_names)

for table_row in table_rows:

  if table_row['class'][0] != 'thead':
    p_id = table_row.find('th').get_text()

    table_row_cells = table_row.find_all('td')
    p_name = table_row_cells[0].text
    p_url = main_url + table_row_cells[0].find('a')['href']

    # Let's print the name of each player as we are looping through, which will help to find if the code gets error in any step
    #print(p_name)

    # Call the function for each player to get the stats of all regular and playoff seasons of that player in form of a dataframe
    each_player_all_seasons_df = getPlayerStatUpdated(p_id, p_name, p_url)

    # Append the player's stat obtained above into the main dataframe for all player's stat
    all_players_all_seasons_df = all_players_all_seasons_df.append(each_player_all_seasons_df)


### Step 11 - The above code took about 4 minute 30 seconds to collect the entire career statistics of each player played in the 2021-22 season and store that to a dataframe. Now let's inspect the dataframe and then save it to a CSV file for future analysis.

In [None]:
all_players_all_seasons_df.shape
# The above command shows the dataframe has 7637 rows and 33 columns

all_players_all_seasons_df.to_csv('NBAPlayerStat.csv')