<a href="https://colab.research.google.com/github/GoAshim/WebScraping/blob/main/Web_Scraping_1_2021_22_NBA_Player_Stats_Version_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scrap 2021-22 NBA Player Summary Statistics
In this first web scraping exercise we are going to scrap the 2021-22 NBA Player Stats from Basketball Reference site (link [here](https://www.basketball-reference.com/leagues/NBA_2022_totals.html)).


## Summary
Basketball Reference has provided the stats of all NBA players for the 2021-22 season in tabular form on the above link. We are going to identify the table and scrap relevant data and load that on a dataframe for further analysis.

### Step 1 - Import required libraries

In [None]:
import requests # To pull data from webpage
from bs4 import BeautifulSoup # To parse data pulled from the webpage
import pandas as pd # To view, modify and store data parsed from the webpage 


### Step 2 - Extract the content of the webpage

In [None]:
url = "https://www.basketball-reference.com/leagues/NBA_2022_totals.html"

# Using requests.get to fetch the source content of the page
page_data = requests.get(url).text

# Uning BeautifulSoup to parse the content with the lxml parser
soup = BeautifulSoup(page_data, "lxml")

# To see the content of the page in more readable way
#print(soup.prettify())

### Step 3 - Locate the table within the page where the stats are listed

In [None]:
# This is a manual step where I inspected the source code of the page on my Web brouser and then identified the table where the stats are stored.
# The table can be identified with <table class="sortable stats_table" and we will use that to extract the content of the table
table_data = soup.find('table', {"class" : "sortable stats_table"})

# Then let's extract the body of the table
table_body = table_data.find('tbody')

# Now we are going to extract the rows of the table, find_all returns a list
table_rows = table_body.find_all('tr')


### Step 4 - Loop throu the header row to get the names of the columns of the table

In [None]:
header_cells = table_data.find('thead').find('tr').find_all('th')
column_names = []

for header_cell in header_cells:
  # Extract the value of the data-tip attribute if it exists
  if header_cell.has_attr('data-tip'):
    column_names.append(header_cell['data-tip'])
  else:
    column_names.append(header_cell['aria-label'])

# Substitute long column names
column_names[3] = 'Players Age'
column_names[17] = 'Effective FG Percentage'

for i in range(len(column_names)):
  column_names[i] = column_names[i].replace(' ', '_').replace('-', '_')

print(column_names)


['Rank', 'Player', 'Position', 'Players_Age', 'Team', 'Games', 'Games_Started', 'Minutes_Played', 'Field_Goals', 'Field_Goal_Attempts', 'Field_Goal_Percentage', '3_Point_Field_Goals', '3_Point_Field_Goal_Attempts', '3_Point_Field_Goal_Percentage', '2_Point_Field_Goals', '2_point_Field_Goal_Attempts', '2_Point_Field_Goal_Percentage', 'Effective_FG_Percentage', 'Free_Throws', 'Free_Throw_Attempts', 'Free_Throw_Percentage', 'Offensive_Rebounds', 'Defensive_Rebounds', 'Total_Rebounds', 'Assists', 'Steals', 'Blocks', 'Turnovers', 'Personal_Fouls', 'Points']


### Step 5 - Loop through the rows in the table body and extract the name of the first 10 players

In [None]:
n = 1
for table_row in table_rows:

  # As we see, the table contains stats of each players in separate rows, which we need to extract. However the header row has been repeated 
  # multiple times to make the table easily readable on the web page, so we will need to exclude those header rows
  if table_row['class'] != ['thead']:

    # Extract the content of cells of each row of the table
    table_cells = table_row.find_all('td')

    if n <= 10:
      # Print the content of the first cell of first 10 rows of the table
      print(table_cells[0].string)
    # End of the inner if block

    n += 1
  # End of the outer if block

# End of the for loop block


Precious Achiuwa
Steven Adams
Bam Adebayo
Santi Aldama
LaMarcus Aldridge
Nickeil Alexander-Walker
Nickeil Alexander-Walker
Nickeil Alexander-Walker
Grayson Allen
Jarrett Allen


### Step 6 - Extract values of all columns related to the 1st player from the table

In [None]:
rank = 1
dataframe_row = []

for table_row in table_rows:
  if table_row['class'][0] != 'thead':
    dataframe_row.append(rank)

    table_row_cells = table_row.find_all('td')

    for table_row_cell in table_row_cells:
      dataframe_row.append(table_row_cell.text)
    
    rank += 1

    if rank == 2:
      break

print(dataframe_row)


[1, 'Precious Achiuwa', 'C', '22', 'TOR', '73', '28', '1725', '265', '603', '.439', '56', '156', '.359', '209', '447', '.468', '.486', '78', '131', '.595', '146', '327', '473', '82', '37', '41', '84', '151', '664']


### Step 7 - Loop through the table to extract all columns of every players and store that into a dataframe

In [None]:
# First create an empty dataframe with the columns we got
df1 = pd.DataFrame(columns=column_names)

# Then loop through the rows in the table as we did in the step 5 above
rank = 1

for table_row in table_rows:
  dataframe_row = []

  if table_row['class'][0] != 'thead':
    dataframe_row.append(rank)
    table_row_cells = table_row.find_all('td')

    for table_row_cell in table_row_cells:
      dataframe_row.append(table_row_cell.text)
    
    rank += 1
    df1.loc[len(df1.index)] = dataframe_row

df1.head()

Unnamed: 0,Rank,Player,Position,Players_Age,Team,Games,Games_Started,Minutes_Played,Field_Goals,Field_Goal_Attempts,...,Free_Throw_Percentage,Offensive_Rebounds,Defensive_Rebounds,Total_Rebounds,Assists,Steals,Blocks,Turnovers,Personal_Fouls,Points
0,1,Precious Achiuwa,C,22,TOR,73,28,1725,265,603,...,0.595,146,327,473,82,37,41,84,151,664
1,2,Steven Adams,C,28,MEM,76,75,1999,210,384,...,0.543,349,411,760,256,65,60,115,153,528
2,3,Bam Adebayo,C,24,MIA,56,56,1825,406,729,...,0.753,137,427,564,190,80,44,148,171,1068
3,4,Santi Aldama,PF,21,MEM,32,0,360,53,132,...,0.625,33,54,87,21,6,10,16,36,132
4,5,LaMarcus Aldridge,C,36,BRK,47,12,1050,252,458,...,0.873,73,185,258,42,14,47,44,78,607
