<a href="https://colab.research.google.com/github/DarylUnix/Data-Related-Learnings-and-Projects/blob/main/Webscraping_NBA_Summer_League_stats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Webscraping NBA Summer League Stats

In this notebook, we will be webscraping various stats of players from all NBA summer leagues ranging from the season 2003-2004 up until 2023-2024.

Acknowledgement:
All of the data are from **basketball.realgm.com**. Huge thanks to them for preserving the data and keeping it public.

Disclaimer:
This project is intended for personal use, for my portfolio project. Not for commercial use.

## Preparation of Requirements

To start webscraping, we will be needing to import a couple of Python libraries:

*   Requests - use to look for and connect with the site
*   Beautiful soup - use to search and extract data within the site
*   Pandas - use to convert data into table




In [1]:
from bs4 import BeautifulSoup
import pandas as pd
import requests

We should also prepare all the names and year id of the league and store them into array, which are found at the URL of all the pages that we would use. This is necessary in order to make a loop, so we won't be manually replacing values everytime we need a data.

Ex. https://basketball.realgm.com/nba/summer/1/NBA-Summer-League/15/stats

* league_name = "NBA-Summer-League"
* league_year_id = "15"

By the way, let's ignore the URLs of historical data since it is an aggregated version of all records, in short thats the summary and we won't be needing it.

In [8]:
league_name = ["NBA-Summer-League", "Orlando-Pro-Summer-League", "Salt-Lake-City-Summer-League", "California-Classic"]
league_year_id = [
    [16, 17, 7, 5, 3, 8, 14, 26, 27, 30, 34, 37, 39, 43, 47, 49, 53],
    [20, 19, 18, 4, 2, 9, 13, 25, 28, 31, 33, 36],
    [24, 23, 22, 21, 6, 29, 32, 35, 38, 41, 44, 48, 52],
    [40, 42, 46, 50, 51]
]

## Getting all the Data

These are the steps that I took in order to construct the loop, to find and store the data into a table repeatedly:



1. Make an empty data frame
2. Create a for loop to check all the league names
3. Create another for loop to check all the league year id
4. Make 1st page into the default page
5. Create while loop to get data in every page
6. Make a url link that has a combination of the URL string and array indexes. That would change the URL automatically every loop
7. Connect with the URL using requests
8. Get the table headers using beautifulsoup
9. Store it in a table using pandas
10. Rename the column # into League as number is irrelevant and league stats detail is very important
11. Make an empty array that would store all records
12. Find all table records
13. Create a for loop that will find all of its data, get all of its text and append each record as an array into the empty array that we just made
14. Find the league stats detail
15. Create a for loop to replace the # value of every record with the league stats detail
16. Make a temporary table that gets combines your existing table header and all the records you appended
17. Concatenate all the records to your table with your temporary table, disregarding the table header as they are the same
18. Find the anchor in the page that contains the text "Next Page"
19. If it exist then add go to the next page and continue the loop
20. If it doesnt exist then go the the next league year id
21. Loop continues until all of the pages from league year id and league names have no longer an anchor that contains the text "Next Page"

### Average Stats

NBAsummerleagueAverage.csv includes:

* League - League stats details
* Player - Name of the player
* Team - Team of the player
* GP - Games Played
* MPG - Minutes per Game
* PPG - Points per Game
* FGM - Field Goals Made
* FGA - Field Goals Attempted
* FG% - Field Goal Percentage
* 3PM - 3 Pointers Made
* 3PA - 3 Pointers Attempted
* 3P% - 3 Point Percentage
* FTM - Free Throws Made
* FTA - Free Throws Attempted
* FT% - Free Throw Percentage
* ORB - Offensive Rebounds
* DRB - Defensive Rebounds
* RPG - Rebounds per Game
* APG - Assists per Game
* SPG - Steals per Game
* BPG - Blocks per Game
* TOV - Turnovers
* PF - Personal Fouls



In [9]:
# Initialize the DataFrame outside the loop
df = pd.DataFrame()

for i in range(len(league_name)):
    for j in league_year_id[i]:
        current_page = 1

        while True:
            url = f'https://basketball.realgm.com/nba/summer/{i+1}/{league_name[i]}/{j}/stats/NBA/0/Averages/All/points/All/desc/{current_page}/Summer_League'
            print(f"Fetching URL: {url}")  # Print the URL to verify
            print(f"Current League: {league_name[i]}, League ID: {j}, Current Page: {current_page}")  # Print the current league, ID, and page number

            page = requests.get(url)
            soup = BeautifulSoup(page.text, 'html.parser')

            elements = soup.find_all('th')
            headers = [element.text for element in elements]

            if df.empty:
                df = pd.DataFrame(columns=headers)
                if "#" in df.columns:
                    df.rename(columns={"#": "League"}, inplace=True)

            all_rows = []
            column_data = soup.find_all('tr')
            for row in column_data[1:]:
                row_data = row.find_all('td')
                individual_row_data = [data.text.strip() for data in row_data]
                all_rows.append(individual_row_data)

            league_desc = soup.find_all('h2')[0].text.strip()
            for row in all_rows:
                if row:  # Ensure that the row is not empty
                    row[0] = league_desc

            temp_df = pd.DataFrame(all_rows, columns=df.columns)
            df = pd.concat([df, temp_df], ignore_index=True)

            next_page_button = soup.find('a', href=lambda href: href and f"desc/{current_page + 1}/Summer_League" in href)
            if not next_page_button:
                print(f"No more pages for League ID: {j} in {league_name[i]}")
                break
            else:
                current_page += 1
                print(f"Moving to next page: {current_page} for League ID: {j} in {league_name[i]}")

# Print the DataFrame to verify

print(df)

Fetching URL: https://basketball.realgm.com/nba/summer/1/NBA-Summer-League/16/stats/NBA/0/Averages/All/points/All/desc/1/Summer_League
Current League: NBA-Summer-League, League ID: 16, Current Page: 1
Moving to next page: 2 for League ID: 16 in NBA-Summer-League
Fetching URL: https://basketball.realgm.com/nba/summer/1/NBA-Summer-League/16/stats/NBA/0/Averages/All/points/All/desc/2/Summer_League
Current League: NBA-Summer-League, League ID: 16, Current Page: 2
Moving to next page: 3 for League ID: 16 in NBA-Summer-League
Fetching URL: https://basketball.realgm.com/nba/summer/1/NBA-Summer-League/16/stats/NBA/0/Averages/All/points/All/desc/3/Summer_League
Current League: NBA-Summer-League, League ID: 16, Current Page: 3
No more pages for League ID: 16 in NBA-Summer-League
Fetching URL: https://basketball.realgm.com/nba/summer/1/NBA-Summer-League/17/stats/NBA/0/Averages/All/points/All/desc/1/Summer_League
Current League: NBA-Summer-League, League ID: 17, Current Page: 1
Moving to next page

### Misc Stats

NBAsummerleagueMisc.csv includes:

* League - League stats details
* Player - Name of the player
* Team - Team of the player
* Dbl Dbl - No. of double-doubles
* Tpl Dbl - No. of triple-doubles
* 40 Pts - No. of games with 40+ points
* 20 Reb - No. of games with 20+ rebounds
* 20 Ast - No. of games with 20+ assists
* 5 Stl - No. of games with 5+ steals
* 5 Blk - No. of games with 5+ blocked shots
* High Game - Highest points in a game
* Techs - No. of technical fouls received
* HOB - Highest Offensive Box Plus/Minus
* Ast/TO - Assists to Turnover ratio
* Stl/TO - Steals to Turnover ratio
* FT/FGA - Free Throw Attempts to Field
* Goal Attempts ratio
* W's - No. of wins
* L's - No. of losses
* Win % - Win percentage
* OWS - Offensive Win Shares
* DWS - Defensive Win Shares
* WS - Win Shares

In [10]:
# Initialize the DataFrame outside the loop
df2 = pd.DataFrame()

for i in range(len(league_name)):
    for j in league_year_id[i]:
        current_page = 1

        while True:
            url = f'https://basketball.realgm.com/nba/summer/{i+1}/{league_name[i]}/{j}/stats/NBA/0/Misc_Stats/All/per/All/desc/{current_page}/Summer_League'
            print(f"Fetching URL: {url}")  # Print the URL to verify
            print(f"Current League: {league_name[i]}, League ID: {j}, Current Page: {current_page}")  # Print the current league, ID, and page number

            page = requests.get(url)
            soup = BeautifulSoup(page.text, 'html.parser')

            elements = soup.find_all('th')
            headers = [element.text for element in elements]

            if df2.empty:
                df2 = pd.DataFrame(columns=headers)
                if "#" in df2.columns:
                    df2.rename(columns={"#": "League"}, inplace=True)

            all_rows = []
            column_data = soup.find_all('tr')
            for row in column_data[1:]:
                row_data = row.find_all('td')
                individual_row_data = [data.text.strip() for data in row_data]
                all_rows.append(individual_row_data)

            league_desc = soup.find_all('h2')[0].text.strip()
            for row in all_rows:
                if row:  # Ensure that the row is not empty
                    row[0] = league_desc

            temp_df2 = pd.DataFrame(all_rows, columns=df2.columns)
            df2 = pd.concat([df2, temp_df2], ignore_index=True)

            next_page_button = soup.find('a', href=lambda href: href and f"desc/{current_page + 1}/Summer_League" in href)
            if not next_page_button:
                print(f"No more pages for League ID: {j} in {league_name[i]}")
                break
            else:
                current_page += 1
                print(f"Moving to next page: {current_page} for League ID: {j} in {league_name[i]}")

# Print the DataFrame to verify
print(df2)

Fetching URL: https://basketball.realgm.com/nba/summer/1/NBA-Summer-League/16/stats/NBA/0/Misc_Stats/All/per/All/desc/1/Summer_League
Current League: NBA-Summer-League, League ID: 16, Current Page: 1
No more pages for League ID: 16 in NBA-Summer-League
Fetching URL: https://basketball.realgm.com/nba/summer/1/NBA-Summer-League/17/stats/NBA/0/Misc_Stats/All/per/All/desc/1/Summer_League
Current League: NBA-Summer-League, League ID: 17, Current Page: 1
No more pages for League ID: 17 in NBA-Summer-League
Fetching URL: https://basketball.realgm.com/nba/summer/1/NBA-Summer-League/7/stats/NBA/0/Misc_Stats/All/per/All/desc/1/Summer_League
Current League: NBA-Summer-League, League ID: 7, Current Page: 1
No more pages for League ID: 7 in NBA-Summer-League
Fetching URL: https://basketball.realgm.com/nba/summer/1/NBA-Summer-League/5/stats/NBA/0/Misc_Stats/All/per/All/desc/1/Summer_League
Current League: NBA-Summer-League, League ID: 5, Current Page: 1
No more pages for League ID: 5 in NBA-Summer-L

### Advanced Stats

NBAsummerleagueAdvanced.csv includes:

* League - League stats details
* Player - Name of the player
* TS% - True Shooting Percentage
* eFG% - Effective Field Goal Percentage
* Total S % - Total Shooting Percentage
* ORB% - Offensive Rebound Percentage
* DRB% - Defensive Rebound Percentage
* TRB% - Total Rebound Percentage
* AST% - Assist Percentage
* TOV% - Turnover Percentage
* STL% - Steal Percentage
* BLK% - Block Percentage
* USG% - Usage Percentage
* PPR - Player Productivity Rating
* PPS - Points Per Shot
* ORtg - Offensive Rating
* DRtg - Defensive Rating
* eDiff - Efficiency Differential
* FIC - Floor Impact Counter
* PER - Player Efficiency Rating

In [11]:
# Initialize the DataFrame outside the loop
df3 = pd.DataFrame()

for i in range(len(league_name)):
    for j in league_year_id[i]:
        current_page = 1

        while True:
            url = f'https://basketball.realgm.com/nba/summer/{i+1}/{league_name[i]}/{j}/stats/NBA/0/Advanced_Stats/All/dbl_dbl/All/desc/{current_page}/Summer_League'
            print(f"Fetching URL: {url}")  # Print the URL to verify
            print(f"Current League: {league_name[i]}, League ID: {j}, Current Page: {current_page}")  # Print the current league, ID, and page number

            page = requests.get(url)
            soup = BeautifulSoup(page.text, 'html.parser')

            elements = soup.find_all('th')
            headers = [element.text for element in elements]

            if df3.empty:
                df3 = pd.DataFrame(columns=headers)
                if "#" in df3.columns:
                    df3.rename(columns={"#": "League"}, inplace=True)

            all_rows = []
            column_data = soup.find_all('tr')
            for row in column_data[1:]:
                row_data = row.find_all('td')
                individual_row_data = [data.text.strip() for data in row_data]
                all_rows.append(individual_row_data)

            league_desc = soup.find_all('h2')[0].text.strip()
            for row in all_rows:
                if row:  # Ensure that the row is not empty
                    row[0] = league_desc

            temp_df3 = pd.DataFrame(all_rows, columns=df3.columns)
            df3 = pd.concat([df3, temp_df3], ignore_index=True)

            next_page_button = soup.find('a', href=lambda href: href and f"desc/{current_page + 1}/Summer_League" in href)
            if not next_page_button:
                print(f"No more pages for League ID: {j} in {league_name[i]}")
                break
            else:
                current_page += 1
                print(f"Moving to next page: {current_page} for League ID: {j} in {league_name[i]}")

# Print the DataFrame to verify
print(df3)

Fetching URL: https://basketball.realgm.com/nba/summer/1/NBA-Summer-League/16/stats/NBA/0/Advanced_Stats/All/dbl_dbl/All/desc/1/Summer_League
Current League: NBA-Summer-League, League ID: 16, Current Page: 1
Moving to next page: 2 for League ID: 16 in NBA-Summer-League
Fetching URL: https://basketball.realgm.com/nba/summer/1/NBA-Summer-League/16/stats/NBA/0/Advanced_Stats/All/dbl_dbl/All/desc/2/Summer_League
Current League: NBA-Summer-League, League ID: 16, Current Page: 2
Moving to next page: 3 for League ID: 16 in NBA-Summer-League
Fetching URL: https://basketball.realgm.com/nba/summer/1/NBA-Summer-League/16/stats/NBA/0/Advanced_Stats/All/dbl_dbl/All/desc/3/Summer_League
Current League: NBA-Summer-League, League ID: 16, Current Page: 3
No more pages for League ID: 16 in NBA-Summer-League
Fetching URL: https://basketball.realgm.com/nba/summer/1/NBA-Summer-League/17/stats/NBA/0/Advanced_Stats/All/dbl_dbl/All/desc/1/Summer_League
Current League: NBA-Summer-League, League ID: 17, Current

### Total Stats

NBAsummerleagueTotals.csv includes:

* League - League stats details
* Player - Name of the player
* Team - Team of the player
* GP - Games Played
* MPG - Minutes per Game
* PPG - Points per Game
* FGM - Field Goals Made
* FGA - Field Goals Attempted
* FG% - Field Goal Percentage
* 3PM - 3 Pointers Made
* 3PA - 3 Pointers Attempted
* 3P% - 3 Point Percentage
* FTM - Free Throws Made
* FTA - Free Throws Attempted
* FT% - Free Throw Percentage
* ORB - Offensive Rebounds
* DRB - Defensive Rebounds
* RPG - Rebounds per Game
* APG - Assists per Game
* SPG - Steals per Game
* BPG - Blocks per Game
* TOV - Turnovers
* PF - Personal Fouls

In [13]:
# Initialize the DataFrame outside the loop
df4 = pd.DataFrame()

for i in range(len(league_name)):
    for j in league_year_id[i]:
        current_page = 1

        while True:
            url = f'https://basketball.realgm.com/nba/summer/{i+1}/{league_name[i]}/{j}/stats/NBA/0/Totals/All/dbl_dbl/All/desc/{current_page}/Summer_League'
            print(f"Fetching URL: {url}")  # Print the URL to verify
            print(f"Current League: {league_name[i]}, League ID: {j}, Current Page: {current_page}")  # Print the current league, ID, and page number

            page = requests.get(url)
            soup = BeautifulSoup(page.text, 'html.parser')

            elements = soup.find_all('th')
            headers = [element.text for element in elements]

            if df4.empty:
                df4 = pd.DataFrame(columns=headers)
                if "#" in df4.columns:
                    df4.rename(columns={"#": "League"}, inplace=True)

            all_rows = []
            column_data = soup.find_all('tr')
            for row in column_data[1:]:
                row_data = row.find_all('td')
                individual_row_data = [data.text.strip() for data in row_data]
                all_rows.append(individual_row_data)

            league_desc = soup.find_all('h2')[0].text.strip()
            for row in all_rows:
                if row:  # Ensure that the row is not empty
                    row[0] = league_desc

            temp_df4 = pd.DataFrame(all_rows, columns=df4.columns)
            df4 = pd.concat([df4, temp_df4], ignore_index=True)

            next_page_button = soup.find('a', href=lambda href: href and f"desc/{current_page + 1}/Summer_League" in href)
            if not next_page_button:
                print(f"No more pages for League ID: {j} in {league_name[i]}")
                break
            else:
                current_page += 1
                print(f"Moving to next page: {current_page} for League ID: {j} in {league_name[i]}")

# Print the DataFrame to verify
print(df4)

Fetching URL: https://basketball.realgm.com/nba/summer/1/NBA-Summer-League/16/stats/NBA/0/Totals/All/dbl_dbl/All/desc/1/Summer_League
Current League: NBA-Summer-League, League ID: 16, Current Page: 1
Moving to next page: 2 for League ID: 16 in NBA-Summer-League
Fetching URL: https://basketball.realgm.com/nba/summer/1/NBA-Summer-League/16/stats/NBA/0/Totals/All/dbl_dbl/All/desc/2/Summer_League
Current League: NBA-Summer-League, League ID: 16, Current Page: 2
Moving to next page: 3 for League ID: 16 in NBA-Summer-League
Fetching URL: https://basketball.realgm.com/nba/summer/1/NBA-Summer-League/16/stats/NBA/0/Totals/All/dbl_dbl/All/desc/3/Summer_League
Current League: NBA-Summer-League, League ID: 16, Current Page: 3
No more pages for League ID: 16 in NBA-Summer-League
Fetching URL: https://basketball.realgm.com/nba/summer/1/NBA-Summer-League/17/stats/NBA/0/Totals/All/dbl_dbl/All/desc/1/Summer_League
Current League: NBA-Summer-League, League ID: 17, Current Page: 1
Moving to next page: 2 

## Saving the Data

Let's save the data seperately as they have different columns.

Don't forget to clean them. Use your preferred environment. You may break down the league details stat for your own convenience (e.g. year_start, year_end, year_duration, league_name, etc...)

In [12]:
df.to_csv("NBAsummerleagueAverage.csv", index=False)
df2.to_csv("NBAsummerleagueMisc.csv", index=False)
df3.to_csv("NBAsummerleagueAdvanced.csv", index=False)
df4.to_csv("NBAsummerleagueTotal.csv", index=False)