# QBC8 Project: [basketball-reference](https://www.basketball-reference.com/)
This project analyzes basketball performance data from [basketball-reference.com](https://www.basketball-reference.com/) for the Quera QBC8 data-analysis bootcamp, utilizing web scraping, relational database management, and statistical methods to extract insights about player and team statistics that can inform strategic decision-making in professional basketball. This notebook mostly focuses on the web scraper program that gathers needed info from the source.

## Crawler Program
This web crawler is made with [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/) which is a Python library that parses HTML and XML documents, enabling easy web scraping by creating navigable parse trees. To handle data and frame data we used Pandas. [Pandas](https://pandas.pydata.org/) is a Python library for data manipulation and analysis, providing powerful data structures like DataFrame for efficient data processing.

The whole of the program is encapsulated as an object which provides better organization, reusability, and modularity by grouping related methods and data together in a structured, maintainable way.

#### Modules To Import:

In [63]:
from bs4 import BeautifulSoup
from datetime import datetime
import pandas as pd
import requests
import time
import json
import re

## The Crawler Object
This Crawler class is a web scraping tool designed to extract and process basketball-related data from Basketball-Reference.com. It provides methods to retrieve information about MVP winners, top season players and teams, team rosters, and detailed player data, using BeautifulSoup for HTML parsing and Pandas for data management. The specific functionalities and use of each method will be explained in detail in subsequent sections of the notebook.

This code block defines the Crawler class and initializes a crawler object:

In [112]:
class Crawler:
    current_year = datetime.now().year

    # Gets Soup From url with error handling
    def __get_soup(self, url, retries=3, wait=5): 
        for i in range(retries):
            try: 
                page = requests.get(url)
            except:
                if i==retries-1:
                    print(f"Failed to Retrieve from {url} After {retries} retries. Passing null")
                    return BeautifulSoup()
                print(f"Failed to Rerieve from {url}, {i+1} of {retries} retries. Waiting {wait} seconds...")
                time.sleep(wait)
                continue
            break
        if page.status_code == 429:
            print("rate limited!")
            return
        return BeautifulSoup(page.content, "html.parser")

    # Gets all The MVP's and return them with some data attached in a pd.DataFrame
    def get_mvps(self):
        mvp_soup = self.__get_soup("https://www.basketball-reference.com/awards/mvp.html")
        mvp_soup = mvp_soup.find('table', id="mvp_NBA").tbody.find_all("tr")
        mvp_df = pd.DataFrame(columns=["player_name", "player_id", "team_id", "year"])
        
        for tr in mvp_soup:
            mvp_df.loc[len(mvp_df)] = [      
                tr.find(attrs={"data-stat":"player"}).a.string,
                tr.find(attrs={"data-stat":"player"}).a.get("href")[9:-5],
                tr.find(attrs={"data-stat":"team_id"}).a.get("href")[1:-5].split("/")[1],
                tr.find(attrs={"data-stat":"team_id"}).a.get("href")[1:-5].split("/")[-1]
            ]
        return mvp_df

    # Gets the top players in each season
    def get_top_season_players(self, year: int, n=0):
        player_soup = self.__get_soup(f"https://www.basketball-reference.com/leagues/NBA_{year}_totals.html")
        team_soup = player_soup.find_all("td", attrs={"data-stat":"team_name_abbr"})
        player_soup = player_soup.find_all("td", attrs={"data-stat":"name_display"})
        player_list = [[tdp.a.string, tdp.a.get("href")[9:-5], tdt.a.string, year] for tdp, tdt in zip(player_soup, team_soup) if tdp.a and tdt.a]
        player_list = player_list[:n] if n else player_list
        return pd.DataFrame(player_list, columns=["player_name", "player_id", "team_id", "year"], index=range(1,len(player_list)+1))

    # Gets table standings of each season
    def get_table_standings(self, year: int, n=0):
        standings_soup = self.__get_soup(f"https://www.basketball-reference.com/leagues/NBA_{year}_standings.html")
        standings_soup = BeautifulSoup(standings_soup.find(class_="placeholder").next_sibling.next_sibling.string, "html.parser").find_all("td", attrs={"data-stat":"team_name"})
        standings_list = [[td.a.string,td.a.get("href")[7:-5]] for td in standings_soup if td.a]
        standings_list = standings_list[:n] if n else standings_list
        return pd.DataFrame(standings_list, columns=["team_name", "team_year_id"], index=range(1,len(standings_list)+1))

    # Gets each team's roster by year
    def get_team_roster(self, team_year_id):
        roster_soup = self.__get_soup(f"https://www.basketball-reference.com/teams/{team_year_id}.html")
        players_soup = roster_soup.find_all("td", attrs={"data-stat":"player"})
        players_number_soup = roster_soup.find_all("th", attrs={"data-stat":"number"})
        playerlist = [[td.a.string, td.a.get("href")[9:-5]] for td in players_soup]
        playernumbers = [int(th.string) for th in players_number_soup[1:]]
        return pd.DataFrame(playerlist, columns=["player_name", "player_id"], index=playernumbers)

    # Gets Data From Each Player using a series that contains the player_id. The reason for this approach is to be able to apply this function to a dataframe as a whole 
    def get_player_data(self, player_series: pd.Series):
        player_series = player_series.copy()
        player_soup = self.__get_soup(f"https://www.basketball-reference.com/players/{player_series["player_id"]}.html")     
        player_json_soup = player_soup.find("script", type="application/ld+json").string
        player_json = json.loads(player_json_soup) if player_json_soup else none           # some player data that are in json form
        player_weightheight = player_soup.find("span", string=re.compile(".*lb")).next_sibling.strip()[1:-1].split(",\xa0")          # Player weight and height in cm and kg in a list

        # Gets and Cleans Player Position
        position_string = player_soup.find("strong", string=re.compile(".*Position:.*")).next_sibling
        positions_cleaned = re.sub(r'[^\w\s,]', '', position_string)
        position_list = re.split(r',|and', positions_cleaned)
        position_list = [pos.strip() for pos in position_list if pos.strip()]

        # get if player is retired
        retired = True if (player_soup.find("strong", string=re.compile(".*Career Length:.*"))) else False

        # Applying the Captured data to the player pd.Series (functionallity of each line insists upon itself). I wrote them line by line so it would be easier to change the order and parameters
        player_series["height_cm"] = player_weightheight[0][:-2]
        player_series["weight_kg"] = player_weightheight[1][:-2]
        player_series["position"] = position_list
        player_series["shooting_hand"] = player_soup.find("strong", string=re.compile(".*Shoots:.*")).next_sibling.strip()
        player_series["retired"] = retired     
        player_series["experience_total"] = int(player_soup.find("strong", string=re.compile(".*Experience:.*|.*Career Length:.*")).next_sibling.strip().split()[0])      # note that the experience got is with respect to the current year (if they are still playing)
        player_series["experience_at_year"] = player_series["experience_total"] - (self.current_year - int(player_series["year"])) if not retired else player_series["experience_total"]    # This gets the experience at the year data is gotten if player isnt retired.
        player_series["birthplace"] = player_json.get('birthPlace').split(",")[-1].strip() if player_json_soup else None
        player_series["birthdate"] = player_json.get('birthDate') if player_json_soup else None

        return player_series

crawler = Crawler()

## Getting All MVP's 
The `get_mvps()` method scrapes MVP award winners from Basketball-Reference.com, extracting each player's name, ID, team ID, and season year into a Pandas DataFrame.

In [66]:
mvp = crawler.get_mvps()
mvp

Unnamed: 0,player_name,player_id,team_id,year
0,Nikola Jokić,j/jokicni01,DEN,2024
1,Joel Embiid,e/embiijo01,PHI,2023
2,Nikola Jokić,j/jokicni01,DEN,2022
3,Nikola Jokić,j/jokicni01,DEN,2021
4,Giannis Antetokounmpo,a/antetgi01,MIL,2020
...,...,...,...,...
64,Wilt Chamberlain,c/chambwi01,PHW,1960
65,Bob Pettit,p/pettibo01,STL,1959
66,Bill Russell,r/russebi01,BOS,1958
67,Bob Cousy,c/cousybo01,BOS,1957


## Adding Details To MVP's
Applying the get_player_data() method to the MVP list to retrieve comprehensive details for in-depth analysis of MVP winners' characteristics and performance.

In [70]:
mvp_detailed = mvp.apply(crawler.get_player_data, axis="columns")
mvp_detailed

Unnamed: 0,player_name,player_id,team_id,year,height_cm,weight_kg,position,shooting_hand,retired,experience_total,experience_at_year,birthplace,birthdate
0,Nikola Jokić,j/jokicni01,DEN,2024,211,128,[Center],Right,False,9,8,Serbia,1995-02-19
1,Joel Embiid,e/embiijo01,PHI,2023,213,127,[Center],Right,False,8,6,Cameroon,1994-03-16
2,Nikola Jokić,j/jokicni01,DEN,2022,211,128,[Center],Right,False,9,6,Serbia,1995-02-19
3,Nikola Jokić,j/jokicni01,DEN,2021,211,128,[Center],Right,False,9,5,Serbia,1995-02-19
4,Giannis Antetokounmpo,a/antetgi01,MIL,2020,211,109,"[Power Forward, Small Forward, Point Guard, Sh...",Right,False,11,6,Greece,1994-12-06
...,...,...,...,...,...,...,...,...,...,...,...,...,...
64,Wilt Chamberlain,c/chambwi01,PHW,1960,216,124,[Center],Right,True,14,14,United States,1936-08-21
65,Bob Pettit,p/pettibo01,STL,1959,206,92,"[Power Forward, Center]",Right,True,11,11,United States,1932-12-12
66,Bill Russell,r/russebi01,BOS,1958,208,97,[Center],Left,True,13,13,United States,1934-02-12
67,Bob Cousy,c/cousybo01,BOS,1957,185,79,[Point Guard],Right,True,14,14,United States,1928-08-09


## Getting Top 50 Players from Seasons 2019-2020 Till 2023-2024 
This code block collects data on the top 50 NBA players from the 2019-2020 to 2023-2024 seasons. It uses the `get_top_season_players()` method to scrape player lists for each season and enriches this data with detailed player information like height, weight, position, and experience using the `get_player_data()` method. The combined data is stored in a pandas DataFrame.

In [None]:
seasons = list(range(2020, 2025))  # 2019-2020 is year 2020 for basketball-reference

# Initialize an empty DataFrame to store all player data
top50players_data = pd.DataFrame()

# Iterate over each season
for season in seasons:
    print(f"Processing season {season}...")
    top_players = crawler.get_top_season_players(year=season, n=50)
    top_players["season"] = season
    enriched_players = top_players.apply(crawler.get_player_data, axis=1)
    top50players_data = pd.concat([all_players_data, enriched_players.reset_index()], ignore_index=True)

top50players_data 

## Player Insights from Top 2 NBA Teams (2019-2024)
This script gathers detailed data on all players from the top 2 NBA teams for each season between 2019-2020 and 2023-2024. It first identifies the top 2 teams using the get_table_standings() method, retrieves each team's roster with get_team_roster(), and enriches player details like height, weight, position, and experience using get_player_data(). The combined data is saved in a CSV file, creating a comprehensive dataset for analysis.

In [None]:
seasons = list(range(2020, 2022))  # 2019-2020 is represented as 2020 for basketball-reference

# Initialize an empty DataFrame to store all player data
top_team_players_data = pd.DataFrame()

# Iterate over each season
for season in seasons:
    print(f"Processing season {season}...")

    # Get standings and fetch the top 2 teams
    top_teams = crawler.get_table_standings(year=season, n=2)

    # Iterate over the top 2 teams
    for _, team in top_teams.iterrows():
        team_year_id = team["team_year_id"]
        team_name = team["team_name"]
        print(f"Processing team: {team_name} ({team_year_id}) for season {season}...")

        # Get the roster of the team
        team_roster = crawler.get_team_roster(team_year_id=team_id)

        # Add the season  to the roster DataFrame
        team_roster["year"] = season

        # Apply get_player_data to enrich player data
        enriched_roster = team_roster.apply(crawler.get_player_data, axis=1)

        # Append the enriched data to the main DataFrame
        top_team_players_data = pd.concat([all_players_data, enriched_roster.reset_index()], ignore_index=True)

top_team_players_data

## Exporting Data
This code block exports all DataFrames into `.csv` files.

In [None]:
mvp_detailed.to_csv("data/mvp_detailed.csv")
all_players_data.to_csv("data/all_players_data.csv")
top_team_players_data.to_csv("data/top_team_players_data.csv")



> #### Note On Season Identifiers
> Season years are represented in th from of "Season year-year+1" (e.g. Season 2021-2022). To have less data complexity and more efficieny I choose "year+1" from each season in the form of "year-year+1" as a key for that specific season because the website [basketball-reference.com](www.basketball-reference.com) is also doing the same.
> 
>  **TL;DR**: The key "2024" points to the 2023-24 season.  