# Scraping NBA.com

In this assignment, you will scrape data from [New York Knicks Team Page](https://www.nba.com/team/1610612752/knicks). The goal of the exercise is to get all player performance data from the New York Knicks team and find out who are the top three players based on the most Points per Game (PPG).

The end result is to write a function: *`get_players(max_retrieve=None)`* that will return a list of **dicts**. Each **dict** should correspond to a player and should contain the following key-value pairs:

- **name**: Player name **[str]**
- **position**: Player position (e.g., Guard/Center/Forward) **[str]**
  - When a player has multiple preferred positions, prioritize the one mentioned first. For example:
    - If a player lists 'Center-Forward,' consider their primary position as 'Center.'
    - If a player lists 'Forward-Guard,' return 'Forward.'
- **experience**: Years of experience of the player **[int]**
  - If the player is a rookie or has 'R' listed for experience, replace with 0.
  - If there is no information on player experience, assign None
- **PPG**: Points Per Game (PPG) **[float]**
  - If no information, assign a <b>0</b> (zero) value.
- **RPG**: Rebounds Per Game (RPG) **[float]**
  - If no information, assign a <b>0</b> (zero) value.
- **APG**: Assists Per Game (APG) **[float]**
  - If no information, assign a <b>0</b> (zero) value.
- **link**: A link to the player's page **[str]**

### Additional Requirement
- The `get_players` function should **only** be used to iterate through the list of player and get their *link* information, all other information (name, position, experience, PPG, RPG, APG) should be retrieved from the player's page using the `get_player_info` function.

- The function should accept an optional parameter `max_retrieve` **[int]**, which determines the maximum number of players to return, based on the order they appear on the website.
  - If `max_retrieve` = n, return only the top `n` players in the order they are listed on the website.
  - If `max_retrieve` is `None` (default), return all players.
  
- Throughout the assignment, you should use **each** of the following functions at least once:
  - find()
  - find_all()
  - find_all() or find() with a class that matches a regular expression pattern. For example : find_all('something', {'class': re.compile('somePatter')})

The goal is to scrape all player data and allow an option to limit the number of players returned based on their appearance order on the website.


### Process
- Retrieve the information for players' links.
- Iterate through the list and invoke the function `get_player_info(player_url)` for each player. **(Hint: You need to add `https://www.nba.com` in front of the player hyperlink.)**
- Accumulate the name, position, experience, acquire, points, rebounds, assists, and link for each player in the output_list.
- Get the top three players by sorting by (Points per Game) PPG.

<div class="alert alert-block alert-info">
<b>Attention:</b> 
    
Please read and follow the instructions carefully to avoid point deduction.
    
You are encouraged to use class materials and online resources to help you with this assignment. However, copying code directly from Generative AI (ChatGPT, Llama, etc.) or coding websites (Stack Overflow, GitHub, etc.) is strictly forbidden. We TAs have used these tools to generate answers for this assignment, so we WILL know if you directly copy or plagiarize your code. If we suspect any dishonest conduct, we reserve the right to call you in during office hours for a code review. If you fail to explain your code, we reserve the right to give you a 0 for the assignment. 

Feel free to email us or come to our office hours if you have any questions regarding this assignment.
</div>

### 1. Collecting player performance data

In [3]:
import requests
import re
from bs4 import BeautifulSoup

<div class="alert alert-block alert-info">
    <b> Attention: </b> You are not allowed to change the input parameters or the output format of this function. However, you may use helper functions if desired.
</div>

In [23]:
def get_players(max_retrieve = None):
    # Define the URL to the NBA players page
    url = "https://www.nba.com/team/1610612752/knicks"
    
    # Initialize an empty list to store player information
    output_list = list()

    ## YOUR CODE HERE
    response = requests.get(url)
    
    if response.status_code != 200:
        return output_list

    # print(response.text)
    
    soup = BeautifulSoup(response.content, "lxml")

    player_tags = soup.find_all("a", {"href": re.compile(r"/player/\d+/\w+")})
    # print(player_tags)
    
    # we want this order of dict :
    # name, position, experience, PPG, RPG, APG, link
    # so we will store the name first, then udpate it with stats
    # only then add the link since we cannot order the dict an other way 
    # (i tried with the hint but i couldn't make it work) 
    for player in player_tags:
        player_name = " ".join(player.get_text(strip=True).split("\n"))
        player_link = "https://www.nba.com" + player.get("href")
        
        player_data = {"name": player_name}
        
        # --> PLEASE make sure to run the next cell before this one <--
        # i didn't want to change the architecture of the HW as i didn't know if it was acceptable
        detailed_player_data = get_player_info(player_link)
        
        # the dict doesn't actually hold order so it won't be 
        # in the right order unless we'd used collections 
        player_data.update(detailed_player_data)
        
        player_data["link"] = player_link
        
        # we could check if player_data["position"] is None to append it to the list to make sure
        # we only get players with not all the data but it seems pretty "stable" based on my tries
        output_list.append(player_data)

        if max_retrieve and len(output_list) >= max_retrieve:
            break
    
    return output_list

In [24]:
players = get_players(5)
for player in players:
    print(player)

{'name': 'Delon Wright', 'position': 'Guard', 'experience': 9, 'PPG': 2.4, 'RPG': 1.6, 'APG': 1.7, 'link': 'https://www.nba.com/player/1626153/delon-wright/'}
{'name': 'MarJon Beauchamp', 'position': 'Forward', 'experience': 2, 'PPG': 2.2, 'RPG': 1.2, 'APG': 0.3, 'link': 'https://www.nba.com/player/1630699/marjon-beauchamp/'}
{'name': 'P.J. Tucker', 'position': None, 'experience': 13, 'PPG': 1.7, 'RPG': 2.7, 'APG': 0.5, 'link': 'https://www.nba.com/player/200782/pj-tucker/'}
{'name': 'Cameron Payne', 'position': 'Guard', 'experience': 9, 'PPG': 6.9, 'RPG': 1.4, 'APG': 2.7, 'link': 'https://www.nba.com/player/1626166/cameron-payne/'}
{'name': 'Miles McBride', 'position': 'Guard', 'experience': 3, 'PPG': 9.2, 'RPG': 2.5, 'APG': 2.7, 'link': 'https://www.nba.com/player/1630540/miles-mcbride/'}


In [7]:
def get_player_info(player_url):
    ## YOUR CODE HERE
    
    player_data = {"position": None, "experience": None, "PPG": 0.0, "RPG": 0.0, "APG": 0.0,}

    response = requests.get(player_url)
    
    if response.status_code != 200:
        return player_data 

    soup = BeautifulSoup(response.content, "lxml")

    # we already have the name in the previous function and the 
    # url gives us the rest and acts as a primary key 
    # to join the 2 lists in the previous function
    
    # position
    position_tag = soup.find("p", {"class": "PlayerSummary_mainInnerInfo__jv3LO"})
    if position_tag is not None :
        info_text = position_tag.get_text(strip=True)
        parts = info_text.split("|")
        if len(parts) > 2:
            player_data["position"] = parts[-1].strip()

    # experience
    experience_value = None
    experience_tags = soup.find_all("p", {"class": "PlayerSummary_playerInfoValue__JS8_v"})
    experience_labels = soup.find_all("p", {"class": "PlayerSummary_playerInfoLabel__hb5fs"})

    # there isn't a built-in "zip while" this is the cleanest form i've found of doing so
    for label, value in zip(experience_labels, experience_tags): 
        if label.get_text(strip=True) == "EXPERIENCE":
            experience_value = value.get_text(strip=True)
            # then we can break since we found the value :)
            break

    if experience_value is not None :
        if experience_value == "Rookie":
            player_data["experience"] = 0  # Rookie just like "Ariel Hukporti" at the bottom of the roaster for example
        else:
            try:
                # we split to get rid of the "years" part
                player_data["experience"] = int(experience_value.split()[0])
            except ValueError:
                player_data["experience"] = None

    # stats
    stat_labels = soup.find_all("p", {"class": re.compile(r"PlayerSummary_playerStatLabel.*")})
    stat_values = soup.find_all("p", {"class": re.compile(r"PlayerSummary_playerStatValue.*")})

    stats_dict = {}
    # same problem here a before, there is probably a cleaner way to do this but i couldn't find it
    # in the archtecture of the html code
    for label, value in zip(stat_labels, stat_values):
        stat_name = label.get_text(strip=True)
        stat_value = value.get_text(strip=True)
        try:
            stats_dict[stat_name] = float(stat_value)
        except ValueError:
            stats_dict[stat_name] = 0.0
            
    # print(stats_dict) # --> we also get "PIE", this could be optimized as it takes some memory
    # we could simply check the name and decide only to add those we want but this way it's more general
    # and can easily be exapnded to other stats if needed (here only a few stats so it's ok)

    # we use .get("PPG", 0.0) to ensure that if "PPG" does not exist in stats_dict 
    # it will return default value of 0.0 instead of raising a KeyError
    # but we could have done player_data["PPG"] = stats_dict["PPG"] since we already made sure it exists..
    # none the less, better safe than sorry haha
    player_data["PPG"] = stats_dict.get("PPG", 0.0)
    player_data["RPG"] = stats_dict.get("RPG", 0.0)
    player_data["APG"] = stats_dict.get("APG", 0.0)

    return player_data

I made sure to use all of these at least once :
- find()
- find_all()
- find_all() or find() with a class that matches a regular expression pattern

In [None]:
# print(players[0]["link"])
# player_info = get_player_info(players[0]["link"])
# print(player_info)

{'position': 'Forward', 'experience': 1, 'PPG': 0.4, 'RPG': 0.7, 'APG': 0.3}


In [64]:
# Run this cell to get the data
data = get_players()
data

[{'name': 'Delon Wright',
  'position': 'Guard',
  'experience': 9,
  'PPG': 2.4,
  'RPG': 1.6,
  'APG': 1.7,
  'link': 'https://www.nba.com/player/1626153/delon-wright/'},
 {'name': 'MarJon Beauchamp',
  'position': 'Forward',
  'experience': 2,
  'PPG': 2.2,
  'RPG': 1.2,
  'APG': 0.3,
  'link': 'https://www.nba.com/player/1630699/marjon-beauchamp/'},
 {'name': 'Cameron Payne',
  'position': 'Guard',
  'experience': 9,
  'PPG': 7.0,
  'RPG': 1.4,
  'APG': 2.7,
  'link': 'https://www.nba.com/player/1626166/cameron-payne/'},
 {'name': 'Miles McBride',
  'position': 'Guard',
  'experience': 3,
  'PPG': 9.2,
  'RPG': 2.5,
  'APG': 2.6,
  'link': 'https://www.nba.com/player/1630540/miles-mcbride/'},
 {'name': 'Josh Hart',
  'position': 'Guard',
  'experience': 7,
  'PPG': 14.4,
  'RPG': 9.6,
  'APG': 5.7,
  'link': 'https://www.nba.com/player/1628404/josh-hart/'},
 {'name': 'Pacome Dadiet',
  'position': 'Forward',
  'experience': 0,
  'PPG': 1.9,
  'RPG': 1.0,
  'APG': 0.4,
  'link': 'ht

In [38]:
# Running the above cell should return (note: the results may vary over time since the website is always updated)
"""
[{'name': 'Pacome Dadiet',
  'position': 'Forward',
  'experience': 0,
  'PPG': 0.0,
  'RPG': 0.0,
  'APG': 0.0,
  'link': 'https://www.nba.com/player/1642359/pacome-dadiet/'},
 {'name': 'Tyler Kolek',
  'position': 'Guard',
  'experience': 0,
  'PPG': 0.0,
  'RPG': 0.0,
  'APG': 0.0,
  'link': 'https://www.nba.com/player/1642278/tyler-kolek/'},
 {'name': 'Ariel Hukporti',
  'position': 'Center',
  'experience': 0,
  'PPG': 0.0,
  'RPG': 0.0,
  'APG': 0.0,
  'link': 'https://www.nba.com/player/1630574/ariel-hukporti/'},
 {'name': 'Kevin McCullar Jr.',
  'position': 'Guard',
  'experience': 0,
  'PPG': 0.0,
  'RPG': 0.0,
  'APG': 0.0,
  'link': 'https://www.nba.com/player/1641755/kevin-mccullar-jr/'},
 {'name': 'Donte DiVincenzo',
  'position': 'Guard',
  'experience': 6,
  'PPG': 15.5,
  'RPG': 3.7,
  'APG': 2.7,
  'link': 'https://www.nba.com/player/1628978/donte-divincenzo/'},
 {'name': 'Jacob Toppin',
  'position': 'Forward',
  'experience': 1,
  'PPG': 1.4,
  'RPG': 0.8,
  'APG': 0.3,
  'link': 'https://www.nba.com/player/1631210/jacob-toppin/'},
 ...]
"""

"\n[{'name': 'Pacome Dadiet',\n  'position': 'Forward',\n  'experience': 0,\n  'PPG': 0.0,\n  'RPG': 0.0,\n  'APG': 0.0,\n  'link': 'https://www.nba.com/player/1642359/pacome-dadiet/'},\n {'name': 'Tyler Kolek',\n  'position': 'Guard',\n  'experience': 0,\n  'PPG': 0.0,\n  'RPG': 0.0,\n  'APG': 0.0,\n  'link': 'https://www.nba.com/player/1642278/tyler-kolek/'},\n {'name': 'Ariel Hukporti',\n  'position': 'Center',\n  'experience': 0,\n  'PPG': 0.0,\n  'RPG': 0.0,\n  'APG': 0.0,\n  'link': 'https://www.nba.com/player/1630574/ariel-hukporti/'},\n {'name': 'Kevin McCullar Jr.',\n  'position': 'Guard',\n  'experience': 0,\n  'PPG': 0.0,\n  'RPG': 0.0,\n  'APG': 0.0,\n  'link': 'https://www.nba.com/player/1641755/kevin-mccullar-jr/'},\n {'name': 'Donte DiVincenzo',\n  'position': 'Guard',\n  'experience': 6,\n  'PPG': 15.5,\n  'RPG': 3.7,\n  'APG': 2.7,\n  'link': 'https://www.nba.com/player/1628978/donte-divincenzo/'},\n {'name': 'Jacob Toppin',\n  'position': 'Forward',\n  'experience': 1

In [65]:
# Set max_retrieve to 6 and check the len of your output
len(get_players(max_retrieve = 6)) 

6

### 2. Who's the top 3 player getting the most points per game (PPG)? 
- Get the top 3 players with highest PPG in New York Knicks
    - Sample output: [('Jalen Brunson', 28.7),('Julius Randle', 24.0),('Mikal Bridges', 19.6)]
    - Note: this sample is intended to provide an idea of the structure of the output, but it should not be used as a reference for the correct answer, as information may change over time.

In [66]:
## YOUR CODE HERE

top_scorers = sorted([(player["name"], player["PPG"]) for player in data], key=lambda x: x[1], reverse=True)[:3]
print(top_scorers)

[('Jalen Brunson', 26.1), ('Karl-Anthony Towns', 24.5), ('Mikal Bridges', 17.4)]


<h3>Hint: How to sort dicts by value for a specific key? How to get selected element in dicts?</h3>

In [67]:
x = [{'letter1':'a','number':23.2,'letter2':'b'},{'letter1':'c','number':17.4,'letter2':'f'},{'letter1':'d','number':29.2,'letter2':'z'},{'letter1':'e','number':1.74,'letter2':'bb'}]
#Sort by the letter1 key in the dict

x.sort(key=lambda a: a['letter1'])
x

[{'letter1': 'a', 'number': 23.2, 'letter2': 'b'},
 {'letter1': 'c', 'number': 17.4, 'letter2': 'f'},
 {'letter1': 'd', 'number': 29.2, 'letter2': 'z'},
 {'letter1': 'e', 'number': 1.74, 'letter2': 'bb'}]

In [81]:
x = [{'letter1':'a','number':23.2,'letter2':'b'},{'letter1':'c','number':17.4,'letter2':'f'},{'letter1':'d','number':29.2,'letter2':'z'},{'letter1':'e','number':1.74,'letter2':'bb'}]
# Sort by the number key in the dict 

x.sort(key=lambda a: a['number'])
x

[{'letter1': 'e', 'number': 1.74, 'letter2': 'bb'},
 {'letter1': 'c', 'number': 17.4, 'letter2': 'f'},
 {'letter1': 'a', 'number': 23.2, 'letter2': 'b'},
 {'letter1': 'd', 'number': 29.2, 'letter2': 'z'}]

In [82]:
x = [{'letter1':'a','number':23.2,'letter2':'b'},{'letter1':'c','number':17.4,'letter2':'f'},{'letter1':'d','number':29.2,'letter2':'z'},{'letter1':'e','number':1.74,'letter2':'bb'}]

[(sub['letter1'],sub['number']) for sub in x]

[('a', 23.2), ('c', 17.4), ('d', 29.2), ('e', 1.74)]