<h1>Scraping NBA.com</h1>
In this assignment, you will scrape data from https://www.nba.com/players. The goal of the exercise is to get 50 player performance data from the NBA official and find out who's the top five players based on the most Points per Game (PPG).

The end result is to write a function: <i>get_players()</i> that will return a list of tuples. Each tuple should correspond to a player and should contain the following data:
<li>1. player name <b>[str]</b>
<li>2. team name (full name, e.g: Miami Heat, Toronto Raptors, etc.) <b>[str]</b>
<li>3. player position (e.g. Guard/Center/Forward) <b>[str]</b> -- When a player has multiple preferred positions, prioritize the one mentioned first. For example, if a player lists 'Center-Forward,' consider their primary position as 'Center,' and if they list 'Forward-Guard,' you should return 'Forward'."
<li>4. Points Per Game (PPG) <b>[float]</b>
<li>5. Rebounds Per Game (RPG) <b>[float]</b>
<li>6.  Assists Per Game (APG) <b>[float]</b>
<li>7. a link to the player's page <b>[str]</b>

<h3>Process</h3>
<li>Retrieve the information for players' names and links.
<li>Iterate through the list and invoke the function get_player_info(player_url) for each player.
<li>Accumulate the name, team, position, points, rebounds, assists, and link for each player in the output_list.
<li>Get top five players by sorting the (Points per Game) PPG

<b>Notes:</b>
<li>Note that you need to prioritize the position mentioned first (e.g., change "Center-Forward" to "Center").
<li>If there is no information about players' team name or position, assign a <b>None</b> value.
<li>If there is no information about the player PPG/RPG/APG, assign a <b>0</b> (zero) value.
<li>You only need to retrieve information for the first 50 players from the initial page of the NBA website.

### 1. Collecting 50 player performance data

In [1]:
import requests
import re
from bs4 import BeautifulSoup

In [2]:
# Notes:

# Note that you need to prioritize the position mentioned first (e.g., change "Center-Forward" to "Center").
# If there is no information about players' team name or position, assign a None value.
# If there is no information about the player PPG/RPG/APG, assign a 0 (zero) value.
# You only need to retrieve information for the first 50 players from the initial page of the NBA website.

In [3]:
def get_players():
    # Define the URL to the NBA players page
    url = "https://www.nba.com/players"
    # Initialize an empty list to store player information
    output_list = list()
    ## YOUR CODE HERE
    page = BeautifulSoup(requests.get(url).content,'lxml')
    #find each player's row, then locate name, team name, position.
    
    #locate all rows, each player's info is inside a single row:
    rows = page.find_all('tr')[1:51]
    positions = {"F": "Forward", "C": "Center", "G": "Guard"}
    
    for k in rows:
        #name:
        both_names = k.find('div', class_ = 'RosterRow_playerName__G28lg')
        player_name = both_names.find_all('p')[0].text + " " + both_names.find_all('p')[1].text
        
        #team name:
        col_two = k.find_all('td', class_ = 'text')[1]
        team_short = col_two.find('a')
        if len(team_short) == 0:
            player_team = None
        else:
            team_url = "https://www.nba.com/" + team_short.get('href')
            team_page = BeautifulSoup(requests.get(team_url).content,'lxml')
            all_text = team_page.find('div', class_ = 'TeamHeader_name__MmHlP').text.strip()
            
            #SPECIAL CASE: some team names have "\xa0" somewhere in the name. 
            #We want to filter out the html comment "\xa0" in the name.
            lower = list(map(chr, range(97, 123)))  #all lowercase alphabets
            upper = [x.upper() for x in lower] #all uppercase alphabets
            curr = all_text
            for i in range(len(all_text)-1):
                if curr[i] in lower or curr[i] in upper or len(curr[i]) == 0:
                    i+= 1
                else:
                    new = curr.replace(curr[i], " ")
                    curr = new
                    i+= 1
            player_team = curr

   
        #player position:
        col_three = k.find_all('td', class_ = 'text')[2]
        if len(col_three) == 0:
            player_position = None
        else:
            position_letter = col_three.text[0]
            player_position = positions[position_letter]
        
        #url:
        half_url = k.find('a').get('href')
        player_url = "https://www.nba.com" + half_url
        
        
        #PPG, RPG, APG:
        PPG, RPG, APG = get_player_info(player_url)
        
        output_list.append((player_name, player_team, player_position, PPG, RPG, APG, player_url))
          
    return output_list


In [4]:
def get_player_info(player_url):
    ## YOUR CODE HERE
    url = player_url 
    player_page = BeautifulSoup(requests.get(player_url).content,'lxml')
    player_info = player_page.find_all('p', class_ = "PlayerSummary_playerStatValue___EDg_")
    #we need three stats: PPG, RPG, APG.
    
    stats_lst = []
    if len(player_info) == 0:
        stats_lst += [0,0,0]
    
    else:
        for i in range(3):
            bench = player_info[i].get_text()
            if bench == "--":
                stats_lst += [0]
            else:
                curr = float(bench)
                stats_lst += [curr]
            i += 1
    
    PPG = stats_lst[0]
    RPG = stats_lst[1]
    APG = stats_lst[2]
    
    return PPG, RPG, APG

In [5]:
# Run this cell to get the data
data = get_players()
data

[('Precious Achiuwa',
  'Toronto Raptors',
  'Forward',
  9.2,
  6.0,
  0.9,
  'https://www.nba.com/player/1630173/precious-achiuwa/'),
 ('Steven Adams',
  'Memphis Grizzlies',
  'Center',
  8.6,
  11.5,
  2.3,
  'https://www.nba.com/player/203500/steven-adams/'),
 ('Bam Adebayo',
  'Miami Heat',
  'Center',
  20.4,
  9.2,
  3.2,
  'https://www.nba.com/player/1628389/bam-adebayo/'),
 ('Ochai Agbaji',
  'Utah Jazz',
  'Guard',
  7.9,
  2.1,
  1.1,
  'https://www.nba.com/player/1630534/ochai-agbaji/'),
 ('Santi Aldama',
  'Memphis Grizzlies',
  'Forward',
  9.0,
  4.8,
  1.3,
  'https://www.nba.com/player/1630583/santi-aldama/'),
 ('Nickeil Alexander-Walker',
  'Minnesota Timberwolves',
  'Guard',
  6.2,
  1.7,
  1.8,
  'https://www.nba.com/player/1629638/nickeil-alexander-walker/'),
 ('Grayson Allen',
  'Phoenix Suns',
  'Guard',
  10.4,
  3.3,
  2.3,
  'https://www.nba.com/player/1628960/grayson-allen/'),
 ('Jarrett Allen',
  'Cleveland Cavaliers',
  'Center',
  14.3,
  9.8,
  1.7,
  '

In [None]:
# Running the above cell should return (note: the results may vary over time since the website is always updated)
"""
[('Precious Achiuwa',
  'Toronto Raptors',
  'Forward',
  9.2,
  6.0,
  0.9,
  'https://www.nba.com/player/1630173/precious-achiuwa/'),
 ('Steven Adams',
  'Memphis Grizzlies',
  'Center',
  8.6,
  11.5,
  2.3,
  'https://www.nba.com/player/203500/steven-adams/'),
 ('Bam Adebayo',
  'Miami Heat',
  'Center',
  20.4,
  9.2,
  3.2,
  'https://www.nba.com/player/1628389/bam-adebayo/'),
 ('Ochai Agbaji',
  'Utah Jazz',
  'Guard',
  7.9,
  2.1,
  1.1,
  'https://www.nba.com/player/1630534/ochai-agbaji/'),
 ('Santi Aldama',
  'Memphis Grizzlies',
  'Forward',
  9.0,
  4.8,
  1.3,
  'https://www.nba.com/player/1630583/santi-aldama/'),
 ('Nickeil Alexander-Walker',
  'Minnesota Timberwolves',
  'Guard',
  6.2,
  1.7,
  1.8,
  'https://www.nba.com/player/1629638/nickeil-alexander-walker/'),
 ('Angelo Allegri',
  None,
  None,
  0,
  0,
  0,
  'https://www.nba.com/player/1641874/angelo-allegri/'),
 ...]
"""

In [6]:
# Check the output length
len(data) #Should return 50

50

### 2. Who's getting the most points per game (PPG)? 

In [7]:
# Top 5 players with highest PPG 
# Sample output: [('Giannis Antetokounmpo', 31.1),('LaMelo Ball', 23.3), ('Bradley Beal', 23.2), ('Desmond Bane', 21.5), ('Bam Adebayo', 20.4)])
# note: the results may vary over time 

## YOUR CODE HERE
ppg_rank = []

#add ppg name and player statistic to each player's tuple.
#the .sort() function sorts by the first element by default. We can define a function to tell sort() which value to use: key = "...".
#in the second argument, we can define the order ascending/descending by using "reverse".

for i in range(len(data)):
    ppg_name = data[i][0]
    ppg_stats = data[i][3]
    ppg_rank += [(ppg_name, ppg_stats)]

ppg_rank.sort(key=lambda a: a[1], reverse = True)

#top five players with highest PPG:
ppg_rank[:5]

[('Giannis Antetokounmpo', 31.1),
 ('LaMelo Ball', 23.3),
 ('Bradley Beal', 23.2),
 ('Desmond Bane', 21.5),
 ('Bam Adebayo', 20.4)]

<h3>Hint: How to sort tuples by an arbitrary element? How to get selected element in tuples?</h3>

In [109]:
x = [('a',23.2,'b'),('z',17.4,'f'),('d',29.2,'z'),('e',1.74,'bb')]
#Sort by the first element of the tuple

x.sort(key=lambda a: a[0])
x

[('a', 23.2, 'b'), ('d', 29.2, 'z'), ('e', 1.74, 'bb'), ('z', 17.4, 'f')]

In [None]:
x = [('a',23.2,'b'),('c',17.4,'f'),('d',29.2,'z'),('e',1.74,'bb')]
#Sort by the element at position 1

x.sort(key=lambda a: a[1])
x

In [107]:
x = [('a',23.2,'b'),('c',17.4,'f'),('d',29.2,'z'),('e',1.74,'bb')]

[(sub[0],sub[1]) for sub in x]

[('a', 23.2), ('c', 17.4), ('d', 29.2), ('e', 1.74)]

Hannah's Scratch work:

In [221]:
## for loop for player name:

# url = "https://www.nba.com/players"
# page = BeautifulSoup(requests.get(url).content,'lxml')


# #Step 1: get 50 players' names and links.
# table = page.find_all('div',class_="RosterRow_playerName__G28lg")
# name_list = []
# for i in table:
#     first_last_name = i.find_all('p')
#     name = ''
#     for j in first_last_name:
#         partial = " "+ j.get_text()
#         name = name + partial
#     name_list += [name]
# print(name_list) 

In [222]:
# #Step 2: get links

# output_list = list()
# ## YOUR CODE HERE

# page = BeautifulSoup(requests.get(url).content,'lxml')  
# tbl = page.find('table', class_ = "players-list")
# body = tbl.find('tbody').find_all('tr')
# for k in body:
#     link = k.find('a').get('href')
#     output_list.append(link)
# print(output_list)

In [106]:
# url2 = "https://www.nba.com/team/1610612740/pelicans"
# ho = BeautifulSoup(requests.get(url2).content,'lxml')
# aw = ho.find('div', class_ = 'TeamHeader_name__MmHlP').text.strip()
# lower = list(map(chr, range(97, 123))) 
# upper = [x.upper() for x in lower]

# for i in range(len(aw)-1):
#     if aw[i] in lower or aw[i] in upper or len(aw[i]) == 0:
#         i+= 1
#     else:
#         meow = aw.replace(aw[i], " ")
#         break

# check = meow
# print(check)