# NBA Roster Scraping
*(ref: Basketball Reference)*
#### To use, get a __[Basketball Reference](https://www.basketball-reference.com/)__ link of any roster from any team and any season.
>Link example: __[https://www.basketball-reference.com/teams/AAA/XXXX.html](https://www.basketball-reference.com/teams/PHI/2021.html)__ → Replace **AAA** with the team acronym and **XXXX** with the season's year.


In [1]:
import pandas as pd
import requests as req
from bs4 import BeautifulSoup
import json
import os

#### Getting the webpage, parsing it and prettifying it with BeautifulSoup

In [2]:
url = input()
page = req.get(url)

soup = BeautifulSoup(page.content, 'html.parser')


#### Getting the roster table with some relevant columns and transforming it into a DataFrame

In [3]:
roster = soup.find(name='table', attrs={'id': 'roster'})
teamRoster = []

for row in roster.find('tbody').find_all('tr'):
    player = {}
    player['Name'] = row.find(attrs={'data-stat' : 'player'}).text
    player['No.'] = row.find(attrs={'data-stat' : 'number'}).text
    player['Position'] = row.find(attrs={'data-stat' : 'pos'}).text
    player['Height'] = row.find(attrs={'data-stat' : 'height'}).text
    player['Weight'] = row.find(attrs={'data-stat' : 'weight'}).text
    teamRoster.append(player)

rosterDf = pd.DataFrame(teamRoster)

#### Getting the Per Game stats table and transforming it into a DataFrame

In [4]:
perGame = soup.find(name="table", attrs={'id': 'per_game'})

playerStats = []
for row in perGame.find('tbody').find_all('tr'):
    player = {}
    player['Name'] = row.find(attrs={'data-stat' : 'player'}).text
    player['Age'] = row.find('td', {'data-stat' : 'age'}).text
    player['Min per Game'] = row.find('td', {'data-stat' : 'mp_per_g'}).text
    player['Field Goal %'] = row.find('td', {'data-stat' : 'fg_pct'}).text
    player['Rebounds per Game'] = row.find('td', {'data-stat' : 'trb_per_g'}).text
    player['Assists per Game'] = row.find('td', {'data-stat' : 'ast_per_g'}).text
    player['Steals per Game'] = row.find('td', {'data-stat' : 'stl_per_g'}).text
    player['Blocks per Game'] = row.find('td', {'data-stat' : 'blk_per_g'}).text
    player['Turnovers per Game'] = row.find('td', {'data-stat' : 'tov_per_g'}).text
    player['Points per Game'] = row.find('td', {'data-stat' : 'pts_per_g'}).text

    playerStats.append(player)

playerStatsDf = pd.DataFrame(playerStats)

#### Merging the previous tables into one single DataFrame, sorted by Jersey No.

In [5]:
teamDf = pd.merge(rosterDf, playerStatsDf, on='Name').sort_values(by=['No.'], key=lambda col: col.astype(int))


ValueError: invalid literal for int() with base 10: '1, 23'

#### Output to JSON
Writing the DataFrame to a JSON file in an organized, *database-like*, File System, in a *table* orientation. <br>
If the specific web page has already been scraped, its JSON will be overriden.

In [None]:
jsonDf = teamDf.to_json()
jsonOutput = json.loads(jsonDf)
teamName = url.split('/')[-2] # getting the 'PHI' from '/PHI/2020.html' (ex.)
season = url.split('/')[-1].split('.')[0] # getting '2020' from '2020.html' (ex.), which is how all the used links are built
path = f'../NBA Data/{teamName}'
if not os.path.exists(path): os.makedirs(path) # creating the folder of the team that's being read 
f = open(path + f'/{season}.json', 'w')
json.dump(jsonOutput,f, indent=4)
f.close()

### Final DataFrame:

In [None]:
teamDf.style.hide_index()