# NBA Web Scraping

## Introduction
> In this project, I am interested in analyzing the numbers and statistics since the NBA began tracking 3-pointers, which was the year 1980. My goal for this project is to explore the performances of the NBA teams. Rather than each individual NBA player's performance, I am more interested in their team's offensive performance. So, for the first step of my personal data exploration project, I will be gathering all of the data that I deem necessary. <br> <br>
In this short notebook, I will scrape data from the **[Basketball Reference](https://www.basketball-reference.com/leagues/)**  website, which contains each team's performances throughout the years. I will only be focusing on the years 1980-2021; I will not include 2022 because the season is still in progress at the time of this writing. 


In [13]:
# import packages
import requests
import bs4 as BeautifulSoup
import pandas as pd

pd.set_option('display.max_columns', None)

In [14]:
years = list(range(1980,2022))
url_start = "https://www.basketball-reference.com/leagues/NBA_{}.html"

In [15]:
# loop to get urls for each years
for year in years:
    url = url_start.format(year)
    data = requests.get(url)
    
    with open("team_stats/{}.html".format(year), "w+") as f:
        f.write(data.text)

In [16]:
# create lists to store dataframes
total_dfs = []
avg_dfs = []
advanced_dfs = []

for year in years:
    with open("team_stats/{}.html".format(year)) as f:
        page = f.read()
    soup = BeautifulSoup.BeautifulSoup(page,'html.parser')
    # decompose unneccessary parts of table
    soup.find('div', class_= "table_container").decompose()
    soup.find('tr', class_= "over_header").decompose()
    
    total_stats_table = soup.find(id = "div_totals-team")
    avg_stats_table = soup.find(id = "div_per_game-team")
    advanced_table = soup.find(id = "advanced-team")
    
    total_stats = pd.read_html(str(total_stats_table))[0]
    avg_stats = pd.read_html(str(avg_stats_table))[0]
    advanced_stats = pd.read_html(str(advanced_table))[0]
    
    # Add a Year column 
    total_stats["Year"] =  year
    avg_stats["Year"] =  year
    advanced_stats["Year"] = year
    
    total_dfs.append(total_stats)
    avg_dfs.append(avg_stats)
    advanced_dfs.append(advanced_stats)

**I noticed that the tables do not include the NBA champion for that year which is something I would like to include in my analysis. Therefore, we will have to scrap the champions and runner-ups from this [Basketball Reference webpage](https://www.basketball-reference.com/playoffs/). The steps are the same as before, but without the loop.**

In [17]:
url_2nd = 'https://www.basketball-reference.com/playoffs/'
data = requests.get(url_2nd)
with open("team_stats/champions.html", "w+") as f:
        f.write(data.text)

In [18]:
with open("team_stats/champions.html") as f:
        page = f.read()
soup = BeautifulSoup.BeautifulSoup(page,'html.parser')
# decompose unneccessary parts of table
soup.find('tr', class_= "over_header").decompose()
champions_table = soup.find(id = "div_champions_index")
champions_stat = pd.read_html(str(champions_table))[0]

## Storing Data

In [19]:
total_stats_df = pd.concat(total_dfs)
total_stats_df.head()

Unnamed: 0,Rk,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
0,1.0,San Antonio Spurs*,82,19755,3856,7738,0.498,52,206,0.252,3804,7532,0.505,2024,2528,0.801,1153,2515,3668,2326,771,333,1589,2103,9788,1980
1,2.0,Los Angeles Lakers*,82,19880,3898,7368,0.529,20,100,0.2,3878,7268,0.534,1622,2092,0.775,1085,2653,3738,2413,774,546,1639,1784,9438,1980
2,3.0,Cleveland Cavaliers,82,19930,3811,8041,0.474,36,187,0.193,3775,7854,0.481,1702,2205,0.772,1307,2381,3688,2108,764,342,1370,1934,9360,1980
3,4.0,New York Knicks,82,19780,3802,7672,0.496,42,191,0.22,3760,7481,0.503,1698,2274,0.747,1236,2303,3539,2265,881,457,1613,2168,9344,1980
4,5.0,Boston Celtics*,82,19880,3617,7387,0.49,162,422,0.384,3455,6965,0.496,1907,2449,0.779,1227,2457,3684,2198,809,308,1539,1974,9303,1980


In [20]:
avg_stats_df = pd.concat(avg_dfs)
avg_stats_df.head()

Unnamed: 0,Rk,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
0,1.0,San Antonio Spurs*,82,240.9,47.0,94.4,0.498,0.6,2.5,0.252,46.4,91.9,0.505,24.7,30.8,0.801,14.1,30.7,44.7,28.4,9.4,4.1,19.4,25.6,119.4,1980
1,2.0,Los Angeles Lakers*,82,242.4,47.5,89.9,0.529,0.2,1.2,0.2,47.3,88.6,0.534,19.8,25.5,0.775,13.2,32.4,45.6,29.4,9.4,6.7,20.0,21.8,115.1,1980
2,3.0,Cleveland Cavaliers,82,243.0,46.5,98.1,0.474,0.4,2.3,0.193,46.0,95.8,0.481,20.8,26.9,0.772,15.9,29.0,45.0,25.7,9.3,4.2,16.7,23.6,114.1,1980
3,4.0,New York Knicks,82,241.2,46.4,93.6,0.496,0.5,2.3,0.22,45.9,91.2,0.503,20.7,27.7,0.747,15.1,28.1,43.2,27.6,10.7,5.6,19.7,26.4,114.0,1980
4,5.0,Boston Celtics*,82,242.4,44.1,90.1,0.49,2.0,5.1,0.384,42.1,84.9,0.496,23.3,29.9,0.779,15.0,30.0,44.9,26.8,9.9,3.8,18.8,24.1,113.5,1980


In [21]:
advanced_stats_df = pd.concat(advanced_dfs)
advanced_stats_df.head()

Unnamed: 0,Rk,Team,Age,W,L,PW,PL,MOV,SOS,SRS,ORtg,DRtg,NRtg,Pace,FTr,3PAr,TS%,Unnamed: 17,eFG%,TOV%,ORB%,FT/FGA,Unnamed: 22,eFG%.1,TOV%.1,DRB%,FT/FGA.1,Unnamed: 27,Arena,Attend.,Attend./G,Year
0,1.0,Boston Celtics*,27.3,61.0,21.0,60,22,7.79,-0.42,7.37,109.4,101.9,7.5,102.6,0.332,0.057,0.55,,0.501,15.4,34.8,0.258,,0.475,16.5,67.8,0.234,,Boston Garden,596349.0,14664.0,1980
1,2.0,Los Angeles Lakers*,26.2,60.0,22.0,55,27,5.9,-0.51,5.4,109.5,103.9,5.6,104.1,0.284,0.014,0.569,,0.53,16.5,32.6,0.22,,0.475,14.0,66.9,0.181,,The Forum,582882.0,17505.0,1980
2,3.0,Seattle SuperSonics*,27.0,56.0,26.0,53,29,4.66,-0.42,4.24,105.8,101.2,4.6,101.8,0.298,0.025,0.52,,0.474,14.9,36.4,0.229,,0.463,15.4,67.9,0.221,,King County Domed Stadium,,28726.0,1980
3,4.0,Philadelphia 76ers*,27.0,59.0,23.0,52,30,4.22,-0.18,4.04,105.0,101.0,4.0,103.0,0.34,0.017,0.544,,0.494,17.2,33.5,0.262,,0.46,15.5,66.7,0.217,,The Spectrum,,,1980
4,5.0,Milwaukee Bucks*,25.3,49.0,33.0,51,31,3.94,-0.37,3.57,106.8,102.9,3.9,102.4,0.278,0.021,0.532,,0.491,15.0,35.2,0.212,,0.467,16.2,63.8,0.229,,MECCA Arena,,,1980


In [22]:
champions_df = pd.DataFrame(champions_stat)
champions_df

Unnamed: 0,Year,Lg,Champion,Runner-Up,Finals MVP,Unnamed: 5,Points,Rebounds,Assists,Win Shares
0,2021.0,NBA,Milwaukee Bucks,Phoenix Suns,G. Antetokounmpo,,G. Antetokounmpo (634),G. Antetokounmpo (269),J. Holiday (199),G. Antetokounmpo (3.7)
1,2020.0,NBA,Los Angeles Lakers,Miami Heat,L. James,,A. Davis (582),L. James (226),L. James (184),A. Davis (4.5)
2,2019.0,NBA,Toronto Raptors,Golden State Warriors,K. Leonard,,K. Leonard (732),D. Green (223),D. Green (187),K. Leonard (4.9)
3,2018.0,NBA,Golden State Warriors,Cleveland Cavaliers,K. Durant,,L. James (748),D. Green (222),L. James (198),L. James (5.2)
4,2017.0,NBA,Golden State Warriors,Cleveland Cavaliers,K. Durant,,L. James (591),K. Love (191),L. James (141),L. James (4.3)
...,...,...,...,...,...,...,...,...,...,...
83,,,,,,,,,,
84,1950.0,NBA,Minneapolis Lakers,Syracuse Nationals,,,G. Mikan (376),,J. Pollard (56),G. Mikan (3.7)
85,1949.0,BAA,Minneapolis Lakers,Washington Capitols,,,G. Mikan (303),,J. Pollard (39),G. Mikan (4.2)
86,1948.0,BAA,Baltimore Bullets,Philadelphia Warriors,,,J. Fulks (282),,H. Dallmar (37),C. Simmons (2.5)


In [23]:
total_stats_df.to_csv('data/total_stats_df.csv')
avg_stats_df.to_csv('data/avg_stats_df.csv')
advanced_stats_df.to_csv('data/advanced_stats_df.csv')
champions_df.to_csv('data/champions_df.csv')

**Please head on over to `Part II - NBA Data Cleaning.ipynb` to continue.**