### NBA mvp 


* **Step 1** -  Manually create a folder named mvp, player, team
* **Step 2** - Are there any values that seem incorrect/nonsensical?
* **Duplicate** or redundant data - Do we have duplicate rows or columns? Do we have columns that provide redundant information given what is contained in other columns?
* **Missing data** - are there any rows or columns that have blank, `np.NaN`, or otherwise missing data?  Should they be dropped or replaced?

> Installing the "requests" library 

> Defining our range 

> Spotting the year part in the url

In [1]:

!pip install requests
years = list(range(1991, 2023))
url_start = "https://www.basketball-reference.com/awards/awards_{}.html"



In [2]:
import requests

for year in years:
    url = url_start.format(year)
    data = requests.get(url)
    with open("mvp/{}.html".format(year), "w+") as f: # will create a .html file for every year into the pre-created mvp folder.
        f.write(data.text)

In [3]:
# parsing the votes(mvp) table with BeautifulSoup
!pip install beautifulsoup4
from bs4 import BeautifulSoup



In [4]:
# Opening the already created 1996.html in readmode.
with open ("mvp/1996.html") as f:
    page = f.read()

We can inspect particular elements in our webpage by rclicking them.

In [5]:
soup = BeautifulSoup(page, "html.parser")
soup.find('tr', class_="over_header").decompose() # What this is doing : removing the overheader in our desired tables.

the whole line is an html elementm, tr is the tag, which also has a class called over_header.
An id in html is a globally unique property. The table we want has id=mvp

In [6]:
mvp_table = soup.find(id="mvp") 

In [7]:
import pandas as pd
mvp_1996 = pd.read_html(str(mvp_table))[0] # for some reason we need the [0] at the end.
mvp_1996

Unnamed: 0,Rank,Player,Age,Tm,First,Pts Won,Pts Max,Share,G,MP,PTS,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48
0,1,Michael Jordan,32,CHI,109.0,1114.0,1130,0.986,82,37.7,30.4,6.6,4.3,2.2,0.5,0.495,0.427,0.834,20.4,0.317
1,2,David Robinson,30,SAS,0.0,574.0,1130,0.508,82,36.8,25.0,12.2,3.0,1.4,3.3,0.516,0.333,0.761,18.3,0.29
2,3,Anfernee Hardaway,24,ORL,2.0,360.0,1130,0.319,82,36.8,21.7,4.3,7.1,2.0,0.5,0.513,0.314,0.767,14.4,0.229
3,4,Hakeem Olajuwon,33,HOU,1.0,238.0,1130,0.211,72,38.8,26.9,10.9,3.6,1.6,2.9,0.514,0.214,0.724,9.7,0.166
4,5,Scottie Pippen,30,CHI,0.0,226.0,1130,0.2,77,36.7,19.4,6.4,5.9,1.7,0.7,0.463,0.374,0.679,12.3,0.209
5,6,Gary Payton,27,SEA,0.0,98.0,1130,0.087,81,39.0,19.3,4.2,7.5,2.9,0.2,0.484,0.328,0.748,11.5,0.174
6,7,Karl Malone,32,UTA,1.0,85.0,1130,0.075,82,38.0,25.7,9.8,4.2,1.7,0.7,0.519,0.4,0.723,15.1,0.233
7,8,Shawn Kemp,26,SEA,0.0,73.0,1130,0.065,79,33.3,19.6,11.4,2.2,1.2,1.6,0.561,0.417,0.742,11.2,0.205
8,9T,Grant Hill,23,DET,0.0,63.0,1130,0.056,80,40.8,20.2,9.8,6.9,1.3,0.6,0.462,0.192,0.751,11.7,0.172
9,9T,Shaquille O'Neal,23,ORL,0.0,63.0,1130,0.056,54,36.0,26.6,11.0,2.9,0.6,2.1,0.573,0.5,0.487,6.9,0.171


In [10]:
# Create a list of dataframes that in the mvp folder.
dfs = []
for year in years:
    with open("mvp/{}.html".format(year)) as f:
        page = f.read()
    soup = BeautifulSoup(page, "html.parser")
    soup.find('tr', class_='over_header').decompose()
    mvp_table = soup.find(id="mvp")
    mvp = pd.read_html(str(mvp_table))[0]
    mvp["Year"] = year # We need the year each entry came from. We need this step first before rearranging the columns
    dfs.append(mvp)

In [11]:
mvps = pd.concat(dfs)

# Rearranging the columns so that Year is the second column
year_col = mvps['Year']
mvps = mvps.drop(columns=['Year'])
mvps.insert(loc=1, column='Year', value=year_col)

mvps.sample(10)

Unnamed: 0,Rank,Year,Player,Age,Tm,First,Pts Won,Pts Max,Share,G,...,PTS,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48
11,12,2020,Jayson Tatum,21,BOS,0.0,1.0,1010,0.001,66,...,23.4,7.0,3.0,1.4,0.9,0.45,0.403,0.812,6.9,0.146
6,7,2006,Elton Brand,26,LAC,1.0,50.0,1250,0.04,79,...,24.7,10.0,2.6,1.0,2.5,0.527,0.333,0.775,14.8,0.229
8,9T,2005,Amar'e Stoudemire,22,PHO,1.0,41.0,1270,0.032,80,...,26.0,8.9,1.6,1.0,1.6,0.559,0.188,0.733,14.6,0.243
10,11,2021,Russell Westbrook,32,WAS,0.0,5.0,1010,0.005,65,...,22.2,11.5,11.7,1.4,0.4,0.439,0.315,0.656,3.7,0.075
4,5,2021,Chris Paul,35,PHO,2.0,139.0,1010,0.138,70,...,16.4,4.5,8.9,1.4,0.3,0.499,0.395,0.934,9.2,0.201
9,10T,2018,Jimmy Butler,28,MIN,0.0,5.0,1010,0.005,59,...,22.2,5.3,4.9,2.0,0.4,0.474,0.35,0.854,8.9,0.198
18,19T,1991,Tim Hardaway,24,GSW,0.0,1.0,960,0.001,82,...,22.9,4.0,9.7,2.6,0.1,0.476,0.385,0.803,9.9,0.148
18,16T,1999,Glenn Robinson,26,MIL,0.0,1.0,1180,0.001,47,...,18.4,5.9,2.1,1.0,0.9,0.459,0.392,0.87,4.0,0.122
3,4,2011,Kobe Bryant,32,LAL,1.0,428.0,1210,0.354,82,...,25.3,5.1,4.7,1.2,0.1,0.451,0.323,0.828,10.3,0.178
14,11T,1994,John Stockton,31,UTA,0.0,1.0,1010,0.001,82,...,15.1,3.1,12.6,2.4,0.3,0.528,0.322,0.805,13.2,0.214


In [12]:
# Saving our pandas dataframe back to a csv – useful to upload to Tableau, SQl etc.
mvps.to_csv("mvps.csv")

We will now get the statistics for each player from our years range —— Downloading player stats.

In [15]:
player_stats_url = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html"

url = player_stats_url.format(1996)
data = requests.get(url)
with open("player/1996.html", "w+") as f:
    f.write(data.text)

Here in lies a problem, we've only scraped a portion of the page, the rest of the page is rendered in Javascript which our current parser is not parsing.

Need to deal with the Javascript –– Enter selenium library.
So we downloaded chromedriver based on our google chrome version. Whereever that application is, make sure the executable path below matched its filepath location.

In [17]:
!pip install selenium
from selenium import webdriver
driver = webdriver.Chrome(executable_path="/Users/Malcolm/Documents/ipynb Repository/chromedriver")



  driver = webdriver.Chrome(executable_path="/Users/Malcolm/Documents/ipynb Repository/chromedriver")


In [89]:
import time
year = 2022
url = player_stats_url.format(year)

driver.get(url)
driver.execute_script("window.scrollTo(1, 100000)")
time.sleep(2)

html = driver.page_source

NoSuchWindowException: Message: no such window: window was already closed
  (Session info: chrome=103.0.5060.53)
Stacktrace:
0   chromedriver                        0x00000001012ff079 chromedriver + 4444281
1   chromedriver                        0x000000010128b403 chromedriver + 3970051
2   chromedriver                        0x0000000100f26038 chromedriver + 409656
3   chromedriver                        0x0000000100f16599 chromedriver + 345497
4   chromedriver                        0x0000000100f179b2 chromedriver + 350642
5   chromedriver                        0x0000000100f1032c chromedriver + 320300
6   chromedriver                        0x0000000100f27452 chromedriver + 414802
7   chromedriver                        0x0000000100f8b8db chromedriver + 825563
8   chromedriver                        0x0000000100f79683 chromedriver + 751235
9   chromedriver                        0x0000000100f4fa45 chromedriver + 580165
10  chromedriver                        0x0000000100f50a95 chromedriver + 584341
11  chromedriver                        0x00000001012d055d chromedriver + 4253021
12  chromedriver                        0x00000001012d53a1 chromedriver + 4273057
13  chromedriver                        0x00000001012da16f chromedriver + 4292975
14  chromedriver                        0x00000001012d5dea chromedriver + 4275690
15  chromedriver                        0x00000001012af54f chromedriver + 4117839
16  chromedriver                        0x00000001012efed8 chromedriver + 4382424
17  chromedriver                        0x00000001012f005f chromedriver + 4382815
18  chromedriver                        0x00000001013068d5 chromedriver + 4475093
19  libsystem_pthread.dylib             0x00007ff8102a64f4 _pthread_start + 125
20  libsystem_pthread.dylib             0x00007ff8102a200f thread_start + 15


In [90]:
with open("player/{}.html".format(year), "w+") as f:
    f.write(html)

Now that we are happy with our test year (1996), we will run the for loop. 
> This for loop is not working below so we will comment it out for now, we can still get the files if we change the year variable nad run it in turn.

In [37]:
# for year in years:
#     url = player_stats_url.format(year)

#     driver.get(url)
#     driver.execute_script("window.scrollTo(1,10000)")
#     time.sleep(2)

#     html = driver.page_source
#     with open("player/{}.html".format(year), "w+") as f:
#         f.write(html)

WebDriverException: Message: unknown error: unexpected command response
  (Session info: chrome=103.0.5060.53)
Stacktrace:
0   chromedriver                        0x00000001012ff079 chromedriver + 4444281
1   chromedriver                        0x000000010128b403 chromedriver + 3970051
2   chromedriver                        0x0000000100f26038 chromedriver + 409656
3   chromedriver                        0x0000000100f133c8 chromedriver + 332744
4   chromedriver                        0x0000000100f12ac7 chromedriver + 330439
5   chromedriver                        0x0000000100f12047 chromedriver + 327751
6   chromedriver                        0x0000000100f11803 chromedriver + 325635
7   chromedriver                        0x0000000100f2d1fa chromedriver + 438778
8   chromedriver                        0x0000000100f8c62d chromedriver + 828973
9   chromedriver                        0x0000000100f79683 chromedriver + 751235
10  chromedriver                        0x0000000100f4fa45 chromedriver + 580165
11  chromedriver                        0x0000000100f50a95 chromedriver + 584341
12  chromedriver                        0x00000001012d055d chromedriver + 4253021
13  chromedriver                        0x00000001012d53a1 chromedriver + 4273057
14  chromedriver                        0x00000001012da16f chromedriver + 4292975
15  chromedriver                        0x00000001012d5dea chromedriver + 4275690
16  chromedriver                        0x00000001012af54f chromedriver + 4117839
17  chromedriver                        0x00000001012efed8 chromedriver + 4382424
18  chromedriver                        0x00000001012f005f chromedriver + 4382815
19  chromedriver                        0x00000001013068d5 chromedriver + 4475093
20  libsystem_pthread.dylib             0x00007ff8102a64f4 _pthread_start + 125
21  libsystem_pthread.dylib             0x00007ff8102a200f thread_start + 15


In [93]:
year = 1991
with open("player/{}.html".format(year)) as f:
    page = f.read()
    
soup = BeautifulSoup(page, "html.parser")
soup.find('tr', class_='thead').decompose()
player_table = soup.find(id="per_game_stats")
player = pd.read_html(str(player_table))[0]
player["Year"] = year
player.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
0,1,Alaa Abdelnaby,PF,22,POR,43,0,6.7,1.3,2.7,...,0.6,1.4,2.1,0.3,0.1,0.3,0.5,0.9,3.1,1991
1,2,Mahmoud Abdul-Rauf,PG,21,DEN,67,19,22.5,6.2,15.1,...,0.5,1.3,1.8,3.1,0.8,0.1,1.6,2.2,14.1,1991
2,3,Mark Acres,C,28,ORL,68,0,19.3,1.6,3.1,...,2.1,3.2,5.3,0.4,0.4,0.4,0.6,3.2,4.2,1991
3,4,Michael Adams,PG,28,DEN,66,66,35.5,8.5,21.5,...,0.9,3.0,3.9,10.5,2.2,0.1,3.6,2.5,26.5,1991
4,5,Mark Aguirre,SF,31,DET,78,13,25.7,5.4,11.7,...,1.7,3.1,4.8,1.8,0.6,0.3,1.6,2.7,14.2,1991


Let's hope this loop works.

In [95]:
player_dfs = []
for year in years:
    with open("player/{}.html".format(year)) as f:
        page = f.read()
    soup = BeautifulSoup(page, "html.parser")
    soup.find('tr', class_="thead").decompose()
    player_table = soup.find(id="per_game_stats")
    player = pd.read_html(str(player_table))[0]
    player["Year"] = year
    player_dfs.append(player)

In [102]:
players = pd.concat(player_dfs)

# Rearranging the columns so that Year is the second column
year_col = players['Year']
players = players.drop(columns=['Year'])
players.insert(loc=1, column='Year', value=year_col)
players

Unnamed: 0,Rk,Year,Player,Pos,Age,Tm,G,GS,MP,FG,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,1991,Alaa Abdelnaby,PF,22,POR,43,0,6.7,1.3,...,.568,0.6,1.4,2.1,0.3,0.1,0.3,0.5,0.9,3.1
1,2,1991,Mahmoud Abdul-Rauf,PG,21,DEN,67,19,22.5,6.2,...,.857,0.5,1.3,1.8,3.1,0.8,0.1,1.6,2.2,14.1
2,3,1991,Mark Acres,C,28,ORL,68,0,19.3,1.6,...,.653,2.1,3.2,5.3,0.4,0.4,0.4,0.6,3.2,4.2
3,4,1991,Michael Adams,PG,28,DEN,66,66,35.5,8.5,...,.879,0.9,3.0,3.9,10.5,2.2,0.1,3.6,2.5,26.5
4,5,1991,Mark Aguirre,SF,31,DET,78,13,25.7,5.4,...,.757,1.7,3.1,4.8,1.8,0.6,0.3,1.6,2.7,14.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
522,436,2022,Trevor Winter,C,25,MIN,1,0,5.0,0.0,...,,1.0,2.0,3.0,0.0,0.0,0.0,0.0,5.0,0.0
523,437,2022,Joe Wolf,PF,34,CHH,3,0,4.0,0.0,...,.000,0.0,0.3,0.3,0.0,0.0,0.0,0.0,2.0,0.0
524,438,2022,Haywoode Workman,PG,33,MIL,29,29,28.1,2.5,...,.787,0.5,3.0,3.5,5.9,1.1,0.0,2.2,1.8,6.9
525,439,2022,Lorenzen Wright,C,23,LAC,48,15,23.6,2.5,...,.692,3.0,4.6,7.5,0.7,0.5,0.8,1.0,3.4,6.6


In [103]:
players.to_csv("players.csv")

## Finally we need th eteam stadnings per year.

In [105]:
# We don't mind the Javascript rendered sections and thus won't use selenium
team_stats_url = "https://www.basketball-reference.com/leagues/NBA_{}_standings.html"
for year in years:
    url = team_stats_url.format(year)

    data= requests.get(url)

    with open("team/{}.html".format(year), "w+") as f:
        f.write(data.text)

In [None]:
dfs = []
with open("team/{}.html".format(year)) as f:
    page = f.read()

soup = BeautifulSoup(page, "html.parser")
soup.find('tr', class_="thead").decompose()
eastern_table = soup.find(id="divs_standings_E")[0]
eastern_df = pd.read_html(str(eastern_table))[0]
eastern_df["Year"] = year
eastern_df["Team"] = eastern_df["Eastern Conference"]
del eastern_df["Eastern Conference"]
dfs.append(eastern_df)

    