# Mini Project - NBA Players' Stats Web Scraping

The goal of this mini project is to practice web scraping using Python. The task is to scrape the data from each NBA player's page on the Wikipedia website and then to create a dataset with the columns found on the player's page. The dataset will be used to do some exploratory data analysis.

In [58]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

The history saving thread hit an unexpected error (OperationalError('unable to open database file')).History will not be written to the database.


Let's define the URLs of the NBA players' pages that we want to scrape.

In [42]:
Players = {
    "LBJ": "Lebron James",
    "Giannis": "Giannis Antetokoumpo",
    "Luka": "Luka Doncic",
    "Chef": "Stephen Curry",
    "Joker": "Nikola Jokic",
    "Antman": "Anthony Edwards",
    "KD": "Kevin Durant",
    "Shai": "Shai Gilgeous-Alexander",
    "Wemby": "Victor Wembanyama",
    "Drew": "Kyrie Irving"
    # Sorry Tatum ;)
}

URLs = {
    "LBJ": "",
    "Giannis": "",
    "Luka": "",
    "Chef": "",
    "Joker": "",
    "Antman": "",
    "KD": "",
    "Shai": "",
    "Wemby": "",
    "Drew": ""
}
for k, v in Players.items():
    URLs[k] = "https://en.wikipedia.org/wiki/" + v.replace(" ", "_")

In [50]:
print(URLs)

{'LBJ': 'https://en.wikipedia.org/wiki/Lebron_James', 'Giannis': 'https://en.wikipedia.org/wiki/Giannis_Antetokoumpo', 'Luka': 'https://en.wikipedia.org/wiki/Luka_Doncic', 'Chef': 'https://en.wikipedia.org/wiki/Stephen_Curry', 'Joker': 'https://en.wikipedia.org/wiki/Nikola_Jokic', 'Antman': 'https://en.wikipedia.org/wiki/Anthony_Edwards_(basketball)', 'KD': 'https://en.wikipedia.org/wiki/Kevin_Durant', 'Shai': 'https://en.wikipedia.org/wiki/Shai_Gilgeous-Alexander', 'Wemby': 'https://en.wikipedia.org/wiki/Victor_Wembanyama', 'Drew': 'https://en.wikipedia.org/wiki/Kyrie_Irving'}


In [49]:
URLs["Antman"] = URLs.get("Antman") + "_(basketball)"

Now that we have the URLs, we can start scraping the data from each player's page. We will use the `requests` and `BeautifulSoup` libraries to scrape the data.

In [51]:
url = []
for k, v in URLs.items():
    url.append(requests.get(v))

In [82]:
content = [BeautifulSoup(url[i].content, 'html.parser') for i in range(len(url))]

In [83]:
print(content[5].find('table', {'class': 'wikitable'}))

<table class="wikitable" style="font-size:90%;">
<caption><a href="/wiki/College_recruiting" title="College recruiting">College recruiting</a> information
</caption>
<tbody><tr>
<th>Name
</th>
<th>Hometown
</th>
<th>High school / college
</th>
<th>Height
</th>
<th>Weight
</th>
<th>Commit date
</th></tr>
<tr style="border-bottom: 3px; text-align: center">
<td rowspan="2"><b>Anthony Edwards </b><br/><i><a href="/wiki/Shooting_guard" title="Shooting guard">SG</a></i>
</td>
<td><a href="/wiki/Atlanta" title="Atlanta">Atlanta, GA</a>
</td>
<td><a href="/wiki/Holy_Spirit_Preparatory_School" title="Holy Spirit Preparatory School">Holy Spirit Prep</a> (GA)
</td>
<td>6 ft 4 in (1.93 m)
</td>
<td>205 lb (93 kg)
</td>
<td class="nowrap">Feb 11, 2019 
</td></tr>
<tr style="text-align: center">
<td colspan="7"><b><a href="/wiki/College_recruiting#Star_ratings" title="College recruiting">Star ratings</a>:</b> <a href="/wiki/Rivals.com" title="Rivals.com">Rivals</a>:<span typeof="mw:File"><a class="m

In [84]:
titles = [content[i].find('h1').text for i in range(len(content))]
print(titles)

['LeBron James', 'Giannis Antetokounmpo', 'Luka Dončić', 'Stephen Curry', 'Nikola Jokić', 'Anthony Edwards (basketball)', 'Kevin Durant', 'Shai Gilgeous-Alexander', 'Victor Wembanyama', 'Kyrie Irving']


Now we scrape the stats of each player's table on their Wikipedia page.
We will store them in a list of lists.

In [147]:
ALL_PLAYER_STATS = []

table_headers = [header.text.strip() for header in content[0].find('table', {'class': 'wikitable'}).find_all('th')]

for i in range(len(content)):    
    stats_table = content[i].find('table', {'class': 'wikitable'})

    # Extract the rows
    rows = stats_table.find_all('tr')
    stats = []
    for row in rows:
        columns = row.find_all('td')
        if columns:
            stats.append([column.text.strip() for column in columns])
    ALL_PLAYER_STATS.append(stats)

We seem to have issues with Players of index 2, 3, 5 and 9. This is because the pages of these players have additional tables that are not related to the player's stats. For now we will skip these players and come back to them later.

In [148]:
for index in (9, 5, 3, 2):
    ALL_PLAYER_STATS.pop(index)

In [149]:
for i in ("Chef", "Antman", "Drew", "Luka"):
    del Players[i]
    del URLs[i]

KeyError: 'Chef'

In [150]:
print(len(ALL_PLAYER_STATS))
print(ALL_PLAYER_STATS[0][len(ALL_PLAYER_STATS[0])-1])

6
['All-Star[665]', '20‡', '20‡', '26.8', '.513', '.297', '.725', '5.7', '5.7', '1.1', '.4', '21.7']


Here we can see there's an issue with the "Career" and "All-Star" rows: there is no team name, so we will add the "None" value to the team name.

In [151]:
for i in range(len(ALL_PLAYER_STATS) -1):
    ALL_PLAYER_STATS[i][len(ALL_PLAYER_STATS[i])-1].insert(1, None)
    ALL_PLAYER_STATS[i][len(ALL_PLAYER_STATS[i])-2].insert(1, None)
    
ALL_PLAYER_STATS[5][len(ALL_PLAYER_STATS[5])-1].insert(1, None)


In [152]:
print(Players.keys())

dict_keys(['LBJ', 'Giannis', 'Joker', 'KD', 'Shai', 'Wemby'])


In [153]:
LBJDF = pd.DataFrame(ALL_PLAYER_STATS[0], columns=table_headers)
GiannisDF = pd.DataFrame(ALL_PLAYER_STATS[1], columns=table_headers)
JokerDF = pd.DataFrame(ALL_PLAYER_STATS[2], columns=table_headers)
KDDF = pd.DataFrame(ALL_PLAYER_STATS[3], columns=table_headers)
ShaiDF = pd.DataFrame(ALL_PLAYER_STATS[4], columns=table_headers)
WembyDF = pd.DataFrame(ALL_PLAYER_STATS[5], columns=table_headers)

In [154]:
LBJDF.head()

Unnamed: 0,Year,Team,GP,GS,MPG,FG%,3P%,FT%,RPG,APG,SPG,BPG,PPG
0,2003–04,Cleveland,79,79,39.5,0.417,0.29,0.754,5.5,5.9,1.6,0.7,20.9
1,2004–05,Cleveland,80,80,42.3*,0.472,0.351,0.75,7.4,7.2,2.2,0.7,27.2
2,2005–06,Cleveland,79,79,42.5,0.48,0.335,0.738,7.0,6.6,1.6,0.8,31.4
3,2006–07,Cleveland,78,78,40.9,0.476,0.319,0.698,6.7,6.0,1.6,0.7,27.3
4,2007–08,Cleveland,75,74,40.4,0.484,0.315,0.712,7.9,7.2,1.8,1.1,30.0*


We now have the dataframes of the stats of all the players we considered/could scrape.

We can now do some exploratory data analysis on the dataset.

## Exploratory Data Analysis

We want to answer the following questions:

I. Average number of points, assists, rebounds, steals, and blocks per game for the players in the dataset.
1. How do the player rank in terms of points per game?
2. How do the player rank in terms of assists per game?
3. How do the player rank in terms of rebounds per game?
4. How do the player rank in terms of steals per game?
5. How do the player rank in terms of blocks per game?

II. What players have the highest field goal percentage, free throw percentage, and three-point percentage?

III. What players played the most minutes in the season?

I. Average number of points, assists, rebounds, steals, and blocks per game for the players in the dataset.

1. Let's rank the players in terms of points per game, on average and then per season.

In [157]:
LBJDF.head(24)

Unnamed: 0,Year,Team,GP,GS,MPG,FG%,3P%,FT%,RPG,APG,SPG,BPG,PPG
0,2003–04,Cleveland,79,79,39.5,0.417,0.29,0.754,5.5,5.9,1.6,0.7,20.9
1,2004–05,Cleveland,80,80,42.3*,0.472,0.351,0.75,7.4,7.2,2.2,0.7,27.2
2,2005–06,Cleveland,79,79,42.5,0.48,0.335,0.738,7.0,6.6,1.6,0.8,31.4
3,2006–07,Cleveland,78,78,40.9,0.476,0.319,0.698,6.7,6.0,1.6,0.7,27.3
4,2007–08,Cleveland,75,74,40.4,0.484,0.315,0.712,7.9,7.2,1.8,1.1,30.0*
5,2008–09,Cleveland,81,81,37.7,0.489,0.344,0.78,7.6,7.2,1.7,1.1,28.4
6,2009–10,Cleveland,76,76,39.0,0.503,0.333,0.767,7.3,8.6,1.6,1.0,29.7
7,2010–11,Miami,79,79,38.8,0.51,0.33,0.759,7.5,7.0,1.6,0.6,26.7
8,2011–12†,Miami,62,62,37.5,0.531,0.362,0.771,7.9,6.2,1.9,0.8,27.1
9,2012–13†,Miami,76,76,37.9,0.565,0.406,0.753,8.0,7.3,1.7,0.9,26.8


In [159]:
print(LBJDF.loc[len(LBJDF)-2][['PPG']])

PPG    27.1
Name: 21, dtype: object
