In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import logging
import lxml

Now that necessary packages are installed, going to parse the needed url. This URL has stats for each player in the league for the 2027-28 season, as well as contract information only for the year of 2027-28. So I am going to scrape the cap friendly website (Players Browse tab) and get this into a dataframe that I can download and put into the MySQL database

In [2]:
url_v2 = "https://www.capfriendly.com/browse/active/2028?stats-season=2017&display=signing-team,birthday,country,weight,height,weightkg,heightcm,draft,slide-candidate,waivers-exempt,signing-status,expiry-year,performance-bonus,signing-bonus,caphit-percent,aav,length,minors-salary,base-salary,arbitration-eligible,type,signing-age,signing-date,arbitration,extension&hide=skater-stats,goalie-stats&limits=gp-5-90"

req = requests.get(url_v2)
soup = BeautifulSoup(req.content)  # make a soup of html & css from the web page

In [3]:
df = pd.read_html(url_v2, header=0, index_col = 0, na_values=["-"])[0]

In [4]:
df.shape

(50, 32)

After running df.shape, we can see our dataframe has 50 rows (players) and 32 columns (attributes about that player for the 2026-27 season). There are more than 50 players but they are on different url links technically as the table on that specific url only shows 50 players. So we need to retrieve the rest of the players. 

Scraping multiple pages of the main table

In [5]:
info_about_lists = soup.find_all("a", {"class": "whi pagin_r"})  # via devtools we find the element that allows to switch between pages of data
     

In [6]:
print(info_about_lists)  # all links to other pages of data

[<a class="whi pagin_r" data-val="2" href="/browse/?p=2">2</a>]


In [7]:
last_list_num = int(info_about_lists[-1]["data-val"])  # take the last number of page from date-val so we now how many values were selected for us
     

In [8]:
print(last_list_num)  # check that 2nd is last page number we got

2


Now we can use a for loop to parse all the data we have on multiple pages

In [9]:
req = requests.get(url_v2)
soup = BeautifulSoup(req.content)  # make a soup of html & css from the web page

info_about_lists = soup.find_all("a", {"class": "whi pagin_r"})  # via devtools we find the element that allows to switch between pages of data
last_list_num = int(info_about_lists[-1]["data-val"])  # take the last number of page from date-val so we now how many values were selected for us

pages_dfs = []

url_start = "https://www.capfriendly.com/browse/active/2028?stats-season=2017&display=signing-team,birthday,country,weight,height,weightkg,heightcm,draft,slide-candidate,waivers-exempt,signing-status,expiry-year,performance-bonus,signing-bonus,caphit-percent,aav,length,minors-salary,base-salary,arbitration-eligible,type,signing-age,signing-date,arbitration,extension&hide=skater-stats,goalie-stats&limits=gp-5-90"

for page_num in range(1, last_list_num + 1):

        print(f"Start scapring page {page_num}")

        time.sleep(1)  # let the page download the results

        url = url_start + f"&pg={page_num}"  # we parse the needed page by adding a parameter for url
        df = pd.read_html(url, header=0, index_col = 0, na_values=["-"])[0]

        df = df.reset_index()  # to have player name as a separate column

        print(df.shape[0], f"rows were retrieved from page number {page_num}")

        pages_dfs.append(df)


result_df = pd.concat(pages_dfs)

Start scapring page 1
50 rows were retrieved from page number 1
Start scapring page 2
5 rows were retrieved from page number 2


In [10]:
result_df.head(5)

Unnamed: 0,PLAYER,TEAM,AGE,DATE OF BIRTH,COUNTRY,WEIGHT,HEIGHT,POS,HANDED,DRAFTED,...,EXPIRY,EXP. YEAR,CAP HIT,CAP HIT %,AAV,SALARY,BASE SALARY,MINORS,S.BONUS,P.BONUS
0,1. Auston Matthews,TOR,29,"Sep. 17, 1997",United States,215 lbs - 98 kg,"6'3"" - 191 cm",C,Left,1 - Round 1 - 2016 (TOR),...,UFA,2028,"$13,250,000",15.9%,"$13,250,000","$10,020,000","$900,000","$10,020,000","$9,120,000",$0
1,2. Nathan MacKinnon,COL,31,"Sep. 1, 1995",Canada,200 lbs - 91 kg,"6'0"" - 183 cm",C,Right,1 - Round 1 - 2013 (COL),...,UFA,2031,"$12,600,000",15.3%,"$12,600,000","$9,900,000","$990,000","$9,900,000","$8,910,000",$0
2,3. William Nylander,TOR,31,"May 1, 1996",Canada,202 lbs - 92 kg,"6'0"" - 183 cm",RW,Right,8 - Round 1 - 2014 (TOR),...,UFA,2032,"$11,500,000",13.8%,"$11,500,000","$11,500,000","$1,000,000","$11,500,000","$10,500,000",$0
3,4. David Pastrnak,BOS,31,"May 25, 1996",Czech Republic,194 lbs - 88 kg,"6'0"" - 183 cm",RW,Right,25 - Round 1 - 2014 (BOS),...,UFA,2031,"$11,250,000",13.6%,"$11,250,000","$11,250,000","$8,250,000","$11,250,000","$3,000,000",$0
4,5. Jonathan Huberdeau,CGY,34,"Jun. 4, 1993",Canada,202 lbs - 92 kg,"6'1"" - 185 cm","LW, RW",Left,3 - Round 1 - 2011 (FLA),...,UFA,2031,"$10,500,000",12.7%,"$10,500,000","$10,500,000","$1,000,000","$10,500,000","$9,500,000",$0


Now, I have player statistics and cap info for the 2027-28 season. I have every player in the entire league for the 2027-28 season!!! So time to export it to a csv, and then upload it into the MySQL database I created. 

In [11]:
result_df.to_csv('Cap Friendly 2027-28 Player Data2.csv', encoding='utf-8')
result_df.to_csv('Cap Friendly 2027-28 Player Data.csv')

The csv file above is a weird file type **Forgot to add the .csv extension above, issue is fixed now. Just have to replace the weird "âœ” with a Yes as when a check mark is scraped off their website, it can not get represented correctly in excel. So just replace that symbol with the text Yes.

Now I have the 2027-28 CapFriendly Player Data all downloaded, time to download the 10 prior seasons(Include that seasons stats and contract information) and 8 seasons after (will include contract information only). Each Season will be done in a separate file to ensure this notebook does not get negatively impacted and to keep code cleaner