# PWHL Attendance

## Overview

As women’s sports gain popularity, I am interested in the evolution of game attendance over time. I am particularly interested in the case of the Professional Women’s Hockey League (PWHL).

## Introduction

I had the chance to witness the Montreal Victoire’s first home game, on January 13, 2024. The team did not have a name yet and was simply referred to as "PWHL Montreal". The game was held at the Verdun auditorium, the 4043-seat venue where the Victoire trains. The ambiance was amazing.

While the Victoire still trains in Verdun, the team now holds all of its regular home matches in a bigger venue, Place Bell, which can accomodate 10,062 spectators.

As women’s hockey gains popularity and athletes are getting paid full-time salaries to train together, we are seeing a higher level of play than ever before. I hope the attendance numbers will be maintained over time or even grow, ensuring the PWHL’s future and setting an example which, I hope, will lead to better funding for women’s sports and high level female athletes flourishing.

## Dataset

### Data source

The PWHL’s website’s [schedule page](https://www.thepwhl.com/en/stats/schedule/all-teams/5/all-months) lists all games for the current season in a table.

![the top of a table titled Schedule and listing the date, time, teams, scores, venues and broadcasters for PWHL games](./img/2025-04-07_ScheduleTable.png)

The R button at the end of each line leads to a webpage containing an official game report which specifies, among other things, the venue, date, teams and attendance.

The current season is selected by default but it is possible to obtain the same table for past seasons.

It is therefore possible to use the webpage to get a list of URLs corresponding to game reports for all of a season’s games. Then, information can be collected from each of the reports, which appear to all be formatted in the same way.

### Data scraping

I scraped a list of game report links from each season’s webpage. Then, I scraped the following information from each report, storing it in a DataFrame: visiting team, home team, venue, date, game start and attendance.

In [4]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [5]:
game_report_urls = []

def scrape_game_report_urls(url):
    game_report_urls = []
    # set up the WebDriver
    driver = webdriver.Edge() #EDIT THIS LINE if not using Edge <<<<<<<<<<<<<<<<<<<< IMPORTANT <<<<<<<<<<<<<<<<<<<<
    # open the webpage
    driver.get(url)
    # wait for the table to load
    try:
        table = WebDriverWait(driver, 2).until(
            EC.presence_of_element_located((By.XPATH, '//table'))
        )
        all_links = driver.find_elements(By.XPATH, "//a[@href]")
        for link in all_links:
            link_url = link.get_attribute("href")
            if "official-game-report" in link_url:
                game_report_urls.append(link_url)
    finally:
        driver.quit()
    return game_report_urls

for i in range(1, 6): # currently, there are 5 seasons
    game_report_urls.extend(scrape_game_report_urls("https://www.thepwhl.com/en/stats/schedule/all-teams/" + str(i) + "/all-months"))

In [6]:
game_report_urls[0:5]

['https://lscluster.hockeytech.com/game_reports/official-game-report.php?client_code=pwhl&game_id=2&lang_id=1',
 'https://lscluster.hockeytech.com/game_reports/official-game-report.php?client_code=pwhl&game_id=3&lang_id=1',
 'https://lscluster.hockeytech.com/game_reports/official-game-report.php?client_code=pwhl&game_id=4&lang_id=1',
 'https://lscluster.hockeytech.com/game_reports/official-game-report.php?client_code=pwhl&game_id=5&lang_id=1',
 'https://lscluster.hockeytech.com/game_reports/official-game-report.php?client_code=pwhl&game_id=6&lang_id=1']

Now that I have the URL for each report, I want to extract specific information: visiting team, home team, venue, date, game start and attendance.

In [148]:
import re
import requests 
from bs4 import BeautifulSoup as bs

In [150]:
def get_game_info(url):
    game_info = {}
    regex = r"#.{0,2}\d+ (.+) \d+ at (.+) \d+  (.+) (\w+ \d+, \d+)"
    # scrape data
    r = requests.get(url) 
    soup = bs(r.content)
    table = soup.select('table')[0]
    game_data = pd.read_html(str(table))[0]
    # find all required values
    text = game_data.iloc[0][1]
    captures = re.findall(regex, text)
    if len(captures) > 0:
        if len(captures[0]) >= 4:
            game_info["visiting team"], game_info["home team"], game_info["venue"], game_info["date"] = captures[0]
        else:
            print("Problem with captures ", captures[0])
    else:
        print("Problem with url ", url)
    game_info["game start"] = game_data.iloc[4][1]
    game_info["attendance"] = game_data.iloc[7][1]
    return game_info

In [152]:
get_game_info("https://lscluster.hockeytech.com/game_reports/official-game-report.php?client_code=pwhl&game_id=105&lang_id=1")

{'visiting team': 'Boston',
 'home team': 'Toronto',
 'venue': 'Coca-Cola Coliseum',
 'date': 'Nov 30, 2024',
 'game start': '2:15 PM EST',
 'attendance': '8089'}

Let’s go and get the info from all the game reports.

In [155]:
import warnings
warnings.filterwarnings('ignore')

In [157]:
games = []
for url in game_report_urls:
    new_data = get_game_info(url)
    games.append(new_data)

In [158]:
len(games)

181

In [159]:
df = pd.DataFrame(games)
df

Unnamed: 0,visiting team,home team,venue,date,game start,attendance
0,New York,Toronto,Mattamy Athletic Centre,"Jan 1, 2024",12:48 PM EST,2537
1,Montreal,Ottawa,TD Place,"Jan 2, 2024",7:12 PM EST,8318
2,Minnesota,Boston,Tsongas Center,"Jan 3, 2024",7:14 PM EST,4012
3,Toronto,New York,Total Mortgage Arena,"Jan 5, 2024",7:15 PM EST,2152
4,Montreal,Minnesota,Xcel Energy Center,"Jan 6, 2024",2:42 PM CST,13316
...,...,...,...,...,...,...
176,Montréal,Minnesota,Xcel Energy Center,"Mar 26, 2025",7:07 PM CST,6330
177,Ottawa,Boston,Enterprise Center,"Mar 29, 2025",1:13 PM CST,8578
178,Toronto,Minnesota,Xcel Energy Center,"Mar 30, 2025",12:07 PM CST,9536
179,New York,Montréal,Place Bell,"Apr 1, 2025",7:08 PM EST,8798


### Adding features

To see how filled each venue was during each match, I obtained seating capacity information and added it to the data.

In [161]:
CAPACITIES = {"Xcel Energy Center": 18300,
             "Tsongas Center": 6500,
             "TD Place": 9862,
             "Place Bell": 10062,
             "Prudential Center": 16514,
             "Coca-Cola Coliseum": 7851,
             "Mattamy Athletic Centre": 2600,
             "Verdun Auditorium": 4043,
             "Utica University Nexus Center - Mastrovito Hyundai": 1200,
             "UBS Arena": 17225,
             "Total Mortgage Arena": 8412,
             "Ford Performance Centre": 200,
             "Agganis Arena": 6150,
             "Little Caesars Arena": 19515,
             "Bell Centre": 21273,
             "Scotiabank Arena": 18800,
             "Videotron Centre": 18259,
             "Lenovo Center": 18680,
             "KeyBank Center": 18595,
             "Rogers Place": 18347,
             "Canadian Tire Centre": 19153,
             "Ball Arena": 18007,
             "Rogers Arena": 18910,
             "Climate Pledge Arena": 17200,
             "PPG Paints Arena": 18187,
             "3M Arena at Mariucci": 10257,
             "Enterprise Center": 18096}

def get_capacity(venue):
    hockey_seats = -1 #will return -1 if the venue is not known
    if venue in CAPACITIES.keys():
        hockey_seats = CAPACITIES[venue]
    return hockey_seats

In [162]:
df["seating capacity"] = df.apply(lambda row: get_capacity(row["venue"]), axis=1)

The day of the week can impact attendance. Typically, Friday, Saturday and Sunday games have the most spectators. So let’s add a "weekday" column.

In [164]:
from dateutil import parser
from datetime import datetime
DAYS = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

In [165]:
def get_weekday(date):
    return parser.parse(date).weekday()

In [166]:
df["weekday"] = df.apply(lambda row: DAYS[get_weekday(str(row["date"]))], axis=1)

In [167]:
df

Unnamed: 0,visiting team,home team,venue,date,game start,attendance,seating capacity,weekday
0,New York,Toronto,Mattamy Athletic Centre,"Jan 1, 2024",12:48 PM EST,2537,2600,Monday
1,Montreal,Ottawa,TD Place,"Jan 2, 2024",7:12 PM EST,8318,9862,Tuesday
2,Minnesota,Boston,Tsongas Center,"Jan 3, 2024",7:14 PM EST,4012,6500,Wednesday
3,Toronto,New York,Total Mortgage Arena,"Jan 5, 2024",7:15 PM EST,2152,8412,Friday
4,Montreal,Minnesota,Xcel Energy Center,"Jan 6, 2024",2:42 PM CST,13316,18300,Saturday
...,...,...,...,...,...,...,...,...
176,Montréal,Minnesota,Xcel Energy Center,"Mar 26, 2025",7:07 PM CST,6330,18300,Wednesday
177,Ottawa,Boston,Enterprise Center,"Mar 29, 2025",1:13 PM CST,8578,18096,Saturday
178,Toronto,Minnesota,Xcel Energy Center,"Mar 30, 2025",12:07 PM CST,9536,18300,Sunday
179,New York,Montréal,Place Bell,"Apr 1, 2025",7:08 PM EST,8798,10062,Tuesday


### Data cleaning

Tickets are not sold for preseason games so there is no attendance. I chose to remove the rows corresponding to these games from the dataset.

In [169]:
df.drop(df[df["attendance"] == "-"].index, inplace=True)

In [170]:
df

Unnamed: 0,visiting team,home team,venue,date,game start,attendance,seating capacity,weekday
0,New York,Toronto,Mattamy Athletic Centre,"Jan 1, 2024",12:48 PM EST,2537,2600,Monday
1,Montreal,Ottawa,TD Place,"Jan 2, 2024",7:12 PM EST,8318,9862,Tuesday
2,Minnesota,Boston,Tsongas Center,"Jan 3, 2024",7:14 PM EST,4012,6500,Wednesday
3,Toronto,New York,Total Mortgage Arena,"Jan 5, 2024",7:15 PM EST,2152,8412,Friday
4,Montreal,Minnesota,Xcel Energy Center,"Jan 6, 2024",2:42 PM CST,13316,18300,Saturday
...,...,...,...,...,...,...,...,...
176,Montréal,Minnesota,Xcel Energy Center,"Mar 26, 2025",7:07 PM CST,6330,18300,Wednesday
177,Ottawa,Boston,Enterprise Center,"Mar 29, 2025",1:13 PM CST,8578,18096,Saturday
178,Toronto,Minnesota,Xcel Energy Center,"Mar 30, 2025",12:07 PM CST,9536,18300,Sunday
179,New York,Montréal,Place Bell,"Apr 1, 2025",7:08 PM EST,8798,10062,Tuesday


In [182]:
df.to_csv("games.csv")

## TODO

  Add column specifying if the local NHL team is playing that day to df
- Check if I can find attendance records for the CWHL (prior women’s hockey league) for comparison
- Check if I can find attendance records for other teams with the same home venues for comparison

## Note

**Seating capacity values are to be taken with a grain of salt.** Certain venues have different seating capacities for hockey according to different sources. For example, the Verdun Auditorium has either 4114 or 4043 seats. Many venues don’t give specific numbers on their own website. Every time I found a discrepancy between sources, I opted for the (subjectively) most trustworthy source I could find, which was sometimes not that trustworthy, due to the teams and venues not sharing the information themselves (they often use rounded numbers instead). It is also possible that the seating capacity changes over time based on accessible accomodations, DJ booths, etc. [PPG Paints Arena](https://www.ppgpaintsarena.com/ppg-paints-arena/faqs) states that "Due to a flexible curtaining system, the seating capacity of each even will vary." [Ball Arena](https://www.ballarena.com/arena-information/about-ball-arena/) gets a special mention for listing both 18,807 and 18,809 as its hockey seating capacity on the same page.

## Sources

- [3M Arena at Mariucci](https://gophersports.com/sports/2018/5/21/facilities-mariucci-facts-html)
- [3M Arena at Mariucci (IIHF)](https://www.iihf.com/en/events/2026/wm20/static/64811/3m_arena_at_mariucci)
- [About Ball Arena](https://www.ballarena.com/arena-information/about-ball-arena/)
- [About Scotiabank Arena](https://www.scotiabankarena.com/venue-information/about)
- [About the Tsongas Center](https://tsongascenter.com/pages/about-the-tsongas-center)
- [Agganis Arena Capacity & Specifications](https://www.agganisarena.com/business-opportunities/book-the-arena/capacity-specifications/)
- [Auditorium de Verdun (Sporting Events)](https://sports.mtl.org/en/plan/sports-facilities/verdun-auditorium)
- [Bell Centre (Magil Construction)](https://www.magil.com/en/projects/bell-center)
- [Canadian Tire Centre A to Z Guide](https://www.canadiantirecentre.com/guide-a-a-z/)
- [Climate Pledge Arena (Populous)](https://populous.com/showcases/climate-pledge-arena)
- [Coca-Cola Coliseum Seating Chart](https://www.torontocoliseum.com/seating-chart/)
- [Enterprise Center Seat Locator](https://www.enterprisecenter.com/events/seat-locator)
- [Facilities - TMU Athletics & Recreation](https://tmubold.ca/sports/2019/8/22/204964949.aspx)
- [Ford Performance Centre Professional Recreational Hockey Facility](https://www.lakeshorearena.ca/)
- [Little Caesars Arena Information](https://www.detroiteventsarena.com/information/)
- [Place Bell - Laval Rocket (Stadium Journey)](https://www.stadiumjourney.com/stadiums/place-bell-laval-rocket)
- [PNC Arena Info (Internet Archive)](https://web.archive.org/web/20141205012555/http://www.thepncarena.com/arena_info)
- [PPG Paints Arena FAQ](https://www.ppgpaintsarena.com/ppg-paints-arena/faqs)
- [Prudential Center Capacity (Arena Capacity)](https://www.arenacapacity.com/prudential-center-capacity/)
- [Rogers Arena (Ticketmaster)](https://blog.ticketmaster.com/step-inside-rogers-arena-vancouver-bc/)
- [Rogers Place (Ticketmaster)](https://blog.ticketmaster.com/step-inside-rogers-place-edmonton-ab/)
- [TD Place Arena (OHL Arena Guide)](https://www.ohlarenaguide.com/67s.htm)
- [Total Mortgage Arena (Ticketsonsale)](https://www.ticketsonsale.com/venues/total-mortgage-arena)
- [UBS Arena (PCI)](https://www.pci.org/PCI/PCI/Project_Resources/Project_Profile/Project_Profile_Details.aspx?ID=253141)
- [Utica University Nexus Center](https://www.nexusutica.com/)
- [Vidéotron Centre, Québec Remparts (QMJHL Arena Guide)](https://www.qmjhlarenaguide.com/remparts.htm)
- [Xcel Energy Center Seating Capacities](https://www.xcelenergycenter.com/assets/doc/Seating_Capacities-e3d7d47c61.pdf)