<a href="https://colab.research.google.com/github/Joseph89155/hockey-webscraper/blob/main/Hockey_Stats_Web_Scrape.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧹 Web Scraping Project
# Title: Web Scraping Project
# Name: Joseph Maina
# Date: 19 May 2025
# This project scrapes NHL hockey team statistics from scrapethissite.com using Requests, BeautifulSoup, and Pandas


### IMPORT LIBRARIES


In [2]:
# Import Required Libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

### Send a Request to the Website.
#### Use the requests library to send an HTTP GET request to the target URL and fetch the HTML content of the page.

In [3]:
# Send a GET request to the target URL
url = 'https://www.scrapethissite.com/pages/forms/'
response = requests.get(url)

In [6]:
# Check the status of the response
print("Status Code:", response.status_code)


Status Code: 200


In [7]:
# Check the content of the response
print("Response Content:", response.text)

Response Content: <!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping</title>
    <link rel="icon" type="image/png" href="/static/images/scraper-icon.png" />

    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta name="description" content="Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components.">

    <link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" rel="stylesheet" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" crossorigin="anonymous">
    <link href='https://fonts.googleapis.com/css?family=Lato:400,700' rel='stylesheet' type='text/css'>
    <link rel="stylesheet" type="text/css" href="/static/css

### Parse the HTML using BeautifulSoup.
####  Parse the HTML content so I can navigate and extract elements from it.

In [8]:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

In [12]:
# Print out a small part to check
print(soup.title.text) # This should print the page title

Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping


### Extraction of the Data Table.
#### Locate the table, extract the column headers and the rows of data.

In [14]:
# Locate the table with hockey team stats
hockey_table = soup.find('table', class_='table')


In [15]:
# Extract column headers
table_headers = hockey_table.find_all('th')
column_names = [header.text.strip() for header in table_headers]
print("Column Headers:", column_names)

Column Headers: ['Team Name', 'Year', 'Wins', 'Losses', 'OT Losses', 'Win %', 'Goals For (GF)', 'Goals Against (GA)', '+ / -']


### Store the Extracted Table Data.
#### Loop through each table row, Extract the cell data and store the cleaned values in a "pandas.DataFrame"

In [16]:
# Create an empty DataFrame with the column headers
df = pd.DataFrame(columns=column_names)

In [17]:
# Extract all rows in the table (skip the first row with headers)
table_rows = hockey_table.find_all('tr')[1:]

In [18]:
# Loop through each row and extract data
for row in table_rows:
  cells = row.find_all('td')
  row_data = [cell.text.strip() for cell in cells]
  if row_data:
    df.loc[len(df)] = row_data


In [19]:
# Display the resulting DataFrame
df.head()

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -
0,Boston Bruins,1990,44,24,,0.55,299,264,35
1,Buffalo Sabres,1990,31,30,,0.388,292,278,14
2,Calgary Flames,1990,46,26,,0.575,344,263,81
3,Chicago Blackhawks,1990,49,23,,0.613,284,211,73
4,Detroit Red Wings,1990,34,38,,0.425,273,298,-25


In [20]:
# Display the resulting DataFrame
df.tail()

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -
20,Winnipeg Jets,1990,26,43,,0.325,260,288,-28
21,Boston Bruins,1991,36,32,,0.45,270,275,-5
22,Buffalo Sabres,1991,31,37,,0.388,289,299,-10
23,Calgary Flames,1991,31,37,,0.388,296,305,-9
24,Chicago Blackhawks,1991,36,29,,0.45,257,236,21


### Export the data to a ".csv File"
#### Lets save my scraped data locally in a ".csv" format.

In [21]:
# Export the DataFrame to a CSV file
df.to_csv('hockey_stats.csv', index=False)

In [22]:
# Print out the csv file
print("CSV file 'hockey_stats.csv' has been saved successfully.")

CSV file 'hockey_stats.csv' has been saved successfully.


In [23]:
# Download it to your local machine
from google.colab import files
files.download('hockey_stats.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Scrape All Pages Using Pagination.
#### The website uses a form-style interface with query parameters like ?page=1, ?page=2, etc. So I can loop through the pages programmatically.

In [24]:
# Base URL with pagination placeholder
base_url = 'https://www.scrapethissite.com/pages/forms/?page={}'


In [25]:
# Initialize an empty DataFrame
all_data =  pd.DataFrame(columns=column_names)

In [27]:
# Start from page 1 and go until there's no data
page_number = 1
while True:
    # Send request to current page
    url = base_url.format(page_number)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the table
    table = soup.find('table', class_='table')
    rows = table.find_all('tr')[1:]  # Skip header row

    # Stop if no data rows are found
    if not rows:
        break

    # Extract and store data
    for row in rows:
        cells = row.find_all('td')
        row_data = [cell.text.strip() for cell in cells]
        if row_data:
            all_data.loc[len(all_data)] = row_data

    print(f"✅ Page {page_number} scraped.")
    page_number += 1

# Preview combined DataFrame
all_data.head()

✅ Page 1 scraped.
✅ Page 2 scraped.
✅ Page 3 scraped.
✅ Page 4 scraped.
✅ Page 5 scraped.
✅ Page 6 scraped.
✅ Page 7 scraped.
✅ Page 8 scraped.
✅ Page 9 scraped.
✅ Page 10 scraped.
✅ Page 11 scraped.
✅ Page 12 scraped.
✅ Page 13 scraped.
✅ Page 14 scraped.
✅ Page 15 scraped.
✅ Page 16 scraped.
✅ Page 17 scraped.
✅ Page 18 scraped.
✅ Page 19 scraped.
✅ Page 20 scraped.
✅ Page 21 scraped.
✅ Page 22 scraped.
✅ Page 23 scraped.
✅ Page 24 scraped.
✅ Page 25 scraped.
✅ Page 26 scraped.
✅ Page 27 scraped.
✅ Page 28 scraped.
✅ Page 29 scraped.
✅ Page 30 scraped.
✅ Page 31 scraped.
✅ Page 32 scraped.
✅ Page 33 scraped.
✅ Page 34 scraped.
✅ Page 35 scraped.
✅ Page 36 scraped.
✅ Page 37 scraped.
✅ Page 38 scraped.
✅ Page 39 scraped.
✅ Page 40 scraped.
✅ Page 41 scraped.
✅ Page 42 scraped.
✅ Page 43 scraped.
✅ Page 44 scraped.
✅ Page 45 scraped.
✅ Page 46 scraped.
✅ Page 47 scraped.
✅ Page 48 scraped.
✅ Page 49 scraped.
✅ Page 50 scraped.
✅ Page 51 scraped.
✅ Page 52 scraped.
✅ Page 53 scraped.
✅ 

KeyboardInterrupt: 

In [28]:
# Save the full multi-page dataset to a CSV file
all_data.to_csv('all_hockey_stats.csv', index=False)

print("✅ Full dataset saved to 'all_hockey_stats.csv'")


✅ Full dataset saved to 'all_hockey_stats.csv'


In [29]:
# Download it to your local machine
from google.colab import files
files.download('all_hockey_stats.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>