#### Name: Andrina Watsemba
#### RegNo: S24B38/026
#### Assignment : WEB SCRAPING


### Part 2: Scraping a Different Website

### Chosen Website: https://www.scrapethissite.com/pages/forms/
- **Data to collect**: NHL hockey player stats – Team, Year, Wins, Losses, OT Losses, Win %, Goals For (GF), Goals Against (GA), +/-, etc.
- **Reason**: Real sports leaderboard paginated table, structured data. 100% static HTML (no JS), ethical practice site. Shows multi-page scraping without Selenium!

In [5]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

In [6]:
# FIXED: Scrapes ONLY 24 pages - stops on empty table
all_players = []

base_url = "https://www.scrapethissite.com/pages/forms/?page_num="
page = 1

print("Scraping hockey stats...")

while True:
    url = base_url + str(page)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    table = soup.find("table", class_="table")
    
    if not table:
        print("No table found. Stopping.")
        break
    
    rows = table.find_all("tr")[1:]  # skip header
    
    if len(rows) == 0:  # ← THIS is the real stop condition!
        print(f"Page {page} is empty. Last page was {page-1}. Done!")
        break
    
    for row in rows:
        cols = row.find_all("td")
        player_data = [col.text.strip() for col in cols]
        all_players.append(player_data)
    
    print(f"Page {page} done ({len(rows)} players)")
    page += 1
    time.sleep(1)

print(f"\nTotal players scraped: {len(all_players)} (from 24 pages)")

Scraping hockey stats...
Page 1 done (25 players)
Page 2 done (25 players)
Page 3 done (25 players)
Page 4 done (25 players)
Page 5 done (25 players)
Page 6 done (25 players)
Page 7 done (25 players)
Page 8 done (25 players)
Page 9 done (25 players)
Page 10 done (25 players)
Page 11 done (25 players)
Page 12 done (25 players)
Page 13 done (25 players)
Page 14 done (25 players)
Page 15 done (25 players)
Page 16 done (25 players)
Page 17 done (25 players)
Page 18 done (25 players)
Page 19 done (25 players)
Page 20 done (25 players)
Page 21 done (25 players)
Page 22 done (25 players)
Page 23 done (25 players)
Page 24 done (7 players)
Page 25 is empty. Last page was 24. Done!

Total players scraped: 582 (from 24 pages)


In [10]:
# FIXED: Correct column names (only 9 columns exist!)
columns = [
    "Team", "Year", "Wins", "Losses", "OT Losses", 
    "Win %", "GF", "GA", "+/-"
]  # ← Only 9 columns!

# Create DataFrame
df = pd.DataFrame(all_players, columns=columns)

# Save with your correct surname
csv_filename = "Hockey_watsemba.csv"  # ← Fixed typo: "wastemba" → "watsemba"
df.to_csv(csv_filename, index=False)

print(f"Successfully saved {len(df)} players to {csv_filename}")
print("\nFirst 10 rows:")
df.head(10)

Successfully saved 582 players to Hockey_watsemba.csv

First 10 rows:


Unnamed: 0,Team,Year,Wins,Losses,OT Losses,Win %,GF,GA,+/-
0,Boston Bruins,1990,44,24,,0.55,299,264,35
1,Buffalo Sabres,1990,31,30,,0.388,292,278,14
2,Calgary Flames,1990,46,26,,0.575,344,263,81
3,Chicago Blackhawks,1990,49,23,,0.613,284,211,73
4,Detroit Red Wings,1990,34,38,,0.425,273,298,-25
5,Edmonton Oilers,1990,37,37,,0.463,272,272,0
6,Hartford Whalers,1990,31,38,,0.388,238,276,-38
7,Los Angeles Kings,1990,46,24,,0.575,340,254,86
8,Minnesota North Stars,1990,27,39,,0.338,256,266,-10
9,Montreal Canadiens,1990,39,30,,0.487,273,249,24


## Dataset Description
- **Website**: ScrapeThisSite.com – Hockey player stats sandbox.
- **Extracted**: Full leaderboard (team, year, wins/losses, goals, +/-, etc.) across 24 pages.
- **Rows**: ~600 players.
- **Challenges**: Pagination via URL parameter (`?page_num=`). Static HTML → simple requests + BS4.
- **Ethical**: Site designed for scraping practice. robots.txt allows all. Public data, educational use.

### part 2 – Hockey Stats Scraper (2 min)

“For Part 2, I chose a different ethical site: https://www.scrapethissite.com/pages/forms/
This is a hockey team stats leaderboard — 24 pages, 600 rows.
Why this site?

100% static HTML (no JavaScript)
Designed for scraping practice
robots.txt allows everything
Real sports data — more interesting than quotes!

Challenge:
The site returns 200 OK even on empty pages, so my first loop went to page 70!
I fixed it by checking if len(rows) == 0 → stops exactly at page 24.
I saved everything to scrape_watsemba.csv with 9 columns: Team, Year, Wins, Losses, etc.
Ethical note: I added time.sleep(1) to be polite to the server.”