# Project: NHL Team Statistics Web Scraper
# Overview
This project involves building an automated web scraper to collect historical performance data of NHL hockey teams. The goal is to aggregate scattered data from multiple web pages into a single, clean dataset suitable for analysis.

# Methodology
Request & Parse: Utilizes requests and BeautifulSoup to access HTML content.
Pagination Handling: Loops through 24 distinct pages of data to ensure the entire dataset is captured.
Data Extraction: Targets specific HTML table rows and cells to extract metrics such as Team Name, Year, Wins, Losses, and Goal Differential.
Storage & Export: Appends data efficiently to a list structure, converts it to a pandas DataFrame, and exports the final result to a CSV file.

# Libraries Used
•requests

•bs4 (BeautifulSoup)

•pandas

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
base_url = 'https://www.scrapethissite.com/pages/forms/?page_num='

In [3]:
#Get Headers (from page 1)

In [4]:
page = requests.get(base_url+str(1))
soup = BeautifulSoup(page.text,'html')

In [5]:
data = soup.find('tr')
headers_only=data.find_all('th')

In [6]:
df_headers=[]
for i in headers_only:
    if(i != ''):
        df_headers.append(i.text.strip())
df_headers

['Team Name',
 'Year',
 'Wins',
 'Losses',
 'OT Losses',
 'Win %',
 'Goals For (GF)',
 'Goals Against (GA)',
 '+ / -']

In [7]:
all_rows = []
data = soup.find_all('tr',class_='team')

In [8]:
#Loop through pages and scrap the data

In [9]:
for i in range(1,25):
    url = base_url+str(i)
    page = requests.get(url)
    soup = BeautifulSoup(page.text,'html')
    data = soup.find_all('tr',class_='team')
    for i in data:
            name = i.find('td',class_='name').text.strip()
            year= i.find('td',class_='year').text.strip()
            wins = i.find('td',class_='wins').text.strip()
            losses= i.find('td',class_='losses').text.strip()
            ot_losses = i.find('td',class_='ot-losses').text.strip()
            pct = i.select_one('.pct').text.strip()
            gf = i.find('td',class_='gf').text.strip()
            ga = i.find('td',class_='ga').text.strip()
            diff = i.select_one('.diff').text.strip()
            # print(name,year,wins,losses,ot_losses,pct,gf,ga,diff)
            row = [name,year,wins,losses,ot_losses,pct,gf,ga,diff]
            all_rows.append(row)

In [10]:
df = pd.DataFrame(all_rows,columns = df_headers)

In [11]:
df

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -
0,Boston Bruins,1990,44,24,,0.55,299,264,35
1,Buffalo Sabres,1990,31,30,,0.388,292,278,14
2,Calgary Flames,1990,46,26,,0.575,344,263,81
3,Chicago Blackhawks,1990,49,23,,0.613,284,211,73
4,Detroit Red Wings,1990,34,38,,0.425,273,298,-25
...,...,...,...,...,...,...,...,...,...
577,Tampa Bay Lightning,2011,38,36,8,0.463,235,281,-46
578,Toronto Maple Leafs,2011,35,37,10,0.427,231,264,-33
579,Vancouver Canucks,2011,51,22,9,0.622,249,198,51
580,Washington Capitals,2011,42,32,8,0.512,222,230,-8


In [12]:
df.to_csv('Hockey Teams.csv',index=False)