<a href="https://colab.research.google.com/github/PKwaringa/web_scraping/blob/main/webscrapingproject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Title: web scraping project
Name:  pauline kungu
date: 15/5/2025

Data is important for machine alearning algorithms to learn patterns, make predictions and improve their performance. sufficient and high quality data is needed. web scraping is the automated process of extracting data from websites.

what am going to cover:


*   Practical Python coding on Jupiter Notebooks hosted on Google Colab
*   Use requests and BeautifulSoup to extract data from a web page.
*  Parse and clean the extracted data.
*  Store structured data into a Pandas DataFrame.
*   Export the final dataset to a .csv file.





In [None]:
#importing libraries for scraping
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [None]:
#here i have specified the url am going to scrape and used the request library to send a get request to the url
url = "https://www.scrapethissite.com/pages/forms"
page = requests.get(url)

In [None]:
#check if the request was successful
page

<Response [200]>

using beautifulsoup a python library to parse the html page we got using the request library

In [None]:
soup = BeautifulSoup(page.text, 'html')
print(soup)

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping</title>
<link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components." name="description"/>
<link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
<link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
<meta content="noindex" name="robot

the aim is to get the table containing the hockey teams.

In [None]:
#use the find function to locate the hockey table
hockey_table = soup.find('table', class_='table')
print(hockey_table)


<table class="table">
<tr>
<th>
                            Team Name
                        </th>
<th>
                            Year
                        </th>
<th>
                            Wins
                        </th>
<th>
                            Losses
                        </th>
<th>
                            OT Losses
                        </th>
<th>
                            Win %
                        </th>
<th>
                            Goals For (GF)
                        </th>
<th>
                            Goals Against (GA)
                        </th>
<th>
                            + / -
                        </th>
</tr>
<tr class="team">
<td class="name">
                            Boston Bruins
                        </td>
<td class="year">
                            1990
                        </td>
<td class="wins">
                            44
                        </td>
<td class="losses">
                            2

then we get the colomn header for the table and in html the headers are written in the "th" tags.

In [None]:
#the find_all() does the same as the find() but returns all that contain not just the first data.
headers = hockey_table.find_all('th')
# using the strip()fuction to remove the tags
column_names = [header.text.strip() for header in headers]
print("Column Titles:", column_names)

Column Titles: ['Team Name', 'Year', 'Wins', 'Losses', 'OT Losses', 'Win %', 'Goals For (GF)', 'Goals Against (GA)', '+ / -']


In [None]:
#create a dataframe using pandas library
df = pd.DataFrame(columns=column_names)
df

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -


next we extract the rows from the table and since the 1st row has the header ie python indexing starts from zero, we extract from 1 and store it in rows

In [None]:
#am geting all the rows which are in the tag "tr"
rows = hockey_table.find_all('tr')[1:]
rows

[<tr class="team">
 <td class="name">
                             Boston Bruins
                         </td>
 <td class="year">
                             1990
                         </td>
 <td class="wins">
                             44
                         </td>
 <td class="losses">
                             24
                         </td>
 <td class="ot-losses">
 </td>
 <td class="pct text-success">
                             0.55
                         </td>
 <td class="gf">
                             299
                         </td>
 <td class="ga">
                             264
                         </td>
 <td class="diff text-success">
                             35
                         </td>
 </tr>,
 <tr class="team">
 <td class="name">
                             Buffalo Sabres
                         </td>
 <td class="year">
                             1990
                         </td>
 <td class="wins">
                             3

in each row (tr) there is a table data(td) which has the specific text for each cell. we are going to loop through all the rows and get the texts then add them to the dataframe we created.

In [None]:
# Loop through each row and extract the text from each cell
for row in rows:
    cells = row.find_all('td')
    #remove the tags
    data = [cell.text.strip() for cell in cells]
    #add the text to the df using .loc which enable assessing and modification of data in dataframe
    #len()function show the length/number of objects ie rows of the dataframe
    df.loc[len(df)] = data

In [None]:
#print the first 5 result of the dataframe using the head()fuction
df.head()

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -
0,Boston Bruins,1990,44,24,,0.55,299,264,35
1,Buffalo Sabres,1990,31,30,,0.388,292,278,14
2,Calgary Flames,1990,46,26,,0.575,344,263,81
3,Chicago Blackhawks,1990,49,23,,0.613,284,211,73
4,Detroit Red Wings,1990,34,38,,0.425,273,298,-25


In [None]:
df.tail()

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -
20,Winnipeg Jets,1990,26,43,,0.325,260,288,-28
21,Boston Bruins,1991,36,32,,0.45,270,275,-5
22,Buffalo Sabres,1991,31,37,,0.388,289,299,-10
23,Calgary Flames,1991,31,37,,0.388,296,305,-9
24,Chicago Blackhawks,1991,36,29,,0.45,257,236,21


In [None]:
#check for missing values
df.isnull().sum()

Unnamed: 0,0
Team Name,0
Year,0
Wins,0
Losses,0
OT Losses,0
Win %,0
Goals For (GF),0
Goals Against (GA),0
+ / -,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 25 entries, 0 to 24
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Team Name           25 non-null     object
 1   Year                25 non-null     object
 2   Wins                25 non-null     object
 3   Losses              25 non-null     object
 4   OT Losses           25 non-null     object
 5   Win %               25 non-null     object
 6   Goals For (GF)      25 non-null     object
 7   Goals Against (GA)  25 non-null     object
 8   + / -               25 non-null     object
dtypes: object(9)
memory usage: 2.5+ KB


the table has data from 2 yrs 1990 and 1991

In [None]:
# Show all unique years
print(df['Year'].unique())


['1990' '1991']


there are 21 teams

In [None]:
# Show all unique teams
print(df['Team Name'].unique())


['Boston Bruins' 'Buffalo Sabres' 'Calgary Flames' 'Chicago Blackhawks'
 'Detroit Red Wings' 'Edmonton Oilers' 'Hartford Whalers'
 'Los Angeles Kings' 'Minnesota North Stars' 'Montreal Canadiens'
 'New Jersey Devils' 'New York Islanders' 'New York Rangers'
 'Philadelphia Flyers' 'Pittsburgh Penguins' 'Quebec Nordiques'
 'St. Louis Blues' 'Toronto Maple Leafs' 'Vancouver Canucks'
 'Washington Capitals' 'Winnipeg Jets']


lets save the dataframe into csv file

In [None]:
df.to_csv('./Hockey_Stats.csv')
print("Data exported successfully to 'Hockey_Stats.csv'")

Data exported successfully to 'Hockey_Stats.csv'
