<a href="https://colab.research.google.com/github/Janak-Khadka/GoogleColab-CurtinUni/blob/main/Week%209%20Notebooks/web_scraping_notebook_ipynb_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping Basics: A Friendly Introduction for Beginners
**Learn how to collect and analyze web data using Python, Pandas, and BeautifulSoup**

## Learning Objectives
- Understand what web scraping is and why it's useful.
- Learn how to extract data from websites using BeautifulSoup.
- Use pandas to organize and analyze the scraped data.
- Practice ethical web scraping techniques.


## What is Web Scraping?

Web scraping is the process of automatically collecting information from websites.  
It helps you gather data that might not be available through APIs or downloads.

We'll use a combination of libraries:
- **requests**: to download web pages
- **BeautifulSoup**: to parse HTML and extract information
- **pandas**: to organize the scraped data

### Installing and Importing

Let's install the necessary libraries:

```python
!pip install pandas requests beautifulsoup4
```

Now let's import them:


In [1]:
import pandas as pd  # For data manipulation
import requests  # For making HTTP requests
from bs4 import BeautifulSoup  # For parsing HTML content
import time  # For adding delays between requests (ethical scraping)

## Web Scraping Ethics and Best Practices

Before we start, it's important to understand some ethical guidelines:

1. **Check robots.txt**: Always check if scraping is allowed (`website.com/robots.txt`).
2. **Respect rate limits**: Don't overload servers with too many requests.
3. **Identify yourself**: Use proper headers with your contact info.
4. **Only take what you need**: Don't scrape unnecessary data.

For this tutorial, we'll use websites that allow scraping for educational purposes.


## Downloading a Web Page

Let's start by downloading a simple web page. We'll use a book catalog website that's designed for scraping practice.


In [2]:
# TODO: Define your headers to identify yourself
headers = {
    'User-Agent': 'Mozilla/5.0 (Educational purpose scraper)',
    'Accept': 'text/html,application/xhtml+xml'
}

# URL of the page we want to scrape
url = 'http://books.toscrape.com/'

# TODO: Make a request to the website
response = requests.get(url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    print(f"Successfully downloaded the page! Content length: {len(response.text)} characters")
else:
    print(f"Failed to download the page. Status code: {response.status_code}")

# Let's see what the raw HTML looks like (just the first 500 characters)
print(response.text[:500])

Successfully downloaded the page! Content length: 51294 characters
<!DOCTYPE html>
<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html lang="en-us" class="no-js"> <!--<![endif]-->
    <head>
        <title>
    All products | Books to Scrape - Sandbox
</title>

        <meta http-equiv="content-type" content="text/html; charset=UTF-8" /


## Parsing HTML with BeautifulSoup

Now that we have the HTML content, let's parse it to extract structured information.


In [3]:
# TODO: Create a BeautifulSoup object
soup = BeautifulSoup(response.text, 'html.parser')

# Let's check the title of the page
page_title = soup.title.text
print(f"Page title: {page_title}")

Page title: 
    All products | Books to Scrape - Sandbox



## Extracting Book Information

Let's extract information about the books shown on the page. We'll look for:
- Book titles
- Prices
- Star ratings


In [4]:
# Find all book containers on the page
book_containers = soup.find_all('article', class_='product_pod')
print(f"Found {len(book_containers)} books on this page")

# Lists to store our data
titles = []
prices = []
ratings = []

# TODO: Extract information from each book container
for book in book_containers:
    # Extract title
    title = book.h3.a['title']
    titles.append(title)

    # Extract price
    price = book.find('p', class_='price_color').text
    prices.append(price)

    # Extract star rating (contained in the class name)
    star_rating = book.find('p', class_='star-rating')['class'][1]
    ratings.append(star_rating)

# Display the first 5 items from each list
for i in range(5):
    print(f"Book {i+1}: {titles[i]}, Price: {prices[i]}, Rating: {ratings[i]}")

Found 20 books on this page
Book 1: A Light in the Attic, Price: Â£51.77, Rating: Three
Book 2: Tipping the Velvet, Price: Â£53.74, Rating: One
Book 3: Soumission, Price: Â£50.10, Rating: One
Book 4: Sharp Objects, Price: Â£47.82, Rating: Four
Book 5: Sapiens: A Brief History of Humankind, Price: Â£54.23, Rating: Five


## Converting to a DataFrame

Now let's organize our scraped data into a pandas DataFrame.


In [8]:
# TODO: Create a DataFrame from the scraped data
books_data = {
    'Title': titles,
    'Price': prices,
    'Rating': ratings
}

books_df = pd.DataFrame(books_data)

# Show the DataFrame
books_df

Unnamed: 0,Title,Price,Rating
0,A Light in the Attic,Â£51.77,Three
1,Tipping the Velvet,Â£53.74,One
2,Soumission,Â£50.10,One
3,Sharp Objects,Â£47.82,Four
4,Sapiens: A Brief History of Humankind,Â£54.23,Five
5,The Requiem Red,Â£22.65,One
6,The Dirty Little Secrets of Getting Your Dream...,Â£33.34,Four
7,The Coming Woman: A Novel Based on the Life of...,Â£17.93,Three
8,The Boys in the Boat: Nine Americans and Their...,Â£22.60,Four
9,The Black Maria,Â£52.15,One


## Cleaning and Processing Data

Let's clean up our data to make it more useful:
- Convert prices from string to numeric values
- Convert ratings to a numeric scale (e.g., 1-5)


In [9]:
# TODO: Clean the price column (remove £ symbol and convert to float)
books_df['Price'] = books_df['Price'].str.replace('Â£', '').astype(float)

# TODO: Convert ratings to numeric values
rating_mapping = {
    'One': 1,
    'Two': 2,
    'Three': 3,
    'Four': 4,
    'Five': 5
}
books_df['Rating'] = books_df['Rating'].map(rating_mapping)

# Show the updated DataFrame
books_df

Unnamed: 0,Title,Price,Rating
0,A Light in the Attic,51.77,3
1,Tipping the Velvet,53.74,1
2,Soumission,50.1,1
3,Sharp Objects,47.82,4
4,Sapiens: A Brief History of Humankind,54.23,5
5,The Requiem Red,22.65,1
6,The Dirty Little Secrets of Getting Your Dream...,33.34,4
7,The Coming Woman: A Novel Based on the Life of...,17.93,3
8,The Boys in the Boat: Nine Americans and Their...,22.6,4
9,The Black Maria,52.15,1


## Analyzing the Scraped Data

Now let's use pandas to analyze the data we've collected.


In [10]:
# Display basic statistics
books_df.describe()

Unnamed: 0,Price,Rating
count,20.0,20.0
mean,38.0485,2.85
std,15.135231,1.565248
min,13.99,1.0
25%,22.6375,1.0
50%,41.38,3.0
75%,51.865,4.0
max,57.25,5.0


In [11]:
# TODO: Calculate average price by rating
avg_price_by_rating = books_df.groupby('Rating')['Price'].mean().sort_index()
print("Average price by rating:")
avg_price_by_rating

Average price by rating:


Unnamed: 0_level_0,Price
Rating,Unnamed: 1_level_1
1,40.018333
2,36.83
3,42.316667
4,31.105
5,39.75


## 🧠 Challenge: Scrape Multiple Pages

The book catalog has multiple pages. Can you expand your scraping to collect books from the first 3 pages?

(Hint: Look at the URL pattern for pagination and use a loop)

Try writing your own code below.


In [None]:
# Your code here
# Hint: The pagination URLs follow this pattern: 'http://books.toscrape.com/catalogue/page-{page_num}.html'

all_titles = []
all_prices = []
all_ratings = []

# Loop through multiple pages
for page_num in range(1, 4):  # Pages 1, 2, and 3
    # Construct the URL for each page
    if page_num == 1:
        page_url = 'http://books.toscrape.com/'
    else:
        page_url = f'http://books.toscrape.com/catalogue/page-{page_num}.html'

    # Add a small delay to be polite
    time.sleep(1)

    # Make the request
    response = requests.get(page_url, headers=headers)

    # Check if successful
    if response.status_code == 200:
        print(f"Successfully scraped page {page_num}")

        # Parse the page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract book information
        # (Your extraction code here)
    else:
        print(f"Failed to scrape page {page_num}")

## Saving the Scraped Data

Let's save our scraped data to a CSV file for future use.


In [None]:
# TODO: Save the DataFrame to a CSV file
books_df.to_csv('scraped_books.csv', index=False)
print("Data saved to 'scraped_books.csv'")

## Reflection

- What was most interesting about the web scraping process?
- What challenges did you encounter?
- How could you extend this scraper to collect more detailed information?
- How might you use web scraping in your own projects?


## Additional Resources

- BeautifulSoup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- Requests Library: https://docs.python-requests.org/
- Pandas Documentation: https://pandas.pydata.org/docs/
- Web Scraping Ethics: https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/
