<a href="https://colab.research.google.com/github/KaylumCassidy/CA/blob/main/17_WebScrawl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Crawl A Web Page

**Web scraping**, often called web crawling or web spidering, or “programmatically going over a collection of web pages and extracting data,” is a powerful tool for working with data on the web.

With a web crawler, you can mine data about a set of products, get a large corpus of text or quantitative data to play around with, get data from a site without an official API, or just satisfy your own personal curiosity.

In this tutorial, you’ll learn about the fundamentals of the scraping and spidering process as you explore a playful data set. We’ll use **BrickSet**, a community-run site that contains information about LEGO sets. By the end of this tutorial, you’ll have a fully functional Python web crawler that walks through a series of pages on **Brickset** and extracts data about LEGO sets from each page, displaying the data to your screen.

The crawler will be easily expandable so you can tinker around with it and use it as a foundation for your own projects scraping data from the web.

## Step 1 — Creating a Basic Review
Scraping is a two step process:<br>
1. You systematically find and download web pages.<br>
2. You take those web pages and extract information from them.<br>

In [None]:
# Let’s create a new folder "brickset-crawler" for our project. You can do this in the terminal by running:

!mkdir brickset-crawler

Let's start from http://brickset.com/sets/year-2016 <br>
<img src="images/Bricksets.jpg"><br>

## Step 2: Crawling all links for  product items (on ONE category page)

Let's take one page for extracting content: http://brickset.com/sets/year-2016 <br>
Download the HTML source code of the category page

In [None]:
start_urls = 'http://brickset.com/sets/year-2016'

# Write you code here to download content of HTML page
import urllib.request
response = urllib.request.urlopen(start_urls)
raw_html = response.read().decode("utf-8")

# Write you code here to parse HTML page
from bs4 import BeautifulSoup
soup = BeautifulSoup(raw_html, 'html.parser')


Identify the product box for each brick item on category. Print out the title and link of detail page.<br>
For example, link for **10251: Brick Bank** is https://brickset.com/sets/10251-1/Brick-Bank <br>
<img src="images/BrickBankTitle.jpg"><br>
Save them into list **lst_productlinks**

In [None]:
import pprint
lst_productlinks = []


secbox = soup.find("section", {"class":"setlist"})
print(len(secbox.find_all("article", {"class": "set"})))
for divbox in secbox.find_all("article", {"class":"set"}):
  try:
      title = divbox.find("h1").get_text()
      href = divbox.find("a").get("href")
      stext = divbox.get_text().strip()
      print(title, "=>", href)
      lst_productlinks.append([title,href])
  except:
      pass

25
Brick Bank => https://images.brickset.com/sets/large/10251-1.jpg?201510121127
Volkswagen Beetle => https://images.brickset.com/sets/large/10252-1.jpg?201606140214
Big Ben => https://images.brickset.com/sets/large/10253-1.jpg?201605190256
Winter Holiday Train => https://images.brickset.com/sets/large/10254-1.jpg?201608110306
XL Creative Brick Box => https://images.brickset.com/sets/large/10654-1.jpg?201609271134
Creative Building Set => https://images.brickset.com/sets/large/10702-1.jpg?201511230710
Creative Building Basket => https://images.brickset.com/sets/large/10705-1.jpg?201605201119
Police Helicopter Chase => https://images.brickset.com/sets/large/10720-1.jpg?201601050913
Iron Man vs. Loki => https://images.brickset.com/sets/large/10721-1.jpg?201601050913
Snake Showdown => https://images.brickset.com/sets/large/10722-1.jpg?201601050913
Ariel's Dolphin Carriage => https://images.brickset.com/sets/large/10723-1.jpg?201601050913
Batman & Superman vs. Lex Luthor => https://images.

In [None]:
# Print result
print("{:,} links".format(len(lst_productlinks)))
for item in lst_productlinks:
    print(item)

25 links
['Brick Bank', 'https://images.brickset.com/sets/large/10251-1.jpg?201510121127']
['Volkswagen Beetle', 'https://images.brickset.com/sets/large/10252-1.jpg?201606140214']
['Big Ben', 'https://images.brickset.com/sets/large/10253-1.jpg?201605190256']
['Winter Holiday Train', 'https://images.brickset.com/sets/large/10254-1.jpg?201608110306']
['XL Creative Brick Box', 'https://images.brickset.com/sets/large/10654-1.jpg?201609271134']
['Creative Building Set', 'https://images.brickset.com/sets/large/10702-1.jpg?201511230710']
['Creative Building Basket', 'https://images.brickset.com/sets/large/10705-1.jpg?201605201119']
['Police Helicopter Chase', 'https://images.brickset.com/sets/large/10720-1.jpg?201601050913']
['Iron Man vs. Loki', 'https://images.brickset.com/sets/large/10721-1.jpg?201601050913']
['Snake Showdown', 'https://images.brickset.com/sets/large/10722-1.jpg?201601050913']
["Ariel's Dolphin Carriage", 'https://images.brickset.com/sets/large/10723-1.jpg?201601050913']
[

## Step 3 — Extracting Data from a Page

Let’s give it some data to extract.<br>
<img src="images/BrickBank.jpg">

Let's extract information for each product:<br>
 -  Title
 -  Link
 -  Theme
 -  Subtheme
 -  Pieces
 -  Packaging
 -  Price
 -  Set Type


In [None]:
import pprint
# Write you code here to extract information
secbox = soup.find("section", {"class":"setlist"})
print(len(secbox.find_all("article", {"class": "set"})))
for divbox in secbox.find_all("article", {"class":"set"}):
  try:
      title = divbox.find("h1").get_text()
      href = divbox.find("a").get("href")
      theme = divbox.find("div", {"class":"tags"}).find_all("a")[1].get_text()
      subtheme = divbox.find("div", {"class":"tags"}).find_all("a")[2].get_text()
      year = divbox.find("div", {"class":"tags"}).find_all("a")[-1].get_text()

      pieces = divbox.find("div", {"class":"col"}).find_all("a")[0].get_text()
      packaging = divbox.find("div", {"class":"col"}).find_all("dd")[5].get_text()

      kpos = -1
      for k, box in enumerate(divbox.find_all("dt")):
        if box.get_text().find("Related") >= 0:
          kpos = k
          break
      relatedset = ""
      if kpos >= 0:
        relatedset = divbox.find_all("dd")[kpos].get_text()
        #Option1:
        relatedset = relatedset[relatedset.find("with")+5:] if relatedset.find("with")>0 else relatedset
        #Option2:
        l = relatedset.split()
        l = [x for x in l if x[-2:]=="-1"]
        relatedset = " ".join(l)
        #Option3: relatedset = " ".join([t for t in relatedset.split() if t[-2:]=="-1"])

      for pricebox in divbox.find("div", {"class":"col"}).find_all("dd"):
        if pricebox.get_text().find("$") >=0:
          price = pricebox.get_text()
          break


      #set_type = divbox.find("div", {"class":"col"}).find_all("dt")[].get_text()
      lst_productlinks.append([title,href,theme,subtheme,pieces,year,packaging,price,relatedset])
  except:
      pass

  oneproduct = {}
  oneproduct["Title"] = title
  oneproduct["Link"] = href
  oneproduct["Theme"] = theme
  oneproduct["Subtheme"] = subtheme
  oneproduct["Pieces"] = pieces
  oneproduct["Year"] = year
  oneproduct["Packaging"] = packaging
  oneproduct["Price"] = price
  oneproduct["Set Type"] = relatedset

  pprint.pprint(oneproduct)
  #break

25
{'Link': 'https://images.brickset.com/sets/large/10251-1.jpg?201510121127',
 'Packaging': 'Box',
 'Pieces': '2380',
 'Price': '$169.99, €149.99 | More',
 'Set Type': '10182-1 10185-1 10190-1 10197-1 10232-1',
 'Subtheme': 'Modular Buildings Collection',
 'Theme': 'Creator Expert',
 'Title': 'Brick Bank',
 'Year': '2016'}
{'Link': 'https://images.brickset.com/sets/large/10252-1.jpg?201606140214',
 'Packaging': '20',
 'Pieces': '1167',
 'Price': '$99.99, €87.72 | More',
 'Set Type': '40252-1',
 'Subtheme': 'Vehicles',
 'Theme': 'Creator Expert',
 'Title': 'Volkswagen Beetle',
 'Year': '2016'}
{'Link': 'https://images.brickset.com/sets/large/10253-1.jpg?201605190256',
 'Packaging': '19',
 'Pieces': '4163',
 'Price': '$249.99, €219.99 | More',
 'Set Type': '',
 'Subtheme': 'Landmarks',
 'Theme': 'Creator Expert',
 'Title': 'Big Ben',
 'Year': '2016'}
{'Link': 'https://images.brickset.com/sets/large/10254-1.jpg?201608110306',
 'Packaging': 'Box',
 'Pieces': '734',
 'Price': '$99.99, €89.

# Crawling Multiple Pages

We’ve successfully extracted data from that initial page, but we’re not progressing past it to see the rest of the results. The whole point of a spider is to detect and traverse links to other pages and grab data from those pages too.

## Step 1: Crawling all links for category pages

Let's start from http://brickset.com/sets/year-2016 <br>
<img src="images/NavigateLinks.jpg"><br>
So,estimate:<br>
 -  How many pages do we have from this category?
 -  How many pages do we have for each year?<br>
List all of these links and put into a list

In [1]:
import requests
from bs4 import BeautifulSoup

# Start URL
start_url = "http://brickset.com/sets/year-2016"

# Send a GET request to the URL
response = requests.get(start_url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all links to category pages
    category_links = soup.find_all('a', class_='year')

    # Store the links in a list
    category_links_list = [link['href'] for link in category_links]

    # Print the number of pages and the list of links
    print(f"Number of pages in the category: {len(category_links_list)}")
    print("List of category links:")
    for link in category_links_list:
        print(link)

    # You can further loop through each year page to get more specific details
    for year_link in category_links_list:
        year_url = f"http://brickset.com{year_link}"
        year_response = requests.get(year_url)

        if year_response.status_code == 200:
            year_soup = BeautifulSoup(year_response.text, 'html.parser')

            # Find all links to sets for each year
            set_links = year_soup.find_all('a', class_='highslide')

            # Store the links in a list
            set_links_list = [link['href'] for link in set_links]

            # Print the number of pages and the list of links for each year
            print(f"\nNumber of pages for {year_link}: {len(set_links_list)}")
            print("List of set links:")
            for set_link in set_links_list:
                print(set_link)
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")


Number of pages in the category: 50
List of category links:
/sets/theme-Creator-Expert/year-2016
/sets/theme-Creator-Expert/year-2016
/sets/theme-Creator-Expert/year-2016
/sets/theme-Creator-Expert/year-2016
/sets/theme-Creator-Expert/year-2016
/sets/theme-Creator-Expert/year-2016
/sets/theme-Creator-Expert/year-2016
/sets/theme-Creator-Expert/year-2016
/sets/theme-Classic/year-2016
/sets/theme-Classic/year-2016
/sets/theme-Classic/year-2016
/sets/theme-Classic/year-2016
/sets/theme-Classic/year-2016
/sets/theme-Classic/year-2016
/sets/theme-Juniors/year-2016
/sets/theme-Juniors/year-2016
/sets/theme-Juniors/year-2016
/sets/theme-Juniors/year-2016
/sets/theme-Juniors/year-2016
/sets/theme-Juniors/year-2016
/sets/theme-Juniors/year-2016
/sets/theme-Juniors/year-2016
/sets/theme-Juniors/year-2016
/sets/theme-Juniors/year-2016
/sets/theme-Juniors/year-2016
/sets/theme-Juniors/year-2016
/sets/theme-Juniors/year-2016
/sets/theme-Juniors/year-2016
/sets/theme-Juniors/year-2016
/sets/theme-Ju

## Step 2: Crawling information for product items (on all category pages)

In [5]:
import pprint
import requests
from bs4 import BeautifulSoup

lst_productinfos = []

lst_catelinks = ["http://brickset.com/sets/year-2016", "http://brickset.com/sets/year-2017"]

# Iterate through each category page
for url in lst_catelinks:
    # Send a GET request to the category page
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content of the category page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract information for each product item
        for divbox in soup.find_all("article", {"class": "set"}):
            try:
                oneproduct = {}
                oneproduct["Title"] = divbox.find("h1").get_text()
                oneproduct["Link"] = divbox.find("a").get("href")
                oneproduct["Theme"] = divbox.find("div", {"class": "tags"}).find_all("a")[1].get_text()
                oneproduct["Subtheme"] = divbox.find("div", {"class": "tags"}).find_all("a")[2].get_text()
                oneproduct["Year"] = divbox.find("div", {"class": "tags"}).find_all("a")[-1].get_text()

                # Add more attributes as needed

                lst_productinfos.append(oneproduct)
            except Exception as e:
                print(f"Error processing product on {url}: {e}")
    else:
        print(f"Failed to retrieve the page {url}. Status code: {response.status_code}")

# Print or do something with lst_productinfos
for product_info in lst_productinfos:
    pprint.pprint(product_info)


{'Link': 'https://images.brickset.com/sets/large/10251-1.jpg?201510121127',
 'Subtheme': 'Modular Buildings Collection',
 'Theme': 'Creator Expert',
 'Title': 'Brick Bank',
 'Year': '2016'}
{'Link': 'https://images.brickset.com/sets/large/10252-1.jpg?201606140214',
 'Subtheme': 'Vehicles',
 'Theme': 'Creator Expert',
 'Title': 'Volkswagen Beetle',
 'Year': '2016'}
{'Link': 'https://images.brickset.com/sets/large/10253-1.jpg?201605190256',
 'Subtheme': 'Landmarks',
 'Theme': 'Creator Expert',
 'Title': 'Big Ben',
 'Year': '2016'}
{'Link': 'https://images.brickset.com/sets/large/10254-1.jpg?201608110306',
 'Subtheme': 'Winter Village Collection',
 'Theme': 'Creator Expert',
 'Title': 'Winter Holiday Train',
 'Year': '2016'}
{'Link': 'https://images.brickset.com/sets/large/10654-1.jpg?201609271134',
 'Subtheme': 'Creative Box',
 'Theme': 'Classic',
 'Title': 'XL Creative Brick Box',
 'Year': '2016'}
{'Link': 'https://images.brickset.com/sets/large/10702-1.jpg?201511230710',
 'Subtheme': '

## Step 3 — Extracting Detail information from Product Page

Each product has a detail page. For example, https://brickset.com/sets/5659-1/The-Great-Train-Chase
<img src="images/BrickDetail.jpg"><br>

In [9]:
# Assuming lst_productinfos is already populated from Step 2

for oneproduct in lst_productinfos:
    url = oneproduct["Link"]
    print(url)

    # Send a GET request to the product detail page
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:

        # Parse the HTML content of the product detail page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Write your code here to extract detailed information
        age_range_element = soup.find("div", {"class": "feature age"})

        # Check if the element exists before trying to extract information
        if age_range_element:
            age_range = age_range_element.find("dd").get_text()
            # Customize the code to extract other details as needed
            oneproduct["Age Range"] = age_range
        else:
            oneproduct["Age Range"] = "Not available"  # Or any default value

        # Add more attributes as needed

    else:
        print(f"Failed to retrieve the page {url}. Status code: {response.status_code}")

# Print or do something with updated lst_productinfos
for product_info in lst_productinfos:
    pprint.pprint(product_info)


https://images.brickset.com/sets/large/10251-1.jpg?201510121127
https://images.brickset.com/sets/large/10252-1.jpg?201606140214
https://images.brickset.com/sets/large/10253-1.jpg?201605190256
https://images.brickset.com/sets/large/10254-1.jpg?201608110306
https://images.brickset.com/sets/large/10654-1.jpg?201609271134
https://images.brickset.com/sets/large/10702-1.jpg?201511230710
https://images.brickset.com/sets/large/10705-1.jpg?201605201119
https://images.brickset.com/sets/large/10720-1.jpg?201601050913
https://images.brickset.com/sets/large/10721-1.jpg?201601050913
https://images.brickset.com/sets/large/10722-1.jpg?201601050913
https://images.brickset.com/sets/large/10723-1.jpg?201601050913
https://images.brickset.com/sets/large/10724-1.jpg?201605201119
https://images.brickset.com/sets/large/10725-1.jpg?201601050913
https://images.brickset.com/sets/large/10726-1.jpg?201605201119
https://images.brickset.com/sets/large/10727-1.jpg?201605201119
https://images.brickset.com/sets/large/1

AssertionError: ignored

The link of page has been stored in **lst_productinfos**.<br>
Save result into JSON file as pretty format

Save result into CSV file with struct: Title, Link, Theme, Subtheme, Pieces, Packaging, Price, Set Type

# Crawl images to build image dataset

Let's start from http://brickset.com/sets/year-2016 <br>
<img src="images/BrickBankImage.jpg"><br>
From each product information, let's navigate to extract src information from img node.<br>
Then, download image and save into folder **images/**

In [11]:
urlimg  = 'https://images.brickset.com/sets/large/5655-1.jpg?201009110931'
imgfname= 'images/5655-1.jpg'

import urllib.request
response = urllib.request.urlopen(urlimg)
imagesrc = response.read()
f = open(imgfname, 'wb')
f.write( imagesrc )
f.close()

In [16]:
import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Function to download and save an image
def download_image(image_url, image_filename):
    try:
        response = requests.get(image_url)
        if response.status_code == 200:
            with open(image_filename, 'wb') as f:
                f.write(response.content)
            print(f"Downloaded and saved: {image_filename}")
        else:
            print(f"Failed to download image from {image_url}. Status code: {response.status_code}")
    except Exception as e:
        print(f"Error downloading image: {e}")

# URL of the starting page
base_url = 'http://brickset.com/sets/year-2016'
response = requests.get(base_url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, 'html.parser')

    # Create a folder to store images if it doesn't exist
    image_folder = 'images'
    if not os.path.exists(image_folder):
        os.makedirs(image_folder)

    # Extract product information and download images
    for set_box in soup.find_all("article", {"class": "set"}):
        try:
            title = set_box.find("h1").get_text()
            image_url = set_box.find("img")['src']
            image_filename = os.path.join(image_folder, f"{title}.jpg")

            # Download and save the image
            download_image(image_url, image_filename)
        except Exception as e:
            print(f"Error processing set: {e}")
else:
    print(f"Failed to retrieve the page {base_url}. Status code: {response.status_code}")


Downloaded and saved: images/Brick Bank.jpg
Downloaded and saved: images/Volkswagen Beetle.jpg
Downloaded and saved: images/Big Ben.jpg
Downloaded and saved: images/Winter Holiday Train.jpg
Downloaded and saved: images/XL Creative Brick Box.jpg
Downloaded and saved: images/Creative Building Set.jpg
Downloaded and saved: images/Creative Building Basket.jpg
Downloaded and saved: images/Police Helicopter Chase.jpg
Downloaded and saved: images/Iron Man vs. Loki.jpg
Downloaded and saved: images/Snake Showdown.jpg
Downloaded and saved: images/Ariel's Dolphin Carriage.jpg
Downloaded and saved: images/Batman & Superman vs. Lex Luthor.jpg
Downloaded and saved: images/Lost Temple.jpg
Downloaded and saved: images/Stephanie's Horse Carriage.jpg
Downloaded and saved: images/Emma's Ice Cream Truck.jpg
Downloaded and saved: images/Mia's Vet Clinic.jpg
Downloaded and saved: images/Cinderella's Carriage.jpg
Downloaded and saved: images/Baby Animals.jpg
Downloaded and saved: images/Savanna.jpg
Downloade