# Crawl A Web Page

**Web scraping**, often called web crawling or web spidering, or “programmatically going over a collection of web pages and extracting data,” is a powerful tool for working with data on the web.

With a web crawler, you can mine data about a set of products, get a large corpus of text or quantitative data to play around with, get data from a site without an official API, or just satisfy your own personal curiosity.

In this tutorial, you’ll learn about the fundamentals of the scraping and spidering process as you explore a playful data set. We’ll use **EBAY**, a community-run site that contains information about Products sets. By the end of this tutorial, you’ll have a fully functional Python web crawler that collects a products information on **Ebay**.

This is a stepping stone for WebScrawling

## Step 1 — Creating a Basic Review
Scraping is a two step process:<br>
1. You systematically find and download web pages.<br>
2. You take those web pages and extract information from them.<br>

In [1]:
# Let’s create a new folder "brickset-crawler" for our project. You can do this in the terminal by running:

!mkdir Ebay-crawler

Let's start from https://www.ebay.ie/b/Computers-Tablets-Network-Hardware/58058/bn_1838792?_catref=1 <br>

## Step 2: Crawling all links for  product items (on ONE category page)

Let's take one page for extracting content: https://www.ebay.ie/b/Computers-Tablets-Network-Hardware/58058/bn_1838792?_catref=1 <br>
Download the HTML source code of the category page

In [2]:
url = 'https://www.ebay.ie/b/Computers-Tablets-Network-Hardware/58058/bn_1838792?_catref=1'
import urllib.request

# Write you code here to download content of HTML page
response = urllib.request.urlopen(url)
raw_html = response.read().decode("utf-8")


# Write you code here to parse HTML page
from bs4 import BeautifulSoup
soup = BeautifulSoup(raw_html, 'html.parser')


Identify the product box for each brick item on category. Print out the title and link of detail page. <br>
Save them into list **lst_productlinks**

In [3]:
lst_productlinks = []

# Write you code here to extract title, link. <Customize your config here to find DOM nodes of product items>
#div1 = soup.find("div", {"id":"mainContent"}) b-list__items_nofooter srp-results srp-grid
#for divbox in soup.find_all("section", {"class":"b-module b-carousel b-guidance b-display--landscape"}):
for divbox in soup.find_all("ul", {"class":"b-list__items_nofooter srp-results srp-grid"}):
    for libox in divbox.find_all("li"):
      title = libox.get_text().strip()
      try:
        href =  libox.find("a").get("href")
        print(title, " => ", href)
        lst_productlinks.append([title, href])
      except:
        pass

ESET NOD32 Antivirus 1 Device 1 Year ESD KeyEUR 3.84Free postage616 sold  =>  https://www.ebay.ie/itm/185161216095?hash=item2b1c77f05f:g:3agAAOSwmhFfmWb3
ESET Internet Security 3 Device 1 Year DIGITAL Key DeliveryEUR 6.14Free postage19,073 sold  =>  https://www.ebay.ie/itm/184509310431?hash=item2af59ca5df:g:BUwAAOSwkKhgYVWT
Bluetooth 5.0 USB Dongle CSR 300Mbps Win 10 7 Vista Adapter PC Laptop ComputerEUR 5.95(EUR 5.95/Unit)Free postageor Best Offer718 sold  =>  https://www.ebay.ie/itm/233777759916?hash=item366e3d86ac:g:Wo4AAOSwPZhfrEpU
Samsung Galaxy Tab A7 8.7" Lite 2021 T220/T225 PU Leather Flip Smart Case CoverEUR 5.99 to EUR 6.99EUR 4.99 postage  =>  https://www.ebay.ie/itm/174969814832?hash=item28bd037f30:g:GEoAAOSwVb5fy6uZ&var=474116985269
Wireless Bluetooth 4.1 Receiver Transmitter Adapter For Car Music Aux 3.5mm JackEUR 6.49(EUR 6.49/Unit)Free postage609 sold  =>  https://www.ebay.ie/itm/233623956251?hash=item366512ab1b:g:MKEAAOSwe~Ze7Min
ESET Internet Security 5 Device License

In [4]:
# Print result
print("{:,} links".format(len(lst_productlinks)))
for item in lst_productlinks:
    print(item)

48 links
['ESET NOD32 Antivirus 1 Device 1 Year ESD KeyEUR 3.84Free postage616 sold', 'https://www.ebay.ie/itm/185161216095?hash=item2b1c77f05f:g:3agAAOSwmhFfmWb3']
['ESET Internet Security 3 Device 1 Year DIGITAL Key DeliveryEUR 6.14Free postage19,073 sold', 'https://www.ebay.ie/itm/184509310431?hash=item2af59ca5df:g:BUwAAOSwkKhgYVWT']
['Bluetooth 5.0 USB Dongle CSR 300Mbps Win 10 7 Vista Adapter PC Laptop ComputerEUR 5.95(EUR 5.95/Unit)Free postageor Best Offer718 sold', 'https://www.ebay.ie/itm/233777759916?hash=item366e3d86ac:g:Wo4AAOSwPZhfrEpU']
['Samsung Galaxy Tab A7 8.7" Lite 2021 T220/T225 PU Leather Flip Smart Case CoverEUR 5.99 to EUR 6.99EUR 4.99 postage', 'https://www.ebay.ie/itm/174969814832?hash=item28bd037f30:g:GEoAAOSwVb5fy6uZ&var=474116985269']
['Wireless Bluetooth 4.1 Receiver Transmitter Adapter For Car Music Aux 3.5mm JackEUR 6.49(EUR 6.49/Unit)Free postage609 sold', 'https://www.ebay.ie/itm/233623956251?hash=item366512ab1b:g:MKEAAOSwe~Ze7Min']
['ESET Internet Secu

## Step 3 — Extracting Data from a Page

Let’s give it some data to extract.<br>

Let's extract information for each product:<br>
 -  Title
 -  Link
 -  Theme
 -  Subtheme
 -  Pieces
 -  Packaging
 -  Price
 -  Set Type
 

In [6]:
# Write you code here to extract information
  
for divbox in soup.find_all("ul", {"class":"b-list__items_nofooter srp-results srp-grid"}):
    for libox in divbox.find_all("li"):
      title = libox.get_text().strip()
      try:
        href =  libox.find("a").get("href")
        #print(title, " => ", href)
        lst_productlinks.append([title, href])
      except:
        pass
    
    oneproduct = {} 
    oneproduct["Title"] = divbox.find("h3").get_text()
    oneproduct["Link"] = href
    oneproduct["Price"] = divbox.find("li",{"class":"s-item s-item--large s-item--bgcolored"}).find("div",{"class":"s-item__wrapper clearfix"}).find("div",{"class":"s-item__info clearfix"}).find("div",{"class":"s-item__details clearfix" }).find("div",{"class":"s-item__detail s-item__detail--primary"}).find("span",{"class":"s-item__price"}).get_text()
    oneproduct["Postage"] = divbox.find("li",{"class":"s-item s-item--large s-item--bgcolored"}).find("div",{"class":"s-item__wrapper clearfix"}).find("div",{"class":"s-item__info clearfix"}).find("div",{"class":"s-item__details clearfix" }).find_all("div",{"class":"s-item__detail s-item__detail--primary"})[1].get_text()
    
    import pprint
    pprint.pprint(oneproduct)

    #Try Get the rest yourself :)

{'Link': 'https://www.ebay.ie/itm/294054631647?hash=item44770598df:g:sUUAAOSwnNFgmRRo&var=592790396368',
 'Postage': 'Free postage',
 'Price': 'EUR 3.84',
 'Title': 'ESET NOD32 Antivirus 1 Device 1 Year ESD Key'}
