# Build a Python Web Crawler From Scratch
![](images/pexels.jpg)

# Introduction

Why would anyone want to collect more data when there is so much already? 

# HTML anatomy basics

We will start by learning the basics of HTML anatomy. Nearly all websites on the Internet are built using the combination of HTML and CSS code (including JavaScript, but we won't talk about it here).

Below is a sample HTML code, with some important parts annotated.

![](images/1.png)

HTML organizes and positions information on a blank page using nodes called tags. The highest parent tag above is a `bookstore`. A tag inside a tag is called a child and the `book` tag is an example. 

Many HTML tags have *attributes* that encode additional information about the things displayed on the screen. The first attribute we see is an `id` attribute, which is used to give a unique identifier to a tag, as there can be many `bookstore` tags in the page source.

Another, lower-level identifier for tags are called classes, an example of which is the `title` tag with a class of `name`. Best practices of writing HTML code dictate that a single `id` should be used on a single tag, while classes should be used to group tags that work similarly.

Other attributes include the `lang` which dictates the language of the page and a very common attribute - `href` of the `a` tag, which links a piece of text to some web page. 

The whole idea behind web scraping is to use automation to extract information from the massive sea of HTML tags and their attributes. One of the tools, among many, to use in this process is using XPath. 

# XPath with `lxml`

XPath stands for XML path language. XPath syntax contains intuitive rules to locate HTML tags and extract information from their attributes and texts. For this section, we will practice XPath on the HTML code you saw in the above picture:

In [1]:
sample_html = """
<bookstore id='main'>

    <book>
        <img src='https://books.toscrape.com/index.html'>
        <title lang="en" class='name'>Harry Potter</title>
        <price>29.99</price>
    </book>

    <book>
        <a href='https://www.w3schools.com/xml/xpath_syntax.asp'>
            <title lang="en">Learning XML</title>
        </a>
        <price>39.95</price>
    </book>

</bookstore>
"""

To start using XPath to query this HTML code, we will be needing a small library:

```python
pip install lxml
```

LXML library allows you to both read HTML code as a string and query it using XPath. First, we will convert the above string to an HTML element using the `fromstring` function:

In [2]:
from lxml import html

source = html.fromstring(sample_html)

source

<Element bookstore at 0x1e612a769a0>

In [3]:
type(source)

lxml.html.HtmlElement

Now, let's write our first XPath code. We will just select the `bookstore` tag first:

In [4]:
source.xpath("//bookstore")

[<Element bookstore at 0x1e612a769a0>]

It is as simple as that. Just write a double forward slash followed by a tag name to select the tag from anywhere of the HTML tree. We can do the same for the `book` tag:

In [5]:
source.xpath("//book")

[<Element book at 0x1e612afcb80>, <Element book at 0x1e612afcbd0>]

As you can see, we get a list of two `book` tags. Now, let's see how to choose an immediate child of a tag. For example, let's choose the `title` tag that comes right inside the `book` tag:

In [6]:
source.xpath("//book/title")

[<Element title at 0x1e6129dfa90>]

We've only got a single element, which is the first `title` tag. The reason the second tag wasn't chosen is because it is not an immediate child of the second `book` tag. But we can replace the single forward slash with a double one to choose both `title` tags:

In [7]:
source.xpath("//book//title")

[<Element title at 0x1e6129dfa90>, <Element title at 0x1e612b0edb0>]

Now, let's see how to choose the text inside a tag:

In [8]:
source.xpath("//book/title[1]/text()")

['Harry Potter']

We are choosing the text inside the first `title` tag. As you can see, we can also specify which of the `title` tags we want using brackets notation. To choose the text inside that tag, just follow it with a forward slash and a `text()` function. 

Finally, we look at how to locate tags based on their attributes like `id`, `class`, `href` or any other attribute inside `<>`. Below, we will choose the title tag with the `name` class:

In [9]:
source.xpath("//title[@class='name']")

[<Element title at 0x1e6129dfa90>]

As expected, we get a single element. Here are a few examples of choosing other tags using attributes:

In [10]:
source.xpath("//*[@id='main']")  # choose any element with id 'main'

[<Element bookstore at 0x1e612a769a0>]

In [11]:
source.xpath("//title[@lang='en']")  # choose a title tag with 'lang' attribute of 'en'.

[<Element title at 0x1e6129dfa90>, <Element title at 0x1e612b0edb0>]

I suggest you look at [this page](https://www.w3schools.com/xml/xpath_syntax.asp) of W3Schools to learn more about XPath.

# Creating a class to store the data

For this tutorial, we will be scraping this [e-store's computers section](https://slickdeals.net/computer-deals/?page=1):

![](images/2.png)

We will be extracting every item's name, manufacturer, price, number of likes, reviews and the image URL. To make things easier, we will create a class with these attributes:

In [82]:
class StoreItem:
    """
    A general class to store item data in a concise manner.
    """

    def __init__(self, name, price, manufacturer):
        self.name = name
        self.price = price
        self.manufacturer = manufacturer

Let's initialize the first item manually:

In [83]:
item1 = StoreItem("Lenovo IdeaPad", 749, "Walmart")

# Getting the page source

Now, let's get to down to the serious business. To scrape the website, we will need its HTML source. Achieving this requires using another library:

```python
pip install requests
```

`requests` allows you to send HTTPS requests to websites and of course, get back the result with their HTML code. It is easy as calling its `get` method passing the webpage address:

In [14]:
import requests

HOME_PAGE = "https://slickdeals.net/computer-deals/?page=1"
requests.get(HOME_PAGE)

<Response [200]>

If the response comes with a 200 status code, it means the request was successful. To get the HTML code, we use the `content` attribute:

In [15]:
r = requests.get(HOME_PAGE)

source = html.fromstring(r.content)

In [16]:
source

<Element html at 0x1e612ba63b0>

Above, we are converting the result to an LXML compatible object. As we probably repeat this process a few times, we will convert into a function:

In [17]:
def get_source(page_url):
    """
    A function to download the page source of the given URL.
    """
    r = requests.get(page_url)
    source = html.fromstring(r.content)

    return source

In [18]:
source = get_source(HOME_PAGE)
source

<Element html at 0x1e612d11770>

But, here is a problem - any website contains tens of thousands of HTML code, which makes visual exploration of the code impossible. For this reason, we will turn to our browser to figure out which tags and attributes contain the information we want. 

After loading the page, right-click anywhere on the page and click "Inspect" to open developer tools:

![](images/2.gif)

Using the "selector arrow", you can hover over and click on parts of the page to find out the element below the cursor and find out their associated attributes and info. It will also change the below window to move to the location of the selected element. As we can see, all stores items are within `li` elements, with a class attribute containing the words `fpGridBox grid`. Let's choose them using XPath:

In [73]:
source = get_source(HOME_PAGE)

li_list = source.xpath("//li[contains(@class, 'fpGridBox grid')]")
len(li_list)

28

Since the class names are changing, we are using a part of the class name that is common in all `li` elements. As a result, we have selected 28 `li` elements, which can be double-checked by counting them on the webpage itself. 

# Extracting the data

Now, let's start extracting the item details from the `li` elements. Let's first look at how to find the item's name using the "selector arrow":

![](images/3.gif)

As you can see, the item names are located inside `a` tags with class names that contain `itemTitle` keyword. Let's select them with XPath to make sure:

In [30]:
item_names = [
    li.xpath(".//a[@class='itemTitle bp-p-dealLink bp-c-link']") for li in li_list
]

len(item_names)

28

As expected, we got 28 item names. This time, we are using chained XPath on `li` elements, which requires starting the syntax with a `dot`. Below, I will write the XPath for other item details using the browser tools:

In [84]:
li_xpath = "//li[contains(@class, 'fpGridBox grid')]"  # Choose the `li` items

names_xpath = ".//a[@class='itemTitle bp-p-dealLink bp-c-link']/text()"
manufacturer_xpath = ".//*[contains(@class, 'itemStore bp-p-storeLink')]/text()"
price_xpath = ".//*[contains(@class, 'itemPrice')]/text()"

Now, we have got everything we need to scrape all items on the page. Let's do it in a loop:

In [87]:
li_list = source.xpath(li_xpath)

items = list()
for li in li_list:
    name = li.xpath(names_xpath)
    manufacturer = li.xpath(manufacturer_xpath)
    price = li.xpath(price_xpath)

    # Store inside a class
    item = StoreItem(name, price, manufacturer)
    items.append(item)

In [89]:
len(items)

28

# Handling the pagination

We've got all items on this page. However, if you scroll down, you'll see that there is a "Next" button, indicating that there are more items to scrape. We don't want to visit all pages manually one-by-one, because there can hundreds of pages. 

But if you pay attention to the URL when we click on the Next button every time:

![](images/4.gif)

As you can see, the page number is changing as we click the button. Now, I've checked that there are 22 pages of items on the website. So, we will create 

# Conclusion