<a href="https://colab.research.google.com/github/Liping-LZ/BDAO_DSDO/blob/main/Web_crawler%26API/02_Web_crawling_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Before we start

This notebook is showing you how you can use Python to crawl a website and get all the links.

Web crawling is the process of systematically browsing and discovering web pages on the internet using automated software called web crawlers or spiders.

It aims to systematically browse and follow links to find as many web page as possible.

Here we will use requests and BeautifulSoup to crawl and extract links from a website https://books.toscrape.com/. This website is built for scraping for learning purpose.

This is just a simple example. In real practice, crawling can be very complicated since the real business website can have more complex structure.

In addition, web crawling can be used for efficient web optimisation. We are continue this topic on Week 4.

## Step 1: Import the relevant libraries

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

## Step 2: Define the base URL, and a set for storing visited URLs and a list to store URLs to visit

In [None]:
# URL of the website to crawl
base_url = "https://books.toscrape.com/"

# Set to store visited URLs
visited_urls = set()

# List to store URLs to visit next
urls_to_visit = [base_url]

## Step 3: Write a function to crawl each web page and extract the links.

`urllib.parse.urljoin`

Construct a full (“absolute”) URL by combining a “base URL” (base) with another URL (url). Informally, this uses components of the base URL, in particular the addressing scheme, the network location and (part of) the path, to provide missing components in the relative URL.

For example:

>from urllib.parse import urljoin

>urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')


Output: 'http://www.cwi.nl/%7Eguido/FAQ.html' [link text](https://)

In [None]:
# Function to crawl a page and extract links
def crawl_page(url):
    #here we use try & except to better handle request errors
    try:
        response = requests.get(url) # Send HTTP request to the server of the target url
        response.raise_for_status()  # Raise an exception for HTTP errors

        soup = BeautifulSoup(response.content, "html.parser") # use beautifulsoup to parse the html page

        # Extract links and enqueue new URLs
        links = [] # create an empty list for links
        for link in soup.find_all("a", href=True): # since in html links are usually in <a> tags 'href' attribute, here we use .find_all("a", href=True) to get all links
            next_url = urljoin(url, link["href"])  # get the new link from the page
            links.append(next_url) # add the new link to the links list

        return links

    except requests.exceptions.RequestException as e:
        print(f"Error crawling {url}: {e}") # if there is request error, then print "error crawling"
        return []

## Step 4 Start to crawl the website

In [None]:
# Crawl the website
while urls_to_visit: # when urls_to_visit list is not empty, continue
    current_url = urls_to_visit.pop(0)  # Dequeue the first URL

    if current_url in visited_urls:
        continue

    print(f"Crawling: {current_url}")

    new_links = crawl_page(current_url) # get all the new links by crawl the current page
    visited_urls.add(current_url) # add the current url to visited url set
    urls_to_visit.extend(new_links) # add the new links to urls_to_visit list

print("Crawling finished.") # when there are no more new links, crawling finish

## Step 5 Store the links into dataframe

In [None]:
import pandas as pd
links = pd.DataFrame(visited_urls, columns = ['links'])
links

## **The end.**