# Required Codio activity: Web crawling in action
---

## Overview

In this activity, you will perform **web crawling using the Breadth-First Search (BFS) algorithm**. You'll simulate crawling through a simplified HTML structure and extract links in a manner similar to how search engines index web pages.

This task is designed to:
- Strengthen your Python coding skills.
- Reinforce your understanding of BFS.
- Help you gain hands-on experience with HTML parsing using the **Beautiful Soup** library.

As you proceed, the complexity of the tasks will increase. We recommend you run and test your code in each step before moving to the next one.

> **Important:** All functions you write must match the expected signatures and return formats exactly, as automated tests will be used to assess your submission.

---

## Learning Outcome

By the end of this activity, you should be able to:

- Implement the **Breadth-First Search** algorithm to crawl and extract information from an HTML page using the **Beautiful Soup** library.




## Index:

- [Question 1](#Question-1)
- [Question 2](#Question-2)
- [Question 3](#Question-3)
- [Question 4](#Question-4)

## Setup: HTML Excerpt for Parsing

Run the cell below to initialize `html_excerpt`, which contains an excerpt of HTML taken from [All Great Quotes – Literary Quotes](https://www.allgreatquotes.com/literary_quotes.shtml). You will use this throughout the activity to simulate a basic web crawler.

In [12]:
# Define html_excerpt for this activity

html_excerpt = ("""<!DOCTYPE html>
<html>
<head>
<title>Literary Quotes</title>


  AUTHORS by last name: <a href="/authors-a/">A</a>&nbsp <a href="/">B</a>&nbsp 

  <a href="/authors-c/">C</a>&nbsp <a href="/authors-d/">D</a>&nbsp <a href="/authors-e/">E</a>&nbsp 

  <a href="/authors-f/">F</a>&nbsp <a href="/authors-g/">G</a>&nbsp <a href="/authors-h/">H</a>&nbsp 

  <a href="/authors-i/">I</a>&nbsp <a href="/authors-j/">J</a>&nbsp <a href="/authors-k/">K</a>&nbsp 

  <a href="/authors-l/">L</a>&nbsp <a href="/authors-m/">M</a>&nbsp <a href="/authors-n/">N</a>&nbsp 

  <a href="/authors-o/">O</a>&nbsp <a href="/authors-p/">P</a>&nbsp <a href="/authors-q/">Q</a>&nbsp 

  <a href="/authors-r/">R</a>&nbsp <a href="/authors-s/">S</a>&nbsp <a href="/authors-t/">T</a>&nbsp 

  <a href="/authors-u/">U</a>&nbsp <a href="/authors-v/">V</a>&nbsp <a href="/authors-w/">W</a>&nbsp 

  <a href="/authors-x/">X</a>&nbsp <a href="/authors-y/">Y</a>&nbsp <a href="/authors-z/">Z</a><br>

  <br>

  <a href="/topics/motivational-quotes/">Motivational</a> - <a href="/topics/love-quotes/">Love</a> - <a href="/topics/funny-quotes/">Funny</a> 

  - <a href="/topics/friendship-quotes/">Friendship</a> - <a href="/topics/life-quotes/">Life</a> 

  - <a href="/topics/family/">Family</a> - <a href="/quote-authors/">Authors</a> - <a href="/quote-topics/">Topics</a><br>

  <br>
</html>"""
               )

###### [Back to top](#Index:) 
---

### Question 1: Extracting the Next URL from HTML

In this task, you will define a function `get_next_url(page)` that extracts the **next hyperlink** (i.e., a string between the first `<a href="...">`) from the given HTML page string.

---

#### Instructions:

1. Complete the following function:

```python
def get_next_url(page):
    start_link = page.find(None)   # Find the starting point of the first anchor tag
    start_quote = page.find('"', None)  # Find the first quote after the anchor
    end_quote = page.find('"', None)  # Find the closing quote
    url = page[start_quote + 1: end_quote]  # Extract the URL between quotes
    return url, end_quote
```

2. What the function does:
    - Finds the index of the first `<a href=` in the string and stores it in `start_link`.
    - Finds the position of the first double quote (`"`) after `start_link`, which marks the beginning of the URL.
    - Finds the next double quote (`"`) after `start_quote`, which marks the end of the URL.
    - Extracts and returns the URL string between the quotes.
    - Also returns `end_quote`, the index of the closing quote, to track where parsing should continue.


3. **Test your function** by calling it on `html_excerpt` (initialized earlier):

```python
next_url, end_quote = get_next_url(html_excerpt)
print(next_url)
print(end_quote)
```

---

#### Expected Output:
```
/authors-a/
107
```

---

**Hint**: Review the documentation for the [`str.find()` method](https://www.programiz.com/python-programming/methods/string/find) in Python if you're unsure how it works.

---


In [13]:
def get_next_url(page):
    start_link = page.find('<a href=')
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1:end_quote]
    return url, end_quote
    
    
next_url, end_quote = get_next_url(html_excerpt)
print(next_url)
print(end_quote)

/authors-a/
107


###### [Back to top](#Index:) 

---

### Question 2: Extracting All URLs

In this task, you will complete a function `get_all_urls(page)` that extracts **all hyperlinks** (i.e., values from `<a href="...">`) from a given HTML string. You will use the `get_next_url()` function you implemented in **Question 1**.

---

#### Instructions

- Complete the following function definition using the template and comments as guidance:
```python
        def get_all_urls(page):
            url_list = []
            while True:
                url, end_quote = None  # Call get_next_url here
                if url:
                    url_list.append(None)  # Add the extracted url to the list
                    page = page[None:]  # Slice the page string to continue parsing
                else:
                    break
            return None  # Return the final list of URLs
```

- This function takes a string of HTML content and returns a list of all URLs found.
- The loop should continue until no more URLs are found by `get_next_url`.
- Use the `end_quote` index to slice the remaining string after each found URL.

---

#### Test Your Function

Call your function using the provided `html_excerpt` string:

```python
url_list = get_all_urls(html_excerpt)
print(url_list)
```

---

#### Expected Output

```python
['/authors-a/', '/', '/authors-c/', '/authors-d/', '/authors-e/', '/authors-f/', '/authors-g/', '/authors-h/', '/authors-i/', '/authors-j/', '/authors-k/', '/authors-l/', '/authors-m/', '/authors-n/', '/authors-o/', '/authors-p/', '/authors-q/', '/authors-r/', '/authors-s/', '/authors-t/', '/authors-u/', '/authors-v/', '/authors-w/', '/authors-x/', '/authors-y/', '/authors-z/', '/topics/motivational-quotes/', '/topics/love-quotes/', '/topics/funny-quotes/', '/topics/friendship-quotes/', '/topics/life-quotes/', '/topics/family/', '/quote-authors/', '/quote-topics/']
```

---

**Hint**: The `get_next_url()` function should return a tuple with the next URL and the position of the end quote. Use both values to process the next segment of HTML in the loop.

---


In [14]:
def get_all_urls(page):
    url_list = []
    while True: 
        url, end_quote = get_next_url(page)
        if url: 
            url_list.append(url)
            page = page[end_quote:]
        else: 
            break 
    return url_list
    
all_url_list = get_all_urls(html_excerpt)
print(all_url_list)

['/authors-a/', '/', '/authors-c/', '/authors-d/', '/authors-e/', '/authors-f/', '/authors-g/', '/authors-h/', '/authors-i/', '/authors-j/', '/authors-k/', '/authors-l/', '/authors-m/', '/authors-n/', '/authors-o/', '/authors-p/', '/authors-q/', '/authors-r/', '/authors-s/', '/authors-t/', '/authors-u/', '/authors-v/', '/authors-w/', '/authors-x/', '/authors-y/', '/authors-z/', '/topics/motivational-quotes/', '/topics/love-quotes/', '/topics/funny-quotes/', '/topics/friendship-quotes/', '/topics/life-quotes/', '/topics/family/', '/quote-authors/', '/quote-topics/']


###### [Back to top](#Index:) 
---

### Question 3: Getting All Links from a Live Web Page

In this task, you will implement a function `get_children(url)` that fetches a live web page using `requests`, extracts its HTML, and returns all the hyperlinks it contains.

---

#### Instructions

- Complete the following function definition using the template and comments as guidance:
```python
        def get_children(url):
            try:
                page_source = None  # Use requests.get(url).text to fetch the page content
            except Exception:
                page_source = ''  # Fallback in case of an error (e.g., bad URL or connection issue)
            url_list = None  # Call get_all_urls(page_source) to extract URLs from the HTML
            return url_list
```
- This function should:
  - Use the `requests` library to fetch the web page contents.
  - Use the `get_all_urls` function (from Question 2) to extract all the hyperlinks from the HTML.
  - Return the list of hyperlinks.

---

####  Test Your Function

Use the URL below to test the function:

```python
url = "https://www.allgreatquotes.com/literary_quotes.shtml"
children = get_children(url)
print(children)
```

---

**Hint**: Don't forget to `import requests` if it hasn’t been already. Make sure your earlier `get_all_urls` function is working correctly before testing this function.

---

In [15]:
import requests

def get_children(url):
    try: 
        page_source = requests.get(url, timeout=1).text
    except Exception as e: 
        page_source = ''
    url_list = get_all_urls(page_source)
    return url_list 
    
agq_literaryQuotes = 'https://www.allgreatquotes.com/literary_quotes.shtml'
child_list=get_children(agq_literaryQuotes)
print(child_list)

['/quotes/literary-1.shtml', '/quotes/literary-1.shtml', '/quotes/literary-2.shtml', '/quotes/literary-2.shtml', '/quotes/literary-3.shtml', '/quotes/literary-3.shtml', '/quotes/literary-4.shtml', '/quotes/literary-4.shtml', '/quotes/literary-5.shtml', '/quotes/literary-5.shtml', '/quotes/literary-6.shtml', '/quotes/literary-6.shtml', '/quotes/literary-7.shtml', '/quotes/literary-7.shtml', '/quotes/literary-8.shtml', '/quotes/literary-8.shtml', '/quotes/literary-9.shtml', '/quotes/literary-9.shtml', '/quotes/literary-10.shtml', '/quotes/literary-10.shtml', '/quotes/literary-11.shtml', '/quotes/literary-11.shtml', '/quotes/literary-12.shtml', '/quotes/literary-12.shtml', '/quotes/literary-13.shtml', '/quotes/literary-13.shtml', '/quotes/literary-14.shtml', '/quotes/literary-14.shtml', '/quotes/literary-15.shtml', '/quotes/literary-15.shtml', '/quotes/literary-16.shtml', '/quotes/literary-16.shtml', '/quotes/literary-17.shtml', '/quotes/literary-17.shtml', '/quotes/literary-18.shtml', '/

###### [Back to top](#Index:) 

---

### Question 4: Web Crawler using Breadth-First Search (BFS)

Now it’s time to put it all together by implementing the BFS algorithm to crawl a website and collect hyperlinks.

---

#### Instructions

- Complete the function definition for `crawl_web(start_url, max_depth)` using the template below:
```python
        def crawl_web(start_url, max_depth):
            """
            Returns a dictionary of all visited URLs and their children
            using a breadth-first search strategy.
            """
            crawled = []
            hyperlinksDict = {}
            to_crawl = [[start_url]]  # A queue of paths (BFS)
            
            while to_crawl:
                path = to_crawl.pop(0)
                if len(path) > max_depth:
                    break
                url = path[-1]
                
                if url not in hyperlinksDict:
                    children = None  # Call get_children(url)
                    hyperlinksDict[None] = None  # Store children for the current URL
                    to_crawl.extend([path + [child] for child in children])  # Add new paths to queue

            return None  # Return the dictionary of hyperlinks
```
- The `start_url` parameter is the entry point for the crawler.
- The `max_depth` controls how deep the crawler should go.
- You should store each URL and the list of its child hyperlinks in a dictionary (`hyperlinksDict`).
- You should also prevent revisiting URLs.

---

#### Test Your Function

Run the function with the following parameters:

```python
result = crawl_web('https://www.allgreatquotes.com/literary_quotes.shtml', 2)
print(result)
```

---

**Hints**:
- Use the previously defined `get_children()` to get hyperlinks from a URL.
- Make sure to replace `None` placeholders with actual logic.
- Avoid revisiting pages by keeping track of visited URLs in the dictionary keys.

---


In [16]:
def crawl_web(start_url, max_depth):
    crawled = []
    hyperlinksDict = {}
    to_crawl = [[start_url]]
    while to_crawl: 
        path = to_crawl.pop(0)
        if len(path) > max_depth: 
            break
        url = path[-1]
        if url not in hyperlinksDict: 
            children = get_children(url)
            hyperlinksDict[url] = children
            to_crawl.extend([path + [child] for child in children])
    return hyperlinksDict
    
agq_websites = crawl_web(start_url='https://www.allgreatquotes.com/literary_quotes.shtml', max_depth=2)
print(agq_websites)

{'https://www.allgreatquotes.com/literary_quotes.shtml': ['/quotes/literary-1.shtml', '/quotes/literary-1.shtml', '/quotes/literary-2.shtml', '/quotes/literary-2.shtml', '/quotes/literary-3.shtml', '/quotes/literary-3.shtml', '/quotes/literary-4.shtml', '/quotes/literary-4.shtml', '/quotes/literary-5.shtml', '/quotes/literary-5.shtml', '/quotes/literary-6.shtml', '/quotes/literary-6.shtml', '/quotes/literary-7.shtml', '/quotes/literary-7.shtml', '/quotes/literary-8.shtml', '/quotes/literary-8.shtml', '/quotes/literary-9.shtml', '/quotes/literary-9.shtml', '/quotes/literary-10.shtml', '/quotes/literary-10.shtml', '/quotes/literary-11.shtml', '/quotes/literary-11.shtml', '/quotes/literary-12.shtml', '/quotes/literary-12.shtml', '/quotes/literary-13.shtml', '/quotes/literary-13.shtml', '/quotes/literary-14.shtml', '/quotes/literary-14.shtml', '/quotes/literary-15.shtml', '/quotes/literary-15.shtml', '/quotes/literary-16.shtml', '/quotes/literary-16.shtml', '/quotes/literary-17.shtml', '/q