
# Required end-of-module assignment: Web crawling and scraping
---

## Overview

In this assignment, you will apply the skills learned in this module to extract data from a real-world website using Python.

This exercise is designed to strengthen your understanding of:
- Writing Python code for data extraction.
- Parsing HTML using BeautifulSoup.
- Traversing web content using Breadth-First Search (BFS) strategies (as applicable).

As you progress, the questions will gradually increase in complexity. Be sure to approach the tasks with a programmer's mindset: break problems into smaller parts, write clean code, and test as you go.

> **Important:** Run your code in each cell before submitting. This will help you catch and fix errors early.

---

### Learning Outcomes Addressed

- Implement Breadth-First Search (BFS) for basic web crawling.
- Use BeautifulSoup to parse and extract structured data from HTML content.

---



## Index:

- [Question 1](#Question-1)
- [Question 2](#Question-2)
- [Question 3](#Question-3)
- [Question 4](#Question-4)
- [Question 5](#Question-5)
- [Question 6](#Question-6)
- [Question 7](#Question-7)
- [Question 8](#Question-8)
- [Question 9](#Question-9)
- [Question 10](#Question-10)

### Real-World Scenario

Imagine your manager has asked you, the company’s data analyst, to extract and format key information from the **FTSE 100 Index** Wikipedia page. You’ll retrieve relevant data from each table on the page and structure it for analysis.

**Target Webpage**: [FTSE 100 Index - Wikipedia](https://en.wikipedia.org/wiki/FTSE_100_Index)

Use this page as your data source. View the page and inspect its HTML structure to understand how data is organized in tables.

In [None]:
import requests
from bs4 import BeautifulSoup
requests = requests.Session()

# Set default headers for the session
requests.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
})

###### [Back to top](#Index:) 

### Question 1: Retrieve the HTML Code

In this task, you'll retrieve the HTML content of the FTSE 100 Index Wikipedia page and parse it using BeautifulSoup.

#### Instructions

1. Import the `requests` library and `BeautifulSoup` from `bs4`.
2. Send an HTTP GET request to the following URL:
```python
    url_footsie = "https://en.wikipedia.org/wiki/FTSE_100_Index"
```
3. Extract the text of the HTML page from the response.
4. Use `BeautifulSoup` to parse the HTML text.
5. Assign the resulting BeautifulSoup object to a variable named `soup`.




In [None]:

url_footsie = "https://en.wikipedia.org/wiki/FTSE_100_Index"
soup = None
# YOUR CODE HERE
raise NotImplementedError()
# Answer test
print(soup)

###### [Back to top](#Index:) 

### Question 2: Locate the `<head>` Tag

Before scraping specific data like tables, it's important to understand the structure of the HTML. In this task, you'll examine the page layout and retrieve a specific section using BeautifulSoup.

#### Instructions

1. Open the FTSE Wikipedia page in a browser and **inspect the page source** to understand how the HTML is structured — especially how tables are defined.
2. Use the `.find()` method from BeautifulSoup to retrieve the `head` section (`<head>`) of the HTML.
3. Assign the result to a variable named `ans2`.


**Hint:** Use `soup.find()`.

In [None]:
ans2 = None

# YOUR CODE HERE
raise NotImplementedError()
# Answer test
print(ans2)

###### [Back to top](#Index:) 

### Question 3: Locate and Count Tables on the Page

Your boss has asked you to retrieve specific information from the **FTSE (QFA) Contract Specifications** table. To begin, you need to identify and count all tables on the page to locate the one containing the **"Contract Size: 10 GBP X Index Points"** entry.

#### Reference
You are looking for the table that contains the following value:

> **Contract Size: 10 GBP X Index Points**

Refer to the image:  
<img src="images/hk_qc_QFA_image_imp_pcba.png" alt="Contracts Table" />


But before that we need to get all the tables from the page.

---

#### Instructions

1. Use the `.find_all()` method from BeautifulSoup to extract **all `<table>` elements** from the parsed HTML.
2. Assign the list of tables to a variable named `ans3a`.
3. Count the number of tables and assign the result to a variable named `ans3b`.


In [None]:
ans3a = None
ans3b = None
# YOUR CODE HERE
raise NotImplementedError()
# Answer test
print(ans3a)
print(f'length of ans3a: {ans3b}')

###### [Back to top](#Index:) 

### Question 4: Select the Desired Table

Now that you’ve retrieved all the tables from the page, your task is to identify the one that contains the following entry:

> **Contract Size: 10 GBP X Index Points**

This information appears in the **FTSE (QFA) Contract Specifications** table.

<img src="images/hk_qc_QFA_image_imp_pcba.png" alt="Contracts Table" />

---

#### Instructions

1. Manually inspect the tables in `ans3a` using indexing and `.prettify()` or `print(ans3a[i].text)` to identify which table contains the desired content.
2. Once you’ve identified the correct index (e.g., `ans3a[5]`), assign that specific table to a new variable called `ans4`.

```python
# Replace the index below with the correct one after inspection
ans4 = ans3a[??]
```


In [None]:

ans4 = None
# YOUR CODE HERE
raise NotImplementedError()
# Answer test
print(ans4)

###### [Back to top](#Index:) 

### Question 5: Extract All Rows from the FTSE (QFA) Contract Specifications Table

Now that you've identified the table containing the contract information, your next task is to extract all the rows within this table.

---

#### Instructions

1. Use the `.find_all('tr')` method on `ans4` (the table identified in Question 4) to get all the table rows.
2. Assign the result (a list of `<tr>` tags) to a variable named `ans5a`.
3. Count the number of rows and assign it to a variable named `ans5b`.


**Example:**
```python
ans4.find_all('tr')
```

In [None]:
ans5a = None
ans5b = None
# YOUR CODE HERE
raise NotImplementedError()
# Answer test
print(ans5a)
print(f'Length of ans5a: {ans5b}')

###### [Back to top](#Index:) 

### Question 6: Extract the Contract Size Value

Now that you have extracted all the rows from the **FTSE (QFA) Contract Specifications** table, your task is to find specific row for **Contract Size**.

The goal is to find the row containing the text `Contract Size:` and retrieve its value.

---

#### Instructions

1. Inspect the rows in `ans5a` to locate the index of the row that contains the text **Contract Size**.
2. Once you identify the correct row, assign it to a variable named `ans6`.

```python
ans6 = ans5a[???]
```

In [None]:
ans6 = None
# YOUR CODE HERE
raise NotImplementedError()
# Answer test
print(ans6)

###### [Back to top](#Index:) 

### Question 7: Retrieve the Contract Size Value from the Table Row

Now that you've identified the correct row in the table, the final step is to extract the **Contract Size** value. This time, you need to look inside the **`<td>`** tags of the row containing the contract size information.

---

#### Instructions

1. Use BeautifulSoup’s `.find_all('td')` method to find all **`<td>` tags** in the row containing the contract size information (stored in `ans6`).
2. Extract the specific value/text (which should be in the `<td>` tag related to **Contract Size**) and assign it to `ans7`.


In [None]:
ans7 = None
ans7tds = ans6.find_all('td')
# YOUR CODE HERE
raise NotImplementedError()
# Answer test
print(ans7.text)


###### [Back to top](#Index:) 

### Question 8: Implement Function to Retrieve All Hyperlinks

For this part of the assignment, you need to retrieve all the hyperlinks on a webpage using BeautifulSoup. Your boss has asked you to implement the function **`get_all_urls(page)`**, which will return a list of all the hyperlinks found on the page.

In this task, you will:
1. Use **BeautifulSoup** to parse the HTML content of the page.
2. Extract and return all hyperlinks (URLs) found within the page.

---

#### Instructions

1. Implement the function **`get_all_urls(page)`** using BeautifulSoup to parse through the HTML content and find all the hyperlinks.
2. Extract and return the **text** (.text) content of all hyperlinks (i.e., the visible clickable text between `<a>` and `</a>` tags).

Here is the function signature and code template for you to complete:
```python
     def get_all_urls(page):
        hyperlinksList = []
        # Parse the page with BeautifulSoup
        
        # Find all the <a> tags (hyperlinks)
        
        # Loop through each link and add the visible text (.text) to hyperlinksList
        
        return hyperlinksList
    
```

3. Use the provided `test_excerpt` to test your function once implemented.

**Example Expected Output:**
```
['Archive', 'What If?', 'Blag', 'Store', 'About', '']
```

In [None]:
test_excerpt = """<!DOCTYPE html>
                <html>
                <body>
                <div id="topContainer">
                <div id="topLeft">
                <ul>
                <li><a href="/archive">Archive</a></li>
                <li><a href="http://what-if.xkcd.com">What If?</a></li>
                <li><a href="http://blag.xkcd.com">Blag</a></li>
                <li><a href="http://store.xkcd.com/">Store</a></li>
                <li><a rel="author" href="/about">About</a></li>
                </ul>
                </div>
                <div id="topRight">
                <div id="masthead">
                <span><a href="/"><img src="/s/0b7742.png" alt="xkcd.com logo" height="83" width="185"/></a></span>
                <span id="slogan">A webcomic of romance,<br/> sarcasm, math, and language.</span>"""

In [None]:
import requests
from bs4 import BeautifulSoup 

def get_all_urls(page):
# YOUR CODE HERE
raise NotImplementedError()
#Answer test
get_all_urls(test_excerpt)

###### [Back to top](#Index:) 

### Question 9: Implement the `get_children()` Function

In this task, you are required to implement the **`get_children(url)`** function, which will:
1. Take a URL containing a web page as input.
2. Retrieve the HTML content from the URL using the **`requests`** library.
3. Call the previously implemented **`get_all_urls(page)`** function to extract and return all hyperlinks from the webpage.

---

#### Instructions

1. Implement the function **`get_children(url)`**, which will:
   - Fetch the HTML content from the provided URL using the **`requests`** library.
   - Use the **`get_all_urls(page)`** function to extract all hyperlinks from the HTML content.

2. Use **`try` and `except`** to handle any exceptions that might occur when fetching the page.
3. The function should return a list of URLs found on the web page.

Here is the provided template for you to complete:

```python
import requests

def get_children(url):
    try:
        # Fetch the HTML content from the URL
        page_source = None
    except Exception:
        # In case of an error, set the page source to an empty string
        page_source = ''
    
    # Call get_all_urls to extract URLs from the page source
    url_list = None
    
    return url_list
```

4. Once you’ve implemented the function, test it using:
```python
footsie_python = "https://en.wikipedia.org/wiki/FTSE_100_Index"
child_list=get_children(footsie_python)
print(child_list)
```

In [None]:
child_list=[]

def get_children(url):
# YOUR CODE HERE
raise NotImplementedError()
# Answer test
footsie_python = "https://en.wikipedia.org/wiki/FTSE_100_Index"
child_list=get_children(footsie_python)
print(child_list)

###### [Back to top](#Index:) 

### Question 10: Implement the `crawl_web(start_url, max_depth)` Function

In this task, you are required to implement the **`crawl_web(start_url, max_depth)`** function that performs a web crawl starting from a given URL and explores links recursively up to the specified maximum depth. You will use the previously implemented function **`get_children(url)`** to extract hyperlinks from each page.

---

#### Instructions

1. **Implement the `crawl_web(start_url, max_depth)` function**:
   - This function performs a web crawl starting from the given `start_url`.
   - It uses a **path-based approach**, maintaining a list of paths (URL sequences) to crawl.
   - It explores new URLs only if the depth of the path is less than or equal to `max_depth`.
   - For each visited page, it:
     - Uses `get_children(url)` to retrieve all the hyperlinks on that page.
     - Adds those child URLs to a graph dictionary where the key is the parent URL and the value is the list of children.
     - Avoids revisiting already crawled pages.

Here is the function structure:

```python
FTSE_websites = {}

def crawl_web(start_url, max_depth):
    # Initialize a list to keep track of already crawled URLs
    crawled = []

    # Initialize a dictionary to hold the structure of crawled pages
    # Keys will be URLs, values will be lists of URLs found on those pages
    graph = {}

    # Initialize the list of paths to crawl, starting with the initial URL
    # Each path is a list of URLs from the start to the current node
    to_crawl = [[start_url]]

    # Start the crawling loop
    while to_crawl:
        # Take the first path from the list (FIFO queue for BFS)
        path = None

        # If the path length exceeds max_depth, skip further crawling from here
        if #condition here:
            break

        # Get the current URL to crawl (the last URL in the path)
        url = path[-1]

        # If this URL hasn't been crawled yet
        if url not in crawled:
            # Get all the children (linked URLs) of the current URL. Use get_children() method.
            children = None

            # Add the URL and its children to the graph
            graph[url] = None 

            # Add new paths to to_crawl by extending the current path with each child
            to_crawl.extend(...)

    # Return the final graph of crawled pages and their links
    return graph

FTSE_websites = crawl_web(start_url="https://en.wikipedia.org/wiki/FTSE_100_Index", max_depth=2)

# Test output
print(FTSE_websites)
```

---

#### Output

- `FTSE_websites` will be a dictionary that maps each URL visited to a list of its direct hyperlinks (children).
- The crawler visits pages to a maximum path length of `max_depth`, ensuring exploration stays within depth limits.

---

#### Example Structure

If your crawler visits:

- Depth 0: FTSE main page
- Depth 1: Pages linked directly from FTSE page
- Depth 2: Pages linked from those in depth 1

The starting of your dictionary might look like:

```python
{
    'https://en.wikipedia.org/wiki/FTSE_100_Index': [
    ...
}
```

In [None]:
FTSE_websites={}

def crawl_web(start_url, max_depth):
# YOUR CODE HERE
raise NotImplementedError()
# Answer test
print(FTSE_websites)