# COGS 188 Spring 2024: Lab 1

**Due Date: April 15th, 11:59 PM**

In this programming assignment, we will explore how to navigate the web by finding a path from one webpage to another. This involves understanding the web as a directed graph, where web pages are nodes connected by directed edges representing hyperlinks from one page to another.

We will use three search strategies:

1. **Breadth-First Search (BFS)**: A layer-by-layer traversal method.
2. **Depth-First Search (DFS)**: An exploration of a node's branches before its neighbors.
3. **Bidirectional Search**: A simultaneous search from both the start and target nodes.

Additionally, we will learn how to use Python libraries like `requests`, `BeautifulSoup`, and `googlesearch-python` to interact with web content and perform searches.

## Submission Instructions

After you finish this assignment, export this Jupyter notebook as a **.py** file and upload the resulting Python script to Gradescope.

## Imports

### `requests`

The `requests` library is one of the most popular Python libraries for making HTTP requests. It simplifies the process of sending HTTP requests to web servers and handling responses. We use `requests` to fetch the content of web pages.

### `BeautifulSoup`

`BeautifulSoup` is a library for parsing HTML and XML documents. It creates a parse tree from page source code that can be used to extract data easily. We use `BeautifulSoup` to parse the HTML content of web pages fetched with `requests` and to extract hyperlinks.

### `collections.deque`

The `deque` (double-ended queue) from the `collections` module is an enhanced list-like container with faster appends and pops from both ends. It's ideal for queues and breadth-first search implementations where elements are frequently added and removed.

This [guide](https://www.geeksforgeeks.org/deque-in-python/) provides some helpful information on how to work with double-ended queues.

### `urllib.parse`

The `urllib.parse` module provides a standard interface for breaking Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path, etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.” We use `urljoin` from this module to resolve relative URLs to absolute URLs, ensuring we always work with complete URLs.

In [None]:
# If you haven't installed requests and beautifulsoup4, you can uncomment the following:

# !pip install requests beautifulsoup4

In [None]:
import requests
from bs4 import BeautifulSoup
from collections import deque
from urllib.parse import urljoin, urlparse

## Fetching Links from a Webpage

To navigate the web, we first need to understand how to extract hyperlinks from a webpage. We will use the `requests` library to fetch the webpage content and `BeautifulSoup` to parse this content and extract links.


In [None]:
def get_links(url):
    """Fetches links from the given URL."""
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'html.parser')
        links = set()

        for link in soup.find_all('a', href=True): # the <a> tag in HTML indicates a link
            absolute_link = urljoin(url, link['href'])
            if urlparse(absolute_link).scheme in ['http', 'https']:
                links.add(absolute_link)

        return links
    except requests.RequestException:
        return set()

Observe that the `get_links()` function returns a **set**  of links leading from a given webpage, and each link is a **string**.

In [None]:
get_links("https://cogsci.ucsd.edu/")

## Task 1: Breadth-First Search (BFS)

BFS is a fundamental algorithm for traversing or searching tree or graph data structures. It starts at a selected node and explores all of the neighbor nodes at the present depth prior to moving on to the nodes at the next depth level.

Your task is to complete the function below to implement BFS to find the shortest path between two webpages.

In [None]:
def bfs(start_url, target_url):
    """Performs BFS to find the shortest path from start_url to target_url."""
    # visited is the set of URLs that have been visited so far
    visited = set([start_url])

    # Create a queue of tuples (URL, path), initialized with the starting URL
    queue = deque([(start_url, [start_url])])

    while queue:
        current_url, path = ... # Fill in this blank to remove an element from the left of the queue
        print(f"Visiting {current_url}")

        if current_url == target_url:
            ... # Fill in this blank to indicate what happens when the target URL is reached successfully

        for link in get_links(current_url):
            if link not in visited:
                # Fill in the blank below to add the link to the visited set
                ...

                queue.append((link, path + [link])) # This adds the link and the path to that link to the right end of the queue

    return []

## Running BFS

Let's run BFS using two example webpages to find a path of hyperlinks connecting them. For this, we define a function called `find_path_bfs` that runs the BFS function described above and prints out the resulting path.

In [None]:
def find_path_bfs(start_url, target_url):
    """Finds a path of links from start_url to target_url."""
    path = bfs(start_url, target_url)

    if path:
        print("Path found:")
        for url in path:
            print(url)
    else:
        print("No path found.")

**NOTE**: You can expect this cell to take at least 15 seconds to run.

In [None]:
start_url = 'https://books.toscrape.com/index.html'
target_url = 'https://books.toscrape.com/catalogue/category/books/science-fiction_16/index.html'
find_path_bfs(start_url, target_url)

What if we try using another approach: depth-first search?

## Task 2: Depth-First Search (DFS)

DFS is another fundamental algorithm that uses a different strategy than BFS. It explores as far as possible along each branch before backtracking. This means it goes deep into the graph as quickly as possible.

Now, your next task is to apply DFS to our web path finding problem.

In [None]:
def dfs(start_url, target_url, visited=None, path=None):
    """Performs DFS to find a path from start_url to target_url."""
    if visited is None:
        visited = set() # Define a set of visited nodes
    if path is None:
        path = [start_url]

    # Fill in the blank below to add the start url to the set of visited nodes.
    ...

    print(f"Visiting {start_url}")

    if start_url == target_url:
        print("\nTarget found.")
        # Fill in the blank to indicate what happens when the target is found
        ...

    for link in get_links(start_url):
        if link not in visited:

            # This recursively applies DFS to the current link
            result_path = dfs(link, target_url, visited, path + [link])

            if result_path is not None:
                return result_path

    return None

## Running DFS

Similar to before, let's try running DFS using the same two webpages provided earlier.

In [None]:
def find_path_dfs(start_url, target_url):
    """Finds a path of links from start_url to target_url using DFS."""
    path = dfs(start_url, target_url)

    if path:
        print("\nPath found:")
        for url in path:
            print(url)
    else:
        print("\nNo path found.")

Run the cell below to see the algorithm in action.

**NOTE**: If this cell takes more than 5 minutes to run, feel free to press the **STOP** button to interrupt execution.

In [None]:
start_url = 'https://books.toscrape.com/index.html'
target_url = 'https://books.toscrape.com/catalogue/category/books/science-fiction_16/index.html'
find_path_dfs(start_url, target_url)

**Optional, ungraded question:** You might notice that DFS seems to take a really long time to run. Why do you think this is the case?

## Bidirectional Search

Bidirectional Search is an advanced search technique that runs two simultaneous searches—one forward from the start node, and the other backward from the target node. The search stops when the two meet in the middle.

To implement the backward search, we will use a function to find incoming links to a page. This is where `googlesearch-python` comes into play. The `googlesearch-python` library is a third-party tool that allows us to use Google search within our Python scripts. It's particularly useful for finding incoming links to a webpage, a task that is otherwise quite complex to automate. We utilize this library for the bidirectional search to simulate searching for pages linking to our target URL.

In [None]:
!pip install googlesearch-python

### Caution: Usage Limits of `googlesearch-python`

When using the `googlesearch-python` library to programmatically perform Google searches, it's important to be mindful of Google's usage policies and limitations. Automated queries can quickly reach the rate limits imposed by Google, potentially leading to your IP being temporarily blocked from making further requests.

Remember, the goal of tools like `googlesearch-python` is to facilitate learning and small-scale automation. They are not intended for large-scale data extraction or activities that could harm the availability and reliability of web services.

### Helpful Tip

If you run into API rate limit issues, I would recommend deleting the current runtime and reconnecting to the server. If you work on Google Colab, you can disconnect and delete the runtime, and then reconnect. If you end up doing this, you'll have to re-run the first cell (that imports all the libraries).

### Finding Incoming Links

Finding incoming links (or backlinks) to a webpage is challenging because webpages don't inherently list pages that link to them. However, we can use Google search to find such links by searching for pages that link to our target URL.

We will define a function `get_incoming_links` using `googlesearch-python` to perform this task.

In [None]:
from googlesearch import search

def get_incoming_links(url, num_results=10):
    """Finds pages that link to the specified URL using Google search."""
    query = f"link:{url}"
    # For simplicity, we limit our search to num_results
    for result in search(query, num_results=num_results, sleep_interval=5): # the sleep_interval ensures that requests aren't sent too quickly
        yield result

### Task 3: Implementing Bidirectional Search

With the ability to find both direct links and incoming links, we can now implement the bidirectional search algorithm. This algorithm alternates between expanding the forward frontier and the backward frontier until a connection is found.

Your task is to fill in the blanks below to complete the implementation of bidirectional search.

In [None]:
def bidirectional_search(start_url, target_url):
    forward_queue = deque([(start_url, [start_url])])
    backward_queue = deque([(target_url, [target_url])])
    forward_visited = {start_url}
    backward_visited = {target_url}
    forward_paths = {start_url: [start_url]}
    backward_paths = {target_url: [target_url]}

    while forward_queue and backward_queue:
        # Forward search step
        # Fill in the blank below to remove an element from the left of the queue
        current_forward, path_forward = ...
        print(f"Forward visiting: {current_forward}")
        for link in ...: # Fill in the blank here to iterate over all links from a webpage
            if link not in ...: # Fill in this blank
                forward_visited.add(link)
                new_path = path_forward + [link]
                forward_paths[link] = new_path
                forward_queue.append((link, new_path))
                if link in backward_visited:
                    return forward_paths[link] + backward_paths[link][::-1][1:]

        # Write some code below to implement the backward search step
        # It follows a very similar structure to the forward search step, but we iterate over all INCOMING links instead

        # INSERT YOUR CODE HERE

    return None  # If no connection is found

## Running Bidirectional Search

Similar to before, let's try running bidirectional search on the same two webpages we used earlier.

Run the cell below.

In [None]:
def find_path_bidirectional(start_url, target_url):
    """Finds a path of links from start_url to target_url using bidirectional search."""
    path = bidirectional_search(start_url, target_url)

    if path:
        print("Path found:")
        for url in path:
            print(url)
    else:
        print("No path found.")

In [None]:
start_url = 'https://books.toscrape.com/index.html'
target_url = 'https://books.toscrape.com/catalogue/category/books/science-fiction_16/index.html'
find_path_bidirectional(start_url, target_url)

**Optional, ungraded question:** Comment on your observations after you run bidirectional search. Does it run faster or slower compared to BFS and DFS? Why do you think this is the case?

## Submission and Grading

Check that you've correctly implemented the `bfs`, `dfs`, and `bidirectional_search` functions above, by filling in the blanks. These functions will be graded.

Once you're done with the assignment, **export this notebook as a .py file**, and turn in the .py file to **Gradescope**.