![Digital Futures](https://github.com/digital-futures-academy/DataScienceMasterResources/blob/main/Resources/datascience-notebook-header.png?raw=true)

# Setting up the Project Environment

Ultimately, the Python code written here will be extracted to scripts for execution in an automated pipeline.  To facilitate this, there is a need to set up a project environment that will allow for the execution of the code in a controlled and reproducible environment.

In the initial stages of the activities, the packages needed are `requests` and `pytest`.  The `requests` package is used to make HTTP requests to the API, while `pytest` is used for testing the code we also need *BeautifulSoup* (package name `beautifulsoup4`.  In later activities, you may need to install additional packages.  To do this, add the packages to the `pip install` command below and re-run the cell.

> **Remember:** The goal is to create a set of code cells that can be extracted to separate scripts for execution in an automated pipeline.  Therefore, the code should be kept in 3 distinct cells:
> 
> - **Shell Commands**:  Used to set up the project environment
> 
> - **Python Tests**: Used to test the Python production scripts both now and as part of the automated pipeline
> 
> - **Python Production Code**: The Python code that will be extracted to a script to execute during the pipeline

---

# Environment Setup Scripts

If you are running this notebook after cloning and have not set up your environment to run shell commands, you will need to run the following commands in your terminal to set up the environment.

> **NOTE:**  These commands need to be executed in the terminal.  
>
> Open a terminal at the root of your project before executing these commands
> 
> Until your environment is set up, Jupyter Notebooks will not be able to run **shell** scripts.

```sh
# Create a virtual environment (add the command below)
python3 -m venv .venv # Note: This command could also be python -m venv .venv # python3 and python are a symlink to the python version installed on your system

# Activate the virtual environment 
source .venv/bin/activate

# Install required package to execute shell commands from Jupyter Notebook
pip install ipykernel               ## OR 
pip install -r requirements.txt     ## IF there is already a requirements.txt file CONTAINING ipykenrnel in the project
```


In [None]:
# Install the necessary packages
!pip install requests pytest beautifulsoup4

# Create a requirements.txt file
!pip freeze > requirements.txt

> **Note:** 
> The `!` at the beginning of the lines is a special character in Jupyter Notebooks that allows you to run shell commands from the notebook.  
> These will need to be removed from any commands that are to be exported to a `.sh` shell script file for the pipeline.

---

# Python Tests

Develop any tests for functions in separate cells below.  The first has been provided for you as an example, add others as necessary.

### Test constants

In [3]:
TEST_URL = 'https://www.testsite.com'
TEST_HTML = '<html><body><h1>Hello, World!</h1></body></html>'
SUCCESS_STATUS = "success"
ERROR_STATUS = "error"
ERROR_NOT_HTML = "The response is not HTML"
NAV_HTML_WITH_CLASS = '<ul class="nav nav-list"><li>Item</li></ul>'
NAV_HTML_WITHOUT_CLASS = '<ul><li>Item</li></ul>'
NAV_HTML_WITHOUT_UL_TAG = '<div class="content"><p>No ul here</p></div>'
NAV_HTML_WITH_DIFFERENT_CLASS = '''
<ul class="different-class"><li>Item</li></ul>
'''
NAV_HTML_INVALID = '<ul class="nav nav-list"><li>Item'
HTML_PARSER = 'html.parser'


### Test `request_to_scrape`

In [7]:
# Test request_to_scrape
import pytest
from unittest.mock import patch
from requests.exceptions import Timeout, RequestException

def test_request_to_scrape_makes_correct_request():
    """
    Test that the function makes a request to the correct URL.
    """
    TEST_URL = 'https://www.testsite.com'
    with patch('requests.get') as mock_get:
        mock_get.return_value.status_code = 200
        request_to_scrape(TEST_URL)
        mock_get.assert_called_once_with(TEST_URL, timeout=10)

test_request_to_scrape_makes_correct_request()

In [None]:
def test_request_to_scrape_returns_html_for_200():
    """
    Test that the function returns the HTML content
    when the request is successful (status code 200).
    """
    with patch('requests.get') as mock_get:
        mock_get.return_value.status_code = 200
        mock_get.return_value.headers = {'Content-Type': 'text/html'}
        mock_get.return_value.text = TEST_HTML
        result = request_to_scrape(TEST_URL)
        assert result == {
            "status": SUCCESS_STATUS,
            "data": TEST_HTML
        }
        
test_request_to_scrape_returns_html_for_200()


In [None]:
def test_request_to_scrape_handles_non_200():
    """
    Test that the function returns an error message
    when the response status code is not 200.
    """
    with patch('requests.get') as mock_get:
        mock_get.return_value.status_code = 404
        result = request_to_scrape(TEST_URL)
        assert result == {
            "status": ERROR_STATUS,
            "error": ERROR_NOT_HTML
        }

In [None]:
@pytest.mark.parametrize("exception, error_message", [
    (
        Exception("An error occurred"),
        "An error occurred - Unexpected error for URL"
    ),
    (
        Timeout("The request timed out"),
        "The request timed out - Request failed for URL"
    ),
    (
        RequestException("Invalid URL"),
        "Invalid URL - Request failed for URL"
    )
])
def test_request_to_scrape_handles_exceptions(exception, error_message):
    """
    Test that the function returns an error message 
    when an exception occurs during the request.
    """
    with patch('requests.get') as mock_get:
        mock_get.side_effect = exception
        result = request_to_scrape(TEST_URL)
        assert result == {
            "status": ERROR_STATUS,
            "error": error_message
        }

### Test `extract_element`

In [None]:
# Test extract_element
from bs4 import BeautifulSoup

@pytest.fixture
def soup_with_class():
    return BeautifulSoup(NAV_HTML_WITH_CLASS, HTML_PARSER)

@pytest.fixture
def soup_without_class():
    return BeautifulSoup(NAV_HTML_WITHOUT_CLASS, HTML_PARSER)

@pytest.fixture
def soup_without_tag():
    return BeautifulSoup(NAV_HTML_WITHOUT_UL_TAG, HTML_PARSER)

@pytest.fixture
def soup_with_different_class():
    return BeautifulSoup(NAV_HTML_WITH_DIFFERENT_CLASS, HTML_PARSER)

@pytest.fixture
def empty_soup():
    return BeautifulSoup('', HTML_PARSER)

@pytest.fixture
def invalid_soup():
    return BeautifulSoup(NAV_HTML_INVALID, HTML_PARSER)

def test_extract_element_with_valid_html_with_class(
    soup_with_class
):
    """
    Test that `extract_element` correctly extracts an element with a specific class.
    """  # noqa E501
    element = extract_element(soup_with_class, 'ul', 'nav nav-list')
    assert element is not None
    assert element['class'] == ['nav', 'nav-list']

def test_extract_element_with_valid_html_without_class(
    soup_without_class
):
    """
    Test that `extract_element` correctly extracts an element without a class.
    """  # noqa E501
    element = extract_element(soup_without_class, 'ul')
    assert element is not None
    assert element.name == 'ul'

def test_extract_element_with_html_without_tag(
    soup_without_tag
):
    """
    Test that `extract_element` returns None when the specified tag is not present.
    """  # noqa E501
    element = extract_element(soup_without_tag, 'ul')
    assert element is None

def test_extract_element_with_html_with_different_class(
    soup_with_different_class
):
    """
    Test that `extract_element` returns None when the class does not match.
    """  # noqa E501
    element = extract_element(soup_with_different_class, 'ul', 'nav nav-list')
    assert element is None

def test_extract_element_with_empty_html(
    empty_soup
):
    """
    Test that `extract_element` returns None when provided with empty HTML content.
    """  # noqa E501
    element = extract_element(empty_soup, 'ul')
    assert element is None

def test_extract_element_with_invalid_html(
    invalid_soup
):
    """
    Test that `extract_element` can handle and correctly parse invalid HTML input.
    """  # noqa E501
    element = extract_element(invalid_soup, 'ul', 'nav nav-list')
    assert element is not None
    assert element['class'] == ['nav', 'nav-list']

### Run the tests

Run the cell containing the `ipytest.run()` command to execute the tests.  The tests should all fail until you have written the production code.

Don't forget to run the installation and initialisation cell too on the first time you run the tests!


---

# Python Production Code


Develop any functions for use as production code in separate cells below. The first has been provided as an example under the Production Constants, add others as necessary.

### PRODUCTION CONSTANTS

In [8]:
# PRODUCTION CONSTANTS

# Constants for status messages
STATUS_SUCCESS = "success"
ERROR_STATUS = "error"
ERROR_NOT_HTML = "The response is not HTML"
ERROR_REQUEST_FAILED = "Request failed for URL"
ERROR_UNEXPECTED = "Unexpected error for URL"

# HTML Parser
HTML_PARSER = "html.parser"

URL = "https://books.toscrape.com/"

### `request_to_scrape` Production Code

In [10]:
# request_to_scrape Production Code
import requests
from requests.exceptions import RequestException, Timeout

def request_to_scrape(url: str, timeout: int = 10) -> dict:
    """
    Sends an HTTP GET request to the specified URL and returns the response content.

    Args:
        url (str): The URL to which the GET request is sent.
        timeout (int, optional): The timeout for the request in seconds. Defaults to 10.

    Returns:
        Dict[str, str]: A dictionary containing the status, data, and any error messages.
                        - If the request is successful and returns HTML, 'status' is 'success' and 'data' contains the response text.
                        - If the request fails or does not return HTML, 'status' is 'error' and 'error' contains the error message.
    """  # noqa: E501
    try:
        response = requests.get(url, timeout=timeout)
        # Raise an HTTPError for bad responses
        response.raise_for_status()
        # Check if the response contains HTML
        if 'text/html' in response.headers.get('Content-Type', ''):
            return {
                "status": STATUS_SUCCESS,
                "data": response.text
            }
        else:
            return {
                "status": ERROR_STATUS,
                "error": ERROR_NOT_HTML
            }
    except (Timeout, RequestException) as e:
        return {
            "status": ERROR_STATUS,
            "error": f"{str(e)} - {ERROR_REQUEST_FAILED}"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": f"{str(e)} - {ERROR_UNEXPECTED}"
        }

#request_to_scrape(URL)
#test_request_to_scrape_handles_exceptions(Exception("An error occurred"),"An error occurred - Unexpected error for URL")

### `extract_book_categories` Production Code

In [14]:
# `extract_book_categories` Production Code
from bs4 import BeautifulSoup

HTML = request_to_scrape(URL)['data']
def extract_book_categories(html: str, site: str) -> dict:
    """
    Extracts book categories and their corresponding links from the provided HTML content.
    Args:
    
        html (str): The HTML content of the webpage to parse.
        site (str): The URL of the site from which the HTML content was retrieved.
    Returns:
        dict: A dictionary where the keys are category names and the values dictionaries containing the corresponding links.
    """  # noqa: E501
    soup = BeautifulSoup(html, HTML_PARSER)
    nav_list = extract_element(soup, 'ul', 'nav nav-list')
    if nav_list is None:
        return {}
    category_list = extract_element(nav_list, 'ul')
    categories = extract_categories_and_links(category_list, site)
    return categories

extract_book_categories(HTML, URL)

{'Travel': {'link': 'https://books.toscrape.com//catalogue/category/books/travel_2/index.html'},
 'Mystery': {'link': 'https://books.toscrape.com//catalogue/category/books/mystery_3/index.html'},
 'Historical Fiction': {'link': 'https://books.toscrape.com//catalogue/category/books/historical-fiction_4/index.html'},
 'Sequential Art': {'link': 'https://books.toscrape.com//catalogue/category/books/sequential-art_5/index.html'},
 'Classics': {'link': 'https://books.toscrape.com//catalogue/category/books/classics_6/index.html'},
 'Philosophy': {'link': 'https://books.toscrape.com//catalogue/category/books/philosophy_7/index.html'},
 'Romance': {'link': 'https://books.toscrape.com//catalogue/category/books/romance_8/index.html'},
 'Womens Fiction': {'link': 'https://books.toscrape.com//catalogue/category/books/womens-fiction_9/index.html'},
 'Fiction': {'link': 'https://books.toscrape.com//catalogue/category/books/fiction_10/index.html'},
 'Childrens': {'link': 'https://books.toscrape.com//

### `extract_element` Production Code

In [12]:
# extract_element Production Code
from bs4 import BeautifulSoup

def extract_element(
    soup: BeautifulSoup, tag: str, class_name: str = None
) -> BeautifulSoup:
    """
    Extracts an HTML element from a BeautifulSoup object based on the specified tag and optional class name.
    Args:
        soup (BeautifulSoup): The BeautifulSoup object to search within.
        tag (str): The HTML tag to search for.
        class_name (str, optional): The class name to filter the search. Defaults to None.
    Returns:
        Tag or None: The first matching Tag object if found, otherwise None.
    """  # noqa: E501
    if soup is None:
        return None
    return soup.find(tag, class_=class_name) if class_name else soup.find(tag)

### `extract_categories_and_links` Production Code

In [13]:
# `extract_categories_and_links` Production Code
from bs4 import BeautifulSoup

def extract_categories_and_links(
    category_list: BeautifulSoup, site: str
) -> dict:
    """
    Extracts categories and their corresponding links from a given list of HTML anchor elements.
    Args:
        category_list (BeautifulSoup object): A BeautifulSoup object containing a list of HTML anchor elements.
        site (str): The site URL to append to relative links.
    Returns:
        dictionary: A dictionary of dictionaries where the keys are category names (str) and the values are the corresponding href links (str) in a dictionary.
        e.g.
        {
            'Category 1': {'link': 'https://www.example.com/category1.html'},
            'Category 2': {'link': 'https://www.example.com/category2.html'}
        }
    """  # noqa: E501
    if not category_list:
        return {}

    categories = {}
    for link in category_list.find_all('a'):
        category_name = link.get_text(strip=True)
        category_href = link.get('href')
        categories[category_name] = {
            'link': f"{site}/{category_href}" if category_href else None
        }

    return categories

---

# Python Execution Code

Develop any code to call the developed functions below.  Add additional cells so you don't need to re-run all of the code when you develop further scripts.

In [None]:
# Python Execution Code


---

# Jupyter Notebook Test and Linting Set Up

To run `pytest` scripts in a Jupyter Notebook cell, we need to install the `ipytest` package.  This package is NOT required for a pipeline and therefore it can be removed from the `requirements.txt` file before adding the production code to the pipeline.

To run linting, we need to install 2 packages `nbqa` and `flake8`.  We will make sure that `flake8` is included in the `requirements.txt` file when constructing the pipeline so that we can lint as part of the pipeline tests.

Run the following cell to install the `ipytest`, `nbqa` and `flake8` packages and a coverage package to help determine if all of your production code is executed during the tests!

This cell only needs to be run once (or after restarting the notebook kernel) to set up the environment for testing and linting.


In [None]:
# Install the `ipytest`, `nbqa` and `flake8` packages
!pip install ipytest nbqa flake8

### Set up `ipytest` to execute `pytest` scripts in Jupyter Notebook

In [None]:
# Configure ipytest for Jupyter Notebook

import ipytest
ipytest.autoconfig(rewrite_asserts=True, magics=True)

### Create a *config* file for `flake8`

Run this script to create a file in your project root

In [None]:
# Create a config file and ignore some flake8 rules
!echo "[flake8]" > .flake8
!echo "ignore = E402, W291, F811" >> .flake8

# Execute the tests and linting in the Jupyter Notebook

Run the following cell ***EVERY TIME*** you want to run the tests and linting that you have written in the *Python Tests* cell above.

>**Note:**
>
> This entire section does not need to be part of any pipeline scripts.  
> It is only required for the Jupyter Notebook environment during development.


## Run the tests

In [None]:
# Run the tests
ipytest.run("-vv", "-ss")

## Run the linter

Run this script each time you want to lint your code

In [None]:
# Run the linter
!nbqa flake8 --show-source --format=pylint webscraping.ipynb


---
