![Digital Futures](https://github.com/digital-futures-academy/DataScienceMasterResources/blob/main/Resources/datascience-notebook-header.png?raw=true)

# Setting up the Project Environment

Ultimately, the Python code written here will be extracted to scripts for execution in an automated pipeline.  To facilitate this, there is a need to set up a project environment that will allow for the execution of the code in a controlled and reproducible environment.

In the initial stages of the activities, the packages needed are `requests` and `pytest`.  The `requests` package is used to make HTTP requests to the API, while `pytest` is used for testing the code we also need *BeautifulSoup* (package name `beautifulsoup4`.  In later activities, you may need to install additional packages.  To do this, add the packages to the `pip install` command below and re-run the cell.

> **Remember:** The goal is to create a set of code cells that can be extracted to separate scripts for execution in an automated pipeline.  Therefore, the code should be kept in 3 distinct cells:
> 
> - **Shell Commands**:  Used to set up the project environment
> 
> - **Python Tests**: Used to test the Python production scripts both now and as part of the automated pipeline
> 
> - **Python Production Code**: The Python code that will be extracted to a script to execute during the pipeline

---

# Environment Setup Scripts

If you are running this notebook after cloning and have not set up your environment to run shell commands, you will need to run the following commands in your terminal to set up the environment.

> **NOTE:**  These commands need to be executed in the terminal.  
>
> Open a terminal at the root of your project before executing these commands
> 
> Until your environment is set up, Jupyter Notebooks will not be able to run **shell** scripts.

```sh
# Create a virtual environment (add the command below)
python3 -m venv .venv # Note: This command could also be python -m venv .venv # python3 and python are a symlink to the python version installed on your system

# Activate the virtual environment 
source .venv/bin/activate

# Install required package to execute shell commands from Jupyter Notebook
pip install ipykernel               ## OR 
pip install -r requirements.txt     ## IF there is already a requirements.txt file CONTAINING ipykenrnel in the project
```


In [None]:
# Install the necessary packages
!pip install requests pytest beautifulsoup4

# Create a requirements.txt file
!pip freeze > requirements.txt

> **Note:** 
> The `!` at the beginning of the lines is a special character in Jupyter Notebooks that allows you to run shell commands from the notebook.  
> These will need to be removed from any commands that are to be exported to a `.sh` shell script file for the pipeline.

---

# Python Tests

Develop any tests for functions in separate cells below.  The first has been provided for you as an example, add others as necessary.

### Test CONSTANTS

In [None]:
# General Constants
TEST_URL = 'https://www.testsite.com'
TEST_HTML = '<html><body><h1>Hello, World!</h1></body></html>'
JSON_CONTENT = '{"key": "value"}'
HTML_PARSER = 'html.parser'

# Constants for expected results for requests
SUCCESS_STATUS = "success"
ERROR_STATUS = "error"
ERROR_NOT_HTML = "The response is not HTML"
ERROR_REQUEST_FAILED = "Request failed for URL"
ERROR_UNEXPECTED = "Unexpected error for URL"

# Constant Test Data and HTML for Extract Book Categories

# List of categories for testing category extraction
TEST_CATEGORIES_WITH_LINKS = [
    {'Category 1': {'link': f"{TEST_URL}/category1.html"}},
    {'Category 2': {'link': f"{TEST_URL}/category2.html"}},
    {'Category No Link Attr': {}},
    {'Category Link None': {'link': None}},
    {'Category Invalid Link': {'link': 'invalid-url'}},
]

# Test List of book data
TEST_BOOK_DATA = [
    {'title': 'Book 1', 'price': 10.99},
    {'title': 'Book 2', 'price': 15.99},
    {'title': 'Book 3', 'price': 9.99},
    {'title': 'Invalidly Priced', 'price': 'Invalid Price'},
    {'title': 'None Price', 'price': None},
    {'title': None, 'price': 20.99},
    {'title': 'Additional Data', 'price': 10.99, 'extra': 'data'},
    {'title': None, 'price': None},
    {'title': 'Very long title truncated in display', 'price': 30.99}
]

# HTML for category link extraction
CATEGORY_HTML_VALID = '''
<ul class="nav nav-list">
    <ul>
        <li><a href="category1.html">Category 1</a></li>
        <li><a href="category2.html">Category 2</a></li>
    </ul>
</ul>
'''

CATEGORY_HTML_WITHOUT_NAV_LIST = '<ul></ul>'

CATEGORY_HTML_INVALID = f'''
    <ul class="nav nav-list">
        <ul>
            <li>
                <a 
                    href="category1.html"
                >{next(iter(TEST_CATEGORIES_WITH_LINKS[0]))}
'''
# next(iter()) extracts the category name of the dictionary supplied

CATEGORY_HTML_NO_HREF = f'''
    <ul class="nav nav-list">
        <ul><li><a>{next(iter(TEST_CATEGORIES_WITH_LINKS[3]))}</a></li></ul>
    </ul>
'''


CATEGORY_HTML_INVALID = f'''
    <ul class="nav nav-list">
        <ul>
            <li>
                <a 
                    href="category1.html"
                >{next(iter(TEST_CATEGORIES_WITH_LINKS[0]))}
'''
# Test HTML for testing category extraction from nav list
NAV_HTML_WITH_CLASS = '<ul class="nav nav-list"><li>Item</li></ul>'
NAV_HTML_WITHOUT_CLASS = '<ul><li>Item</li></ul>'
NAV_HTML_WITHOUT_UL_TAG = '<div class="content"><p>No ul here</p></div>'
NAV_HTML_WITH_DIFFERENT_CLASS = '''
<ul class="different-class"><li>Item</li></ul>
'''
NAV_HTML_INVALID = '<ul class="nav nav-list"><li>Item'

# Test HTML for Number of Books
BOOK_NUMBER_HTML_VALID = '''
    <form class="form-horizontal">
        <strong>10</strong> books found
    </form>
'''
BOOK_NUMBER_HTML_WITHOUT_NUMBER = '''
<form class="form-horizontal">No books found</form>
'''
BOOK_NUMBER_HTML_INVALID = '<form class="form-horizontal"><strong>10'
BOOK_NUMBER_HTML_NO_BOOKS_FOUND = '''
    <form class="form-horizontal">
        <strong>0</strong> books found
    </form>
'''
BOOK_NUMBER_HTML_NO_BOOKS_FOUND_TEXT = '''
<form class="form-horizontal">No books found</form>
'''
BOOK_NUMBER_HTML_NON_NUMERIC_BOOK_NUMBER = '''
    <form class="form-horizontal">
        <strong>Not a number</strong> books found
    </form>
'''
BOOK_NUMBER_HTML_NO_FORM_HORIZONTAL = '<form><strong>10</strong></form>'

# Book data HTML for testing
BOOK_DATA_HTML_VALID = f'''
<html>
    <body>
        <article class="product_pod">
            <h3>
                <a title="{TEST_BOOK_DATA[0]['title']}">
                    {TEST_BOOK_DATA[0]['title']}
                </a>
            </h3>
            <p class="price_color">Â£{TEST_BOOK_DATA[0]['price']}9</p>
        </article>
        <article class="product_pod">
            <h3>
                <a title="{TEST_BOOK_DATA[2]['title']}">
                    {TEST_BOOK_DATA[2]['title']}
                </a>
            </h3>
            <p class="price_color">Â£{TEST_BOOK_DATA[2]['price']}9</p>
        </article>
    </body>
</html>
'''

BOOK_DATA_HTML_INVALID = f'''
<html>
    <body>
        <article class="product_pod">
            <h3>{TEST_BOOK_DATA[3]['title']}</h3>
            <p class="price_color">{TEST_BOOK_DATA[3]['price']}</p>
        </article>
    </body>
</html>
'''

TITLE_LINK_NO_PRICE = f'<a title="{TEST_BOOK_DATA[4]['title']}">{TEST_BOOK_DATA[4]['title']}</a>' # noqa E501

BOOK_DATA_HTML_MISSING_PRICE = f'''
    <html>
        <body>
            <article class="product_pod">
                <h3>
                    {TITLE_LINK_NO_PRICE}
            </article>
        </body>
    </html>
'''

SINGLE_BOOK_PRICE_HTML_NONE = '<p class="price_color"></p>'

SINGLE_BOOK_TITLE_HTML_NO_ATTR = f'''
    <h3>{TEST_BOOK_DATA[0]['title']}</h3>
'''

TITLE_LINK_VALID = f'<a title="{TEST_BOOK_DATA[0]['title']}">{TEST_BOOK_DATA[0]['title']}</a>' # noqa E501

SINGLE_BOOK_TITLE_HTML_VALID = f'''
    <h3>{TITLE_LINK_VALID}</h3>
'''
SINGLE_BOOK_PRICE_HTML_VALID = f'''
    <p class="price_color">Â£{TEST_BOOK_DATA[0]['price']}</p>
'''

SINGLE_BOOK_PRICE_HTML_INVALID = f'''
    <p class="price_color">{TEST_BOOK_DATA[3]['price']}</p>
'''

TRUNCATED_TITLE_LINK = f'<a title="{TEST_BOOK_DATA[8]['title']}">{TEST_BOOK_DATA[8]['title'][:10]}...</a>'  # noqa E501

SINGLE_BOOK_TITLE_HTML_TRUNCATED = f'''
    <h3>{TRUNCATED_TITLE_LINK}</h3>
'''

SINGLE_BOOK_TITLE_TRUN_PRICE_HTML_VALID = f'''
    <p class="price_color">Â£{TEST_BOOK_DATA[8]['price']}</p>
'''

SINGLE_BOOK_PRICE_HTML_NONE = '<p class="price_color"></p>'

SINGLE_BOOK_TITLE_HTML_NO_ATTR = f'''
    <h3>{TEST_BOOK_DATA[0]['title']}</h3>
'''

BOOK_DATA_HTML_NO_TITLE_ATTR = f'''
    <article class="product_pod">
        {SINGLE_BOOK_TITLE_HTML_NO_ATTR}
        {SINGLE_BOOK_PRICE_HTML_VALID}
    </article>
'''

BOOK_DATA_HTML_TRUNCATED_TITLE = f'''
    <article class="product_pod">
        {SINGLE_BOOK_TITLE_HTML_TRUNCATED}
        {SINGLE_BOOK_TITLE_TRUN_PRICE_HTML_VALID}
    </article>
'''

# Pagination HTML data for testing

PAGINATION_VALID = '<li class="current">Page 1 of 4</li>'
PAGINATION_INVALID = '<li class="current">Page one of four</li>'
PAGINATION_EDGE_400 = '<li class="current">Page 1 of 400</li>'
PAGINATION_EDGE_40 = '<li class="current">Page 1 of 40</li>'
PAGINATION_EDGE_4 = '<li class="current">Page 1 of 4</li>'

BOOK_DATA_HTML_MISSING_TITLE = f'''
    <html>
        <body>
            <article class="product_pod">
                <p class="price_color">Â£{TEST_BOOK_DATA[5]['price']}</p>
            </article>
        </body>
    </html>
'''

BOOK_DATA_HTML_MIXED = f'''
    <html>
        <body>
            <article class="product_pod">
                {SINGLE_BOOK_TITLE_HTML_VALID}
                {SINGLE_BOOK_PRICE_HTML_VALID}
            </article>
            <article class="product_pod">
                <h3>
                    <a title="{TEST_BOOK_DATA[4]['title']}">
                        {TEST_BOOK_DATA[4]['title']}
                    </a>
                </h3>
                <p class="price_color">{TEST_BOOK_DATA[4]['price']}</p>
            </article>
            <article class="product_pod">
                <h3>
                    <a title="{TEST_BOOK_DATA[5]['title']}">
                        {TEST_BOOK_DATA[5]['title']}
                    </a>
                </h3>
                <p class="price_color">Â£{TEST_BOOK_DATA[5]['price']}</p>
            </article>
            <article class="product_pod">
                <h3>
                    <a title="{TEST_BOOK_DATA[7]['title']}">
                        {TEST_BOOK_DATA[7]['title']}
                    </a>
                </h3>
                <p class="price_color">{TEST_BOOK_DATA[7]['price']}</p>
            </article>
        </body>
    </html>
'''

### Test `request_to_scrape`

In [None]:
# Test request_to_scrape
import pytest
from unittest.mock import patch
from requests.exceptions import Timeout, RequestException

def test_request_to_scrape_makes_correct_request():
    """
    Test that the function makes a request to the correct URL.
    """
    with patch('requests.get') as mock_get:
        mock_get.return_value.status_code = 200
        request_to_scrape(TEST_URL)
        mock_get.assert_called_once_with(TEST_URL, timeout=10)

def test_request_to_scrape_returns_html_for_200():
    """
    Test that the function returns the HTML content
    when the request is successful (status code 200).
    """
    with patch('requests.get') as mock_get:
        mock_get.return_value.status_code = 200
        mock_get.return_value.headers = {'Content-Type': 'text/html'}
        mock_get.return_value.text = TEST_HTML
        result = request_to_scrape(TEST_URL)
        assert result == {
            "status": SUCCESS_STATUS,
            "data": TEST_HTML
        }

def test_request_to_scrape_handles_non_200():
    """
    Test that the function returns an error message
    when the response status code is not 200.
    """
    with patch('requests.get') as mock_get:
        mock_get.return_value.status_code = 404
        result = request_to_scrape(TEST_URL)
        assert result == {
            "status": ERROR_STATUS,
            "error": ERROR_NOT_HTML
        }

def test_request_to_scrape_non_html_content():
    """
    Test that the function returns an error message
    when the response type is not HTML.
    """
    with patch('requests.get') as mock_get:
        mock_get.return_value.status_code = 200
        mock_get.return_value.headers = {'Content-Type': 'application/json'}
        mock_get.return_value.text = JSON_CONTENT
        result = request_to_scrape(TEST_URL)
        assert result == {
            "status": ERROR_STATUS,
            "error": ERROR_NOT_HTML
        }

@pytest.mark.parametrize("exception, error_message", [
    (
        Exception("An error occurred"),
        "An error occurred - Unexpected error for URL"
    ),
    (
        Timeout("The request timed out"),
        "The request timed out - Request failed for URL"
    ),
    (
        RequestException("Invalid URL"),
        "Invalid URL - Request failed for URL"
    )
])
def test_request_to_scrape_handles_exceptions(exception, error_message):
    """
    Test that the function returns an error message 
    when an exception occurs during the request.
    """
    with patch('requests.get') as mock_get:
        mock_get.side_effect = exception
        result = request_to_scrape(TEST_URL)
        assert result == {
            "status": ERROR_STATUS,
            "error": error_message
        }

### Test `extract_book_categories`

In [None]:
# Test `extract_book_categories`

def test_extract_book_categories_with_valid_html():
    """
    Test that `extract_book_categories` correctly parses valid HTML input.
    """  # noqa E501
    categories = extract_book_categories(
        CATEGORY_HTML_VALID, 
        TEST_URL
    )
    assert categories == { 
        **TEST_CATEGORIES_WITH_LINKS[0],
        **TEST_CATEGORIES_WITH_LINKS[1]
    }

def test_extract_book_categories_with_html_without_nav_list():
    """
    Test that `extract_book_categories` returns an empty dictionary when the HTML content does not contain a navigation list.
    """  # noqa E501
    categories = extract_book_categories(
        CATEGORY_HTML_WITHOUT_NAV_LIST,
        TEST_URL
    )
    assert categories == {}

def test_extract_book_categories_with_empty_html():
    """
    Test that `extract_book_categories` returns an empty dictionary when provided with an empty HTML string.
    """  # noqa E501
    categories = extract_book_categories(
        TEST_HTML,
        TEST_URL
    )
    assert categories == {}

def test_extract_book_categories_with_invalid_html():
    """
    Test that `extract_book_categories` can correctly parse and extract book categories from invalid HTML input.
    """  # noqa E501
    categories = extract_book_categories(
        CATEGORY_HTML_INVALID,
        TEST_URL
    )
    assert categories == TEST_CATEGORIES_WITH_LINKS[0]


### Test `extract_element`

In [None]:
# Test extract_element
from bs4 import BeautifulSoup

@pytest.fixture
def soup_with_class():
    return BeautifulSoup(NAV_HTML_WITH_CLASS, HTML_PARSER)

@pytest.fixture
def soup_without_class():
    return BeautifulSoup(NAV_HTML_WITHOUT_CLASS, HTML_PARSER)

@pytest.fixture
def soup_without_tag():
    return BeautifulSoup(NAV_HTML_WITHOUT_UL_TAG, HTML_PARSER)

@pytest.fixture
def soup_with_different_class():
    return BeautifulSoup(NAV_HTML_WITH_DIFFERENT_CLASS, HTML_PARSER)

@pytest.fixture
def empty_soup():
    return BeautifulSoup('', HTML_PARSER)

@pytest.fixture
def invalid_soup():
    return BeautifulSoup(NAV_HTML_INVALID, HTML_PARSER)

def test_extract_element_with_valid_html_with_class(
    soup_with_class
):
    """
    Test that `extract_element` correctly extracts an element with a specific class.
    """  # noqa E501
    element = extract_element(soup_with_class, 'ul', 'nav nav-list')
    assert element is not None
    assert element['class'] == ['nav', 'nav-list']

def test_extract_element_with_valid_html_without_class(
    soup_without_class
):
    """
    Test that `extract_element` correctly extracts an element without a class.
    """  # noqa E501
    element = extract_element(soup_without_class, 'ul')
    assert element is not None
    assert element.name == 'ul'

def test_extract_element_with_html_without_tag(
    soup_without_tag
):
    """
    Test that `extract_element` returns None when the specified tag is not present.
    """  # noqa E501
    element = extract_element(soup_without_tag, 'ul')
    assert element is None

def test_extract_element_with_html_with_different_class(
    soup_with_different_class
):
    """
    Test that `extract_element` returns None when the class does not match.
    """  # noqa E501
    element = extract_element(soup_with_different_class, 'ul', 'nav nav-list')
    assert element is None

def test_extract_element_with_empty_html(
    empty_soup
):
    """
    Test that `extract_element` returns None when provided with empty HTML content.
    """  # noqa E501
    element = extract_element(empty_soup, 'ul')
    assert element is None

def test_extract_element_with_invalid_html(
    invalid_soup
):
    """
    Test that `extract_element` can handle and correctly parse invalid HTML input.
    """  # noqa E501
    element = extract_element(invalid_soup, 'ul', 'nav nav-list')
    assert element is not None
    assert element['class'] == ['nav', 'nav-list']

### Test `extract_categories_and_links`

In [None]:
# Test extract_categories_and_links
from bs4 import BeautifulSoup

def test_extract_categories_and_links_with_valid_html_with_multiple_categories():  # noqa E501
    """
    Test that `extract_categories_and_links` correctly extracts categories and their links from valid HTML input.
    """  # noqa E501
    expected_output = {
        **TEST_CATEGORIES_WITH_LINKS[0],
        **TEST_CATEGORIES_WITH_LINKS[1]
    }
    soup = BeautifulSoup(CATEGORY_HTML_VALID, HTML_PARSER)
    category_list = soup.find('ul')
    categories = extract_categories_and_links(category_list, TEST_URL)
    assert categories == expected_output

def test_extract_categories_and_links_with_none():
    """
    Test that `extract_categories_and_links` returns an empty dictionary when input is None.
    """  # noqa E501
    categories = extract_categories_and_links(None, TEST_URL)
    assert categories == {}

def test_extract_categories_and_links_with_empty_html():
    """
    Test that `extract_categories_and_links` returns an empty dictionary when HTML is empty.
    """  # noqa E501
    soup = BeautifulSoup(TEST_HTML, HTML_PARSER)
    category_list = soup.find('ul')
    categories = extract_categories_and_links(category_list, TEST_URL)
    assert categories == {}

def test_extract_categories_and_links_with_html_without_a_tags():
    """
    Test that `extract_categories_and_links` returns an empty dictionary when there are no `a` tags in the HTML.
    """  # noqa E501
    soup = BeautifulSoup(CATEGORY_HTML_WITHOUT_NAV_LIST, HTML_PARSER)
    category_list = soup.find('ul')
    categories = extract_categories_and_links(category_list, TEST_URL)
    assert categories == {}

def test_extract_categories_and_links_with_a_tags_without_href():
    """
    Test that `extract_categories_and_links` handles `a` tags without `href` attributes gracefully.
    """  # noqa E501
    soup = BeautifulSoup(CATEGORY_HTML_NO_HREF, HTML_PARSER)
    category_list = soup.find('ul')
    categories = extract_categories_and_links(category_list, TEST_URL)
    assert categories == TEST_CATEGORIES_WITH_LINKS[3]

def test_extract_categories_and_links_with_invalid_html():
    """
    Test that `extract_categories_and_links` handles invalid HTML gracefully.
    """  # noqa E501
    soup = BeautifulSoup(CATEGORY_HTML_INVALID, HTML_PARSER)
    category_list = soup.find('ul')
    categories = extract_categories_and_links(category_list, TEST_URL)
    assert categories == TEST_CATEGORIES_WITH_LINKS[0]


### Test `extract_book_categories`

In [None]:
# Test extract_book_categories
from bs4 import BeautifulSoup

def test_extract_book_categories_with_valid_html():
    """
    Test that `extract_book_categories` correctly parses valid HTML input.
    """  # noqa E501
    categories = extract_book_categories(
        CATEGORY_HTML_VALID, 
        TEST_URL
    )
    assert categories == { 
        **TEST_CATEGORIES_WITH_LINKS[0],
        **TEST_CATEGORIES_WITH_LINKS[1]
    }

def test_extract_book_categories_with_html_without_nav_list():
    """
    Test that `extract_book_categories` returns an empty dictionary when the HTML content does not contain a navigation list.
    """  # noqa E501
    categories = extract_book_categories(
        CATEGORY_HTML_WITHOUT_NAV_LIST,
        TEST_URL
    )
    assert categories == {}

def test_extract_book_categories_with_empty_html():
    """
    Test that `extract_book_categories` returns an empty dictionary when provided with an empty HTML string.
    """  # noqa E501
    categories = extract_book_categories(
        TEST_HTML,
        TEST_URL
    )
    assert categories == {}

def test_extract_book_categories_with_invalid_html():
    """
    Test that `extract_book_categories` can correctly parse and extract book categories from invalid HTML input.
    """  # noqa E501
    categories = extract_book_categories(
        CATEGORY_HTML_INVALID,
        TEST_URL
    )
    assert categories == TEST_CATEGORIES_WITH_LINKS[0]


### Test `extract_category_data`

In [None]:
# Test `extract_category_data`
import pytest
from unittest.mock import patch

@pytest.fixture
def test_category_data():
    return {
        'category': 'Category 1',
        'link': f"{TEST_URL}/category1.html",
    }

@pytest.fixture
def expected_category_data_valid():
    return {
        'category': 'Category 1',
        'link': f"{TEST_URL}/category1.html",
        'number_of_books': 50
    }

@pytest.fixture
def expected_category_data_invalid():
    return {
        'category': 'Category 1',
        'link': f"{TEST_URL}/category1.html",
        'number_of_books': 0
    }

@pytest.fixture
def expected_category_data_with_books():
    return {
        'category': 'Category 1',
        'link': f"{TEST_URL}/category1.html",
        'number_of_books': 50,
        'books': TEST_BOOK_DATA
    }

@pytest.fixture
def expected_category_data_invalid_with_books():
    return {
        'category': 'Category 1',
        'link': f"{TEST_URL}/category1.html",
        'number_of_books': 0,
        'books': TEST_BOOK_DATA
    }

def test_extract_category_data_valid_book_number(
    test_category_data,
    expected_category_data_valid
):
    """
    Test that `extract_category_data` correctly adds the number of books in a category
    when `extract_number_in_category` returns a valid integer.
    """  # noqa E501
    with (
        patch(
            '__main__.request_to_scrape',
            return_value={
                'status': 'success',
                'data': BOOK_NUMBER_HTML_VALID
            }
        ),
        patch(
            '__main__.extract_number_in_category',
            return_value=50
        )
    ):
        category_data = extract_category_data(test_category_data)
        assert category_data == expected_category_data_valid

def test_extract_category_data_invalid_book_number(
    test_category_data,
    expected_category_data_invalid
):
    """
    Test that `extract_category_data` returns 0 books in category when `extract_number_in_category`
    returns an invalid integer value.
    """  # noqa E501
    with (
        patch(
            '__main__.request_to_scrape',
            return_value={
                'status': 'success',
                'data': BOOK_NUMBER_HTML_INVALID
            }
        ),
        patch(
            '__main__.extract_number_in_category',
            return_value='Not a number'
        )
    ):
        category_data = extract_category_data(test_category_data)
        assert category_data == expected_category_data_invalid

def test_extract_category_data_request_to_scrape_failure(
    test_category_data
):
    """
    Test that `extract_category_data` handles request_to_scrape failure gracefully.
    """  # noqa E501
    with (
        patch(
            '__main__.request_to_scrape',
            return_value={'status': 'error', 'data': ''}
        ),
        patch(
            '__main__.extract_number_in_category',
            return_value=0
        )
    ):
        category_data = extract_category_data(test_category_data)
        assert category_data['number_of_books'] == 0

def test_extract_category_data_empty_html_response(
    test_category_data
):
    """
    Test that `extract_category_data` handles empty HTML response gracefully.
    """  # noqa E501
    with (
        patch(
            '__main__.request_to_scrape',
            return_value={'status': 'success', 'data': ''}
        ),
        patch(
            '__main__.extract_number_in_category',
            return_value=0
        )
    ):
        category_data = extract_category_data(test_category_data)
        assert category_data['number_of_books'] == 0

def test_extract_category_data_test_no_books_found(
    test_category_data
):
    """
    Test that `extract_category_data` returns 0 books when no books are found.
    """  # noqa E501
    with (
        patch(
            '__main__.request_to_scrape',
            return_value={
                'status': 'success',
                'data': BOOK_NUMBER_HTML_NO_BOOKS_FOUND
            }
        ),
        patch(
            '__main__.extract_number_in_category',
            return_value=0
        )
    ):
        category_data = extract_category_data(test_category_data)
        assert category_data['number_of_books'] == 0

"""
TESTS FOR WHEN `get_book_data` is True
"""

def test_extract_category_data_valid_book_number_with_books(
    test_category_data,
    expected_category_data_with_books
):
    """
    Test that `extract_category_data` correctly adds the number of books in a category
    and extracts book data when `get_book_data` is True and `extract_number_in_category`
    returns a valid integer.
    """  # noqa: E501
    with (
        patch(
            '__main__.request_to_scrape',
            return_value={
                'status': 'success',
                'data': BOOK_NUMBER_HTML_VALID
            }
        ),
        patch(
            '__main__.extract_number_in_category',
            return_value=50
        ),
        patch(
            '__main__.extract_book_data',
            return_value=TEST_BOOK_DATA
        )
    ):
        category_data = extract_category_data(
            test_category_data,
            get_book_data=True
        )
        assert category_data == expected_category_data_with_books

def test_extract_category_data_invalid_book_number_with_books(
    test_category_data,
    expected_category_data_invalid_with_books
):
    """
    Test that `extract_category_data` returns 0 books in category and extracts book data
    when `get_book_data` is True and `extract_number_in_category` returns an invalid integer value.
    """  # noqa: E501
    with (
        patch(
            '__main__.request_to_scrape',
            return_value={
                'status': 'success',
                'data': BOOK_NUMBER_HTML_INVALID
            }
        ),
        patch(
            '__main__.extract_number_in_category',
            return_value='Not a number'
        ),
        patch(
            '__main__.extract_book_data',
            return_value=TEST_BOOK_DATA
        )
    ):
        category_data = extract_category_data(
            test_category_data,
            get_book_data=True
        )
        assert category_data == expected_category_data_invalid_with_books

def test_extract_category_data_request_to_scrape_failure_with_books(
    test_category_data
):
    """
    Test that `extract_category_data` handles request_to_scrape failure gracefully
    and still extracts book data when `get_book_data` is True.
    """  # noqa: E501
    with (
        patch(
            '__main__.request_to_scrape',
            return_value={'status': 'error', 'data': ''}
        ),
        patch(
            '__main__.extract_number_in_category',
            return_value=0
        ),
        patch(
            '__main__.extract_book_data',
            return_value=TEST_BOOK_DATA
        )
    ):
        category_data = extract_category_data(
            test_category_data,
            get_book_data=True
        )
        assert category_data['number_of_books'] == 0
        assert category_data['books'] == TEST_BOOK_DATA

def test_extract_category_data_empty_html_response_with_books(
    test_category_data
):
    """
    Test that `extract_category_data` handles empty HTML response gracefully
    and still extracts book data when `get_book_data` is True.
    """  # noqa: E501
    with (
        patch(
            '__main__.request_to_scrape',
            return_value={'status': 'success', 'data': ''}
        ),
        patch(
            '__main__.extract_number_in_category',
            return_value=0
        ),
        patch(
            '__main__.extract_book_data',
            return_value=TEST_BOOK_DATA
        )
    ):
        category_data = extract_category_data(
            test_category_data,
            get_book_data=True
        )
        assert category_data['number_of_books'] == 0
        assert category_data['books'] == TEST_BOOK_DATA

def test_extract_category_data_no_books_found_with_books(
    test_category_data
):
    """
    Test that `extract_category_data` returns 0 books when no books are found
    and still extracts book data when `get_book_data` is True.
    """  # noqa: E501
    with (
        patch(
            '__main__.request_to_scrape',
            return_value={
                'status': 'success',
                'data': BOOK_NUMBER_HTML_NO_BOOKS_FOUND
            }
        ),
        patch(
            '__main__.extract_number_in_category',
            return_value=0
        ),
        patch(
            '__main__.extract_book_data',
            return_value=TEST_BOOK_DATA
        )
    ):
        category_data = extract_category_data(
            test_category_data,
            get_book_data=True
        )
        assert category_data['number_of_books'] == 0
        assert category_data['books'] == TEST_BOOK_DATA

### Test `extract_number_in_category`

In [None]:
# Test `extract_number_in_category`
from bs4 import BeautifulSoup

def test_extract_number_in_category_with_valid_html_with_number():
    """
    Test that `extract_number_in_category` correctly extracts the number of books from valid HTML.
    """  # noqa E501
    soup = BeautifulSoup(BOOK_NUMBER_HTML_VALID, HTML_PARSER)
    number = extract_number_in_category(soup)
    assert number == 10

def test_extract_number_html_without_number():
    """
    Test that `extract_number_in_category` returns 0 when the number of books is not found.
    """  # noqa E501
    soup = BeautifulSoup(BOOK_NUMBER_HTML_NO_BOOKS_FOUND_TEXT, HTML_PARSER)
    number = extract_number_in_category(soup)
    assert number == 0

def test_extract_number_empty_html():
    """
    Test that `extract_number_in_category` returns 0 when provided with empty HTML.
    """  # noqa E501
    soup = BeautifulSoup(TEST_HTML, HTML_PARSER)
    number = extract_number_in_category(soup)
    assert number == 0

def test_extract_number_invalid_html():
    """
    Test that `extract_number_in_category` correctly extracts the number of books from invalid HTML.
    """  # noqa E501
    soup = BeautifulSoup(BOOK_NUMBER_HTML_VALID, HTML_PARSER)
    number = extract_number_in_category(soup)
    assert number == 10

def test_extract_number_html_with_no_books_found():
    """
    Test that `extract_number_in_category` returns 0 when the HTML indicates no books found.
    """  # noqa E501
    soup = BeautifulSoup(BOOK_NUMBER_HTML_NO_BOOKS_FOUND, HTML_PARSER)
    number = extract_number_in_category(soup)
    assert number == 0

def test_extract_number_html_with_no_books_found_text():
    """
    Test that `extract_number_in_category` returns 0 when the HTML indicates no books found.
    """  # noqa E501
    soup = BeautifulSoup(BOOK_NUMBER_HTML_NO_BOOKS_FOUND_TEXT, HTML_PARSER)
    number = extract_number_in_category(soup)
    assert number == 0

def test_extract_number_html_with_non_numeric_number_of_books():
    """
    Test that `extract_number_in_category` returns 0 when the HTML contains a non-numeric value.
    """  # noqa E501
    soup = BeautifulSoup(BOOK_NUMBER_HTML_NON_NUMERIC_BOOK_NUMBER, HTML_PARSER)
    number = extract_number_in_category(soup)
    assert number == 0

def test_extract_number_no_form_horizontal_class():
    """
    Test that `extract_number_in_category` returns 0 when the HTML does not contain the 'form-horizontal' class.
    """  # noqa E501
    soup = BeautifulSoup(BOOK_NUMBER_HTML_NO_FORM_HORIZONTAL, HTML_PARSER)
    number = extract_number_in_category(soup)
    assert number == 0

### Test `extract_all_category_data`

In [None]:
# Test `extract_all_category_data`
import pytest
from unittest.mock import patch

@pytest.fixture
def valid_categories():
    return {
        **TEST_CATEGORIES_WITH_LINKS[0],
        **TEST_CATEGORIES_WITH_LINKS[1]
    }

@pytest.fixture
def empty_categories():
    return {}

@pytest.fixture
def single_category():
    return {
        **TEST_CATEGORIES_WITH_LINKS[0]
    }

@pytest.fixture
def invalid_category_link():
    return {
        **TEST_CATEGORIES_WITH_LINKS[4]
    }

@pytest.fixture
def category_with_missing_link():
    return {
        **TEST_CATEGORIES_WITH_LINKS[2]
    }

# @pytest.fixture
# def category_with_additional_data():
#     return {
#         **TEST_CATEGORIES_WITH_LINKS[6]
#     }

def test_extract_all_category_data_makes_right_extract_category_data_call(
    valid_categories
):
    """
    Verify `extract_all_category_data` calls `extract_category_data` with correct arguments.

    Asserts:
        - The function calls `extract_category_data` with the correct category link arguments.
    """  # noqa: E501
    with (
        patch('__main__.extract_category_data') as mock_extract_category_data
    ):
        extract_all_category_data(valid_categories)
        assert mock_extract_category_data.call_count == 2
        mock_extract_category_data.assert_any_call(
            {'link': f'{TEST_URL}/category1.html'}, False
        )
        mock_extract_category_data.assert_any_call(
            {'link': f'{TEST_URL}/category2.html'}, False
        )

def test_extract_all_category_data_empty_categories(
    empty_categories
):
    """
    Verify `extract_all_category_data` handles empty categories dictionary.
    """  # noqa: E501
    with (
        patch('__main__.extract_category_data') as mock_extract_category_data
    ):
        result = extract_all_category_data(empty_categories)
        assert result == {}
        assert mock_extract_category_data.call_count == 0

def test_extract_all_category_data_single_category(
    single_category
):
    """
    Verify `extract_all_category_data` handles single category.
    """  # noqa: E501
    with (
        patch('__main__.extract_category_data') as mock_extract_category_data
    ):
        extract_all_category_data(single_category)
        assert mock_extract_category_data.call_count == 1
        mock_extract_category_data.assert_any_call(
            {'link': f'{TEST_URL}/category1.html'}, False
        )

def test_extract_all_category_data_invalid_category_link(
    invalid_category_link
):
    """
    Verify `extract_all_category_data` handles invalid category link.
    """  # noqa: E501
    with (
        patch('__main__.extract_category_data') as mock_extract_category_data
    ):
        extract_all_category_data(invalid_category_link)
        assert mock_extract_category_data.call_count == 1
        mock_extract_category_data.assert_any_call(
            {'link': 'invalid-url'}, False
        )

def test_extract_all_category_data_category_with_missing_link(
    category_with_missing_link
):
    """
    Verify `extract_all_category_data` handles category with missing link.
    """  # noqa: E501
    with (
        patch('__main__.extract_category_data') as mock_extract_category_data
    ):
        extract_all_category_data(category_with_missing_link)
        assert mock_extract_category_data.call_count == 1
        mock_extract_category_data.assert_any_call(
            {}, False
        )


### Test `extract_book_data`

In [None]:
# Test `extract_book_data`
import pytest
from bs4 import BeautifulSoup
from unittest.mock import patch

@pytest.fixture
def category_page():
    return BeautifulSoup(CATEGORY_HTML_VALID, HTML_PARSER)

BOOK_DATA_FOR_TEST = [
    TEST_BOOK_DATA[0], 
    TEST_BOOK_DATA[1],
    TEST_BOOK_DATA[2]
]

def test_extract_book_data_single_page(category_page):
    """
    Verify `extract_book_data` correctly extracts book data from a single page.
    """  # noqa: E501
    with (
        patch(
            '__main__.extract_book_data_from_page',
            return_value=[BOOK_DATA_FOR_TEST[0]]
        )
    ):
        books = extract_book_data(
            category_page,
            list(TEST_CATEGORIES_WITH_LINKS[0].values())[0]
        )
        assert len(books) == 1
        assert books == [BOOK_DATA_FOR_TEST[0]]

def test_extract_book_data_no_books(category_page):
    """
    Verify `extract_book_data` returns an empty list when no books are found.
    """  # noqa: E501
    with (
        patch(
            '__main__.extract_book_data_from_page',
            return_value=[]
        )
    ):
        books = extract_book_data(
            category_page,
            list(TEST_CATEGORIES_WITH_LINKS[0].values())[0]
        )
        assert len(books) == 0
        assert books == []

def test_extract_book_data_invalid_data(category_page):
    """
    Verify `extract_book_data` correctly handles invalid data and returns the valid data.
    """  # noqa: E501
    with (
        patch(
            '__main__.extract_book_data_from_page',
            return_value=[BOOK_DATA_FOR_TEST[0]]
        )
    ):
        books = extract_book_data(
            category_page,
            list(TEST_CATEGORIES_WITH_LINKS[0].values())[0]
        )
        assert len(books) == 1
        assert books == [BOOK_DATA_FOR_TEST[0]]

"""
TESTS FOR MULTIPLE PAGES
"""

# Helper to make many books for testing
def book_data_generator():
    i = 2
    while True:
        yield [{'title': f'Book {i}', 'price': round(float(i), 2)}]
        i += 1

def test_extract_book_data_multiple_pages(category_page):
    """
    Verify `extract_book_data` correctly extracts book data from multiple pages.
    """  # noqa: E501
    with (
        patch(
            '__main__.extract_number_of_book_pages',
            return_value=3
        ),
        patch(
            '__main__.extract_book_data_from_page',
            return_value=[BOOK_DATA_FOR_TEST[0]]
        ),
        patch(
            '__main__.extract_additional_page_book_data',
            side_effect=[[BOOK_DATA_FOR_TEST[1]], [BOOK_DATA_FOR_TEST[2]]]
        ) as mock_extract_additional_page_book_data
    ):
        books = extract_book_data(
            category_page,
            list(TEST_CATEGORIES_WITH_LINKS[0].values())[0]
        )
        assert mock_extract_additional_page_book_data.call_count == 2
        assert len(books) == len(BOOK_DATA_FOR_TEST)
        assert books == BOOK_DATA_FOR_TEST
        mock_extract_additional_page_book_data.assert_called()

def test_extract_book_data_max_pages(category_page):
    """
    Verify `extract_book_data` correctly extracts book data from the maximum number of pages.
    """  # noqa: E501
    with (
        patch(
            '__main__.extract_number_of_book_pages',
            return_value=100
        ),
        patch(
            '__main__.extract_book_data_from_page',
            return_value=[BOOK_DATA_FOR_TEST[0]]
        ),
        patch(
            '__main__.extract_additional_page_book_data',
            side_effect=book_data_generator()
        )
    ):
        books = extract_book_data(
            category_page,
            list(TEST_CATEGORIES_WITH_LINKS[0].values())[0]
        )
        assert len(books) == 100
        assert books[0] == BOOK_DATA_FOR_TEST[0]
        assert books[-1] == {'title': 'Book 100', 'price': 100.00}

### Test `extract_book_data_from_page`

In [None]:
# Test `extract_book_data_from_page`
from bs4 import BeautifulSoup
from unittest.mock import patch

def test_extract_book_data_from_page_normal_case():
    """
    Verify `extract_book_data_from_page` correctly extracts book data from a page with multiple books.
    """  # noqa: E501
    page = BeautifulSoup(BOOK_DATA_HTML_VALID, HTML_PARSER)

    with patch(
        '__main__.extract_book_data_from_article',
        side_effect=[
            TEST_BOOK_DATA[0],
            TEST_BOOK_DATA[1]
        ]
    ):
        books = extract_book_data_from_page(page)
        assert len(books) == 2
        assert books == [
            TEST_BOOK_DATA[0],
            TEST_BOOK_DATA[1]
        ]

def test_extract_book_data_from_page_no_books():
    """
    Verify `extract_book_data_from_page` returns an empty list when no books are found on the page.
    """  # noqa: E501
    page = BeautifulSoup(TEST_HTML, HTML_PARSER)
    books = extract_book_data_from_page(page)
    assert len(books) == 0
    assert books == []

def test_extract_book_data_from_page_empty_page():
    """
    Verify `extract_book_data_from_page` returns an empty list when the page is empty.
    """  # noqa: E501
    page = BeautifulSoup('', HTML_PARSER)
    books = extract_book_data_from_page(page)
    assert len(books) == 0
    assert books == []

def test_extract_book_data_from_page_invalid_data():
    """
    Verify `extract_book_data_from_page` correctly handles invalid book data and returns the valid data.
    """  # noqa: E501
    page = BeautifulSoup(BOOK_DATA_HTML_INVALID, HTML_PARSER)

    with patch(
        '__main__.extract_book_data_from_article',
        return_value=TEST_BOOK_DATA[3]
    ):
        books = extract_book_data_from_page(page)
        assert len(books) == 1
        assert books == [TEST_BOOK_DATA[3]]

def test_extract_book_data_from_page_none_case():
    """
    Verify `extract_book_data_from_page` returns an empty list when the page is None.
    """  # noqa: E501
    page = None

    with patch(
        '__main__.extract_book_data_from_article',
        return_value=None
    ):
        books = extract_book_data_from_page(page)
        assert len(books) == 0
        assert books == []


### Test `extract_book_data_from_article`

In [None]:
# Test `extract_book_data_from_article`
from bs4 import BeautifulSoup
from unittest.mock import patch

def test_extract_book_data_from_article_normal_case():
    """
        Verify `extract_book_data_from_article` correctly extracts book data from an article with valid data.
    """  # noqa: E501
    article = BeautifulSoup(BOOK_DATA_HTML_VALID, HTML_PARSER)
    with patch(
        '__main__.extract_element',
        side_effect=[
            BeautifulSoup(SINGLE_BOOK_TITLE_HTML_VALID, HTML_PARSER),
            BeautifulSoup(SINGLE_BOOK_PRICE_HTML_VALID, HTML_PARSER),
        ]
    ):
        book = extract_book_data_from_article(article)
        assert book == TEST_BOOK_DATA[0]

def test_extract_book_data_from_article_truncated_title():
    """
    Verify `extract_book_data_from_article` correctly extracts book data from an article with a truncated title.
    """  # noqa: E501
    article = BeautifulSoup(BOOK_DATA_HTML_TRUNCATED_TITLE, HTML_PARSER)
    with patch(
        '__main__.extract_element',
        side_effect=[
            BeautifulSoup(
                SINGLE_BOOK_TITLE_HTML_TRUNCATED,
                HTML_PARSER
            ),
            BeautifulSoup(
                SINGLE_BOOK_TITLE_TRUN_PRICE_HTML_VALID,
                HTML_PARSER
            ),
        ]
    ):
        book = extract_book_data_from_article(article)
        assert book == TEST_BOOK_DATA[8]

def test_extract_book_data_from_article_invalid_price():
    """
    Verify `extract_book_data_from_article` returns the price str when the price is invalid.
    """  # noqa: E501
    article = BeautifulSoup(BOOK_DATA_HTML_INVALID, HTML_PARSER)
    with patch(
        '__main__.extract_element',
        side_effect=[
            BeautifulSoup(SINGLE_BOOK_TITLE_HTML_VALID, HTML_PARSER),
            BeautifulSoup(SINGLE_BOOK_PRICE_HTML_INVALID, HTML_PARSER),
        ]
    ):
        book = extract_book_data_from_article(article)
        assert book == {
            'title': TEST_BOOK_DATA[0]['title'],
            'price': TEST_BOOK_DATA[3]['price']
        }

def test_extract_book_data_from_article_missing_price():
    """
    Verify `extract_book_data_from_article` returns None for price when element is missing from the article.
    """  # noqa: E501
    article = BeautifulSoup(BOOK_DATA_HTML_MISSING_PRICE, HTML_PARSER)
    with patch(
        '__main__.extract_element',
        side_effect=[
            BeautifulSoup(SINGLE_BOOK_TITLE_HTML_VALID, HTML_PARSER),
            None
        ]
    ):
        book = extract_book_data_from_article(article)
        assert book == {
            'title': TEST_BOOK_DATA[0]['title'],
            'price': None
        }

def test_extract_book_data_from_article_missing_title_attribute():
    """
    Verify `extract_book_data_from_article` returns None for the title when attribute is missing.
    """  # noqa: E501
    article = BeautifulSoup(BOOK_DATA_HTML_NO_TITLE_ATTR, HTML_PARSER)
    with patch(
        '__main__.extract_element',
        side_effect=[
            BeautifulSoup(SINGLE_BOOK_TITLE_HTML_NO_ATTR, HTML_PARSER),
            BeautifulSoup(SINGLE_BOOK_PRICE_HTML_VALID, HTML_PARSER),
        ]
    ):
        book = extract_book_data_from_article(article)
        assert book == {
            'title': None,
            'price': TEST_BOOK_DATA[0]['price']
        }

### Test `extract_extract_number_of_book_pages`

In [None]:
# Test `extract_extract_number_of_book_pages`
import pytest
from unittest.mock import patch
from bs4 import BeautifulSoup

@pytest.fixture
def category_page():
    return BeautifulSoup(CATEGORY_HTML_VALID, 'html.parser')

def test_extract_number_of_book_pages_pagination_not_found(
    category_page
):
    """
    Verify `extract_number_of_book_pages` returns 1 when no pagination is found.
    """  # noqa: E501
    with patch('__main__.extract_element', return_value=None):
        assert extract_number_of_book_pages(category_page) == 1

def test_extract_number_of_book_pages_pagination_found_valid_text(
    category_page
):
    """
    Verify `extract_number_of_book_pages` correctly extracts the number of pages from valid pagination text.
    """  # noqa: E501
    with (
        patch(
            '__main__.extract_element',
            return_value=BeautifulSoup(PAGINATION_VALID, 'html.parser').li
        )
    ):
        assert extract_number_of_book_pages(category_page) == 4

def test_extract_number_of_book_pages_pagination_found_invalid_text(
    category_page
):
    """
    Verify `extract_number_of_book_pages` returns 1 when the pagination text is invalid.
    """  # noqa: E501
    with (
        patch(
            '__main__.extract_element',
            return_value=BeautifulSoup(PAGINATION_INVALID, 'html.parser').li
        )
    ):
        assert extract_number_of_book_pages(category_page) == 1

def test_extract_number_of_book_pages_pagination_edge_case_page_400(
    category_page
):
    """
    Verify `extract_number_of_book_pages` correctly extracts the number of pages from edge case pagination text 'Page 1 of 400'.
    """  # noqa: E501
    with (
        patch(
            '__main__.extract_element',
            return_value=BeautifulSoup(PAGINATION_EDGE_400, 'html.parser').li
        )
    ):
        assert extract_number_of_book_pages(category_page) == 400

def test_extract_number_of_book_pages_pagination_edge_case_page_40(
    category_page
):
    """
    Verify `extract_number_of_book_pages` correctly extracts the number of pages from edge case pagination text 'Page 1 of 40'.
    """  # noqa: E501
    with (
        patch(
            '__main__.extract_element',
            return_value=BeautifulSoup(PAGINATION_EDGE_40, 'html.parser').li
        )
    ):
        assert extract_number_of_book_pages(category_page) == 40

def test_extract_number_of_book_pages_pagination_edge_case_page_4(
    category_page
):
    """
    Verify `extract_number_of_book_pages` correctly extracts the number of pages from edge case pagination text 'Page 1 of 4'.
    """  # noqa: E501
    with (
        patch(
            '__main__.extract_element',
            return_value=BeautifulSoup(PAGINATION_EDGE_4, 'html.parser').li
        )
    ):
        assert extract_number_of_book_pages(category_page) == 4

### Test `extract_additional_page_book_data`

In [None]:
# Test `extract_additional_page_book_data`
import pytest
from bs4 import BeautifulSoup
from unittest.mock import patch

# Book data for testing
BOOK_DATA_FOR_TESTS = [
    TEST_BOOK_DATA[0],
    TEST_BOOK_DATA[1]
]

@pytest.fixture
def category_link():
    return f"{TEST_URL}/category1.html"

@pytest.fixture
def book_data_html_valid_data_dict():
    return {'data': BOOK_DATA_HTML_VALID}

@pytest.fixture
def empty_page_data_dict():
    return {'data': TEST_HTML}

@pytest.fixture
def invalid_url_page_data():
    return {'data': ''}

@pytest.fixture
def book_data_html_missing_price_data_dict():
    return {'data': BOOK_DATA_HTML_MISSING_PRICE}

@pytest.fixture
def book_data_html_missing_title_data_dict():
    return {'data': BOOK_DATA_HTML_MISSING_TITLE}

@pytest.fixture
def book_data_html_mixed_data_dict():
    return {'data': BOOK_DATA_HTML_MIXED}

def test_extract_additional_page_book_data_normal_case(
    category_link,
    book_data_html_valid_data_dict
):
    """
    Verify `extract_additional_page_book_data` correctly extracts book data from a page with multiple books.
    """  # noqa: E501
    with (
        patch(
            '__main__.request_to_scrape',
            return_value=book_data_html_valid_data_dict
        ),
        patch(
            '__main__.extract_book_data_from_page',
            side_effect=[[BOOK_DATA_FOR_TESTS[0], BOOK_DATA_FOR_TESTS[1]]]
        )
    ):
        books = extract_additional_page_book_data(category_link, 2)
        assert len(books) == 2
        assert books == [BOOK_DATA_FOR_TESTS[0], BOOK_DATA_FOR_TESTS[1]]

def test_extract_additional_page_book_data_invalid_url(
    category_link,
    invalid_url_page_data
):
    """
    Verify `extract_additional_page_book_data` returns an empty list when the URL is invalid.
    """  # noqa: E501
    with (
        patch(
            '__main__.request_to_scrape',
            return_value=invalid_url_page_data
        ),
        patch(
            '__main__.extract_book_data_from_page',
            return_value=[]
        )
    ):
        books = extract_additional_page_book_data(category_link, 2)
        assert len(books) == 0
        assert books == []

def test_extract_additional_page_book_data_empty_page(
    category_link,
    empty_page_data_dict
):
    """
    Verify `extract_additional_page_book_data` returns an empty list when the page is empty.
    """  # noqa: E501
    with (
        patch(
            '__main__.request_to_scrape',
            return_value=empty_page_data_dict
        ),
        patch(
            '__main__.extract_book_data_from_page',
            return_value=[]
        )
    ):
        books = extract_additional_page_book_data(category_link, 2)
        assert len(books) == 0
        assert books == []

def test_extract_additional_page_book_data_no_books(
    category_link,
    empty_page_data_dict
):
    """
    Verify `extract_additional_page_book_data` returns an empty list when no books are found on the page.
    """  # noqa: E501
    with (
        patch(
            '__main__.request_to_scrape',
            return_value=empty_page_data_dict
        ),
        patch(
            '__main__.extract_book_data_from_page',
            return_value=[]
        )
    ):
        books = extract_additional_page_book_data(category_link, 2)
        assert len(books) == 0
        assert books == []

def test_extract_additional_page_book_data_request_failure(
    category_link
):
    """
    Verify `extract_additional_page_book_data` returns an empty list when the request to scrape the page fails.
    """  # noqa: E501
    with (
        patch(
            '__main__.request_to_scrape',
            side_effect=Exception('Request failed')
        ),
        patch(
            '__main__.extract_book_data_from_page',
            return_value=[]
        )
    ):
        books = extract_additional_page_book_data(category_link, 2)
        assert len(books) == 0
        assert books == []

def test_extract_additional_page_book_data_missing_price(
    category_link,
    book_data_html_missing_price_data_dict
):
    """
    Verify `extract_additional_page_book_data` correctly handles book data with missing price.
    """  # noqa: E501
    with (
        patch(
            '__main__.request_to_scrape',
            return_value=book_data_html_missing_price_data_dict
        ),
        patch(
            '__main__.extract_book_data_from_page',
            return_value=[TEST_BOOK_DATA[4]]
        )
    ):
        books = extract_additional_page_book_data(category_link, 2)
        assert len(books) == 1
        assert books == [TEST_BOOK_DATA[4]]

def test_extract_additional_page_book_data_missing_title(
    category_link,
    book_data_html_missing_title_data_dict
):
    """
    Verify `extract_additional_page_book_data` correctly handles book data with missing title.
    """  # noqa: E501
    with (
        patch(
            '__main__.request_to_scrape',
            return_value=book_data_html_missing_title_data_dict
        ),
        patch(
            '__main__.extract_book_data_from_page',
            return_value=[TEST_BOOK_DATA[4]]
        )
    ):
        books = extract_additional_page_book_data(category_link, 2)
        assert len(books) == 1
        assert books == [TEST_BOOK_DATA[4]]

def test_extract_additional_page_book_data_mixed_data(
    category_link,
    book_data_html_mixed_data_dict
):
    """
    Verify `extract_additional_page_book_data` correctly handles mixed valid and invalid book data.
    """   # noqa: E501
    with (
        patch(
            '__main__.request_to_scrape',
            return_value=book_data_html_mixed_data_dict
        ),
        patch(
            '__main__.extract_book_data_from_page',
            return_value=[
                TEST_BOOK_DATA[0],
                TEST_BOOK_DATA[4],
                TEST_BOOK_DATA[5],
                TEST_BOOK_DATA[7]
            ]
        )
    ):
        books = extract_additional_page_book_data(category_link, 2)
        assert len(books) == 4
        assert books == [
            TEST_BOOK_DATA[0],
            TEST_BOOK_DATA[4],
            TEST_BOOK_DATA[5],
            TEST_BOOK_DATA[7]  
        ]


### Run the tests

Run the cell containing the `ipytest.run()` command to execute the tests.  The tests should all fail until you have written the production code.

Don't forget to run the installation and initialisation cell too on the first time you run the tests!


---

# Python Production Code


Develop any functions for use as production code in separate cells below. The first has been provided as an example under the Production Constants, add others as necessary.

### PRODUCTION CONSTANTS

In [None]:
# PRODUCTION CONSTANTS

# Constants for status messages
STATUS_SUCCESS = "success"
STATUS_ERROR = "error"
ERROR_NOT_HTML = "The response is not HTML"
ERROR_REQUEST_FAILED = "Request failed for URL"
ERROR_UNEXPECTED = "Unexpected error for URL"

# HTML Parser
HTML_PARSER = "html.parser"

### `request_to_scrape` Production Code

In [None]:
# request_to_scrape Production Code
import requests
from requests.exceptions import RequestException, Timeout

def request_to_scrape(url: str, timeout: int = 10) -> dict:
    """
    Sends an HTTP GET request to the specified URL and returns the response content.

    Args:
        url (str): The URL to which the GET request is sent.
        timeout (int, optional): The timeout for the request in seconds. Defaults to 10.

    Returns:
        Dict[str, str]: A dictionary containing the status, data, and any error messages.
                        - If the request is successful and returns HTML, 'status' is 'success' and 'data' contains the response text.
                        - If the request fails or does not return HTML, 'status' is 'error' and 'error' contains the error message.
    """  # noqa: E501
    try:
        response = requests.get(url, timeout=timeout)
        # Raise an HTTPError for bad responses
        response.raise_for_status()
        # Check if the response contains HTML
        if 'text/html' in response.headers.get('Content-Type', ''):
            return {
                "status": STATUS_SUCCESS,
                "data": response.text
            }
        else:
            return {
                "status": STATUS_ERROR,
                "error": ERROR_NOT_HTML
            }
    except (Timeout, RequestException) as e:
        return {
            "status": STATUS_ERROR,
            "error": f"{str(e)} - {ERROR_REQUEST_FAILED}"
        }
    except Exception as e:
        return {
            "status": "error",
            "error": f"{str(e)} - {ERROR_UNEXPECTED}"
        }


### `extract_book_categories` Production Code

In [None]:
# `extract_book_categories` Production Code
from bs4 import BeautifulSoup

def extract_book_categories(html: str, site: str) -> dict:
    """
    Extracts book categories and their corresponding links from the provided HTML content.
    Args:
        html (str): The HTML content of the webpage to parse.
        site (str): The URL of the site from which the HTML content was retrieved.
    Returns:
        dict: A dictionary where the keys are category names and the values dictionaries containing the corresponding links.
    """  # noqa: E501
    soup = BeautifulSoup(html, HTML_PARSER)
    nav_list = extract_element(soup, 'ul', 'nav nav-list')
    if nav_list is None:
        return {}
    category_list = extract_element(nav_list, 'ul')
    categories = extract_categories_and_links(category_list, site)
    return categories


### `extract_element` Production Code

In [None]:
# extract_element Production Code
from bs4 import BeautifulSoup

def extract_element(
    soup: BeautifulSoup, tag: str, class_name: str = None
) -> BeautifulSoup:
    """
    Extracts an HTML element from a BeautifulSoup object based on the specified tag and optional class name.
    Args:
        soup (BeautifulSoup): The BeautifulSoup object to search within.
        tag (str): The HTML tag to search for.
        class_name (str, optional): The class name to filter the search. Defaults to None.
    Returns:
        Tag or None: The first matching Tag object if found, otherwise None.
    """  # noqa: E501
    if soup is None:
        return None
    return soup.find(tag, class_=class_name) if class_name else soup.find(tag)

### `extract_categories_and_links` Production Code

In [None]:
# `extract_categories_and_links` Production Code
from bs4 import BeautifulSoup

def extract_categories_and_links(
    category_list: BeautifulSoup, site: str
) -> dict:
    """
    Extracts categories and their corresponding links from a given list of HTML anchor elements.
    Args:
        category_list (BeautifulSoup object): A BeautifulSoup object containing a list of HTML anchor elements.
        site (str): The site URL to append to relative links.
    Returns:
        dictionary: A dictionary of dictionaries where the keys are category names (str) and the values are the corresponding href links (str) in a dictionary.
        e.g.
        {
            'Category 1': {'link': 'https://www.example.com/category1.html'},
            'Category 2': {'link': 'https://www.example.com/category2.html'}
        }
    """  # noqa: E501
    if not category_list:
        return {}

    categories = {}
    for link in category_list.find_all('a'):
        category_name = link.get_text(strip=True)
        category_href = link.get('href')
        categories[category_name] = {
            'link': f"{site}/{category_href}" if category_href else None
        }

    return categories

### `extract_category_data` Production Code

In [None]:
# `extract_category_data` Production Code
from bs4 import BeautifulSoup

def extract_category_data(
    category: dict, 
    get_book_data: bool = False
) -> dict:
    """
    Extracts the number of books in a category and optionally the book data.
    Args:
        category (dict): A dictionary containing the category name and link.
        get_book_data (bool, optional): A flag to indicate whether to extract book data. Defaults to False.
    Returns:
        dict: A dictionary containing the category name, link, and number of books.
    """  # noqa: E501
    category_page = request_to_scrape(category['link'])
    soup = BeautifulSoup(category_page['data'], HTML_PARSER)

    number_of_books = extract_number_in_category(soup)
    category['number_of_books'] = number_of_books if isinstance(
        number_of_books, 
        int
    ) else 0
    if get_book_data:
        category['books'] = extract_book_data(soup, category)

    return category


### `extract_number_in_category` Production Code

In [None]:
# extract_number_in_category Production Code
from bs4 import BeautifulSoup

def extract_number_in_category(category_page: BeautifulSoup) -> int:
    """
    Extracts the number of books in a category from the category page.
    Args:
        category_page (BeautifulSoup): The BeautifulSoup object of the category page.
    Returns:
        int: The number of books in the category.
    """  # noqa: E501
    form = extract_element(category_page, 'form', 'form-horizontal')
    try:
        number_of_books = int(
            extract_element(form, 'strong').get_text(strip=True)
        )
    except (AttributeError, ValueError):
        number_of_books = 0
    return number_of_books


### `extract_all_category_data` Production Code

In [None]:
# extract_all_category_data Production Code

def extract_all_category_data(
    categories: dict,
    get_book_data: bool = False
) -> dict:
    """
    Extracts data for all categories and updates the categories dictionary with the extracted data.

    Args:
        categories (Dict[str, Dict[str, Any]]): A dictionary where the keys are category names and the values are dictionaries containing category information.
        get_book_data (bool, optional): A flag to indicate whether to extract book data. Defaults to False.

    Returns:
        dict: The updated categories dictionary with extracted data.
    """  # noqa: E501
    for category in categories.values():
        category_data = extract_category_data(category, get_book_data)
        category.update(category_data)
    return categories

### `extract_book_data` Production Code

In [None]:
# `extract_book_data` Production Code

def extract_book_data(category_page: BeautifulSoup, category: dict) -> list:
    """
    Extracts book data from a category page.
    Args:
        category_page (BeautifulSoup): The BeautifulSoup object of the category page.
        category (dict): A dictionary containing the category name and link.
    Returns:
        list: A list of dictionaries containing book data in a dictionary.
    """  # noqa: E501
    books = []
    number_of_pages = extract_number_of_book_pages(category_page)
    books.extend(extract_book_data_from_page(category_page))
    if number_of_pages > 1:
        for page_number in range(2, number_of_pages + 1):
            try:
                books.extend(
                    extract_additional_page_book_data(
                        category['link'],
                        page_number
                    )
                )
            except Exception:
                break
    return books

### `extract_book_data_from_page` Production Code

In [None]:
# extract_book_data_from_page Production Code
from bs4 import BeautifulSoup

def extract_book_data_from_page(page: BeautifulSoup) -> list:
    """
    Extracts book data from a category page.
    Args:
        page (BeautifulSoup): The BeautifulSoup object of the category page.
    Returns:
        list: A list of dictionaries containing book data.
    """  # noqa: E501
    if page is None:
        return []
    book_articles = page.find_all(class_='product_pod')
    books = [
        extract_book_data_from_article(book_markup) 
        for book_markup in book_articles
    ]
    return books

### `extract_book_data_from_article` Production Code

In [None]:
# extract_book_data_from_article Production Code

def extract_book_data_from_article(article: BeautifulSoup) -> dict:
    """
    Extracts book data from an article element.
    Args:
        article (BeautifulSoup): The BeautifulSoup object of the article element.
    Returns:
        dict: A dictionary containing the book title and price.
    """  # noqa: E501
    try:
        title_element = extract_element(article, 'h3')
        title = (
            title_element.a['title'] 
            if title_element and title_element.a 
            else None
        )
        # title = title_element.get('a').get('title')
        price_element = extract_element(article, 'p', 'price_color')
        price_str = price_element.get_text(strip=True)
        price = float(price_str.replace('Â£', ''))

        return {
            'title': title,
            'price': price,
        }
    except ValueError:
        return {
            'title': title,
            'price': price_str
        }
    except Exception:
        return {
            'title': title or None,
            'price': None
        }


### `extract_number_of_book_pages` Production Code

In [None]:
# extract_number_of_book_pages Production Code

def extract_number_of_book_pages(category_page: BeautifulSoup) -> int:
    """
    Extracts the number of pages of books in a category from the category page text 
    showing number of pages, or 1 if not found.
    Args:
        category_page (BeautifulSoup): The BeautifulSoup object of the category page.
    Returns:
        int: The number of pages of books in the category.
    """  # noqa: E501
    pagination = extract_element(category_page, 'li', 'current')
    if pagination is None:
        return 1
    pagination_text = pagination.get_text(strip=True)
    try:
        return int(pagination_text.split()[-1])
    except ValueError:
        return 1


### `extract_additional_page_book_data` Production Code

In [None]:
# `extract_additional_page_book_data` Production Code

def extract_additional_page_book_data(
    category_link: str,
    page_number: int
) -> list:
    """
    Extracts additional book data from a specific page number in a category.
    Args:
        category_link (str): The link to the category page.
        page_number (int): The page number to extract book data.
    Returns:
        list: A list of dictionaries containing book data.
    """  # noqa: E501
    page_url = category_link.replace('index.html', f"page-{page_number}.html")
    try:
        page_data = request_to_scrape(page_url)
        page_soup = BeautifulSoup(page_data['data'], HTML_PARSER)
        return extract_book_data_from_page(page_soup)
    except Exception:
        return []

---

# Python Execution Code

Develop any code to call the developed functions below.  Add additional cells so you don't need to re-run all of the code when you develop further scripts.

In [None]:
# Python Execution Code 1

# Get the site's homepage
site = 'http://books.toscrape.com'
home_page = request_to_scrape(site)

In [None]:
# Python Execution Code 2

# Extract book categories from the homepage
categories = extract_book_categories(home_page['data'], site)


In [None]:
# Python Execution Code 3

# category_data = extract_category_data(categories['Travel'])

# print(category_data)

# Get just the categories and the number of books in the category

# category_data = extract_all_category_data(categories)

# for category, data in category_data.items():
#     print(f"Category: {category}")
#     for key, value in data.items():
#         print(f"  {key}: {value}")

In [None]:
# Python Execution Code 4

# Get the book data as well as the number of books in each category
category_data = extract_all_category_data(categories, True)

for category, data in category_data.items():
    print(f"Category: {category}")
    for key, value in data.items():
        print(f"  {key}: {value}")

In [None]:
# Python Execution Code 5
import json
with open('category_data.json', 'w') as file:
    json.dump(category_data, file, indent=2)

---

# Jupyter Notebook Test and Linting Set Up

To run `pytest` scripts in a Jupyter Notebook cell, we need to install the `ipytest` package.  This package is NOT required for a pipeline and therefore it can be removed from the `requirements.txt` file before adding the production code to the pipeline.

To run linting, we need to install 2 packages `nbqa` and `flake8`.  We will make sure that `flake8` is included in the `requirements.txt` file when constructing the pipeline so that we can lint as part of the pipeline tests.

Run the following cell to install the `ipytest`, `nbqa` and `flake8` packages and a coverage package to help determine if all of your production code is executed during the tests!

This cell only needs to be run once (or after restarting the notebook kernel) to set up the environment for testing and linting.


In [None]:
# Install the `ipytest`, `nbqa` and `flake8` packages
!pip install ipytest nbqa flake8

### Set up `ipytest` to execute `pytest` scripts in Jupyter Notebook

In [None]:
# Configure ipytest for Jupyter Notebook

import ipytest
ipytest.autoconfig(rewrite_asserts=True, magics=True)

### Create a *config* file for `flake8`

Run this script to create a file in your project root

In [None]:
# Create a config file and ignore some flake8 rules
!echo "[flake8]" > .flake8
!echo "ignore = E402, W291, F811" >> .flake8

# Execute the tests and linting in the Jupyter Notebook

Run the following cell ***EVERY TIME*** you want to run the tests and linting that you have written in the *Python Tests* cell above.

>**Note:**
>
> This entire section does not need to be part of any pipeline scripts.  
> It is only required for the Jupyter Notebook environment during development.


## Run the tests

In [None]:
# Run the tests
ipytest.run("-vv", "-ss")

## Run the linter

Run this script each time you want to lint your code

In [None]:
# Run the linter
!nbqa flake8 --show-source --format=pylint webscraping.ipynb


---
