![Digital Futures](https://github.com/digital-futures-academy/DataScienceMasterResources/blob/main/Resources/datascience-notebook-header.png?raw=true)

# Setting up the Project Environment

Ultimately, the Python code written here will be extracted to scripts for execution in an automated pipeline.  To facilitate this, there is a need to set up a project environment that will allow for the execution of the code in a controlled and reproducible environment.

In the initial stages of the activities, the packages needed are `requests` and `pytest`.  The `requests` package is used to make HTTP requests to the API, while `pytest` is used for testing the code we also need *BeautifulSoup* (package name `beautifulsoup4`.  In later activities, you may need to install additional packages.  To do this, add the packages to the `pip install` command below and re-run the cell.

> **Remember:** The goal is to create a set of code cells that can be extracted to separate scripts for execution in an automated pipeline.  Therefore, the code should be kept in 3 distinct cells:
> 
> - **Shell Commands**:  Used to set up the project environment
> 
> - **Python Tests**: Used to test the Python production scripts both now and as part of the automated pipeline
> 
> - **Python Production Code**: The Python code that will be extracted to a script to execute during the pipeline

---

# Environment Setup Scripts

If you are running this notebook after cloning and have not set up your environment to run shell commands, you will need to run the following commands in your terminal to set up the environment.

> **NOTE:**  These commands need to be executed in the terminal.  
>
> Open a terminal at the root of your project before executing these commands
> 
> Until your environment is set up, Jupyter Notebooks will not be able to run **shell** scripts.

```sh
# Create a virtual environment (add the command below)
python3 -m venv .venv # Note: This command could also be python -m venv .venv # python3 and python are a symlink to the python version installed on your system

# Activate the virtual environment 
source .venv/bin/activate

# Install required package to execute shell commands from Jupyter Notebook
pip install ipykernel               ## OR 
pip install -r requirements.txt     ## IF there is already a requirements.txt file CONTAINING ipykenrnel in the project
```


In [None]:
# Install the necessary packages
!pip install requests pytest beautifulsoup4

# Create a requirements.txt file
!pip freeze > requirements.txt

> **Note:** 
> The `!` at the beginning of the lines is a special character in Jupyter Notebooks that allows you to run shell commands from the notebook.  
> These will need to be removed from any commands that are to be exported to a `.sh` shell script file for the pipeline.

---

# Python Tests

Develop any tests for functions in separate cells below.  The first has been provided for you as an example, add others as necessary.

### Test `request_to_scrape`

In [59]:

from unittest.mock import patch
import requests  # Assuming the production function uses the requests library

dummy_url = "https://www.example.com"
dummy_html = "<html><body><h1>Hello, World!</h1></body></html>"

# Test request_to_scrape
def test_request_to_scrape():
    # Arrange
    with patch('requests.get') as mock_get:
        # Configure the mock to return a response with status_code 200
        mock_get.return_value.status_code = 200

        # Act
        response = request_to_scrape(dummy_url)
        
        # Assert
        mock_get.assert_called_once_with(dummy_url)  # Ensure the mock was called with the correct URL
        
def test_request_to_scrape_returns_html():
    # Arrange
    
    with patch('requests.get') as mock_get:
        # Configure the mock to return a response with status_code 200 and dummy HTML content
        mock_get.return_value.status_code = 200
        mock_get.return_value.text = dummy_html
        # Act
        result = request_to_scrape(dummy_url)
        # Assert
        assert result.text == dummy_html  # Check that the returned content matches the dummy HTML
        
def test_for_non_200_status():
    #Arrange
    
    with patch('requests.get') as mock_get:
        mock_get.return_value.status_code = 404   
        #Act
        result = request_to_scrape(dummy_url)
        #Assert
        assert result == "Error:page not found"
        
def test_request_handles_exception():
    #Arrange
    
    with patch('requests.get') as mock_get:
        mock_get.side_effect = requests.exceptions.RequestException("Connection Error")
        #Act
        result = request_to_scrape(dummy_url)
        #Assert
        assert result == "An error occurred: Connection Error"

### Test extract_element

In [66]:
from bs4 import BeautifulSoup
import pytest

def test_extract_element_with_tag_and_class():
    # Arrange
    html_doc = "<html><body><h1 class='title'>Welcome</h1><p class='content'>Content here</p></body></html>"
    soup = BeautifulSoup(html_doc, 'html.parser')
    
    # Act
    result = extract_element(soup, 'h1', 'title')
    
    # Assert
    assert result is not None
    assert result.text == "Welcome"
    
def test_extract_element_with_tag_only():
    # Arrange
    html_doc = "<html><body><h1 class='title'>Welcome</h1><p class='content'>Content here</p></body></html>"
    soup = BeautifulSoup(html_doc, 'html.parser')
    
    # Act
    result = extract_element(soup, 'p')
    
    # Assert
    assert result is not None
    assert result.text == "Content here"

def test_extract_element_with_no_matching_class():
    # Arrange
    html_doc = "<html><body><h1 class='title'>Welcome</h1><p class='content'>Content here</p></body></html>"
    soup = BeautifulSoup(html_doc, 'html.parser')
    
    # Act
    result = extract_element(soup, 'h1', 'nonexistent-class')
    
    # Assert
    assert result is None

def test_extract_element_with_no_matching_tag():
    # Arrange
    html_doc = "<html><body><h1 class='title'>Welcome</h1><p class='content'>Content here</p></body></html>"
    soup = BeautifulSoup(html_doc, 'html.parser')
    
    # Act
    result = extract_element(soup, 'div', 'nonexistent-class')
    
    # Assert
    assert result is None

def test_extract_element_with_multiple_matching_elements_tag_and_class():
    # Arrange
    html_doc = "<html><body><h1 class='title'>First</h1><h1 class='title'>Second</h1></body></html>"
    soup = BeautifulSoup(html_doc, 'html.parser')
    
    # Act
    result = extract_element(soup, 'h1', 'title')
    
    # Assert
    assert result is not None
    assert result.text == "First"  # The first matching element

def test_extract_element_with_multiple_matching_elements_tag_only():
    # Arrange
    html_doc = "<html><body><h1>First</h1><h1>Second</h1></body></html>"
    soup = BeautifulSoup(html_doc, 'html.parser')
    
    # Act
    result = extract_element(soup, 'h1')
    
    # Assert
    assert result is not None
    assert result.text == "First"  # The first matching element

### Tests for extract_categories_and_lists

In [77]:
from bs4 import BeautifulSoup
import pytest

# Test 1: Valid HTML with multiple categories
def test_extract_categories_and_links_valid_html():
    html_doc = """
    <html><body>
    <ul class="categories">
        <li><a href="cat1">Category 1</a></li>
        <li><a href="cat2">Category 2</a></li>
        <li><a href="cat3">Category 3</a></li>
    </ul>
    </body></html>
    """
    soup = BeautifulSoup(html_doc, 'html.parser')
    site = "http://example.com"
    
    result = extract_categories_and_links(soup, site)
    
    expected = {
        "Category 1": "http://example.com/cat1",
        "Category 2": "http://example.com/cat2",
        "Category 3": "http://example.com/cat3"
    }
    
    assert result == expected

# Test 2: HTML without any categories
def test_extract_categories_and_links_no_categories():
    html_doc = """
    <html><body>
    <ul class="other">
        <li>Item 1</li>
        <li>Item 2</li>
    </ul>
    </body></html>
    """
    soup = BeautifulSoup(html_doc, 'html.parser')
    site = "http://example.com"
    
    result = extract_categories_and_links(soup, site)
    
    assert result == {}

# Test 3: Empty HTML
def test_extract_categories_and_links_empty_html():
    html_doc = ""
    soup = BeautifulSoup(html_doc, 'html.parser')
    site = "http://example.com"
    
    result = extract_categories_and_links(soup, site)
    
    assert result == {}

# Test 4: Invalid HTML
def test_extract_categories_and_links_invalid_html():
    html_doc = "<html><body><ul class='categories'><li><a href='cat1'>Category 1</a></ul></body></html>"
    soup = BeautifulSoup(html_doc, 'html.parser')  # Invalid HTML, missing closing tag for <li>
    site = "http://example.com"
    
    result = extract_categories_and_links(soup, site)
    
    expected = {
        "Category 1": "http://example.com/cat1"
    }
    
    assert result == expected

# Test 5: HTML with nested categories
def test_extract_categories_and_links_nested_categories():
    html_doc = """
    <html><body>
    <ul class="categories">
        <li><a href="cat1">Category 1</a></li>
        <li>
            <ul>
                <li><a href="cat2">Category 2</a></li>
            </ul>
        </li>
    </ul>
    </body></html>
    """
    soup = BeautifulSoup(html_doc, 'html.parser')
    site = "http://example.com"
    
    result = extract_categories_and_links(soup, site)
    
    expected = {
        "Category 1": "http://example.com/cat1",
        "Category 2": "http://example.com/cat2"
    }
    
    assert result == expected

### Test the extract_book_categories Function

In [82]:
from bs4 import BeautifulSoup
import pytest


# Assuming extract_element and extract_categories_and_links are defined in the same module as extract_book_categories

# Test 1: Correct integration of helper functions (without patching)
def test_extract_book_categories_integration():
    # Arrange
    html_doc = """
    <html><body>
    <div class="categories-list">
        <a href="cat1">Category 1</a>
        <a href="cat2">Category 2</a>
    </div>
    </body></html>
    """
    soup = BeautifulSoup(html_doc, 'html.parser')
    site = "http://example.com"

    # Mocking the behavior of extract_element and extract_categories_and_links
    mock_extract_element = lambda soup, tag, class_name: soup.find_all(tag)  # Simulating extract_element
    mock_extract_categories_and_links = lambda categories_list, site: {
        "Category 1": f"{site}/cat1",
        "Category 2": f"{site}/cat2"
    }  # Simulating extract_categories_and_links

    # Act
    # Use the mocked functions directly in the function being tested
    categories_list = mock_extract_element(soup, 'a', None)  # Mocked call to extract_element
    result = mock_extract_categories_and_links(categories_list, site)  # Mocked call to extract_categories_and_links
    
    # Assert
    # Check if the mocked functions return the expected results
    assert result == {
        "Category 1": "http://example.com/cat1",
        "Category 2": "http://example.com/cat2"
    }

# Test 2: Handling different HTML structures
def test_extract_book_categories_invalid_structure():
    # Arrange (HTML without expected category structure)
    html_doc = """
    <html><body>
    <div class="other">
        <p>Some random text</p>
    </div>
    </body></html>
    """
    soup = BeautifulSoup(html_doc, 'html.parser')
    site = "http://example.com"
    
    # Act
    result = extract_book_categories(soup, site)
    
    # Assert (Expected result when no categories are found)
    assert result == {}

def test_extract_book_categories_empty_html():
    # Arrange (Empty HTML)
    html_doc = ""
    soup = BeautifulSoup(html_doc, 'html.parser')
    site = "http://example.com"
    
    # Act
    result = extract_book_categories(soup, site)
    
    # Assert (Expected result when no categories are found)
    assert result == {}

def test_extract_book_categories_invalid_html():
    # Arrange (Malformed HTML)
    html_doc = "<html><body><ul class='categories'><li><a href='cat1'>Category 1</a></ul></body></html>"
    soup = BeautifulSoup(html_doc, 'html.parser')  # Invalid HTML, missing closing tag for <li>
    site = "http://example.com"
    
    # Act
    result = extract_book_categories(soup, site)
    
    # Assert (Expected result when a valid category is found)
    assert result == {
        "Category 1": "http://example.com/cat1"
    }

# Test 3: Returning the expected results
def test_extract_book_categories_returns_expected_results():
    # Arrange (Valid HTML with categories)
    html_doc = """
    <html><body>
    <div class="categories-list">
        <a href="cat1">Category 1</a>
        <a href="cat2">Category 2</a>
    </div>
    </body></html>
    """
    soup = BeautifulSoup(html_doc, 'html.parser')
    site = "http://example.com"
    
    # Act
    result = extract_book_categories(soup, site)
    
    # Assert
    assert result == {
        "Category 1": "http://example.com/cat1",
        "Category 2": "http://example.com/cat2"
    }

### Run the tests

Run the cell containing the `ipytest.run()` command to execute the tests.  The tests should all fail until you have written the production code.

Don't forget to run the installation and initialisation cell too on the first time you run the tests!


---

# Python Production Code


Develop any functions for use as production code in separate cells below. The first has been provided as an example under the Production Constants, add others as necessary.

### PRODUCTION CONSTANTS

In [79]:
# PRODUCTION CONSTANTS

# Constants for status messages
STATUS_SUCCESS = "success"
STATUS_ERROR = "error"
ERROR_NOT_HTML = "The response is not HTML"
ERROR_REQUEST_FAILED = "Request failed for URL"
ERROR_UNEXPECTED = "Unexpected error for URL"

# HTML Parser
HTML_PARSER = "html.parser"

### `request_to_scrape` Production Code

In [None]:
def request_to_scrape(url: str): 
    try:  
        response = requests.get(url)
        if response.status_code == 200:
            return response
        else:
            return "Error:page not found"
    except requests.exceptions.RequestException as e:
        return f"An error occurred: {str(e)}"

### `extract_book_categories` Production Code

In [137]:
from bs4 import BeautifulSoup

def extract_book_categories(html, site: str, html_parser: str = 'html.parser'):
    """
    Extracts book categories and their links from the given HTML.
    
    Args:
        html (str or response object): The HTML content or response object to parse.
        site (str): The base URL of the website.
        html_parser (str): The parser to use for BeautifulSoup.
        
    Returns:
        dict: A dictionary of category names as keys and their full URLs as values.
    """
    try:
        # If `html` is a response object, extract the content; otherwise, assume it's a raw HTML string
        html_content = html.content if hasattr(html, 'content') else html
        #print(html_content)
        # Parse the HTML
        soup = BeautifulSoup(html_content, html_parser) if html_content else None
        
        # Handle empty or invalid HTML
        if not soup:
            return {}

        # Extract the navigation list
        nav_list = extract_element(soup, 'ul', 'nav nav-list')

        # Handle cases where the navigation list is not found
        if not nav_list:
            return {}

        # Extract categories and links
        categories = extract_categories_and_links(nav_list, site)
        
        return categories
    except Exception as e:
        # Gracefully handle exceptions and return an empty dictionary
        print(f"An error occurred: {str(e)}")
        return {}


### `extract_element` Production code

In [167]:
from bs4 import BeautifulSoup

def extract_element(soup,tag,class_name=None):
    if class_name!= None:
        #print(tag,class_name)
        found_elem = soup.find(tag,class_=class_name)
        return(found_elem)
    else: 
        return(soup.find(tag))

### `extract_categories_and_links` Production code

In [166]:
def extract_categories_and_links(categories_list,site):
    categories = {}
    for link in categories_list.find_all('a'):
        category_name = link.get_text(strip=True)
        category_href = link.get('href')
        categories[category_name] = {'link':f"{site}/{category_href}"} if category_href else None
    return categories


---

# Python Execution Code

Develop any code to call the developed functions below.  Add additional cells so you don't need to re-run all of the code when you develop further scripts.

In [160]:
site = "http://books.toscrape.com"
home_page = request_to_scrape(site)

In [176]:
categories = extract_book_categories(home_page,site)
print(categories)

{'Books': {'link': 'http://books.toscrape.com/catalogue/category/books_1/index.html'}, 'Travel': {'link': 'http://books.toscrape.com/catalogue/category/books/travel_2/index.html'}, 'Mystery': {'link': 'http://books.toscrape.com/catalogue/category/books/mystery_3/index.html'}, 'Historical Fiction': {'link': 'http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html'}, 'Sequential Art': {'link': 'http://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html'}, 'Classics': {'link': 'http://books.toscrape.com/catalogue/category/books/classics_6/index.html'}, 'Philosophy': {'link': 'http://books.toscrape.com/catalogue/category/books/philosophy_7/index.html'}, 'Romance': {'link': 'http://books.toscrape.com/catalogue/category/books/romance_8/index.html'}, 'Womens Fiction': {'link': 'http://books.toscrape.com/catalogue/category/books/womens-fiction_9/index.html'}, 'Fiction': {'link': 'http://books.toscrape.com/catalogue/category/books/fiction_10/index.

In [189]:
for category in categories.keys():
    extract_category_data(categories[category])

print(categories)

{'Books': {'link': 'http://books.toscrape.com/catalogue/category/books_1/index.html', 'number_of_books': 1000}, 'Travel': {'link': 'http://books.toscrape.com/catalogue/category/books/travel_2/index.html', 'number_of_books': 11}, 'Mystery': {'link': 'http://books.toscrape.com/catalogue/category/books/mystery_3/index.html', 'number_of_books': 32}, 'Historical Fiction': {'link': 'http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html', 'number_of_books': 26}, 'Sequential Art': {'link': 'http://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html', 'number_of_books': 75}, 'Classics': {'link': 'http://books.toscrape.com/catalogue/category/books/classics_6/index.html', 'number_of_books': 19}, 'Philosophy': {'link': 'http://books.toscrape.com/catalogue/category/books/philosophy_7/index.html', 'number_of_books': 11}, 'Romance': {'link': 'http://books.toscrape.com/catalogue/category/books/romance_8/index.html', 'number_of_books': 35}, 'Womens Ficti

### extract_category_data Production Code

In [188]:
from bs4 import BeautifulSoup

def extract_category_data(category,get_book_data=False):
    #print(category)
    category_page = request_to_scrape(category['link'])
    #print(category_page.content)
    soup = BeautifulSoup(category_page.content,HTML_PARSER)
    num_of_books_in_category = extract_number_in_category(soup)
    #print(num_of_books_in_category)
    category['number_of_books'] = num_of_books_in_category
    
    return(category)

### Extract_number_in_category Production Code

In [183]:
def extract_number_in_category(category_page):
    form = extract_element(category_page,'form','form-horizontal')
    num_of_books_in_category = int(extract_element(form,'strong').get_text(strip=True))
    return(num_of_books_in_category)

---

# Jupyter Notebook Test and Linting Set Up

To run `pytest` scripts in a Jupyter Notebook cell, we need to install the `ipytest` package.  This package is NOT required for a pipeline and therefore it can be removed from the `requirements.txt` file before adding the production code to the pipeline.

To run linting, we need to install 2 packages `nbqa` and `flake8`.  We will make sure that `flake8` is included in the `requirements.txt` file when constructing the pipeline so that we can lint as part of the pipeline tests.

Run the following cell to install the `ipytest`, `nbqa` and `flake8` packages and a coverage package to help determine if all of your production code is executed during the tests!

This cell only needs to be run once (or after restarting the notebook kernel) to set up the environment for testing and linting.


In [None]:
# Install the `ipytest`, `nbqa` and `flake8` packages
!pip install ipytest nbqa flake8

### Set up `ipytest` to execute `pytest` scripts in Jupyter Notebook

In [None]:
# Configure ipytest for Jupyter Notebook

import ipytest
ipytest.autoconfig(rewrite_asserts=True, magics=True)

### Create a *config* file for `flake8`

Run this script to create a file in your project root

In [None]:
# Create a config file and ignore some flake8 rules
!echo "[flake8]" > .flake8
!echo "ignore = E402, W291, F811" >> .flake8

# Execute the tests and linting in the Jupyter Notebook

Run the following cell ***EVERY TIME*** you want to run the tests and linting that you have written in the *Python Tests* cell above.

>**Note:**
>
> This entire section does not need to be part of any pipeline scripts.  
> It is only required for the Jupyter Notebook environment during development.


## Run the tests

In [None]:
# Run the tests
ipytest.run("-vv", "-ss")

## Run the linter

Run this script each time you want to lint your code

In [71]:
# Run the linter
!nbqa flake8 --show-source --format=pylint webscraping.ipynb


---
