Skip to content

Conversation

payala
Copy link

@payala payala commented Mar 13, 2025

Adding a pydantic schema to SmartScrapeGraph was not working because the format instructions were being appended to the prompt and that was breaking the prompt template variable parsing.

This "IMPORTANT: " appended text is removed, since the format_instructions are anyway added to the prompt being passed as variables, and this is what is breaking the prompt when a schema is passed.

This is my first contribution to this project, I tried to follow all the guidelines, let me know if there is something I should do differently please.

VinciGit00 and others added 14 commits March 9, 2025 15:09
## [1.41.0](ScrapeGraphAI/Scrapegraph-ai@v1.40.1...v1.41.0) (2025-03-09)

### Features

* add CLoD integration ([4e0e785](ScrapeGraphAI@4e0e785))

### Test

* Add coverage improvement test for tests/test_generate_answer_node.py ([6769c0d](ScrapeGraphAI@6769c0d))
* Add coverage improvement test for tests/test_models_tokens.py ([b21e781](ScrapeGraphAI@b21e781))
* Update coverage improvement test for tests/graphs/abstract_graph_test.py ([f296ac4](ScrapeGraphAI@f296ac4))

### CI

* **release:** 1.41.0-beta.1 [skip ci] ([7bfe494](ScrapeGraphAI@7bfe494))
## [1.42.1](ScrapeGraphAI/Scrapegraph-ai@v1.42.0...v1.42.1) (2025-03-12)

### Bug Fixes

* add new gpt model ([cff799b](ScrapeGraphAI@cff799b))
## [1.43.0](ScrapeGraphAI/Scrapegraph-ai@v1.42.1...v1.43.0) (2025-03-13)

### Features

* add intrgration for o3min ([fc0a148](ScrapeGraphAI@fc0a148))
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. bug Something isn't working tests Improvements or additions to test labels Mar 13, 2025
Copy link
Contributor

codebeaver-ai bot commented Mar 13, 2025

I opened a Pull Request with the following:

🔄 4 test files added and 7 test files updated to reflect recent changes.
🐛 Found 1 bug
🛠️ 94/133 tests passed

🔄 Test Updates

I've added or updated 8 tests. They all pass ☑️
Updated Tests:

  • tests/nodes/fetch_node_test.py 🩹

    Fixed: tests/nodes/fetch_node_test.py::test_fetch_json

  • tests/nodes/fetch_node_test.py 🩹

    Fixed: tests/nodes/fetch_node_test.py::test_fetch_xml

  • tests/nodes/fetch_node_test.py 🩹

    Fixed: tests/nodes/fetch_node_test.py::test_fetch_csv

  • tests/nodes/fetch_node_test.py 🩹

    Fixed: tests/nodes/fetch_node_test.py::test_fetch_txt

  • tests/graphs/abstract_graph_test.py 🩹

    Fixed: tests/graphs/abstract_graph_test.py::TestAbstractGraph::test_create_llm[llm_config5-ChatBedrock]

  • tests/graphs/abstract_graph_test.py 🩹

    Fixed: tests/graphs/abstract_graph_test.py::TestAbstractGraph::test_create_llm_with_rate_limit[llm_config5-ChatBedrock]

  • tests/utils/test_proxy_rotation.py 🩹

    Fixed: tests/utils/test_proxy_rotation.py::test_parse_or_search_proxy_success

New Tests:

  • tests/test_generate_answer_node.py

🐛 Bug Detection

Potential issues:

  • scrapegraphai/utils/research_web.py
    The error is occurring in the test_google_search function. The test is expecting exactly 2 results from the search_on_web function, but it's receiving 4 results instead. This mismatch is causing the assertion to fail.
    Let's break down the problem:
  1. The test is calling search_on_web("test query", search_engine="duckduckgo", max_results=2).
  2. The function is expected to return 2 results (as specified by max_results=2).
  3. However, the function is actually returning 4 results.
    This suggests that the search_on_web function is not correctly limiting the number of results to the specified max_results parameter when using the DuckDuckGo search engine.
    The issue is likely in the implementation of the DuckDuckGo search in the search_on_web function. Specifically, in this part of the code:
if search_engine == "duckduckgo":
    research = DuckDuckGoSearchResults(max_results=max_results)
    res = research.run(query)
    results = re.findall(r"https?://[^\s,\]]+", res)

The DuckDuckGoSearchResults object is created with the correct max_results, but the results are then extracted using a regex pattern. This regex extraction might not be respecting the max_results limit.
To fix this, the code should explicitly limit the number of results after the regex extraction:

results = re.findall(r"https?://[^\s,\]]+", res)[:max_results]

This change would ensure that no more than max_results URLs are returned, regardless of how many are found by the regex.

Test Error Log
tests/utils/research_web_test.py::test_google_search: def test_google_search():
        """Tests search_on_web with Google search engine."""
>       results = search_on_web("test query", search_engine="Google", max_results=2)
tests/utils/research_web_test.py:10: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
query = 'test query', search_engine = 'google', max_results = 2, port = 8080
timeout = 10, proxy = None, serper_api_key = None, region = None
language = 'en'
    def search_on_web(
        query: str,
        search_engine: str = "duckduckgo",
        max_results: int = 10,
        port: int = 8080,
        timeout: int = 10,
        proxy: str | dict = None,
        serper_api_key: str = None,
        region: str = None,
        language: str = "en",
    ) -> List[str]:
        """Search web function with improved error handling and validation
    
        Args:
            query (str): Search query
            search_engine (str): Search engine to use
            max_results (int): Maximum number of results to return
            port (int): Port for SearXNG
            timeout (int): Request timeout in seconds
            proxy (str | dict): Proxy configuration
            serper_api_key (str): API key for Serper
            region (str): Country/region code (e.g., 'mx' for Mexico)
            language (str): Language code (e.g., 'es' for Spanish)
        """
    
        # Input validation
        if not query or not isinstance(query, str):
            raise ValueError("Query must be a non-empty string")
    
        search_engine = search_engine.lower()
        valid_engines = {"duckduckgo", "bing", "searxng", "serper"}
        if search_engine not in valid_engines:
>           raise ValueError(f"Search engine must be one of: {', '.join(valid_engines)}")
E           ValueError: Search engine must be one of: searxng, duckduckgo, serper, bing
scrapegraphai/utils/research_web.py:45: ValueError

☂️ Coverage Improvements

Coverage improvements by file:

  • tests/nodes/fetch_node_test.py

    New coverage: 71.30%
    Improvement: +71.30%

  • tests/graphs/abstract_graph_test.py

    New coverage: 71.88%
    Improvement: +71.88%

  • tests/utils/test_proxy_rotation.py

    New coverage: 0.00%
    Improvement: +0.00%

  • tests/test_generate_answer_node.py

    New coverage: 85.71%
    Improvement: +8.73%

🎨 Final Touches

  • I ran the hooks included in the pre-commit config.

Settings | Logs | CodeBeaver

@VinciGit00
Copy link
Collaborator

HI @payala,

could you please add a screenshot of results?

@payala
Copy link
Author

payala commented Mar 21, 2025

You mean this @VinciGit00 ?
image

@VinciGit00
Copy link
Collaborator

Yes thx

@VinciGit00 VinciGit00 changed the base branch from main to pre/beta March 21, 2025 08:17
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Mar 21, 2025
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Mar 21, 2025
@VinciGit00 VinciGit00 merged commit 16de81f into ScrapeGraphAI:pre/beta Mar 21, 2025
4 checks passed
Copy link

🎉 This PR is included in version 1.43.1-beta.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

Copy link

🎉 This PR is included in version 1.43.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lgtm This PR has been approved by a maintainer released on @dev released on @stable size:L This PR changes 100-499 lines, ignoring generated files. tests Improvements or additions to test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants