#  <h1 style="text-align: center;"><font color="red">🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper</font></h1>

<font color="pink">Senior Data Scientist.: Dr. Eddy Giusepe Chirinos Isidro</font>

<div align="center">

<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>

[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)

[![PyPI version](https://badge.fury.io/py/crawl4ai.svg)](https://badge.fury.io/py/crawl4ai)
[![Python Version](https://img.shields.io/pypi/pyversions/crawl4ai)](https://pypi.org/project/crawl4ai/)
[![Downloads](https://static.pepy.tech/badge/crawl4ai/month)](https://pepy.tech/project/crawl4ai)

[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)

</div>

``Crawl4AI`` simplifica o rastreamento assíncrono da web (asynchronous web crawling) e a extração de dados, tornando-o acessível para modelos de linguagem grandes (``LLMs``) e aplicações de ``IA``. 🆓🌐

- GitHub Repository: [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- Twitter: [@unclecode](https://twitter.com/unclecode)
- Website: [https://crawl4ai.com](https://crawl4ai.com)

# <font color="gree">Quickstart with Crawl4AI</font>

## <font color="pink">Installation</font>

Instalar ``Crawl4AI`` e suas dependências necessárias:

In [None]:
%%capture
! pip install -U crawl4ai
! pip install nest_asyncio

In [1]:
# Check crawl4ai version
import crawl4ai
print(crawl4ai.__version__.__version__)

0.5.0.post8


## <font color="pink">1. ``Setup Crawl4ai``</font>

O comando a seguir instala ``Playride`` e suas dependências e atualiza algumas configurações para ``Crawl4ai``. Após isso, você pode executar o comando ``doctor`` para garantir que tudo funcione conforme o esperado.


In [None]:
%%capture
! crawl4ai-setup

In [None]:
! crawl4ai-doctor

In [None]:
# Se você enfrentar um erro, tente instalar manualmente
# !playwright install --with-deps chrome # Recomendado para Colab/Linux

Eu sugiro que você primeiro tente o código abaixo para garantir que o ``Playwright`` está instalado e funcionando corretamente.

In [2]:
import asyncio
import nest_asyncio
nest_asyncio.apply()

from playwright.async_api import async_playwright

async def test_browser():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto('https://example.com')
        print(f'Title: {await page.title()}')
        await browser.close()

asyncio.run(test_browser())

Title: Example Domain


## <font color="pink">2. ``Basic Setup and Simple Crawl``</font>

In [3]:
import asyncio
import nest_asyncio
nest_asyncio.apply()

from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig, CacheMode

async def simple_crawl():
    crawler_run_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS)
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.kidocode.com/degrees/technology",
            config=crawler_run_config
        )
        #print(result.markdown.raw_markdown[:500].replace("\n", " -- "))  # Print the first 500 characters
        print(result.markdown.raw_markdown)

asyncio.run(simple_crawl())

[INIT].... → Crawl4AI 0.5.0.post8
[FETCH]... ↓ https://www.kidocode.com/degrees/technology... | Status: True | Time: 1.40s
[SCRAPE].. ◆ https://www.kidocode.com/degrees/technology... | Time: 0.04s
[COMPLETE] ● https://www.kidocode.com/degrees/technology... | Status: True | Total: 1.45s
[![coding school for kids](https://cdn.prod.website-files.com/61d6943d6b5924685ac825ca/64a6a12136e8f756c9df3baa_k-combomark-white.svg)](https://www.kidocode.com/)
[Trial Class](https://www.kidocode.com/trial-class)
Degrees
degrees
[All Degrees](https://www.kidocode.com/degrees)
[AI Degree](https://www.kidocode.com/degrees/artificial-intelligence)
[Technology Degree](https://www.kidocode.com/degrees/technology)
[Entrepreneurship Degree](https://www.kidocode.com/degrees/entrepreneurship)
About Us
About
[Mission](https://www.kidocode.com/about)
[Team](https://www.kidocode.com/team)
[Contact](https://www.kidocode.com/contact)
Community
[Our community](https://www.kidocode.com/community)
[Gallery](https://www

In [1]:
import asyncio
import nest_asyncio
nest_asyncio.apply()

from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig, CacheMode

async def simple_crawl():
    crawler_run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.kidocode.com/degrees/technology",
            config=crawler_run_config
        )
        
        # 1. Acessar o conteúdo em markdown
        print("MARKDOWN COMPLETO:")
        print(result.markdown.raw_markdown[:500])  # Primeiros 500 caracteres
        
        # 2. Acessar markdown com citações
        print("\nMARKDOWN COM CITAÇÕES:")
        print(result.markdown.markdown_with_citations[:300])
        
        # 3. Acessar referências
        print("\nREFERÊNCIAS:")
        print(result.markdown.references_markdown[:300])
        
        # 4. Acessar links internos e externos
        print("\nLINKS:")
        print(f"Links internos: {len(result.links['internal'])}")
        print(f"Links externos: {len(result.links['external'])}")
        print(f"Exemplo de links internos: {result.links['internal'][:3]}")
        
        # 5. Acessar informações sobre imagens
        print("\nIMAGENS:")
        if result.media and 'images' in result.media:
            for img in result.media['images'][:3]:  # Primeiras 3 imagens
                print(f"URL: {img['src']}, Alt: {img['alt']}")
        
        # 6. Obter metadados da página
        print("\nMETADADOS:")
        if hasattr(result, 'metadata'):
            print(f"Título: {result.metadata.get('title', 'N/A')}")
            print(f"Descrição: {result.metadata.get('description', 'N/A')}")

asyncio.run(simple_crawl())

[INIT].... → Crawl4AI 0.5.0.post8
[FETCH]... ↓ https://www.kidocode.com/degrees/technology... | Status: True | Time: 1.86s
[SCRAPE].. ◆ https://www.kidocode.com/degrees/technology... | Time: 0.049s
[COMPLETE] ● https://www.kidocode.com/degrees/technology... | Status: True | Total: 1.92s
MARKDOWN COMPLETO:
[![coding school for kids](https://cdn.prod.website-files.com/61d6943d6b5924685ac825ca/64a6a12136e8f756c9df3baa_k-combomark-white.svg)](https://www.kidocode.com/)
[Trial Class](https://www.kidocode.com/trial-class)
Degrees
degrees
[All Degrees](https://www.kidocode.com/degrees)
[AI Degree](https://www.kidocode.com/degrees/artificial-intelligence)
[Technology Degree](https://www.kidocode.com/degrees/technology)
[Entrepreneurship Degree](https://www.kidocode.com/degrees/entrepreneurship)
About Us


MARKDOWN COM CITAÇÕES:
![coding school for kids⟨1⟩](https://www.kidocode.com/)
Trial Class⟨2⟩
Degrees
degrees
All Degrees⟨3⟩
AI Degree⟨4⟩
Technology Degree⟨5⟩
Entrepreneurship Degree⟨6⟩
About

## <font color="pink">3. ``Dynamic Content Handling``</font>

In [2]:
import asyncio
import nest_asyncio
nest_asyncio.apply()

from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig, CacheMode


async def crawl_dynamic_content():
    # You can use wait_for to wait for a condition to be met before returning the result
    # wait_for = """() => {
    #     return Array.from(document.querySelectorAll('article.tease-card')).length > 10;
    # }"""

    # wait_for can be also just a css selector
    # wait_for = "article.tease-card:nth-child(10)"

    async with AsyncWebCrawler() as crawler:
        js_code = [
            "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
        ]
        config = CrawlerRunConfig(
            cache_mode=CacheMode.ENABLED,
            js_code=js_code,
            # wait_for=wait_for,
        )
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            config=config,

        )
        print(result.markdown.raw_markdown[:500].replace("\n", " -- "))  # Print first 500 characters

asyncio.run(crawl_dynamic_content())

[INIT].... → Crawl4AI 0.5.0.post8
[JS_EXEC]. ℹ Playwright execution error: Page.evaluate: SyntaxError: Unexpected token 'const'
    at eval (<anonymous>)
    at UtilityScript.evaluate (<anonymous>:234:30)
    at UtilityScript.<anonymous> (<anonymous>:1:44)
[FETCH]... ↓ https://www.nbcnews.com/business... | Status: True | Time: 2.62s
[SCRAPE].. ◆ https://www.nbcnews.com/business... | Time: 0.106s
[COMPLETE] ● https://www.nbcnews.com/business... | Status: True | Total: 2.77s
IE 11 is not supported. For an optimal experience visit our site on another browser. -- Skip to Content -- [NBC News Logo](https://www.nbcnews.com) -- Sponsored By --   * [Trump admin](https://www.nbcnews.com/politics/trump-administration/live-blog/trump-tariffs-china-doge-musk-immigration-live-updates-rcna199420) --   * [Tariffs](https://www.nbcnews.com/business/economy/live-blog/trump-reciprocal-tariffs-reactions-markets-consumers-live-updates-rcna199411) --   * [Storms](https://www.nbcnews.com/weather/floods/l


## <font color="pink">4. ``Content Cleaning and Fit Markdown``</font>

In [2]:
import asyncio
import nest_asyncio
nest_asyncio.apply()
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig, CacheMode

from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def clean_content():
    async with AsyncWebCrawler(verbose=True) as crawler:
        config = CrawlerRunConfig(
            cache_mode=CacheMode.ENABLED,
            excluded_tags=['nav', 'footer', 'aside'],
            remove_overlay_elements=True,
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0),
                options={
                    "ignore_links": True
                }
            ),
        )
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/Apple",
            config=config,
        )
        full_markdown_length = len(result.markdown.raw_markdown)
        fit_markdown_length = len(result.markdown.fit_markdown)
        print(f"Full Markdown Length: {full_markdown_length}")
        print(f"Fit Markdown Length: {fit_markdown_length}")


asyncio.run(clean_content())

[INIT].... → Crawl4AI 0.5.0.post8
[FETCH]... ↓ https://en.wikipedia.org/wiki/Apple... | Status: True | Time: 2.06s
[SCRAPE].. ◆ https://en.wikipedia.org/wiki/Apple... | Time: 1.351s
[COMPLETE] ● https://en.wikipedia.org/wiki/Apple... | Status: True | Total: 3.43s
Full Markdown Length: 98309
Fit Markdown Length: 77182


## <font color="pink">5. ``Link Analysis and Smart Filtering``</font>

In [1]:
import asyncio
import nest_asyncio
nest_asyncio.apply()
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig, CacheMode


async def link_analysis():
    async with AsyncWebCrawler() as crawler:
        config = CrawlerRunConfig(
            cache_mode=CacheMode.ENABLED,
            exclude_external_links=True,
            exclude_social_media_links=True,
            # exclude_domains=["facebook.com", "twitter.com"]
        )
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            config=config,
        )
        print(f"Found {len(result.links['internal'])} internal links")
        print(f"Found {len(result.links['external'])} external links")

        for link in result.links['internal'][:5]:
            print(f"Href: {link['href']}\nText: {link['text']}\n")


asyncio.run(link_analysis())

[INIT].... → Crawl4AI 0.5.0.post8
[FETCH]... ↓ https://www.nbcnews.com/business... | Status: True | Time: 0.03s
[COMPLETE] ● https://www.nbcnews.com/business... | Status: True | Total: 0.04s
Found 116 internal links
Found 60 external links
Href: https://www.nbcnews.com
Text: NBC News Logo

Href: https://www.nbcnews.com/politics/trump-administration/live-blog/trump-tariffs-china-doge-musk-immigration-live-updates-rcna199420
Text: Trump admin

Href: https://www.nbcnews.com/business/economy/live-blog/trump-reciprocal-tariffs-reactions-markets-consumers-live-updates-rcna199411
Text: Tariffs

Href: https://www.nbcnews.com/weather/floods/live-blog/severe-weather-live-updates-life-threatening-catastrophic-flash-floodi-rcna199446
Text: Storms

Href: https://www.nbcnews.com/politics
Text: Politics



## <font color="pink">6. ``Media Handling``</font>

In [1]:
import asyncio
import nest_asyncio
nest_asyncio.apply()
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig, CacheMode

async def media_handling():
    async with AsyncWebCrawler() as crawler:
        config = CrawlerRunConfig(
            cache_mode=CacheMode.ENABLED,
            exclude_external_images=False,
            # screenshot=True # Set this to True if you want to take a screenshot
        )
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            config=config,
        )
        for img in result.media['images'][:5]:
            print(f"Image URL: {img['src']}, Alt: {img['alt']}, Score: {img['score']}")

asyncio.run(media_handling())

[INIT].... → Crawl4AI 0.5.0.post8
[FETCH]... ↓ https://www.nbcnews.com/business... | Status: True | Time: 0.03s
[COMPLETE] ● https://www.nbcnews.com/business... | Status: True | Total: 0.04s
Image URL: https://media-cldnry.s-nbcnews.com/image/upload/t_focal-760x428,f_auto,q_auto:best/rockcms/2025-04/250403-donald-trump-white-house-mn-1515-864cd0.jpg, Alt: President Donald Trump gestures to members of the media before boarding Marine One on the South Lawn of the White House on April 3, 2025., Score: 6
Image URL: https://media-cldnry.s-nbcnews.com/image/upload/t_focal-60x60,f_auto,q_auto:best/newscms/2024_26/3653104/240627-katherine-doyle.jpg, Alt: , Score: 3
Image URL: https://media-cldnry.s-nbcnews.com/image/upload/t_focal-60x60,f_auto,q_auto:best/newscms/2024_21/3649017/reuters-live-blog-byline-image_1.png, Alt: , Score: 3
Image URL: https://media-cldnry.s-nbcnews.com/image/upload/t_focal-762x508,f_auto,q_auto:best/rockcms/2025-04/250403-stellantis-canada-mn-1130-8a295d.jpg, Alt: The 

## <font color="pink">7. ``Using Hooks for Custom Workflow``</font>

Hooks (ganchos) no ``Crawl4AI`` permitem executar lógica personalizada em etapas específicas do processo de crawling. Isso pode ser valioso para cenários como definir cabeçalhos personalizados, registrar atividades ou processar conteúdo antes de ser retornado. Abaixo está um exemplo de um fluxo de trabalho básico usando um gancho, seguido por uma lista completa de ganchos disponíveis e explicações sobre seu uso.

In [None]:
import asyncio
import nest_asyncio
nest_asyncio.apply()
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig, CacheMode


async def before_goto(page: Page, context: BrowserContext, url: str, **kwargs):
        """Hook called before navigating to each URL"""
        print(f"[HOOK] before_goto - About to visit: {url}")
        # Example: Add custom headers for the request
        await page.set_extra_http_headers({
            "Custom-Header": "my-value"
        })
        return page

async def custom_hook_workflow(verbose=True):
    async with AsyncWebCrawler(config=BrowserConfig( verbose=verbose)) as crawler:
        # Set a 'before_goto' hook to run custom code just before navigation
        crawler.crawler_strategy.set_hook("before_goto", before_goto)

        # Perform the crawl operation
        result = await crawler.arun(
            url="https://crawl4ai.com",
            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
        )
        print(result.markdown_v2.raw_markdown[:500].replace("\n", " -- "))

asyncio.run(custom_hook_workflow())

Lista de ganchos disponíveis e exemplos para cada etapa do processo de ``crawling``:

- **on_browser_created**
    ```python
    async def on_browser_created_hook(browser):
        print("[Hook] Browser created")
    ```

- **before_goto**
    ```python
    async def before_goto_hook(page, context = None):
        await page.set_extra_http_headers({"X-Test-Header": "test"})
    ```

- **after_goto**
    ```python
    async def after_goto_hook(page, context = None):
        print(f"[Hook] Navigated to {page.url}")
    ```

- **on_execution_started**
    ```python
    async def on_execution_started_hook(page, context = None):
        print("[Hook] JavaScript execution started")
    ```

- **before_return_html**
    ```python
    async def before_return_html_hook(page, html, context = None):
        print(f"[Hook] HTML length: {len(html)}")
    ```

## <font color="pink">8. ``Session-Based Crawling``</font>

Quando usar ``Session-Based Crawling``:

Session-based crawling é especialmente benéfico quando se navega por conteúdo de várias páginas onde cada carregamento de página precisa manter o mesmo contexto da sessão. Por exemplo, em casos onde um botão "Próxima Página" deve ser clicado para carregar dados subsequentes, os novos dados geralmente substituem o conteúdo anterior. Aqui, o crawling baseado em sessão mantém o estado do navegador intacto em cada interação, permitindo que sequências de ações sejam executadas dentro da mesma sessão. Um jeito fácil de pensar em uma sessão é que ela age como uma aba do navegador; ao passar o mesmo ID de sessão, ela usa a mesma aba do navegador e não cria uma nova aba.

Exemplo: Navegação de Múltiplas Páginas Usando JavaScript

Neste exemplo, navegaremos por várias páginas clicando em um botão "Próxima Página". Após cada carregamento de página, extrairemos o novo conteúdo e repetiremos o processo.

In [None]:
from crawl4ai.extraction_strategy import (
    JsonCssExtractionStrategy,
    LLMExtractionStrategy,
)
import json

async def crawl_dynamic_content_pages_method_2():
    print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")

    async with AsyncWebCrawler() as crawler:
        url = "https://github.com/microsoft/TypeScript/commits/main"
        session_id = "typescript_commits_session"
        all_commits = []
        last_commit = ""

        js_next_page_and_wait = """
        (async () => {
            const getCurrentCommit = () => {
                const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
                return commits.length > 0 ? commits[0].textContent.trim() : null;
            };

            const initialCommit = getCurrentCommit();
            const button = document.querySelector('a[data-testid="pagination-next-button"]');
            if (button) button.click();

            // Poll for changes
            while (true) {
                await new Promise(resolve => setTimeout(resolve, 100)); // Wait 100ms
                const newCommit = getCurrentCommit();
                if (newCommit && newCommit !== initialCommit) {
                    break;
                }
            }
        })();
        """

        schema = {
            "name": "Commit Extractor",
            "baseSelector": "li.Box-sc-g0xbh4-0",
            "fields": [
                {
                    "name": "title",
                    "selector": "h4.markdown-title",
                    "type": "text",
                    "transform": "strip",
                },
            ],
        }
        extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

        for page in range(2):  # Crawl 2 pages
            config = CrawlerRunConfig(
                cache_mode=CacheMode.BYPASS,
                session_id=session_id,
                css_selector="li.Box-sc-g0xbh4-0",
                extraction_strategy=extraction_strategy,
                js_code=js_next_page_and_wait if page > 0 else None,
                js_only=page > 0,
            )
            result = await crawler.arun(
                url=url,
                config=config
            )

            assert result.success, f"Failed to crawl page {page + 1}"

            commits = json.loads(result.extracted_content)
            all_commits.extend(commits)

            print(f"Page {page + 1}: Found {len(commits)} commits")

        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")

asyncio.run(crawl_dynamic_content_pages_method_2())

#### 9. **Using Extraction Strategies**

##### Executing JavaScript & Extract Structured Data without LLMs

In [None]:
from crawl4ai.extraction_strategy import (
    JsonCssExtractionStrategy,
    LLMExtractionStrategy,
)
import json
async def extract():
    schema = {
        "name": "KidoCode Courses",
        "baseSelector": "section.charge-methodology .div-block-214.p-extraxx",
        "fields": [
            {
                "name": "section_title",
                "selector": "h3.heading-50",
                "type": "text",
            },
            {
                "name": "section_description",
                "selector": ".charge-content",
                "type": "text",
            },
            {
                "name": "course_name",
                "selector": ".text-block-93",
                "type": "text",
            },
            {
                "name": "course_description",
                "selector": ".course-content-text",
                "type": "text",
            },
            {
                "name": "course_icon",
                "selector": ".image-92",
                "type": "attribute",
                "attribute": "src"
            }
        ]
    }

    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

    async with AsyncWebCrawler() as crawler:

        # Create the JavaScript that handles clicking multiple times
        js_click_tabs = """
        (async () => {
            const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");

            for(let tab of tabs) {
                // scroll to the tab
                tab.scrollIntoView();
                tab.click();
                // Wait for content to load and animations to complete
                await new Promise(r => setTimeout(r, 500));
            }
        })();
        """

        config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            extraction_strategy=extraction_strategy,
            js_code=[js_click_tabs],
        )
        result = await crawler.arun(
            url="https://www.kidocode.com/degrees/technology",
            config=config
        )

        courses = json.loads(result.extracted_content)
        print(result.extracted_content)
        print(f"Successfully extracted {len(courses)} courses")
        print(len(result.markdown))
        # print(json.dumps(courses[0], indent=2))

await extract()

# <font color="gree">LLM Extraction</font>

Este exemplo demonstra como usar a extração baseada em modelo de linguagem para recuperar dados estruturados de uma página de preços do site da OpenAI.

In [None]:
import asyncio
import nest_asyncio
nest_asyncio.apply()
from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig, CacheMode

from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
import json
from crawl4ai import LLMConfig
#from google.colab import userdata
#os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())  # read local .env file
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")


class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Nome do modelo OpenAI.")
    input_fee: str = Field(..., description="Taxa para token de entrada do modelo OpenAI.")
    output_fee: str = Field(
        ..., description="Taxa para token de saída do modelo OpenAI."
    )

async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: dict = None):
    print(f"\n--- Extraindo Dados Estruturados com {provider} ---")

    # Pula se o token API estiver ausente (para provedores que exigem isso):
    if api_token is None and provider != "ollama":
        print(f"Token API é necessário para {provider}. Pulando este exemplo.")
        return

    extra_args = {"extra_headers": extra_headers} if extra_headers else {}

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://openai.com/api/pricing/", # https://platform.openai.com/docs/pricing
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                LLMConfig(
                    provider=provider,
                    api_token=api_token,
                    **extra_args),
                schema=OpenAIModelFee.model_json_schema(),
                extraction_type="schema",
                instruction="""Extrair todos os nomes de modelo junto com as taxas para tokens de entrada e saída."
                "{model_name: 'GPT-4', input_fee: 'US$10.00 / 1M tokens', output_fee: 'US$30.00 / 1M tokens'}.""",
                **extra_args
            ),
            cache_mode = CacheMode.ENABLED
        )
        print(json.loads(result.extracted_content)[:5])

# Usage:
# await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
# await extract_structured_data_using_llm("ollama/llama3.2")
await extract_structured_data_using_llm("openai/gpt-4o-mini", os.getenv("OPENAI_API_KEY"))

**Cosine Similarity Strategy**

This strategy uses semantic clustering to extract relevant content based on contextual similarity, which is helpful when extracting related sections from a single topic.

IMPORTANT: This strategy uses embedding models from HuggingFace, to have a proper response time, make sure to switch to GPU.

In [None]:
from crawl4ai.extraction_strategy import CosineStrategy

async def cosine_similarity_extraction():
    async with AsyncWebCrawler() as crawler:
        strategy = CosineStrategy(
            word_count_threshold=10,
            max_dist=0.2, # Maximum distance between two words
            linkage_method="ward", # Linkage method for hierarchical clustering (ward, complete, average, single)
            top_k=3, # Number of top keywords to extract
            sim_threshold=0.3, # Similarity threshold for clustering
            semantic_filter="McDonald's economic impact, American consumer trends", # Keywords to filter the content semantically using embeddings
            verbose=True
        )

        result = await crawler.arun(
            url="https://www.nbcnews.com/business/consumer/how-mcdonalds-e-coli-crisis-inflation-politics-reflect-american-story-rcna177156",
            extraction_strategy=strategy,
            cache_mode = CacheMode.ENABLED
        )
        print(json.loads(result.extracted_content)[:5])

asyncio.run(cosine_similarity_extraction())


#### 10. **Conclusion and Next Steps**

You’ve explored core features of Crawl4AI, including dynamic content handling, link analysis, and advanced extraction strategies. Visit our documentation for further details on using Crawl4AI’s extensive features.

- GitHub Repository: [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- Twitter: [@unclecode](https://twitter.com/unclecode)
- Website: [https://crawl4ai.com](https://crawl4ai.com)

Happy Crawling with Crawl4AI! 🕷️🤖
