<a href="https://colab.research.google.com/github/J-D-S-E/test/blob/main/crawl4ai_quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper
<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>

[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
![PyPI - Downloads](https://img.shields.io/pypi/dm/Crawl4AI)
[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
[![GitHub Issues](https://img.shields.io/github/issues/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/issues)
[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)
[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)

Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐

- GitHub Repository: [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- Twitter: [@unclecode](https://twitter.com/unclecode)
- Website: [https://crawl4ai.com](https://crawl4ai.com)

## 🌟 Meet the Crawl4AI Assistant: Your Copilot for Crawling
Use the [Crawl4AI GPT Assistant](https://tinyurl.com/crawl4ai-gpt) as your AI-powered copilot! With this assistant, you can:
- 🧑‍💻 Generate code for complex crawling and extraction tasks
- 💡 Get tailored support and examples
- 📘 Learn Crawl4AI faster with step-by-step guidance

### **Quickstart with Crawl4AI**

#### 1. **Installation**
Install Crawl4AI and necessary dependencies:

In [None]:
%%capture
!pip install crawl4ai
!pip install nest_asyncio
!playwright install

In [None]:
import asyncio
import nest_asyncio
nest_asyncio.apply()

#### 2. **Basic Setup and Simple Crawl**

In [None]:
import asyncio
from crawl4ai import AsyncWebCrawler

async def simple_crawl():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            bypass_cache=True # By default this is False, meaning the cache will be used
        )
        print(result.markdown[:1000])  # Print the first 500 characters

asyncio.run(simple_crawl())

[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 4.19 seconds
[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.73 seconds
[LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler
[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.74 seconds.
IE 11 is not supported. For an optimal experience visit our site on another browser.

[Morning Rundown: Trump and Harris' vastly different closing pitches, why Kim Jong Un is helping Russia, and an ancient city is discovered by accident](https://www.nbcnews.com/news/harris-speech-ellipse-ancient-mayan-city-morning-rundown-rcna177973)[](https://www.nbcnews.com/news/harris-speech-ellipse-ancient-mayan-city-morning-rundown-rcna177973)

Skip to Content

[NBC News Logo](https://www.nbcnews.com)

Sponsored By

  * [2024 Election](https://www.nbcnews.com/politics/2024-election/live-blog/kamala-harris-donald-t

#### 3. **Dynamic Content Handling**

In [None]:
async def crawl_dynamic_content():
    # You can use wait_for to wait for a condition to be met before returning the result
    # wait_for = """() => {
    #     return Array.from(document.querySelectorAll('article.tease-card')).length > 10;
    # }"""

    # wait_for can be also just a css selector
    # wait_for = "article.tease-card:nth-child(10)"

    async with AsyncWebCrawler(verbose=True) as crawler:
        js_code = [
            "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
        ]
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            js_code=js_code,
            # wait_for=wait_for,
            bypass_cache=True,
        )
        print(result.markdown[:500])  # Print first 500 characters

asyncio.run(crawl_dynamic_content())

[LOG] 🌤️  Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
[LOG] 🕸️ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy...
[LOG] ✅ Crawled https://www.nbcnews.com/business successfully!
[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 10.22 seconds
[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.89 seconds
[LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler
[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.90 seconds.
IE 11 is not supported. For an optimal experience visit our site on another browser.

[Morning Rundown: Trump and Harris' vastly different closing pitches, why Kim Jong Un is helping Russia, and an ancient city is discovered by accident](https://www.nbcnews.com/news/harris-speech-ellipse-ancient-mayan-city-morning-rundown-rcna177973)[](https://www.nbcnews.com/news/harr

#### 4. **Content Cleaning and Fit Markdown**

In [None]:
async def clean_content():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://janineintheworld.com/places-to-visit-in-central-mexico",
            excluded_tags=['nav', 'footer', 'aside'],
            remove_overlay_elements=True,
            word_count_threshold=10,
            bypass_cache=True
        )
        full_markdown_length = len(result.markdown)
        fit_markdown_length = len(result.fit_markdown)
        print(f"Full Markdown Length: {full_markdown_length}")
        print(f"Fit Markdown Length: {fit_markdown_length}")
        print(result.fit_markdown[:1000])


asyncio.run(clean_content())

[LOG] 🚀 Crawling done for https://janineintheworld.com/places-to-visit-in-central-mexico, success: True, time taken: 6.99 seconds
[LOG] 🚀 Content extracted for https://janineintheworld.com/places-to-visit-in-central-mexico, success: True, time taken: 0.84 seconds
[LOG] 🔥 Extracting semantic blocks for https://janineintheworld.com/places-to-visit-in-central-mexico, Strategy: AsyncWebCrawler
[LOG] 🚀 Extraction done for https://janineintheworld.com/places-to-visit-in-central-mexico, time taken: 1.07 seconds.
Full Markdown Length: 86397
Fit Markdown Length: 80613
  * Pin31
  * [Share](https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Fwww.janineintheworld.com%2Fplaces-to-visit-in-central-mexico%2F&t=23%20Can%27t-Miss%20Destinations%20in%20Central%20Mexico "Share on Facebook")
  * [Tweet](https://x.com/intent/tweet?text=From%20well-known%20destinations%20like%20San%20Miguel%20de%20Allende%20to%20hidden%20gems%20like%20Xilitla%20and%20Aguascalientes%2C%20central%20%23Mexico%20is%20a%

#### 5. **Link Analysis and Smart Filtering**

In [None]:

async def link_analysis():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            bypass_cache=True,
            exclude_external_links=True,
            exclude_social_media_links=True,
            # exclude_domains=["facebook.com", "twitter.com"]
        )
        print(f"Found {len(result.links['internal'])} internal links")
        print(f"Found {len(result.links['external'])} external links")

        for link in result.links['internal'][:5]:
            print(f"Href: {link['href']}\nText: {link['text']}\n")


asyncio.run(link_analysis())

[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 4.70 seconds
[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.60 seconds
[LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler
[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.60 seconds.
Found 107 internal links
Found 34 external links
Href: https://www.nbcnews.com/news/harris-speech-ellipse-ancient-mayan-city-morning-rundown-rcna177973
Text: Morning Rundown: Trump and Harris' vastly different closing pitches, why Kim Jong Un is helping Russia, and an ancient city is discovered by accident

Href: https://www.nbcnews.com
Text: NBC News Logo

Href: https://www.nbcnews.com/politics/2024-election/live-blog/kamala-harris-donald-trump-rally-election-live-updates-rcna177529
Text: 2024 Election

Href: https://www.nbcnews.com/politics
Text: Politics

Href: https://www.nbcnews.com/us-news
Text: 

#### 6. **Media Handling**

In [None]:
async def media_handling():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            bypass_cache=True,
            exclude_external_images=False,
            screenshot=True
        )
        for img in result.media['images'][:5]:
            print(f"Image URL: {img['src']}, Alt: {img['alt']}, Score: {img['score']}")

asyncio.run(media_handling())

[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 14.74 seconds
[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.47 seconds
[LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler
[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.47 seconds.
Image URL: https://media-cldnry.s-nbcnews.com/image/upload/t_focal-762x508,f_auto,q_auto:best/rockcms/2024-10/241023-NM-Chilccare-jg-27b982.jpg, Alt: , Score: 4
Image URL: https://media-cldnry.s-nbcnews.com/image/upload/t_focal-80x80,f_auto,q_auto:best/rockcms/2024-10/241030-china-ev-electric-mb-0746-cae05c.jpg, Alt: Volkswagen Workshop in Hefei, Score: 5
Image URL: https://media-cldnry.s-nbcnews.com/image/upload/t_focal-80x80,f_auto,q_auto:best/rockcms/2024-10/241029-nyc-subway-sandwich-2021-ac-922p-a92374.jpg, Alt: A sub is prepared at a Subway restaurant in Manhattan, New York City, Score: 5
Image 

#### 7. **Using Hooks for Custom Workflow**

Hooks in Crawl4AI allow you to run custom logic at specific stages of the crawling process. This can be invaluable for scenarios like setting custom headers, logging activities, or processing content before it is returned. Below is an example of a basic workflow using a hook, followed by a complete list of available hooks and explanations on their usage.

In [None]:
async def custom_hook_workflow():
    async with AsyncWebCrawler() as crawler:
        # Set a 'before_goto' hook to run custom code just before navigation
        crawler.crawler_strategy.set_hook("before_goto", lambda page: print("[Hook] Preparing to navigate..."))

        # Perform the crawl operation
        result = await crawler.arun(
            url="https://crawl4ai.com",
            bypass_cache=True
        )
        print(result.markdown[:500])  # Display the first 500 characters

asyncio.run(custom_hook_workflow())

[Hook] Preparing to navigate...
[LOG] 🚀 Crawling done for https://crawl4ai.com, success: True, time taken: 2.01 seconds
[LOG] 🚀 Content extracted for https://crawl4ai.com, success: True, time taken: 0.29 seconds
[LOG] 🔥 Extracting semantic blocks for https://crawl4ai.com, Strategy: AsyncWebCrawler
[LOG] 🚀 Extraction done for https://crawl4ai.com, time taken: 0.30 seconds.
[Crawl4AI Documentation](https://docs.crawl4ai.com/)

  * [ Home ](.)
  * [ Installation ](basic/installation/)
  * [ Quick Start ](basic/quickstart/)
  * [ Search ](#)



  * Home
  * [Installation](basic/installation/)
  * [Quick Start](basic/quickstart/)
  * Basic
    * [Simple Crawling](basic/simple-crawling/)
    * [Output Formats](basic/output-formats/)
    * [Browser Configuration](basic/browser-config/)
    * [Page Interaction](basic/page-interaction/)
    * [Content Selection](basic/con


List of available hooks and examples for each stage of the crawling process:

- **on_browser_created**
    ```python
    async def on_browser_created_hook(browser):
        print("[Hook] Browser created")
    ```

- **before_goto**
    ```python
    async def before_goto_hook(page):
        await page.set_extra_http_headers({"X-Test-Header": "test"})
    ```

- **after_goto**
    ```python
    async def after_goto_hook(page):
        print(f"[Hook] Navigated to {page.url}")
    ```

- **on_execution_started**
    ```python
    async def on_execution_started_hook(page):
        print("[Hook] JavaScript execution started")
    ```

- **before_return_html**
    ```python
    async def before_return_html_hook(page, html):
        print(f"[Hook] HTML length: {len(html)}")
    ```

#### 8. **Session-Based Crawling**

When to Use Session-Based Crawling:
Session-based crawling is especially beneficial when navigating through multi-page content where each page load needs to maintain the same session context. For instance, in cases where a “Next Page” button must be clicked to load subsequent data, the new data often replaces the previous content. Here, session-based crawling keeps the browser state intact across each interaction, allowing for sequential actions within the same session.

Example: Multi-Page Navigation Using JavaScript
In this example, we’ll navigate through multiple pages by clicking a "Next Page" button. After each page load, we extract the new content and repeat the process.

In [None]:
async def multi_page_session_crawl():
    async with AsyncWebCrawler() as crawler:
        session_id = "page_navigation_session"
        url = "https://example.com/paged-content"

        for page_number in range(1, 4):
            result = await crawler.arun(
                url=url,
                session_id=session_id,
                js_code="document.querySelector('.next-page-button').click();" if page_number > 1 else None,
                css_selector=".content-section",
                bypass_cache=True
            )
            print(f"Page {page_number} Content:")
            print(result.markdown[:500])  # Print first 500 characters

# asyncio.run(multi_page_session_crawl())

#### 9. **Using Extraction Strategies**

**LLM Extraction**

This example demonstrates how to use language model-based extraction to retrieve structured data from a pricing page on OpenAI’s site.

In [None]:
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
import os, json
from google.colab import userdata
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(
        ..., description="Fee for output token for the OpenAI model."
    )

async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: dict = None):
    print(f"\n--- Extracting Structured Data with {provider} ---")

    # Skip if API token is missing (for providers that require it)
    if api_token is None and provider != "ollama":
        print(f"API token is required for {provider}. Skipping this example.")
        return

    extra_args = {"extra_headers": extra_headers} if extra_headers else {}

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://openai.com/api/pricing/",
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
                provider=provider,
                api_token=api_token,
                schema=OpenAIModelFee.schema(),
                extraction_type="schema",
                instruction="""Extract all model names along with fees for input and output tokens."
                "{model_name: 'GPT-4', input_fee: 'US$10.00 / 1M tokens', output_fee: 'US$30.00 / 1M tokens'}.""",
                **extra_args
            ),
            bypass_cache=True,
        )
        print(json.loads(result.extracted_content)[:5])

# Usage:
await extract_structured_data_using_llm("openai/gpt-4o-mini", os.getenv("OPENAI_API_KEY"))





--- Extracting Structured Data with openai/gpt-4o-mini ---
[LOG] 🌤️  Warming up the AsyncWebCrawler
[LOG] 🌞 AsyncWebCrawler is ready to crawl
[LOG] 🕸️ Crawling https://openai.com/api/pricing/ using AsyncPlaywrightCrawlerStrategy...
[LOG] ✅ Crawled https://openai.com/api/pricing/ successfully!
[LOG] 🚀 Crawling done for https://openai.com/api/pricing/, success: True, time taken: 4.21 seconds
[LOG] 🚀 Content extracted for https://openai.com/api/pricing/, success: True, time taken: 1.63 seconds
[LOG] 🔥 Extracting semantic blocks for https://openai.com/api/pricing/, Strategy: AsyncWebCrawler
[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 0
[LOG] Extracted 26 blocks from URL: https://openai.com/api/pricing/ block index: 0
[LOG] 🚀 Extraction done for https://openai.com/api/pricing/, time taken: 38.26 seconds.


  Expected `PromptTokensDetails` but got `dict` with value `{'audio_tokens': None, 'cached_tokens': 0}` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


[{'model_name': 'gpt-4o', 'input_fee': 'US$2.50 / 1M input tokens', 'output_fee': 'US$10.00 / 1M output tokens', 'error': False}, {'model_name': 'gpt-4o-2024-08-06', 'input_fee': 'US$2.50 / 1M input tokens', 'output_fee': 'US$10.00 / 1M output tokens', 'error': False}, {'model_name': 'gpt-4o-audio-preview', 'input_fee': 'US$2.50 / 1M input tokens', 'output_fee': 'US$10.00 / 1M output tokens', 'error': False}, {'model_name': 'gpt-4o-audio-preview-2024-10-01', 'input_fee': 'US$2.50 / 1M input tokens', 'output_fee': 'US$10.00 / 1M output tokens', 'error': False}, {'model_name': 'gpt-4o-2024-05-13', 'input_fee': 'US$5.00 / 1M input tokens', 'output_fee': 'US$15.00 / 1M output tokens', 'error': False}]


**Cosine Similarity Strategy**

This strategy uses semantic clustering to extract relevant content based on contextual similarity, which is helpful when extracting related sections from a single topic.

IMPORTANT: This strategy uses embedding models from HuggingFace, to have a proper response time, make sure to switch to GPU.

In [None]:
from crawl4ai.extraction_strategy import CosineStrategy

async def cosine_similarity_extraction():
    async with AsyncWebCrawler() as crawler:
        strategy = CosineStrategy(
            word_count_threshold=10,
            max_dist=0.2, # Maximum distance between two words
            linkage_method="ward", # Linkage method for hierarchical clustering (ward, complete, average, single)
            top_k=3, # Number of top keywords to extract
            sim_threshold=0.3, # Similarity threshold for clustering
            semantic_filter="McDonald's economic impact, American consumer trends", # Keywords to filter the content semantically using embeddings
            verbose=True
        )

        result = await crawler.arun(
            url="https://www.nbcnews.com/business/consumer/how-mcdonalds-e-coli-crisis-inflation-politics-reflect-american-story-rcna177156",
            extraction_strategy=strategy
        )
        print(json.loads(result.extracted_content)[:5])

asyncio.run(cosine_similarity_extraction())


[LOG] Loading Extraction Model for cpu device.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

[LOG] Loading Multilabel Classifier for cpu device.


tokenizer_config.json:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/1.88k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

[LOG] Model loaded sentence-transformers/all-MiniLM-L6-v2, models/reuters, took 27.069828987121582 seconds
[LOG] 🚀 Crawling done for https://www.nbcnews.com/business/consumer/how-mcdonalds-e-coli-crisis-inflation-politics-reflect-american-story-rcna177156, success: True, time taken: 8.65 seconds
[LOG] 🚀 Content extracted for https://www.nbcnews.com/business/consumer/how-mcdonalds-e-coli-crisis-inflation-politics-reflect-american-story-rcna177156, success: True, time taken: 0.64 seconds
[LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business/consumer/how-mcdonalds-e-coli-crisis-inflation-politics-reflect-american-story-rcna177156, Strategy: AsyncWebCrawler
[LOG] 🚀 Assign tags using cpu
[LOG] 🚀 Categorization done in 2.50 seconds
[LOG] 🚀 Extraction done for https://www.nbcnews.com/business/consumer/how-mcdonalds-e-coli-crisis-inflation-politics-reflect-american-story-rcna177156, time taken: 59.85 seconds.
[{'index': 3, 'tags': ['business_&_entrepreneurs', 'news_&_social_c

#### 10. **Conclusion and Next Steps**

You’ve explored core features of Crawl4AI, including dynamic content handling, link analysis, and advanced extraction strategies. Visit our documentation for further details on using Crawl4AI’s extensive features.

- GitHub Repository: [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- Twitter: [@unclecode](https://twitter.com/unclecode)
- Website: [https://crawl4ai.com](https://crawl4ai.com)

Happy Crawling with Crawl4AI! 🕷️🤖
