# Example 3 — Deep-Crawl Specific URLs

**Goal:** Deep-crawl a list of seed URLs using a headless browser (crawl4ai), discover internal links via BFS financial-keyword scoring, extract content from HTML and PDFs, and save to Parquet.

The `crawl` subcommand is fundamentally different from `search`:

| | Search | Crawl |
|---|---|---|
| Input | Keyword queries | Seed URLs |
| Discovery | DuckDuckGo results | BFS link-following on the seed domain |
| Renderer | Simple HTTP (aiohttp) | Headless browser (crawl4ai) |
| JS support | No | Yes |
| PDF handling | pdfplumber | pdfplumber or Docling |

Use `crawl` when you:
- Know the exact URLs or domains you want
- Need JavaScript rendering (SPAs, dynamic dashboards)
- Want to discover internal links on a corporate/investor site
- Need to extract PDFs (annual reports, SEC filings)

> **Prerequisite:** Install the crawl extra: `pip install -e ".[crawl]"`

## 1. Setup

In [None]:
import asyncio
import sys
from pathlib import Path

# Note: do NOT set WindowsSelectorEventLoopPolicy for crawl —
# crawl4ai uses Playwright which needs the default event loop.

import pandas as pd
from financial_scraper.crawl.config import CrawlConfig
from financial_scraper.crawl.pipeline import CrawlPipeline

## 2. Write a Seed URL File

One URL per line. These can be:
- HTML pages (the crawler will render them and follow links)
- Direct PDF URLs (downloaded and extracted automatically)
- Investor relations landing pages (the BFS scorer prioritizes financial links)

In [None]:
urls_path = Path("seed_urls_example.txt")
urls_path.write_text(
    """# Seed URLs for deep crawl
# Corporate investor pages — the crawler will follow internal links
https://investor.apple.com/sec-filings/default.aspx
https://ir.tesla.com

# Direct PDF — downloaded and extracted automatically
https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/aapl-20230930.htm
"""
)
print(f"Wrote {urls_path} with 3 seed URLs")

## 3. Configure and Run

Key settings for URL crawling:

| Setting | Value | Why |
|---------|-------|-----|
| `max_depth` | `2` | Follow links 2 levels deep from each seed |
| `max_pages` | `20` | Cap pages per seed URL (prevents runaway crawls) |
| `semaphore_count` | `2` | Max concurrent browser tabs |
| `pdf_extractor` | `"auto"` | Uses Docling if installed, falls back to pdfplumber |
| `min_word_count` | `100` | Filter out navigation/stub pages |

### How BFS scoring works

Discovered links are scored before the crawler decides which to visit:
- **Financial keyword relevance** (weight 0.8) — URLs containing `earnings`, `quarterly`, `report`, `sec`, `filings` score higher
- **Path depth** (weight 0.3) — shorter paths preferred
- **Freshness** (weight 0.15) — URLs containing the current year score higher

Non-content URLs (login, contact, career, legal) are filtered out before scoring.

In [None]:
output_dir = Path("./runs_crawl_example")
output_dir.mkdir(exist_ok=True)

output_parquet = output_dir / "crawl_results.parquet"
output_jsonl = output_dir / "crawl_results.jsonl"

config = CrawlConfig(
    urls_file=urls_path,
    max_depth=2,
    max_pages=20,
    semaphore_count=2,
    pdf_extractor="auto",
    min_word_count=100,
    output_dir=output_dir,
    output_path=output_parquet,
    jsonl_path=output_jsonl,
    # Resume support — safe to re-run
    resume=True,
    checkpoint_file=output_dir / ".crawl_checkpoint.json",
    # Domain exclusion: uses built-in list by default (set exclude_file=None to disable)
    exclude_file=Path("../config/exclude_domains.txt"),
)

print("Config ready:")
print(f"  urls_file      = {config.urls_file}")
print(f"  max_depth      = {config.max_depth}")
print(f"  max_pages      = {config.max_pages}")
print(f"  pdf_extractor  = {config.pdf_extractor}")
print(f"  output         = {config.output_path}")

In [None]:
pipeline = CrawlPipeline(config)
await pipeline.run()

## 4. Inspect Results

In [None]:
if output_parquet.exists():
    df = pd.read_parquet(output_parquet)
    print(f"Total documents: {len(df)}")
    print(f"Unique sources:  {df['source'].nunique()}")
    print(f"Total words:     {df['full_text'].str.split().str.len().sum():,}")
    print(f"Avg words/doc:   {df['full_text'].str.split().str.len().mean():.0f}")
else:
    print("No output file — check logs above for errors.")

In [None]:
# Preview extracted pages
if output_parquet.exists():
    df[["company", "title", "source", "date"]].head(15)

In [None]:
# Pages per seed domain
if output_parquet.exists():
    print("Pages per seed domain:")
    print(df["company"].value_counts().to_string())

In [None]:
# Content type breakdown (HTML vs PDF, based on link extension)
if output_parquet.exists():
    df["is_pdf"] = df["link"].str.lower().str.endswith(".pdf")
    pdf_count = df["is_pdf"].sum()
    html_count = len(df) - pdf_count
    print(f"HTML pages: {html_count}")
    print(f"PDF files:  {pdf_count}")

In [None]:
# Word count distribution
if output_parquet.exists():
    word_counts = df["full_text"].str.split().str.len()
    print(f"Word count stats:")
    print(f"  Min:    {word_counts.min()}")
    print(f"  Median: {word_counts.median():.0f}")
    print(f"  Mean:   {word_counts.mean():.0f}")
    print(f"  Max:    {word_counts.max()}")

In [None]:
# Sample document
if output_parquet.exists() and len(df) > 0:
    row = df.iloc[0]
    print(f"Title:  {row['title']}")
    print(f"Source: {row['source']}")
    print(f"Link:   {row['link']}")
    print(f"Date:   {row['date']}")
    print(f"Words:  {len(row['full_text'].split())}")
    print(f"\n--- First 800 chars ---\n")
    print(row["full_text"][:800])

## 5. CLI Equivalent

The same run from the command line:

```bash
financial-scraper crawl \
    --urls-file seed_urls_example.txt \
    --max-depth 2 \
    --max-pages 20 \
    --resume \
    --output-dir ./runs_crawl_example \
    --jsonl
```

Use Docling for better PDF extraction (tables, layout):

```bash
pip install -e ".[docling]"

financial-scraper crawl \
    --urls-file seed_urls.txt \
    --max-depth 2 \
    --max-pages 50 \
    --pdf-extractor docling \
    --resume \
    --output-dir ./runs \
    --jsonl
```

Stealth mode (lower concurrency):

```bash
financial-scraper crawl \
    --urls-file seed_urls.txt \
    --max-depth 3 \
    --max-pages 100 \
    --stealth --resume \
    --output-dir ./runs \
    --jsonl
```

To disable domain exclusions:

```bash
financial-scraper crawl --urls-file seed_urls.txt --no-exclude
```

> **Windows note:** If you see Unicode encoding errors from crawl4ai's logger, set `PYTHONUTF8=1` before running:
> ```bash
> set PYTHONUTF8=1
> python -m financial_scraper crawl --urls-file seed_urls.txt --output-dir ./runs
> ```

## 6. Cleanup

In [None]:
# Uncomment to delete temporary files
# urls_path.unlink(missing_ok=True)
# import shutil; shutil.rmtree(output_dir, ignore_errors=True)