# Example 1 — Text Search Mode

**Goal:** Search DuckDuckGo in **text** mode for broad financial research content (SEC filings, analyst reports, reference material), extract clean text, and save to Parquet.

Text mode returns organic web results (articles, PDFs, reports) rather than recent news. It is best suited for:
- SEC filings and annual reports
- Research papers and whitepapers
- Reference/educational financial content
- Historical data and analysis

> **Tip:** Text search is more rate-limited by DuckDuckGo than news search. For large runs (50+ queries) use `--stealth` and `--use-tor`.

## 1. Setup

Make sure the package is installed:

```bash
cd financial_scraper
pip install -e .
```

In [None]:
import asyncio
import sys
from pathlib import Path

# Windows asyncio fix (required before any asyncio.run call)
if sys.platform.startswith("win"):
    asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

import pandas as pd
from financial_scraper import ScraperConfig, ScraperPipeline

## 2. Write a Query File

One query per line. Lines starting with `#` are comments.

In [None]:
queries_path = Path("queries_text_example.txt")
queries_path.write_text(
    """# Text search — broad financial research
SEC 10-K annual report risk factors analysis
copper supply chain disruption impact 2025
lithium mining production forecast
central bank monetary policy inflation outlook
renewable energy investment trends global
"""
)
print(f"Wrote {queries_path} ({queries_path.stat().st_size} bytes)")

## 3. Configure and Run

Key settings for text search:

| Setting | Value | Why |
|---------|-------|-----|
| `search_type` | `"text"` | Organic results — broader, includes reports/PDFs |
| `max_results_per_query` | `10` | Moderate — avoids rate limits while getting enough coverage |
| `min_word_count` | `150` | Higher threshold filters out stub pages |
| `exclude_file` | default | Built-in list blocks social media, paywalls, low-quality sites |
| `favor_precision` | `True` | trafilatura precision mode extracts main content only |

In [None]:
output_dir = Path("./runs_text_example")
output_dir.mkdir(exist_ok=True)

output_parquet = output_dir / "text_search_results.parquet"
output_jsonl = output_dir / "text_search_results.jsonl"

config = ScraperConfig(
    queries_file=queries_path,
    search_type="text",
    max_results_per_query=10,
    min_word_count=150,
    favor_precision=True,
    output_dir=output_dir,
    output_path=output_parquet,
    jsonl_path=output_jsonl,
    # Domain exclusion: uses built-in list by default (set exclude_file=None to disable)
    exclude_file=Path("../config/exclude_domains.txt"),
)

print("Config ready:")
print(f"  search_type    = {config.search_type}")
print(f"  max_results    = {config.max_results_per_query}")
print(f"  min_words      = {config.min_word_count}")
print(f"  output         = {config.output_path}")
print(f"  exclude_file   = {config.exclude_file}")

In [None]:
pipeline = ScraperPipeline(config)
await pipeline.run()

## 4. Inspect Results

In [None]:
if output_parquet.exists():
    df = pd.read_parquet(output_parquet)
    print(f"Total documents: {len(df)}")
    print(f"Unique sources:  {df['source'].nunique()}")
    print(f"Total words:     {df['full_text'].str.split().str.len().sum():,}")
    print(f"Avg words/doc:   {df['full_text'].str.split().str.len().mean():.0f}")
    print()
    print("Columns:", list(df.columns))
else:
    print("No output file — check logs above for errors.")

In [None]:
# Preview the first few rows
if output_parquet.exists():
    df[["company", "title", "source", "date"]].head(10)

In [None]:
# Top domains by article count
if output_parquet.exists():
    print("Top sources:")
    print(df["source"].value_counts().head(10).to_string())

In [None]:
# Results per query
if output_parquet.exists():
    print("Results per query:")
    print(df["company"].value_counts().to_string())

In [None]:
# Preview full text of the first document
if output_parquet.exists() and len(df) > 0:
    row = df.iloc[0]
    print(f"Title:  {row['title']}")
    print(f"Source: {row['source']}")
    print(f"Date:   {row['date']}")
    print(f"Words:  {len(row['full_text'].split())}")
    print(f"\n--- First 500 chars ---\n")
    print(row["full_text"][:500])

## 5. CLI Equivalent

The same run from the command line:

```bash
financial-scraper search \
    --queries-file queries_text_example.txt \
    --search-type text \
    --max-results 10 \
    --min-words 150 \
    --output-dir ./runs_text_example \
    --jsonl
```

Add `--stealth` and `--use-tor` for large runs (50+ queries):

```bash
financial-scraper search \
    --queries-file queries.txt \
    --search-type text \
    --max-results 10 \
    --stealth --use-tor --resume \
    --output-dir ./runs \
    --jsonl
```

To disable domain exclusions:

```bash
financial-scraper search --queries-file queries.txt --search-type text --no-exclude
```

## 6. Cleanup

In [None]:
# Uncomment to delete temporary files
# queries_path.unlink(missing_ok=True)
# import shutil; shutil.rmtree(output_dir, ignore_errors=True)