# Example 2 — News Search Mode

**Goal:** Search DuckDuckGo in **news** mode for recent financial articles, extract clean text with deduplication (exact + fuzzy), and analyze the results.

News mode returns recent articles from news outlets. It is best suited for:
- Earnings announcements and market reactions
- Commodity price movements and supply updates
- Central bank decisions and economic data releases
- Breaking financial events

> **Tip:** News mode is **less rate-limited** by DuckDuckGo than text mode. It is the recommended mode for financial content scraping.

## 1. Setup

In [None]:
import asyncio
import sys
from pathlib import Path

if sys.platform.startswith("win"):
    asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

import pandas as pd
from financial_scraper import ScraperConfig, ScraperPipeline

## 2. Write a Query File

News queries work best with 4-8 words focusing on a specific event or topic.

In [None]:
queries_path = Path("queries_news_example.txt")
queries_path.write_text(
    """# News search — recent financial articles
crude oil futures price outlook
gold price safe haven demand
Federal Reserve interest rate decision
nvidia earnings revenue AI demand
wheat commodity market supply
natural gas storage winter forecast
copper demand electric vehicle battery
"""
)
print(f"Wrote {queries_path} with 7 queries")

## 3. Configure and Run

Key settings for news search:

| Setting | Value | Why |
|---------|-------|-----|
| `search_type` | `"news"` | Recent articles from news outlets — less rate-limited |
| `max_results_per_query` | `15` | Good coverage without being excessive |
| `min_word_count` | `80` | Lower threshold — news articles can be shorter |
| `date_from` | `"2025-01-01"` | Filter out stale articles (optional) |
| `resume` | `True` | Safe to re-run if interrupted |

### Deduplication layers

The pipeline automatically applies two dedup layers:
1. **Exact** — SHA256 of first 2000 chars (catches identical republished articles)
2. **Fuzzy** — MinHash LSH (catches syndicated rewrites with minor edits, added disclaimers, different headers)

In [None]:
output_dir = Path("./runs_news_example")
output_dir.mkdir(exist_ok=True)

output_parquet = output_dir / "news_search_results.parquet"
output_jsonl = output_dir / "news_search_results.jsonl"

config = ScraperConfig(
    queries_file=queries_path,
    search_type="news",
    max_results_per_query=15,
    min_word_count=80,
    favor_precision=True,
    # Optional date filtering — keeps only articles after this date
    date_from="2025-01-01",
    output_dir=output_dir,
    output_path=output_parquet,
    jsonl_path=output_jsonl,
    # Resume support — safe to re-run
    resume=True,
    checkpoint_file=output_dir / ".checkpoint.json",
    # Domain exclusion: uses built-in list by default
    exclude_file=Path("../config/exclude_domains.txt"),
)

print("Config ready:")
print(f"  search_type    = {config.search_type}")
print(f"  max_results    = {config.max_results_per_query}")
print(f"  min_words      = {config.min_word_count}")
print(f"  date_from      = {config.date_from}")
print(f"  resume         = {config.resume}")
print(f"  output         = {config.output_path}")

In [None]:
pipeline = ScraperPipeline(config)
await pipeline.run()

## 4. Inspect Results

In [None]:
if output_parquet.exists():
    df = pd.read_parquet(output_parquet)
    print(f"Total documents: {len(df)}")
    print(f"Unique sources:  {df['source'].nunique()}")
    print(f"Unique queries:  {df['company'].nunique()}")
    print(f"Total words:     {df['full_text'].str.split().str.len().sum():,}")
    print(f"Avg words/doc:   {df['full_text'].str.split().str.len().mean():.0f}")
else:
    print("No output file — check logs above for errors.")

In [None]:
# Preview the first rows
if output_parquet.exists():
    df[["company", "title", "source", "date"]].head(15)

In [None]:
# Top domains by article count
if output_parquet.exists():
    print("Top sources:")
    print(df["source"].value_counts().head(15).to_string())

In [None]:
# Results per query
if output_parquet.exists():
    print("Results per query:")
    for query, group in df.groupby("company"):
        avg_words = group["full_text"].str.split().str.len().mean()
        print(f"  {query}: {len(group)} articles, {avg_words:.0f} avg words")

In [None]:
# Date distribution
if output_parquet.exists():
    dated = df.dropna(subset=["date"])
    print(f"Articles with dates: {len(dated)}/{len(df)}")
    if len(dated) > 0:
        print(f"Date range: {dated['date'].min()} to {dated['date'].max()}")
        print("\nArticles per month:")
        print(dated["date"].dt.to_period("M").value_counts().sort_index().to_string())

In [None]:
# Sample article
if output_parquet.exists() and len(df) > 0:
    row = df.iloc[0]
    print(f"Title:  {row['title']}")
    print(f"Source: {row['source']}")
    print(f"Query:  {row['company']}")
    print(f"Date:   {row['date']}")
    print(f"Words:  {len(row['full_text'].split())}")
    print(f"\n--- First 800 chars ---\n")
    print(row["full_text"][:800])

## 5. CLI Equivalent

The same run from the command line:

```bash
financial-scraper search \
    --queries-file queries_news_example.txt \
    --search-type news \
    --max-results 15 \
    --min-words 80 \
    --date-from 2025-01-01 \
    --resume \
    --output-dir ./runs_news_example \
    --jsonl
```

For a larger commodity run (50 queries) with stealth:

```bash
financial-scraper search \
    --queries-file config/commodities_50.txt \
    --search-type news \
    --max-results 20 \
    --stealth --resume \
    --output-dir ./runs \
    --exclude-file config/exclude_domains.txt \
    --jsonl
```

For 300+ queries, add Tor:

```bash
financial-scraper search \
    --queries-file config/commodities_300.txt \
    --search-type news \
    --stealth --use-tor --resume \
    --output-dir ./runs \
    --jsonl
```

## 6. Cleanup

In [None]:
# Uncomment to delete temporary files
# queries_path.unlink(missing_ok=True)
# import shutil; shutil.rmtree(output_dir, ignore_errors=True)