# Example 4 — Earnings Call Transcripts

**Goal:** Download structured earnings call transcripts from Motley Fool by ticker symbol, extract speakers, prepared remarks, and Q&A sections, and save to Parquet.

The `transcripts` subcommand is different from `search` and `crawl`:

| | Search | Crawl | Transcripts |
|---|---|---|---|
| Input | Keyword queries | Seed URLs | Ticker symbols |
| Discovery | DuckDuckGo results | BFS link-following | Motley Fool sitemaps |
| Content | General articles | General articles + PDFs | Earnings call transcripts |
| Output | Parquet | Parquet | Parquet (same schema) |

Use `transcripts` when you need:
- Full earnings call transcripts for specific companies
- Structured data: speakers, prepared remarks, Q&A sections
- Historical transcripts by fiscal year and quarter

> **No extra dependencies required.** The transcripts module uses `requests` and `beautifulsoup4` (both already installed).

## 1. Setup

Make sure the package is installed:

```bash
cd financial_scraper
pip install -e .
```

In [None]:
from pathlib import Path

import pandas as pd
from financial_scraper.transcripts import TranscriptConfig, TranscriptPipeline

## 2. Configure and Run

Key settings for transcript downloading:

| Setting | Value | Why |
|---------|-------|-----|
| `tickers` | `("AAPL", "MSFT")` | Companies to download transcripts for |
| `year` | `2025` | Fiscal year (default: current year) |
| `quarters` | `("Q1",)` | Filter to Q1 only (empty = all quarters) |

### How discovery works

1. Scans Motley Fool monthly XML sitemaps (`fool.com/sitemap/YYYY/MM`)
2. Filters URLs matching the ticker in the URL slug (e.g. `-aapl-q1-2025-earnings`)
3. Scans target year + following year (Q4 transcripts often published in Jan/Feb)
4. Fetches each transcript page with polite 1.5s delays

In [None]:
output_dir = Path("./runs_transcripts_example")
output_dir.mkdir(exist_ok=True)

output_parquet = output_dir / "transcripts.parquet"
output_jsonl = output_dir / "transcripts.jsonl"

config = TranscriptConfig(
    tickers=("AAPL", "MSFT"),
    year=2025,
    quarters=("Q1",),          # Empty tuple = all quarters
    output_dir=output_dir,
    output_path=output_parquet,
    jsonl_path=output_jsonl,
    checkpoint_file=output_dir / ".checkpoint.json",
)

print("Config ready:")
print(f"  tickers   = {config.tickers}")
print(f"  year      = {config.year}")
print(f"  quarters  = {config.quarters}")
print(f"  output    = {config.output_path}")

In [None]:
pipeline = TranscriptPipeline(config)
pipeline.run()  # Synchronous — no asyncio.run() needed

## 3. Inspect Results

In [None]:
if output_parquet.exists():
    df = pd.read_parquet(output_parquet)
    print(f"Total transcripts: {len(df)}")
    print(f"Unique tickers:    {df['company'].nunique()}")
    print(f"Total words:       {df['full_text'].str.split().str.len().sum():,}")
    print(f"Avg words/doc:     {df['full_text'].str.split().str.len().mean():.0f}")
    print()
    print("Columns:", list(df.columns))
else:
    print("No output file — check logs above for errors.")

In [None]:
# Preview all rows
if output_parquet.exists():
    df[["company", "title", "source", "date"]]

In [None]:
# Word count per transcript
if output_parquet.exists():
    print("Words per transcript:")
    for _, row in df.iterrows():
        words = len(row["full_text"].split())
        print(f"  {row['company']} {row['title'].split('Transcript')[0].strip()}: {words:,} words")

In [None]:
# Preview the first transcript
if output_parquet.exists() and len(df) > 0:
    row = df.iloc[0]
    print(f"Ticker: {row['company']}")
    print(f"Title:  {row['title']}")
    print(f"Date:   {row['date']}")
    print(f"Source: {row['source']}")
    print(f"Words:  {len(row['full_text'].split())}")
    print(f"\n--- First 1000 chars ---\n")
    print(row["full_text"][:1000])

## 4. Extract Structured Content

The transcript extractor can also return structured data (speakers, prepared remarks, Q&A). Let's use the extraction module directly on the raw HTML.

In [None]:
import requests
from financial_scraper.transcripts.extract import extract_transcript

# Fetch a transcript page directly
if output_parquet.exists() and len(df) > 0:
    url = df.iloc[0]["link"]
    print(f"Fetching: {url}\n")
    resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=30)
    result = extract_transcript(resp.text)

    if result:
        print(f"Company:  {result.company}")
        print(f"Ticker:   {result.ticker}")
        print(f"Quarter:  {result.quarter}")
        print(f"Year:     {result.year}")
        print(f"Date:     {result.date}")
        print(f"\nParticipants ({len(result.participants)}):")
        for p in result.participants[:10]:
            print(f"  - {p}")
        print(f"\nSpeakers ({len(result.speakers)}):")
        for s in result.speakers:
            print(f"  - {s}")
        print(f"\nPrepared remarks: {len(result.prepared_remarks.split())} words")
        print(f"Q&A section:      {len(result.qa_section.split())} words")

## 5. Using a Ticker File

For many tickers, use a file (one ticker per line, `#` comments allowed):

In [None]:
tickers_path = Path("tickers_example.txt")
tickers_path.write_text(
    """# Big Tech
AAPL
MSFT
GOOG
AMZN
META
NVDA
"""
)
print(f"Wrote {tickers_path} with 6 tickers")
print()
print("To run from CLI:")
print("  financial-scraper transcripts --tickers-file tickers_example.txt --year 2025 --output-dir ./runs")

## 6. CLI Equivalent

The same run from the command line:

```bash
financial-scraper transcripts \
    --tickers AAPL MSFT \
    --year 2025 \
    --quarters Q1 \
    --output-dir ./runs_transcripts_example \
    --jsonl
```

All quarters for a single ticker:

```bash
financial-scraper transcripts \
    --tickers NVDA \
    --year 2025 \
    --output-dir ./runs
```

Resume an interrupted download:

```bash
financial-scraper transcripts \
    --tickers AAPL MSFT GOOG AMZN META NVDA \
    --year 2025 \
    --resume \
    --output-dir ./runs
```

## 7. Cleanup

In [None]:
# Uncomment to delete temporary files
# tickers_path.unlink(missing_ok=True)
# import shutil; shutil.rmtree(output_dir, ignore_errors=True)