# Financial News RAG Orchestrator Usage Example

This notebook demonstrates how to use the `FinancialNewsRAG` orchestrator class, which provides a high-level interface to the financial-news-rag system. The orchestrator integrates all the low-level components:

- EODHD API client for fetching financial news articles
- Article Manager for storing and retrieving articles from SQLite
- Text Processor for cleaning and chunking article content
- Embeddings Generator for creating embeddings from text chunks
- ChromaDB Manager for storing and querying vector embeddings
- ReRanker for improving search results using Gemini LLM

This example will walk through the complete pipeline from fetching articles to searching and retrieving relevant content.

## Setup

First, we need to import the necessary modules and initialize the orchestrator.

In [None]:
import os
import sys
from datetime import datetime, timedelta
import pandas as pd
from dotenv import load_dotenv
from pprint import pprint

# Add the project root to the path if needed
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), '..')))

# Import the orchestrator
from financial_news_rag.orchestrator import FinancialNewsRAG

### Load Environment Variables

The orchestrator requires API keys for the EODHD API and Gemini API. These should be set in a `.env` file in the project root.

In [None]:
# Load environment variables from .env file
load_dotenv()

# Check for required API keys
eodhd_api_key = os.getenv("EODHD_API_KEY")
gemini_api_key = os.getenv("GEMINI_API_KEY")

if not eodhd_api_key:
    print("⚠️ EODHD_API_KEY not found in environment variables.")
    print("Please create a .env file with your EODHD_API_KEY.")
else:
    print("✅ EODHD_API_KEY found.")
    
if not gemini_api_key:
    print("⚠️ GEMINI_API_KEY not found in environment variables.")
    print("Please create a .env file with your GEMINI_API_KEY.")
else:
    print("✅ GEMINI_API_KEY found.")

### Initialize the Orchestrator

Now we can create an instance of the `FinancialNewsRAG` orchestrator.

In [None]:
# Initialize the orchestrator with default settings
# This will use the API keys from environment variables
# and default paths for the databases
try:
    rag = FinancialNewsRAG(
        # Optional: override API keys if needed
        # eodhd_api_key=eodhd_api_key,
        # gemini_api_key=gemini_api_key,
        
        # Optional: customize database paths
        # db_path="custom_financial_news.db",  # SQLite database path
        # chroma_persist_dir="custom_chroma_db",  # ChromaDB persistence directory
        
        # Optional: other settings
        # chroma_collection_name="custom_collection",  # Name for the ChromaDB collection
        max_tokens_per_chunk=2048,  # Maximum tokens per text chunk
    )
    print("✅ FinancialNewsRAG orchestrator initialized successfully.")
except Exception as e:
    print(f"❌ Error initializing orchestrator: {e}")

## 1. Fetching and Storing Articles

The first step in the RAG pipeline is to fetch articles from the EODHD API and store them in the SQLite database. We can fetch articles by tag or symbol.

### Fetch Articles by Tag

In [None]:
# Define date range for the fetch operation
today = datetime.now()
one_week_ago = today - timedelta(days=7)

# Format dates as YYYY-MM-DD
from_date = one_week_ago.strftime("%Y-%m-%d")
to_date = today.strftime("%Y-%m-%d")

In [None]:
print(f"Fetching technology news articles from {from_date} to {to_date}...")

# Fetch articles with the TECHNOLOGY tag
result = rag.fetch_and_store_articles(
    tag="TECHNOLOGY",  # News category tag
    from_date=from_date,
    to_date=to_date,
    limit=10  # Maximum number of articles to fetch
)

print(f"\nFetch operation completed with status: {result['status']}")
print(f"Articles fetched: {result['articles_fetched']}")
print(f"Articles stored: {result['articles_stored']}")

if result['errors']:
    print("\nErrors encountered:")
    for error in result['errors']:
        print(f"- {error}")

### Fetch Articles by Symbol

In [None]:
# Fetch articles for multiple stock symbols
print(f"Fetching news for tech stocks from {from_date} to {to_date}...")

result = rag.fetch_and_store_articles(
    symbol="AAPL.US",  # Comma-separated list of symbols
    from_date=from_date,
    to_date=to_date,
    limit=20  # Maximum number of articles to fetch
)

print(f"\nFetch operation completed with status: {result['status']}")
print(f"Articles fetched: {result['articles_fetched']}")
print(f"Articles stored: {result['articles_stored']}")

## 2. Processing Article Text

After fetching and storing the articles, we need to process the raw content to clean and normalize the text for embedding.

In [None]:
# Process pending articles (those with status_text_processing = 'PENDING')
print("Processing pending articles...")

result = rag.process_pending_articles(limit=5)  # Number of pending articles to process

print(f"\nProcessing completed with status: {result['status']}")
print(f"Articles successfully processed: {result['articles_processed']}")
print(f"Articles that failed processing: {result['articles_failed']}")

if result['errors']:
    print("\nErrors encountered during processing:")
    for error in result['errors']:
        print(f"- {error}")

### Reprocess Articles with Failed Text Processing

In [None]:
# Get any articles that failed text processing
failed_articles = rag.get_failed_text_processing_articles()
print(f"Found {len(failed_articles)} articles with failed text processing.")

if failed_articles:
    print("\nAttempting to reprocess failed articles...")
    result = rag.reprocess_failed_articles()
    
    print(f"\nReprocessing completed with status: {result['status']}")
    print(f"Articles successfully reprocessed: {result['articles_reprocessed']}")
    print(f"Articles that failed reprocessing: {result['articles_failed']}")

## 3. Generating and Storing Embeddings

Now that we have processed articles, we can generate embeddings for them and store them in ChromaDB.

In [None]:
# Generate embeddings for processed articles
print("Generating embeddings for processed articles...")

result = rag.embed_processed_articles(limit=10)  # Number of processed articles to embed

print(f"\nEmbedding generation completed with status: {result['status']}")
print(f"Articles successfully embedded: {result['articles_embedded']}")
print(f"Articles that failed embedding: {result['articles_failed']}")

if result['errors']:
    print("\nErrors encountered during embedding:")
    for error in result['errors']:
        print(f"- {error}")

### Re-embed Articles with Failed Embedding

In [None]:
# Get any articles that failed embedding
failed_articles = rag.get_failed_embedding_articles()
print(f"Found {len(failed_articles)} articles with failed embedding.")

if failed_articles:
    print("\nAttempting to re-embed failed articles...")
    result = rag.re_embed_failed_articles()
    
    print(f"\nRe-embedding completed with status: {result['status']}")
    print(f"Articles successfully re-embedded: {result['articles_reembedded']}")
    print(f"Articles that failed re-embedding: {result['articles_failed']}")

## 4. Checking Database Status

We can check the status of both the article database (SQLite) and the vector database (ChromaDB).

In [None]:
# Get article database status
print("Checking article database status...")
article_db_status = rag.get_article_database_status()

print(f"\nArticle Database Status:")
print(f"Total articles: {article_db_status['total_articles']}")

print("\nText processing status:")
for status, count in article_db_status['text_processing_status'].items():
    print(f"  {status}: {count}")

print("\nEmbedding status:")
for status, count in article_db_status['embedding_status'].items():
    print(f"  {status}: {count}")

print("\nArticles by tag:")
for tag, count in article_db_status['articles_by_tag'].items():
    print(f"  {tag}: {count}")

print("\nArticles by symbol:")
for symbol, count in article_db_status['articles_by_symbol'].items():
    print(f"  {symbol}: {count}")

print(f"\nDate range: {article_db_status['date_range']['oldest_article']} to {article_db_status['date_range']['newest_article']}")
print(f"API calls: {article_db_status['api_calls']['total_calls']} (retrieved {article_db_status['api_calls']['total_articles_retrieved']} articles total)")

In [None]:
# Get vector database status
print("Checking vector database status...")
vector_db_status = rag.get_vector_database_status()

print(f"\nVector Database Status:")
print(f"Collection name: {vector_db_status['collection_name']}")
print(f"Total chunks: {vector_db_status['total_chunks']}")
print(f"Unique articles: {vector_db_status['unique_articles']}")
print(f"Persistence location: {vector_db_status['persist_directory']}")

## 5. Searching for Articles

Now that we have articles stored and embedded, we can search for relevant content using natural language queries.

### Basic Search (No Re-ranking)

In [None]:
# Perform a basic search using semantic similarity only
query = "Latest artificial intelligence innovations in tech companies"
print(f"Searching with query: '{query}'")

results = rag.search_articles(
    query=query,
    n_results=3,  # Return top 3 results
    rerank=False  # No re-ranking
)

print(f"\nFound {len(results)} relevant articles:\n")

for i, article in enumerate(results):
    print(f"Result {i+1}: {article.get('title', 'Untitled')}")
    print(f"  URL: {article.get('url', 'No URL')}")
    print(f"  Published: {article.get('published_at', 'Unknown date')}")
    print(f"  Similarity score: {article.get('similarity_score', 0):.4f}")
    
    # Print a short preview of the content
    content = article.get('processed_content', 'No content')
    preview = content[:300] + '...' if len(content) > 300 else content
    print(f"  Preview: {preview}\n")

### Search with Re-ranking

Re-ranking uses the Gemini LLM to improve search results by considering semantic meaning beyond vector similarity.

In [None]:
# Search with re-ranking using Gemini LLM
query = "Impact of generative AI on financial markets"
print(f"Searching with query: '{query}' (with re-ranking)")
preview_length = 500  # Length of the preview to display

results = rag.search_articles(
    query=query,
    n_results=3,  # Return top 3 results
    rerank=True   # Apply re-ranking
)

print(f"\nFound {len(results)} relevant articles:\n")

for i, article in enumerate(results):
    print(f"Result {i+1}: {article.get('title', 'Untitled')}")
    print(f"  URL: {article.get('url', 'No URL')}")
    print(f"  Published: {article.get('published_at', 'Unknown date')}")
    print(f"  Similarity score: {article.get('similarity_score', 0):.4f}")
    print(f"  Re-rank score: {article.get('rerank_score', 0):.4f}")
    
    # Print a short preview of the content
    content = article.get('processed_content', 'No content')
    preview = content[:preview_length] + '...' if len(content) > preview_length else content
    print(f"  Preview: {preview}\n")

### Compare Search Results: With and Without Re-ranking

Let's see how re-ranking affects the order and quality of search results.

In [None]:
# Compare search results with and without re-ranking
query = "AI applications in financial forecasting"
print(f"Comparing search results for query: '{query}'")

# Search without re-ranking
basic_results = rag.search_articles(
    query=query,
    n_results=5,
    rerank=False
)

# Search with re-ranking
reranked_results = rag.search_articles(
    query=query,
    n_results=5,
    rerank=True
)

# Create a comparison table
comparison_data = []

for i in range(max(len(basic_results), len(reranked_results))):
    basic_title = basic_results[i].get('title', '-') if i < len(basic_results) else '-'
    basic_similarity = basic_results[i].get('similarity_score', 0) if i < len(basic_results) else 0
    
    reranked_title = reranked_results[i].get('title', '-') if i < len(reranked_results) else '-'
    reranked_similarity = reranked_results[i].get('similarity_score', 0) if i < len(reranked_results) else 0
    reranked_score = reranked_results[i].get('rerank_score', 0) if i < len(reranked_results) else 0
    
    comparison_data.append({
        "Rank": i+1,
        "Basic Search Title": basic_title,
        "Basic Similarity": round(basic_similarity, 4),
        "Reranked Title": reranked_title,
        "Reranked Similarity": round(reranked_similarity, 4),
        "Rerank Score": round(reranked_score, 4),
    })

# Display as a DataFrame
comparison_df = pd.DataFrame(comparison_data)
comparison_df

## 6. Deleting Article Data

If needed, we can delete articles and their associated embeddings from the databases.

In [None]:
# Example: Delete an article by its URL hash
# Note: You need to know the URL hash of an article to delete it
# Get a URL hash from one of the search results
if results and len(results) > 0:
    article_to_delete = results[0]['url_hash']
    print(f"Deleting article with URL hash: {article_to_delete}")
    
    result = rag.delete_article_data(article_to_delete)
    
    print(f"\nDelete operation completed with status: {result['status']}")
    print(f"Message: {result['message']}")
    print(f"Article deleted from SQLite: {result['article_deleted']}")
    print(f"Embeddings deleted from ChromaDB: {result['embeddings_deleted']}")
else:
    print("No articles available to demonstrate deletion.")

## 7. Clean Up

When finished, it's important to properly close database connections.

In [None]:
# Close database connections
rag.close()
print("Successfully closed all database connections.")

## Summary

In this notebook, we've demonstrated the complete workflow using the `FinancialNewsRAG` orchestrator:

1. **Initialization**: Setting up the orchestrator with API keys and configuration
2. **Article Fetching**: Retrieving articles by tag and symbol from the EODHD API
3. **Text Processing**: Cleaning and normalizing article content
4. **Embedding Generation**: Creating vector representations of article chunks
5. **Database Status**: Checking the state of both SQLite and ChromaDB
6. **Semantic Search**: Finding relevant articles using natural language queries
7. **Re-ranking**: Improving search results using the Gemini LLM
8. **Data Management**: Deleting articles and embeddings when no longer needed

The orchestrator provides a clean, high-level interface that simplifies the use of the financial-news-rag system, handling all the interactions between components and providing error handling and status reporting.