# FinancialNewsRAG Orchestrator Example
This notebook demonstrates the functionality of the `FinancialNewsRAG` orchestrator class, which integrates various components of the financial news RAG system.

## 1. Initialization
First, we need to initialize the `FinancialNewsRAG` orchestrator. This requires API keys for EODHD and Gemini, which are automatically loaded from environment variables through the centralized `Config` class. You can also specify custom paths for the SQLite database and ChromaDB persistence directory.

Make sure you have a `.env` file in your project root with `EODHD_API_KEY` and `GEMINI_API_KEY` set. Alternatively, you can pass these values directly to the constructor as overrides.

### Centralized Configuration
The system uses a centralized configuration approach where all settings are managed by the `Config` class in `config.py`. This class loads values from environment variables with sensible defaults. When initializing `FinancialNewsRAG`, any parameters you provide will override the values from the centralized configuration.

In [None]:
import os
from financial_news_rag.orchestrator import FinancialNewsRAG

# Initialize the orchestrator
try:
    # Using a unique DB path for this example to avoid conflicts with other uses
    example_db_path = "financial_news_rag_example.db"
    example_chroma_persist_dir = "chroma_db_example"
    
    # The orchestrator will automatically load API keys from the Config class,
    # which reads from environment variables. If you want to override them, you can
    # pass them directly as parameters:
    # orchestrator = FinancialNewsRAG(
    #     eodhd_api_key="your_eodhd_key_here",
    #     gemini_api_key="your_gemini_key_here",
    #     db_path=example_db_path,
    #     chroma_persist_dir=example_chroma_persist_dir
    # )
    
    # For this example, we'll just override the DB paths and let Config handle the API keys:
    orchestrator = FinancialNewsRAG(
        db_path=example_db_path,
        chroma_persist_dir=example_chroma_persist_dir
    )
    
    print("FinancialNewsRAG orchestrator initialized successfully.")
    print(f"SQLite DB will be created/used at: {os.path.abspath(example_db_path)}")
    print(f"ChromaDB will persist data in: {os.path.abspath(example_chroma_persist_dir)}")
except ValueError as e:
    print(f"Initialization failed: {e}")
    print("Please ensure EODHD_API_KEY and GEMINI_API_KEY are set in your .env file or environment variables.")

## 2. Fetching and Storing Articles
The `fetch_and_store_articles` method fetches news articles from the EODHD API and stores them in the SQLite database. You can fetch articles by `tag` (e.g., 'TECHNOLOGY', 'M&A') or by `symbol` (e.g., 'AAPL.US'). You can also specify `from_date`, `to_date`, and a `limit` for the number of articles.

In [None]:
if 'orchestrator' in locals():
    # Example 1: Fetch articles by tag (e.g., 'M&A' for Mergers and Acquisitions)
    # Using a very small limit for demonstration purposes
    print("Fetching articles by tag 'MERGERS AND ACQUISITIONS' (Mergers and Acquisitions)...")
    fetch_results_tag = orchestrator.fetch_and_store_articles(tag="MERGERS AND ACQUISITIONS", limit=20)
    print(f"Tag fetch results: {fetch_results_tag}")
    
    # Example 2: Fetch articles by symbol (e.g., 'MSFT.US' for Microsoft)
    print("Fetching articles by symbol 'MSFT.US'...")
    fetch_results_symbol = orchestrator.fetch_and_store_articles(symbol="MSFT.US", limit=20)
    print(f"Symbol fetch results: {fetch_results_symbol}")
    
    # Example 3: Fetch articles with a date range
    # Note: EODHD free tier might have limitations on date ranges for news.
    # Using a recent date range for better chances of getting results.
    from datetime import datetime, timedelta
    to_date_str = datetime.now().strftime('%Y-%m-%d')
    from_date_str = (datetime.now() - timedelta(days=7)).strftime('%Y-%m-%d')
    print(f"Fetching articles for 'AAPL.US' from {from_date_str} to {to_date_str}...")
    fetch_results_date = orchestrator.fetch_and_store_articles(
        symbol="AAPL.US", 
        from_date=from_date_str, 
        to_date=to_date_str, 
        limit=20
    )
    print(f"Date range fetch results: {fetch_results_date}")
else:
    print("Orchestrator not initialized. Skipping fetch examples.")

## 3. Processing Articles
The `process_articles_by_status` method retrieves articles from the database based on their processing status (e.g., 'PENDING', 'FAILED') and processes their raw content. Processing involves cleaning HTML, extracting text, and validating the content. Successfully processed articles are updated in the database with the cleaned content and a 'SUCCESS' status.

In [None]:
if 'orchestrator' in locals():
    # Process articles that are currently 'PENDING' (newly fetched)
    print("Processing 'PENDING' articles...")
    processing_results_pending = orchestrator.process_articles_by_status(status='PENDING', limit=40)
    print(f"Pending processing results: {processing_results_pending}")
    
    # Optionally, you could try to re-process 'FAILED' articles if any exist
    # print("Attempting to re-process 'FAILED' articles...")
    # processing_results_failed = orchestrator.process_articles_by_status(status='FAILED', limit=5)
    # print(f"Failed processing results: {processing_results_failed}")
else:
    print("Orchestrator not initialized. Skipping processing examples.")

## 4. Embedding Articles
The `embed_processed_articles` method takes articles that have been successfully processed and generates embeddings for their content. These embeddings are then stored in ChromaDB. This method can also handle articles whose previous embedding attempts were 'PENDING' or 'FAILED'.

In [None]:
if 'orchestrator' in locals():
    # Embed articles whose content has been processed and are 'PENDING' embedding
    print("Embedding 'PENDING' (embedding status) articles...")
    embedding_results_pending = orchestrator.embed_processed_articles(status='PENDING', limit=20)
    print(f"Pending embedding results: {embedding_results_pending}")
    
    # Optionally, re-attempt embedding for articles that 'FAILED' previously
    # print("Re-attempting to embed 'FAILED' (embedding status) articles...")
    # embedding_results_failed = orchestrator.embed_processed_articles(status='FAILED', limit=5)
    # print(f"Failed embedding results: {embedding_results_failed}")
else:
    print("Orchestrator not initialized. Skipping embedding examples.")

## 5. Database Status
You can check the status of both the article database (SQLite) and the vector database (ChromaDB) using the following methods.

In [None]:
if 'orchestrator' in locals():
    # Get status of the article database (SQLite)
    print("Fetching article database status...")
    article_db_status = orchestrator.get_article_database_status()
    print(f"Article DB Status: {article_db_status}")
    
    # Get status of the vector database (ChromaDB)
    print("Fetching vector database status...")
    vector_db_status = orchestrator.get_vector_database_status()
    print(f"Vector DB Status: {vector_db_status}")
else:
    print("Orchestrator not initialized. Skipping database status examples.")

## 6. Searching Articles
The `search_articles` method allows you to search for articles relevant to a given query. It generates an embedding for the query, searches ChromaDB for similar article chunks, retrieves the corresponding articles from SQLite, and can optionally re-rank the results using a Gemini LLM.

In [None]:
if 'orchestrator' in locals() and vector_db_status.get('total_chunks', 0) > 0:
    # Example 1: Basic search
    query1 = "latest advancements in AI by Microsoft"
    print(f"Searching for: '{query1}'...")
    search_results1 = orchestrator.search_articles(query=query1, n_results=5)
    print(f"Search Results 1 (basic):")
    for i, article in enumerate(search_results1):
        print(f"  {i+1}. Title: {article.get('title')}, Score: {article.get('similarity_score')}")
        # print(f"     URL: {article.get('link')}")
        # print(f"     Published: {article.get('published_at')}")
        # print(f"     Content Snippet: {article.get('processed_content', '')[:200]}...")
    
    # Example 2: Search with re-ranking
    # Re-ranking can provide more contextually relevant results but is slower.
    query2 = "Acquisitions in the tech industry related to safety and security"
    print(f"Searching for: '{query2}' with re-ranking...")
    search_results2 = orchestrator.search_articles(query=query2, n_results=5, rerank=True)
    print(f"Search Results 2 (re-ranked):")
    for i, article in enumerate(search_results2):
        print(f"  {i+1}. Title: {article.get('title')}, Score: {article.get('relevance_score', article.get('similarity_score'))}") # ReRanker adds 'relevance_score'
        print(f"     URL: {article.get('url')}")
        print(f"     Published: {article.get('published_at')}")
        print(f"     Content Snippet: {article.get('processed_content', '')[:200]}...")
    
    # Example 3: Search with date filtering
    # Assuming some articles were published in the last few days
    from_date_search = (datetime.now() - timedelta(days=3)).isoformat() + "Z" # ISO format
    to_date_search = datetime.now().isoformat() + "Z"
    query3 = "market trends"
    print(f"Searching for: '{query3}' between {from_date_search} and {to_date_search}...")
    search_results3 = orchestrator.search_articles(
        query=query3, 
        n_results=5, 
        from_date_str=from_date_search, 
        to_date_str=to_date_search
    )
    print(f"Search Results 3 (date filtered):")
    for i, article in enumerate(search_results3):
        print(f"  {i+1}. Title: {article.get('title')}, Published: {article.get('published_at')}, Score: {article.get('similarity_score')}")
elif 'orchestrator' in locals():
    print("Vector database is empty. Skipping search examples. Please run fetch, process, and embed steps first.")
else:
    print("Orchestrator not initialized. Skipping search examples.")

## 7. Deleting Old Articles
The `delete_articles_older_than` method removes articles from both SQLite and ChromaDB that are older than a specified number of days. This is useful for managing data retention.

In [None]:
if 'orchestrator' in locals():
    # Example: Delete articles older than 365 days
    # For this example, it's unlikely to delete anything unless you've run this over a long period.
    # We can try with a very small number of days to see if it targets the articles we just added,
    # but be careful as this will actually delete them.
    print("Attempting to delete articles older than 1 day (for demonstration)...")
    # This will likely target the articles fetched if they were published more than 1 day ago.
    # If you want to keep them, use a larger number like 365.
    delete_results = orchestrator.delete_articles_older_than(days=1) 
    print(f"Deletion results: {delete_results}")
    
    # Check status again after deletion
    print("Fetching article database status after potential deletion...")
    article_db_status_after_delete = orchestrator.get_article_database_status()
    print(f"Article DB Status: {article_db_status_after_delete}")
    
    print("Fetching vector database status after potential deletion...")
    vector_db_status_after_delete = orchestrator.get_vector_database_status()
    print(f"Vector DB Status: {vector_db_status_after_delete}")
else:
    print("Orchestrator not initialized. Skipping deletion examples.")

## 8. Closing Connections
Finally, the `close` method should be called to properly close database connections and release any other resources held by the orchestrator.

In [None]:
if 'orchestrator' in locals():
    print("Closing orchestrator connections...")
    orchestrator.close()
    print("Orchestrator connections closed.")
    
    # Clean up the example database and chroma directory created by this notebook
    # You might want to comment this out if you want to inspect the files afterwards
    if os.path.exists(example_db_path):
        os.remove(example_db_path)
        print(f"Removed example SQLite DB: {example_db_path}")
    if os.path.exists(example_chroma_persist_dir):
        import shutil
        shutil.rmtree(example_chroma_persist_dir)
        print(f"Removed example ChromaDB directory: {example_chroma_persist_dir}")
else:
    print("Orchestrator not initialized. Skipping close example.")

## 9. Customizing Configuration

The system uses a centralized configuration approach through the `Config` class. There are several ways to customize the configuration:

1. **Environment Variables**: Set variables in your `.env` file or environment
2. **Direct Overrides**: Pass parameters directly to the `FinancialNewsRAG` constructor
3. **Custom Config Class**: Create a subclass of `Config` with your own settings

Here's an example of how to override configuration values:

In [None]:
# Example of creating an orchestrator with custom configuration
# (This is just for demonstration - don't actually run this code as it would create another instance)

'''
# Method 1: Override via constructor parameters
custom_orchestrator = FinancialNewsRAG(
    # API keys (override environment variables)
    eodhd_api_key="your_custom_eodhd_key",
    gemini_api_key="your_custom_gemini_key",
    
    # Database paths
    db_path="custom_database.db",
    chroma_persist_dir="custom_chroma_dir",
    
    # Text processing settings
    max_tokens_per_chunk=1024  # Override the default chunk size
)

# Method 2: Set environment variables before importing
# You can set these in your .env file or programmatically:
import os
os.environ['EODHD_API_KEY'] = 'your_api_key'
os.environ['GEMINI_API_KEY'] = 'your_gemini_key'
os.environ['TEXTPROCESSOR_MAX_TOKENS_PER_CHUNK'] = '1024'
os.environ['CHROMA_DEFAULT_COLLECTION_NAME'] = 'my_custom_collection'
'''

This concludes the demonstration of the `FinancialNewsRAG` orchestrator. You can adapt these examples to build more complex workflows for your financial news analysis tasks.