# Financial Data Ingestion to Delta Lake

This notebook downloads financial data from Yahoo Finance and ingests it into our Delta Lake bronze layer. 
The process includes:
1. Downloading historical price data for selected stock symbols
2. Transforming and validating the data
3. Writing to Delta Lake with proper optimization

## Environment Setup

In [None]:
# Import required libraries
import sys
import os
from datetime import datetime, timedelta

# Add the project root to Python path to import our agent
project_root = os.path.abspath(os.path.join(os.path.dirname("__file__"), '..'))
sys.path.append(project_root)

from src.agents.ingest_agent import DataIngestionAgent, DataIngestionError

# Initialize the data ingestion agent
agent = DataIngestionAgent()

## Configure Target Symbols and Date Range

Define the list of stock symbols to download and the time period for historical data. The schema and error handling are already configured in the DataIngestionAgent class.

In [None]:
# Define schema for financial data
schema = StructType([
    StructField("ticker", StringType(), False),
    StructField("date", DateType(), False),
    StructField("open", DoubleType(), True),
    StructField("high", DoubleType(), True),
    StructField("low", DoubleType(), True),
    StructField("close", DoubleType(), True),
    StructField("adj_close", DoubleType(), True),
    StructField("volume", LongType(), True),
    StructField("ingestion_timestamp", DateType(), False)
])

# Error handling class
class DataIngestionError(Exception):
    """Custom exception for data ingestion errors."""
    pass

# Initialize ingestion stats dictionary
ingestion_stats = {
    "successful_tickers": [],
    "failed_tickers": [],
    "total_rows": 0,
    "start_time": None,
    "end_time": None
}

## Define Target Symbols and Date Range

Specify the list of stock symbols to download and the time period for historical data.

In [None]:
# Define target stock symbols
tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'META']

# Set date range (default: 1 year of data)
end_date = datetime.now()
start_date = end_date - timedelta(days=365)

print(f"Configured to download data for {len(tickers)} symbols:")
print(f"Symbols: {', '.join(tickers)}")
print(f"Date range: {start_date.date()} to {end_date.date()}")

## Download and Ingest Financial Data

Use the DataIngestionAgent to download historical price data for the selected symbols and ingest it into Delta Lake.

In [None]:
# Download price data using the agent
try:
    # Download the data
    price_data = agent.download_price_data(
        tickers=tickers,
        start_date=start_date,
        end_date=end_date
    )
    
    # Convert date column to datetime before ingestion
    price_data['date'] = pd.to_datetime(price_data['date'])
    
    # Ingest to Delta Lake
    agent.ingest_to_delta(price_data)
    
except DataIngestionError as e:
    print(f"Data ingestion failed: {str(e)}")
    raise

## Write to Delta Lake Bronze Layer

Write the transformed data to Delta Lake with proper optimization settings.

In [None]:
# Create or get the catalog
spark.sql("CREATE CATALOG IF NOT EXISTS finance_catalog")
spark.sql("USE CATALOG finance_catalog")

spark.sql("CREATE DATABASE IF NOT EXISTS bronze")
spark.sql("USE bronze")

try:
    if 'adj_close' not in combined_data.columns:
        combined_data['adj_close'] = combined_data['close']
    
    selected_columns = ['ticker', 'date', 'open', 'high', 'low', 'close', 
                       'adj_close', 'volume', 'ingestion_timestamp']
    combined_data_filtered = combined_data[selected_columns]
    
    spark_df = spark.createDataFrame(combined_data_filtered)
    
    spark_df = spark_df.select(
        spark_df.ticker.cast(StringType()),
        spark_df.date.cast(DateType()),
        spark_df.open.cast(DoubleType()),
        spark_df.high.cast(DoubleType()),
        spark_df.low.cast(DoubleType()),
        spark_df.close.cast(DoubleType()),
        spark_df.adj_close.cast(DoubleType()),
        spark_df.volume.cast(LongType()),
        spark_df.ingestion_timestamp.cast(DateType())
    )
    
    table_name = "prices"
    (spark_df.write
     .format("delta")
     .mode("append")
     .option("mergeSchema", "true")
     .option("delta.autoOptimize.optimizeWrite", "true")
     .option("delta.autoOptimize.autoCompact", "true")
     .saveAsTable(f"finance_catalog.bronze.{table_name}"))
    
    row_count = spark_df.count()
    print(f"\nSuccessfully wrote {row_count} rows to finance_catalog.bronze.{table_name}")
    
except Exception as exc:
    logger.error("Failed to write to Delta table", exc_info=True)
    raise DataIngestionError(f"Delta table write failed: {str(exc)}")

## Data Validation

Perform basic validation of the ingested data using Spark SQL queries.

In [None]:
# Read the table and create a view
spark = agent.spark
prices_df = spark.table(f"{agent.catalog}.{agent.database}.prices")
prices_df.createOrReplaceTempView("prices_view")

# Get basic statistics
print("\nBasic Statistics:")
print(f"Total records: {prices_df.count()}")
print("\nSchema:")
prices_df.printSchema()

# Check record counts by symbol
symbol_counts = spark.sql("""
    SELECT 
        ticker,
        COUNT(*) as record_count,
        MIN(date) as first_date,
        MAX(date) as last_date
    FROM prices_view
    GROUP BY ticker
    ORDER BY record_count DESC
""")

print("\nRecord counts by symbol:")
display(symbol_counts)

# Show sample of recent data
print("\nSample of recent data:")
display(spark.sql("""
    SELECT *
    FROM prices_view
    ORDER BY date DESC, ticker
    LIMIT 5
"""))