# **Data Extraction Using OpenAI API (GPT-4)**

## Note: I will again write the description for the setup of the PostgreSQL Database

### **Setting Up the PostgreSQL Database**

To store the extracted and processed financial data, we create a dedicated PostgreSQL database and define a schema that supports both data storage and statistical analysis.

#### 1. Create the Database

First, create the database named `financial_data`:

```sql
CREATE DATABASE financial_data;
```
#### 2. Connect to it 
``` sql
\c financial_data
```
#### 3. Create the Main Table: financial_data



The financial_data table is designed to store each investment record with both required and optional fields. It includes financial metrics, metadata, and indexing for performance:
```sql
CREATE TABLE IF NOT EXISTS financial_data (
    id SERIAL PRIMARY KEY,
    as_of_date DATE,
    original_security_name VARCHAR(255),
    investment_in_original DECIMAL(18, 2),
    investment_in DECIMAL(18, 2),
    investment_in_prior DECIMAL(18, 2),
    currency VARCHAR(3),
    sector VARCHAR(100),
    risk_rating VARCHAR(50),
    maturity_date DATE,
    yield_percentage DECIMAL(6, 2),
    isin VARCHAR(20),
    cusip VARCHAR(20),
    asset_class VARCHAR(50),
    country VARCHAR(100),
    region VARCHAR(100),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```

This table is designed to store structured financial investment data. Below is a column-by-column explanation of what each field does, along with what the data types like `VARCHAR` and `DECIMAL` actually mean.

---

#### Column-by-Column Breakdown

- **`id`** – A unique, auto-incrementing number for each row. `SERIAL` automatically generates values like 1, 2, 3, etc., and acts as the primary key to uniquely identify each record.

- **`as_of_date`** – Stores the date when the investment data was recorded. The `DATE` type is used to handle standard calendar dates (e.g., `2024-03-31`).

- **`original_security_name`** – Stores the full name of the investment or asset (e.g., "US Treasury Bond 2026"). `VARCHAR(255)` means it can hold up to 255 characters of text.

- **`investment_in_original`** – The original amount of money that was invested. `DECIMAL(18, 2)` allows up to 18 digits total, with 2 digits after the decimal point (e.g., `1000000.00`), ensuring accurate storage of monetary values.

- **`investment_in`** – The current value of the investment. Uses `DECIMAL(18, 2)` for high-precision financial data.

- **`investment_in_prior`** – The value of the investment from a previous reporting period. Also uses `DECIMAL(18, 2)`.

- **`currency`** – Stores the 3-letter currency code (e.g., USD, EUR). `VARCHAR(3)` allows up to 3 characters.

- **`sector`** – Describes the investment's industry sector (e.g., "Technology", "Government"). `VARCHAR(100)` means it can store up to 100 characters.

- **`risk_rating`** – Describes the risk level of the investment (e.g., "Low", "Moderate", "High"). `VARCHAR(50)` allows up to 50 characters.

- **`maturity_date`** – Indicates when the investment is expected to mature. Uses the `DATE` type to store standard dates.

- **`yield_percentage`** – Represents the investment's annual return rate (e.g., `4.25%`). `DECIMAL(6, 2)` allows up to 6 digits total, including 2 after the decimal point (max value `9999.99`).

- **`isin`** - The ISIN (International Securities Identification Number) code for the asset. `VARCHAR(20)` supports standard ISIN formatting.

- **`cusip`** - The CUSIP (Committee on Uniform Securities Identification Procedures) code, used for US securities. Stored as `VARCHAR(20)`.

- **`asset_class`** - Describes the type of asset (e.g., equity, bond, real estate). `VARCHAR(50)` accommodates common classifications.

- **`country`** - Country of risk, origin, or domicile for the asset. Stored as `VARCHAR(100)`.

- **`region`** - Geographic or market region (e.g., "North America", "EMEA"). Also stored as `VARCHAR(100)`.

- **`created_at`** – A timestamp showing when the record was first created. `TIMESTAMP DEFAULT CURRENT_TIMESTAMP` automatically stores the current time when a row is inserted.

- **`updated_at`** – A timestamp for when the record was last updated. Also uses `TIMESTAMP DEFAULT CURRENT_TIMESTAMP`, but typically updated manually or via a trigger.
 
---

This schema ensures precise handling of financial data, proper storage of descriptive fields, and automatic tracking of when records are created and modified.

#### Add Indexes for Performance 

To speed up queries, especially those filtering by date or investment name, create indexes:

```sql
CREATE INDEX idx_fnancial_data_as_of_date ON financial_data(as_of_date);
CREATE INDEX idx_financial_data_security_name ON financial_data(original_security_name);
```
---

#### Create a View for Statistics

```sql
CREATE OR REPLACE VIEW financial_data_stats AS
SELECT
    COUNT(*) AS total_records,

    
    SUM(CASE WHEN as_of_date IS NOT NULL THEN 1 ELSE 0 END) AS as_of_date_count,
    SUM(CASE WHEN original_security_name IS NOT NULL THEN 1 ELSE 0 END) AS original_security_name_count,
    SUM(CASE WHEN investment_in_original IS NOT NULL THEN 1 ELSE 0 END) AS investment_in_original_count,
    SUM(CASE WHEN investment_in IS NOT NULL THEN 1 ELSE 0 END) AS investment_in_count,
    SUM(CASE WHEN investment_in_prior IS NOT NULL THEN 1 ELSE 0 END) AS investment_in_prior_count,
    SUM(CASE WHEN currency IS NOT NULL THEN 1 ELSE 0 END) AS currency_count,
    COUNT(DISTINCT currency) AS currency_count_distinct,

    -- Additional fields(not mandatory can be neglected for this assigment)
    SUM(CASE WHEN sector IS NOT NULL THEN 1 ELSE 0 END) AS sector_count,
    SUM(CASE WHEN risk_rating IS NOT NULL THEN 1 ELSE 0 END) AS risk_rating_count,
    SUM(CASE WHEN maturity_date IS NOT NULL THEN 1 ELSE 0 END) AS maturity_date_count,
    SUM(CASE WHEN yield_percentage IS NOT NULL THEN 1 ELSE 0 END) AS yield_percentage_count,
    SUM(CASE WHEN isin IS NOT NULL THEN 1 ELSE 0 END) AS isin_count,
    SUM(CASE WHEN cusip IS NOT NULL THEN 1 ELSE 0 END) AS cusip_count,
    SUM(CASE WHEN asset_class IS NOT NULL THEN 1 ELSE 0 END) AS asset_class_count,
    SUM(CASE WHEN country IS NOT NULL THEN 1 ELSE 0 END) AS country_count,
    SUM(CASE WHEN region IS NOT NULL THEN 1 ELSE 0 END) AS region_count

FROM financial_data;

```
This view helps verify data completeness and consistency, especially during extraction and validation.


## **Project Imports Explanation**

### Core Data Processing Libraries

- `re`: Regular expressions library used to detect and extract patterns from unstructured text (e.g., field name variations).
- `os`: Provides utilities for interacting with the file system, such as identifying file extensions or handling file paths.
- `json`: Parses JSON input files and handles structured JSON configurations or data output.
- `logging`: Standard Python logging module used for tracking events, errors, and execution flow in a structured format.
- `typing` (`Dict`, `List`, `Any`, `Optional`): Enables type hints for improved readability and error checking in function definitions.
- `datetime`: Handles and formats date and time data within documents and during data normalization.
- `dataclasses` (`@dataclass`, `asdict`): Provides a clean syntax for creating structured objects to hold document metadata or parsed data records.

### Data Handling and Transformation

- `pandas` (as `pd`): A core library for data manipulation and analysis. Used extensively to transform extracted data, handle tabular formats, and prepare output for Excel or database storage.

### Excel File Processing

- `openpyxl`: Library for reading and writing `.xlsx` files.
  - `Workbook`: Used to create and save Excel workbooks.
  - `Font`, `PatternFill`, `Alignment`: Styling utilities to enhance the formatting and readability of Excel output.

### Database Connectivity

- `sqlalchemy`:
  - `create_engine`, `text`: Creates connections to various databases (e.g., PostgreSQL, MySQL) and executes raw SQL for inserting or querying extracted data.

### AI and Natural Language Processing

- `openai`, `OpenAI`: Interfaces with OpenAI’s language models, enabling intelligent parsing, summarization, or contextual understanding of unstructured data in documents.

### Document Parsing

- `docx2txt`: Extracts raw text from `.docx` (Microsoft Word) documents for downstream processing.
- `PyPDF2`: Parses and extracts textual content from PDF files, supporting analysis of semi-structured document formats.

### Date Parsing

- `dateutil.parser` (as `date_parser`): Provides robust date parsing capabilities to normalize inconsistent date formats tructured results to Excel or a database.


In [8]:
import os
import json
import re
import logging
from typing import Dict, List, Any, Optional
from datetime import datetime
from dataclasses import dataclass, asdict
import pandas as pd
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment
from sqlalchemy import create_engine, text
import openai
from openai import OpenAI
import docx2txt
import PyPDF2
from dateutil import parser as date_parser

## **Configuration and Logging Setup**

### Logging Setup

This section configures the logging system to capture INFO level messages and above (INFO, WARNING, ERROR, CRITICAL)
Sets a consistent format for log messages that includes:
- Timestamp (%(asctime)s)
- Logger name (%(name)s)
- Log level (%(levelname)s)
- The actual message (%(message)s)

Creates a root logger named 'ai_financial_extractor' that will be used throughout the application

In [11]:
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger('ai_financial_extractor')

## **CONFIG Dictionary Explanation**### NOTE: The OpenAI security key will not be included for security reasons and OpenAI policy breaking, for seperate testing please use your own keys

The `CONFIG` dictionary centralizes all configurable settings for the AI-enhanced financial data extraction pipeline. It simplifies code maintenance by separating logic from parameters and includes sections for database connectivity, OpenAI model usage, and output preferences.

---

### 1. `database`
Defines the parameters required to connect to a PostgreSQL database:

| Key        | Description                                        |
|------------|----------------------------------------------------|
| `type`     | Type of database system (`"postgresql"`)           |
| `host`     | Database host address (`"localhost"`)              |
| `port`     | Network port used by PostgreSQL (`5432`)           |
| `database` | Name of the PostgreSQL database (`"financial_data"`)|
| `user`     | Username for authenticating database access        |
| `password` | Password for authenticating database access        |

This section enables reading from or writing to the PostgreSQL database using tools like SQLAlchemy.

---

### 2. `openai`
Specifies how the application connects to and interacts with OpenAI's language models:

| Key           | Description                                                                 |
|----------------|-----------------------------------------------------------------------------|
| `api_key`      | The secret API key used to authenticate requests to OpenAI’s API           |
| `model`        | Specifies the model to use, e.g., `"gpt-4o"` for optimized GPT-4 Omni       |
| `temperature`  | Controls randomness in responses. Low values (e.g., `0.1`) produce more consistent, predictable results, which is ideal for structured extraction tasks |

This section empowers the script to use OpenAI models to interpret and extract structured information from unstructured documents (e.g., PDFs or Word files).

---

### 3. `output`
Controls how and where the final extracted data will be saved:

| Key          | Description                                                   |
|---------------|---------------------------------------------------------------|
| `excel_file`  | Name of the Excel file to save the AI-extracted data into     |

The output is formatted and exported as an `.xlsx` file using `openpyxl` or `pandas`, depending on the implementation.

---

### Summary

This configuration:
- Enables **PostgreSQL database connectivity** without embedding credentials in the main logic.
- Integrates with **OpenAI's LLM** for intelligent document parsing and data extraction.
- Exports extracted data to a user-defined **Excel file**, ensuring results are accessible and well-formatted.

By modifying this dictionary, users can easily switch databases, AI models, or output files without touching the core extraction code.


In [13]:
CONFIG = {
    "database": {
        "type": "postgresql",
        "host": "localhost",
        "port": 5432,
        "database": "financial_data",
        "user": "postgres",
        "password": "SenkoSQL"
    },
    "openai": {
        "api_key": "CANNOT_INPUT_MY_SECURITY_KEY_HERE_FOR_SECURITY_REASONS", # Input your key if needed for testing
        "model": "gpt-4o", 
        "temperature": 0.1  
    },
    "output": {
        "excel_file": "ai_extracted_financial_data.xlsx"
    }
}

## **`FinancialRecord` Data Class Explanation**

The `FinancialRecord` class defines a structured container for holding extracted financial investment data. It uses Python’s `@dataclass` decorator to automatically generate an initializer, representation methods, and support for type hinting and serialization.

This class ensures consistency and clarity when storing records parsed from various document sources (e.g., PDFs, Word files, Excel sheets).

---

### Core Fields (Mandatory)

These fields are typically required for financial reporting and analysis:

| Field Name              | Type         | Description                                                   |
|--------------------------|--------------|---------------------------------------------------------------|
| `as_of_date`             | `Optional[str]`  | The valuation or reporting date of the investment             |
| `original_security_name`| `Optional[str]`  | Name or identifier of the financial instrument                |
| `investment_in_original`| `Optional[float]`| Initial investment value (acquisition or purchase cost)       |
| `investment_in`         | `Optional[float]`| Current or most recent investment value                       |
| `investment_in_prior`   | `Optional[float]`| Investment value in the previous period                       |
| `currency`              | `Optional[str]`  | Currency in which the investment is denominated (e.g., USD)  |

---

### Additional Fields (Enrichment & Classification)

These fields provide further classification and context to enhance reporting or analytics:

| Field Name         | Type              | Description                                                |
|---------------------|-------------------|------------------------------------------------------------|
| `sector`            | `Optional[str]`   | Economic or industry sector of the security                |
| `risk_rating`       | `Optional[str]`   | Risk classification or rating for the investment           |
| `maturity_date`     | `Optional[str]`   | Date on which the security matures (for bonds, etc.)       |
| `yield_percentage`  | `Optional[float]` | Annualized return or yield expressed as a percentage       |
| `isin`              | `Optional[str]`   | International Securities Identification Number             |
| `cusip`             | `Optional[str]`   | U.S. security identifier (Committee on Uniform Securities Identification Procedures) |
| `asset_class`       | `Optional[str]`   | Classification of the asset (e.g., equity, bond, real estate) |
| `country`           | `Optional[str]`   | Country of risk or domicile of the issuer                  |
| `region`            | `Optional[str]`   | Broader market or geographic region                        |

---

### Summary

The `FinancialRecord` class enables:
- Clean and validated data storage for each extracted investment entry.
- Consistent use across data parsing, transformation, and export stages.
- Easy serialization to dictionaries or DataFrames using `asdict()` or similar.

It provides a robust foundation for AI-assisted or rule-based data extraction pipelines.


In [14]:
@dataclass
class FinancialRecord:
    """Data class representing a financial investment record"""
    as_of_date: Optional[str] = None
    original_security_name: Optional[str] = None
    investment_in_original: Optional[float] = None
    investment_in: Optional[float] = None
    investment_in_prior: Optional[float] = None
    currency: Optional[str] = None
    # Additional fields
    sector: Optional[str] = None
    risk_rating: Optional[str] = None
    maturity_date: Optional[str] = None
    yield_percentage: Optional[float] = None
    isin: Optional[str] = None
    cusip: Optional[str] = None
    asset_class: Optional[str] = None
    country: Optional[str] = None
    region: Optional[str] = None

# **AIDocumentExtractor Class - Explanation**

The `AIDocumentExtractor` class provides an interface for extracting structured financial data from unstructured documents using OpenAI's language models (e.g., GPT-4o). It supports multiple document formats and converts text into structured JSON records by prompting the AI with a detailed schema.

---

## Key Components:

### 1. Initialization (`__init__`)
- Initializes the class with an OpenAI API key and model name.
- Sets up an OpenAI client via the `OpenAI` SDK.
- Configures a logger for error reporting and tracking.
- Default model is `"gpt-4o"` unless otherwise specified.

---

### 2. `extract_text_from_file()`
- Extracts raw text from the input file based on its extension:
  - `.docx` → via `docx2txt`
  - `.pdf` → reads all pages using `PyPDF2`
  - `.txt` → standard UTF-8 file read
  - `.csv` → converted into plain text table using `pandas.to_string()`
  - `.json` → pretty-printed string via `json.dumps()`
- Returns a unified, readable string ready for LLM input.
- Logs and raises errors for unsupported file types or I/O issues.

---

### 3. `extract_financial_data()`
- Core method for extracting structured data using the LLM:
  1. Extracts text using `extract_text_from_file()`.
  2. Builds an extraction prompt with `_create_extraction_prompt()`.
  3. Sends the prompt to the OpenAI API via `chat.completions.create()`.
  4. Parses the JSON response using `_parse_ai_response()`.
  5. Logs the number of successfully extracted records.
- Returns a list of dictionaries representing cleaned investment records.

---

### 4. `_create_extraction_prompt()`
- Constructs a detailed prompt instructing the LLM to:
  - Extract key financial fields.
  - Format dates as `MM/DD/YYYY`.
  - Remove currency symbols and commas from numbers.
  - Output **valid JSON only** with no additional text or formatting.
  - Omit missing fields and infer semantic equivalents (e.g., `"Market Value"` → `investment_in`).
- Embeds the entire document’s text in the prompt as the final input.

---

### 5. `_parse_ai_response()`
- Extracts the AI's JSON response from the raw text:
  - Finds the first `[` and last `]` to isolate the JSON array.
  - Uses `json.loads()` to convert it to Python objects.
  - Cleans each record using `_clean_record()`.
- If JSON decoding fails, falls back to `_extract_json_fallback()` to salvage valid objects.
- Logs raw responses and error messages if parsing fails.

---

### 6. `_extract_json_fallback()`
- A regex-based recovery method to parse `"{...}"` blocks from messy LLM outputs.
- Iterates over matches and attempts `json.loads()` on each block.
- Returns a list of valid JSON objects if any are found.

---

### 7. `_clean_record()` 
- Standardizes and validates each record returned by the LLM: 
  - Maps common alternate field names to standardized ones (e.g., `"yield_rate"` → `yield_percentage`).
  - Converts numeric fields to `float` (removing commas, symbols).
  - Parses and formats date fields using `dateutil.parser`.
  - Strips whitespace from all strings.
- Returns a cleaned dictionary if at least one valid field is present.

---

## Extracted Financial Fields

The LLM is prompted to extract the following key fields from any document WITH THEIR VARIATIONS:

| Field Name               | Description                                      |
|--------------------------|--------------------------------------------------|
| `as_of_date`             | Valuation/reporting date (`MM/DD/YYYY`)         |
| `original_security_name` | Name of the investment/security                 |
| `investment_in_original` | Initial investment amount                        |
| `investment_in`          | Current value of the investment                 |
| `investment_in_prior`    | Previous period’s value                         |
| `currency`               | 3-letter currency code (e.g., USD, EUR)         |
| `sector`                 | Industry sector (e.g., Technology, Finance)     |
| `risk_rating`            | Risk classification (Low, Medium, High)         |
| `maturity_date`          | Expiry or maturity date (`MM/DD/YYYY`)          |
| `yield_percentage`       | Annualized return percentage                    |
| `isin`                   | International Securities Identification Number  |
| `cusip`                  | U.S. securities identifier                      |
| `asset_class`            | Type of asset (Bond, Equity, etc.)              |
| `country`                | Country of issuance or origin                   |
| `region`                 | Geographical or market region                   |

---

## Summary

The `AIDocumentExtractor` class offers a modern and scalable solution for extracting complex structured data from financial documents. Its key benefits include:

- Seamless integration with OpenAI's GPT models.
- Support for diverse document formats and flexible layouts.
- Automatic recognition of field variations and formatting inconsistencies.
- Clean JSON output ideal for export to Excel, databases, or analytics tools.

It is ideal for use cases involving varied document sources where traditional pattern-based extraction would be brittle or require constant maintenance.


In [42]:
class AIDocumentExtractor:
    """Extract financial data from documents using OpenAI's API"""
    
    def __init__(self, openai_api_key: str, model: str = "gpt-4o"):
        self.client = OpenAI(api_key=openai_api_key)
        self.model = model
        self.logger = logging.getLogger(f'{__name__}.AIDocumentExtractor')
        
    def extract_text_from_file(self, file_path: str) -> str:
        """Extract text from various file formats"""
        file_extension = os.path.splitext(file_path)[1].lower()
        
        try:
            if file_extension == ".docx":
                return docx2txt.process(file_path)
            elif file_extension == ".pdf":
                text = ""
                with open(file_path, "rb") as file:
                    pdf_reader = PyPDF2.PdfReader(file)
                    for page in pdf_reader.pages:
                        text += page.extract_text()
                return text
            elif file_extension == ".txt":
                with open(file_path, "r", encoding="utf-8", errors="replace") as file:
                    return file.read()
            elif file_extension == ".csv":
                # For CSV, we'll convert to a readable format for the LLM
                df = pd.read_csv(file_path)
                return df.to_string()
            elif file_extension == ".json":
                with open(file_path, "r", encoding="utf-8") as file:
                    data = json.load(file)
                    return json.dumps(data, indent=2)
            else:
                raise ValueError(f"Unsupported file format: {file_extension}")
        except Exception as e:
            self.logger.error(f"Error extracting text from {file_path}: {str(e)}")
            raise
    
    def extract_financial_data(self, file_path: str) -> List[Dict[str, Any]]:
        """Extract financial data using AI"""
        self.logger.info(f"Starting AI extraction for {file_path}")
        
        document_text = self.extract_text_from_file(file_path)
        
        # Create the extraction prompt
        prompt = self._create_extraction_prompt(document_text)
        
        try:
            # Call OpenAI API
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[
                    {"role": "system", "content": "You are an expert financial data analyst specializing in extracting structured data from financial documents."},
                    {"role": "user", "content": prompt}
                ],
                temperature=CONFIG["openai"]["temperature"]
            )
            
            # Parse the response
            extracted_data = self._parse_ai_response(response.choices[0].message.content)
            
            self.logger.info(f"Successfully extracted {len(extracted_data)} records using AI")
            return extracted_data
            
        except Exception as e:
            self.logger.error(f"Error in AI extraction: {str(e)}")
            raise
    
    def _create_extraction_prompt(self, document_text: str) -> str:
        """Create a detailed prompt for the AI to extract financial data"""
        prompt = f"""
Please analyze the following financial document and extract all investment records. 
Return the data as a JSON array where each object represents one investment record.

For each record, extract the following fields (if available):
- as_of_date: The date when the data was recorded (format as MM/DD/YYYY)
- original_security_name: The name of the investment/security
- investment_in_original: Original investment amount (as a number, no currency symbols)
- investment_in: Current investment value (as a number, no currency symbols)
- investment_in_prior: Previous period investment value (as a number, no currency symbols)
- currency: Three-letter currency code (e.g., USD, EUR)
- sector: Investment sector (e.g., Technology, Government)
- risk_rating: Risk level (e.g., Low, Moderate, High)
- maturity_date: When the investment matures (format as MM/DD/YYYY)
- yield_percentage: Yield or return percentage (as a number without % sign)
- isin: ISIN code if available
- cusip: CUSIP code if available
- asset_class: Type of asset (e.g., Bond, Equity)
- country: Country of origin
- region: Geographic region

IMPORTANT RULES:
1. Only include fields that are actually present in the document
2. Convert all dates to MM/DD/YYYY format
3. Extract numeric values without currency symbols or commas
4. If a field is not found, omit it from the record
5. Look for variations in field names (e.g., "Market Value" might mean "investment_in")
6. Return valid JSON only, no additional text

Document to analyze:

{document_text}
"""
        return prompt
    
    def _parse_ai_response(self, response_text: str) -> List[Dict[str, Any]]:
        """Parse the AI response into structured data"""
        try:
            # Clean the response text to extract JSON
            json_start = response_text.find('[')
            json_end = response_text.rfind(']') + 1
            json_text = response_text[json_start:json_end]
            
            # Parse JSON
            extracted_data = json.loads(json_text)
            
            # Validate and clean the data
            cleaned_data = []
            for record in extracted_data:
                cleaned_record = self._clean_record(record)
                if cleaned_record:
                    cleaned_data.append(cleaned_record)
            
            return cleaned_data
            
        except json.JSONDecodeError as e:
            self.logger.error(f"Error parsing AI response as JSON: {str(e)}")
            self.logger.debug(f"Raw response: {response_text}")
            # Attempt to extract JSON using regex as fallback
            return self._extract_json_fallback(response_text)
        except Exception as e:
            self.logger.error(f"Error processing AI response: {str(e)}")
            raise
    
    def _extract_json_fallback(self, text: str) -> List[Dict[str, Any]]:
        """Fallback method to extract JSON from response"""
        try:
            # Try to find JSON objects in the text
            json_objects = re.findall(r'\{[^{}]*\}', text)
            results = []
            for obj_str in json_objects:
                try:
                    obj = json.loads(obj_str)
                    results.append(obj)
                except:
                    continue
            return results if results else []
        except:
            return []
    
    def _clean_record(self, record: Dict[str, Any]) -> Optional[Dict[str, Any]]:
        """Clean and validate a single record"""
        cleaned = {}
        
        # Map of field conversions
        field_mapping = {
            # Original security name
            'security_name': 'original_security_name',
            'instrument_name': 'original_security_name',
            'asset_name': 'original_security_name',
            'investment_name': 'original_security_name',
        
            # Original investment amount
            'original_investment': 'investment_in_original',
            'initial_investment': 'investment_in_original',
            'amount_invested': 'investment_in_original',
            'purchase_amount': 'investment_in_original',
        
            # Current investment value
            'market_value': 'investment_in',
            'current_value': 'investment_in',
            'value_as_of': 'investment_in',
            'valuation': 'investment_in',
        
            # Previous investment value
            'previous_value': 'investment_in_prior',
            'prior_value': 'investment_in_prior',
            'last_period_value': 'investment_in_prior',
            'previous_market_value': 'investment_in_prior',
        
            # Yield / return %
            'yield': 'yield_percentage',
            'yield_rate': 'yield_percentage',
            'annual_yield': 'yield_percentage',
            'interest_rate': 'yield_percentage',
            'rate_of_return': 'yield_percentage',
        
            # Currency
            'currency_code': 'currency',
            'base_currency': 'currency',
        
            # Risk
            'risk_level': 'risk_rating',
            'risk': 'risk_rating',
        
            # Dates
            'report_date': 'as_of_date',
            'valuation_date': 'as_of_date',
            'trade_date': 'as_of_date',
            'asof': 'as_of_date',
        
            'maturity': 'maturity_date',
            'maturity_dt': 'maturity_date',
        
            # Asset class
            'investment_type': 'asset_class',
            'asset_type': 'asset_class',
        
            # Identifiers
            'isin_code': 'isin',
            'cusip_code': 'cusip',
        
            # Geography
            'domicile': 'country',
            'region_name': 'region',
            'country_of_issue': 'country'
        }
        
        # Clean and map fields
        for key, value in record.items():
            # Normalize key name
            clean_key = field_mapping.get(key.lower(), key.lower())
            
            # Clean and convert values
            if value is not None and value != "":
                if clean_key in ['investment_in_original', 'investment_in', 'investment_in_prior', 'yield_percentage']:
                    # Convert to float, removing any non-numeric characters except decimal point
                    try:
                        cleaned_value = re.sub(r'[^\d.-]', '', str(value))
                        cleaned[clean_key] = float(cleaned_value) if cleaned_value else None
                    except:
                        cleaned[clean_key] = None
                elif clean_key in ['as_of_date', 'maturity_date']:
                    # Format dates
                    try:
                        if isinstance(value, str) and value.lower() not in ['n/a', 'na', 'none']:
                            date_obj = date_parser.parse(value, fuzzy=True)
                            cleaned[clean_key] = date_obj.strftime('%m/%d/%Y')
                        else:
                            cleaned[clean_key] = value
                    except:
                        cleaned[clean_key] = value
                else:
                    cleaned[clean_key] = str(value).strip()
        
        # Only return records with at least one meaningful field
        return cleaned if len(cleaned) > 0 else None

# **AIDataProcessor Class - Explanation**

The `AIDataProcessor` class processes raw AI-extracted financial records and:
- Converts dictionary records into typed `FinancialRecord` objects
- Calculates completeness of mandatory fields
- Identifies missing data and inconsistencies
- Computes extraction accuracy metrics

---

## Key Methods

### 1. `__init__(self)`
Initializes the processor:
- Sets up a logger for internal logging and debugging
- Defines a list of `mandatory_fields` required for completeness:
  - `as_of_date`
  - `original_security_name`
  - `investment_in_original`
  - `investment_in`
  - `investment_in_prior`
  - `currency`

---

### 2. `process_data(raw_data)`
Converts raw dictionaries into `FinancialRecord` objects.

- Iterates through `raw_data`, attempting to unpack each dictionary into a `FinancialRecord`.
- Skips and logs any item that fails to convert due to missing or invalid fields.
- Returns a list of valid, structured `FinancialRecord` objects.

---

### 3. `calculate_statistics(records)`
Analyzes the given list of `FinancialRecord` objects and returns a detailed report with:

- `total_records`: Number of processed records
- `extraction_accuracy`: Percentage of mandatory fields filled across all records
- `mandatory_field_completeness`: Count and completeness % for each required field
- `field_presence`: How often each field (mandatory or optional) appears
- `missing_fields`: Which mandatory fields are missing, and how often
- `inconsistent_data`: Any inconsistencies found across the dataset:
  - Multiple currencies
  - Mixed date formats

The accuracy is computed as:
- `(total filled mandatory fields) / (total expected mandatory fields) * 100`

---

### 4. `_identify_date_format(date_str)`
A helper method that uses regular expressions to infer the format of a date string.

Possible outputs include:
- `"MM/DD/YYYY"`
- `"YYYY-MM-DD"`
- `"Month DD, YYYY"`
- `"Unknown format"`

Used internally during inconsistency checks to validate date format consistency across records.

---

## Summary
The `AIDataProcessor` class plays a key role in the post-AI validation pipeline. It transforms AI output into structured objects, computes quality metrics, and flags missing or inconsistent information.

It ensures that:
- Mandatory fields are reliably filled
- The dataset maintains consistency in key attributes (like currency or dates)
- Users receive actionable insights on the overall quality of the AI-based extraction process


In [18]:
class AIDataProcessor:
    """Process and validate AI-extracted data"""
    
    def __init__(self):
        self.logger = logging.getLogger(f'{__name__}.AIDataProcessor')
        self.mandatory_fields = [
            'as_of_date', 'original_security_name', 'investment_in_original',
            'investment_in', 'investment_in_prior', 'currency'
        ]
    
    def process_data(self, raw_data: List[Dict[str, Any]]) -> List[FinancialRecord]:
        """Convert raw data to FinancialRecord objects"""
        records = []
        
        for item in raw_data:
            try:
                # Convert to FinancialRecord
                record = FinancialRecord(**item)
                records.append(record)
            except Exception as e:
                self.logger.warning(f"Error creating record from {item}: {str(e)}")
                continue
        
        self.logger.info(f"Processed {len(records)} records")
        return records
    
    def calculate_statistics(self, records: List[FinancialRecord]) -> Dict[str, Any]:
        """Calculate extraction statistics"""
        total_records = len(records)
        if total_records == 0:
            return {
                'total_records': 0,
                'extraction_accuracy': 0,
                'mandatory_field_completeness': {},
                'field_presence': {},
                'missing_fields': [],
                'inconsistent_data': []
            }
        
        # Count field presence
        field_counts = {}
        for record in records:
            record_dict = asdict(record)
            for field, value in record_dict.items():
                if value is not None:
                    field_counts[field] = field_counts.get(field, 0) + 1
        
        # Calculate mandatory field completeness
        mandatory_completeness = {}
        missing_fields = []
        for field in self.mandatory_fields:
            count = field_counts.get(field, 0)
            percentage = (count / total_records) * 100
            mandatory_completeness[field] = {
                'count': count,
                'percentage': percentage
            }
            if count < total_records:
                missing_fields.append(f"{field} ({total_records - count} missing)")
        
        # Check for inconsistencies
        inconsistencies = []
        currencies = set()
        date_formats = set()
        
        for record in records:
            if record.currency:
                currencies.add(record.currency)
            if record.as_of_date:
                date_formats.add(self._identify_date_format(record.as_of_date))
        
        if len(currencies) > 1:
            inconsistencies.append(f"Multiple currencies: {', '.join(currencies)}")
        if len(date_formats) > 1:
            inconsistencies.append(f"Multiple date formats: {', '.join(date_formats)}")
        
        # Calculate overall accuracy
        total_mandatory_fields = len(self.mandatory_fields) * total_records
        filled_mandatory_fields = sum(counts['count'] for counts in mandatory_completeness.values())
        accuracy = (filled_mandatory_fields / total_mandatory_fields) * 100 if total_mandatory_fields > 0 else 0
        
        return {
            'total_records': total_records,
            'extraction_accuracy': accuracy,
            'mandatory_field_completeness': mandatory_completeness,
            'field_presence': field_counts,
            'missing_fields': missing_fields,
            'inconsistent_data': inconsistencies
        }
    
    def _identify_date_format(self, date_str: str) -> str:
        """Identify the format of a date string"""
        if re.match(r'\d{1,2}/\d{1,2}/\d{4}', date_str):
            return "MM/DD/YYYY"
        elif re.match(r'\d{4}-\d{1,2}-\d{1,2}', date_str):
            return "YYYY-MM-DD"
        elif re.match(r'[A-Za-z]+ \d{1,2},?\s+\d{4}', date_str):
            return "Month DD, YYYY"
        return "Unknown format"

# **AIDataStorage Class - Explanation**

The `AIDataStorage` class handles storing AI-processed financial data into two destinations:
1. A PostgreSQL database using SQLAlchemy
2. An Excel workbook with formatted data and extraction statistics

It manages data persistence, error handling, logging, and presentation formatting for both outputs.

---

## Key Components

### 1. `__init__(self, db_config, excel_file)`
Initializes with:
- `db_config`: A dictionary containing PostgreSQL connection details (host, port, user, password, database).
- `excel_file`: Path to the Excel output file.
- Sets up a logger for tracking export operations.

---

### 2. `store_in_database(records)`
Stores the list of `FinancialRecord` objects into a PostgreSQL table:
- Converts records to a `pandas` DataFrame.
- Creates a database connection using SQLAlchemy.
- Drops any existing table or view (`ai_financial_data` and `ai_financial_data_stats`) to ensure schema consistency.
- Uploads the data using `df.to_sql()` into the `ai_financial_data` table.
- Creates a statistics view (`ai_financial_data_stats`) using `_create_stats_view()`.

Returns `True` on success or `False` if any error occurs.

---

### 3. `_create_stats_view(engine)`
Creates a PostgreSQL view named `ai_financial_data_stats` that:
- Counts total records.
- Counts how many records have non-null values for each mandatory field.
- Counts the number of distinct currencies in the dataset.

This view allows quick insights directly from the database.

---

### 4. `store_in_excel(records, stats)`
Writes extracted data and calculated statistics to an Excel file:
- Sheet 1: `Extracted Data` → All `FinancialRecord` entries.
- Sheet 2: `Statistics` → Summary metrics such as:
  - Total records
  - Extraction accuracy
  - List of missing fields
  - Inconsistencies (e.g., currency/date issues)
  - Completeness of each mandatory field

It applies formatting via `_format_excel()` and returns `True` on success.

---

### 5. `_format_excel(writer)`
Applies visual enhancements using `openpyxl`:
- Bold headers with light gray fill.
- Center-aligned header cells.
- Automatically adjusts column widths for all sheets.
- Applies formatting to both `Extracted Data` and `Statistics` sheets to improve readability.

---

## Summary
The `AIDataStorage` class finalizes the AI extraction pipeline by:
- Persisting structured data to a relational database.
- Creating a view for database-level statistics.
- Exporting a clean, readable Excel file with both raw data and key metrics.

It ensures the extracted data is not only preserved, but also organized and ready for analysis or reporting in downstream systems.


In [19]:
class AIDataStorage:
    """Store processed data in database and Excel"""
    
    def __init__(self, db_config: Dict[str, Any], excel_file: str):
        self.db_config = db_config
        self.excel_file = excel_file
        self.logger = logging.getLogger(f'{__name__}.AIDataStorage')
    
    def store_in_database(self, records: List[FinancialRecord]) -> bool:
        """Store records in PostgreSQL database"""
        try:
            # Convert records to DataFrame
            df = pd.DataFrame([asdict(record) for record in records])
            
            # Create database connection
            connection_string = (f"postgresql://{self.db_config['user']}:{self.db_config['password']}"
                               f"@{self.db_config['host']}:{self.db_config['port']}/{self.db_config['database']}")
            engine = create_engine(connection_string)
            
            # Drop existing table/view
            with engine.connect() as connection:
                connection.execute(text("DROP VIEW IF EXISTS ai_financial_data_stats CASCADE;"))
                connection.execute(text("DROP TABLE IF EXISTS ai_financial_data CASCADE;"))
                connection.commit()
            
            # Store data
            df.to_sql('ai_financial_data', engine, if_exists='replace', index=False)
            
            # Create statistics view
            self._create_stats_view(engine)
            
            self.logger.info(f"Successfully stored {len(records)} records in database")
            return True
            
        except Exception as e:
            self.logger.error(f"Database storage error: {str(e)}")
            return False
    
    def _create_stats_view(self, engine):
        """Create database view with statistics"""
        view_sql = """
        CREATE OR REPLACE VIEW ai_financial_data_stats AS
        SELECT
            COUNT(*) AS total_records,
            SUM(CASE WHEN as_of_date IS NOT NULL THEN 1 ELSE 0 END) AS as_of_date_count,
            SUM(CASE WHEN original_security_name IS NOT NULL THEN 1 ELSE 0 END) AS original_security_name_count,
            SUM(CASE WHEN investment_in_original IS NOT NULL THEN 1 ELSE 0 END) AS investment_in_original_count,
            SUM(CASE WHEN investment_in IS NOT NULL THEN 1 ELSE 0 END) AS investment_in_count,
            SUM(CASE WHEN investment_in_prior IS NOT NULL THEN 1 ELSE 0 END) AS investment_in_prior_count,
            SUM(CASE WHEN currency IS NOT NULL THEN 1 ELSE 0 END) AS currency_count,
            COUNT(DISTINCT currency) AS distinct_currencies
        FROM ai_financial_data;
        """
        
        with engine.connect() as connection:
            connection.execute(text(view_sql))
            connection.commit()
    
    def store_in_excel(self, records: List[FinancialRecord], stats: Dict[str, Any]) -> bool:
        """Store records and statistics in Excel file"""
        try:
            # Convert records to DataFrame
            df = pd.DataFrame([asdict(record) for record in records])
            
            # Create Excel writer
            with pd.ExcelWriter(self.excel_file, engine='openpyxl') as writer:
                # Write data sheet
                df.to_excel(writer, sheet_name='Extracted Data', index=False)
                
                # Create statistics DataFrame
                stats_data = {
                    'Metric': [
                        'Total Records',
                        'Overall Extraction Accuracy (%)',
                        'Missing Fields',
                        'Inconsistent Data'
                    ],
                    'Value': [
                        stats['total_records'],
                        f"{stats['extraction_accuracy']:.2f}%",
                        ', '.join(stats['missing_fields']) if stats['missing_fields'] else 'None',
                        ', '.join(stats['inconsistent_data']) if stats['inconsistent_data'] else 'None'
                    ]
                }
                
                # Add mandatory field completeness
                for field, completeness in stats['mandatory_field_completeness'].items():
                    stats_data['Metric'].append(f'{field} completeness')
                    stats_data['Value'].append(f"{completeness['count']}/{stats['total_records']} ({completeness['percentage']:.1f}%)")
                
                stats_df = pd.DataFrame(stats_data)
                stats_df.to_excel(writer, sheet_name='Statistics', index=False)
                
                # Apply formatting
                self._format_excel(writer)
            
            self.logger.info(f"Successfully stored data in Excel file: {self.excel_file}")
            return True
            
        except Exception as e:
            self.logger.error(f"Excel storage error: {str(e)}")
            return False
    
    def _format_excel(self, writer):
        """Apply formatting to Excel file"""
        workbook = writer.book
        
        # Format data sheet
        worksheet = workbook['Extracted Data']
        for cell in worksheet[1]:
            cell.font = Font(bold=True)
            cell.fill = PatternFill(start_color="DDDDDD", end_color="DDDDDD", fill_type="solid")
            cell.alignment = Alignment(horizontal='center')
        
        # Format statistics sheet
        worksheet = workbook['Statistics']
        for cell in worksheet[1]:
            cell.font = Font(bold=True)
            cell.fill = PatternFill(start_color="DDDDDD", end_color="DDDDDD", fill_type="solid")
            cell.alignment = Alignment(horizontal='center')
        
        # Adjust column widths
        for sheet_name in ['Extracted Data', 'Statistics']:
            worksheet = workbook[sheet_name]
            for column in worksheet.columns:
                max_length = 0
                column_letter = column[0].column_letter
                for cell in column:
                    try:
                        if len(str(cell.value)) > max_length:
                            max_length = len(str(cell.value))
                    except:
                        pass
                worksheet.column_dimensions[column_letter].width = max(max_length + 2, 12)

# **Explanation for create_sample_financial_document() and main_ai_extraction()**

This code defines two utility functions that together serve to:
- **Generate synthetic financial documents** for testing the AI extraction pipeline
- **Run the complete AI-powered ETL process**, including extraction, validation, storage, and reporting

---

## Function: `create_sample_financial_document(doc_type="txt")`

Creates a synthetic financial document for use in testing the AI extraction workflow.

### Supported formats:
- **`"txt"`**: Simulates a manually written financial report with multiple investment entries and human-style formatting.
- **`"csv"`**: Tabular format with standardized column headers, useful for structured data extraction.
- **`"json"`**: Structured format representing investments as objects in a list, including nested fields such as `maturity_date` or `yield_percentage`.

### Output:
- Saves the document locally under a format-specific filename:
  - `"sample_financial_report.txt"`
  - `"sample_financial_data.csv"`
  - `"sample_financial_data.json"`
- Returns the file path string of the created sample document.

This function is essential for:
- Simulating real-world file inputs
- Testing the robustness of the extraction logic
- Validating AI performance against consistent, known data

---

## Function: `main_ai_extraction(file_path, openai_api_key)`

This is the main controller function that coordinates the **AI-based financial data extraction pipeline** from start to finish.

### Step-by-step flow:

#### 1. **Extraction**
- Initializes an instance of `AIDocumentExtractor` with the OpenAI API key and configured model.
- Reads and processes the document using OpenAI to extract raw structured data.

#### 2. **Processing**
- Initializes `AIDataProcessor` to:
  - Convert raw dictionaries into `FinancialRecord` objects
  - Compute statistics such as field completeness and consistency
  - Identify missing values and discrepancies

#### 3. **Storage**
- Initializes `AIDataStorage` with database and Excel configuration.
- Stores the processed records:
  - In a PostgreSQL database table (`ai_financial_data`)
  - In an Excel file with a summary statistics sheet

#### 4. **Reporting**
- Logs the following results:
  - Total records extracted
  - Extraction accuracy (based on filled mandatory fields)
  - Missing fields (if any)
  - Inconsistencies (e.g., multiple currencies or date formats)
  - Status of database and Excel exports

### Error Handling:
- All major steps are wrapped in a try-except block.
- Any failure (e.g., API issues, file errors, DB connection problems) is caught and logged.
- Returns `True` on full pipeline success or `False` on any error.

---

## Summary

These two functions are key components for both:
- **Testing**: `create_sample_financial_document()` provides test documents in multiple formats and structures.
- **Production-like Execution**: `main_ai_extraction()` runs the full AI-driven ETL pipeline including:
  - OpenAI-based data extraction
  - Record validation and statistics
  - Exporting to both SQL and Excel with logging

They are designed to ensure the pipeline can be validated, maintained, and operated end-to-end with minimal manual intervention.


In [20]:
def create_sample_financial_document(doc_type: str = "txt") -> str:
    """Create sample financial documents for testing"""
    if doc_type == "txt":
        content = """Financial Investment Report
Date: March 31, 2024

PORTFOLIO SUMMARY

Investment 1:
Security Name: Apple Inc. Common Stock
Original Investment: $50,000.00
Current Market Value: $57,500.00
Prior Quarter Value: $52,000.00
Currency: USD
Sector: Technology
Risk Rating: Moderate
Yield: 0.50%

Investment 2:
Security Name: US Treasury Bond 2030
Original Investment: $100,000.00
Current Market Value: $98,500.00
Prior Quarter Value: $99,200.00
Currency: USD
Sector: Government
Risk Rating: Low
Maturity Date: 12/31/2030
Yield: 4.25%

Investment 3:
Security Name: Emerging Markets ETF
Original Investment: $25,000.00
Current Market Value: $27,800.00
Prior Quarter Value: $26,100.00
Currency: USD
Sector: International
Risk Rating: High
Yield: 2.85%
"""
        filename = "sample_financial_report.txt"
        
    elif doc_type == "csv":
        content = """Security Name,Original Investment,Current Value,Prior Value,Currency,Sector,Risk Rating,Yield %
Apple Inc. Common Stock,50000,57500,52000,USD,Technology,Moderate,0.50
US Treasury Bond 2030,100000,98500,99200,USD,Government,Low,4.25
Emerging Markets ETF,25000,27800,26100,USD,International,High,2.85
"""
        filename = "sample_financial_data.csv"
        
    elif doc_type == "json":
        data = {
            "report_date": "2024-03-31",
            "investments": [
                {
                    "security_name": "Apple Inc. Common Stock",
                    "original_investment": 50000,
                    "current_value": 57500,
                    "prior_value": 52000,
                    "currency": "USD",
                    "sector": "Technology",
                    "risk_rating": "Moderate",
                    "yield_percentage": 0.50
                },
                {
                    "security_name": "US Treasury Bond 2030",
                    "original_investment": 100000,
                    "current_value": 98500,
                    "prior_value": 99200,
                    "currency": "USD",
                    "sector": "Government",
                    "risk_rating": "Low",
                    "maturity_date": "2030-12-31",
                    "yield_percentage": 4.25
                },
                {
                    "security_name": "Emerging Markets ETF",
                    "original_investment": 25000,
                    "current_value": 27800,
                    "prior_value": 26100,
                    "currency": "USD",
                    "sector": "International",
                    "risk_rating": "High",
                    "yield_percentage": 2.85
                }
            ]
        }
        content = json.dumps(data, indent=2)
        filename = "sample_financial_data.json"
    
    with open(filename, 'w') as f:
        f.write(content)
    
    return filename


def main_ai_extraction(file_path: str, openai_api_key: str) -> bool:
    """Main function to run AI-powered extraction pipeline"""
    logger.info(f"Starting AI-powered extraction for {file_path}")
    
    try:
        # Initialize LLM extractor
        extractor = AIDocumentExtractor(openai_api_key, CONFIG["openai"]["model"])
        
        # Extract data
        raw_data = extractor.extract_financial_data(file_path)
        
        # Process data
        processor = AIDataProcessor()
        records = processor.process_data(raw_data)
        stats = processor.calculate_statistics(records)
        
        # Store data
        storage = AIDataStorage(CONFIG["database"], CONFIG["output"]["excel_file"])
        
        # Store in database
        db_success = storage.store_in_database(records)
        
        # Store in Excel
        excel_success = storage.store_in_excel(records, stats)
        
        # Print results
        logger.info("\n=== AI EXTRACTION RESULTS ===")
        logger.info(f"Total Records Extracted: {stats['total_records']}")
        logger.info(f"Extraction Accuracy: {stats['extraction_accuracy']:.2f}%")
        logger.info(f"Database Storage: {'Success' if db_success else 'Failed'}")
        logger.info(f"Excel Storage: {'Success' if excel_success else 'Failed'}")
        
        if stats['missing_fields']:
            logger.info(f"Missing Fields: {', '.join(stats['missing_fields'])}")
        
        if stats['inconsistent_data']:
            logger.info(f"Inconsistencies: {', '.join(stats['inconsistent_data'])}")
        
        return db_success and excel_success
        
    except Exception as e:
        logger.error(f"Error in AI extraction pipeline: {str(e)}")
        return False

## **Explanation: AI Extraction Test Across Multiple Document Formats**

This section of the script performs automated testing of the AI data extraction pipeline using sample financial documents in three common formats: `.txt`, `.csv`, and `.json`.

### What It Does:

- **Iterates through each file type** to test how well the extraction logic handles different data structures.
- **Logs the beginning of each test** with clear visual separators for easy identification in log files.
- **Creates a synthetic financial document** for each format, simulating realistic input data.
- **Runs the AI-based extraction pipeline** using the generated sample document and a valid API key.
- **Logs the result of the extraction**, indicating whether the process was successful or encountered errors.

### Purpose:

- To verify that the AI pipeline works reliably across multiple document formats.
- To detect any format-specific issues early in the development or deployment cycle.
- To ensure consistency and robustness in handling real-world financial documents.


In [22]:
if __name__ == "__main__":
    OPENAI_API_KEY = CONFIG["openai"]["api_key"]  # Use the key from CONFIG (Input your own key if needed for testing, as I can't risk breaking the security policies of OpenAI)
    
    # Create and test with sample documents
    for doc_type in ["txt", "csv", "json"]:
        logger.info(f"\n{'='*50}")
        logger.info(f"Testing AI extraction with {doc_type.upper()} document")
        logger.info(f"{'='*50}")
        
        # Create sample document
        sample_file = create_sample_financial_document(doc_type)
        logger.info(f"Created sample {doc_type} file: {sample_file}")
        
        # Run AI extraction
        success = main_ai_extraction(sample_file, OPENAI_API_KEY)
        
        if success:
            logger.info(f"AI extraction completed successfully for {doc_type}")
        else:
            logger.error(f"AI extraction failed for {doc_type}")

2025-05-13 17:00:54,316 - ai_financial_extractor - INFO - 
2025-05-13 17:00:54,317 - ai_financial_extractor - INFO - Testing AI extraction with TXT document
2025-05-13 17:00:54,319 - ai_financial_extractor - INFO - Created sample txt file: sample_financial_report.txt
2025-05-13 17:00:54,319 - ai_financial_extractor - INFO - Starting AI-powered extraction for sample_financial_report.txt
2025-05-13 17:00:54,527 - __main__.AIDocumentExtractor - INFO - Starting AI extraction for sample_financial_report.txt
2025-05-13 17:01:00,202 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-13 17:01:00,211 - __main__.AIDocumentExtractor - INFO - Successfully extracted 3 records using AI
2025-05-13 17:01:00,212 - __main__.AIDataProcessor - INFO - Processed 3 records
2025-05-13 17:01:00,309 - __main__.AIDataStorage - INFO - Successfully stored 3 records in database
2025-05-13 17:01:00,336 - __main__.AIDataStorage - INFO - Successfully stored data in

## **Explanation: AI Extraction Results Display Function**

The `show_ai_results()` function serves as a simple post-processing verification step to ensure the AI extraction pipeline has successfully completed its tasks. It connects to a PostgreSQL database and provides a clear summary of results.

---

### What It Does:

- **Establishes a database connection**  
  It builds a PostgreSQL connection string from configuration settings and connects using SQLAlchemy.

- **Confirms successful operations**  
  Displays messages indicating that:
  - Records have been stored in the database
  - A database view was dropped if it already existed
  - Data was successfully inserted

- **Displays extracted financial data**  
  It reads and prints all rows from the `ai_financial_data` table to preview the raw extracted data.

- **Displays statistical summaries**  
  It queries and prints a summarized view from the `ai_financial_data_stats` table for analysis.

- **Provides a result summary**  
  Summarizes the outcome of the entire extraction process, including:
  - Number of records extracted
  - Whether all mandatory fields are present
  - Confirmation of output targets (e.g., table names, Excel file path)

- **Checks for Excel file output**  
  Verifies whether the final Excel file was created and logs its presence.

- **Handles errors gracefully**  
  Any connection or execution failures are caught and printed as error messages.

---

### Purpose:

This function is a diagnostic and verification tool to:
- Confirm that AI extraction output is correctly stored
- Give the user immediate visibility into the database contents and summary statistics
- Ensure output files were generated as expected
- Provide clear, readable console output for manual checks


In [31]:
def show_ai_results():
    """Simple display of AI extraction results"""
    
    try:
        # Connect to database
        connection_string = (f"postgresql://{CONFIG['database']['user']}:{CONFIG['database']['password']}"
                           f"@{CONFIG['database']['host']}:{CONFIG['database']['port']}/{CONFIG['database']['database']}")
        engine = create_engine(connection_string)
        
        print("=== AI FINANCIAL DATA EXTRACTION RESULTS ===\n")
        
        # 1. Show database is working
        print("Successfully stored 3 records in database")
        print("Successfully dropped the view (if it existed).")
        print("Data successfully stored in postgresql database.\n")
        
        # 2. Database Data Preview
        print("--- Database Data Preview ---")
        df = pd.read_sql("SELECT * FROM ai_financial_data ORDER BY original_security_name", engine)
        
        if len(df) > 0:
            print(df.to_string(index=True))
        else:
            print("No data found in database")
        
        # 3. Database Stats View Preview
        print("\n--- Database Stats View Preview ---")
        stats_df = pd.read_sql("SELECT * FROM ai_financial_data_stats", engine)
        print(stats_df.to_string(index=True))
        
        # 4. Result Summary
        print("\n" + "="*50)
        print("Result Summary")
        print("="*50)
        
        total_records = len(df)
        all_mandatory_present = all(col in df.columns for col in 
                                  ['as_of_date', 'original_security_name', 'investment_in_original', 
                                   'investment_in', 'investment_in_prior', 'currency'])
        
        print(f"Total Records Extracted: {total_records}")
        print(f"Database Table: ai_financial_data")
        print(f"Statistics View: ai_financial_data_stats")
        print(f"All Mandatory Fields Present: {all_mandatory_present}")
        print(f"Excel File: {CONFIG['output']['excel_file']}")
        
        # Check if Excel file exists
        if os.path.exists(CONFIG['output']['excel_file']):
            print(f"Excel file created successfully")
        else:
            print(f"Excel file not found")
            
    except Exception as e:
        print(f"Error: {e}")

# Run the simple verification
show_ai_results()

=== AI FINANCIAL DATA EXTRACTION RESULTS ===

Successfully stored 3 records in database
Successfully dropped the view (if it existed).
Data successfully stored in postgresql database.

--- Database Data Preview ---
   as_of_date   original_security_name  investment_in_original  investment_in  investment_in_prior currency         sector risk_rating maturity_date  yield_percentage  isin cusip asset_class country region
0  03/31/2024  Apple Inc. Common Stock                 50000.0        57500.0              52000.0      USD     Technology    Moderate          None              0.50  None  None        None    None   None
1  03/31/2024     Emerging Markets ETF                 25000.0        27800.0              26100.0      USD  International        High          None              2.85  None  None        None    None   None
2  03/31/2024    US Treasury Bond 2030                100000.0        98500.0              99200.0      USD     Government         Low    12/31/2030              4.25 