# 🌟 **Smart Statement Reader** 🚀  

> **An AI/ML-powered solution for processing financial PDFs with precision and efficiency.**  

 🔹 **Features:**  
    - 📂 **Automatic File Format Detection** – Identifies PDF structures effortlessly.  
    - 📊 **Financial Ledger Extraction** – Extracts and organizes financial data into structured formats.  
    - 🛠️ **Handling Formatting Inconsistencies** – Adapts to varying PDF formats seamlessly.  
    - 🎯 **Confidence Scoring** – Provides accuracy insights on extracted data.  
    - 🔄 **User Feedback Integration** – Continuously improves through corrections and feedback.  

    ✨ Transform financial statements into actionable insights with AI!  

#### Detect PDF Type (Structured vs. Scanned)

In [1]:
import pdfplumber

def is_scanned_pdf(pdf_path):
    """Check if the PDF is scanned or structured"""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text = page.extract_text()
            if text and len(text.strip()) > 10:
                return False  # Text-based PDF
    return True  # Scanned PDF

#### _Testing_

In [2]:
pdf_path = "Ledger_Entries.pdf"
print("Scanned PDF:", is_scanned_pdf(pdf_path))

Scanned PDF: False


#### Extract Text Based on PDF Type

##### For Text-Based PDFs (Structured PDFs)

In [3]:
def extract_text_from_pdf(pdf_path):
    """Extract text from structured PDFs using pdfplumber"""
    text_data = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text_data.append(page.extract_text())
    return "\n".join(text_data)

##### For Scanned PDFs (OCR-Based Extraction)

In [4]:
import fitz  # PyMuPDF
import cv2
import numpy as np
import pytesseract

def extract_text_from_scanned_pdf(pdf_path):
    """Extract text from scanned PDFs using OCR"""
    text_data = []
    doc = fitz.open(pdf_path)

    for page_num in range(len(doc)):
        img = doc[page_num].get_pixmap()  # Convert to image
        img_array = np.frombuffer(img.samples, dtype=np.uint8).reshape(img.height, img.width, 3)
        gray = cv2.cvtColor(img_array, cv2.COLOR_BGR2GRAY)  # Convert to grayscale
        text = pytesseract.image_to_string(gray)
        text_data.append(text)
    
    return "\n".join(text_data)

#### _Testing_

In [5]:
scanned = is_scanned_pdf(pdf_path)
if scanned:
    text = extract_text_from_scanned_pdf(pdf_path)
else:
    text = extract_text_from_pdf(pdf_path)

print(text[:1000])  # Print first 1000 characters for verification

JOURNAL, LEDGER,
SUBSIDIARY BOOKS AND
TRIAL BALANCE
Prepared by Mrs.M.Janani
Department of Commerce (International Business)
Governement Arts College, Coimbatore – 18.
Refernce: Financial Accounting
Author: T.S.Reddy & Dr.A.Murthy
Journal:
1. Journalise the following transactions of M/s. Radha & Sons.
1.1.2000 Business Started with Rs.2,50,000 and cash deposited with Bank – 1,50,000
3.1.2000 Purchasesd machinery on credit from Rangan – 50,000
6.1.2000 Bought furniture from Ramesh for cash – 25,000
12.1.2000 Goods sold to Yesodha – 22,500
13.1.2000 Goods returned by Yesodha – 2,500
15.1.2000 Goods sold for cash – 50,000
17.1.2000 Bought goods for cash – 25,000
20.1.2000 Cash received from Yesodha – 10,000
21.1.2000 Cash paid to Ramola – 20,000
25.1.2000 Cash withdrawn from bank – 50,000
29.1.2000 Paid advertisement expenses – 12,500
30.1.2000 Bought office stationery for cash – 5,000
31.1.2000 Cash withdrawn from bank for personal use of the proprietor – 6,250
31.1.2000 Paid salaries – 

#### Extract Financial Data (Ledger Entries)

In [11]:
import re

def extract_ledger_entries(text):
    """Extract financial ledger entries using regex"""
    pattern = re.compile(
        r'(\d{1,4}[-/.]\d{1,2}[-/.]\d{2,4})\s+([\w\s]+?)\s+(-?\d{1,3}(?:,\d{3})*(?:\.\d{2})?)\s+(-?\d{1,3}(?:,\d{3})*(?:\.\d{2})?)'
    )
    transactions = []

    for match in pattern.findall(text):
        date, description, amount, balance = match
        transactions.append({
            "Date": date,
            "Description": description.strip(),
            "Amount": float(amount.replace(",", "")),  # Remove commas before conversion
            "Balance": float(balance.replace(",", ""))  # Remove commas before conversion
        })

    return transactions

#### _Testing_

In [12]:
transactions = extract_ledger_entries(text)
print(transactions[:5])  # Print first 5 transactions

[{'Date': '1.4.2000', 'Description': 'To Sales', 'Amount': 6000.0, 'Balance': 5.0}, {'Date': '18.4.2000', 'Description': 'To Sales By Discount allowed', 'Amount': 200.0, 'Balance': 8000.0}, {'Date': '30.4.2000', 'Description': 'By Cash', 'Amount': 4500.0, 'Balance': 30.0}, {'Date': '22.5.2000', 'Description': 'By Cash', 'Amount': 4850.0, 'Balance': 12.0}, {'Date': '30.8.1987', 'Description': 'Returned inferior goods to Sankar', 'Amount': -800.0, 'Balance': 31.0}]


#### Save Data to CSV/Excel

In [18]:
import pandas as pd

def save_to_csv(transactions, output_file):
    """Save extracted data to CSV"""
    df = pd.DataFrame(transactions)
    df.to_csv(output_file, index=False)
    print(f"Data saved to {output_file}")

def save_to_excel(transactions, output_file):
    """Save extracted data to Excel"""
    df = pd.DataFrame(transactions)
    df.to_excel(output_file,index=False, engine='openpyxl')
    print(f"Data saved to {output_file}")

#### Combine Everything in One Function

In [22]:
def process_pdf(pdf_path, output_file):
    """Main function to process PDF and extract structured financial data"""
    scanned = is_scanned_pdf(pdf_path)

    if scanned:
        print("Processing scanned PDF with OCR...")
        extracted_text = extract_text_from_scanned_pdf(pdf_path)
    else:
        print("Processing structured PDF...")
        extracted_text = extract_text_from_pdf(pdf_path)

    transactions = extract_ledger_entries(extracted_text)

    if transactions:
        save_to_csv(transactions, output_file.replace(".xlsx", ".csv"))
        save_to_excel(transactions, output_file)
    else:
        print("No financial data found in the document.")

# Run the complete pipeline
pdf_file = "Ledger_Entries.pdf"  # Change this to your PDF file
output_file = "extracted_data.xlsx"
process_pdf(pdf_file, output_file)

Processing structured PDF...
Data saved to extracted_data.csv
Data saved to extracted_data.xlsx
