Skip to content

HarshXAI/CreditCardParser

Repository files navigation

Credit Card Statement Parser

A Python-based web application for automatically extracting key information and all transaction details from credit card statements using Streamlit and pdfplumber.

Overview

This project provides an automated solution for parsing PDF credit card statements from multiple issuers. It extracts:

Statement Summary (5 key data points):

  1. Card Variant/Type - The name of the credit card (e.g., Chase Sapphire Preferred)
  2. Card Number (Last 4) - Last four digits of the card number
  3. Billing Cycle - Statement period date range
  4. Payment Due Date - When payment is due
  5. Total Balance - Amount due on the statement

Transaction Details (NEW!):

  • All transactions with date, description, and amount
  • ✅ Transaction analytics (count, total spent, average)
  • ✅ Export to CSV for further analysis
  • ✅ Complete PDF text viewer

Supported Issuers

  • American Express
  • Chase
  • Citibank
  • Bank of America
  • Discover

Features

  • Multi-Issuer Support: Modular parser design handles different statement formats
  • Auto-Detection: Automatically identifies credit card issuer from statement text
  • Transaction Extraction: Extracts ALL transactions with date, description, amount
  • Transaction Analytics: Shows count, total spent, and average per transaction
  • Web Interface: User-friendly Streamlit interface for easy file upload and viewing results
  • Export Functionality: Download extracted data and transactions as CSV
  • Raw Text Viewer: See complete PDF text for verification
  • Robust Parsing: Uses regex patterns and table extraction for accuracy

Architecture

The project follows a modular design pattern:

credit_card_parser/
├── parsers/                    # PDF parsing logic package
│   ├── __init__.py            # Package initialization and factory functions
│   ├── base_parser.py         # Base parser class with common utilities
│   ├── amex_parser.py         # American Express specific parser
│   ├── chase_parser.py        # Chase specific parser
│   ├── citi_parser.py         # Citibank specific parser
│   ├── boa_parser.py          # Bank of America specific parser
│   ├── discover_parser.py     # Discover specific parser
│   └── utils.py               # Helper functions for text extraction
├── app.py                      # Streamlit web application
├── requirements.txt            # Python dependencies
└── README.md                   # This file

Design Principles

  • Strategy Pattern: Each issuer has a dedicated parser class implementing a common interface
  • Separation of Concerns: UI logic (Streamlit) separated from parsing logic
  • Extensibility: Easy to add support for new issuers by creating new parser classes
  • Reusability: Common extraction utilities shared across all parsers

Technology Stack

  • Streamlit: Web application framework for the user interface
  • pdfplumber: PDF text extraction (superior layout handling)
  • pandas: Data manipulation and CSV export
  • pypdf: Additional PDF handling capabilities
  • python-dateutil: Date parsing utilities

Installation

Prerequisites

  • Python 3.10 or higher
  • uv - Fast Python package installer

Setup with uv (Recommended)

  1. Install uv (if not already installed):
# On macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or with Homebrew
brew install uv

# Or with pip
pip install uv
  1. Clone or download the project

  2. Navigate to project directory:

cd /Users/harshkanani/Desktop/creditprj
  1. Create virtual environment and install dependencies:
# uv will automatically create a venv and install all dependencies
uv sync

This will:

  • Create a virtual environment in .venv/
  • Install all required packages from pyproject.toml
  • Set up the project for development

Alternative: Traditional pip setup

If you prefer using pip:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Usage

Running the Application

  1. Activate the virtual environment (if using uv sync):
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Start the Streamlit app:
# With uv (automatically uses the project's virtual environment)
uv run streamlit run app.py

# Or if venv is activated
streamlit run app.py
  1. Open your browser to the URL displayed (typically http://localhost:8501)

  2. Upload a PDF statement using the file uploader in the sidebar

  3. Choose detection method:

    • Auto-detect (recommended): Automatically identifies the issuer
    • Manual selection: Choose your issuer from the dropdown
  4. Click "Parse Statement" to extract the information

  5. View results displayed in the main panel

  6. Download CSV (optional) to export the extracted data

Expected PDF Format

  • Digital PDFs only: Text-based statements (not scanned images)
  • No password protection: PDFs should not be encrypted
  • Standard formats: Works best with official bank-issued statements
  • English language: Designed for English-language statements

How It Works

1. PDF Text Extraction

  • Uses pdfplumber to extract text from all pages
  • Handles multi-column layouts and tables effectively
  • Text is cleaned and normalized for parsing

2. Issuer Detection

  • Searches for issuer-specific keywords and patterns
  • Matches against known issuer identifiers
  • Falls back to manual selection if auto-detection fails

3. Data Extraction

Each parser implements issuer-specific logic:

  • Regex patterns for structured data (dates, amounts, card numbers)
  • Keyword matching for field labels (varies by issuer)
  • Flexible search to handle format variations
  • Fallback mechanisms when primary patterns don't match

4. Result Presentation

  • Structured data object (StatementData)
  • Clean display with metrics and summaries
  • Export functionality for further analysis

Extending the Parser

Adding a New Issuer

  1. Create a new parser file in parsers/ (e.g., wells_fargo_parser.py)

  2. Inherit from StatementParser:

from .base_parser import StatementParser

class WellsFargoParser(StatementParser):
    def __init__(self):
        super().__init__()
        self.issuer_name = "Wells Fargo"

    # Implement required methods
    def extract_card_variant(self):
        # Your extraction logic
        pass

    # ... implement other methods
  1. Update parsers/__init__.py:
from .wells_fargo_parser import WellsFargoParser

PARSER_MAP = {
    # ... existing parsers
    'Wells Fargo': WellsFargoParser,
}
  1. Add detection patterns in the detect_issuer function

  2. Test with sample statements from the new issuer

Limitations

  • Digital PDFs only: Does not support scanned/image-based PDFs (no OCR)
  • Encrypted PDFs: Password-protected files must be unlocked first
  • Format variations: Accuracy depends on statement format consistency
  • Language: Currently supports English statements only
  • Local execution: Designed for local/single-user use (not cloud-deployed)

Future Enhancements

Potential improvements for future versions:

  • OCR Support: Add Tesseract integration for scanned statements
  • Transaction Extraction: Parse and categorize individual transactions
  • Password Handling: Support for encrypted PDFs with password input
  • Batch Processing: Process multiple statements at once
  • Data Visualization: Charts and graphs for spending analysis
  • Database Storage: Save parsed data for historical tracking
  • API Endpoint: RESTful API for programmatic access
  • Additional Issuers: Expand support to more credit card companies

Testing

To test the parser:

  1. Obtain sample PDF statements from each supported issuer
  2. Upload through the Streamlit interface
  3. Verify all five data points are extracted correctly
  4. Check edge cases (different date formats, special characters, etc.)

Troubleshooting

"Could not extract text from PDF"

  • Ensure PDF is not password-protected
  • Verify PDF is digitally generated (not a scan)
  • Try opening the PDF in a reader to confirm it contains selectable text

"Could not auto-detect issuer"

  • Use manual selection from the dropdown
  • Check if the statement is from a supported issuer
  • Verify the PDF contains the issuer's name/logo text

Incorrect data extraction

  • Some statements may have non-standard formats
  • Try updating the regex patterns in the relevant parser
  • Report the issue with statement details for improvements

Development

Using uv for Development

Install with dev dependencies:

uv sync --all-extras

Run tests (when available):

uv run pytest

Format code:

uv run black .
uv run ruff check --fix .

Adding New Dependencies

# Add a new dependency
uv add package-name

# Add a dev dependency
uv add --dev package-name

Contributing

To contribute to this project:

  1. Fork the repository
  2. Create a feature branch
  3. Install with dev dependencies: uv sync --all-extras
  4. Add your enhancements or fixes
  5. Test thoroughly with sample statements
  6. Format code: uv run black . && uv run ruff check --fix .
  7. Submit a pull request with detailed description

License

This project is provided as-is for educational and personal use.

Acknowledgments

  • Built with Streamlit
  • PDF parsing powered by pdfplumber
  • Inspired by real-world needs for automating financial document processing

Note: This tool is for personal use only. Always verify extracted data against original statements before making financial decisions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published