Credit Card Statement Parser

A Python-based web application for automatically extracting key information and all transaction details from credit card statements using Streamlit and pdfplumber.

Overview

This project provides an automated solution for parsing PDF credit card statements from multiple issuers. It extracts:

Statement Summary (5 key data points):

Card Variant/Type - The name of the credit card (e.g., Chase Sapphire Preferred)
Card Number (Last 4) - Last four digits of the card number
Billing Cycle - Statement period date range
Payment Due Date - When payment is due
Total Balance - Amount due on the statement

Transaction Details (NEW!):

✅ All transactions with date, description, and amount
✅ Transaction analytics (count, total spent, average)
✅ Export to CSV for further analysis
✅ Complete PDF text viewer

Supported Issuers

American Express
Chase
Citibank
Bank of America
Discover

Features

Multi-Issuer Support: Modular parser design handles different statement formats
Auto-Detection: Automatically identifies credit card issuer from statement text
Transaction Extraction: Extracts ALL transactions with date, description, amount
Transaction Analytics: Shows count, total spent, and average per transaction
Web Interface: User-friendly Streamlit interface for easy file upload and viewing results
Export Functionality: Download extracted data and transactions as CSV
Raw Text Viewer: See complete PDF text for verification
Robust Parsing: Uses regex patterns and table extraction for accuracy

Architecture

The project follows a modular design pattern:

credit_card_parser/
├── parsers/                    # PDF parsing logic package
│   ├── __init__.py            # Package initialization and factory functions
│   ├── base_parser.py         # Base parser class with common utilities
│   ├── amex_parser.py         # American Express specific parser
│   ├── chase_parser.py        # Chase specific parser
│   ├── citi_parser.py         # Citibank specific parser
│   ├── boa_parser.py          # Bank of America specific parser
│   ├── discover_parser.py     # Discover specific parser
│   └── utils.py               # Helper functions for text extraction
├── app.py                      # Streamlit web application
├── requirements.txt            # Python dependencies
└── README.md                   # This file

Design Principles

Strategy Pattern: Each issuer has a dedicated parser class implementing a common interface
Separation of Concerns: UI logic (Streamlit) separated from parsing logic
Extensibility: Easy to add support for new issuers by creating new parser classes
Reusability: Common extraction utilities shared across all parsers

Technology Stack

Streamlit: Web application framework for the user interface
pdfplumber: PDF text extraction (superior layout handling)
pandas: Data manipulation and CSV export
pypdf: Additional PDF handling capabilities
python-dateutil: Date parsing utilities

Installation

Prerequisites

Python 3.10 or higher
uv - Fast Python package installer

Setup with uv (Recommended)

Install uv (if not already installed):

# On macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or with Homebrew
brew install uv

# Or with pip
pip install uv

Clone or download the project
Navigate to project directory:

cd /Users/harshkanani/Desktop/creditprj

Create virtual environment and install dependencies:

# uv will automatically create a venv and install all dependencies
uv sync

This will:

Create a virtual environment in .venv/
Install all required packages from pyproject.toml
Set up the project for development

Alternative: Traditional pip setup

If you prefer using pip:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Usage

Running the Application

Activate the virtual environment (if using uv sync):

source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Start the Streamlit app:

# With uv (automatically uses the project's virtual environment)
uv run streamlit run app.py

# Or if venv is activated
streamlit run app.py

Open your browser to the URL displayed (typically http://localhost:8501)
Upload a PDF statement using the file uploader in the sidebar
Choose detection method:
- Auto-detect (recommended): Automatically identifies the issuer
- Manual selection: Choose your issuer from the dropdown
Click "Parse Statement" to extract the information
View results displayed in the main panel
Download CSV (optional) to export the extracted data

Expected PDF Format

Digital PDFs only: Text-based statements (not scanned images)
No password protection: PDFs should not be encrypted
Standard formats: Works best with official bank-issued statements
English language: Designed for English-language statements

How It Works

1. PDF Text Extraction

Uses pdfplumber to extract text from all pages
Handles multi-column layouts and tables effectively
Text is cleaned and normalized for parsing

2. Issuer Detection

Searches for issuer-specific keywords and patterns
Matches against known issuer identifiers
Falls back to manual selection if auto-detection fails

3. Data Extraction

Each parser implements issuer-specific logic:

Regex patterns for structured data (dates, amounts, card numbers)
Keyword matching for field labels (varies by issuer)
Flexible search to handle format variations
Fallback mechanisms when primary patterns don't match

4. Result Presentation

Structured data object (StatementData)
Clean display with metrics and summaries
Export functionality for further analysis

Extending the Parser

Adding a New Issuer

Create a new parser file in parsers/ (e.g., wells_fargo_parser.py)
Inherit from StatementParser:

from .base_parser import StatementParser

class WellsFargoParser(StatementParser):
    def __init__(self):
        super().__init__()
        self.issuer_name = "Wells Fargo"

    # Implement required methods
    def extract_card_variant(self):
        # Your extraction logic
        pass

    # ... implement other methods

Update parsers/__init__.py:

from .wells_fargo_parser import WellsFargoParser

PARSER_MAP = {
    # ... existing parsers
    'Wells Fargo': WellsFargoParser,
}

Add detection patterns in the detect_issuer function
Test with sample statements from the new issuer

Limitations

Digital PDFs only: Does not support scanned/image-based PDFs (no OCR)
Encrypted PDFs: Password-protected files must be unlocked first
Format variations: Accuracy depends on statement format consistency
Language: Currently supports English statements only
Local execution: Designed for local/single-user use (not cloud-deployed)

Future Enhancements

Potential improvements for future versions:

OCR Support: Add Tesseract integration for scanned statements
Transaction Extraction: Parse and categorize individual transactions
Password Handling: Support for encrypted PDFs with password input
Batch Processing: Process multiple statements at once
Data Visualization: Charts and graphs for spending analysis
Database Storage: Save parsed data for historical tracking
API Endpoint: RESTful API for programmatic access
Additional Issuers: Expand support to more credit card companies

Testing

To test the parser:

Obtain sample PDF statements from each supported issuer
Upload through the Streamlit interface
Verify all five data points are extracted correctly
Check edge cases (different date formats, special characters, etc.)

Troubleshooting

"Could not extract text from PDF"

Ensure PDF is not password-protected
Verify PDF is digitally generated (not a scan)
Try opening the PDF in a reader to confirm it contains selectable text

"Could not auto-detect issuer"

Use manual selection from the dropdown
Check if the statement is from a supported issuer
Verify the PDF contains the issuer's name/logo text

Incorrect data extraction

Some statements may have non-standard formats
Try updating the regex patterns in the relevant parser
Report the issue with statement details for improvements

Development

Using uv for Development

Install with dev dependencies:

uv sync --all-extras

Run tests (when available):

uv run pytest

Format code:

uv run black .
uv run ruff check --fix .

Adding New Dependencies

# Add a new dependency
uv add package-name

# Add a dev dependency
uv add --dev package-name

Contributing

To contribute to this project:

Fork the repository
Create a feature branch
Install with dev dependencies: uv sync --all-extras
Add your enhancements or fixes
Test thoroughly with sample statements
Format code: uv run black . && uv run ruff check --fix .
Submit a pull request with detailed description

License

This project is provided as-is for educational and personal use.

Acknowledgments

Built with Streamlit
PDF parsing powered by pdfplumber
Inspired by real-world needs for automating financial document processing

Note: This tool is for personal use only. Always verify extracted data against original statements before making financial decisions.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
parsers		parsers
test_statements		test_statements
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
app.py		app.py
demo_transactions.py		demo_transactions.py
dev.sh		dev.sh
generate_sample_pdfs.py		generate_sample_pdfs.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.sh		run.sh
test_parser.py		test_parser.py
uv.lock		uv.lock

HarshXAI/CreditCardParser

Folders and files

Latest commit

History

Repository files navigation