## 🧾 Bank Record Categorization System
### Bank Records Analyst (bank_record_analyst.ipynb) 
###### Note - it superceded categorize_monthly_expenses.ipynb

An application to categorize, summarize, and analyze checking and credit card bank records. It provides a modular framework for classifying bank transactions into semantic categories and subcategories using pattern-based matching. It supports both human-readable summaries and programmatic filtering for financial analysis.

---
### Reads input files from folder: ./data 
#### DATA NAMING CONVENTION:  2025-08-A&T-CheckingAcct.xlsx / 2025-08-A&T-CCard.xlsx

---

### 📂 Data Structures

#### `categories_to_subcategories_tree: dict[str, dict[str, dict]]`
Defines the hierarchical taxonomy of financial categories and their subcategories.

- **Top-level keys** represent broad financial domains (e.g. `'INCOME'`, `'FOOD'`, `'HOUSING'`).
- **Nested keys** represent specific subcategories (e.g. `'Anita Income'`, `'Dining Out'`).
- Leaf nodes are empty dicts, to keep the datastructure uncluttered and human readible
- Leaf nodes are programatically populated with patterns from the second datastructure, allowing flexibility and extensibility.

**Example:**
```python
'FOOD': {
    'Groceries': {},
    'Dining Out': {},
    'Fast Food': {}
}
```

#### `subcategories_to_patterns: dict[str, list[str | re.Pattern]]`
Maps subcategories to lists of string or regex patterns used to identify matching bank record descriptions.
- **Patterns** may be **simple substrings** or **compiled regular expressions**.
- Enables flexible matching across diverse transaction formats.

**Example:**
```python
'Bills n Utilities': [
    'VERIZON',
    'DOMINION',
    re.compile(r'ATT\*BILL')
]
```



In [None]:
categories_to_subcategories_tree = {
    'INCOME': {
        'Anita Income': {},
        'Fidelity Transfer': {},
        'KeyBank Cash-Back': {}
    },
    'TAXES': {
        'Taxes': {}
    },
    'FEES': {
        'Transaction Fees': {}
    },
    'EXCLUDE': {
        'Visa Payment': {},
        'Visa Payment Received': {}
    },
    'HOUSING': {
        'Mortgage': {},
        'Bills n Utilities': {}
    },
    'INSURANCE': {
        'Medical Insurance': {},
        'Car Insurance': {}
    },
    'HEALTHCARE': {
        'Medical and Dental': {},
        'Pharmacy': {}
    },
    'EDUCATION': {
        'College Tuition': {},
        'Art Supplies': {}
    },
    'PROFESSIONAL': {
        'Professional Fees': {},
        'AI API charges': {},
        'Professional Services': {},
        'Liability Insurance': {}
    },
    'TRANSPORTATION': {
        'Transportation miscellaneous': {},
        'Car Registration': {},
        'Gas': {},
        'Parking and Tolls': {},
        'Car Maintenance': {}
    },
    'FOOD': {
        'Groceries': {},
        'Dining Out': {},
        'Fast Food': {},
        'World Food': {},
        'Why DOORDASH?': {}
    },
    'SELFCARE & WELLBEING': {
        'Aikido n Yoga': {},
        'Beauty n Supplies': {},
        'Sound Bath': {}
    },
    'HOME & GARDEN': {
        'House Maintenance': {},
        'Furnishing': {},
        'Garden': {}
    },
    'SUBSCRIPTIONS': {
        'Subscription': {}
    },
    'SHOPPING': {
        'Amazon': {},
        'Department Store': {},
        'Clothes': {},
        'Kindle n Books': {},
        'Software n Accessories': {},
        'Computers': {},
        'Personal Electronics': {},
        'Personal Equipment': {},
        'Home Supplies n Decor': {},
        'Home Decor': {},
        'Gifts': {},
        'Kids Toys': {},
        'Cycling n Paddling': {}
    },
    'PETS': {
        'Cat Food n Supplies': {},
        'Cat Health': {}
    },
    'ENTERTAINMENT': {
        'Fun Out': {},
        'Entertaining and Parties': {},
        'Movies n Theater': {},
        'Music n Games': {},
        'Memberships': {}
    },
    'MISCELLANEOUS': {
        'Political Donations': {},
        'ATM Wthdrw n Dpsit': {},
        'Vending Machines': {},
        'Trafic Tickets': {},
        'Parcels': {}
    },
    'VACATIONS & TRAVEL': {
        'Trips Out of Town': {},
        'Vacation': {},
        'Air Travel': {},
        'Visiting Grandma Ela': {},
        'Vacation SC': {},
        'Visiting Wanda': {},
        'Visiting Eva': {}
    },
    'UNCATEGORIZED':{
        'Uncategorized'
    }
}

In [None]:
import re   # regular expressions for complex patterns

subcategories_to_patterns = {
    'Anita Income': [
        'ZELLE DEP ANITA'
    ],
    'Fidelity Transfer': [
        'FID '
    ],
    'KeyBank Cash-Back': [
        'KEY REWARDS',
        'GIFT FROM KEY BANK'
    ],
    'Taxes': [
        'TAXREFUND',
        ' IRS ',
        'TAX REF',
        'RITA',
        'CHECK # 746',
        'CHECK # 747'
    ],
    'Transaction Fees': [
        'TRANSACTION FEE'
    ],
    'Visa Payment': [
        'INTERNET TRF TO CCA'
    ],
    'Visa Payment Received': [
        'PAYMENT RECEIVED'
    ],
    'Mortgage': [
        'WFHM'
    ],
    'Bills n Utilities': [
        'VERIZON',
        'VZWRLSS',
        'DOMINION',
        'FIRST ENERGY',
        'NORTHEAST OHIO',
        'CLEVELAND HEIGHTS',
        'ENBRIDGE GAS',
        'ATT ',
        'ATT*BILL',
        'NEORSD',
        'CWD'
    ],
    'Medical Insurance': [
        'MEDICARE',
        'VSP',
        'UNITEDHEALTHCARE',
        'ROCKWELL',
        'AARP HEALTH',
        'DELTA DENTAL'
    ],
    'Car Insurance': [
        'LIBERTY MUTUAL'
    ],
    'Medical and Dental': [
        'PEDIATRICS',
        'CLEVELAND CLINIC',
        'METROHEALTH',
        'WESTERN RESERVE PERIO',
        'HILLCREST ',
        'CLEVELAND KIDNEY ',
        'SPRY SENIOR',
        'ETNA CLEVELAND',
        'HEIGHTS DENTAL '
    ],
    'Pharmacy': [
        'CVS',
        'WALGREENS'
    ],
    'College Tuition': [
        'SMARTPAYCIA',
        'CASHNET',
        'CAMPUS CIA',
        'BURREN',
        'COLLEGE'
    ],
    'Art Supplies': [
        'UTRECHT',
        ' ART '
    ],
    'Professional Fees': [
        'LICENSURE',
        'LICENSE',
        'CENTER FOR INTUITIVE'
    ],
    'AI API charges': [
        'OPENAI '
    ],
    'Professional Services': [
        'PAUKENLEGAL'
    ],
    'Liability Insurance': [
        'CPH LIABILITY'
    ],
    'Transportation miscellaneous': [
        'Annotate manually'
    ],
    'Car Registration': [
        'BUREAU MOTOR VE'
    ],
    'Gas': [
        'SUNOCO',
        'BP',
        'SHELL',
        'MARATHON',
        'CIRCLE K',
        'SHEETZ',
        'GAS',
        'SPEEDWAY'
    ],
    'Parking and Tolls': [
        'EZ PASS REAL TIME',
        'GARAG ',
        'PARKING SERVICE',
        'PARKING CLEVELAND',
        'CITY OF CLEVELAND HEIG'
    ],
    'Car Maintenance': [
        'REPAIR',
        'AUTO',
        'AUTO BODY',
        'QUALITY AUTO'
    ],
    'Groceries': [
        'GROCERY',
        'HEINEN',
        'DAVE',
        'WHOLE',
        'SODA',
        'TRADE'
    ],
    'Dining Out': [
        'COZUMEL CLEVELAND HEIOH',
        'TAVERN',
        'TOMMYS',
        'CAFE',
        'WASABI',
        'PACIFIC',
        'ANATOLIA',
        'BATUQUI',
        'PHO',
        'LAKE HOUSE',
        'INDIAN FLAME',
        'DEWEY',
        'MAROTTA ',
        'BANANA',
        'BANGKOK',
        'HIBACHI',
        'BRASSICA',
        'RESTAUR',
        'BUFFALO',
        'COZUMEL',
        'FIRST WATCH',
        'PARADISE BIRYANI',
        'CARIBOU COFFEE',
        'STONE OVEN',
        'SEOUL GARDEN ',
        'YOURS TRULY',
        'LOCKKEEPERS',
        'MO MO`S KEBAB',
        'BAKE ME A WISH',
        'ONE POT CLEVELAND'
    ],
    'Fast Food': [
        'LEFTY',
        'SHAKE SHACK',
        'SHAKESHACK',
        'WENDY',
        'BUDDA',
        'CILANTRO',
        'PANERA',
        'CHIPOTLE',
        'BIBIBOP',
        'ROGERS',
        'PIADA',
        'FIVE GUYS',
        'ZINA',
        'SUBSHOPPE',
        'NATURES OASIS',
        'LOTUS EXPRESS',
        'JERSEY MIKES',
        'SQ *LEVY', # is it Levy fast food and where? 
        'BRUEGGERS '
    ],
    'World Food': [
        'KRAKOW',
        'NIPA HUT',
        'YELESEYEVSKY'
    ],
    'Why DOORDASH?': [
        'DOORDASH'
    ],
    'Aikido n Yoga': [
        'CHECK',
        ' YOGA'
    ],
    'Beauty n Supplies': [
        'LADIES',
        'BATH',
        'SALLY BEAUTY',
        'AVEDA',
        'LUSH BEACHWOOD',
        'AIKIKAI',
        'AIKIDO',
        'ATMA',
        'PADDLE',
        'OFFICEMAX'
    ],
    'Sound Bath': [
        'PAYPAL INST'
    ],
    'House Maintenance': [
        'HEIGHTS HDWE', # Cle Hts hardware store
        re.compile(r'HOME DEPOT.*CLEVELAND'),
        re.compile(r'HOME DEPOT.*OH$')
    ],
    'Furnishing': [
        'WORLD',
        'REFURNISHING',
        'KOALA',
        'WAYFAIR'
    ],
    'Garden': [
        'BREMEC',
        'LANDSCAPE',
        'STUMP',
        'NATURE CENTER'
    ],
    'Subscription': [
        'SPOTIFY',
        'APPLE',
        'NETFLIX',
        'AUDIBLE',
        'PEACOCK',
        'WALL',
        'BITDEFENDER',
        'MICROSOFT',
        'HULU',
        'NYTIMES',
        'IDEASTREAM',
        'WSJ'
    ],
    'Amazon': [
        'AMAZON',
        'AMZN'
    ],
    'Department Store': [
        'TARGET',
        'MACY'
    ],
    'Clothes': [
        'REI',
        'NORDSTROM',
        'DICK',
        'DSW',
        'AVALON',
        'MARSHALLS',
        'ANN TAYLOR',
        'AMERICAN EAGLE',
        'FOOTWEAR',
        'H&M ',
        'OLD NAVY ',
        "VICTORIA'S SECRET"
    ],
    'Kindle n Books': [
        'KINDLE',
        'AUDIOTEKA',
        'LOGANBERRY',
        "MAC'S",
        'EMPIK',
        'thepolishbookstore' # online bookstore Tytus Romek etc.
    ],
    'Software n Accessories': [
        'SIMON HAYNES',
        'ALISTORE',
        'GOOGLE',
        'FLIXEASY',
        'CLIP STUDIO',
        'ALIEXPRESS',
        'SERIF.COMBILL MINNETONKA MN'
    ],
    'Computers': [
        'MICRO CENTER'
    ],
    'Personal Electronics': [
        'BEST BUY'
    ],
    'Personal Equipment': [
        'Annotate manually'
    ],
    'Home Supplies n Decor': [
        'Annotate manually'
    ],
    'Gifts': [
        'FIDDLEHEAD',
        'PASSPORT',
        'DIAMONDS FLOWERS',
        'BUNDT',
        'ALL CITY CANDY',
        'LOTUS FLOWER LLC',
        'PINKBLUSHMATERNIT',
        'LITTLE ROOM ',
        'CRAFT COLLECTIVE',
        'PANDORA',
        'WORDPRESS ', # is it wordpress.com? why a gift?
        'STUFFEDANIMALSHOP',
        'GETSHIRTZ',
        'FLOWERS.COM',
        'PROGIFT',
        'BERKSHIRE BLANKET'
    ],
    'Kids Toys': [
        'PLAYMATTERS',
        'DISNEYSTORE',
        'CHILDRENS PLACE',
        'LITTLEMARVIN', # little switchboard for Bean
        'THECHILDRENSPLACE.COM'
    ],
    'Cycling n Paddling': [
        'BIKES',
        'IROCKER',
        'OHIO STATE PARKS'
    ],
    'Cat Food n Supplies': [
        'PET',
        'CHEWY',
        'HOLLYWOOD FEED',
        'JACKSON GALAXY'
    ],
    'Cat Health': [
        'VETERINARY'
    ],
    'Fun Out': [
        'CLEVELAND MUSEUM OF AR',
        'ClevelandMuseumofNat', # Cleveland Museum of Natural History parking
        'CLEV MUS NAT HIST', # Cleveland Museum of Natural History membership
        'CHILDRENS MUSEUM ',
        'GREATER CLEVELAND AQUA',
        'METROPARKS FARMPA KIRTLAND',
        'MITCHELL',
        'SWEET FIX',
        'MANGO MANGO DESSE',
        'ON THE RISE',
        'RISING STAR COFFEE',
        'NERVOUS DOG COFFEE',
        'STARBUCKS',
        'MICHAELS',
        'UPTOWN MART',
        'ELLIE-MAYS',
        '6 FLAVORS INDIAN',
        'LUXE KITCHEN',
        'KOKO BAKERY'
    ],
    'Entertaining and Parties': [
        'Annotate manually'
    ],
    'Movies n Theater': [
        'VUDU',
        'FANDANGO',
        'DOBAMA',
        'THEAT',
        'CLEVELAND PUBLIC',
        'CLEVELAND INSTITUTE OF CLEVELAND',
        'MOVIE',
        'PRIME VIDEO',
        'BORDERLIGHT',
        'CINEMA'
    ],
    'Music n Games': [
        'STEAMGAMES',
        'BANDCAMP'
    ],
    'Memberships': [
        'Cleveland Museum Cleveland',
        'LITCLEVELAND'
    ],
    'Political Donations': [
        'ACTBLUE',
        'CUIMC '
    ],
    'ATM Wthdrw n Dpsit': [
        'ATM '
    ],
    'Vending Machines': [
        'VENDING ',
        'PEPSIVEN'
    ],
    'Parcels': [
        'USPS '
    ],
    'Traffic Tickets': [
        'Annotate manually'
    ],
    'Trips Out of Town': [
        'Annotate manually'
    ],
    'Vacation': [
        'Annotate manually'
    ],
    'Air Travel': [
        'LOT ',
        'AMERICAN',
        'EXPEDIA',
        'SAS ',
        'EAT AND GO JAMAICA',
        ' Kastrup',
        'HUDSONNEWS ',
        'HUDSON',
        'HNDISCOVER',
        'JFK ',
        'MIDTOWN BISTRO',
        'TST* ',
        'CURRITO '
    ],
    'Visiting Grandma Ela': [
        'PLUSKI',
        'OLSZTYN',
        'WARSZAWA',
        re.compile(r' POL$'),
        re.compile(r' CHICAGO IL$')
    ],
    'Vacation SC': [
        'FOLLY',
        'VIATORTRIPADVISOR',
        'VACASA',
        'VRBO',
        re.compile(r' SC$'),
        re.compile(r' NC$'),
        re.compile(r' WV$')
    ],
    'Visiting Wanda': [
        re.compile(r' VA$'),
        re.compile(r' GA$'),
        re.compile(r' MD$')
    ],
    'Visiting Eva': [
        'PITTSBURGH',
        'TURNPIKE',
        re.compile(r' PA$')
    ],
    'Uncategorized': [
        'Autogenerated - uncategorized transactions'
    ]   
}

In [319]:
import pandas as pd

def build_categories_df(categories_to_subcategories_tree, subcategories_to_patterns):
    """
    Build a DataFrame from the categories and subcategories mapping.
    
    :param categories_to_subcategories_tree: Dictionary mapping categories to their subcategories.
    :param subcategories_to_patterns: Dictionary mapping subcategories to their patterns.
    :return: DataFrame with columns 'Category', 'Subcategory', 'Pattern', 'IsRegex'.
    """
    
    # Step 1: Create a mapping of subcategory → category
    subcategory_to_category = {
        subcat: category
        for category, subcats in categories_to_subcategories_tree.items()
        for subcat in subcats
    }

    # Step 2: Assemble the new categories_list
    categories_list = []

    for subcat, patterns in subcategories_to_patterns.items():
        category = subcategory_to_category.get(subcat, 'UNKNOWN')
        for pattern in patterns:
            pattern_text = pattern.pattern if hasattr(pattern, 'pattern') else pattern
            categories_list.append({
                'Category': category,
                'Subcategory': subcat,
                'Pattern': pattern_text,
                'IsRegex': hasattr(pattern, 'search')
            })

    return pd.DataFrame(categories_list)

In [320]:
categories_df = build_categories_df(categories_to_subcategories_tree, subcategories_to_patterns)
print("Categories DataFrame sample:")
print(categories_df.head())

Categories DataFrame sample:
  Category        Subcategory             Pattern  IsRegex
0   INCOME       Anita Income     ZELLE DEP ANITA    False
1   INCOME  Fidelity Transfer                FID     False
2   INCOME  KeyBank Cash-Back         KEY REWARDS    False
3   INCOME  KeyBank Cash-Back  GIFT FROM KEY BANK    False
4    TAXES              Taxes           TAXREFUND    False


To help you validate the integrity of your categorization system, I’ve designed a method that complements your  function. It checks for:

### ✅ Validation Goals
- 	Typos or mismatches between subcategories and their patterns.
- 	Empty subcategories (no patterns assigned).
- 	Subcategories with unknown categories.
- 	Duplicate subcategory entries.
- 	Suspicious naming (e.g. trailing spaces, unusual characters).

In [321]:
def validate_category_mappings(categories_to_subcategories_tree, subcategories_to_patterns):
    """
    Validate the integrity of category and subcategory mappings.

    Returns a dictionary of issues found, grouped by type.
    """

    issues = {
        'missing_subcategory_in_tree': [],
        'missing_patterns_for_subcategory': [],
        'unknown_category': [],
        'duplicate_subcategories': [],
        'suspicious_subcategory_names': []
    }

    # Build subcategory → category mapping
    subcategory_to_category = {}
    for category, subcats in categories_to_subcategories_tree.items():
        for subcat in subcats:
            if subcat in subcategory_to_category:
                issues['duplicate_subcategories'].append(subcat)
            subcategory_to_category[subcat] = category

            # Check for suspicious naming
            if subcat.strip() != subcat or any(c in subcat for c in ['?', '*', '  ']):
                issues['suspicious_subcategory_names'].append(subcat)

    # Check for subcategories in patterns that aren't in the tree
    for subcat in subcategories_to_patterns:
        if subcat not in subcategory_to_category:
            issues['missing_subcategory_in_tree'].append(subcat)

    # Check for subcategories in tree that have no patterns
    for subcat in subcategory_to_category:
        if subcat not in subcategories_to_patterns or not subcategories_to_patterns[subcat]:
            issues['missing_patterns_for_subcategory'].append(subcat)

    # Check for subcategories mapped to 'UNKNOWN' category
    for subcat in subcategories_to_patterns:
        category = subcategory_to_category.get(subcat, 'UNKNOWN')
        if category == 'UNKNOWN':
            issues['unknown_category'].append(subcat)

    return issues

In [322]:
issues = validate_category_mappings(categories_to_subcategories_tree, subcategories_to_patterns)

for issue_type, entries in issues.items():
    if entries:
        print(f"{issue_type}:")
        for entry in entries:
            print(f"  - {entry}")

missing_subcategory_in_tree:
  - Traffic Tickets
missing_patterns_for_subcategory:
  - Trafic Tickets
unknown_category:
  - Traffic Tickets
suspicious_subcategory_names:
  - Why DOORDASH?


### 🔍 `categorize_transaction(description, categories_df) → tuple[str, str]`

Attempts to classify a single transaction description into a `(Category, Subcategory)` pair using pattern matching.

**Parameters:**
- `description` (`str`): Raw transaction description text.
- `categories_df` (`pd.DataFrame`): Classification table with columns:
  - `'Category'`: Top-level category name.
  - `'Subcategory'`: Subcategory name.
  - `'Pattern'`: Matching string or regex pattern.
  - `'IsRegex'`: Boolean flag indicating regex usage.

**Returns:**
- `tuple[str, str]`: A `(Category, Subcategory)` pair.
  - If a match is found, returns the corresponding category and subcategory.
  - If no match is found, returns `('UNCATEGORIZED', 'Uncategorized')`.

**Logic:**
1. Converts the input `description` to uppercase for case-insensitive matching.
2. Iterates over each row in `categories_df`.
3. Applies regex search if `IsRegex` is `True`; otherwise checks for substring presence.
4. Returns the first matching `(Category, Subcategory)` pair.
5. Defaults to `'UNCATEGORIZED'` if no match is found.

**Example:**
```python
categorize_transaction("Payment to ATT*BILL", categories_df)
# → ('HOUSING', 'Bills n Utilities')

In [323]:
# def categorize_transaction(description, categories_df):
#     """
#     ### 🔍 `categorize_transaction(description, categories_df) → tuple[str, str]`

#     Attempts to classify a single transaction description into a `(Category, Subcategory)` pair using pattern matching.

#     **Parameters:**
#     - `description` (`str`): Raw transaction description text.
#     - `categories_df` (`pd.DataFrame`): Classification table with columns:
#     - `'Category'`: Top-level category name.
#     - `'Subcategory'`: Subcategory name.
#     - `'Pattern'`: Matching string or regex pattern.
#     - `'IsRegex'`: Boolean flag indicating regex usage.

#     **Returns:**
#     - `tuple[str, str]`: A `(Category, Subcategory)` pair.
#     - If a match is found, returns the corresponding category and subcategory.
#     - If no match is found, returns `('UNCATEGORIZED', 'Uncategorized')`.

#     **Logic:**
#     1. Converts the input `description` to uppercase for case-insensitive matching.
#     2. Iterates over each row in `categories_df`.
#     3. Applies regex search if `IsRegex` is `True`; otherwise checks for substring presence.
#     4. Returns the first matching `(Category, Subcategory)` pair.
#     5. Defaults to `'UNCATEGORIZED'` if no match is found.

#     **Example:**
#     ```python
#     categorize_transaction("Payment to ATT*BILL", categories_df)
#     # → ('HOUSING', 'Bills n Utilities')"""
#     description_upper = description.upper()

#     for _, row in categories_df.iterrows():
#         pattern = row['Pattern']
#         if row['IsRegex']:
#             if re.search(pattern, description_upper, flags=re.IGNORECASE):
#                 return row['Category'], row['Subcategory']
#         else:
#             if pattern.upper() in description_upper:
#                 return row['Category'], row['Subcategory']

#     return 'UNCATEGORIZED', 'Uncategorized'

In [324]:
# def categorize_transactions(raw_transactions, categories_df):
#     """
#     Categorizes transactions in the DataFrame based on the provided categories DataFrame.
    
#     Parameters:
#     raw_transactions (DataFrame): DataFrame containing transactions with a 'Description' column.
#     categories_df (DataFrame): DataFrame containing categories and subcategories with patterns.
    
#     Returns:
#     DataFrame: COPY OF THE Original DataFrame with two new columns: 'Category' and 'Subcategory'.
#     """
#     categorized_expenses_df = raw_transactions.copy()
#     categorized_expenses_df[['Category', 'Subcategory']] = \
#         raw_transactions['Description'].apply(lambda x: pd.Series(categorize_transaction(x, categories_df)))
#     return categorized_expenses_df


In [325]:
# def categorize_transactions_with_annotate(raw_transactions, categories_df):
#     """
#     Categorizes transactions in the DataFrame based on the provided categories DataFrame,
#     with optional override via 'Annotation' and passthrough of 'Comments'.

#     Parameters:
#     raw_transactions (DataFrame): DataFrame containing transactions with a 'Description' column.
#                                   Optionally includes 'Annotation' and 'Comments' columns.
#     categories_df (DataFrame): DataFrame containing categories and subcategories with patterns.

#     Returns:
#     DataFrame: Copy of the original DataFrame with new columns: 'Category', 'Subcategory', 'Comments'.
#     """
#     categorized_expenses_df = raw_transactions.copy()

#     # Build subcategory → category lookup
#     subcategory_to_category = {
#         row['Subcategory']: row['Category']
#         for _, row in categories_df.iterrows()
#     }

#     def resolve_transaction(row):
#         # Check for manual override via Annotation
#         annotation = row.get('Annotation')
#         if pd.notna(annotation) and annotation in subcategory_to_category:
#             return pd.Series({
#                 'Category': subcategory_to_category[annotation],
#                 'Subcategory': annotation
#             })
#         # Fallback to pattern matching
#         return pd.Series(categorize_transaction(row['Description'], categories_df))

#     # Apply categorization logic
#     categorized_expenses_df[['Category', 'Subcategory']] = raw_transactions.apply(resolve_transaction, axis=1)

#     # Copy Comments if present
#     if 'Comments' in raw_transactions.columns:
#         categorized_expenses_df['Comments'] = raw_transactions['Comments']

#     return categorized_expenses_df

In [326]:
# Load a pair of bank record files for a particular year and month 
# NOTE - uses standard file naming convention for checking and credit card record  
def load_expenses(year_month):
    # Load the CSV files with credit card and checking card records
    creditcard_filename = 'data/' + year_month + '-A&T-CCard.xlsx'
    checking_filename = 'data/' + year_month + '-A&T-CheckingAcct.xlsx'

    df_credit = pd.read_excel(creditcard_filename)
    df_checking = pd.read_excel(checking_filename)
    print(df_credit.head())
    print(df_checking.head())

    # Check the first few rows to understand the structure of your data.
    df = pd.concat([df_checking, df_credit], ignore_index=True)
    df.head()
    return df

In [327]:
# set pandas display options to show all columns
pd.set_option("display.width", 1000)
pd.set_option("display.max_columns", None)  # Show all columns

In [328]:
# year_month = '2025-01'  # Example month
# df_monthly_expenses = load_expenses(year_month)
# df_monthly_expenses.head()

In [329]:
# categories_df = build_categories_df(categories_to_subcategories_tree, subcategories_to_patterns)
# print(categories_df)
# categorized_expenses_df = categorize_transactions(df_monthly_expenses, categories_df)
# print(categories_df)
# print(categorized_expenses_df.head())
# print(categorized_expenses_df[['Description', 'Subcategory', 'Category']].head(10))  # first 10 rows

In [330]:
def show_uncategorized_transactions(categorized_expenses_df):
    """
    Show uncategorized transactions from the categorized DataFrame.
    
    Parameters:
    categorized_expenses_df (DataFrame): DataFrame with categorized transactions.
    
    Returns:
    DataFrame: DataFrame containing uncategorized transactions.
    """
    uncategorized = categorized_expenses_df[categorized_expenses_df['Category'] == 'UNCATEGORIZED']
    print(f"{len(uncategorized)} uncategorized transactions")
    return uncategorized[['Date', 'Amount', 'Description']]

In [331]:
def correct_date_format(categorized_expenses_df):
    """
    Correct the date format in the categorized expenses DataFrame.
    
    Parameters:
    categorized_expenses_df (DataFrame): DataFrame with a 'Date' column to be corrected.
    
    Returns:
    None: The DataFrame is modified in place.
    """
    # Convert 'Date' column to datetime format and then to string in 'YYYY-MM-DD' format
    categorized_expenses_df['Date'] = pd.to_datetime(categorized_expenses_df['Date']).dt.strftime('%Y-%m-%d')
    categorized_expenses_df['Date'] = pd.to_datetime(categorized_expenses_df['Date'])

In [332]:
# Use of ExcelWriter to manage Excel formatting and saving workbook and sheet
def UseExcelWriter(categorized_expenses_df, year_month):
    """
    Use of ExcelWriter to manage Excel formatting and saving workbook and sheet.
    """
    output_path = 'categorized_transactions' + year_month + '_formatted.xlsx'
    with pd.ExcelWriter(output_path, engine='xlsxwriter') as writer:
        # Write your DataFrame
        categorized_expenses_df.to_excel(writer, sheet_name='Expenses', index=False)

        # Access Excel components
        workbook = writer.book
        worksheet = writer.sheets['Expenses']

        # Apply formatting to a column (e.g. Date)
        date_format = workbook.add_format({'num_format': 'yyyy-mm-dd'})
        worksheet.set_column('A:A', 15, date_format)  # Assuming column A is 'Date'

    # ✅ File is written and closed when the 'with' block ends

In [333]:
def save_to_excel(categorized_expenses_df, year_month, long_format=False):
    """
    Save the categorized expenses DataFrame to an Excel file.
    
    Parameters:
    categorized_expenses_df (DataFrame): DataFrame with categorized expenses.
    year_month (str): Year and month string to include in the filename.
    
    Returns:
    None: The DataFrame is saved to an Excel file.
    """

    if long_format:
        # Save in long format
        output_path = 'categorized_transactions' + year_month + '_filtered.xlsx'
        categorized_expenses_df[['Date', 'Description', 'Amount', 'Subcategory', 'Category']].to_excel(output_path, index=False)
    else:
        output_path = 'categorized_transactions' + year_month + '.xlsx'
        # format the 'Date' column to 'YYYY-MM-DD' and save to Excel
        categorized_expenses_df['Date'] = pd.to_datetime(categorized_expenses_df['Date']).dt.strftime('%Y-%m-%d')
        categorized_expenses_df.to_excel(output_path, index=False)\

    print(f"✅ Saved categorized results to: {output_path}")

In [334]:
import pandas as pd
import re
from datetime import datetime

class MonthlyExpenseReport:
    def __init__(self, year_month: str):
        """
        Initialize with a raw transaction DataFrame and metadata
        """
        self.year_month = year_month  # format: 'yy-mm'
        self.categories_df =  build_categories_df(categories_to_subcategories_tree, subcategories_to_patterns)
        self.transactions_raw = load_expenses(self.year_month)
        self.transactions_categorized = None
        self.summary_table = None

    def read_bank_records(self):
        """ read bank records for a given month"""
        self.transactions_raw = load_expenses(self.year_month)  

    def preprocess(self):
        """Ensure proper datetime format and extract YearMonth label"""
        self.transactions_raw['Date'] = pd.to_datetime(self.transactions_raw['Date'])
        # self.transactions_raw['YearMonth'] = self.transactions_raw['Date'].dt.strftime('%y-%m')

    def categorize_transactions(self):
        """Apply pattern matching and build categorized DataFrame"""
        def _categorize_row(row):
            desc = row['Description']
            for _, rule in self.categories_df.iterrows():
                pat = rule['Pattern']
                if rule['IsRegex']:
                    if re.search(pat, desc, flags=re.IGNORECASE):
                        return pd.Series([rule['Subcategory'], rule['Category']])
                else:
                    if pat.upper() in desc.upper():
                        return pd.Series([rule['Subcategory'], rule['Category']])
            return pd.Series(['Uncategorized', 'UNCATEGORIZED'])

        df = self.transactions_raw.copy()
        df[['Subcategory', 'Category']] = df.apply(_categorize_row, axis=1)
        self.transactions_categorized = df

    def categorize_transactions_with_annotate(self):
        """Categorize transactions using Annotation override or pattern matching on Description."""

        # Build subcategory → category lookup
        subcategory_to_category = {
            row['Subcategory']: row['Category']
            for _, row in self.categories_df.iterrows()
        }

        def _categorize_row(row):
            # Manual override via Annotation
            annotation = row.get('Annotation')
            if pd.notna(annotation) and annotation in subcategory_to_category:
                return pd.Series({
                    'Subcategory': annotation,
                    'Category': subcategory_to_category[annotation]
                })

            # If no Annotation, use pattern matching on Description
            description = row.get('Description', '')
            # Iterate over all category definitions in dataframe
            for _, category_definition in self.categories_df.iterrows():
                # Identify whetehr to use regex or simple substring match
                pattern = category_definition['Pattern']
                if category_definition['IsRegex']:
                    if re.search(pattern, description, flags=re.IGNORECASE):
                        return pd.Series({
                            'Subcategory': category_definition['Subcategory'],
                            'Category': category_definition['Category']
                        })
                else:
                    if pattern.upper() in description.upper():
                        return pd.Series({
                            'Subcategory': category_definition['Subcategory'],
                            'Category': category_definition['Category']
                        })

            return pd.Series({'Subcategory': 'Uncategorized', 'Category': 'UNCATEGORIZED'})

        # Apply categorization
        df = self.transactions_raw.copy()
        df[['Subcategory', 'Category']] = df.apply(_categorize_row, axis=1)

        # Remove Annotation column if present
        if 'Annotation' in df.columns:
            df.drop(columns='Annotation', inplace=True)

        # Move Comments to the end if present
        if 'Comments' in df.columns:
            df = df[[col for col in df.columns if col != 'Comments'] + ['Comments']]

        self.transactions_categorized = df

    def summarize(self):
        """Creates a cleanly indexed summary table with real category + subcategory labels and correct totals."""

        # Ordered structure from category tree
        cat_sub_list = [
            (category, subcat)
            for category, subcats in categories_to_subcategories_tree.items()
            for subcat in subcats
        ]
        cat_sub_index = pd.MultiIndex.from_tuples(
            cat_sub_list, names=['Category', 'Subcategory']
        )
        print("cat_sub_index sample:", cat_sub_index.tolist()[:10])

        # Group actual expenses by category/subcategory (returns a Series!)
        grouped_series = (
            self.transactions_categorized
            .groupby(['Category', 'Subcategory'])['Amount']
            .sum()
        )

        # Convert to DataFrame explicitly and reindex
        grouped_df = grouped_series.to_frame(name='Amount') \
            .reindex(cat_sub_index, fill_value=0) \
            .reset_index()

        # Final column cleanup
        grouped_df.rename(columns={
            'Category': 'CATEGORIES',
            'Subcategory': 'SUBCATEGORIES',
            'Amount': 'AMOUNT'
        }, inplace=True)

        # ... after grouped_df is created and columns are renamed ...
        ordered_index = pd.MultiIndex.from_tuples(
            [
                (cat, subcat)
                for cat, subcats in categories_to_subcategories_tree.items()
                for subcat in subcats
            ],
            names=['CATEGORIES', 'SUBCATEGORIES']
        )
        # Enforce semantic order
        self.summary_table = (
            grouped_df
            .set_index(['CATEGORIES', 'SUBCATEGORIES'])
            .reindex(ordered_index, fill_value=0)
            .reset_index()
        )

        self.summary_table = grouped_df

    def export_summary_excel(self, output_path: str):
        """Save summary table with clean formatting and custom sheet name"""
        sheet_name = f"Summary_{self.year_month}"

        print("Summary table:\n", self.summary_table.head(10))
        print("Columns:", self.summary_table.columns.tolist())

        with pd.ExcelWriter(output_path, engine='xlsxwriter') as writer:
            self.summary_table.to_excel(writer, sheet_name=sheet_name, index=False)

            workbook = writer.book
            worksheet = writer.sheets[sheet_name]

            # Format columns
            header_format = workbook.add_format({'bold': True})
            money_format = workbook.add_format({'num_format': '$#,##0.00'})

            # Set column widths and formats by header
            worksheet.set_column('A:A', 22, None)         # CATEGORIES
            worksheet.set_column('B:B', 26, None)         # SUBCATEGORIES
            worksheet.set_column('C:C', 15, money_format) # AMOUNT

            # Bold headers (optional, decorative)
            for col_num, value in enumerate(self.summary_table.columns):
                worksheet.write(0, col_num, value, header_format)

    def export_transactions_excel(self, output_path: str):
        """Export full transaction list with formatted dates"""
        df = self.transactions_categorized.copy()
        df['Date'] = df['Date'].dt.strftime('%Y-%m-%d')
        df.to_excel(output_path, index=False, sheet_name=f'Tnxs_{self.year_month}')

    def run_full_pipeline(self, use_annotation: bool=True):
        """Convenience method"""
        self.preprocess()
        if use_annotation:
            self.categorize_transactions_with_annotate()
        else: 
            self.categorize_transactions()
        self.summarize()

In [335]:
class MultiMonthExpenseWorkbook:
    def __init__(self, reports: list[MonthlyExpenseReport]):
        self.reports = reports
        self.summary_pivot = None

    def enforce_semantic_order(self):
        ordered_pairs = [
            (cat, subcat)
            for cat, subcats in categories_to_subcategories_tree.items()
            for subcat in subcats
        ]

        # Reindex the merged summary_pivot
        self.summary_pivot = (
            self.summary_pivot
            .set_index(['CATEGORIES', 'SUBCATEGORIES'])
            .reindex(ordered_pairs, fill_value=0)
            .reset_index()
        )

        # Ensure the columns are in the correct order
        # Capture fixed columns
        fixed_cols = ['CATEGORIES', 'SUBCATEGORIES']
        # Extract and sort month columns
        month_cols = sorted(
            [col for col in self.summary_pivot.columns if col not in fixed_cols]
        )
        # Reorder DataFrame
        self.summary_pivot = self.summary_pivot[fixed_cols + month_cols]        

        
    def build_combined_summary(self):
        """Pivot all monthly summaries into a single table."""
        summary_frames = []

        for report in self.reports:
            df = report.summary_table.copy()
            df = df.rename(columns={'AMOUNT': report.year_month})
            summary_frames.append(df)

        merged = summary_frames[0][['CATEGORIES', 'SUBCATEGORIES', self.reports[0].year_month]]

        for df in summary_frames[1:]:
            merged = merged.merge(df, on=['CATEGORIES', 'SUBCATEGORIES'], how='outer')

        merged.fillna(0, inplace=True)
        self.summary_pivot = merged

    def export_all_to_excel(self, output_path: str):
        """Save summary and all transaction sheets to a single workbook."""
        self.build_combined_summary()
        self.enforce_semantic_order()

        with pd.ExcelWriter(output_path, engine='xlsxwriter') as writer:
            # Write the combined summary first
            self.summary_pivot.to_excel(writer, sheet_name='Monthly Summary', index=False)

            # Format summary
            workbook = writer.book
            summary_ws = writer.sheets['Monthly Summary']
            currency_fmt = workbook.add_format({'num_format': '$#,##0.00'})
            summary_ws.set_column('A:B', 22)  # Categories/Subcategories
            summary_ws.set_column(2, 1 + len(self.reports), 14, currency_fmt)

            # Write each month's transactions sheet
            for report in self.reports:
                sheet_name = f"Txns_{report.year_month}"
                df = report.transactions_categorized.copy()
                df = df.sort_values('Date')  # ← this ensures chronological order
                df['Date'] = df['Date'].dt.strftime('%Y-%m-%d')
                df.to_excel(writer, sheet_name=sheet_name, index=False)

                # Optional formatting for txn sheets
                tx_ws = writer.sheets[sheet_name]
                tx_ws.set_column('A:A', 14)  # Date
                tx_ws.set_column('B:B', 40)  # Description
                tx_ws.set_column('C:D', 15)  # Amount + Type

                tx_ws.freeze_panes(1, 0)  # Keeps header row visible while scrolling
                tx_ws.autofilter(0, 0, df.shape[0], df.shape[1] - 1)  # Enables filtering on all columns

                # Dynamically adjust each column width based on content
                for i, col in enumerate(df.columns):
                    # Get max width between header and the longest string in the column
                    max_len = max(
                        df[col].astype(str).map(len).max(),
                        len(str(col))
                    ) + 2  # optional padding for readability

                    tx_ws.set_column(i, i, max_len)

In [336]:
month_label = '2025-08'

# jan_df = load_expenses(month_label)
# df_monthly_expenses.head()

report = MonthlyExpenseReport(year_month=month_label)
report.run_full_pipeline(use_annotation=False)

# Save results
report.export_transactions_excel(f'monthly_categorized_transactions_{month_label}.xlsx')
report.export_summary_excel(f'monthly_summary_{month_label}.xlsx')

annotated_report = MonthlyExpenseReport(year_month=month_label)
annotated_report.run_full_pipeline(use_annotation=True)

# Save results
annotated_report.export_transactions_excel(f'monthly_categorized_transactions_{month_label}_annotated.xlsx')
annotated_report.export_summary_excel(f'monthly_summary_{month_label}_annotated.xlsx')

        Date  Amount                             Description  Ref.# Annotation Comments
0 2025-08-29 -324.99      WHOLEFDS CTR 10199 UNIVERSITY HTOH   6089        NaN      NaN
1 2025-08-31  -44.19  DAVE'S SUPERMARKET #22 CLEVELAND HTSOH   4123        NaN      NaN
2 2025-08-30  -58.19    THE HOME DEPOT #3818 CLEVELAND HGTOH   4123        NaN      NaN
3 2025-08-30  -45.33              SP SODA SENSE GREEN BAY WI   4123        NaN      NaN
4 2025-08-30   -5.00     ideastream Public Medi CLEVELAND OH   4123        NaN      NaN
        Date   Amount                                Description  Ref.# Annotation Comments
0 2025-08-29 -7331.30  INTERNET TRF TO CCA XXXXXXXXXXXX4123 0101    NaN        NaN      NaN
1 2025-08-28   -14.00                 VSP INSURANCE COCORP PYMNT    NaN        NaN      NaN
2 2025-08-27  -177.70                   VERIZON WIRELESSPAYMENTS    NaN        NaN      NaN
3 2025-08-27 -1490.75       BILL PAY:WFHM 708 XXXXX7939 6BA12IXC    NaN        NaN      NaN
4 2025-08-26

In [337]:
# Jan_25_label = '2025-01'
# Feb_25_label = '2025-02'
# Mar_25_label = '2025-03'
# Apr_25_label = '2025-04'

# jan_report = MonthlyExpenseReport(year_month=Jan_25_label)
# feb_report = MonthlyExpenseReport(year_month=Feb_25_label)
# mar_report = MonthlyExpenseReport(year_month=Mar_25_label)
# apr_report = MonthlyExpenseReport(year_month=Apr_25_label)

# all_reports = [jan_report, feb_report, mar_report]
# book = MultiMonthExpenseWorkbook(all_reports)
# book.export_all_to_excel("q1_expenses_summary.xlsx")

In [338]:
month_labels = {
    '2025-01',
    '2025-02',
    '2025-03',
    '2025-04',
    '2025-05',
    '2025-06',
    '2025-07',
    '2025-08'
}
summary_label = '2025-01_to_08'

reports = []

for month_label in month_labels:
    report = MonthlyExpenseReport(year_month=month_label)
    report.run_full_pipeline()  # ← This processes everything
    reports.append(report)

        Date  Amount                            Description  Ref.# Annotation Comments
0 2025-05-31 -273.08     WHOLEFDS CTR 10199 UNIVERSITY HTOH   6089        NaN      NaN
1 2025-05-30  -34.54  BREMEC ON THE HEIGHTS CLEVELAND HEIOH   6089        NaN      NaN
2 2025-05-29  -31.19  SUNOCO 0756490900 QPS CLEVELAND HEIOH   6089        NaN      NaN
3 2025-05-30  -87.75     THE TAVERN COMPANY CLEVELAND HEIOH   4123        NaN      NaN
4 2025-05-30  -39.31   THE HOME DEPOT #3818 CLEVELAND HGTOH   4123        NaN      NaN
        Date   Amount                                Description  Ref.# Annotation Comments
0 2025-05-30 -6489.40  INTERNET TRF TO CCA XXXXXXXXXXXX4123 0101    NaN        NaN      NaN
1 2025-05-30 -2450.00                                CHECK # 752  752.0        NaN      NaN
2 2025-05-28   -61.98                                CWD WEB PAY    NaN        NaN      NaN
3 2025-05-28   -14.00                 VSP INSURANCE COCORP PYMNT    NaN        NaN      NaN
4 2025-05-28  -206

In [339]:
book = MultiMonthExpenseWorkbook(reports)
book.export_all_to_excel(f'multimonth_expense_summary_{summary_label}.xlsx')