## Step 1: Get all markdown files recursively
This cell finds every .md file in your Obsidian Vault, including subfolders.

In [None]:
import os
from pathlib import Path

# Define the root folder
vault_path = r"C:\Users\BalasubramanianPG\Videos\Obsidian Vault\Pharma Domain\KPI Family"

# Get all markdown files recursively
markdown_files = list(Path(vault_path).rglob("*.md"))

print(f"Found {len(markdown_files)} markdown files")
print("\nFirst 5 files:")
for file in markdown_files[:5]:
    print(file)

Found 95 markdown files

First 5 files:
C:\Users\BalasubramanianPG\Videos\Obsidian Vault\Clippings\An Introduction to Market Access in the Pharmaceutical Industry.md
C:\Users\BalasubramanianPG\Videos\Obsidian Vault\Pharma Domain\Terminologies to Master First.md
C:\Users\BalasubramanianPG\Videos\Obsidian Vault\Pharma Domain\The Pharma Overview.md
C:\Users\BalasubramanianPG\Videos\Obsidian Vault\Training Videos\June 10th\Trust Based Discount.md
C:\Users\BalasubramanianPG\Videos\Obsidian Vault\Training Videos\SQL Training\Overall.md


## Step 2: Map the folder structure
This cell gives you a breakdown of folders, subfolders, and file counts.

In [None]:
from collections import defaultdict

# Analyze folder structure
folder_stats = defaultdict(int)

for file in markdown_files:
    folder = file.parent
    folder_stats[folder] += 1

print(f"Total folders with markdown files: {len(folder_stats)}\n")
print("Files per folder:")
for folder, count in sorted(folder_stats.items(), key=lambda x: x[1], reverse=True)[:10]:
    print(f"{count:3d} files | {folder}")

Total folders with markdown files: 32

Files per folder:
  8 files | C:\Users\BalasubramanianPG\Videos\Obsidian Vault\Pharma Domain\Cheat Sheets\2. Clinical Trials
  7 files | C:\Users\BalasubramanianPG\Videos\Obsidian Vault\Pharma Domain\The Training Document\2.7 Input Files
  7 files | C:\Users\BalasubramanianPG\Videos\Obsidian Vault\Pharma Domain\Cheat Sheets\3. Supply Chain & Manufacturing
  7 files | C:\Users\BalasubramanianPG\Videos\Obsidian Vault\Pharma Domain\Cheat Sheets\4. Market Access
  5 files | C:\Users\BalasubramanianPG\Videos\Obsidian Vault\Pharma Domain\The Training Document\2.3 Drug Distribution Area
  5 files | C:\Users\BalasubramanianPG\Videos\Obsidian Vault\Pharma Domain\The Training Document\3.0 Appendix
  4 files | C:\Users\BalasubramanianPG\Videos\Obsidian Vault\Pharma Domain\KPI Family
  4 files | C:\Users\BalasubramanianPG\Videos\Obsidian Vault\Pharma Domain\Phases
  4 files | C:\Users\BalasubramanianPG\Videos\Obsidian Vault\Pharma Domain\The Training Document

## Step 3: Extract existing YAML frontmatter (if any)
Before we start writing properties, let's see what's already there.

In [None]:
import re

def extract_yaml_frontmatter(file_path):
    """Extract YAML frontmatter from a markdown file."""
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Match YAML frontmatter (--- at start and end)
    match = re.match(r'^---\s*\n(.*?)\n---\s*\n', content, re.DOTALL)
    
    if match:
        return match.group(1)  # Return YAML content
    return None

# Test on first file
sample_file = markdown_files[0]
yaml_content = extract_yaml_frontmatter(sample_file)

print(f"Testing file: {sample_file.name}\n")
if yaml_content:
    print("Existing YAML frontmatter:")
    print(yaml_content)
else:
    print("No YAML frontmatter found")

Testing file: An Introduction to Market Access in the Pharmaceutical Industry.md

Existing YAML frontmatter:
title: An Introduction to Market Access in the Pharmaceutical Industry
source: https://www.rxcomms.com/learning/guide-to-market-access-in-the-pharmaceutical-industry
author: RXComms
published: 01-01-2024
created: 2025-11-24
description: Understand the role of Market Access in the Pharma sector. Explore its challenges, global perspective, and relevance to drug development and Biotech.
tags:
  - clippings


Why this matters:

- Some files might already have YAML properties
- We don't want to overwrite them blindly

## Step 4: Get file metadata (Created Date)
This extracts the file creation date from the OS.

In [None]:
from datetime import datetime

def get_file_metadata(file_path):
    """Get file creation and modification dates."""
    stat = file_path.stat()
    
    created = datetime.fromtimestamp(stat.st_ctime)
    modified = datetime.fromtimestamp(stat.st_mtime)
    
    return {
        'created': created.strftime('%Y-%m-%d'),
        'modified': modified.strftime('%Y-%m-%d')
    }

# Test on first file
metadata = get_file_metadata(markdown_files[0])
print(f"File: {markdown_files[0].name}")
print(f"Created: {metadata['created']}")
print(f"Modified: {metadata['modified']}")

File: An Introduction to Market Access in the Pharmaceutical Industry.md
Created: 2025-11-24
Modified: 2025-11-24


Why this approach:

- st_ctime = creation time (Windows) or last metadata change (Unix)
- st_mtime = last modification time
- Format as YYYY-MM-DD (Obsidian-friendly)

## Step 5: Draft YAML property insertion function
This is the core function that will add YAML frontmatter to files.

In [None]:
def add_yaml_properties(file_path, properties, overwrite=False):
    """
    Add YAML frontmatter to a markdown file.
    
    Args:
        file_path: Path to the markdown file
        properties: Dict of properties to add (e.g., {'created': '2025-01-01', 'area': 'Projects'})
        overwrite: If True, replace existing YAML. If False, skip files with YAML.
    """
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Check if YAML already exists
    has_yaml = content.startswith('---\n')
    
    if has_yaml and not overwrite:
        print(f"Skipped (has YAML): {file_path.name}")
        return False
    
    # Build YAML frontmatter
    yaml_lines = ['---']
    for key, value in properties.items():
        yaml_lines.append(f'{key}: {value}')
    yaml_lines.append('---\n')
    
    yaml_block = '\n'.join(yaml_lines)
    
    # If overwriting, remove old YAML
    if has_yaml:
        content = re.sub(r'^---\s*\n.*?\n---\s*\n', '', content, flags=re.DOTALL)
    
    # Add new YAML at the top
    new_content = yaml_block + content
    
    # Write back to file
    with open(file_path, 'w', encoding='utf-8') as f:
        f.write(new_content)
    
    print(f"Updated: {file_path.name}")
    return True

# Test on a COPY of a file (don't modify the original yet)
# We'll do a dry-run test in the next step

### Why this design:

- overwrite=False by default = safe mode
- Strips existing YAML before adding new one (if overwriting)
- Returns True/False so you can track success


Next Steps (Before We Run This)
Before we execute the YAML insertion, let's:

Brainstorm the property values (like you mentioned)

created: Auto-populated from file metadata
area: Should this be based on folder name? Or manual tagging?
tag: Do you want default tags, or should we parse existing #tags in the content?


Test on a single file (or a copy) to make sure it works
Decide on batch processing strategy:

Process all files at once?
- Only files without YAML?
- Only files in specific folders?

## Step 6: Define Area from folder path
We'll use the parent folder name as the area. This gives meaningful context without being too granular.

In [None]:
def extract_area_from_path(file_path, vault_root):
    """
    Extract area from folder structure.
    Uses the first subfolder under vault root as the area.
    """
    # Get relative path from vault root
    relative_path = file_path.relative_to(vault_root)
    
    # Get all parent folders
    parts = relative_path.parts[:-1]  # Exclude the filename itself
    
    if len(parts) == 0:
        return "Root"  # File is directly in vault root
    elif len(parts) == 1:
        return parts[0]  # File is in a top-level folder
    else:
        # File is nested - use the top-level folder as area
        return parts[0]

# Test on sample files
vault_root = Path(r"C:\Users\BalasubramanianPG\Videos\Obsidian Vault\Pharma Domain\KPI Family")

print("Area extraction examples:\n")
for file in markdown_files[:10]:
    area = extract_area_from_path(file, vault_root)
    print(f"{area:25s} | {file.name}")

Area extraction examples:

Clippings                 | An Introduction to Market Access in the Pharmaceutical Industry.md
Pharma Domain             | Terminologies to Master First.md
Pharma Domain             | The Pharma Overview.md
Training Videos           | Trust Based Discount.md
Training Videos           | Overall.md
Pharma Domain             | Commercial (Sales & Marketing KPIs).md
Pharma Domain             | Manufacturing & Supply Chain KPIs.md
Pharma Domain             | Quality & Regulatory Affairs KPIs.md
Pharma Domain             | R&D KPIs.md
Pharma Domain             | Phase 1 - Foundations.md


Why this works:

- Files under Pharma Domain\Cheat Sheets\... → Area = Pharma Domain
- Files under Pharma Domain\The Training Document\... → Area = Pharma Domain
- Keeps it simple and consistent

# Step 7: Define Tags from filename
We'll extract tags from the filename using a few strategies:

In [None]:
def extract_tags_from_filename(file_path):
    """
    Extract tags from filename.
    
    Strategies:
    1. Split on common separators (-, _, spaces)
    2. Remove common words (the, a, an, etc.)
    3. Convert to lowercase
    4. Return as list
    """
    # Get filename without extension
    filename = file_path.stem
    
    # Common words to exclude from tags
    stopwords = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with'}
    
    # Split on separators and clean
    tokens = re.split(r'[-_\s]+', filename.lower())
    
    # Remove stopwords and numbers-only tokens
    tags = [
        token for token in tokens 
        if token and token not in stopwords and not token.isdigit()
    ]
    
    # Limit to top 3-5 tags (avoid tag explosion)
    return tags[:5]

# Test on sample files
print("Tag extraction examples:\n")
for file in markdown_files[:10]:
    tags = extract_tags_from_filename(file)
    print(f"{file.name:50s} → {tags}")

Tag extraction examples:

An Introduction to Market Access in the Pharmaceutical Industry.md → ['introduction', 'market', 'access', 'pharmaceutical', 'industry']
Terminologies to Master First.md                   → ['terminologies', 'master', 'first']
The Pharma Overview.md                             → ['pharma', 'overview']
Trust Based Discount.md                            → ['trust', 'based', 'discount']
Overall.md                                         → ['overall']
Commercial (Sales & Marketing KPIs).md             → ['commercial', '(sales', '&', 'marketing', 'kpis)']
Manufacturing & Supply Chain KPIs.md               → ['manufacturing', '&', 'supply', 'chain', 'kpis']
Quality & Regulatory Affairs KPIs.md               → ['quality', '&', 'regulatory', 'affairs', 'kpis']
R&D KPIs.md                                        → ['r&d', 'kpis']
Phase 1 - Foundations.md                           → ['phase', 'foundations']


Why this approach:

Converts Clinical-Trial-Phases.md → ['clinical', 'trial', 'phases']
Removes noise words like "the", "and"
Limits tags to avoid clutter

# Step 8: Combine everything into a processing function
Now let's put it all together:

In [None]:
def process_markdown_file(file_path, vault_root, overwrite=False, dry_run=True):
    """
    Add YAML properties to a markdown file.
    
    Args:
        file_path: Path to markdown file
        vault_root: Root of Obsidian vault
        overwrite: Replace existing YAML if True
        dry_run: If True, only print what would be done (don't modify files)
    """
    # Extract metadata
    metadata = get_file_metadata(file_path)
    area = extract_area_from_path(file_path, vault_root)
    tags = extract_tags_from_filename(file_path)
    
    # Build properties dict
    properties = {
        'created': metadata['created'],
        'area': area,
        'tags': tags  # Obsidian accepts list format
    }
    
    # Show what we'd do
    print(f"\n{'='*60}")
    print(f"File: {file_path.name}")
    print(f"Path: {file_path.relative_to(vault_root)}")
    print(f"\nProposed YAML:")
    print("---")
    print(f"created: {properties['created']}")
    print(f"area: {properties['area']}")
    print(f"tags: {properties['tags']}")
    print("---")
    
    if dry_run:
        print("\n[DRY RUN] No changes made")
        return False
    else:
        return add_yaml_properties(file_path, properties, overwrite)

# Test on first 3 files (dry run)
vault_root = Path(r"C:\Users\BalasubramanianPG\Videos\Obsidian Vault\Pharma Domain\KPI Family")

print("DRY RUN: Processing first 3 files...\n")
for file in markdown_files[:3]:
    process_markdown_file(file, vault_root, dry_run=False)

DRY RUN: Processing first 3 files...


File: An Introduction to Market Access in the Pharmaceutical Industry.md
Path: Clippings\An Introduction to Market Access in the Pharmaceutical Industry.md

Proposed YAML:
---
created: 2025-11-24
area: Clippings
tags: ['introduction', 'market', 'access', 'pharmaceutical', 'industry']
---
Skipped (has YAML): An Introduction to Market Access in the Pharmaceutical Industry.md

File: Terminologies to Master First.md
Path: Pharma Domain\Terminologies to Master First.md

Proposed YAML:
---
created: 2025-11-24
area: Pharma Domain
tags: ['terminologies', 'master', 'first']
---
Updated: Terminologies to Master First.md

File: The Pharma Overview.md
Path: Pharma Domain\The Pharma Overview.md

Proposed YAML:
---
created: 2025-11-24
area: Pharma Domain
tags: ['pharma', 'overview']
---
Updated: The Pharma Overview.md


## Step 9: Batch process all files
Once you're happy with the dry-run output, run this:

In [None]:
def batch_process_vault(markdown_files, vault_root, overwrite=False, dry_run=False):
    """
    Process all markdown files in the vault.
    """
    results = {
        'updated': 0,
        'skipped': 0,
        'errors': 0
    }
    
    for file in markdown_files:
        try:
            success = process_markdown_file(file, vault_root, overwrite, dry_run)
            if success:
                results['updated'] += 1
            else:
                results['skipped'] += 1
        except Exception as e:
            print(f"\nERROR processing {file.name}: {e}")
            results['errors'] += 1
    
    print("\n" + "="*60)
    print("BATCH PROCESSING COMPLETE")
    print(f"Updated: {results['updated']}")
    print(f"Skipped: {results['skipped']}")
    print(f"Errors: {results['errors']}")
    
    return results

# When you're ready to run for real:
# batch_process_vault(markdown_files, vault_root, overwrite=False, dry_run=False)

Before you run this for real:
Test the dry-run output above and check:

- Does the area look correct? (Should all be Pharma Domain based on your folder structure)
- Do the tags make sense from the filenames?
- Are the created dates accurate?

If everything looks good, change dry_run=False and let it rip.

In [None]:
import os
import re
from pathlib import Path
from datetime import datetime
from collections import defaultdict

# ========================================
# CONFIGURATION
# ========================================
vault_path = r"C:\Users\BalasubramanianPG\Videos\Obsidian Vault"
vault_root = Path(vault_path)

# ========================================
# STEP 1: GET ALL MARKDOWN FILES (RECURSIVE)
# ========================================
def get_all_markdown_files(root_path):
    """Recursively get all .md files in all subfolders."""
    return list(Path(root_path).rglob("*.md"))

# ========================================
# STEP 2: EXTRACT METADATA
# ========================================
def get_file_metadata(file_path):
    """Get file creation date."""
    stat = file_path.stat()
    created = datetime.fromtimestamp(stat.st_ctime)
    return created.strftime('%Y-%m-%d')

def extract_area_from_path(file_path, vault_root):
    """Extract area from the top-level folder."""
    relative_path = file_path.relative_to(vault_root)
    parts = relative_path.parts[:-1]  # Exclude filename
    
    if len(parts) == 0:
        return "Root"
    else:
        return parts[0]  # Top-level folder = area

def extract_tags_from_filename(file_path):
    """Extract tags from filename."""
    filename = file_path.stem
    stopwords = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with'}
    
    # Split on separators
    tokens = re.split(r'[-_\s]+', filename.lower())
    
    # Filter stopwords and numbers
    tags = [
        token for token in tokens 
        if token and token not in stopwords and not token.isdigit()
    ]
    
    return tags[:5]  # Limit to 5 tags

# ========================================
# STEP 3: ADD YAML FRONTMATTER
# ========================================
def add_yaml_properties(file_path, properties, overwrite=False):
    """
    Add YAML frontmatter to markdown file.
    
    Args:
        file_path: Path to the markdown file
        properties: Dict of properties {'created': '2025-01-01', 'area': 'X', 'tags': ['a', 'b']}
        overwrite: If True, replace existing YAML
    """
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
    except Exception as e:
        print(f"ERROR reading {file_path.name}: {e}")
        return False
    
    # Check if YAML already exists
    has_yaml = content.startswith('---\n')
    
    if has_yaml and not overwrite:
        return False  # Skip files with existing YAML
    
    # Build YAML frontmatter
    yaml_lines = ['---']
    yaml_lines.append(f"created: {properties['created']}")
    yaml_lines.append(f"area: {properties['area']}")
    
    # Format tags as YAML list
    if properties['tags']:
        yaml_lines.append('tags:')
        for tag in properties['tags']:
            yaml_lines.append(f"  - {tag}")
    else:
        yaml_lines.append('tags: []')
    
    yaml_lines.append('---\n')
    yaml_block = '\n'.join(yaml_lines)
    
    # If overwriting, remove old YAML
    if has_yaml:
        content = re.sub(r'^---\s*\n.*?\n---\s*\n', '', content, flags=re.DOTALL)
    
    # Add new YAML at top
    new_content = yaml_block + content
    
    # Write back
    try:
        with open(file_path, 'w', encoding='utf-8') as f:
            f.write(new_content)
        return True
    except Exception as e:
        print(f"ERROR writing {file_path.name}: {e}")
        return False

# ========================================
# STEP 4: PROCESS SINGLE FILE
# ========================================
def process_markdown_file(file_path, vault_root, overwrite=False, dry_run=True):
    """Process a single markdown file."""
    # Extract metadata
    created_date = get_file_metadata(file_path)
    area = extract_area_from_path(file_path, vault_root)
    tags = extract_tags_from_filename(file_path)
    
    properties = {
        'created': created_date,
        'area': area,
        'tags': tags
    }
    
    if dry_run:
        # Just show what would be done
        relative_path = file_path.relative_to(vault_root)
        print(f"\n{'='*70}")
        print(f"File: {file_path.name}")
        print(f"Path: {relative_path}")
        print(f"\nWould add YAML:")
        print("---")
        print(f"created: {properties['created']}")
        print(f"area: {properties['area']}")
        print(f"tags:")
        for tag in properties['tags']:
            print(f"  - {tag}")
        print("---")
        return False
    else:
        return add_yaml_properties(file_path, properties, overwrite)

# ========================================
# STEP 5: BATCH PROCESS ALL FILES
# ========================================
def batch_process_vault(vault_root, overwrite=False, dry_run=True):
    """Process all markdown files in vault."""
    
    # Get all markdown files
    markdown_files = get_all_markdown_files(vault_root)
    
    print(f"Found {len(markdown_files)} markdown files")
    print(f"Overwrite existing YAML: {overwrite}")
    print(f"Dry run mode: {dry_run}")
    print("\n" + "="*70)
    
    results = {
        'total': len(markdown_files),
        'updated': 0,
        'skipped': 0,
        'errors': 0
    }
    
    for i, file in enumerate(markdown_files, 1):
        try:
            success = process_markdown_file(file, vault_root, overwrite, dry_run)
            
            if not dry_run:
                if success:
                    results['updated'] += 1
                    print(f"[{i}/{results['total']}] ✓ Updated: {file.name}")
                else:
                    results['skipped'] += 1
                    print(f"[{i}/{results['total']}] ⊘ Skipped: {file.name}")
        
        except Exception as e:
            results['errors'] += 1
            print(f"[{i}/{results['total']}] ✗ ERROR: {file.name} - {e}")
    
    # Summary
    print("\n" + "="*70)
    print("BATCH PROCESSING COMPLETE")
    print(f"Total files: {results['total']}")
    print(f"Updated: {results['updated']}")
    print(f"Skipped: {results['skipped']}")
    print(f"Errors: {results['errors']}")
    
    return results

# ========================================
# EXECUTION
# ========================================

# DRY RUN FIRST (shows first 5 files as examples)
print("="*70)
print("DRY RUN: Showing first 5 files")
print("="*70)

markdown_files = get_all_markdown_files(vault_root)
for file in markdown_files[:5]:
    process_markdown_file(file, vault_root, overwrite=False, dry_run=True)

print("\n" + "="*70)
print("NEXT STEPS:")
print("="*70)
print("1. Review the YAML output above")
print("2. If it looks good, run the FULL dry run:")
print("   batch_process_vault(vault_root, overwrite=False, dry_run=True)")
print("\n3. If everything looks correct, run FOR REAL:")
print("   batch_process_vault(vault_root, overwrite=False, dry_run=False)")
print("="*70)