# ENEX Performance Analysis & Element Discovery

Comprehensive analysis of our dynamic ENEX parsing implementation:

1. **Performance Testing**: Time how long it takes to load all ENEX files
2. **Element Discovery**: Find all unique element names across the entire corpus
3. **Data Structure Analysis**: Understand the complete ENEX schema

This will help us understand the real-world performance characteristics and discover what data fields are available in your Evernote export.

In [1]:
# Import Required Libraries
import sys
import time
from pathlib import Path
from collections import Counter, defaultdict

# Add src to path for importing enote
src_path = Path.cwd().parent / "src"
sys.path.insert(0, str(src_path))

import enote

print("‚úÖ Libraries imported successfully")
print(f"üìÇ ENEX path: {enote.DEFAULT_ENEX_PATH}")

‚úÖ Libraries imported successfully
üìÇ ENEX path: ~/tmp/evernote_backup


In [2]:
# Initialize Corpus and Setup Timing
print("üîß Initializing Corpus...")

# Create corpus instance
corpus = enote.Corpus()
print(f"üìç Corpus path: {corpus.enex_path}")

# Check what ENEX files are available
enex_files = list(corpus.enex_path.glob("*.enex"))
print(f"üìÅ Found {len(enex_files)} ENEX files:")

for enex_file in enex_files:
    size_mb = enex_file.stat().st_size / (1024 * 1024)
    print(f"  - {enex_file.name}: {size_mb:.1f} MB")

print(f"\n‚è±Ô∏è  Ready to measure performance...")

üîß Initializing Corpus...
üìç Corpus path: /Users/johnsteill/tmp/evernote_backup
üìÅ Found 8 ENEX files:
  - Music.enex: 0.1 MB
  - Dad.enex: 64.8 MB
  - JohnsRecipes.enex: 217.2 MB
  - Tech.enex: 159.9 MB
  - LabView Project.enex: 56.8 MB
  - SBT.enex: 22.6 MB
  - Current.enex: 429.6 MB
  - Home.enex: 123.8 MB

‚è±Ô∏è  Ready to measure performance...


In [3]:
# Time ENEX File Loading
print("üöÄ Starting full ENEX file loading...")
print("=" * 50)

# Record start time
start_time = time.time()

# Load ALL notes (no max_notes limit)
corpus.load()

# Record end time
end_time = time.time()
elapsed_time = end_time - start_time

# Display results
notes_loaded = len(corpus.notes)
notes_per_second = notes_loaded / elapsed_time if elapsed_time > 0 else 0

print(f"\nüìä PERFORMANCE RESULTS:")
print(f"‚è±Ô∏è  Total time: {elapsed_time:.2f} seconds")
print(f"üìù Notes loaded: {notes_loaded:,}")
print(f"üöÄ Speed: {notes_per_second:.1f} notes/second")

if elapsed_time > 60:
    minutes = elapsed_time / 60
    print(f"‚åö Time: {minutes:.1f} minutes")

print("=" * 50)

üöÄ Starting full ENEX file loading...

üìä PERFORMANCE RESULTS:
‚è±Ô∏è  Total time: 3.72 seconds
üìù Notes loaded: 2,057
üöÄ Speed: 553.1 notes/second

üìä PERFORMANCE RESULTS:
‚è±Ô∏è  Total time: 3.72 seconds
üìù Notes loaded: 2,057
üöÄ Speed: 553.1 notes/second


In [4]:
# Extract All Element Names from Notes
print("üîç Discovering all element names across corpus...")

# Collect all unique element names
all_element_names = set()
element_counts = Counter()
element_examples = defaultdict(list)

# Analyze each note
for note_id, note_data in corpus.notes.items():
    for element_name, value in note_data.items():
        # Track unique element names
        all_element_names.add(element_name)
        
        # Count occurrences
        element_counts[element_name] += 1
        
        # Store examples (first 3 for each element)
        if len(element_examples[element_name]) < 3:
            if isinstance(value, list):
                example = f"list({len(value)}) - {value[:2] if value else '[]'}"
            else:
                example = str(value)[:50] + "..." if len(str(value)) > 50 else str(value)
            element_examples[element_name].append(example)

print(f"‚úÖ Analysis complete!")
print(f"üè∑Ô∏è  Found {len(all_element_names)} unique element names")
print(f"üìù Across {len(corpus.notes):,} notes")

üîç Discovering all element names across corpus...
‚úÖ Analysis complete!
üè∑Ô∏è  Found 7 unique element names
üìù Across 2,057 notes


In [5]:
# Analyze Performance Results
print("üìà DETAILED PERFORMANCE ANALYSIS")
print("=" * 60)

# File-level statistics
total_files = len(enex_files)
total_size_mb = sum(f.stat().st_size for f in enex_files) / (1024 * 1024)

print(f"üìÅ File Statistics:")
print(f"   Files processed: {total_files}")
print(f"   Total size: {total_size_mb:.1f} MB")
print(f"   Average file size: {total_size_mb/total_files:.1f} MB")

# Performance metrics
print(f"\n‚ö° Performance Metrics:")
print(f"   Total parsing time: {elapsed_time:.2f} seconds")
print(f"   Notes per second: {notes_per_second:.1f}")
print(f"   MB per second: {total_size_mb/elapsed_time:.1f}")
print(f"   Average time per note: {(elapsed_time/notes_loaded)*1000:.1f} ms")

# Memory efficiency estimate
avg_elements_per_note = sum(len(note) for note in corpus.notes.values()) / len(corpus.notes)
print(f"\nüíæ Data Structure:")
print(f"   Average elements per note: {avg_elements_per_note:.1f}")
print(f"   Total note objects: {notes_loaded:,}")
print(f"   Est. total elements: {int(notes_loaded * avg_elements_per_note):,}")

print("=" * 60)

üìà DETAILED PERFORMANCE ANALYSIS
üìÅ File Statistics:
   Files processed: 8
   Total size: 1074.8 MB
   Average file size: 134.3 MB

‚ö° Performance Metrics:
   Total parsing time: 3.72 seconds
   Notes per second: 553.1
   MB per second: 289.0
   Average time per note: 1.8 ms

üíæ Data Structure:
   Average elements per note: 6.4
   Total note objects: 2,057
   Est. total elements: 13,136


In [6]:
# Display Element Name Statistics
print("üè∑Ô∏è  COMPLETE ELEMENT NAME ANALYSIS")
print("=" * 80)

print(f"üìä Found {len(all_element_names)} unique element types:")
print()

# Show ALL elements sorted by frequency (most common first)
for element_name, count in element_counts.most_common():
    percentage = (count / len(corpus.notes)) * 100
    
    print(f"üîπ {element_name}")
    print(f"   Frequency: {count:,} notes ({percentage:.1f}%)")
    
    # Show examples
    examples = element_examples[element_name]
    if examples:
        print(f"   Examples:")
        for i, example in enumerate(examples[:2], 1):
            print(f"     {i}. {example}")
    print()

print("=" * 80)

# Summary insights
print("üéØ KEY INSIGHTS:")
print(f"‚úÖ All {len(corpus.notes):,} notes loaded successfully")
print(f"‚úÖ Dynamic extraction discovered {len(all_element_names)} element types")
print(f"‚úÖ No hardcoded field limitations")

# Most/least common elements
most_common = element_counts.most_common(1)[0]
least_common = element_counts.most_common()[-1]
print(f"üìà Most common: '{most_common[0]}' in {most_common[1]:,} notes")
print(f"üìâ Least common: '{least_common[0]}' in {least_common[1]:,} notes")

# Show complete set of all element names
print(f"\nüìã COMPLETE ELEMENT LIST:")
sorted_elements = sorted(all_element_names)
print(f"   {', '.join(sorted_elements)}")

üè∑Ô∏è  COMPLETE ELEMENT NAME ANALYSIS
üìä Found 7 unique element types:

üîπ title
   Frequency: 2,057 notes (100.0%)
   Examples:
     1. Band Practice checklist
     2. Blue Orchid

üîπ created
   Frequency: 2,057 notes (100.0%)
   Examples:
     1. 20131016T175947Z
     2. 20210212T144859Z

üîπ updated
   Frequency: 2,057 notes (100.0%)
   Examples:
     1. 20201118T193529Z
     2. 20210218T144307Z

üîπ note-attributes
   Frequency: 2,057 notes (100.0%)
   Examples:
     1. 
    
     2. 
      

üîπ content
   Frequency: 2,057 notes (100.0%)
   Examples:
     1. 
      <?xml version="1.0" encoding="UTF-8" standa...
     2. 
      <?xml version="1.0" encoding="UTF-8" standa...

üîπ tag
   Frequency: 2,053 notes (99.8%)
   Examples:
     1. Checklists
     2. .Reference

üîπ resource
   Frequency: 798 notes (38.8%)
   Examples:
     1. 
      
     2. 
      

üéØ KEY INSIGHTS:
‚úÖ All 2,057 notes loaded successfully
‚úÖ Dynamic extraction discovered 7 element types
‚úÖ No