# Vigone Municipality Data Harvester

## Purpose
This notebook orchestrates an end-to-end pipeline for extracting municipal indicators from Vigone (TO) website PDFs.

## Prerequisites
- **Google Drive Mount**: Required for accessing CSV templates and storing outputs
- **Templates**: 6 CSV indicator templates must exist in Drive at configured path
- **System**: Runs on Google Colab with sufficient disk space for PDF processing

## Configuration
- `ANNO_TARGET`: Target year for data extraction (default: 2024)
- `MAX_PAGES`: Web crawl depth limit (default: 50)
- `MAX_PDFS`: Maximum PDFs to process (default: 30)

## Output Structure
```
/content/drive/MyDrive/vigone_extraction/
‚îú‚îÄ‚îÄ docs/          # Downloaded PDFs
‚îú‚îÄ‚îÄ marker/        # Converted JSON/Markdown
‚îú‚îÄ‚îÄ output/        # Populated CSV templates
‚îî‚îÄ‚îÄ manifest.json  # Resume state tracking
```

**Workflow**: Web Discovery ‚Üí PDF Download ‚Üí Marker Conversion ‚Üí Indicator Extraction ‚Üí CSV Population ‚Üí Reporting

In [None]:
# Install required dependencies
!pip install -q requests beautifulsoup4 pandas tqdm tenacity marker-pdf pytesseract pillow
!apt-get install -y -qq poppler-utils tesseract-ocr tesseract-ocr-ita

In [None]:
# Mount Google Drive for persistent storage
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Configuration parameters and directory setup
import os
from pathlib import Path

# Core configuration
vigone_year = 2024
vigone_page_limit = 50
vigone_pdf_cap = 30
vigone_origin = "https://www.comune.vigone.to.it/"

# Path configuration
vigone_workspace = Path("/content/drive/MyDrive/vigone_extraction")
vigone_pdf_storage = vigone_workspace / "docs"
vigone_converted_storage = vigone_workspace / "marker"
vigone_results_storage = vigone_workspace / "output"
vigone_template_source = Path("/content/drive/MyDrive/templates")
vigone_state_file = vigone_workspace / "manifest.json"

# Create directory structure
for dir_path in [vigone_workspace, vigone_pdf_storage, vigone_converted_storage, vigone_results_storage]:
    dir_path.mkdir(parents=True, exist_ok=True)

print(f"‚úì Configured for year {vigone_year}")
print(f"‚úì Directories ready at {vigone_workspace}")

In [None]:
# Import libraries
import requests
import json
import hashlib
import re
import csv
import time
from collections import deque, defaultdict
from urllib.parse import urljoin, urlparse
from datetime import datetime
from typing import Dict, List, Tuple, Optional, Callable, Any, Set
from functools import reduce, partial, wraps
from itertools import islice, chain, groupby

from bs4 import BeautifulSoup
import pandas as pd
from tqdm.auto import tqdm
from tenacity import retry, stop_after_attempt, wait_exponential

print("‚úì Libraries loaded successfully")

In [None]:
# Utility functions: hashing, normalization, extraction

def stream_hash_digest(byte_stream: bytes) -> str:
    """Generate compact SHA256 digest for deduplication."""
    digest_engine = hashlib.sha256()
    digest_engine.update(byte_stream)
    return digest_engine.hexdigest()[:16]

def text_normalization_chain(raw_text: str) -> str:
    """Apply multi-stage text normalization pipeline."""
    # Stage 1: Unicode normalization
    stage1 = raw_text.lower()
    # Stage 2: Whitespace collapsing
    stage2 = re.sub(r'\s+', ' ', stage1)
    # Stage 3: Special character filtering
    stage3 = re.sub(r'[^\w\s\.,;:‚Ç¨$%()\[\]\-/]', '', stage2)
    # Stage 4: Trim edges
    return stage3.strip()

def url_canonicalization(href: str, base_domain: str) -> Optional[str]:
    """Canonicalize URL and validate against base domain."""
    try:
        absolute_url = urljoin(base_domain, href)
        url_components = urlparse(absolute_url)
        domain_components = urlparse(base_domain)
        
        # Verify domain matching
        if url_components.netloc == domain_components.netloc:
            return absolute_url
        return None
    except:
        return None

def numeric_pattern_extraction(text_fragment: str) -> Optional[float]:
    """Extract numeric values with European format handling."""
    # Remove all non-numeric except separators
    sanitized = re.sub(r'[^\d,\.\-]', '', text_fragment)
    
    # Handle European format: 1.234,56 -> 1234.56
    if ',' in sanitized and '.' in sanitized:
        # European format detected
        sanitized = sanitized.replace('.', '').replace(',', '.')
    elif ',' in sanitized:
        # Only comma - treat as decimal
        sanitized = sanitized.replace(',', '.')
    
    # Extract first numeric match
    pattern_match = re.search(r'-?\d+(?:\.\d+)?', sanitized)
    if pattern_match:
        try:
            return float(pattern_match.group())
        except:
            return None
    return None

def unit_inference_from_context(surrounding_text: str) -> str:
    """Infer measurement units from textual context."""
    text_lower = surrounding_text.lower()
    
    # Unit detection patterns (order matters - most specific first)
    unit_rules = [
        (r'\b(euro|eur|‚Ç¨)\b', '‚Ç¨'),
        (r'\b(percentual[ei]|percent[oi]|%)\b', '%'),
        (r'\b(chilogramm[io]|kg)\b', 'kg'),
        (r'\b(metri|metro|\bm\b)\b', 'm'),
        (r'\b(abitant[ei]|resident[ei])\b', 'persone'),
        (r'\b(giorn[io]|giorni)\b', 'giorni'),
    ]
    
    for pattern, unit in unit_rules:
        if re.search(pattern, text_lower):
            return unit
    
    return ''

print("‚úì Utility functions defined")

In [None]:
# State persistence: JSON manifest for resume capability

def read_processing_manifest(manifest_location: Path) -> Dict:
    """Load existing processing state or initialize new."""
    if manifest_location.exists():
        try:
            with open(manifest_location, 'r', encoding='utf-8') as stream:
                return json.load(stream)
        except:
            pass
    
    # Initialize fresh state
    return {
        'url_catalog': [],
        'pdf_registry': {},
        'conversion_log': [],
        'extraction_history': [],
        'last_update': datetime.now().isoformat()
    }

def write_processing_manifest(manifest_location: Path, state_data: Dict) -> None:
    """Atomically write manifest state to disk."""
    state_data['last_update'] = datetime.now().isoformat()
    
    temp_location = manifest_location.with_suffix('.tmp')
    with open(temp_location, 'w', encoding='utf-8') as stream:
        json.dump(state_data, stream, indent=2, ensure_ascii=False)
    
    # Atomic rename
    temp_location.replace(manifest_location)

def modify_manifest_attribute(manifest_location: Path, attribute_key: str, attribute_value: Any) -> None:
    """Update single manifest attribute."""
    current_state = read_processing_manifest(manifest_location)
    current_state[attribute_key] = attribute_value
    write_processing_manifest(manifest_location, current_state)

print("‚úì State persistence functions ready")

In [None]:
# Web discovery pipeline: BFS crawler for PDF links

def construct_web_crawler(origin_url: str, depth_boundary: int) -> Callable:
    """Build BFS-based web crawler closure."""
    
    def crawl_and_discover() -> List[str]:
        visited_urls = set()
        discovered_pdfs = []
        traversal_queue = deque([(origin_url, 0)])
        
        progress_tracker = tqdm(total=depth_boundary, desc="üîç Web Discovery")
        
        while traversal_queue and len(visited_urls) < depth_boundary:
            current_node, current_depth = traversal_queue.popleft()
            
            # Skip visited or too deep
            if current_node in visited_urls or current_depth > 3:
                continue
            
            visited_urls.add(current_node)
            progress_tracker.update(1)
            
            try:
                # Fetch page with timeout
                http_response = requests.get(current_node, timeout=10, headers={
                    'User-Agent': 'Mozilla/5.0 (compatible; VigoneDataBot/1.0)'
                })
                
                if http_response.status_code != 200:
                    continue
                
                # Parse HTML structure
                page_soup = BeautifulSoup(http_response.content, 'html.parser')
                
                # Process all hyperlinks
                for anchor_element in page_soup.find_all('a', href=True):
                    link_href = anchor_element['href']
                    
                    # Identify PDF links
                    if link_href.lower().endswith('.pdf'):
                        canonical_url = url_canonicalization(link_href, origin_url)
                        if canonical_url and canonical_url not in discovered_pdfs:
                            discovered_pdfs.append(canonical_url)
                    else:
                        # Queue HTML pages for crawling
                        canonical_url = url_canonicalization(link_href, origin_url)
                        if canonical_url and canonical_url not in visited_urls:
                            traversal_queue.append((canonical_url, current_depth + 1))
                
                # Polite crawling delay
                time.sleep(0.5)
                
            except Exception as crawl_error:
                continue
        
        progress_tracker.close()
        return discovered_pdfs
    
    return crawl_and_discover

print("‚úì Web discovery pipeline configured")

In [None]:
# PDF acquisition pipeline: download with retry and deduplication

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=15))
def retrieve_document_bytes(document_url: str) -> bytes:
    """Fetch remote document with exponential backoff."""
    http_response = requests.get(document_url, timeout=30, headers={
        'User-Agent': 'Mozilla/5.0 (compatible; VigoneDataBot/1.0)'
    })
    http_response.raise_for_status()
    return http_response.content

def construct_pdf_downloader(storage_path: Path, state_path: Path) -> Callable:
    """Build PDF acquisition pipeline with hash-based deduplication."""
    
    def download_pdf_collection(url_collection: List[str], quantity_limit: int) -> Dict[str, str]:
        current_state = read_processing_manifest(state_path)
        hash_registry = current_state.get('pdf_registry', {})
        
        download_progress = tqdm(
            islice(url_collection, quantity_limit),
            desc="üì• Downloading PDFs",
            total=min(len(url_collection), quantity_limit)
        )
        
        for pdf_url in download_progress:
            try:
                # Fetch document content
                pdf_bytes = retrieve_document_bytes(pdf_url)
                content_hash = stream_hash_digest(pdf_bytes)
                
                # Check for duplicate content
                if content_hash in hash_registry.values():
                    download_progress.set_postfix({'status': 'duplicate'})
                    continue
                
                # Generate storage filename
                storage_filename = f"vigone_{content_hash}.pdf"
                storage_location = storage_path / storage_filename
                
                # Write to disk
                with open(storage_location, 'wb') as output_stream:
                    output_stream.write(pdf_bytes)
                
                # Register in hash map
                hash_registry[pdf_url] = content_hash
                download_progress.set_postfix({'status': 'saved'})
                
            except Exception as download_error:
                download_progress.set_postfix({'status': 'failed'})
                continue
        
        # Persist updated registry
        modify_manifest_attribute(state_path, 'pdf_registry', hash_registry)
        return hash_registry
    
    return download_pdf_collection

print("‚úì PDF acquisition pipeline ready")

In [None]:
# Marker conversion pipeline: PDF to JSON/Markdown

def construct_marker_transformer(source_path: Path, target_path: Path, state_path: Path) -> Callable:
    """Build Marker-based PDF transformation pipeline."""
    
    def transform_pdf_to_structured() -> List[Dict]:
        current_state = read_processing_manifest(state_path)
        conversion_registry = current_state.get('conversion_log', [])
        transformation_results = []
        
        pdf_inventory = list(source_path.glob('*.pdf'))
        conversion_progress = tqdm(pdf_inventory, desc="üîÑ Marker Conversion")
        
        for pdf_file in conversion_progress:
            json_target = target_path / f"{pdf_file.stem}.json"
            markdown_target = target_path / f"{pdf_file.stem}.md"
            
            # Skip already processed
            if str(pdf_file) in conversion_registry:
                if json_target.exists():
                    try:
                        with open(json_target, 'r', encoding='utf-8') as json_stream:
                            transformation_results.append(json.load(json_stream))
                    except:
                        pass
                continue
            
            try:
                # Execute marker conversion
                import subprocess
                conversion_command = [
                    'marker_single',
                    str(pdf_file),
                    str(target_path),
                    '--batch_multiplier', '2',
                    '--langs', 'Italian'
                ]
                
                subprocess.run(
                    conversion_command,
                    check=True,
                    capture_output=True,
                    text=True
                )
                
                # Load conversion output
                if json_target.exists():
                    with open(json_target, 'r', encoding='utf-8') as json_stream:
                        structured_data = json.load(json_stream)
                        transformation_results.append(structured_data)
                        conversion_registry.append(str(pdf_file))
                        conversion_progress.set_postfix({'status': 'converted'})
                
            except Exception as conversion_error:
                conversion_progress.set_postfix({'status': 'failed'})
                continue
        
        # Update manifest
        modify_manifest_attribute(state_path, 'conversion_log', conversion_registry)
        return transformation_results
    
    return transform_pdf_to_structured

print("‚úì Marker conversion pipeline configured")

In [None]:
# Indicator extraction pipeline: fuzzy search and regex parsing

def calculate_token_similarity(sequence_a: str, sequence_b: str) -> float:
    """Custom token-based Jaccard similarity."""
    tokens_a = set(sequence_a.lower().split())
    tokens_b = set(sequence_b.lower().split())
    
    if not tokens_a or not tokens_b:
        return 0.0
    
    intersection_size = len(tokens_a & tokens_b)
    union_size = len(tokens_a | tokens_b)
    
    return intersection_size / union_size if union_size > 0 else 0.0

def construct_indicator_matcher(converted_path: Path, template_path: Path, year_target: int) -> Callable:
    """Build indicator extraction engine with fuzzy matching."""
    
    def match_indicators_in_corpus() -> Dict[str, List[Dict]]:
        # Load CSV templates
        template_catalog = {}
        for template_file in template_path.glob('*.csv'):
            try:
                template_df = pd.read_csv(template_file)
                template_catalog[template_file.stem] = template_df
            except Exception as load_error:
                continue
        
        if not template_catalog:
            print("‚ö†Ô∏è No templates found in", template_path)
            return {}
        
        # Build text corpus from markdown files
        document_corpus = []
        for markdown_file in converted_path.glob('*.md'):
            try:
                with open(markdown_file, 'r', encoding='utf-8') as md_stream:
                    raw_content = md_stream.read()
                    normalized_content = text_normalization_chain(raw_content)
                    document_corpus.append(normalized_content)
            except Exception as read_error:
                continue
        
        extraction_results = defaultdict(list)
        
        # Process each template
        for template_name, template_data in template_catalog.items():
            print(f"\nüìä Processing template: {template_name}")
            
            # Find indicator column
            indicator_column = None
            for column_name in template_data.columns:
                column_lower = column_name.lower()
                if 'indicat' in column_lower or 'descri' in column_lower or 'nome' in column_lower:
                    indicator_column = column_name
                    break
            
            if not indicator_column:
                print(f"  ‚ö†Ô∏è No indicator column found")
                continue
            
            # Match each indicator
            for row_index, row_data in template_data.iterrows():
                indicator_label = str(row_data[indicator_column])
                
                if pd.isna(indicator_label) or not indicator_label.strip():
                    continue
                
                normalized_label = text_normalization_chain(indicator_label)
                optimal_match = None
                optimal_score = 0.0
                
                # Search in corpus
                for document_text in document_corpus:
                    # Split into semantic chunks (sentences)
                    text_chunks = re.split(r'[.!?\n]+', document_text)
                    
                    for chunk in text_chunks:
                        if len(chunk) < 10:  # Skip too short
                            continue
                        
                        similarity = calculate_token_similarity(normalized_label, chunk)
                        
                        if similarity > optimal_score and similarity > 0.3:
                            optimal_score = similarity
                            optimal_match = chunk
                
                # Extract numeric data from match
                if optimal_match:
                    extracted_value = numeric_pattern_extraction(optimal_match)
                    inferred_unit = unit_inference_from_context(optimal_match)
                    
                    extraction_results[template_name].append({
                        'row_idx': row_index,
                        'indicator_name': indicator_label,
                        'extracted_value': extracted_value,
                        'unit': inferred_unit,
                        'confidence_score': round(optimal_score, 3),
                        'source_snippet': optimal_match[:250]
                    })
        
        return dict(extraction_results)
    
    return match_indicators_in_corpus

print("‚úì Indicator extraction pipeline ready")

In [None]:
# CSV population pipeline: maintain template structure

def construct_csv_writer(template_source: Path, output_target: Path, year_target: int) -> Callable:
    """Build CSV template population engine."""
    
    def populate_template_csvs(extraction_map: Dict[str, List[Dict]]) -> List[Path]:
        populated_files = []
        
        for template_identifier, extracted_indicators in extraction_map.items():
            template_location = template_source / f"{template_identifier}.csv"
            
            if not template_location.exists():
                print(f"‚ö†Ô∏è Template not found: {template_identifier}")
                continue
            
            # Load template structure
            template_df = pd.read_csv(template_location)
            
            # Locate or create year column
            year_column_name = None
            for column_name in template_df.columns:
                if str(year_target) in str(column_name) or 'anno' in str(column_name).lower():
                    year_column_name = column_name
                    break
            
            if not year_column_name:
                # Create new year column
                year_column_name = f"Anno_{year_target}"
                template_df[year_column_name] = None
            
            # Inject extracted values
            for extraction_item in extracted_indicators:
                target_row = extraction_item['row_idx']
                extracted_val = extraction_item['extracted_value']
                
                if extracted_val is not None and target_row < len(template_df):
                    template_df.at[target_row, year_column_name] = extracted_val
            
            # Write populated CSV
            output_location = output_target / f"{template_identifier}_vigone_{year_target}.csv"
            template_df.to_csv(output_location, index=False)
            populated_files.append(output_location)
            print(f"‚úÖ Populated: {template_identifier}")
        
        return populated_files
    
    return populate_template_csvs

print("‚úì CSV population pipeline configured")

In [None]:
# Reporting pipeline: JSON report with stats and confidence scores

def construct_report_builder(output_location: Path) -> Callable:
    """Build comprehensive reporting engine."""
    
    def build_extraction_report(
        extraction_map: Dict[str, List[Dict]],
        pdf_quantity: int,
        url_quantity: int
    ) -> Dict:
        
        report_structure = {
            'generation_timestamp': datetime.now().isoformat(),
            'pipeline_summary': {
                'urls_discovered': url_quantity,
                'pdfs_acquired': pdf_quantity,
                'templates_processed': len(extraction_map),
                'total_extractions': sum(len(items) for items in extraction_map.values())
            },
            'template_details': {}
        }
        
        # Build per-template statistics
        for template_id, extraction_list in extraction_map.items():
            values_extracted = sum(
                1 for item in extraction_list 
                if item['extracted_value'] is not None
            )
            
            confidence_scores = [
                item['confidence_score'] 
                for item in extraction_list
            ]
            
            mean_confidence = (
                sum(confidence_scores) / len(confidence_scores) 
                if confidence_scores else 0.0
            )
            
            coverage_percentage = (
                (values_extracted / len(extraction_list) * 100) 
                if extraction_list else 0.0
            )
            
            report_structure['template_details'][template_id] = {
                'indicator_count': len(extraction_list),
                'successful_extractions': values_extracted,
                'coverage_percent': f"{coverage_percentage:.1f}%",
                'average_confidence': f"{mean_confidence:.3f}",
                'example_extractions': extraction_list[:3]
            }
        
        # Persist report
        report_file = output_location / 'vigone_extraction_report.json'
        with open(report_file, 'w', encoding='utf-8') as report_stream:
            json.dump(report_structure, report_stream, indent=2, ensure_ascii=False)
        
        print(f"\nüìä Report saved to: {report_file}")
        return report_structure
    
    return build_extraction_report

print("‚úì Reporting pipeline ready")

In [None]:
# Main orchestrator: chain all pipelines

def execute_vigone_pipeline(
    origin_url: str,
    page_limit: int,
    pdf_limit: int,
    target_year: int,
    pdf_storage: Path,
    converted_storage: Path,
    results_storage: Path,
    template_source: Path,
    state_file: Path
) -> Dict:
    """Orchestrate complete extraction pipeline."""
    
    print("\n" + "="*70)
    print(" " * 15 + "VIGONE DATA EXTRACTION PIPELINE")
    print("="*70 + "\n")
    
    # Stage 1: Web Discovery
    print("[STAGE 1/6] Web Discovery")
    print("-" * 70)
    crawler = construct_web_crawler(origin_url, page_limit)
    pdf_urls = crawler()
    print(f"\n‚úÖ Discovered {len(pdf_urls)} PDF URLs\n")
    
    # Stage 2: PDF Acquisition
    print("[STAGE 2/6] PDF Acquisition")
    print("-" * 70)
    downloader = construct_pdf_downloader(pdf_storage, state_file)
    pdf_hashes = downloader(pdf_urls, pdf_limit)
    print(f"\n‚úÖ Acquired {len(pdf_hashes)} unique PDFs\n")
    
    # Stage 3: Marker Conversion
    print("[STAGE 3/6] Marker Conversion")
    print("-" * 70)
    transformer = construct_marker_transformer(pdf_storage, converted_storage, state_file)
    converted_documents = transformer()
    print(f"\n‚úÖ Converted {len(converted_documents)} documents\n")
    
    # Stage 4: Indicator Extraction
    print("[STAGE 4/6] Indicator Extraction")
    print("-" * 70)
    matcher = construct_indicator_matcher(converted_storage, template_source, target_year)
    extracted_data = matcher()
    total_matches = sum(len(items) for items in extracted_data.values())
    print(f"\n‚úÖ Extracted {total_matches} indicator matches\n")
    
    # Stage 5: CSV Population
    print("[STAGE 5/6] CSV Population")
    print("-" * 70)
    csv_writer = construct_csv_writer(template_source, results_storage, target_year)
    populated_csvs = csv_writer(extracted_data)
    print(f"\n‚úÖ Populated {len(populated_csvs)} CSV files\n")
    
    # Stage 6: Report Generation
    print("[STAGE 6/6] Report Generation")
    print("-" * 70)
    report_builder = construct_report_builder(results_storage)
    final_report = report_builder(extracted_data, len(pdf_hashes), len(pdf_urls))
    
    print("\n" + "="*70)
    print(" " * 25 + "PIPELINE COMPLETE")
    print("="*70 + "\n")
    
    return final_report

print("‚úì Orchestrator ready")

In [None]:
# Quick test run with limited scope

print("\nüöÄ Starting quick test run...\n")

test_results = execute_vigone_pipeline(
    origin_url=vigone_origin,
    page_limit=30,  # Reduced for testing
    pdf_limit=20,   # Reduced for testing
    target_year=vigone_year,
    pdf_storage=vigone_pdf_storage,
    converted_storage=vigone_converted_storage,
    results_storage=vigone_results_storage,
    template_source=vigone_template_source,
    state_file=vigone_state_file
)

print("\n" + "="*70)
print(" " * 28 + "TEST SUMMARY")
print("="*70)
print(json.dumps(test_results['pipeline_summary'], indent=2))
print("\nüìÅ Full report:", vigone_results_storage / 'vigone_extraction_report.json')
print("="*70)

## Next Steps

### Full Production Run
```python
production_results = execute_vigone_pipeline(
    origin_url=vigone_origin,
    page_limit=vigone_page_limit,
    pdf_limit=vigone_pdf_cap,
    target_year=vigone_year,
    pdf_storage=vigone_pdf_storage,
    converted_storage=vigone_converted_storage,
    results_storage=vigone_results_storage,
    template_source=vigone_template_source,
    state_file=vigone_state_file
)
```

### Troubleshooting

**No PDFs discovered:**
- Verify base URL is accessible
- Increase `vigone_page_limit` parameter
- Check if website structure changed
- Inspect `manifest.json` for discovered URLs

**Low extraction coverage:**
- Review template indicator phrasing
- Adjust similarity threshold in `calculate_token_similarity`
- Check PDF quality (scanned vs. digital)
- Verify Tesseract OCR is working

**Marker conversion errors:**
- Ensure sufficient disk space (>2GB free)
- Check PDF file integrity
- Install Italian language pack: `!apt-get install tesseract-ocr-ita`
- Review conversion logs in terminal output

**Resume interrupted pipeline:**
- State is automatically saved in `manifest.json`
- Re-run pipeline - it skips completed stages
- Check `conversion_log` and `pdf_registry` in manifest

### Output Files Structure

```
vigone_extraction/
‚îú‚îÄ‚îÄ docs/
‚îÇ   ‚îî‚îÄ‚îÄ vigone_[hash].pdf        # Downloaded PDFs (hash-deduplicated)
‚îú‚îÄ‚îÄ marker/
‚îÇ   ‚îú‚îÄ‚îÄ vigone_[hash].json       # Structured JSON from Marker
‚îÇ   ‚îî‚îÄ‚îÄ vigone_[hash].md         # Markdown text from Marker
‚îú‚îÄ‚îÄ output/
‚îÇ   ‚îú‚îÄ‚îÄ [template]_vigone_2024.csv  # Populated CSV templates
‚îÇ   ‚îî‚îÄ‚îÄ vigone_extraction_report.json # Detailed statistics
‚îî‚îÄ‚îÄ manifest.json                # Pipeline state tracking
```

### Advanced Configuration

**Adjust similarity threshold:**
```python
# In calculate_token_similarity function
if similarity > 0.3:  # Change threshold (0.0-1.0)
```

**Modify crawl depth:**
```python
# In construct_web_crawler function
if current_depth > 3:  # Increase for deeper crawling
```

**Add custom unit patterns:**
```python
# In unit_inference_from_context function
unit_rules = [
    (r'your_pattern', 'your_unit'),
    ...
]
```