<span style="font-size: 12px;">


Given:

Key Parameters to Describe a PDF Dataset Parametrically

1. Number of PDF Files

Total count of PDF documents included in the dataset.

2. Storage Size

Total size on disk (e.g., in megabytes or gigabytes).

Average file size and range (min, max) to indicate variability.

3. Textual Content Size

Total number of words or tokens extracted from all PDFs combined.

Average number of words or tokens per document.

If applicable, number of unique tokens (vocabulary size).

4. Information Content (in Bits)

Approximate information content can be estimated by the file size in bytes multiplied by 8 (bits per byte).

For extracted text, entropy-based measures or compression size can be used as proxies for information content.

If the dataset includes scanned PDFs requiring OCR, note the quality and potential noise affecting information content.

5. Document Length Distribution

Statistical summary of document lengths (e.g., mean, median, standard deviation of word counts or token counts).

Distribution shape (e.g., skewness) if relevant.

6. Language and Encoding

Language(s) of the documents.

Text encoding used (e.g., UTF-8).

7. Structural Elements (Optional)

Number or proportion of documents containing tables, figures, or annotations.



Directory: D:\Dataset\Lagerugpijn\LR_EPDs

contains a batch of EHR PDF files



Provide a Pyhton based Jupyter notebook code that provides the above stated parameters about all PDF files stored in D:\Dataset\Lagerugpijn\LR_EPDs


============================================================================================================

now make sure it can detect the language that is used in the PDF files

add sophisticated measures (entropy, compression) would require deeper analysis.



make sure that the output reports on all the required parameters for each file separately and that it concludes with a general overview report that best describes the entire dataset




===========================================================================================================



now make sure the code:

can Identifying structural elements like tables, figures, or annotations requires more advanced PDF parsing libraries and logic

+ reports on these elements in the Per-File Analysis Results ouput

</span>

In [10]:
import os
import sys
import math
import statistics
import numpy as np
from collections import Counter
import re # For tokenization
from langdetect import detect, DetectorFactory # For language detection
from langdetect.lang_detect_exception import LangDetectException
import pdfplumber # Added: For advanced PDF parsing
from io import StringIO # To capture print output

# Ensure consistent language detection results
DetectorFactory.seed = 0

# --- Configuration ---
# Directory containing the PDF files
pdf_directory = r'D:\Dataset\Lagerugpijn\LR_EPDs' # Use a raw string for the path
output_markdown_file = 'pdf_analysis_report.md' # Name of the output Markdown file

# --- Data Storage ---
file_data = [] # List to store dictionaries, each containing data for one file
all_extracted_text_for_vocab = "" # String to accumulate all text for vocabulary analysis
report_output = StringIO() # Use StringIO to capture print statements

# Redirect stdout to capture print output
original_stdout = sys.stdout
sys.stdout = report_output

# --- Helper Function for Text Extraction and Structural Element Detection ---
def analyze_pdf_content(pdf_path):
    """
    Extracts text and detects structural elements (tables, figures, annotations)
    from a PDF file using pdfplumber.
    Returns extracted text, table count, figure count, annotation count.
    """
    extracted_text = ""
    table_count = 0
    figure_count = 0
    annotation_count = 0

    try:
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                # Extract text
                extracted_text += page.extract_text() or "" # Add text, handle None

                # Detect tables
                tables = page.extract_tables()
                table_count += len(tables)

                # Detect figures (images)
                figure_count += len(page.images)

                # Detect annotations
                annotation_count += len(page.annots)

    except pdfplumber.PDFSyntaxError as e:
         # Print warnings directly to original stdout so they appear during processing
         original_stdout.write(f"Warning: PDF Syntax Error in {os.path.basename(pdf_path)}: {e}\n")
         # Return empty data for this file if there's a syntax error
         return "", 0, 0, 0
    except Exception as e:
        # Print warnings directly to original stdout
        original_stdout.write(f"Warning: Error processing {os.path.basename(pdf_path)} with pdfplumber: {e}\n")
        # Return empty data for this file if there's any other error
        return "", 0, 0, 0

    return extracted_text, table_count, figure_count, annotation_count


# --- Helper Function for Language Detection ---
def detect_language(text):
    """Detects the language of the input text."""
    if not text.strip():
        return "N/A (No text)"
    try:
        # langdetect works best on larger text samples.
        # Use a sample if the text is very long to speed things up,
        # but for typical documents, using the full text is fine.
        # Limit text length for detection to avoid potential issues with very large texts
        sample_text = text[:5000] if len(text) > 5000 else text
        if not sample_text.strip():
             return "N/A (No text in sample)"
        return detect(sample_text)
    except LangDetectException:
        return "Undetected"
    except Exception as e:
        # Print warnings directly to original stdout
        original_stdout.write(f"Warning: Error during language detection: {e}\n")
        return "Error"

# --- Analysis ---
# Print initial message to original stdout
original_stdout.write(f"Analyzing PDF files in directory: {pdf_directory}\n")


if not os.path.isdir(pdf_directory):
    # Print error to original stdout
    original_stdout.write(f"Error: Directory not found at {pdf_directory}\n")
else:
    # Iterate through all entries in the directory
    for entry_name in os.listdir(pdf_directory):
        entry_path = os.path.join(pdf_directory, entry_name)

        # Check if the entry is a file and ends with .pdf (case-insensitive)
        if os.path.isfile(entry_path) and entry_name.lower().endswith('.pdf'):

            file_info = {} # Dictionary to store data for the current file
            file_info['filename'] = entry_name
            file_info['filepath'] = entry_path

            # 1. & 2. Storage Size
            try:
                file_size_bytes = os.path.getsize(entry_path) # size in bytes
                file_info['storage_size_bytes'] = file_size_bytes
                file_info['storage_size_mb'] = file_size_bytes / (1024 * 1024)
            except Exception as e:
                # Print warnings directly to original stdout
                original_stdout.write(f"Warning: Could not get size for {entry_name}: {e}\n")
                file_info['storage_size_bytes'] = 0
                file_info['storage_size_mb'] = 0


            # 3. Textual Content Size & Extraction + 7. Structural Elements
            text, table_count, figure_count, annotation_count = analyze_pdf_content(entry_path)

            file_info['extracted_text'] = text # Store text for language
            # Simple tokenization: split by whitespace and punctuation
            tokens = re.findall(r'\b\w+\b', text.lower()) # Convert to lower case for vocabulary size
            word_count = len(tokens)
            file_info['word_count'] = word_count

            # Calculate unique words for the current file
            unique_words_in_file = set(tokens)
            file_info['unique_word_count'] = len(unique_words_in_file)


            # Accumulate text for overall vocabulary later
            all_extracted_text_for_vocab += text + " " # Add a space to ensure separation

            # Store structural element counts
            file_info['table_count'] = table_count
            file_info['figure_count'] = figure_count
            file_info['annotation_count'] = annotation_count
            # Add boolean flags for easy checking
            file_info['has_tables'] = table_count > 0
            file_info['has_figures'] = figure_count > 0
            file_info['has_annotations'] = annotation_count > 0


            # 6. Language Detection
            file_info['language'] = detect_language(text)

            # 4. Information Content (Estimated)
            # Simple estimation based on file size in bits
            file_info['estimated_info_content_bits_filesize'] = file_size_bytes * 8


            file_data.append(file_info) # Add the file's data to the list


    # --- Per-File Reporting (Captured for Markdown) ---
    print("\n## Per-File Analysis Results") # Markdown Heading
    if file_data:
        # Create a Markdown table for per-file data
        # Added 'Unique Words' column header
        print("\n| Filename | Size (MB) | Word Count | Unique Words | Language | Tables | Figures | Annotations |")
        print("|---|---|---|---|---|---|---|---|") # Updated separator line
        for file_info in file_data:
            # Added file_info['unique_word_count'] to the row
            print(f"| {file_info['filename']} | {file_info['storage_size_mb']:.2f} | {file_info['word_count']} | {file_info['unique_word_count']} | {file_info['language']} | {file_info['table_count']} | {file_info['figure_count']} | {file_info['annotation_count']} |")

    else:
        print("No PDF files found in the specified directory.")


    # --- Overall Dataset Summary (Captured for Markdown) ---
    print("\n## Overall Dataset Summary") # Markdown Heading

    total_files = len(file_data)
    print(f"\n### 1. Number of PDF Files: {total_files}") # Markdown Subheading

    if total_files > 0:
        # 2. Storage Size
        all_storage_sizes_bytes = [f['storage_size_bytes'] for f in file_data]
        total_storage_size_bytes = sum(all_storage_sizes_bytes)
        total_storage_size_mb = total_storage_size_bytes / (1024 * 1024)
        print(f"\n### 2. Storage Size") # Markdown Subheading
        print(f"- **Total:** {total_storage_size_mb:.2f} MB ({total_storage_size_bytes} bytes)")
        if all_storage_sizes_bytes:
            avg_file_size_mb = statistics.mean(all_storage_sizes_bytes) / (1024 * 1024)
            min_file_size_mb = min(all_storage_sizes_bytes) / (1024 * 1024)
            max_file_size_mb = max(all_storage_sizes_bytes) / (1024 * 1024)
            print(f"- **Average:** {avg_file_size_mb:.2f} MB")
            print(f"- **Range:** ({min_file_size_mb:.2f} MB, {max_file_size_mb:.2f} MB)")

        # 3. Textual Content Size
        all_word_counts = [f['word_count'] for f in file_data]
        total_word_count = sum(all_word_counts)
        print(f"\n### 3. Textual Content Size") # Markdown Subheading
        print(f"- **Total words/tokens:** {total_word_count}")
        if all_word_counts:
            avg_word_count = statistics.mean(all_word_counts)
            print(f"- **Average words/tokens per document:** {avg_word_count:.2f}")

            # Calculate unique tokens (vocabulary size) from accumulated text
            all_tokens = re.findall(r'\b\w+\b', all_extracted_text_for_vocab.lower())
            unique_tokens = set(all_tokens)
            vocabulary_size = len(unique_tokens)
            print(f"- **Unique tokens across dataset (vocabulary size):** {vocabulary_size}") # Clarified label
        else:
             print(f"- **Average words/tokens per document:** 0")
             print(f"- **Unique tokens across dataset (vocabulary size):** 0")

        # 4. Information Content (Estimated)
        total_estimated_info_content_bits_filesize = total_storage_size_bytes * 8
        print(f"\n### 4. Information Content (Estimated)") # Markdown Subheading
        print(f"- **Total estimated info content (based on total file size):** {total_estimated_info_content_bits_filesize} bits")
        print(f"- *Note: This estimate is based on raw file size in bits.*")


        # 5. Document Length Distribution
        print(f"\n### 5. Document Length Distribution (in words/tokens)") # Markdown Subheading
        if all_word_counts:
            mean_len = statistics.mean(all_word_counts)
            median_len = statistics.median(all_word_counts)
            # Ensure std dev calculation is valid for >1 data points
            std_dev_len = statistics.stdev(all_word_counts) if len(all_word_counts) > 1 else 0
            print(f"- **Mean length:** {mean_len:.2f}")
            print(f"- **Median length:** {median_len}")
            print(f"- **Standard deviation:** {std_dev_len:.2f}")


        # 6. Language Distribution
        print(f"\n### 6. Language Distribution") # Markdown Subheading
        all_languages = [f['language'] for f in file_data]
        language_counts = Counter(all_languages)
        print("- **Counts:**")
        for lang, count in language_counts.most_common():
            print(f"   - {lang}: {count} files")


        # 7. Structural Elements Summary
        print(f"\n### 7. Structural Elements Summary") # Markdown Subheading
        total_tables_found = sum([f['table_count'] for f in file_data])
        total_figures_found = sum([f['figure_count'] for f in file_data])
        total_annotations_found = sum([f['annotation_count'] for f in file_data])

        files_with_tables = sum([f['has_tables'] for f in file_data])
        files_with_figures = sum([f['has_figures'] for f in file_data])
        files_with_annotations = sum([f['has_annotations'] for f in file_data])

        print(f"- **Total Tables found:** {total_tables_found}")
        print(f"- **Total Figures/Images found:** {total_figures_found}")
        print(f"- **Total Annotations found:** {total_annotations_found}")
        print(f"- **Files with Tables:** {files_with_tables} ({files_with_tables/total_files*100:.2f}%)")
        print(f"- **Files with Figures/Images:** {files_with_figures} ({files_with_figures/total_files*100:.2f}%)")
        print(f"- **Files with Annotations:** {files_with_annotations} ({files_with_annotations/total_files*100:.2f}%)")


    else:
        print("No data collected from PDF files.")


# --- End of Captured Output ---
sys.stdout = original_stdout # Restore stdout

# --- Display Report in Console ---
markdown_content = report_output.getvalue() # Get the captured output
print("\n" + "="*30 + " PDF Analysis Report " + "="*30) # Separator
print("\n*Note: Markdown rendering may vary depending on the viewer. Multi-column layout and font size control are not standard Markdown features.*\n")
print(markdown_content) # Print the captured report content to console
print("="*79) # Separator


# --- Save Output to Markdown File ---
try:
    with open(output_markdown_file, 'w', encoding='utf-8') as f:
        f.write(markdown_content)
    print(f"\nAnalysis report also saved to {output_markdown_file}")
except Exception as e:
    print(f"\nError saving report to {output_markdown_file}: {e}")

report_output.close() # Close the StringIO object


Analyzing PDF files in directory: D:\Dataset\Lagerugpijn\LR_EPDs


*Note: Markdown rendering may vary depending on the viewer. Multi-column layout and font size control are not standard Markdown features.*


## Per-File Analysis Results

| Filename | Size (MB) | Word Count | Unique Words | Language | Tables | Figures | Annotations |
|---|---|---|---|---|---|---|---|
| EPDAfdruk_897_59037.pdf | 0.20 | 1766 | 453 | nl | 3 | 0 | 0 |
| EPDAfdruk_897_59684.pdf | 0.19 | 1589 | 543 | nl | 3 | 0 | 0 |
| EPDAfdruk_897_60038.pdf | 0.17 | 1452 | 493 | nl | 3 | 0 | 0 |
| EPDAfdruk_897_60384.pdf | 0.17 | 1667 | 472 | nl | 2 | 0 | 0 |
| EPDAfdruk_897_60818.pdf | 0.17 | 1344 | 474 | nl | 3 | 0 | 0 |
| EPDAfdruk_897_61014.pdf | 0.20 | 1605 | 507 | nl | 4 | 0 | 0 |
| EPDAfdruk_897_61368.pdf | 0.19 | 1888 | 532 | nl | 3 | 0 | 0 |
| EPDAfdruk_897_61665.pdf | 0.16 | 1554 | 446 | nl | 2 | 0 | 0 |
| EPDAfdruk_897_61810.pdf | 0.19 | 2176 | 651 | nl | 2 | 0 | 0 |
| EPDAfdruk_897_62117.pdf | 0.16 | 1360 | 427 

In [6]:
import os
import sys
import math
import statistics
import numpy as np
from collections import Counter
import re # For tokenization
from langdetect import detect, DetectorFactory # For language detection
from langdetect.lang_detect_exception import LangDetectException
import pdfplumber # Added: For advanced PDF parsing
from io import StringIO # To capture print output
import string # For character entropy

# Ensure consistent language detection results
DetectorFactory.seed = 0

# --- Configuration ---
# Directory containing the PDF files
pdf_directory = r'D:\Dataset\Lagerugpijn\LR_EPDs' # Use a raw string for the path
output_markdown_file = 'pdf_analysis_report.md' # Name of the output Markdown file
PMI_BIGRAM_FREQ_THRESHOLD = 3 # Minimum frequency for a bigram to be included in Average PMI calculation
# Removed: JSD_EPSILON = 1e-9 # Small value to prevent log(0) in JSD calculation

# --- Suppress Warnings ---
# Suppress warnings originating from any pdfminer module (including PDFSyntaxWarning)
# This is a more general approach than filtering by a specific warning class name
# warnings.filterwarnings("ignore", module='pdfminer\\..*')
# You could add other filters here if other specific warnings are bothersome, e.g.,
# warnings.filterwarnings("ignore", message="some specific message")
warnings.filterwarnings("ignore", message="CropBox missing from /Page, defaulting to MediaBox", category=UserWarning)
# If the above doesn't catch it, the warning might be a different category or you can remove category
# warnings.filterwarnings("ignore", message="CropBox missing from /Page, defaulting to MediaBox")


# --- Data Storage ---
file_data = [] # List to store dictionaries, each containing data for one file
all_extracted_text_for_vocab = "" # String to accumulate all text for vocabulary analysis (for overall vocab and PMI)
all_extracted_chars = "" # String to accumulate all characters (for overall char entropy)

report_output = StringIO() # Use StringIO to capture print statements

# Redirect stdout to capture print output
original_stdout = sys.stdout
sys.stdout = report_output

# --- Helper Function for Text Extraction and Structural Element Detection ---
def analyze_pdf_content(pdf_path):
    """
    Extracts text and detects structural elements (tables, figures, annotations)
    from a PDF file using pdfplumber.
    Returns extracted text, table count, figure count, annotation count.
    """
    extracted_text = ""
    table_count = 0
    figure_count = 0
    annotation_count = 0

    try:
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                # Extract text
                extracted_text += page.extract_text() or "" # Add text, handle None

                # Detect tables
                tables = page.extract_tables()
                table_count += len(tables)

                # Detect figures (images)
                figure_count += len(page.images)

                # Detect annotations
                annotation_count += len(page.annots)

    except pdfplumber.PDFSyntaxError as e:
         # Print warnings directly to original stdout so they appear during processing
         original_stdout.write(f"Warning: PDF Syntax Error in {os.path.basename(pdf_path)}: {e}\n")
         # Return empty data for this file if there's a syntax error
         return "", 0, 0, 0
    except Exception as e:
        # Print warnings directly to original stdout
        original_stdout.write(f"Warning: Error processing {os.path.basename(pdf_path)} with pdfplumber: {e}\n")
        # Return empty data for this file if there's any other error
        return "", 0, 0, 0

    return extracted_text, table_count, figure_count, annotation_count


# --- Helper Function for Language Detection ---
def detect_language(text):
    """Detects the language of the input text."""
    if not text.strip():
        return "N/A (No text)"
    try:
        # langdetect works best on larger text samples.
        # Use a sample if the text is very long to speed things up,
        # but for typical documents, using the full text is fine.
        # Limit text length for detection to avoid potential issues with very large texts
        sample_text = text[:5000] if len(text) > 5000 else text
        if not sample_text.strip():
             return "N/A (No text in sample)"
        return detect(sample_text)
    except LangDetectException:
        return "Undetected"
    except Exception as e:
        # Print warnings directly to original stdout
        original_stdout.write(f"Warning: Error during language detection: {e}\n")
        return "Error"

# --- Helper Functions for Information Theory Metrics ---

def calculate_shannon_entropy(items):
    """Calculates Shannon entropy for a list of items (chars or words)."""
    if not items:
        return 0.0
    counts = Counter(items)
    total_items = len(items)
    entropy = 0.0
    for count in counts.values():
        probability = count / total_items
        # Add a small epsilon to probability to avoid log(0) if needed, though standard formula handles p > 0
        if probability > 0:
             entropy -= probability * math.log2(probability)
    return entropy

# Removed: calculate_kullback_leibler_divergence function
# Removed: calculate_jensen_shannon_divergence function

def calculate_avg_bigram_pmi(text, min_freq=3):
    """
    Calculates the average Pointwise Mutual Information (PMI) for word bigrams
    that occur at least min_freq times.
    A proxy metric related to Mutual Information, measuring word association strength.
    """
    if not text:
        return 0.0

    # Simple word tokenization and lowercase
    # Use the same tokenization as the main script for consistency
    words = re.findall(r'\b\w+\b', text.lower())
    if len(words) < 2:
        return 0.0

    word_counts = Counter(words)
    bigram_counts = Counter(zip(words[:-1], words[1:])) # Count occurrences of bigrams

    total_words = len(words)
    # total_bigrams = len(list(zip(words[:-1], words[1:]))) # Count actual bigram instances

    pmi_values = []
    for bigram, bigram_count in bigram_counts.items():
        # Only consider bigrams that meet the minimum frequency threshold
        if bigram_count >= min_freq:
            word1, word2 = bigram

            # Calculate probabilities (using total_words for marginals is common)
            p_w1 = word_counts[word1] / total_words if total_words > 0 else 0
            p_w2 = word_counts[word2] / total_words if total_words > 0 else 0
            # Use total words as normalization for bigram probability as well for PMI formula
            p_w1_w2 = bigram_count / total_words if total_words > 0 else 0

            # Avoid log(0) - check if probabilities are positive
            if p_w1 > 0 and p_w2 > 0 and p_w1_w2 > 0:
                 # PMI formula: log2( P(w1,w2) / (P(w1) * P(w2)) )
                 pmi = math.log2(p_w1_w2 / (p_w1 * p_w2))
                 pmi_values.append(pmi)
            # Note: Bigrams that never appear together with positive marginals would have PMI -infinity.
            # We only average over bigrams that *do* appear (with >= min_freq).

    if not pmi_values:
        return 0.0 # Return 0 if no bigrams meet min_freq or text was empty/too short

    return np.mean(pmi_values)


# --- Analysis ---
# Print initial message to original stdout
original_stdout.write(f"Analyzing PDF files in directory: {pdf_directory}\n")


if not os.path.isdir(pdf_directory):
    # Print error to original stdout
    original_stdout.write(f"Error: Directory not found at {pdf_directory}\n")
else:
    # Iterate through all entries in the directory
    for entry_name in os.listdir(pdf_directory):
        entry_path = os.path.join(pdf_directory, entry_name)

        # Check if the entry is a file and ends with .pdf (case-insensitive)
        if os.path.isfile(entry_path) and entry_name.lower().endswith('.pdf'):

            file_info = {} # Dictionary to store data for the current file
            file_info['filename'] = entry_name
            file_info['filepath'] = entry_path

            # 1. & 2. Storage Size
            try:
                file_size_bytes = os.path.getsize(entry_path) # size in bytes
                file_info['storage_size_bytes'] = file_size_bytes
                file_info['storage_size_mb'] = file_size_bytes / (1024 * 1024)
            except Exception as e:
                # Print warnings directly to original stdout
                original_stdout.write(f"Warning: Could not get size for {entry_name}: {e}\n")
                file_info['storage_size_bytes'] = 0
                file_info['storage_size_mb'] = 0


            # 3. Textual Content Size & Extraction + 7. Structural Elements
            text, table_count, figure_count, annotation_count = analyze_pdf_content(entry_path)

            file_info['extracted_text'] = text # Store text for language
            # Simple tokenization: split by whitespace and punctuation
            tokens = re.findall(r'\b\w+\b', text.lower()) # Convert to lower case for vocabulary size
            word_count = len(tokens)
            file_info['word_count'] = word_count

            # Calculate unique words for the current file
            unique_words_in_file = set(tokens)
            file_info['unique_word_count'] = len(unique_words_in_file)

            # Calculate per-file Shannon Entropy
            file_info['char_entropy'] = calculate_shannon_entropy(list(text)) # Character entropy
            file_info['word_entropy'] = calculate_shannon_entropy(tokens) # Word entropy

            # Calculate per-file Average Bigram PMI
            file_info['average_pmi'] = calculate_avg_bigram_pmi(text, min_freq=PMI_BIGRAM_FREQ_THRESHOLD)


            # Accumulate text and characters for overall vocabulary, PMI, and char entropy later
            all_extracted_text_for_vocab += text + " " # Add a space to ensure separation
            all_extracted_chars += text # Accumulate all characters


            # Store structural element counts
            file_info['table_count'] = table_count
            file_info['figure_count'] = figure_count
            file_info['annotation_count'] = annotation_count
            # Add boolean flags for easy checking
            file_info['has_tables'] = table_count > 0
            file_info['has_figures'] = figure_count > 0
            file_info['has_annotations'] = annotation_count > 0


            # 6. Language Detection
            file_info['language'] = detect_language(text)

            # 4. Information Content (Estimated)
            # Simple estimation based on file size in bits
            file_info['estimated_info_content_bits_filesize'] = file_size_bytes * 8


            file_data.append(file_info) # Add the file's data to the list


    # Removed: --- Calculate Pairwise JSD and Average JSD per file --- section


    # --- Calculate Overall Dataset Metrics ---
    overall_char_entropy = calculate_shannon_entropy(list(all_extracted_chars))

    overall_tokens = re.findall(r'\b\w+\b', all_extracted_text_for_vocab.lower())
    overall_word_entropy = calculate_shannon_entropy(overall_tokens)

    # Calculate Average Bigram PMI for the overall dataset
    # Re-using the calculate_avg_bigram_pmi function for consistency
    overall_average_pmi = calculate_avg_bigram_pmi(all_extracted_text_for_vocab, min_freq=PMI_BIGRAM_FREQ_THRESHOLD)


    # --- Per-File Reporting (Captured for Markdown) ---
    print("\n## Per-File Analysis Results") # Markdown Heading
    if file_data:
        # Create a Markdown table for per-file data
        # Replaced 'Avg JSD' with 'Avg PMI' column header
        print("\n| Filename | Size (MB) | Word Count | Unique Words | Language | Tables | Figures | Annotations | Char Entropy | Word Entropy | Avg PMI |")
        print("|---|---|---|---|---|---|---|---|---|---|---|") # Updated separator line
        for file_info in file_data:
            # Extract only the digits from the filename
            digits_in_filename = re.findall(r'\d+', file_info['filename'])
            last_five_digits = "".join(digits_in_filename)[-5:] if digits_in_filename else ""

            # Determine display filename
            display_filename = last_five_digits
            # Only add "..." if there were other characters besides the last 5 digits
            if len(file_info['filename'].replace('.', '').replace('_', '').replace('-', '').replace(' ', '')) > len(last_five_digits):
                 display_filename = "..." + display_filename


            # Replaced file_info['average_jsd'] with file_info['average_pmi'] in the row
            print(f"| {display_filename} | {file_info['storage_size_mb']:.2f} | {file_info['word_count']} | {file_info['unique_word_count']} | {file_info['language']} | {file_info['table_count']} | {file_info['figure_count']} | {file_info['annotation_count']} | {file_info['char_entropy']:.2f} | {file_info['word_entropy']:.2f} | {file_info['average_pmi']:.4f} |")

    else:
        print("No PDF files found in the specified directory.")


    # --- Overall Dataset Summary (Captured for Markdown) ---
    print("\n## Overall Dataset Summary") # Markdown Heading

    total_files = len(file_data)
    print(f"\n### 1. Number of PDF Files: {total_files}") # Markdown Subheading

    if total_files > 0:
        # 2. Storage Size
        all_storage_sizes_bytes = [f['storage_size_bytes'] for f in file_data]
        total_storage_size_bytes = sum(all_storage_sizes_bytes)
        total_storage_size_mb = total_storage_size_bytes / (1024 * 1024)
        print(f"\n### 2. Storage Size") # Markdown Subheading
        print(f"- **Total:** {total_storage_size_mb:.2f} MB ({total_storage_size_bytes} bytes)")
        if all_storage_sizes_bytes:
            avg_file_size_mb = statistics.mean(all_storage_sizes_bytes) / (1024 * 1024)
            min_file_size_mb = min(all_storage_sizes_bytes) / (1024 * 1024)
            max_file_size_mb = max(all_storage_sizes_bytes) / (1024 * 1024)
            print(f"- **Average:** {avg_file_size_mb:.2f} MB")
            print(f"- **Range:** ({min_file_size_mb:.2f} MB, {max_file_size_mb:.2f} MB)")

        # 3. Textual Content Size
        all_word_counts = [f['word_count'] for f in file_data]
        total_word_count = sum(all_word_counts)
        print(f"\n### 3. Textual Content Size") # Markdown Subheading
        print(f"- **Total words/tokens:** {total_word_count}")
        if all_word_counts:
            avg_word_count = statistics.mean(all_word_counts)
            print(f"- **Average words/tokens per document:** {avg_word_count:.2f}")

            # Calculate unique tokens (vocabulary size) from accumulated text
            all_tokens = re.findall(r'\b\w+\b', all_extracted_text_for_vocab.lower())
            unique_tokens = set(all_tokens)
            vocabulary_size = len(unique_tokens)
            print(f"- **Unique tokens across dataset (vocabulary size):** {vocabulary_size}") # Clarified label
        else:
             print(f"- **Average words/tokens per document:** 0")
             print(f"- **Unique tokens across dataset (vocabulary size):** 0")

        # 4. Information Content (Estimated)
        total_estimated_info_content_bits_filesize = total_storage_size_bytes * 8
        print(f"\n### 4. Information Content (Estimated)") # Markdown Subheading
        print(f"- **Total estimated info content (based on total file size):** {total_estimated_info_content_bits_filesize} bits")
        print(f"- *Note: This estimate is based on raw file size in bits.*")

        # 4b. Information Theory Metrics (Overall Dataset)
        print(f"\n### 4b. Information Theory Metrics (Overall Dataset)") # Markdown Subheading
        print(f"- **Overall Character Entropy:** {overall_char_entropy:.2f} bits/character")
        print(f"- **Overall Word Entropy:** {overall_word_entropy:.2f} bits/word")
        # Replaced overall average PMI calculation logic with a call to the new function
        print(f"- **Average Bigram PMI (Threshold={PMI_BIGRAM_FREQ_THRESHOLD}):** {overall_average_pmi:.4f}")
        print(f"  *Note: Average PMI is calculated for bigrams appearing at least {PMI_BIGRAM_FREQ_THRESHOLD} times across the dataset.*")


        # 5. Document Length Distribution
        print(f"\n### 5. Document Length Distribution (in words/tokens)") # Markdown Subheading
        if all_word_counts:
            mean_len = statistics.mean(all_word_counts)
            median_len = statistics.median(all_word_counts)
            # Ensure std dev calculation is valid for >1 data points
            std_dev_len = statistics.stdev(all_word_counts) if len(all_word_counts) > 1 else 0
            print(f"- **Mean length:** {mean_len:.2f}")
            print(f"- **Median length:** {median_len}")
            print(f"- **Standard deviation:** {std_dev_len:.2f}")


        # 6. Language Distribution
        print(f"\n### 6. Language Distribution") # Markdown Subheading
        all_languages = [f['language'] for f in file_data]
        language_counts = Counter(all_languages)
        print("- **Counts:**")
        for lang, count in language_counts.most_common():
            print(f"   - {lang}: {count} files")


        # 7. Structural Elements Summary
        print(f"\n### 7. Structural Elements Summary") # Markdown Subheading
        total_tables_found = sum([f['table_count'] for f in file_data])
        total_figures_found = sum([f['figure_count'] for f in file_data])
        total_annotations_found = sum([f['annotation_count'] for f in file_data])

        files_with_tables = sum([f['has_tables'] for f in file_data])
        files_with_figures = sum([f['has_figures'] for f in file_data])
        files_with_annotations = sum([f['has_annotations'] for f in file_data])

        print(f"- **Total Tables found:** {total_tables_found}")
        print(f"- **Total Figures/Images found:** {total_figures_found}")
        print(f"- **Total Annotations found:** {total_annotations_found}")
        print(f"- **Files with Tables:** {files_with_tables} ({files_with_tables/total_files*100:.2f}%)")
        print(f"- **Files with Figures/Images:** {files_with_figures} ({files_with_figures/total_files*100:.2f}%)")
        print(f"- **Files with Annotations:** {files_with_annotations} ({files_with_annotations/total_files*100:.2f}%)")


    else:
        print("No data collected from PDF files.")


# --- End of Captured Output ---
sys.stdout = original_stdout # Restore stdout
sys.stdout.flush() # Explicitly flush the buffer


# --- Display Report in Console ---
markdown_content = report_output.getvalue() # Get the captured output
print("\n" + "="*30 + " PDF Analysis Report " + "="*30) # Separator
print("\n*Note: Markdown rendering may vary depending on the viewer. Multi-column layout and font size control are not standard Markdown features.*\n")
print(markdown_content) # Print the captured report content to console
print("="*79) # Separator


# --- Save Output to Markdown File ---
try:
    with open(output_markdown_file, 'w', encoding='utf-8') as f:
        f.write(markdown_content)
    print(f"\nAnalysis report also saved to {output_markdown_file}")
except Exception as e:
    print(f"\nError saving report to {output_markdown_file}: {e}")

report_output.close() # Close the StringIO object


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox


Analyzing PDF files in directory: D:\Dataset\Lagerugpijn\LR_EPDs


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def



*Note: Markdown rendering may vary depending on the viewer. Multi-column layout and font size control are not standard Markdown features.*


## Per-File Analysis Results

| Filename | Size (MB) | Word Count | Unique Words | Language | Tables | Figures | Annotations | Char Entropy | Word Entropy | Avg PMI |
|---|---|---|---|---|---|---|---|---|---|---|
| ...59037 | 0.20 | 1766 | 453 | nl | 3 | 0 | 0 | 4.89 | 8.15 | 6.9683 |
| ...59684 | 0.19 | 1589 | 543 | nl | 3 | 0 | 0 | 4.90 | 8.54 | 6.9370 |
| ...60038 | 0.17 | 1452 | 493 | nl | 3 | 0 | 0 | 4.85 | 8.39 | 6.7113 |
| ...60384 | 0.17 | 1667 | 472 | nl | 2 | 0 | 0 | 4.83 | 8.14 | 6.5174 |
| ...60818 | 0.17 | 1344 | 474 | nl | 3 | 0 | 0 | 4.93 | 8.35 | 6.2739 |
| ...61014 | 0.20 | 1605 | 507 | nl | 4 | 0 | 0 | 4.88 | 8.43 | 7.1602 |
| ...61368 | 0.19 | 1888 | 532 | nl | 3 | 0 | 0 | 4.89 | 8.32 | 6.7119 |
| ...61665 | 0.16 | 1554 | 446 | nl | 2 | 0 | 0 | 4.81 | 8.03 | 6.2304 |
| ...61810 | 0.19 | 2176 | 651 | nl | 2 | 0 | 0 | 4.83 | 8.4

In [13]:
import os
import sys
import math
import statistics
import numpy as np
from collections import Counter
import re # For tokenization
from langdetect import detect, DetectorFactory # For language detection
from langdetect.lang_detect_exception import LangDetectException
import pdfplumber # Added: For advanced PDF parsing
from io import StringIO # To capture print output
import string # For character entropy
import warnings # Added: To manage warnings
# Removed: from pdfminer.pdfparser import PDFSyntaxWarning # Removed: Specific import causing error

# Ensure consistent language detection results
DetectorFactory.seed = 0

# --- Configuration ---
# Directory containing the PDF files
pdf_directory = r'D:\Dataset\Lagerugpijn\LR_EPDs' # Use a raw string for the path
output_markdown_file = 'pdf_analysis_report.md' # Name of the output Markdown file
PMI_BIGRAM_FREQ_THRESHOLD = 3 # Minimum frequency for a bigram to be included in Average PMI calculation

# --- Suppress Warnings ---
# Suppress warnings originating from any pdfminer module (including PDFSyntaxWarning)
# This is a more general approach than filtering by a specific warning class name
warnings.filterwarnings("ignore", module='pdfminer\..*')
# You could add other filters here if other specific warnings are bothersome, e.g.,
# warnings.filterwarnings("ignore", message="some specific message")

# --- Data Storage ---
file_data = [] # List to store dictionaries, each containing data for one file
all_extracted_text_for_vocab = "" # String to accumulate all text for vocabulary analysis (for overall vocab and PMI)
all_extracted_chars = "" # String to accumulate all characters (for overall char entropy)
all_char_counts = [] # List to store character counts for each file (for overall average)

report_output = StringIO() # Use StringIO to capture print statements

# Redirect stdout to capture print output
original_stdout = sys.stdout
sys.stdout = report_output

# --- Helper Function for Text Extraction and Structural Element Detection ---
def analyze_pdf_content(pdf_path):
    """
    Extracts text and detects structural elements (tables, figures, annotations)
    from a PDF file using pdfplumber.
    Returns extracted text, table count, figure count, annotation count.
    """
    extracted_text = ""
    table_count = 0
    figure_count = 0
    annotation_count = 0

    try:
        # Use warnings.catch_warnings to temporarily manage warnings within this function if needed,
        # but filtering globally at the start is simpler for this case.
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                # Extract text
                extracted_text += page.extract_text() or "" # Add text, handle None

                # Detect tables
                tables = page.extract_tables()
                table_count += len(tables)

                # Detect figures (images)
                figure_count += len(page.images)

                # Detect annotations
                annotation_count += len(page.annots)

    except pdfplumber.PDFSyntaxError as e:
         # Print warnings directly to original stdout so they appear during processing
         original_stdout.write(f"Warning: PDF Syntax Error in {os.path.basename(pdf_path)}: {e}\n")
         # Return empty data for this file if there's a syntax error
         return "", 0, 0, 0
    except Exception as e:
        # Print warnings directly to original stdout
        original_stdout.write(f"Warning: Error processing {os.path.basename(pdf_path)} with pdfplumber: {e}\n")
        # Return empty data for this file if there's any other error
        return "", 0, 0, 0

    return extracted_text, table_count, figure_count, annotation_count


# --- Helper Function for Language Detection ---
def detect_language(text):
    """Detects the language of the input text."""
    if not text.strip():
        return "N/A (No text)"
    try:
        # langdetect works best on larger text samples.
        # Use a sample if the text is very long to speed things up,
        # but for typical documents, using the full text is fine.
        # Limit text length for detection to avoid potential issues with very large texts
        sample_text = text[:5000] if len(text) > 5000 else text
        if not sample_text.strip():
             return "N/A (No text in sample)"
        # Use warnings.catch_warnings to temporarily manage warnings within this function if needed
        with warnings.catch_warnings():
             # langdetect can sometimes issue warnings, e.g., about language probabilities
             warnings.filterwarnings("ignore", category=UserWarning) # Example: suppress UserWarnings from langdetect
             return detect(sample_text)
    except LangDetectException:
        return "Undetected"
    except Exception as e:
        # Print warnings directly to original stdout
        original_stdout.write(f"Warning: Error during language detection: {e}\n")
        return "Error"

# --- Helper Functions for Information Theory Metrics ---

def calculate_shannon_entropy(items):
    """Calculates Shannon entropy for a list of items (chars or words)."""
    if not items:
        return 0.0
    counts = Counter(items)
    total_items = len(items)
    entropy = 0.0
    for count in counts.values():
        probability = count / total_items
        # Add a small epsilon to probability to avoid log(0) if needed, though standard formula handles p > 0
        if probability > 0:
             entropy -= probability * math.log2(probability)
    return entropy

# Removed: calculate_kullback_leibler_divergence function
# Removed: calculate_jensen_shannon_divergence function

def calculate_avg_bigram_pmi(text, min_freq=3):
    """
    Calculates the average Pointwise Mutual Information (PMI) for word bigrams
    that occur at least min_freq times.
    A proxy metric related to Mutual Information, measuring word association strength.
    """
    if not text:
        return 0.0

    # Simple word tokenization and lowercase
    # Use the same tokenization as the main script for consistency
    words = re.findall(r'\b\w+\b', text.lower())
    if len(words) < 2:
        return 0.0

    word_counts = Counter(words)
    bigram_counts = Counter(zip(words[:-1], words[1:])) # Count occurrences of bigrams

    total_words = len(words)
    # total_bigrams = len(list(zip(words[:-1], words[1:]))) # Count actual bigram instances

    pmi_values = []
    for bigram, bigram_count in bigram_counts.items():
        # Only consider bigrams that meet the minimum frequency threshold
        if bigram_count >= min_freq:
            word1, word2 = bigram

            # Calculate probabilities (using total_words for marginals is common)
            p_w1 = word_counts[word1] / total_words if total_words > 0 else 0
            p_w2 = word_counts[word2] / total_words if total_words > 0 else 0
            # Use total words as normalization for bigram probability as well for PMI formula
            p_w1_w2 = bigram_count / total_words if total_words > 0 else 0

            # Avoid log(0) - check if probabilities are positive
            if p_w1 > 0 and p_w2 > 0 and p_w1_w2 > 0:
                 # PMI formula: log2( P(w1,w2) / (P(w1) * P(w2)) )
                 pmi = math.log2(p_w1_w2 / (p_w1 * p_w2))
                 pmi_values.append(pmi)
            # Note: Bigrams that never appear together with positive marginals would have PMI -infinity.
            # We only average over bigrams that *do* appear (with >= min_freq).

    if not pmi_values:
        return 0.0 # Return 0 if no bigrams meet min_freq or text was empty/too short

    return np.mean(pmi_values)


# --- Analysis ---
# Print initial message to original stdout
original_stdout.write(f"Analyzing PDF files in directory: {pdf_directory}\n")


if not os.path.isdir(pdf_directory):
    # Print error to original stdout
    original_stdout.write(f"Error: Directory not found at {pdf_directory}\n")
else:
    # Iterate through all entries in the directory
    for entry_name in os.listdir(pdf_directory):
        entry_path = os.path.join(pdf_directory, entry_name)

        # Check if the entry is a file and ends with .pdf (case-insensitive)
        if os.path.isfile(entry_path) and entry_name.lower().endswith('.pdf'):

            file_info = {} # Dictionary to store data for the current file
            file_info['filename'] = entry_name
            file_info['filepath'] = entry_path

            # 1. & 2. Storage Size
            try:
                file_size_bytes = os.path.getsize(entry_path) # size in bytes
                file_info['storage_size_bytes'] = file_size_bytes
                file_info['storage_size_mb'] = file_size_bytes / (1024 * 1024)
            except Exception as e:
                # Print warnings directly to original stdout
                original_stdout.write(f"Warning: Could not get size for {entry_name}: {e}\n")
                file_info['storage_size_bytes'] = 0
                file_info['storage_size_mb'] = 0


            # 3. Textual Content Size & Extraction + 7. Structural Elements
            text, table_count, figure_count, annotation_count = analyze_pdf_content(entry_path)

            file_info['extracted_text'] = text # Store text for language
            # Calculate character count for the current file
            file_info['char_count'] = len(text)
            all_char_counts.append(file_info['char_count']) # Add to list for overall average

            # Simple word tokenization and lowercase
            tokens = re.findall(r'\b\w+\b', text.lower()) # Convert to lower case for vocabulary size
            word_count = len(tokens)
            file_info['word_count'] = word_count

            # Calculate unique words for the current file
            unique_words_in_file = set(tokens)
            file_info['unique_word_count'] = len(unique_words_in_file)

            # Calculate per-file Shannon Entropy
            file_info['char_entropy'] = calculate_shannon_entropy(list(text)) # Character entropy
            file_info['word_entropy'] = calculate_shannon_entropy(tokens) # Word entropy

            # Calculate per-file Average Bigram PMI
            file_info['average_pmi'] = calculate_avg_bigram_pmi(text, min_freq=PMI_BIGRAM_FREQ_THRESHOLD)


            # Accumulate text and characters for overall vocabulary, PMI, and char entropy later
            all_extracted_text_for_vocab += text + " " # Add a space to ensure separation
            all_extracted_chars += text # Accumulate all characters


            # Store structural element counts
            file_info['table_count'] = table_count
            file_info['figure_count'] = figure_count
            file_info['annotation_count'] = annotation_count
            # Add boolean flags for easy checking
            file_info['has_tables'] = table_count > 0
            file_info['has_figures'] = figure_count > 0
            file_info['has_annotations'] = annotation_count > 0


            # 6. Language Detection
            file_info['language'] = detect_language(text)

            # 4. Information Content (Estimated)
            # Simple estimation based on file size in bits
            file_info['estimated_info_content_bits_filesize'] = file_size_bytes * 8


            file_data.append(file_info) # Add the file's data to the list


    # Removed: --- Calculate Pairwise JSD and Average JSD per file --- section


    # --- Calculate Overall Dataset Metrics ---
    overall_char_entropy = calculate_shannon_entropy(list(all_extracted_chars))

    overall_tokens = re.findall(r'\b\w+\b', all_extracted_text_for_vocab.lower())
    overall_word_entropy = calculate_shannon_entropy(overall_tokens)

    # Calculate Average Bigram PMI for the overall dataset
    # Re-using the calculate_avg_bigram_pmi function for consistency
    overall_average_pmi = calculate_avg_bigram_pmi(all_extracted_text_for_vocab, min_freq=PMI_BIGRAM_FREQ_THRESHOLD)

    # Calculate Overall Average Document Length (Characters)
    overall_avg_doc_length_chars = np.mean(all_char_counts) if all_char_counts else 0


    # --- Per-File Reporting (Captured for Markdown) ---
    print("\n## Per-File Analysis Results") # Markdown Heading
    if file_data:
        # Create a Markdown table for per-file data
        # Added 'doc length' column header
        print("\n| Filename | Size (MB) | Word Count | Unique Words | doc length | Language | Tables | Figures | Annotations | Char Entropy | Word Entropy | Avg PMI |")
        # Updated separator line to match the new column count
        print("|---|---|---|---|---|---|---|---|---|---|---|---|")
        for file_info in file_data:
            # Extract only the digits from the filename
            digits_in_filename = re.findall(r'\d+', file_info['filename'])
            last_five_digits = "".join(digits_in_filename)[-5:] if digits_in_filename else ""

            # Determine display filename
            display_filename = last_five_digits
            # Only add "..." if there were other characters besides the last 5 digits
            if len(file_info['filename'].replace('.', '').replace('_', '').replace('-', '').replace(' ', '')) > len(last_five_digits):
                 display_filename = "..." + display_filename


            # Added file_info['char_count'] to the row
            print(f"| {display_filename} | {file_info['storage_size_mb']:.2f} | {file_info['word_count']} | {file_info['unique_word_count']} | {file_info['char_count']} | {file_info['language']} | {file_info['table_count']} | {file_info['figure_count']} | {file_info['annotation_count']} | {file_info['char_entropy']:.2f} | {file_info['word_entropy']:.2f} | {file_info['average_pmi']:.4f} |")

    else:
        print("No PDF files found in the specified directory.")


    # --- Overall Dataset Summary (Captured for Markdown) ---
    print("\n## Overall Dataset Summary") # Markdown Heading

    total_files = len(file_data)
    print(f"\n### 1. Number of PDF Files: {total_files}") # Markdown Subheading

    if total_files > 0:
        # 2. Storage Size
        all_storage_sizes_bytes = [f['storage_size_bytes'] for f in file_data]
        total_storage_size_bytes = sum(all_storage_sizes_bytes)
        total_storage_size_mb = total_storage_size_bytes / (1024 * 1024)
        print(f"\n### 2. Storage Size") # Markdown Subheading
        print(f"- **Total:** {total_storage_size_mb:.2f} MB ({total_storage_size_bytes} bytes)")
        if all_storage_sizes_bytes:
            avg_file_size_mb = statistics.mean(all_storage_sizes_bytes) / (1024 * 1024)
            min_file_size_mb = min(all_storage_sizes_bytes) / (1024 * 1024)
            max_file_size_mb = max(all_storage_sizes_bytes) / (1024 * 1024)
            print(f"- **Average:** {avg_file_size_mb:.2f} MB")
            print(f"- **Range:** ({min_file_size_mb:.2f} MB, {max_file_size_mb:.2f} MB)")

        # 3. Textual Content Size
        all_word_counts = [f['word_count'] for f in file_data]
        total_word_count = sum(all_word_counts)
        print(f"\n### 3. Textual Content Size") # Markdown Subheading
        print(f"- **Total words/tokens:** {total_word_count}")
        if all_word_counts:
            avg_word_count = statistics.mean(all_word_counts)
            print(f"- **Average words/tokens per document:** {avg_word_count:.2f}")

            # Calculate unique tokens (vocabulary size) from accumulated text
            all_tokens = re.findall(r'\b\w+\b', all_extracted_text_for_vocab.lower())
            unique_tokens = set(all_tokens)
            vocabulary_size = len(unique_tokens)
            print(f"- **Unique tokens across dataset (vocabulary size):** {vocabulary_size}") # Clarified label
        else:
             print(f"- **Average words/tokens per document:** 0")
             print(f"- **Unique tokens across dataset (vocabulary size):** 0")

        # 4. Information Content (Estimated)
        total_estimated_info_content_bits_filesize = total_storage_size_bytes * 8
        print(f"\n### 4. Information Content (Estimated)") # Markdown Subheading
        print(f"- **Total estimated info content (based on total file size):** {total_estimated_info_content_bits_filesize} bits")
        print(f"- *Note: This estimate is based on raw file size in bits.*")

        # 4b. Information Theory Metrics (Overall Dataset)
        print(f"\n### 4b. Information Theory Metrics (Overall Dataset)") # Markdown Subheading
        print(f"- **Overall Character Entropy:** {overall_char_entropy:.2f} bits/character")
        print(f"- **Overall Word Entropy:** {overall_word_entropy:.2f} bits/word")
        # Replaced overall average PMI calculation logic with a call to the new function
        print(f"- **Average Bigram PMI (Threshold={PMI_BIGRAM_FREQ_THRESHOLD}):** {overall_average_pmi:.4f}")
        print(f"  *Note: Average PMI is calculated for bigrams appearing at least {PMI_BIGRAM_FREQ_THRESHOLD} times across the dataset.*")

        # Added Overall Average Document Length (Characters)
        print(f"- **Overall Average Document Length (Characters):** {overall_avg_doc_length_chars:.2f}")


        # 5. Document Length Distribution
        print(f"\n### 5. Document Length Distribution (in words/tokens)") # Markdown Subheading
        if all_word_counts:
            mean_len = statistics.mean(all_word_counts)
            median_len = statistics.median(all_word_counts)
            # Ensure std dev calculation is valid for >1 data points
            std_dev_len = statistics.stdev(all_word_counts) if len(all_word_counts) > 1 else 0
            print(f"- **Mean length:** {mean_len:.2f}")
            print(f"- **Median length:** {median_len}")
            print(f"- **Standard deviation:** {std_dev_len:.2f}")


        # 6. Language Distribution
        print(f"\n### 6. Language Distribution") # Markdown Subheading
        all_languages = [f['language'] for f in file_data]
        language_counts = Counter(all_languages)
        print("- **Counts:**")
        for lang, count in language_counts.most_common():
            print(f"   - {lang}: {count} files")


        # 7. Structural Elements Summary
        print(f"\n### 7. Structural Elements Summary") # Markdown Subheading
        total_tables_found = sum([f['table_count'] for f in file_data])
        total_figures_found = sum([f['figure_count'] for f in file_data])
        total_annotations_found = sum([f['annotation_count'] for f in file_data])

        files_with_tables = sum([f['has_tables'] for f in file_data])
        files_with_figures = sum([f['has_figures'] for f in file_data])
        files_with_annotations = sum([f['has_annotations'] for f in file_data])

        print(f"- **Total Tables found:** {total_tables_found}")
        print(f"- **Total Figures/Images found:** {total_figures_found}")
        print(f"- **Total Annotations found:** {total_annotations_found}")
        print(f"- **Files with Tables:** {files_with_tables} ({files_with_tables/total_files*100:.2f}%)")
        print(f"- **Files with Figures/Images:** {files_with_figures} ({files_with_figures/total_files*100:.2f}%)")
        print(f"- **Files with Annotations:** {files_with_annotations} ({files_with_annotations/total_files*100:.2f}%)")


    else:
        print("No data collected from PDF files.")


# --- End of Captured Output ---
sys.stdout = original_stdout # Restore stdout
sys.stdout.flush() # Explicitly flush the buffer


# --- Display Report in Console ---
markdown_content = report_output.getvalue() # Get the captured output
print("\n" + "="*30 + " PDF Analysis Report " + "="*30) # Separator
print("\n*Note: Markdown rendering may vary depending on the viewer. Multi-column layout and font size control are not standard Markdown features.*\n")
print(markdown_content) # Print the captured report content to console
print("="*79) # Separator


# --- Save Output to Markdown File ---
try:
    with open(output_markdown_file, 'w', encoding='utf-8') as f:
        f.write(markdown_content)
    print(f"\nAnalysis report also saved to {output_markdown_file}")
except Exception as e:
    print(f"\nError saving report to {output_markdown_file}: {e}")

report_output.close() # Close the StringIO object


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox


Analyzing PDF files in directory: D:\Dataset\Lagerugpijn\LR_EPDs


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def



*Note: Markdown rendering may vary depending on the viewer. Multi-column layout and font size control are not standard Markdown features.*


## Per-File Analysis Results

| Filename | Size (MB) | Word Count | Unique Words | doc length | Language | Tables | Figures | Annotations | Char Entropy | Word Entropy | Avg PMI |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ...59037 | 0.20 | 1766 | 453 | 12081 | nl | 3 | 0 | 0 | 4.89 | 8.15 | 6.9683 |
| ...59684 | 0.19 | 1589 | 543 | 11178 | nl | 3 | 0 | 0 | 4.90 | 8.54 | 6.9370 |
| ...60038 | 0.17 | 1452 | 493 | 10095 | nl | 3 | 0 | 0 | 4.85 | 8.39 | 6.7113 |
| ...60384 | 0.17 | 1667 | 472 | 10761 | nl | 2 | 0 | 0 | 4.83 | 8.14 | 6.5174 |
| ...60818 | 0.17 | 1344 | 474 | 9412 | nl | 3 | 0 | 0 | 4.93 | 8.35 | 6.2739 |
| ...61014 | 0.20 | 1605 | 507 | 11118 | nl | 4 | 0 | 0 | 4.88 | 8.43 | 7.1602 |
| ...61368 | 0.19 | 1888 | 532 | 12109 | nl | 3 | 0 | 0 | 4.89 | 8.32 | 6.7119 |
| ...61665 | 0.16 | 1554 | 446 | 10287 | nl | 2 | 0 | 0 | 4.8

In [None]:
import os
import sys
import math
import statistics
import numpy as np
from collections import Counter
import re # For tokenization
from langdetect import detect, DetectorFactory # For language detection
from langdetect.lang_detect_exception import LangDetectException
import pdfplumber # Added: For advanced PDF parsing
from io import StringIO # To capture print output
import string # For character entropy
import warnings # Added: To manage warnings
# Removed: from pdfminer.pdfparser import PDFSyntaxWarning # Removed: Specific import causing error
from scipy.stats import entropy # Added: For KL divergence calculation

# Ensure consistent language detection results
DetectorFactory.seed = 0

# --- Configuration ---
# Directory containing the PDF files
pdf_directory = r'D:\Dataset\Lagerugpijn\LR_EPDs' # Use a raw string for the path
output_markdown_file = 'pdf_analysis_report.md' # Name of the output Markdown file
PMI_BIGRAM_FREQ_THRESHOLD = 3 # Minimum frequency for a bigram to be included in Average PMI calculation
# Removed: JSD_EPSILON = 1e-9 # Small value used in old JSD calculation

# --- Suppress Warnings ---
# Suppress warnings originating from any pdfminer module (including PDFSyntaxWarning)
# This is a more general approach than filtering by a specific warning class name
warnings.filterwarnings("ignore", module='pdfminer\..*')
# You could add other filters here if other specific warnings are bothersable, e.g.,
# warnings.filterwarnings("ignore", message="some specific message")

# --- Data Storage ---
file_data = [] # List to store dictionaries, each containing data for one file
all_extracted_text_for_vocab = "" # String to accumulate all text for vocabulary analysis (for overall vocab and PMI)
all_extracted_chars = "" # String to accumulate all characters (for overall char entropy)
all_char_counts = [] # List to store character counts for each file (for overall average)

report_output = StringIO() # Use StringIO to capture print statements

# Redirect stdout to capture print output
original_stdout = sys.stdout
sys.stdout = report_output

# --- Helper Function for Text Extraction and Structural Element Detection ---
def analyze_pdf_content(pdf_path):
    """
    Extracts text and detects structural elements (tables, figures, annotations)
    from a PDF file using pdfplumber.
    Returns extracted text, table count, figure count, annotation count.
    """
    extracted_text = ""
    table_count = 0
    figure_count = 0
    annotation_count = 0

    try:
        # Use warnings.catch_warnings to temporarily manage warnings within this function if needed,
        # but filtering globally at the start is simpler for this case.
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                # Extract text
                extracted_text += page.extract_text() or "" # Add text, handle None

                # Detect tables
                tables = page.extract_tables()
                table_count += len(tables)

                # Detect figures (images)
                figure_count += len(page.images)

                # Detect annotations
                annotation_count += len(page.annots)

    except pdfplumber.PDFSyntaxError as e:
         # Print warnings directly to original stdout so they appear during processing
         original_stdout.write(f"Warning: PDF Syntax Error in {os.path.basename(pdf_path)}: {e}\n")
         # Return empty data for this file if there's a syntax error
         return "", 0, 0, 0
    except Exception as e:
        # Print warnings directly to original stdout
        original_stdout.write(f"Warning: Error processing {os.path.basename(pdf_path)} with pdfplumber: {e}\n")
        # Return empty data for this file if there's any other error
        return "", 0, 0, 0

    return extracted_text, table_count, figure_count, annotation_count


# --- Helper Function for Language Detection ---
def detect_language(text):
    """Detects the language of the input text."""
    if not text.strip():
        return "N/A (No text)"
    try:
        # langdetect works best on larger text samples.
        # Use a sample if the text is very long to speed things up,
        # but for typical documents, using the full text is fine.
        # Limit text length for detection to avoid potential issues with very large texts
        sample_text = text[:5000] if len(text) > 5000 else text
        if not sample_text.strip():
             return "N/A (No text in sample)"
        # Use warnings.catch_warnings to temporarily manage warnings within this function if needed
        with warnings.catch_warnings():
             # langdetect can sometimes issue warnings, e.g., about language probabilities
             warnings.filterwarnings("ignore", category=UserWarning) # Example: suppress UserWarnings from langdetect
             return detect(sample_text)
    except LangDetectException:
        return "Undetected"
    except Exception as e:
        # Print warnings directly to original stdout
        original_stdout.write(f"Warning: Error during language detection: {e}\n")
        return "Error"

# --- Helper Functions for Information Theory Metrics ---

def calculate_shannon_entropy(items):
    """Calculates Shannon entropy for a list of items (chars or words)."""
    if not items:
        return 0.0
    counts = Counter(items)
    total_items = len(items)
    entropy = 0.0
    for count in counts.values():
        probability = count / total_items
        # Add a small epsilon to probability to avoid log(0) if needed, though standard formula handles p > 0
        if probability > 0:
             entropy -= probability * math.log2(probability)
    return entropy

# Removed: calculate_kullback_leibler_divergence function
# Removed: calculate_jensen_shannon_divergence function

def calculate_kl_divergence(text1, text2, unit='word'):
    """
    Calculates the Jensen-Shannon Divergence (JSD) between the distributions
    of tokens (chars or words) in two texts.
    Uses scipy.stats.entropy for KL calculation.
    """
    if not text1 or not text2:
        return np.nan # Cannot compute divergence with empty text

    if unit == 'char':
        tokens1 = list(text1)
        tokens2 = list(text2)
    elif unit == 'word':
        # Simple word tokenization and lowercase
        tokens1 = re.findall(r'\b\w+\b', text1.lower()) # Use consistent tokenization
        tokens2 = re.findall(r'\b\w+\b', text2.lower()) # Use consistent tokenization
    else:
        raise ValueError("Unit must be 'char' or 'word'")

    if not tokens1 or not tokens2:
          return np.nan

    # Build a combined vocabulary
    vocab = list(set(tokens1 + tokens2))
    vocab_size = len(vocab)

    # Create frequency distributions
    counts1 = Counter(tokens1)
    counts2 = Counter(tokens2)

    # Create probability distributions over the combined vocabulary
    # Add a small smoothing value to avoid zero probabilities, which cause log(0) issues
    smoothing = 1e-9
    p1 = np.array([counts1.get(token, 0) + smoothing for token in vocab])
    p2 = np.array([counts2.get(token, 0) + smoothing for token in vocab])

    # Normalize to get probability distributions
    p1 = p1 / p1.sum()
    p2 = p2 / p2.sum()

    # Calculate KL Divergence using scipy.stats.entropy
    # entropy(pk, qk) calculates KL(pk || qk)
    kl_pq = entropy(p1, qk=p2, base=2) # Use base=2 for bits
    kl_qp = entropy(p2, qk=p1, base=2) # Use base=2 for bits


    # Calculate Jensen-Shannon Divergence (JSD) - symmetric and bounded
    # JSD = 0.5 * (KL(P || Q) + KL(Q || P))
    jsd = 0.5 * (kl_pq + kl_qp)

    return jsd


def calculate_avg_bigram_pmi(text, min_freq=3):
    """
    Calculates the average Pointwise Mutual Information (PMI) for word bigrams
    that occur at least min_freq times.
    A proxy metric related to Mutual Information, measuring word association strength.
    """
    if not text:
        return 0.0

    # Simple word tokenization and lowercase
    # Use the same tokenization as the main script for consistency
    words = re.findall(r'\b\w+\b', text.lower())
    if len(words) < 2:
        return 0.0

    word_counts = Counter(words)
    bigram_counts = Counter(zip(words[:-1], words[1:])) # Count occurrences of bigrams

    total_words = len(words)
    # total_bigrams = len(list(zip(words[:-1], words[1:]))) # Count actual bigram instances

    pmi_values = []
    for bigram, bigram_count in bigram_counts.items():
        # Only consider bigrams that meet the minimum frequency threshold
        if bigram_count >= min_freq:
            word1, word2 = bigram

            # Calculate probabilities (using total_words for marginals is common)
            p_w1 = word_counts[word1] / total_words if total_words > 0 else 0
            p_w2 = word_counts[word2] / total_words if total_words > 0 else 0
            # Use total words as normalization for bigram probability as well for PMI formula
            p_w1_w2 = bigram_count / total_words if total_words > 0 else 0

            # Avoid log(0) - check if probabilities are positive
            if p_w1 > 0 and p_w2 > 0 and p_w1_w2 > 0:
                 # PMI formula: log2( P(w1,w2) / (P(w1) * P(w2)) )
                 pmi = math.log2(p_w1_w2 / (p_w1 * p_w2))
                 pmi_values.append(pmi)
            # Note: Bigrams that never appear together with positive marginals would have PMI -infinity.
            # We only average over bigrams that *do* appear (with >= min_freq).

    if not pmi_values:
        return 0.0 # Return 0 if no bigrams meet min_freq or text was empty/too short

    return np.mean(pmi_values)


# --- Analysis ---
# Print initial message to original stdout
original_stdout.write(f"Analyzing PDF files in directory: {pdf_directory}\n")


if not os.path.isdir(pdf_directory):
    # Print error to original stdout
    original_stdout.write(f"Error: Directory not found at {pdf_directory}\n")
else:
    # Iterate through all entries in the directory
    for entry_name in os.listdir(pdf_directory):
        entry_path = os.path.join(pdf_directory, entry_name)

        # Check if the entry is a file and ends with .pdf (case-insensitive)
        if os.path.isfile(entry_path) and entry_name.lower().endswith('.pdf'):

            file_info = {} # Dictionary to store data for the current file
            file_info['filename'] = entry_name
            file_info['filepath'] = entry_path

            # 1. & 2. Storage Size
            try:
                file_size_bytes = os.path.getsize(entry_path) # size in bytes
                file_info['storage_size_bytes'] = file_size_bytes
                file_info['storage_size_mb'] = file_size_bytes / (1024 * 1024)
            except Exception as e:
                # Print warnings directly to original stdout
                original_stdout.write(f"Warning: Could not get size for {entry_name}: {e}\n")
                file_info['storage_size_bytes'] = 0
                file_info['storage_size_mb'] = 0


            # 3. Textual Content Size & Extraction + 7. Structural Elements
            text, table_count, figure_count, annotation_count = analyze_pdf_content(entry_path)

            file_info['extracted_text'] = text # Store text for language
            # Calculate character count for the current file
            file_info['char_count'] = len(text)
            all_char_counts.append(file_info['char_count']) # Add to list for overall average

            # Simple word tokenization and lowercase
            tokens = re.findall(r'\b\w+\b', text.lower()) # Convert to lower case for vocabulary size
            word_count = len(tokens)
            file_info['word_count'] = word_count
            file_info['tokens'] = tokens # Store tokens for JSD calculation


            # Calculate unique words for the current file
            unique_words_in_file = set(tokens)
            file_info['unique_word_count'] = len(unique_words_in_file)

            # Calculate per-file Shannon Entropy
            file_info['char_entropy'] = calculate_shannon_entropy(list(text)) # Character entropy
            file_info['word_entropy'] = calculate_shannon_entropy(tokens) # Word entropy

            # Calculate per-file Average Bigram PMI
            file_info['average_pmi'] = calculate_avg_bigram_pmi(text, min_freq=PMI_BIGRAM_FREQ_THRESHOLD)


            # Accumulate text and characters for overall vocabulary, PMI, and char entropy later
            all_extracted_text_for_vocab += text + " " # Add a space to ensure separation
            all_extracted_chars += text # Accumulate all characters


            # Store structural element counts
            file_info['table_count'] = table_count
            file_info['figure_count'] = figure_count
            file_info['annotation_count'] = annotation_count
            # Add boolean flags for easy checking
            file_info['has_tables'] = table_count > 0
            file_info['has_figures'] = figure_count > 0
            file_info['has_annotations'] = annotation_count > 0


            # 6. Language Detection
            file_info['language'] = detect_language(text)

            # 4. Information Content (Estimated)
            # Simple estimation based on file size in bits
            file_info['estimated_info_content_bits_filesize'] = file_size_bytes * 8


            file_data.append(file_info) # Add the file's data to the list


    # --- Calculate Overall Dataset Metrics (needed for per-file JSD) ---
    overall_char_entropy = calculate_shannon_entropy(list(all_extracted_chars))

    overall_tokens = re.findall(r'\b\w+\b', all_extracted_text_for_vocab.lower())
    overall_word_counts = Counter(overall_tokens)
    total_words_overall = len(overall_tokens)
    overall_word_distribution = {word: count / total_words_overall for word, count in overall_word_counts.items()} if total_words_overall > 0 else {}
    overall_vocabulary = set(overall_tokens)


    overall_word_entropy = calculate_shannon_entropy(overall_tokens)

    # Calculate Average Bigram PMI for the overall dataset
    overall_average_pmi = calculate_avg_bigram_pmi(all_extracted_text_for_vocab, min_freq=PMI_BIGRAM_FREQ_THRESHOLD)

    # Calculate Overall Average Document Length (Characters)
    overall_avg_doc_length_chars = np.mean(all_char_counts) if all_char_counts else 0

    # --- Calculate Per-File JSD (compared to overall dataset distribution) ---
    if file_data and overall_word_distribution:
        for file_info in file_data:
            file_tokens = file_info['tokens']
            file_word_counts = Counter(file_tokens)
            total_file_words = len(file_tokens)
            file_word_distribution = {word: count / total_file_words for word, count in file_word_counts.items()} if total_file_words > 0 else {}

            # Calculate JSD between file distribution and overall distribution
            # Use the provided calculate_kl_divergence function for JSD
            jsd_value = calculate_kl_divergence(file_info['extracted_text'], all_extracted_text_for_vocab, unit='word')
            file_info['js_dist'] = jsd_value
    else:
        # Set JSD to NaN if no data or no overall distribution (consistent with calculate_kl_divergence)
        for file_info in file_data:
             file_info['js_dist'] = np.nan


    # --- Per-File Reporting (Captured for Markdown) ---
    print("\n## Per-File Analysis Results") # Markdown Heading
    if file_data:
        # Create a Markdown table for per-file data
        # Added 'JS Dist' column header
        print("\n| Filename | Size (MB) | Word Count | Unique Words | doc length | Language | Tables | Figures | Annotations | Char Entropy | Word Entropy | Avg PMI | JS Dist |")
        # Updated separator line to match the new column count
        print("|---|---|---|---|---|---|---|---|---|---|---|---|---|")
        for file_info in file_data:
            # Extract only the digits from the filename
            digits_in_filename = re.findall(r'\d+', file_info['filename'])
            last_five_digits = "".join(digits_in_filename)[-5:] if digits_in_filename else ""

            # Determine display filename
            display_filename = last_five_digits
            # Only add "..." if there were other characters besides the last 5 digits
            if len(file_info['filename'].replace('.', '').replace('_', '').replace('-', '').replace(' ', '')) > len(last_five_digits):
                 display_filename = "..." + display_filename

            # Format JSD value, handle NaN
            js_dist_formatted = f"{file_info['js_dist']:.4f}" if not np.isnan(file_info['js_dist']) else "NaN"

            # Added file_info['js_dist'] to the row
            print(f"| {display_filename} | {file_info['storage_size_mb']:.2f} | {file_info['word_count']} | {file_info['unique_word_count']} | {file_info['char_count']} | {file_info['language']} | {file_info['table_count']} | {file_info['figure_count']} | {file_info['annotation_count']} | {file_info['char_entropy']:.2f} | {file_info['word_entropy']:.2f} | {file_info['average_pmi']:.4f} | {js_dist_formatted} |")

    else:
        print("No PDF files found in the specified directory.")


    # --- Overall Dataset Summary (Captured for Markdown) ---
    print("\n## Overall Dataset Summary") # Markdown Heading

    total_files = len(file_data)
    print(f"\n### 1. Number of PDF Files: {total_files}") # Markdown Subheading

    if total_files > 0:
        # 2. Storage Size
        all_storage_sizes_bytes = [f['storage_size_bytes'] for f in file_data]
        total_storage_size_bytes = sum(all_storage_sizes_bytes)
        total_storage_size_mb = total_storage_size_bytes / (1024 * 1024)
        print(f"\n### 2. Storage Size") # Markdown Subheading
        print(f"- **Total:** {total_storage_size_mb:.2f} MB ({total_storage_size_bytes} bytes)")
        if all_storage_sizes_bytes:
            avg_file_size_mb = statistics.mean(all_storage_sizes_bytes) / (1024 * 1024)
            min_file_size_mb = min(all_storage_sizes_bytes) / (1024 * 1024)
            max_file_size_mb = max(all_storage_sizes_bytes) / (1024 * 1024)
            print(f"- **Average:** {avg_file_size_mb:.2f} MB")
            print(f"- **Range:** ({min_file_size_mb:.2f} MB, {max_file_size_mb:.2f} MB)")

        # 3. Textual Content Size
        all_word_counts = [f['word_count'] for f in file_data]
        total_word_count = sum(all_word_counts)
        print(f"\n### 3. Textual Content Size") # Markdown Subheading
        print(f"- **Total words/tokens:** {total_word_count}")
        if all_word_counts:
            avg_word_count = statistics.mean(all_word_counts)
            print(f"- **Average words/tokens per document:** {avg_word_count:.2f}")

            # Calculate unique tokens (vocabulary size) from accumulated text
            all_tokens = re.findall(r'\b\w+\b', all_extracted_text_for_vocab.lower())
            unique_tokens = set(all_tokens)
            vocabulary_size = len(unique_tokens)
            print(f"- **Unique tokens across dataset (vocabulary size):** {vocabulary_size}") # Clarified label
        else:
             print(f"- **Average words/tokens per document:** 0")
             print(f"- **Unique tokens across dataset (vocabulary size):** 0")

        # 4. Information Content (Estimated)
        total_estimated_info_content_bits_filesize = total_storage_size_bytes * 8
        print(f"\n### 4. Information Content (Estimated)") # Markdown Subheading
        print(f"- **Total estimated info content (based on total file size):** {total_estimated_info_content_bits_filesize} bits")
        print(f"- *Note: This estimate is based on raw file size in bits.*")

        # 4b. Information Theory Metrics (Overall Dataset)
        print(f"\n### 4b. Information Theory Metrics (Overall Dataset)") # Markdown Subheading
        print(f"- **Overall Character Entropy:** {overall_char_entropy:.2f} bits/character")
        print(f"- **Overall Word Entropy:** {overall_word_entropy:.2f} bits/word")
        print(f"- **Average Bigram PMI (Threshold={PMI_BIGRAM_FREQ_THRESHOLD}):** {overall_average_pmi:.4f}")
        print(f"  *Note: Average PMI is calculated for bigrams appearing at least {PMI_BIGRAM_FREQ_THRESHOLD} times across the dataset.*")
        print(f"  *Note: Per-file JSD is calculated against the overall dataset word distribution.*")


        # Added Overall Average Document Length (Characters)
        print(f"- **Overall Average Document Length (Characters):** {overall_avg_doc_length_chars:.2f}")


        # 5. Document Length Distribution
        print(f"\n### 5. Document Length Distribution (in words/tokens)") # Markdown Subheading
        if all_word_counts:
            mean_len = statistics.mean(all_word_counts)
            median_len = statistics.median(all_word_counts)
            # Ensure std dev calculation is valid for >1 data points
            std_dev_len = statistics.stdev(all_word_counts) if len(all_word_counts) > 1 else 0
            print(f"- **Mean length:** {mean_len:.2f}")
            print(f"- **Median length:** {median_len}")
            print(f"- **Standard deviation:** {std_dev_len:.2f}")


        # 6. Language Distribution
        print(f"\n### 6. Language Distribution") # Markdown Subheading
        all_languages = [f['language'] for f in file_data]
        language_counts = Counter(all_languages)
        print("- **Counts:**")
        for lang, count in language_counts.most_common():
            print(f"   - {lang}: {count} files")


        # 7. Structural Elements Summary
        print(f"\n### 7. Structural Elements Summary") # Markdown Subheading
        total_tables_found = sum([f['table_count'] for f in file_data])
        total_figures_found = sum([f['figure_count'] for f in file_data])
        total_annotations_found = sum([f['annotation_count'] for f in file_data])

        files_with_tables = sum([f['has_tables'] for f in file_data])
        files_with_figures = sum([f['has_figures'] for f in file_data])
        files_with_annotations = sum([f['has_annotations'] for f in file_data])

        print(f"- **Total Tables found:** {total_tables_found}")
        print(f"- **Total Figures/Images found:** {total_figures_found}")
        print(f"- **Total Annotations found:** {total_annotations_found}")
        print(f"- **Files with Tables:** {files_with_tables} ({files_with_tables/total_files*100:.2f}%)")
        print(f"- **Files with Figures/Images:** {files_with_figures} ({files_with_figures/total_files*100:.2f}%)")
        print(f"- **Files with Annotations:** {files_with_annotations} ({files_with_annotations/total_files*100:.2f}%)")


    else:
        print("No data collected from PDF files.")


# --- End of Captured Output ---
sys.stdout = original_stdout # Restore stdout
sys.stdout.flush() # Explicitly flush the buffer


# --- Display Report in Console ---
markdown_content = report_output.getvalue() # Get the captured output
print("\n" + "="*30 + " PDF Analysis Report " + "="*30) # Separator
print("\n*Note: Markdown rendering may vary depending on the viewer. Multi-column layout and font size control are not standard Markdown features.*\n")
print(markdown_content) # Print the captured report content to console
print("="*79) # Separator


# --- Save Output to Markdown File ---
try:
    with open(output_markdown_file, 'w', encoding='utf-8') as f:
        f.write(markdown_content)
    print(f"\nAnalysis report also saved to {output_markdown_file}")
except Exception as e:
    print(f"\nError saving report to {output_markdown_file}: {e}")

report_output.close() # Close the StringIO object
# --- End of Script ---