# Automated Metadata Generation System - Proof of Concept

This notebook demonstrates the core functionality of the automated metadata generation system for documents.

## Features Demonstrated:
- Document text extraction (PDF, DOCX, TXT)
- Semantic content analysis using NLP
- Automated metadata generation
- Structured JSON output
- OCR support (when available)

In [1]:
# Import required modules
import sys
from pathlib import Path
import json
from pprint import pprint

# Import our custom modules
from document_processor import DocumentProcessor
from metadata_generator import MetadataGenerator
from text_extractor import TextExtractor
from semantic_analyzer import SemanticAnalyzer
from utils import setup_logging
from config import SUPPORTED_FORMATS, OUTPUT_DIR

# Set up logging
logger = setup_logging('INFO')
print(" Modules imported successfully!")





 Modules imported successfully!


## Check System Dependencies

## Create Sample Document for Testing

In [2]:
# Create a sample text document for testing
sample_text = """
Artificial Intelligence in Healthcare: A Comprehensive Analysis

Introduction:
Artificial Intelligence (AI) has emerged as a transformative technology in healthcare, revolutionizing patient care, diagnosis, and treatment methodologies. This comprehensive analysis examines the current applications, benefits, challenges, and future prospects of AI in the healthcare sector.

Current Applications:
AI technologies are being deployed across various healthcare domains including medical imaging, drug discovery, personalized medicine, and clinical decision support systems. Machine learning algorithms have shown remarkable success in analyzing medical images, detecting diseases like cancer and diabetic retinopathy with accuracy comparable to human specialists.

Benefits and Advantages:
The implementation of AI in healthcare offers numerous advantages including improved diagnostic accuracy, reduced medical errors, enhanced patient outcomes, and increased operational efficiency. AI-powered systems can process vast amounts of medical data quickly, enabling faster diagnosis and treatment decisions.

Challenges and Concerns:
Despite its potential, AI adoption in healthcare faces several challenges including data privacy concerns, regulatory compliance, integration with existing systems, and the need for specialized training. Healthcare organizations must address these challenges to successfully implement AI solutions.

Conclusion:
AI represents a significant opportunity to transform healthcare delivery, improve patient outcomes, and reduce costs. However, successful implementation requires careful consideration of technical, ethical, and regulatory factors. As the technology continues to evolve, collaboration between healthcare professionals, technology experts, and policymakers will be crucial for realizing the full potential of AI in healthcare.

Keywords: artificial intelligence, healthcare, machine learning, medical imaging, diagnosis, patient care, digital transformation
"""

# Save sample document
sample_dir = Path('sample_documents')
sample_dir.mkdir(exist_ok=True)

sample_file = sample_dir / 'ai_healthcare_analysis.txt'
with open(sample_file, 'w', encoding='utf-8') as f:
    f.write(sample_text.strip())

print(f" Sample document created: {sample_file}")
print(f"Document length: {len(sample_text)} characters")

✅ Sample document created: sample_documents/ai_healthcare_analysis.txt
Document length: 1997 characters


## Demonstrate Text Extraction

In [3]:
# Initialize text extractor
text_extractor = TextExtractor()

# Extract text from sample document
extraction_result = text_extractor.extract_text(sample_file)

print("Text Extraction Results:")
print("=" * 50)
print(f"Success: {extraction_result['success']}")
print(f"Extraction Method: {extraction_result['extraction_method']}")
print(f"Page Count: {extraction_result['page_count']}")
print(f"Word Count: {extraction_result['word_count']}")
print(f"Character Count: {extraction_result['character_count']}")

# if extraction_result['error']:
#     print(f"Error: {extraction_result['error']}")

print("\nExtracted Text (first 300 characters):")
print("-" * 50)
print(extraction_result['text'][:300] + "..." if len(extraction_result['text']) > 300 else extraction_result['text'])

Text Extraction Results:
Success: True
Extraction Method: direct_text_read
Page Count: 1
Word Count: 240
Character Count: 1989

Extracted Text (first 300 characters):
--------------------------------------------------
Artificial Intelligence in Healthcare: A Comprehensive Analysis Introduction: Artificial Intelligence (AI) has emerged as a transformative technology in healthcare, revolutionizing patient care, diagnosis, and treatment methodologies. This comprehensive analysis examines the current applications, be...


## Demonstrate Semantic Analysis

In [4]:
# Initialize semantic analyzer
semantic_analyzer = SemanticAnalyzer()

# Analyze the extracted text
if extraction_result['success'] and extraction_result['text']:
    semantic_result = semantic_analyzer.analyze_text(extraction_result['text'])
    
    print("Semantic Analysis Results:")
    print("=" * 50)
    
    print(f"\n Language: {semantic_result['language']}")
    print(f" Sentiment: {semantic_result['sentiment']}")
    print(f" Readability Score: {semantic_result['readability_score']:.1f}/100")
    
    print("\n Text Statistics:")
    stats = semantic_result['text_statistics']
    print(f"  - Sentences: {stats['sentence_count']}")
    print(f"  - Words: {stats['word_count']}")
    print(f"  - Characters: {stats['character_count']}")
    print(f"  - Paragraphs: {stats['paragraph_count']}")
    print(f"  - Avg words/sentence: {stats['average_words_per_sentence']:.1f}")
    
    print("\n Top Keywords:")
    for i, keyword in enumerate(semantic_result['keywords'][:10], 1):
        print(f"  {i:2d}. {keyword['word']} (freq: {keyword['frequency']}, score: {keyword['relevance_score']:.3f})")
    
    print("\n  Named Entities:")
    if semantic_result['entities']:
        for entity in semantic_result['entities'][:10]:
            print(f"  - {entity['text']} ({entity['label']}: {entity['description']})")
    else:
        print("  No named entities found (spaCy model may not be available)")
    
    print("\n Topics:")
    if semantic_result['topics']:
        print(f"  {', '.join(semantic_result['topics'][:10])}")
    else:
        print("  No topics identified")
    
    print("\n Summary:")
    print(f"  {semantic_result['summary']}")
    
else:
    print(" Cannot perform semantic analysis - no text extracted")

ERROR:semantic_analyzer:Could not load spaCy model: en_core_web_sm
ERROR:semantic_analyzer:Install it with: python -m spacy download en_core_web_sm


Semantic Analysis Results:

 Language: en
 Sentiment: neutral
 Readability Score: 0.0/100

 Text Statistics:
  - Sentences: 12
  - Words: 240
  - Characters: 1989
  - Paragraphs: 1
  - Avg words/sentence: 20.0

 Top Keywords:
   1. healthcare (freq: 7, score: 0.029)
   2. medical (freq: 5, score: 0.021)
   3. patient (freq: 4, score: 0.017)
   4. artificial (freq: 3, score: 0.013)
   5. technology (freq: 3, score: 0.013)
   6. including (freq: 3, score: 0.013)
   7. challenges (freq: 3, score: 0.013)
   8. intelligence (freq: 2, score: 0.008)
   9. comprehensive (freq: 2, score: 0.008)
  10. analysis (freq: 2, score: 0.008)

  Named Entities:
  No named entities found (spaCy model may not be available)

 Topics:
  artificial, intelligence, healthcare:, comprehensive, analysis, introduction:, (ai), emerged, transformative, technology

 Summary:
  Artificial Intelligence in Healthcare: A Comprehensive Analysis Introduction: Artificial Intelligence (AI) has emerged as a transformative tec

## Demonstrate Complete Metadata Generation

In [5]:
# Initialize metadata generator
metadata_generator = MetadataGenerator()

# Generate complete metadata
metadata = metadata_generator.generate_metadata(sample_file)

print("Complete Metadata Generation Results:")
print("=" * 60)

# Document Information
doc_info = metadata.get('document_info', {})
print("\n Document Information:")
print(f"  Filename: {doc_info.get('filename', 'N/A')}")
print(f"  File Size: {doc_info.get('file_size_mb', 0)} MB")
print(f"  Document Type: {doc_info.get('document_type', 'N/A')}")


# Processing Information
proc_info = metadata.get('processing_info', {})
print("\n Processing Information:")
print(f"  Success: {proc_info.get('success', False)}")
if proc_info.get('errors'):
    print(f"  Errors: {'; '.join(proc_info['errors'])}")

# Extraction Information
ext_info = metadata.get('extraction_info', {})
print("\n Text Extraction:")
print(f"  Method: {ext_info.get('extraction_method', 'N/A')}")
print(f"  Word Count: {ext_info.get('word_count', 0)}")
print(f"  Character Count: {ext_info.get('character_count', 0)}")
print(f"  Page Count: {ext_info.get('page_count', 0)}")

# Derived Metadata
derived = metadata.get('derived_metadata', {})
if derived:
    print("\n Derived Metadata:")
    print(f"  Title: {derived.get('title', 'N/A')}")
    print(f"  Category: {derived.get('category', 'N/A')}")
    print(f"  Content Type: {derived.get('content_type', 'N/A')}")
    print(f"  Quality Score: {derived.get('quality_score', 0)}/100")
    print(f"  Complexity: {derived.get('complexity_level', 'N/A')}")
    print(f"  Reading Time: {derived.get('estimated_reading_time', 'N/A')}")
    print(f"  Language: {derived.get('primary_language', 'N/A')}")
    
    if derived.get('top_keywords'):
        print(f"  Keywords: {', '.join(derived['top_keywords'][:5])}")
    
    if derived.get('main_topics'):
        print(f"  Topics: {', '.join(derived['main_topics'][:5])}")
    
    print(f"  Description: {derived.get('description', 'N/A')[:100]}...")

ERROR:semantic_analyzer:Could not load spaCy model: en_core_web_sm
ERROR:semantic_analyzer:Install it with: python -m spacy download en_core_web_sm


Complete Metadata Generation Results:

 Document Information:
  Filename: ai_healthcare_analysis.txt
  File Size: 0.0 MB
  Document Type: Text Document

 Processing Information:
  Success: True

 Text Extraction:
  Method: direct_text_read
  Word Count: 240
  Character Count: 1989
  Page Count: 1

 Derived Metadata:
  Title: Ai Healthcare Analysis
  Category: Academic/Research
  Content Type: Informational
  Quality Score: 70.0/100
  Complexity: Very Complex
  Reading Time: 1 minute
  Language: en
  Keywords: healthcare, medical, patient, artificial, technology
  Topics: artificial, intelligence, healthcare:, comprehensive, analysis
  Description: Artificial Intelligence in Healthcare: A Comprehensive Analysis Introduction: Artificial Intelligenc...


## Save Metadata to JSON File

In [6]:
# Save metadata to JSON file
from utils import save_metadata

output_file = OUTPUT_DIR / f"{sample_file.stem}_metadata.json"
success = save_metadata(metadata, output_file)

if success:
    print(f" Metadata saved to: {output_file}")
    print(f"File size: {output_file.stat().st_size} bytes")
    
    # Display JSON structure
    print("\n JSON Structure:")
    json_keys = list(metadata.keys())
    for key in json_keys:
        if isinstance(metadata[key], dict):
            sub_keys = list(metadata[key].keys())
            print(f"  {key}: {{{', '.join(sub_keys[:5])}{'...' if len(sub_keys) > 5 else ''}}}")
        else:
            print(f"  {key}: {type(metadata[key]).__name__}")
else:
    print(" Failed to save metadata")

 Metadata saved to: /Users/tvishadhuper/Downloads/MetadataGenerator/output/ai_healthcare_analysis_metadata.json
File size: 6500 bytes

 JSON Structure:
  document_info: {filename, file_extension, file_size_bytes, file_size_mb, created_date...}
  extraction_info: {extraction_method, page_count, text, character_count, word_count...}
  content_analysis: {keywords, entities, summary, language, sentiment...}
  processing_info: {timestamp, version, success, errors}
  derived_metadata: {title, description, category, primary_language, quality_score...}


## Demonstrate Document Processor (High-level Interface)

In [7]:
# Initialize document processor
processor = DocumentProcessor(log_level='INFO')

# Process the sample document using high-level interface
print("Using DocumentProcessor for end-to-end processing:")
print("=" * 55)

result_metadata = processor.process_single_document(sample_file)

# Get processing statistics
stats = processor.get_processing_stats()
print(f"\n Processing Statistics:")
print(f"  Total files processed: {stats['total_files']}")

if stats['errors']:
    print(f"  Errors: {'; '.join(stats['errors'])}")


ERROR:semantic_analyzer:Could not load spaCy model: en_core_web_sm
ERROR:semantic_analyzer:Install it with: python -m spacy download en_core_web_sm


Using DocumentProcessor for end-to-end processing:

 Processing Statistics:
  Total files processed: 1


## Create Additional Sample Documents for Testing

In [8]:
# Create additional sample documents
samples = {
    'technical_report.txt': """
System Performance Analysis Report

Executive Summary:
This report analyzes the performance metrics of our distributed computing system over the past quarter. The analysis reveals significant improvements in throughput and response times.

Methodology:
We collected performance data using automated monitoring tools and conducted load testing under various scenarios. The metrics include CPU utilization, memory usage, network latency, and transaction throughput.

Key Findings:
1. Average response time decreased by 23% compared to the previous quarter
2. System throughput increased by 31% during peak hours
3. Memory utilization remained stable at 67% average
4. Network latency improved by 15% after infrastructure upgrades

Recommendations:
Based on our analysis, we recommend continued monitoring and potential scaling of the database tier to handle increased load.
""",
    
    'meeting_minutes.txt': """
Project Alpha Team Meeting Minutes
Date: March 15, 2024
Attendees: John Smith, Sarah Johnson, Mike Chen, Lisa Rodriguez

Agenda Items Discussed:

1. Project Timeline Review
   - Current phase completion: 75%
   - Next milestone: April 1, 2024
   - Risk assessment: Low to medium

2. Budget Status
   - Expenses to date: $45,000
   - Remaining budget: $23,000
   - Projected completion cost: $62,000

3. Action Items
   - John: Complete user interface mockups by March 22
   - Sarah: Finalize database schema by March 20
   - Mike: Set up testing environment by March 25

Next meeting scheduled for March 22, 2024 at 2:00 PM.
""",
    
    'policy_document.txt': """
Remote Work Policy - Version 2.1

Purpose:
This policy establishes guidelines and procedures for employees working remotely to ensure productivity, security, and work-life balance.

Scope:
This policy applies to all full-time and part-time employees who are approved for remote work arrangements.

Eligibility Criteria:
- Minimum 6 months of employment
- Satisfactory performance reviews
- Role suitable for remote work
- Reliable internet connection and appropriate workspace

Security Requirements:
- Use of company-approved VPN for all work-related activities
- Regular software updates and security patches
- Secure storage of confidential information
- Compliance with data protection regulations

Performance Expectations:
Remote employees are expected to maintain the same level of productivity and communication as office-based employees.
"""
}


for filename, content in samples.items():
    file_path = sample_dir / filename
    with open(file_path, 'w', encoding='utf-8') as f:
        f.write(content.strip())
    print(f"Created: {filename}")

print(f"\n Sample documents directory: {sample_dir.absolute()}")
print(f"Files created: {len(samples) + 1} total")

Created: technical_report.txt
Created: meeting_minutes.txt
Created: policy_document.txt

 Sample documents directory: /Users/tvishadhuper/Downloads/MetadataGenerator/sample_documents
Files created: 4 total


## Batch Processing Demonstration

In [9]:

processor.reset_stats()


print("Batch Processing All Sample Documents:")
print("=" * 45)

batch_results = processor.process_directory(sample_dir, recursive=False)

print(f"\n Batch Processing Results:")
print(f"Total files processed: {len(batch_results)}")


print("\n Individual File Results:")
for i, result in enumerate(batch_results, 1):
    doc_info = result.get('document_info', {})
    derived = result.get('derived_metadata', {})
    processing = result.get('processing_info', {})
    
    print(f"\n{i}. {doc_info.get('filename', 'Unknown')}")

    if processing.get('success', False):
        print(f"   Title: {derived.get('title', 'N/A')}")
        print(f"   Category: {derived.get('category', 'N/A')}")
        print(f"   Quality: {derived.get('quality_score', 0)}/100")
        print(f"   Word Count: {result.get('extraction_info', {}).get('word_count', 0)}")
    else:
        errors = processing.get('errors', [])
        if errors:
            print(f"   Error: {errors[0]}")


final_stats = processor.get_processing_stats()
print(f"\n🎯 Final Statistics:")
print(f"  Successfully processed: {final_stats['successful']} files")
print(f"  Failed: {final_stats['failed']} files")
print(f"  Success rate: {(final_stats['successful'] / max(1, final_stats['total_files']) * 100):.1f}%")

Batch Processing All Sample Documents:

 Batch Processing Results:
Total files processed: 8

 Individual File Results:

1. business_report.txt
   Title: Business Report
   Category: General Document
   Quality: 70.0/100
   Word Count: 268

2. technical_report.txt
   Title: Technical Report
   Category: Academic/Research
   Quality: 70.0/100
   Word Count: 119

3. policy_document.txt
   Title: Policy Document
   Category: General Document
   Quality: 70.0/100
   Word Count: 114

4. technical_manual.txt
   Title: Technical Manual
   Category: Technical/IT
   Quality: 70.0/100
   Word Count: 295

5. ai_research_paper.txt
   Title: Ai Research Paper
   Category: General Document
   Quality: 70.0/100
   Word Count: 374

6. meeting_minutes.txt
   Title: Meeting Minutes
   Category: General Document
   Quality: 75.0/100
   Word Count: 98

7. ai_healthcare_analysis.txt
   Title: Ai Healthcare Analysis
   Category: Academic/Research
   Quality: 70.0/100
   Word Count: 240

8. sample.txt
   Titl

## Metadata Comparison Analysis

In [10]:

print("Metadata Comparison Analysis:")
print("=" * 35)

comparison_data = []

for result in batch_results:
    if result.get('processing_info', {}).get('success', False):
        doc_info = result.get('document_info', {})
        derived = result.get('derived_metadata', {})
        extraction = result.get('extraction_info', {})
        analysis = result.get('content_analysis', {})
        
        comparison_data.append({
            'filename': doc_info.get('filename', 'Unknown'),
            'category': derived.get('category', 'N/A'),
            'word_count': extraction.get('word_count', 0),
            'quality_score': derived.get('quality_score', 0),
            'readability': analysis.get('readability_score', 0),
            'complexity': derived.get('complexity_level', 'N/A'),
            'reading_time': derived.get('estimated_reading_time', 'N/A'),
            'keywords_count': len(analysis.get('keywords', [])),
            'entities_count': len(analysis.get('entities', []))
        })

if comparison_data:
    print(f"\n{'Filename':<25} {'Category':<20} {'Words':<8} {'Quality':<8} {'Readability':<12}")
    print("-" * 80)
    
    for data in comparison_data:
        print(f"{data['filename']:<25} {data['category']:<20} {data['word_count']:<8} {data['quality_score']:<8.1f} {data['readability']:<12.1f}")
    
    print("\n📈 Summary Statistics:")
    avg_words = sum(d['word_count'] for d in comparison_data) / len(comparison_data)
    avg_quality = sum(d['quality_score'] for d in comparison_data) / len(comparison_data)
    avg_readability = sum(d['readability'] for d in comparison_data) / len(comparison_data)
    
    print(f"  Average word count: {avg_words:.1f}")
    print(f"  Average quality score: {avg_quality:.1f}/100")
    print(f"  Average readability: {avg_readability:.1f}/100")
    
  
    categories = {}
    for data in comparison_data:
        cat = data['category']
        categories[cat] = categories.get(cat, 0) + 1
    
    print(f"\n📊 Document Categories:")
    for category, count in categories.items():
        print(f"  {category}: {count} document{'s' if count > 1 else ''}")
else:
    print("No successful processing results available for comparison.")

Metadata Comparison Analysis:

Filename                  Category             Words    Quality  Readability 
--------------------------------------------------------------------------------
business_report.txt       General Document     268      70.0     20.9        
technical_report.txt      Academic/Research    119      70.0     12.8        
policy_document.txt       General Document     114      70.0     0.0         
technical_manual.txt      Technical/IT         295      70.0     11.6        
ai_research_paper.txt     General Document     374      70.0     0.0         
meeting_minutes.txt       General Document     98       75.0     46.2        
ai_healthcare_analysis.txt Academic/Research    240      70.0     0.0         
sample.txt                Technical/IT         120      70.0     14.2        

📈 Summary Statistics:
  Average word count: 203.5
  Average quality score: 70.6/100
  Average readability: 13.2/100

📊 Document Categories:
  General Document: 4 documents
  Academic/R

## System Summary and Next Steps

In [11]:
print(" Automated Metadata Generation System - POC Summary")
print("=" * 60)



print("\n Generated Metadata Includes:")
print("  • Document information (size, type, hash, dates)")
print("  • Text extraction details (method, word count, quality)")
print("  • Content analysis (keywords, entities, topics, sentiment)")
print("  • Derived metadata (title, category, description, complexity)")
print("  • Quality metrics (readability, completeness score)")



print("\n Output Files Generated:")
output_files = list(OUTPUT_DIR.glob('*.json'))
for file in output_files:
    print(f"  • {file.name} ({file.stat().st_size} bytes)")



 Automated Metadata Generation System - POC Summary

 Generated Metadata Includes:
  • Document information (size, type, hash, dates)
  • Text extraction details (method, word count, quality)
  • Content analysis (keywords, entities, topics, sentiment)
  • Derived metadata (title, category, description, complexity)
  • Quality metrics (readability, completeness score)

 Output Files Generated:
  • technical_report_metadata.json (4955 bytes)
  • ai_research_paper_metadata.json (7611 bytes)
  • ai_healthcare_analysis_metadata.json (6500 bytes)
  • business_report_metadata.json (5806 bytes)
  • sample_metadata.json (5071 bytes)
  • technical_manual_metadata.json (5827 bytes)
  • meeting_minutes_metadata.json (4607 bytes)
  • policy_document_metadata.json (4743 bytes)
  • batch_summary_sample_documents.json (2110 bytes)
