📄 Automated Metadata Generation System

## 🎯 Objective
This notebook demonstrates an automated system to extract meaningful metadata from documents such as PDFs, DOCX files, and TXT files.

It includes:
- Text extraction from different file formats
- Basic semantic processing (top keywords, word count)
- Metadata generation and preview

In [33]:
import fitz  # PyMuPDF for PDFs
import docx  # python-docx for DOCX
import pandas as pd
import os

In [34]:
import sys
print(sys.executable)



c:\Users\HP\Muskan\.venv\Scripts\python.exe


## 🛠️ Helper Functions

We'll define the following functions:
- `extract_pdf_text(filepath)` - Extracts text from PDF using PyMuPDF
- `extract_docx_text(filepath)` - Extracts text from Word document
- `extract_txt_text(filepath)` - Extracts plain text
- `generate_metadata(text)` - Generates semantic metadata from text


In [35]:
def extract_pdf_text(filepath):
    with fitz.open(filepath) as doc:
        return "".join([page.get_text() for page in doc])

def extract_docx_text(filepath):
    doc = docx.Document(filepath)
    return "\n".join([para.text for para in doc.paragraphs])

def extract_txt_text(filepath):
    with open(filepath, "r", encoding="utf-8") as f:
        return f.read()

def generate_metadata(text):
    words = text.split()
    word_freq = pd.Series(words).value_counts().head(5)
    metadata = {
        "Title": words[0] if words else "N/A",
        "Top Keywords": ', '.join(word_freq.index),
        "Word Count": len(words),
        "Preview": ' '.join(words[:40])
    }
    return metadata


## 📂 Select a File for Testing

Upload or select a document from the `sample_docs` folder.


In [36]:
# Choose a test file
filepath = "sample_docs/sample1.pdf"  # change to .docx or .txt for other formats
ext = filepath.split('.')[-1].lower()

# Extract text based on file type
if ext == "pdf":
    text = extract_pdf_text(filepath)
elif ext == "docx":
    text = extract_docx_text(filepath)
elif ext == "txt":
    text = extract_txt_text(filepath)
else:
    text = ""

print("✅ Text Extracted!")
print(text[:500])  # preview


✅ Text Extracted!
The Role of Artificial Intelligence in Modern Healthcare
Artificial Intelligence (AI) is revolutionizing healthcare by enabling faster diagnosis, improved patient
monitoring, and efficient drug development. The integration of AI in clinical systems helps reduce
human error and optimize treatment plans.
Machine learning models are now capable of analyzing medical images, predicting patient
deterioration, and offering personalized treatment recommendations. This technology represents a
leap forwar


## 📋 Generate Metadata from Extracted Text

In [37]:
if text:
    metadata = generate_metadata(text)
    print("📄 Metadata:")
    for k, v in metadata.items():
        print(f"{k}: {v}")
else:
    print("❌ No text found or unsupported format.")


📄 Metadata:
Title: The
Top Keywords: of, in, and, Intelligence, Artificial
Word Count: 67
Preview: The Role of Artificial Intelligence in Modern Healthcare Artificial Intelligence (AI) is revolutionizing healthcare by enabling faster diagnosis, improved patient monitoring, and efficient drug development. The integration of AI in clinical systems helps reduce human error and optimize treatment plans.


## 💾 Export Metadata as CSV 

In [38]:
df = pd.DataFrame([metadata])
df.to_csv("generated_metadata.csv", index=False)
print("📁 Metadata saved to 'generated_metadata.csv'")


📁 Metadata saved to 'generated_metadata.csv'


## ✅ Summary

We successfully:
- Extracted text from PDF, DOCX, and TXT files
- Generated semantic metadata
- Exported the metadata for further use
