# 📓 Automated Metadata Generation - Final Notebook (Standalone)

This notebook extracts and generates metadata from documents using OCR and LLMs.

- Supports: PDF, DOCX, TXT, RTF, ODT, HTML, JPG, PNG
- Uses: Tesseract, HuggingFace Transformers, KeyBERT
- Exports: JSON + CSV

---

In [34]:
!pip install transformers keybert sentence-transformers pytesseract pdf2image odfpy striprtf beautifulsoup4 pandas ipywidgets


Defaulting to user installation because normal site-packages is not writeable


In [35]:
import os
import fitz
import docx
from pdf2image import convert_from_path
import pytesseract
from PIL import Image
from odf.opendocument import load
from odf.text import P
from bs4 import BeautifulSoup
from striprtf.striprtf import rtf_to_text

from keybert import KeyBERT
from transformers import pipeline
import json
import pandas as pd
import tempfile
from IPython.display import display
import ipywidgets as widgets


## 📁 Upload Your Document


In [36]:
upload = widgets.FileUpload(accept='', multiple=False)
display(upload)

ocr_toggle = widgets.ToggleButtons(
    options=["Auto (Recommended)", "Force OCR"],
    description='OCR Mode:',
    style={'description_width': 'initial'}
)
display(ocr_toggle)


FileUpload(value=(), description='Upload')

ToggleButtons(description='OCR Mode:', options=('Auto (Recommended)', 'Force OCR'), style=ToggleButtonsStyle(d…

## 💾 Save Uploaded File Temporarily


In [37]:
def save_uploaded_file(upload):
    for file_info in upload.value:
        name = file_info['name']
        suffix = os.path.splitext(name)[-1]
        with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
            tmp.write(file_info['content'])
            return tmp.name


In [38]:
file_path = save_uploaded_file(upload)
force_ocr = ocr_toggle.value == "Force OCR"

if file_path:
    text = extract_text(file_path, force_ocr=force_ocr)
    print("\n--- Extracted Text (first 1000 chars) ---\n")
    print(text[:1000])
else:
    print("❌ No file uploaded. Please select a file before proceeding.")


📦 File size: 1.26 MB
📊 Avg characters per page: 3240.69

--- Extracted Text (first 1000 chars) ---

arXiv:2506.18854v1  [astro-ph.HE]  23 Jun 2025
Comparative analysis of machine learning techniques for feature selection and classification
of Fast Radio Bursts
Ailton J. B. J´unior,1, ∗J´eferson A. S. Fortunato
,2, † Leonardo J. Silvestre
,3, ‡ Thonimar V. Alencar
,4, § and Wiliam S. Hip´olito-Ricaldi
4, 5, ¶
1Departamento de Computa¸c˜ao e Eletrˆonica, CEUNES,
Universidade Federal do Esp´ırito Santo (UFES), Rodovia BR 101 Norte, km. 60,
CEP 29.940-540, S˜ao Mateus, ES, Brazil
2High Energy Physics, Cosmology & Astrophysics Theory (HEPCAT) Group,
Department of Mathematics and Applied Mathematics,
University of Cape Town, Cape Town 7700, South Africa
3Departamento de Computa¸c˜ao e Eletrˆonica, CEUNES,
Universidade Federal do Esp´ırito Santo, Rodovia BR 101 Norte, km. 60,
CEP 29.940-540, S˜ao Mateus, ES, Brazil
4Departamento de Ciˆencias Naturais, CEUNES, Universidade Federal do Esp´ırito

## 📄 Extract Text (with OCR if needed)


In [39]:
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
poppler_path = r"C:\Program Files\poppler-24.08.0\Library\bin"

def extract_text(file_path, force_ocr=False):
    ext = os.path.splitext(file_path)[-1].lower()

    def ocr_pdf(path):  # helper
        return "\n".join([pytesseract.image_to_string(img) for img in convert_from_path(path, poppler_path=poppler_path)])

    if ext == ".txt":
        return open(file_path, "r", encoding="utf-8").read()

    elif ext == ".docx":
        return "\n".join([p.text for p in docx.Document(file_path).paragraphs])

    elif ext == ".odt":
        doc = load(file_path)
        return "\n".join([p.firstChild.data for p in doc.getElementsByType(P) if p.firstChild])

    elif ext in [".html", ".htm"]:
        return BeautifulSoup(open(file_path, encoding="utf-8").read(), "html.parser").get_text()

    elif ext == ".rtf":
        return rtf_to_text(open(file_path, encoding="utf-8").read())

    elif ext in [".jpg", ".jpeg", ".png"]:
        return pytesseract.image_to_string(Image.open(file_path))

    elif ext == ".pdf":
        if force_ocr:
            print("🔁 Force OCR enabled.")
            return ocr_pdf(file_path)

        # Heuristic 1: file size
        size_mb = os.path.getsize(file_path) / (1024 * 1024)
        print(f"📦 File size: {size_mb:.2f} MB")

        if size_mb > 3.0:
            print("📦 Large file size → Using OCR")
            return ocr_pdf(file_path)

        # Heuristic 2: avg text density
        with fitz.open(file_path) as doc:
            total_chars = 0
            text = ''
            for page in doc:
                page_text = page.get_text()
                total_chars += len(page_text)
                text += page_text
            avg_chars = total_chars / len(doc)
            print(f"📊 Avg characters per page: {avg_chars:.2f}")

        if avg_chars < 50:
            print("🧠 Low content density → OCR fallback")
            return ocr_pdf(file_path)

        if len(text.strip()) < 100:
            print("🔁 Very little text → OCR fallback")
            return ocr_pdf(file_path)

        return text

    else:
        raise ValueError(f"Unsupported file type: {ext}")

text = extract_text(file_path, force_ocr=force_ocr)
print("\n--- Extracted Text (first 1000 chars) ---\n")
print(text[:1000])

📦 File size: 1.26 MB
📊 Avg characters per page: 3240.69

--- Extracted Text (first 1000 chars) ---

arXiv:2506.18854v1  [astro-ph.HE]  23 Jun 2025
Comparative analysis of machine learning techniques for feature selection and classification
of Fast Radio Bursts
Ailton J. B. J´unior,1, ∗J´eferson A. S. Fortunato
,2, † Leonardo J. Silvestre
,3, ‡ Thonimar V. Alencar
,4, § and Wiliam S. Hip´olito-Ricaldi
4, 5, ¶
1Departamento de Computa¸c˜ao e Eletrˆonica, CEUNES,
Universidade Federal do Esp´ırito Santo (UFES), Rodovia BR 101 Norte, km. 60,
CEP 29.940-540, S˜ao Mateus, ES, Brazil
2High Energy Physics, Cosmology & Astrophysics Theory (HEPCAT) Group,
Department of Mathematics and Applied Mathematics,
University of Cape Town, Cape Town 7700, South Africa
3Departamento de Computa¸c˜ao e Eletrˆonica, CEUNES,
Universidade Federal do Esp´ırito Santo, Rodovia BR 101 Norte, km. 60,
CEP 29.940-540, S˜ao Mateus, ES, Brazil
4Departamento de Ciˆencias Naturais, CEUNES, Universidade Federal do Esp´ırito

## 🧠 Generate Metadata (with LLMs)


In [40]:
kw_model = KeyBERT(model="all-MiniLM-L6-v2")
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
classifier = pipeline("zero-shot-classification")

def extract_title(text):
    for line in text.split("\n"):
        if 5 <= len(line.strip().split()) <= 15:
            return line.strip()
    return "Untitled Document"

def extract_summary(text, max_chars=1024):
    return summarizer(text[:max_chars], max_length=130, min_length=30, do_sample=False)[0]['summary_text']

def extract_keywords(text, top_n=5):
    return kw_model.extract_keywords(text, top_n=top_n)  # returns (kw, score)

def extract_date(text):
    match = re.search(r"\d{4}[-/]\d{2}[-/]\d{2}|\b(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec).*?\d{2,4}", text, re.I)
    return match.group(0) if match else "Not found"

def detect_type(text):
    labels = ["resume", "declaration form", "research paper", "invoice", "application", "academic transcript"]
    result = classifier(text[:1024], labels)
    return result["labels"][0], round(result["scores"][0] * 100, 2)

def generate_metadata(text):
    if not text.strip():
        return {
            "title": "No content",
            "summary": "",
            "keywords": [],
            "keyword_scores": {},
            "document_type": "Unknown",
            "detected_date": "Not found"
        }
    title = extract_title(text)
    summary = extract_summary(text)
    
    kw_raw = extract_keywords(text)
    keywords = [kw[0] for kw in kw_raw]
    keyword_scores = {kw[0]: round(kw[1], 3) for kw in kw_raw}

    doc_type, confidence = detect_type(text)
    detected_date = extract_date(text)

    return {
        "title": title,
        "summary": summary,
        "keywords": keywords,
        "keyword_scores": keyword_scores,
        "document_type": f"{doc_type} ({confidence}%)",
        "detected_date": detected_date
    }



Device set to use cpu
No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


In [41]:
metadata = generate_metadata(text)

print("\n--- Metadata Summary ---")
for k, v in metadata.items():
    if isinstance(v, dict):  # For keyword_scores
        print(f"\n{k.title()}:")
        for term, score in v.items():
            print(f"  - {term}: {score}")
    elif isinstance(v, list):
        print(f"{k.title()}: {', '.join(v)}")
    else:
        print(f"{k.title()}: {v}")



--- Metadata Summary ---
Title: arXiv:2506.18854v1  [astro-ph.HE]  23 Jun 2025
Summary: Ailton J. B. J´unior, † Leonardo J. Silvestre, ‡ Thonimar V. Alencar and Wiliam S. Hip´olito-Ricaldi. arXiv:2506.18854v1  [astro-ph.HE] 23 Jun 2025.
Keywords: astrophysics, astrophysical, bursts, astronomical, astrophysically

Keyword_Scores:
  - astrophysics: 0.394
  - astrophysical: 0.373
  - bursts: 0.359
  - astronomical: 0.324
  - astrophysically: 0.322
Document_Type: declaration form (44.34%)
Detected_Date: Jun 2025


### Text Stats

In [42]:
print(f"\n📏 Text Stats:\n- Total Characters: {len(text)}\n- Total Words: {len(text.split())}")



📏 Text Stats:
- Total Characters: 51851
- Total Words: 7550


### Keyword Relevance Table

In [43]:
import pandas as pd
print("\n📊 Keyword Relevance Table")
display(pd.DataFrame.from_dict(metadata["keyword_scores"], orient="index", columns=["Relevance"]))



📊 Keyword Relevance Table


Unnamed: 0,Relevance
astrophysics,0.394
astrophysical,0.373
bursts,0.359
astronomical,0.324
astrophysically,0.322


## 📊 View Metadata as Table


In [44]:
from IPython.display import display, Markdown
display(Markdown("### 📋 Metadata Summary"))
display(df)

### 📋 Metadata Summary

Unnamed: 0,Field,Value
0,title,arXiv:2506.18854v1 [astro-ph.HE] 23 Jun 2025
1,summary,"Ailton J. B. J´unior, † Leonardo J. Silvestre,..."
2,keywords,"[astrophysics, astrophysical, bursts, astronom..."
3,document_type,declaration form
4,detected_date,Jun 2025


## 💾 Export Metadata (JSON + CSV)


In [45]:
import json

with open("metadata.json", "w") as f:
    json.dump(metadata, f, indent=2)

df.to_csv("metadata.csv", index=False)

print("✅ Exported as metadata.json and metadata.csv")


✅ Exported as metadata.json and metadata.csv
