
# Automated Metadata Generation

This notebook demonstrates an end-to-end pipeline for automatically generating metadata from various document types. The process includes text extraction, OCR for scanned documents, and metadata generation (summary and file statistics) using a pre-trained NLP model.



## Step 1: Install Dependencies

First, we need to ensure all the required libraries are installed. You can install them by running the following command in your terminal. The `requirements.txt` file should be in the same directory.


In [None]:
!pip install -r requirements.txt


## Step 2: Import Necessary Libraries

Here, we import all the libraries needed for file handling, text extraction, OCR, and NLP.


In [None]:

import os
import docx
import PyPDF2
import pdfplumber
import pytesseract
from pdf2image import convert_from_path
from PIL import Image
from transformers import pipeline
import json



## Step 3: Configure OCR Engine (Tesseract)

For the OCR functionality to work, you must have Tesseract-OCR installed on your system and added to your PATH. If it's installed in a custom location, you can uncomment the line below and set the path to your Tesseract executable.


In [None]:

# On Windows, you might need to specify the path if it's not in your system's PATH
# e.g., pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'



## Step 4: Text Extraction Functions

These functions handle the extraction of text from different file formats (`.txt`, `.docx`, `.pdf`). For PDFs, it first attempts direct text extraction and then falls back to OCR if needed. **Note:** The OCR function (`extract_text_from_scanned_pdf`) requires Poppler to be installed and in your PATH.


In [None]:

def extract_text_from_txt(file_path):
    """Extracts text from a .txt file."""
    with open(file_path, 'r', encoding='utf-8') as f:
        return f.read()

def extract_text_from_docx(file_path):
    """Extracts text from a .docx file."""
    doc = docx.Document(file_path)
    return '\n'.join([para.text for para in doc.paragraphs])

def extract_text_from_pdf(file_path):
    """Extracts text from a text-based .pdf file."""
    text = ''
    try:
        with pdfplumber.open(file_path) as pdf:
            for page in pdf.pages:
                page_text = page.extract_text()
                if page_text:
                    text += page_text + '\n'
    except Exception as e:
        print(f"Error with pdfplumber: {e}, trying PyPDF2")
        try:
            with open(file_path, 'rb') as f:
                reader = PyPDF2.PdfReader(f)
                for page in reader.pages:
                    text += page.extract_text() + '\n'
        except Exception as e2:
            print(f"Error with PyPDF2: {e2}")
    return text

def extract_text_from_scanned_pdf(file_path):
    """Extracts text from a scanned (image-based) .pdf file using OCR."""
    try:
        images = convert_from_path(file_path)
        text = ''
        for img in images:
            text += pytesseract.image_to_string(img) + '\n'
        return text
    except Exception as e:
        print(f"Error during OCR: {e}")
        return ''

def extract_text(file_path):
    """A master function to extract text from any supported file type."""
    if not os.path.exists(file_path):
        return 'File not found'
        
    file_extension = os.path.splitext(file_path)[1].lower()
    text = ''

    if file_extension == '.txt':
        text = extract_text_from_txt(file_path)
    elif file_extension == '.docx':
        text = extract_text_from_docx(file_path)
    elif file_extension == '.pdf':
        text = extract_text_from_pdf(file_path)
        if not text or len(text.strip()) < 100:
            print(f'Standard text extraction yielded little or no text for {os.path.basename(file_path)}, trying OCR.')
            text += extract_text_from_scanned_pdf(file_path)
    else:
        print(f'Unsupported file type: {file_extension}')
        
    return text



## Step 5: Metadata Generation Functions

These functions use the extracted text to generate structured metadata. This includes generating a summary using a pre-trained model from Hugging Face and collecting basic file statistics.


In [None]:

SUMMARIZATION_MODEL = "google/pegasus-xsum"
summarizer = None

def initialize_summarizer():
    """Initializes the summarization pipeline."""
    global summarizer
    if summarizer is None:
        print("Initializing summarization model...")
        summarizer = pipeline("summarization", model=SUMMARIZATION_MODEL)
        print("Model initialized.")

def generate_summary(text, max_length=150, min_length=30):
    """Generates a summary for the given text."""
    if summarizer is None:
        initialize_summarizer()
    
    max_input_length = 4096
    if len(text) > max_input_length:
        text = text[:max_input_length]
    
    if not text.strip():
        return "(No text to summarize)"

    summary_list = summarizer(text, max_length=max_length, min_length=min_length, do_sample=False)
    return summary_list[0]['summary_text']

def get_file_stats(file_path, text_content):
    """Generates basic statistics for a file."""
    return {
        "file_name": os.path.basename(file_path),
        "file_size_kb": round(os.path.getsize(file_path) / 1024, 2),
        "word_count": len(text_content.split())
    }

def generate_metadata(file_path):
    """The main function to generate a full set of metadata for a file."""
    print(f"Processing file: {file_path}")
    text_content = extract_text(file_path)
    if not text_content or text_content == 'File not found':
        print(f"Could not extract text from {file_path}")
        return None

    summary = generate_summary(text_content)
    stats = get_file_stats(file_path, text_content)

    metadata = {
        "summary": summary,
        **stats
    }
    
    return metadata



## Step 6: Run the Pipeline

Now, let's test the entire pipeline. We will specify the path to a document and call the `generate_metadata` function. The first time you run this, it will download the summarization model, which may take a few minutes.


In [None]:

# Make sure the 'documents' folder exists and contains your test file
if not os.path.exists('documents'):
    os.makedirs('documents')
    with open('documents/test.txt', 'w') as f:
        f.write('This is a test text file for the notebook. It is short and simple.')

test_file_path = 'documents/test.txt' 
# You can change this to a .docx or .pdf file in the 'documents' folder
# test_file_path = 'documents/my_document.pdf'

# Generate the metadata
metadata_result = generate_metadata(test_file_path)

# Print the result in a clean JSON format
if metadata_result:
    print("\n--- Generated Metadata ---")
    print(json.dumps(metadata_result, indent=2))
