### Step 1: Install Libraries

Install the necessary libraries for running model locally using transformers, reading PDFs, and handling file uploads.


In [None]:
!pip install PyPDF2 transformers torch

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


### Step 2: Import Libraries and setup Model and Tokenizer

Import the libraries and set up Model and tokenizer


In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer
# Imports the necessary classes from the 'transformers' library:
# 1. T5ForConditionalGeneration: This is the main class for the T5 model,
#    which is a Google-developed model excellent for sequence-to-sequence tasks
#    (like translation, question answering, and text summarization).
# 2. T5Tokenizer: This class is used to preprocess text (tokenization)
#    into a numerical format (tokens or IDs) that the T5 model can understand.

import PyPDF2
# Imports the PyPDF2 library, a common tool in Python for reading and manipulating PDF files.

from google.colab import files
# Imports the 'files' utility from the Google Colab environment,
# which allows users to upload files directly from their local machine.

model_name = "t5-large"
# Defines a string variable to specify the exact version of the T5 model to use.
# "t5-large" is a large, high-capacity model known for producing excellent
# summarization results, but it requires more memory (RAM/GPU) and is slower
# than smaller versions like "t5-small" or "t5-base".

tokenizer = T5Tokenizer.from_pretrained(model_name)
# Initializes the tokenizer. The 'from_pretrained' method downloads the
# specific vocabulary, rules, and configurations associated with "t5-large".
# This step prepares the tool for converting human-readable text into model inputs.

model = T5ForConditionalGeneration.from_pretrained(model_name)
# Initializes the T5 model itself. The 'from_pretrained' method downloads the
# pre-trained weights (the "knowledge") of the "t5-large" model.
# This prepares the neural network structure to perform the actual summarization task.

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

### Step 3: Define Extraction Function

Define functions to read PDF content and extract metadata using t5 model


In [None]:
# ==============================================================================
# 1. Function to read PDF content (using PyPDF2)
# ==============================================================================
def read_pdf(file_path):
    """
    Reads a PDF file from the given path and extracts all text content
    using the PyPDF2 library. Note: PyPDF2 may struggle with complex
    layouts like tables or multi-column text.

    Args:
        file_path (str): The local path to the uploaded PDF file (opened in binary mode 'rb').

    Returns:
        str: The concatenated raw text content from all pages.
    """
    # 'with open' ensures the file is closed automatically, even if an error occurs.
    with open(file_path, 'rb') as file:
        # Initialize the PdfReader object.
        reader = PyPDF2.PdfReader(file)
        # Initialize an empty string to hold the entire document's text.
        text = ""

        # Iterate over all pages in the PDF.
        for page_num in range(len(reader.pages)):
            # Extract text from the current page and append it to the main text string.
            # Note: A space is usually added after page_text to prevent word merging,
            # but it is omitted here, relying on PyPDF2's output formatting.
            text += reader.pages[page_num].extract_text()

    # Return the entire extracted text block.
    return text

# ==============================================================================
# 2. Function to create the instruction prompt
# ==============================================================================
def create_prompt(text):
    """
    Constructs a detailed, instruction-based prompt for the T5 model.
    This method is known as 'prompt engineering' and guides the LLM to
    perform a specific structured task (metadata extraction and JSON output).

    Args:
        text (str): The document text extracted from the PDF.

    Returns:
        str: The complete, formatted prompt string ready for the model.
    """
    # The triple quotes allow for multi-line string definition.
    return f"""
    You are a highly advanced metadata extraction tool designed to handle various document types, including legal contracts, judgments, research papers, reports, articles, and general documents. Your task is to analyze the provided document content and extract all relevant metadata. The metadata can include, but is not limited to:

    - The type of document (e.g., legal contract, research paper, judgment, article, report, etc.)
    - Important entities involved (e.g., people, companies, organizations, dates, locations, etc.)
    - Any dates mentioned (e.g., creation date, judgment date, publication date, expiration date)
    - Key sections or parts of the document (e.g., titles, chapters, clauses, sections, etc.)
    - Any references made (e.g., laws, regulations, academic references, research citations, case law)
    - Summary or abstract of the document, if applicable

    Ensure you cover the document's most relevant aspects, and feel free to add anything you think is important for understanding the document's context or content.
    Must extract metadata in JSON format # CRITICAL instruction for structured output.

    Document content:
    {text} # The extracted PDF content is inserted here (f-string placeholder).
    """


# ==============================================================================
# 3. Function for metadata extraction using the T5 model
# ==============================================================================
# NOTE: This function assumes 'tokenizer' and 'model' (T5ForConditionalGeneration)
# have been globally initialized before this function is called.
def extract_metadata(text):
    """
    Encodes the prompt, feeds it to the T5 model, and generates the structured metadata.

    Args:
        text (str): The document text extracted from the PDF.

    Returns:
        str: The generated metadata string, expected to be in JSON format.
    """
    # 1. Prompt Creation:
    prompt = create_prompt(text)

    # 2. Tokenization/Encoding:
    # Convert the human-readable prompt string into numerical tensors (token IDs)
    # that the T5 model requires.
    inputs = tokenizer.encode(
        prompt,
        return_tensors="pt",  # Specifies the output format should be PyTorch tensors.
        max_length=512,       # Limits the input length to 512 tokens. (Crucial for T5 limits)
        truncation=True       # If the prompt is longer than 512, it is cut short.
    )

    # 3. Generation (Inference):
    # Pass the encoded input to the model to generate the output sequence (the metadata).
    metadata_ids = model.generate(
        inputs,
        max_length=500,       # Sets the maximum length of the output metadata.
        min_length=50,        # Sets the minimum length of the output metadata.
        length_penalty=2.0,   # A parameter that encourages longer, more comprehensive outputs.
        num_beams=4,          # Activates Beam Search with a beam size of 4, improving output quality.
        early_stopping=True   # Stops generation once the model produces an end-of-sequence token.
    )

    # 4. Decoding:
    # Convert the numerical output tokens (metadata_ids) back into a human-readable string.
    # [0] selects the first sequence (as only one was generated).
    # skip_special_tokens=True removes tokens like <s> (start) and </s> (end).
    metadata = tokenizer.decode(metadata_ids[0], skip_special_tokens=True)

    # Return the final extracted metadata string (which should be JSON).
    return metadata

### Step 4: Upload and Extract Metadata

Upload a file, and the metadata will be extracted and displayed. You can also download the metadata as a text file.


In [None]:
# ==============================================================================
# 4. MAIN EXECUTION BLOCK: Upload, Process, and Download
# ==============================================================================

# 1. File Upload Widget:
# This command initiates the Google Colab file uploader, prompting the user
# to select a file from their local machine. The result is stored in the
# 'uploaded' dictionary.
uploaded = files.upload()

# 2. Conditional Execution:
# Checks if the 'uploaded' dictionary is not empty, meaning a file was successfully provided.
if uploaded:
    # 3. Get Filename:
    # files.upload() returns a dictionary where keys are filenames.
    # We retrieve the name of the first uploaded file (the key).
    filename = next(iter(uploaded))

    # 4. Text Extraction:
    # Calls the previously defined 'read_pdf' function to open the uploaded file
    # and extract all its textual content.
    content = read_pdf(filename)

    # 5. Metadata Extraction (AI Processing):
    # Calls the 'extract_metadata' function. This is the core AI step where:
    # a) The prompt is created using the document 'content'.
    # b) The prompt is encoded and fed to the T5 model.
    # c) The model generates the structured metadata (expected to be JSON).
    metadata = extract_metadata(content)

    # 6. Display Output:
    # Prints a header and the final generated metadata to the Colab console
    # for immediate user viewing.
    print("Extracted Metadata:")
    print(metadata)

    # 7. Save to File:
    # Opens a new file named "extracted_metadata.txt" in write mode ("w").
    with open("extracted_metadata.txt", "w") as file:
        # Writes the extracted metadata string into the newly created file.
        file.write(metadata)

    # 8. Download Result:
    # Initiates the Google Colab download utility, which pushes the created file
    # ("extracted_metadata.txt") back to the user's local machine for persistence.
    files.download("extracted_metadata.txt")

# 9. Handle No Upload:
# If the 'uploaded' dictionary was empty (user closed the dialog or canceled),
# this message is printed.
else:
    print("No file uploaded. Please upload a document to extract metadata.")

Saving extracted_metadata.txt to extracted_metadata (1).txt


PdfReadError: EOF marker not found