### Step 1: Install Libraries

Install the necessary libraries for running model locally using transformers, reading PDFs, and handling file uploads.


In [None]:
!pip install PyPDF2 transformers torch

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [None]:
! pip install openai



### Step 2: Import Libraries and setup Model and Tokenizer

Import the libraries and set up Model and tokenizer


In [None]:
# --- 1. Import Necessary Libraries ---

# Imports the main OpenAI client class. This is required to connect to and use the OpenAI API.
from openai import OpenAI

# Imports the PyPDF2 library, which is used for reading, splitting, and extracting text from PDF files.
import PyPDF2

# Imports the 'files' module, which is specific to Google Colab.
# This module allows you to trigger a file upload dialog in your Colab notebook.
from google.colab import files

# Imports the built-in JSON library. This is used to "parse" (read) the
# JSON-formatted text that the model returns and turn it into a Python dictionary.
import json

# Imports the 'getpass' function. This is a secure way to ask for a password or
# API key because it hides the text as the user types it.
from getpass import getpass

# Imports the regular expression library, which we'll use for cleaning up the text.
import re


# Securely prompts the user to enter their API key.
# The key won't be visible on the screen as they type.
api_key = getpass("Enter your OpenAI API Key: ")

# Initializes the OpenAI client object. We pass our API key to authenticate.
# All API calls will be made using this 'client' object.
client = OpenAI(api_key=api_key)

# Defines the specific OpenAI model we want to use.
# Using a variable here is good practice, so if you want to change to
# "gpt-4o" or another model, you only have to change it in one place.
MODEL_NAME = "gpt-4o-mini"

Enter your OpenAI API Key: ··········


### Step 3: Define Extraction Function

Define functions to read PDF content and extract metadata using t5 model


In [None]:
def read_pdf(file_path):
    """
    Extracts text content from a PDF file using PyPDF2.

    This function opens the PDF in binary mode, initializes a PdfReader object,
    and iterates through all pages to extract and concatenate the text content.

    Args:
        file_path (str): Path to the PDF file to be read

    Returns:
        str: Complete text content extracted from all pages of the PDF

    Note:
        PyPDF2 may not handle complex layouts (tables, multi-column) perfectly.
        Consider using pdfplumber or PyMuPDF for advanced PDF parsing needs.
    """
    text = ""

    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)

        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            text += page.extract_text()

    return text

## Step 4 : Create well formalized prompt

In [None]:

def create_prompt(text):
    """
    Constructs a comprehensive prompt for metadata extraction using prompt engineering techniques.

    The prompt guides the AI model to:
    1. Understand what metadata means in document context
    2. Identify key metadata categories (who, what, when, where, structure, references)
    3. Extract relevant information based on document content
    4. Output results in structured JSON format

    Args:
        text (str): The extracted document content from PDF

    Returns:
        str: Formatted prompt string ready for the AI model
    """
    return f"""
You are an expert metadata extraction system. Your task is to analyze the provided document content and extract all relevant metadata.

### What is Metadata?
Metadata is "data about data." It's the contextual information that describes the document. This includes:

- **Who:** Authors, parties involved, companies, organizations
- **What:** Title, document type (e.g., 'Legal Contract', 'Research Paper', 'News Article', 'Judgment'), key topics, brief summary
- **When:** Publication date, creation date, effective date, judgment date, or any other relevant dates
- **Where:** Locations, jurisdictions, publication venues (e.g., journal name, court name)
- **Structure:** Key sections, headings, clauses, or organizational elements
- **References:** Citations, laws referenced, case numbers, DOI, ISBN, or other external documents

### Comprehensive Example: Research Paper

**Example Document Text:**

"The Role of AI in Climate Change Mitigation
Dr. Evelyn Reed, Prof. Kenji Tanaka
Journal of Environmental Informatics, Vol. 18, Issue 2
Published: July 15, 2024
DOI: 10.1234/jei.2024.0045

Abstract:
This paper explores the application of artificial intelligence (AI) models in predicting extreme weather events and optimizing renewable energy grids. We analyze datasets from 2010-2023...
Keywords: Artificial Intelligence, Climate Change, Renewable Energy, Predictive Modeling"

**Expected Output:**
{{
  "document_type": "Research Paper",
  "title": "The Role of AI in Climate Change Mitigation",
  "authors": ["Dr. Evelyn Reed", "Prof. Kenji Tanaka"],
  "journal_name": "Journal of Environmental Informatics",
  "volume": "18",
  "issue": "2",
  "publication_date": "July 15, 2024",
  "doi": "10.1234/jei.2024.0045",
  "keywords": ["Artificial Intelligence", "Climate Change", "Renewable Energy", "Predictive Modeling"],
  "summary": "Explores AI applications in predicting extreme weather and optimizing renewable energy grids using 2010-2023 datasets"
}}

NOTE: The example above is for demonstration purposes only. Analyze the actual document content below.

---

### Your Task

Analyze the document content provided below. Extract all relevant metadata based on the actual content. If a field from the example (like 'doi' or 'journal_name') is not present in the document, omit it from the output.

**CRITICAL INSTRUCTION:**
Output ONLY a single, valid JSON object containing the extracted metadata.
Do NOT include any explanatory text, apologies, or markdown formatting before or after the JSON.
The response must start with {{ and end with }}.

**Document Content:**
{text}
"""

### Step 5: Upload and Extract Metadata

Upload a file, and the metadata will be extracted and displayed. You can also download the metadata as a text file.



In [None]:
def extract_metadata(text, max_chars=15000):
    """
    Extracts structured metadata from document text using OpenAI's GPT-4o mini model.

    This function performs the following steps:
    1. Truncates text if it exceeds the maximum character limit (to manage token costs)
    2. Creates a structured prompt using prompt engineering
    3. Sends the prompt to OpenAI's chat completion API
    4. Processes and returns the AI-generated metadata in JSON format

    Args:
        text (str): The document text extracted from PDF
        max_chars (int): Maximum characters to send to the API (default: 15000)
                        Helps control API costs and stay within token limits

    Returns:
        str: Extracted metadata in JSON format

    Note:
        - GPT-4o mini has a context window that can handle large documents
        - Adjust max_chars based on your document size and API budget
        - The function uses temperature=0 for consistent, deterministic outputs
        - response_format with "json_object" ensures valid JSON output
    """

    # --- 1. Truncate Input Text ---
    # This is a safety and cost-control measure.
    # If the text is longer than `max_chars`, we'll only send the first
    # 15,000 characters (by default) to the model.
    if len(text) > max_chars:
        text = text[:max_chars] + "..."

    # --- 2. Create the Prompt ---
    # This calls the `create_prompt` function (which you defined earlier)
    # to build the full set of instructions for the model.
    prompt = create_prompt(text)

    # --- 3. Call the OpenAI API ---
    # This 'try...except' block will "catch" any errors if the API call fails
    # (e.g., network error, invalid API key).
    try:
        # This is the actual API call to OpenAI.
        response = client.chat.completions.create(
            # Use the model name we defined earlier (e.g., "gpt-4o-mini")
            model=MODEL_NAME,

            # The 'messages' list defines the conversation.
            messages=[
                {
                    "role": "system",
                    "content": "You are a precise metadata extraction expert. You analyze documents and return structured JSON metadata. Always respond with valid JSON only, no additional text."
                },
                {
                    "role": "user",
                    "content": prompt # This is our big prompt containing the instructions and document text
                }
            ],

            # temperature=0 makes the model's output as consistent and "non-creative" as possible.
            # This is ideal for extraction tasks.
            temperature=0,

            # Sets a limit on how long the model's response can be.
            max_tokens=1500,

            # This CRITICAL parameter forces the model to output a valid JSON object.
            # This greatly improves reliability.
            response_format={"type": "json_object"}
        )

        # --- 4. Safely Extract the Response ---
        # This line is a safe way to get the model's text.
        # It checks if the `response` and its nested parts exist before trying to access them.
        # If any part is missing, it defaults to an empty JSON string "{}".
        metadata = response.choices[0].message.content if response and response.choices and len(response.choices) > 0 and response.choices[0].message else "{}"

        # --- 5. Parse and Format the JSON ---
        # We use a *nested* 'try...except' block here.
        # The outer block caught API errors; this inner block catches JSON parsing errors.
        try:
            # `json.loads` converts the model's JSON *string* into a Python *dictionary*.
            parsed_metadata = json.loads(metadata)
            # `json.dumps` converts the Python *dictionary* back into a JSON *string*,
            # but 'indent=2' makes it "pretty-printed" and easy to read.
            return json.dumps(parsed_metadata, indent=2)
        except json.JSONDecodeError:
            # If the model *still* sent bad JSON (rare with 'json_object' mode),
            # just return the raw text it sent so we can debug it.
            return metadata

    except Exception as e:
        # If the *outer* 'try' block (the API call) failed,
        # return a formatted JSON error message.
        return json.dumps({"error": "Failed to extract metadata", "details": str(e)}, indent=2)


# --- Main Execution Block (Runs the script) ---

# This command is specific to Google Colab. It opens a file upload dialog in your notebook.
uploaded = files.upload()

# This 'if' block checks if the 'uploaded' dictionary is not empty (i.e., a file was uploaded).
if uploaded:
    # Get the filename of the *first* file that was uploaded.
    # `next(iter(...))` is just a way to get the first key from the dictionary.
    filename = next(iter(uploaded))

    # Call the PDF reading function (assumed to be defined elsewhere)
    # to get all the text content from the uploaded file.
    content = read_pdf(filename) # Note: `read_pdf` is not defined in this snippet, but assumed to exist

    # This is where we call our main function, passing in the text from the PDF.
    metadata = extract_metadata(content)

    # --- Print the results to the console ---
    print("\n" + "="*80) # Prints a '====' separator line
    print("EXTRACTED METADATA")
    print("="*80)
    print(metadata) # Print the pretty-printed JSON string
    print("="*80 + "\n")

    # --- Save the results to a file ---

    # Define the output filename
    output_filename = "extracted_metadata.json"

    # Open the file in "write" mode ('w')
    # The 'with' statement ensures the file is automatically closed.
    with open(output_filename, "w") as file:
        # Write the metadata string to the file.
        file.write(metadata)

    # This is another Colab-specific command.
    # It triggers a download of the specified file from Colab to your local computer.
    files.download(output_filename)

else:
    # This 'else' block runs if no file was uploaded.
    print("No file uploaded. Please upload a PDF document to extract metadata.")

Saving 67.pdf to 67 (2).pdf

EXTRACTED METADATA
{
  "document_type": "Conference Paper",
  "title": "Contour\u2010based Repositioning of lower limbs of the GHBMC Human Body FE Model",
  "authors": [
    "Aditya Chhabra",
    "Sachiv Paruchuri",
    "Dhruv Kaushik",
    "Kshitij Mishra",
    "Anoop Chawla",
    "Sudipto Mukherjee",
    "Rajesh Malhotra"
  ],
  "conference_name": "IRCOBI Conference 2017",
  "reference_number": "IRC-17-67",
  "summary": "This paper presents a contour-based repositioning technique for lower limbs of the GHBMC Human Body Model, improving posture-specific human body models for injury prediction.",
  "funding": "This research has received funding from the European Union Seventh Framework Programme under grant agreement n\u00b0605544 (PIPER project).",
  "keywords": [
    "Contour-based Repositioning",
    "Human Body Model",
    "Injury Prediction",
    "Posture-specific Models"
  ]
}



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>