#### Model deployment and use
---

we already pre-trained a bioBERT model for classification, summarization and relevance of scientific papers related to healthspan and lifespan. It is saved in a local directory. Here we  implement its deployment, and use it to scan a directory, and display a summary and information of each of the 15 most recent documents relevant to the topic. Also a summary of these documents


1. Load the pre-trained BioBERT model and tokenizer

In [15]:
import json
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline, AutoModelForSeq2SeqLM


In [None]:
model_dir = "../models/biobert_healthspan"
#model_name = "michiyasunaga/BioBART-large"

In [18]:

model = AutoModelForSequenceClassification.from_pretrained(model_dir)

In [8]:
tokenizer = AutoTokenizer.from_pretrained(model_dir)

2. Setup PDF text extraction

In [None]:
#import fitz  # PyMuPDF

#def extract_text_from_pdf(pdf_path):
#    doc = fitz.open(pdf_path)
#    text = ""
#    for page_num in range(doc.page_count):
#        page = doc.load_page(page_num)
#        text += page.get_text()
#    return text


3. Or work from the json file

- Load the json file

We’ll start by loading the JSON file and extracting the relevant fields (title, author, date, processed_text, and filename)

 - Preprocess the documents

We’ll then process the processed_text for relevance classification and summarization, ignoring the label.

 - Display summaries of the top 15 relevant documents

In [9]:
# PDF extraction function (no longer necessary as we are using processed_text)
# Extract data from JSON file
json_file_path = "../data/preprocessed_pdf_info_list.json"  # Replace with the path to your JSON file

with open(json_file_path, 'r') as file:
    pdf_data = json.load(file)



In [10]:
# Classify relevance of documents
def classify_document(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
    return predictions.item()



In [19]:
# Summarization function
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

def summarize_document(text):
    summary = summarizer(text, max_length=200, min_length=50, do_sample=False)
    return summary[0]['summary_text']



Device set to use cpu


In [21]:
# Process and classify the documents
relevant_documents = []

In [23]:
# Ensure pdf_data is a list
if not isinstance(pdf_data, list):
    raise ValueError("Expected 'pdf_data' to be a list of dictionaries.")

In [25]:
for idx, document in enumerate(pdf_data):
    try:
        # Ensure document is a dictionary
        if not isinstance(document, dict):
            print(f"Warning: Document at index {idx} is not a dictionary. Skipping...")
            continue

        # Retrieve processed text
        processed_text = document.get("processed_text", "")

        # Ensure processed_text is a valid string
        if not isinstance(processed_text, str) or not processed_text.strip():
            print(f"Warning: Document '{document.get('filename', 'Unknown')}' has no valid processed text. Skipping...")
            continue

        # Classify document
        relevance = classify_document(processed_text)

        # Ensure classification output is valid
        if not isinstance(relevance, int):
            print(f"Error: Classification output for '{document.get('filename', 'Unknown')}' is not an integer. Skipping...")
            continue

        # Check if document is relevant
        if relevance == 1:  # Assuming '1' means relevant
            relevant_documents.append(document)

    except Exception as e:
        print(f"Error processing document at index {idx}: {e}")


Error processing document at index 473: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]


In [26]:
print(f"Total relevant documents found: {len(relevant_documents)}")

Total relevant documents found: 653


In [28]:
from datetime import datetime

In [31]:
def get_sortable_date(doc):
    """
    Extracts and converts the date to a sortable format.
    If the date is missing, invalid, or doc is None, returns a default old date.
    """
    try:
        if not isinstance(doc, dict):  # Ensure doc is a valid dictionary
            print(f"Warning: Encountered an invalid document entry: {doc}. Skipping.")
            return datetime(1900, 1, 1)

        date_str = doc.get("date", "")

        if not isinstance(date_str, str) or not date_str.strip():
            print(f"Warning: Missing or invalid date in document '{doc.get('filename', 'Unknown')}'. Using default old date.")
            return datetime(1900, 1, 1)

        return datetime.strptime(date_str.strip(), "%Y-%m-%d")  # Adjust format if needed

    except Exception as e:
        print(f"Error processing date for document '{doc.get('filename', 'Unknown')}': {e}")
        return datetime(1900, 1, 1)  # Return default old date to avoid breaking the sort


In [32]:
# Sort documents safely
try:
    relevant_documents = [doc for doc in relevant_documents if isinstance(doc, dict)]  # Remove None or invalid entries
    relevant_documents.sort(key=get_sortable_date, reverse=True)
except Exception as e:
    print(f"Error while sorting documents: {e}")
    raise  # Re-raise for debugging if needed

Error processing date for document '0706.1996.pdf': unconverted data remains: T23:21:48
Error processing date for document '1102.0933.pdf': unconverted data remains: T13:41:06
Error processing date for document '1102.3369.pdf': unconverted data remains: T11:49:30
Error processing date for document '1109.1296.pdf': unconverted data remains: T02:25:30
Error processing date for document '1207.1891.pdf': unconverted data remains: T22:37:20
Error processing date for document '1209.5046.pdf': unconverted data remains: T07:18:13
Error processing date for document '1209.5754.pdf': unconverted data remains: T11:42:42
Error processing date for document '1210.0037.pdf': unconverted data remains: T16:35:17
Error processing date for document '1210.7480.pdf': unconverted data remains: T01:01:44
Error processing date for document '1211.4911.pdf': unconverted data remains: T22:18:10
Error processing date for document '1301.1077.pdf': unconverted data remains: T01:36:25
Error processing date for docume

In [None]:
# Sort by the most recent based on the date or filename
#relevant_documents.sort(key=lambda x: x["date"], reverse=True)  # Sort by 'date' or 'filename'



In [33]:
# Select the 15 most recent relevant documents
top_15_documents = relevant_documents[:15]



In [35]:
summaries = []
for document in top_15_documents:
    try:
        # Ensure document is a dictionary
        if not isinstance(document, dict):
            print(f"Warning: Skipping invalid document entry: {document}")
            continue

        # Check if "processed_text" exists and is valid
        processed_text = document.get("processed_text", "").strip()
        if not processed_text:
            print(f"Warning: Missing or empty processed text for '{document.get('filename', 'Unknown')}'. Using default summary.")
            summary = "No summary available due to missing or empty content."
        else:
            summary = summarize_document(processed_text)

        summaries.append({
            'title': document.get("title", "Unknown Title"),
            'author': document.get("author", "Unknown Author"),
            'date': document.get("date", "Unknown Date"),
            'filename': document.get("filename", "Unknown Filename"),
            'summary': summary
        })
    
    except Exception as e:
        print(f"Error summarizing document '{document.get('filename', 'Unknown')}': {e}")
        summaries.append({
            'title': document.get("title", "Unknown Title"),
            'author': document.get("author", "Unknown Author"),
            'date': document.get("date", "Unknown Date"),
            'filename': document.get("filename", "Unknown Filename"),
            'summary': "Error generating summary."
        })



Error summarizing document '0706.1996.pdf': index out of range in self
Error summarizing document '1102.0933.pdf': index out of range in self
Error summarizing document '1102.3369.pdf': index out of range in self
Error summarizing document '1109.1296.pdf': index out of range in self
Error summarizing document '1201.2900.pdf': index out of range in self
Error summarizing document '1207.1891.pdf': index out of range in self
Error summarizing document '1209.5046.pdf': index out of range in self
Error summarizing document '1209.5754.pdf': index out of range in self
Error summarizing document '1210.0037.pdf': index out of range in self
Error summarizing document '1210.7480.pdf': index out of range in self
Error summarizing document '1211.4911.pdf': index out of range in self
Error summarizing document '1301.1077.pdf': index out of range in self
Error summarizing document '1302.3861.pdf': index out of range in self
Error summarizing document '1304.0479.pdf': index out of range in self
Error 

In [36]:
# Display the summaries of the top 15 relevant documents
for summary in summaries:
    print(f"Document: {summary['title']} ({summary['filename']})")
    print(f"Author: {summary['author']}")
    print(f"Date: {summary['date']}")
    print(f"Summary: {summary['summary']}")
    print("="*50)



Document: RTR Planet RB4arXiv June 2007 (0706.1996.pdf)
Author: Razvan T. Radulescu
Date: 2007-06-13T23:21:48
Summary: Error generating summary.
Document: Unknown Title (1102.0933.pdf)
Author: Unknown Author
Date: 2018-10-22T13:41:06
Summary: Error generating summary.
Document: Microsoft Word - Gen_med_insilmac_acc_chng.doc (1102.3369.pdf)
Author: Steven Watterson
Date: 2011-01-28T11:49:30
Summary: Error generating summary.
Document: Unknown Title (1109.1296.pdf)
Author: Unknown Author
Date: 2022-03-28T02:25:30
Summary: Error generating summary.
Document: Model of pathogenesis of psoriasis. Part 2. Local processes. (1201.2900.pdf)
Author: Mikhail Peslyak
Date: None
Summary: Error generating summary.
Document: Unknown Title (1207.1891.pdf)
Author: sachin
Date: 2012-07-08T22:37:20
Summary: Error generating summary.
Document: Unknown Title (1209.5046.pdf)
Author: Unknown Author
Date: 2022-03-12T07:18:13
Summary: Error generating summary.
Document: View 10 Articles from Experimental Cell R

In [37]:
# Overall summary of the top 15 documents
combined_text = " ".join([doc["processed_text"] for doc in top_15_documents])
overall_summary = summarize_document(combined_text)
print("Overall Summary of the Top 15 Documents:")
print(overall_summary)

IndexError: index out of range in self

Deployment Options: Depending on your application, you can deploy the model as a REST API using frameworks like Flask or FastAPI, integrate it into existing data pipelines, or utilize it within interactive applications.​