# Cache & Conquer: Accelerating Research Summarization with Gemini 1.5  

### Introduction  

In the health sciences, researchers often face the daunting task of reviewing hundreds or even thousands of academic papers to determine which are relevant to their study. This process is not only time-intensive but also critical, as faster and more efficient research leads to better-informed medical decisions and, ultimately, better patient care. Traditional methods of literature review rely heavily on manual effort and keyword-based searches, which can overlook key insights buried in long or complex documents.  

This notebook introduces an innovative approach to address this challenge by leveraging the Google Gemini 1.5 API. The Gemini 1.5 model is uniquely suited for this task due to its groundbreaking long-context window, capable of processing up to 2 million tokens in a single request. This capability enables direct analysis of large bodies of text, such as full academic papers, without the need for pre-segmenting or reducing the input data. Furthermore, the use of *context caching* drastically improves efficiency by storing uploaded documents for repeated use within a set timeframe, eliminating redundant uploads and reducing latency.

### The Approach  

This notebook automates the process of evaluating research papers against predefined inclusion and exclusion criteria using the following steps:  

1. **Setup and Configuration:**  
   The environment is prepared with essential libraries, and the Gemini API is configured for interaction. A unique identifier (`QUERY_UUID`) organizes the workflow outputs, ensuring clear file management.  

2. **Context Caching:**  
   Research papers are uploaded and stored in a context cache, along with a main prompt defining the evaluation criteria. This cache is valid for two hours, allowing the model to access and utilize the uploaded documents without re-uploading for each request.  

3. **Generative Model Configuration:**  
   The Gemini 1.5 model is initialized with custom settings, including a response schema that ensures structured, consistent JSON outputs. These outputs include the paper title, authors, rationale for inclusion/exclusion, and a pass/fail decision.  

4. **Iterative Paper Processing:**  
   Each paper is processed individually. A tailored prompt is combined with the cached document to guide the model in its analysis. The resulting structured responses are parsed, saved, and compiled into a comprehensive dataset for review.  

5. **Data Export:**  
   The results are organized into a Pandas DataFrame and exported as a CSV file. This provides researchers with an easily interpretable summary of the analysis, enabling quick identification of relevant papers.  

### Why This Matters  

By integrating Gemini 1.5’s long-context capabilities and context caching into this workflow, this notebook transforms the literature review process. Researchers can now process entire libraries of documents in a fraction of the time, focusing their expertise on analysis rather than manual review. This approach not only improves efficiency but also empowers researchers to uncover insights that might otherwise be overlooked.  

Whether you’re working in medicine, public health, or any field that requires the synthesis of vast amounts of research, this solution offers a faster, smarter way to stay informed and make impactful decisions.  

In [None]:
import json
import os
import google.generativeai as genai
from google.generativeai import caching
from google.ai.generativelanguage_v1beta.types import content
import datetime
import time
from dotenv import load_dotenv

load_dotenv()
QUERY_UUID="debd9b3c-4531-462c-b2c2-983b2710fe81" #this is the UUID of the query that stored the PDFs retrieved from the search

PROMPT= open("instructions.txt", "r").read()


genai.configure(api_key=os.environ['GEMINI_API_KEY'])

base = f"./cache/"
paths = os.listdir(base)
files = []

for path in paths:
    files.append(genai.upload_file(path=base+path))

num_papers = len(paths)


# Create the model
generation_config = {
  "temperature": 1,
  "top_p": 0.95,
  "top_k": 40,
  "max_output_tokens": 8192,
  "response_schema": content.Schema(
    type = content.Type.OBJECT,
    required = ["title", "first author name", "populationRationale", "interventionExposureRationale", "comparatorRationale", "outcomeRationale", "populationPass", "interventionExposurePass", "comparatorPass", "outcomePass"],
    properties = {
      "title": content.Schema(
        type = content.Type.STRING,
      ),
      "first author name": content.Schema(
        type = content.Type.STRING,
      ),
      "populationRationale": content.Schema(
        type = content.Type.STRING,
      ),
      "interventionExposureRationale": content.Schema(
        type = content.Type.STRING,
      ),
      "comparatorRationale": content.Schema(
        type = content.Type.STRING,
      ),
      "outcomeRationale": content.Schema(
        type = content.Type.STRING,
      ),
      "populationPass": content.Schema(
        type = content.Type.BOOLEAN,
      ),
      "interventionExposurePass": content.Schema(
        type = content.Type.BOOLEAN,
      ),
      "comparatorPass": content.Schema(
        type = content.Type.BOOLEAN,
      ),
      "outcomePass": content.Schema(
        type = content.Type.BOOLEAN,
      ),
    },
  ),
  "response_mime_type": "application/json",
}

# Create a cache with a 2 hour TTL
cache = caching.CachedContent.create(
    model='models/gemini-1.5-flash-001',
    display_name='cache', # used to identify the cache
    system_instruction=(PROMPT),
    contents=files,
    ttl=datetime.timedelta(hours=2),
)

model = genai.GenerativeModel.from_cached_content(cached_content=cache,  generation_config=generation_config)

In [27]:
import pandas as pd
count = 0

base = f"pdfs/{QUERY_UUID}/"
paths = os.listdir(base)
files = []

#drop all the non-pdf files
for path in paths:
    if not path.endswith(".pdf"):
        paths.remove(path)

responses = []
#upload each paper and ask the model to generate a response. The promp is in the paper_prompt.txt file
for path in paths:
    files.append(genai.upload_file(path=base+path))

#now we will generate the responses for each paper
prompt = open("paper_prompt.txt", "r").read()
prompt += "\n\n" + open("inclusionExclusionCriteria.txt", "r").read()
for paper in files:
    response = model.generate_content([paper, prompt])
    response_json = json.loads(response.candidates[0].content.parts[0].text)
    #print the response title
    print(response_json["title"])
    responses.append(response_json)
    if count == 0:
        print(response.usage_metadata)
        count += 1


Effects of dual-task resistance exercise on cognition, mood, depression, functional fitness, and activities of daily living in older adults with cognitive impairment: a single-blinded, randomized controlled trial
prompt_token_count: 113094
candidates_token_count: 431
total_token_count: 113525
cached_content_token_count: 92844

A randomized study examining the effects of mild-to-moderate group exercises on cardiovascular, physical, and psychological well-being in patients with heart failure
A Comparison of Accelerometry Analysis Methods for Physical Activity in Older Adult Women and Associations with Health Outcomes Over Time
Prospective Study on the Association Between Adherence to Healthy Lifestyles and Depressive Symptoms Among Japanese Employees: The Furukawa Nutrition and Health Study
Moderators of response to cognitive behavior therapy for major depression in patients with heart failure
Depressive Symptoms are Associated with Heart Rate Variability Independently of Fitness: A Cros

In [28]:
#from the responses, we need to create a dataframe that will be used to create the final report

df = pd.DataFrame(responses)
#name the Index column PaperID
df.index.name = "PaperID"

#reorder the columns to  ["title", "first author name", "populationRationale", "interventionExposureRationale", "comparatorRationale", "outcomeRationale", "populationPass", "interventionExposurePass", "comparatorPass", "outcomePass"]
df = df[["title", "first author name", "populationRationale", "interventionExposureRationale", "comparatorRationale", "outcomeRationale", "populationPass", "interventionExposurePass", "comparatorPass", "outcomePass"]]
#save the dataframe to a csv file
df.to_csv(f"pdfs/{QUERY_UUID}/responses.csv", index=True)
