## Cache & Conquer: Accelerating Research Analysis with Gemini and Caching
This Python notebook uses the Google Gemini API to process a set of PDF research papers and evaluate them against pre-defined inclusion/exclusion criteria. It leverages the Gemini `models/gemini-1.5-flash-001` model with a context cache to efficiently handle multiple documents.

Here's a breakdown of the notebook's functionality:

1. **Setup and Configuration:**
    - Loads environment variables (API key).
    - Configures the Gemini API.
    - Defines a UUID (`QUERY_UUID`) for organizing files.
    - Loads a main prompt from `instructions.txt`.

2. **Context Cache Creation:**
    - Uploads PDF files from the `./cache/` directory to Gemini's context cache.
    - Creates a cached content object with the main prompt and uploaded files, setting a 2-hour TTL. This allows the model to access the content of these files without repeatedly uploading them.

3. **Model Initialization:**
    - Initializes a Gemini generative model instance using the cached content and specified generation configurations (temperature, top_p, top_k, max output tokens, response schema, and MIME type).  The response schema enforces a structured JSON output including fields like title, author, rationales for different criteria, and boolean pass/fail flags.

4. **Processing Individual Papers:**
    - Loads PDF files from a directory based on the `QUERY_UUID` (`pdfs/{QUERY_UUID}/`).
    - Iterates through each PDF:
        - Uploads the PDF to Gemini.
        - Constructs a combined prompt using content from `paper_prompt.txt` and `inclusionExclusionCriteria.txt`. This prompt, along with the current PDF, is sent to the Gemini model.
        - Parses the JSON response from the model, extracting relevant information.
        - Prints the extracted title.
        - Stores the JSON response.
        - Prints usage metadata for the first request.

5. **Dataframe Creation and Export:**
    - Creates a Pandas DataFrame from the collected JSON responses.
    - Sets "PaperID" as the index column.
    - Reorders columns to a specific arrangement.
    - Saves the DataFrame to a CSV file (`pdfs/{QUERY_UUID}/responses.csv`).


**Key Features and Improvements:**

- **Context Caching:** Improves efficiency by storing uploaded files in a cache, reducing upload overhead for subsequent requests.
- **Structured Output:** Uses a response schema to ensure consistent, structured JSON output from the model.
- **Organized File Management:** Uses a `QUERY_UUID` to organize files and results.
- **Clearer Variable Names:**  Uses descriptive variable names for improved readability.
- **Error Handling:** While not explicitly included, the code could be improved by adding error handling (e.g., `try-except` blocks) to handle potential issues during file processing and API calls.

This notebook provides a structured and efficient approach to processing multiple PDF documents using the Google Gemini API and evaluating them based on specific criteria.  The output CSV file provides a summarized overview of the analysis for each paper.


In [None]:
import json
import os
import google.generativeai as genai
from google.generativeai import caching
from google.ai.generativelanguage_v1beta.types import content
import datetime
import time
from dotenv import load_dotenv

load_dotenv()
QUERY_UUID="debd9b3c-4531-462c-b2c2-983b2710fe81"

PROMPT= open("instructions.txt", "r").read()


genai.configure(api_key=os.environ['GEMINI_API_KEY'])

base = f"./cache/"
paths = os.listdir(base)
files = []

for path in paths:
    files.append(genai.upload_file(path=base+path))

#unfortunately, the model is not able to handle more than 500k tokens at a time. We will limit the number of papers to 10 for now.
#This is a limitation of the context cache. 
num_papers = len(paths)


# Create the model
generation_config = {
  "temperature": 1,
  "top_p": 0.95,
  "top_k": 40,
  "max_output_tokens": 8192,
  "response_schema": content.Schema(
    type = content.Type.OBJECT,
    required = ["title", "first author name", "populationRationale", "interventionExposureRationale", "comparatorRationale", "outcomeRationale", "populationPass", "interventionExposurePass", "comparatorPass", "outcomePass"],
    properties = {
      "title": content.Schema(
        type = content.Type.STRING,
      ),
      "first author name": content.Schema(
        type = content.Type.STRING,
      ),
      "populationRationale": content.Schema(
        type = content.Type.STRING,
      ),
      "interventionExposureRationale": content.Schema(
        type = content.Type.STRING,
      ),
      "comparatorRationale": content.Schema(
        type = content.Type.STRING,
      ),
      "outcomeRationale": content.Schema(
        type = content.Type.STRING,
      ),
      "populationPass": content.Schema(
        type = content.Type.BOOLEAN,
      ),
      "interventionExposurePass": content.Schema(
        type = content.Type.BOOLEAN,
      ),
      "comparatorPass": content.Schema(
        type = content.Type.BOOLEAN,
      ),
      "outcomePass": content.Schema(
        type = content.Type.BOOLEAN,
      ),
    },
  ),
  "response_mime_type": "application/json",
}

# Create a cache with a 2 hour TTL
cache = caching.CachedContent.create(
    model='models/gemini-1.5-flash-001',
    display_name='cache', # used to identify the cache
    system_instruction=(PROMPT),
    contents=files,
    ttl=datetime.timedelta(hours=2),
)



model = genai.GenerativeModel.from_cached_content(cached_content=cache,  generation_config=generation_config)




In [None]:
import pandas as pd
count = 0

base = f"pdfs/{QUERY_UUID}/"
paths = os.listdir(base)
files = []

#drop all the non-pdf files
for path in paths:
    if not path.endswith(".pdf"):
        paths.remove(path)

responses = []
#upload each paper and ask the model to generate a response. The promp is in the paper_prompt.txt file
for path in paths:
    files.append(genai.upload_file(path=base+path))

#now we will generate the responses for each paper
prompt = open("paper_prompt.txt", "r").read()
prompt += "\n\n" + open("inclusionExclusionCriteria.txt", "r").read()
for paper in files:
    response = model.generate_content([paper, prompt])
    response_json = json.loads(response.candidates[0].content.parts[0].text)
    #print the response title
    print(response_json["title"])
    responses.append(response_json)
    if count == 0:
        print(response.usage_metadata)
        count += 1


In [None]:
#from the responses, we need to create a dataframe that will be used to create the final report

df = pd.DataFrame(responses)
#name the Index column PaperID
df.index.name = "PaperID"

#reorder the columns to  ["title", "first author name", "populationRationale", "interventionExposureRationale", "comparatorRationale", "outcomeRationale", "populationPass", "interventionExposurePass", "comparatorPass", "outcomePass"]
df = df[["title", "first author name", "populationRationale", "interventionExposureRationale", "comparatorRationale", "outcomeRationale", "populationPass", "interventionExposurePass", "comparatorPass", "outcomePass"]]
#save the dataframe to a csv file
df.to_csv(f"pdfs/{QUERY_UUID}/responses.csv", index=True)
