<a href="https://colab.research.google.com/github/Amirosimani/100_days_of_spice/blob/master/tempus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tempus Long context I/E criteria

|||
|----------|-------------|
| Author(s)   | amirimani@ astakhov@ |
| Last updated | 24/06/2024 - Initial Draft |
<br><br>


### To Do:

1. functions based on `time` criteria


-----
* patient `8e848225-2c52-4149-bb0d-60199380b20a` has >1mm tokens
* patient `c621f7ca-0b27-4164-963a-757ef56f7db6` is filtered.


# install dependencies

In [5]:
%pip install --upgrade --quiet google-cloud-aiplatform
%pip install --upgrade --quiet asynciolimiter nest_asyncio

# Restart the kernel runtime to load the private preview SDK
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

# Project Config

In [1]:
from google.colab import auth
auth.authenticate_user()

In [2]:
# from google.cloud import aiplatform
import vertexai
from vertexai.generative_models import GenerativeModel
import vertexai.preview.generative_models as generative_models


import re
import ast
import asyncio
import nest_asyncio
from asynciolimiter import Limiter
from sklearn.metrics import accuracy_score, f1_score

nest_asyncio.apply()


PROJECT_ID = "amir-genai-bb"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}
MODEL = "gemini-1.5-pro-001" # @param {type:"string"}

vertexai.init(project=PROJECT_ID, location=REGION)

# Data prep

In [3]:
import json
import warnings
import pandas as pd
from pprint import pprint

from google.cloud import storage

In [4]:
# helper functions

def load_json_from_gcs(bucket_name, file_name):
    """Loads a JSON file from Google Cloud Storage."""

    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(file_name)

    json_string = blob.download_as_string()
    json_data = json.loads(json_string)

    return json_data

def count_tokens(input_prompt, model_name=MODEL):
  model = GenerativeModel(model_name=MODEL)
  return model.count_tokens(input_prompt).total_tokens


def create_patient_df(data, count=False):

  # concat content based on patient_id
  print("Converting json to dataframe")
  df_attachments = pd.DataFrame.from_dict(data['attachments'])
  df_attachments = df_attachments.groupby('patient_id')['content'].agg('/n'.join).reset_index()
  # count tokens to ensure it is within the context window limit
  if count:
    print("Counting tokens...")
    df_attachments['token_count'] = df_attachments['content'].apply(count_tokens)

    try:
      assert df_attachments["token_count"].any() > 1000000
    except AssertionError as e:
      warnings.warn("Some rows have larger than 1M tokens")

  # get labels and notes
  df_ptid = pd.DataFrame.from_dict(data['ptid_to_review'], orient="index").reset_index()
  df_ptid.columns = ["patient_id", "label", "note"]

  assert df_attachments.shape[0] == df_ptid.shape[0]



  df = pd.merge(df_attachments, df_ptid, on="patient_id", how="left")

  return df


def format_output(text):
  match = re.search(r'\{(.*)\}', text, re.DOTALL)

  if match:
      json_text = match.group(0).strip()  # Get the text inside the curly braces
      # print(json_text)
  else:
    print("No match found!!!!")

  return ast.literal_eval(json_text)

In [5]:
# Get the data
bucket_name = "tempus-experiment"
file_name = "tempus.json"
data = load_json_from_gcs(bucket_name, file_name)

In [6]:
df = create_patient_df(data, count=False)

Converting json to dataframe


In [7]:
df = df[~df['patient_id'].isin(["8e848225-2c52-4149-bb0d-60199380b20a", "c621f7ca-0b27-4164-963a-757ef56f7db6"])]

# df = df[df['token_count'] < 1000000]

In [None]:
## store the dataframe as jsonl back in GCS

def save_to_gcs(df, bucket_name=bucket_name):
  # Convert DataFrame to JSON Lines format
  jsonl_data = df.to_json(orient='records', lines=True)

  # Initialize Google Cloud Storage client
  blob_name = 'df.jsonl'
  storage_client = storage.Client()

  bucket = storage_client.bucket(bucket_name)
  blob = bucket.blob(blob_name)
  blob.upload_from_string(jsonl_data, content_type='application/jsonl')

  print(f"DataFrame saved as JSONL to gs://{bucket_name}/{blob_name}")


In [None]:
save_to_gcs(df)

# Gemini for i/e validation


* Exclusion is determinisitc i.e. `if an exclusion criteria exists, the patient is not a candidate`. [confirm with tempus]
* Otherwise, if inclusion criteria is applicable, it could be either `is_candidate` or `watch`

### ctgov_ie_criteria

this is the most generic approach - directly from clinicaltrial.gov criteria.



In [16]:
print(data['ctgov_ie_criteria'])

Inclusion Criteria:

* Adult participants with loco-regional recurrent or metastatic breast disease not amenable to surgical resection or radiation therapy
* Confirmed diagnosis of ER+/HER2- breast cancer
* Prior therapies for locoregional recurrent or metastatic disease must fulfill all the following criteria:
* One line of CDK4/6 inhibitor therapy in combination with endocrine therapy. Only one line of CDK4/6 inhibitor is allowed in any setting.
* <= 1 endocrine therapy in addition to CDK4/6 inhibitor with ET
* Most recent endocrine treatment duration must have been given for >= 6 months prior to disease progression. This may be the endocrine treatment component of the CDK4/6 inhibitor line of therapy.
* Radiological progression during or after the last line of therapy.
* Measurable disease evaluable per Response Evaluation Criterion in Solid Tumors (RECIST) v.1.1 or non-measurable bone-only disease
* Eastern Cooperative Oncology Group (ECOG) performance status 0-1
* Participants sho

In [22]:
async def generate(content, model=MODEL):

    generation_config = {
        "max_output_tokens": 8192,
        "temperature": 0.3,
        # "top_p": 1,
        # "top_k": 32
    }

    safety_settings = {
        generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_NONE,
        generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_NONE,
        generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_NONE,
        generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_NONE,
    }
    rate_limiter = Limiter(20)
    await rate_limiter.wait()

    model = GenerativeModel(model_name=MODEL,
                            system_instruction=[
    "You are an experienced clinical trial coordinator.",
    "Your primary goal is to rigorously assess patient medical history and find any relevant sections to the inclusion and exclusion criteria.",
    "Always adhere strictly to the given instructions. Never fabricate information; base your assessment solely on the provided data.",
    "Your output MUST be a list of all relevant sections as a JSON object with two keys:",
    "  * 'citation': The specific patient detail from their medical history'",
    "  * 'reason': A clear explanation from the criteria supporting your decision"
    ],

    )

    responses = model.generate_content(
        [content],
        generation_config=generation_config,
        safety_settings=safety_settings,
        stream=False,
    )
    return responses.text



async def main(df, criteria=data['ctgov_ie_criteria']):
  tasks = []
  prompts = []
  for idx, row in df.iterrows():

    tasks.append(asyncio.create_task(generate(template.format(history=row["content"], criteria=criteria))))

  results = await asyncio.gather(*tasks, return_exceptions=True)
  return results


In [23]:
template = """

You will be given a medical history of a patient. You need to read carefully and understand all of it. Then you
will be given a set of inclusion and exclusion criteria. Your goal is to rigorously assess patient eligibility for a clinical trial based on provided inclusion and exclusion criteria.
You may require you to use 1 or more parts of the medical history when assessing patient's eligibility.

===== Patient Medical History =====
{history}

===== Inclusion/Exclusion Criterias =====
{criteria}


===== Now let's start! =====
What are all the relevant sections from patient's history to the provided inclusion/exclusion criteria?

"""

In [24]:
r = asyncio.run(main(df))

In [20]:
def format_output(text):
  try:
    match = re.search(r'\{(.*)\}', text, re.DOTALL)

    if match:
        json_text = match.group(0).strip()
        # print(json_text)
    else:
      print("No match found!!!!")
  except TypeError:
    json_text = '{"result":"NA", "reason":"NA", "citation":"NA"}'

  return ast.literal_eval(json_text)

In [21]:
df_gemini = pd.DataFrame([format_output(x) for x in r])
df_gemini = pd.concat([df.reset_index(drop=True), df_gemini],axis=1)
df_gemini.head()

Unnamed: 0,patient_id,content,label,note,result,reason,citation
0,06a7bc2c-bc5d-41f2-8150-2092c89ba8c7,patient_id\n: X6940405\npatient_mrn\n: 5997693...,is_candidate,"pt w/ met breast ca, unable to find ER percent...",watch,The patient's medical history confirms a diagn...,Diagnosis:C50.912 - Malignant neoplasm of unsp...
1,0e9dedab-33a5-47f0-990f-2a4b55a14e8e,patient_id\npatient_mrn\n: 075864942\n: 320872...,not_candidate,"pt w/ Met Breast CA ER+ HER2 neg, on 1L AI+Ve...",not_candidate,"The patient has active brain metastases, as ev...",8/8/20: MRI of the brain and pituitary: 1. The...
2,13b40448-21a8-4167-bfca-40d3176792e0,patient_id\n: Q606124\npatient_mrn\n: 7496847\...,not_candidate,Stage IV ER+/HER2- L Breast ca s/p part. maste...,not_candidate,The patient has received prior treatment with ...,Jan 2023 therapy changed to Ibrance 100 mg 21/...
3,144568cf-3b86-4ad1-a868-daaf090eba10,patient_id\n: V7901691\npatient_mrn\n: 2952772...,not_candidate,Stage IV R ER+/HER2- IDC s/p tamoxifen/ribocic...,watch,While the patient's medical history confirms a...,The provided medical history mentions treatmen...
4,1717aedf-0d30-463b-8168-bd7ea37f0961,"Protocol: A A Randomised, Multicentre, Double ...",not_candidate,Recurrent/metastatic Stage IV ER+/HER2- Breast...,not_candidate,The patient was noted to have grade 3 neutrope...,Safety labs were reviewed and a Grade 3 Neutro...


# Evaluate

In [None]:
print(f'Accuracy: {accuracy_score(df_gemini["label"], df_gemini["result"])*100:.2f}%')
print(f'F1 score:  {f1_score(df_gemini["label"], df_gemini["result"], average="weighted")*100:.2f}%')

In [None]:
# df_gemini[["label", "result", "token_count"]]

In [None]:
df_gemini[df_gemini["label"] != df_gemini["result"]]

In [None]:
df_sample = df_gemini[df_gemini["patient_id"] == "6a7473fd-6229-4d75-8c31-2db7248ad594"]

In [None]:

df_sample["context"][8]

In [None]:

df_sample[" note"][8]

# Debug

In [None]:
df_gemini[df_gemini["label"]!="not_candidate"]

In [None]:
df_sub = df[["content", "label"]]
df_sub.columns = ["question", "answer"]
df_sub.to_json('data.json')

In [None]:

# Save each row to a separate text file
for index, row in df.iterrows():
    file_name = f"file_{index}.txt"  # Create file name based on index
    with open(file_name, 'w') as file:
        file.write(row['content'])