In [3]:
EXAMPLE_DOC = """Statin Use and Breast Cancer Survival: A Nationwide Cohort Study from Finland
Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. We evaluated risk of breast cancer death among statin users in a population-based cohort of breast cancer patients. The study cohort included all newly diagnosed breast cancer patients in Finland during 1995–2003 (31,236 cases), identified from the Finnish Cancer Registry. Information on statin use before and after the diagnosis was obtained from a national prescription database. We used the Cox proportional hazards regression method to estimate mortality among statin users with statin use as time-dependent variable. A total of 4,151 participants had used statins. During the median follow-up of 3.25 years after the diagnosis (range 0.08–9.0 years) 6,011 participants died, of which 3,619 (60.2%) was due to breast cancer. After adjustment for age, tumor characteristics, and treatment selection, both post-diagnostic and pre-diagnostic statin use were associated with lowered risk of breast cancer death (HR 0.46, 95% CI 0.38–0.55 and HR 0.54, 95% CI 0.44–0.67, respectively). The risk decrease by post-diagnostic statin use was likely affected by healthy adherer bias; that is, the greater likelihood of dying cancer patients to discontinue statin use as the association was not clearly dose-dependent and observed already at low-dose/short-term use. The dose- and time-dependence of the survival benefit among pre-diagnostic statin users suggests a possible causal effect that should be evaluated further in a clinical trial testing statins’ effect on survival in breast cancer patients.
"""

In [7]:
SYS_PROMPT = """You are a query‐generation assistant. For each document you receive, Your task is to generate exactly 10 distinct questions based on the provided document, topics, and keywords.
Your output MUST adhere to the following format STRICTLY:

Rules:
- Exactly 10 questions.
- Each question on a new line.
- NO introductory or concluding text (e.g., "Here are the questions:", "Okay, here are...", "These are the queries:").
- NO numbering (e.g., "1.", "2.").
- NO bullet points or other list markers (e.g., "-", "*").
- Your entire response must consist ONLY of the 10 questions, each separated by a newline character.**

When generating the queries, document content, topics and keywords information should all be considered:

- **Implicitly cover** the document’s topics in proportion to their weights:  
   - High-weight topics should steer more of your queries’ framing.  
   - You may express topics via synonyms, broader phrases, or context (e.g., topic “survival analysis” → “risk estimation over time”).
   - Ensure your 10 queries, as a set, reflect the topic balance.

- **Explicitly include** the document’s keywords
   - Embed them naturally (e.g., “breast cancer death” can appear as is).
"""

USER_PROMPT = """Here is an example document with keywords and weighted topics. Generate 10 queries without any additional text:

Document:
Statin Use and Breast Cancer Survival: A Nationwide Cohort Study from Finland
Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. We evaluated risk of breast cancer death among statin users in a population-based cohort of breast cancer patients. The study cohort included all newly diagnosed breast cancer patients in Finland during 1995–2003 (31,236 cases), identified from the Finnish Cancer Registry. Information on statin use before and after the diagnosis was obtained from a national prescription database. We used the Cox proportional hazards regression method to estimate mortality among statin users with statin use as time-dependent variable. A total of 4,151 participants had used statins. During the median follow-up of 3.25 years after the diagnosis (range 0.08–9.0 years) 6,011 participants died, of which 3,619 (60.2%) was due to breast cancer. After adjustment for age, tumor characteristics, and treatment selection, both post-diagnostic and pre-diagnostic statin use were associated with lowered risk of breast cancer death (HR 0.46, 95% CI 0.38–0.55 and HR 0.54, 95% CI 0.44–0.67, respectively). The risk decrease by post-diagnostic statin use was likely affected by healthy adherer bias; that is, the greater likelihood of dying cancer patients to discontinue statin use as the association was not clearly dose-dependent and observed already at low-dose/short-term use. The dose- and time-dependence of the survival benefit among pre-diagnostic statin users suggests a possible causal effect that should be evaluated further in a clinical trial testing statins’ effect on survival in breast cancer patients.

Keywords:
breast cancer death
cohort study finland
used cox proportional
estimate mortality
short term use

Topics (with weights):
[
  ["Statin Use and Breast Cancer Survival/Mortality", 0.50],
  ["Epidemiological Cohort Study Methodology (Finland)", 0.25],
  ["Analysis of Confounding Factors (Healthy Adherer Bias) and Causal Inference", 0.25]
]
"""

In [11]:
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)


chat_response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct",
    # messages=[
    #     {"role": "system", "content": "You are an expert Question Generation AI. Your task is to carefully read the provided document and generate exactly 10 distinct questions that can be directly and definitively answered using ONLY the information present in the document. You must strictly adhere to output formatting instructions and generate only 10 distinct questions without any additional text."},
    #     {"role": "user", "content": f"Please read the following document and generate 10 questions: {EXAMPLE_DOC}"},
    # ],
    messages=[
        {"role": "system", "content": SYS_PROMPT},
        {"role": "user", "content": USER_PROMPT},
    ],
    temperature=0.7,
    max_completion_tokens=256,
    # extra_body={"guided_json": Query.model_json_schema()},
)

print(chat_response.choices[0].message.content)

1. What is the estimated mortality rate among breast cancer patients who use statins for the first time?
2. Can you explain the difference between post-diagnostic and pre-diagnostic statin use in relation to breast cancer survival?
3. What is the statistical significance of the association between post-diagnostic statin use and breast cancer death?
4. How does the estimated mortality rate among breast cancer patients who use statins compare to the general population?
5. Can you identify the potential causal effect of statin use on breast cancer survival in a clinical trial?
6. What is the estimated proportion of breast cancer patients who use statins for the first time?
7. How does the analysis of confounding factors (healthy adherer bias) affect the observed association between post-diagnostic statin use and breast cancer death?
8. What is the estimated time dependence of the survival benefit among pre-diagnostic statin users?
9. Can you identify the potential limitations of the study

In [None]:
"""LLM3.2-1B-Instruct output following topics and keywords weights:
Here are 10 queries without any additional text:

1. How do statin use and breast cancer survival mortality rates vary among different age groups?
2. What is the effect of pre-diagnostic statin use on breast cancer death in a Finnish cohort study?
3. Which statistical method was used to analyze the survival benefit of statin use in breast cancer patients?
4. What is the association between statin use and breast cancer death in a nationwide cohort study?
5. How does post-diagnostic statin use affect the risk of breast cancer death in a Finnish population?
6. Can pre-diagnostic statin use lower the risk of breast cancer death in patients with different tumor characteristics?
7. What is the dose-response relationship between statin use and breast cancer death in a Finnish cohort study?
8. How does the impact of healthy adherer bias affect the observed survival benefit of statin use in breast cancer patients?
9. What is the time-dependent effect of statin use on breast cancer death in a Finnish cohort study?
10. What are the key factors that influence the effect of statin use on breast cancer survival mortality rates in a Finnish population?
"""

## Promptagator Prompt

In [1]:
import json

FEW_SHOT_EXAMPLES_PATH = "/home/guest/r12922050/GitHub/d2qplus/prompts/promptagator/few_shot_examples.jsonl"
with open(FEW_SHOT_EXAMPLES_PATH, "r") as f:
    few_shot_examples = [json.loads(line) for line in f]
print(f"Loaded {len(few_shot_examples)} few-shot examples from {FEW_SHOT_EXAMPLES_PATH}")
print(few_shot_examples[0])

Loaded 8 few-shot examples from /home/guest/r12922050/GitHub/d2qplus/prompts/promptagator/few_shot_examples.jsonl
{'query_id': 'PLAIN-1', 'query_text': 'Why Deep Fried Foods May Cause Cancer', 'doc_id': 'MED-2421', 'doc_text': 'Background: Acrylamide is a common dietary exposure that crosses the human placenta. It is classified as a probable human carcinogen, and developmental toxicity has been observed in rodents. Objectives: We examined the associations between prenatal exposure to acrylamide and birth outcomes in a prospective European mother–child study. Methods: Hemoglobin (Hb) adducts of acrylamide and its metabolite glycidamide were measured in cord blood (reflecting cumulated exposure in the last months of pregnancy) from 1,101 singleton pregnant women recruited in Denmark, England, Greece, Norway, and Spain during 2006–2010. Maternal diet was estimated through food-frequency questionnaires. Results: Both acrylamide and glycidamide Hb adducts were associated with a statisticall

In [14]:
PROMPTAGATOR_SYS_PROMPT = """You are a query generation assistant. Your task is to generate a query based on the provided article. Your output MUST adhere the following format strictly:

Rules:
- NO introductory or concluding text (e.g., "Here are the questions:", "Okay, here are...", "These are the queries:").
- Exactly 1 query

The following are some examples:\n"""
for example in few_shot_examples:
    PROMPTAGATOR_SYS_PROMPT += f"Article: {example['doc_text']}\n"
    PROMPTAGATOR_SYS_PROMPT += f"Query: {example['query_text']}\n\n"
PROMPTAGATOR_USER_PROMPT = "Here is an example document. Generate a query without any additional text:\n\n" + f"Article: {EXAMPLE_DOC}"


In [19]:
print(PROMPTAGATOR_SYS_PROMPT)

You are a query generation assistant. Your task is to generate a query based on the provided article. Your output MUST adhere the following format strictly:

Rules:
- NO introductory or concluding text (e.g., "Here are the questions:", "Okay, here are...", "These are the queries:").
- Exactly 1 query

The following are some examples:
Article: Background: Acrylamide is a common dietary exposure that crosses the human placenta. It is classified as a probable human carcinogen, and developmental toxicity has been observed in rodents. Objectives: We examined the associations between prenatal exposure to acrylamide and birth outcomes in a prospective European mother–child study. Methods: Hemoglobin (Hb) adducts of acrylamide and its metabolite glycidamide were measured in cord blood (reflecting cumulated exposure in the last months of pregnancy) from 1,101 singleton pregnant women recruited in Denmark, England, Greece, Norway, and Spain during 2006–2010. Maternal diet was estimated through f

In [28]:
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)


chat_response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct",
    # messages=[
    #     {"role": "system", "content": "You are an expert Question Generation AI. Your task is to carefully read the provided document and generate exactly 10 distinct questions that can be directly and definitively answered using ONLY the information present in the document. You must strictly adhere to output formatting instructions and generate only 10 distinct questions without any additional text."},
    #     {"role": "user", "content": f"Please read the following document and generate 10 questions: {EXAMPLE_DOC}"},
    # ],
    messages=[
        {"role": "system", "content": PROMPTAGATOR_SYS_PROMPT},
        {"role": "user", "content": PROMPTAGATOR_USER_PROMPT},
    ],
    temperature=0.7,
    max_completion_tokens=64,
    # extra_body={"guided_json": Query.model_json_schema()},
)

print(chat_response.choices[0].message.content)


How does statin use affect breast cancer survival?
