<a href="https://colab.research.google.com/github/HishamYahya/literature-reviewer-dspy-example/blob/main/PyData_Prompt_Optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prompt Optimization Notebook

## Section 0: Setting Up

### Install dependencies (3 mins, Restarting session required after running)

In [2]:
!pip install --force-reinstall dspy-ai[milvus] && pip install milvus datasets sentence-transformers

Collecting dspy-ai[milvus]
  Downloading dspy_ai-2.4.13-py3-none-any.whl.metadata (39 kB)
Collecting backoff (from dspy-ai[milvus])
  Downloading backoff-2.2.1-py3-none-any.whl.metadata (14 kB)
Collecting datasets (from dspy-ai[milvus])
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting joblib~=1.3 (from dspy-ai[milvus])
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting openai<2.0.0,>=0.28.1 (from dspy-ai[milvus])
  Downloading openai-1.42.0-py3-none-any.whl.metadata (22 kB)
Collecting optuna (from dspy-ai[milvus])
  Downloading optuna-3.6.1-py3-none-any.whl.metadata (17 kB)
Collecting pandas (from dspy-ai[milvus])
  Downloading pandas-2.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Collecting pydantic~=2.0 (from dspy-ai[milvus])
  Using cached pydantic-2.8.2-py3-none-any.whl.metadata (125 kB)
Collecting regex (from dspy-ai[milvus])
  Downloading regex-2024.7.24-cp310-cp310-manylinux_2_17_x86_64.manylinux

### Set up vector database of ArXiv abstracts (3 mins)

The following code starts up the Milvus Lite server and vectorizes and stores the first 1000 abstracts of the ArXiv Abstracts dataset.

#### Vectorization

Code for restarting Milvus server (Uncomment and run the next 3 cells if a MilvusException pops up when doing inference)

In [52]:
# %%bash
# echo "Stopping milvus..."
# PROCESS=$(ps -e | grep milvus | grep -v grep | awk '{print $1}')
# if [ -z "$PROCESS" ]; then
#   echo "No milvus process"
#   exit 0
# fi
# kill -9 $PROCESS
# echo "Milvus stopped"

Stopping milvus...
Milvus stopped


In [53]:
# from milvus import default_server

# default_server.start()

In [54]:
# connections.disconnect("default")
# connections.connect(host='127.0.0.1', port=default_server.listen_port)
# collection = Collection("arxiv_abstracts")
# collection.load()

Start Milvus Vector Database server

In [2]:
from milvus import default_server

In [1]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Vectorize and save in the database

In [3]:
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

from pymilvus import MilvusClient, DataType, FieldSchema, CollectionSchema, connections, Collection
from milvus import default_server
from datasets import load_dataset
import numpy as np

connections.connect(host='127.0.0.1', port=default_server.listen_port)

# Load the dataset in streaming mode
dataset = load_dataset("gfissore/arxiv-abstracts-2021", split="train", streaming=True)

# Define the collection schema using FieldSchema
collection_name = "arxiv_abstracts"
fields = [
    FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, max_length=5000),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=500),
    FieldSchema(name="authors", dtype=DataType.VARCHAR, max_length=5000),
    FieldSchema(name="categories", dtype=DataType.VARCHAR, max_length=200),
    FieldSchema(name="abstract", dtype=DataType.VARCHAR, max_length=5000),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=5000), # DSPy reads this field
    FieldSchema(name="abstract_vector", dtype=DataType.FLOAT_VECTOR, dim=384)
]

schema = CollectionSchema(fields=fields, description="ArXiv Abstracts Collection")

# Create the collection
collection = Collection(collection_name, schema)

# Create an IVF_FLAT index for the abstract_vector field
index_params = {
    "metric_type": "L2",
    "index_type": "IVF_FLAT",
    "params": {"nlist": 1024}
}
collection.create_index("abstract_vector", index_params)

# Function to process and insert data in batches
def process_and_insert_batch(batch):
    data = []
    for i, item in enumerate(batch):
        data.append({
            "id": item['id'],
            "title": item['title'],
            "authors": ', '.join(item['authors']),
            "categories": ', '.join(item['categories']),
            "abstract": item['abstract'],
            "text": item['abstract'],
            "abstract_vector": embedding_model.encode(item['abstract']).tolist()
        })
    collection.insert(data)
    return len(data)

# Process the streaming dataset
batch_size = 100
total_inserted = 0
batch = []

for item in dataset.take(100):  # Limit to first 1000 items, remove .take(1000) to process entire dataset
    batch.append(item)
    if len(batch) == batch_size:
        total_inserted += process_and_insert_batch(batch)
        batch = []
        print(f"Inserted {total_inserted} documents so far...")

# Insert any remaining documents
if batch:
    total_inserted += process_and_insert_batch(batch)

collection.load()

print(f"Successfully inserted {total_inserted} documents into Milvus Lite database.")


Downloading readme:   0%|          | 0.00/6.75k [00:00<?, ?B/s]

Inserted 100 documents so far...
Successfully inserted 100 documents into Milvus Lite database.


#### Searching

In [4]:
def search_abstracts(query, top_k=5):
    # Encode the query
    query_vector = embedding_model.encode(query).tolist()

    # Perform the search
    search_params = {"metric_type": "L2", "params": {"nprobe": 10}}
    results = collection.search(
        data=[query_vector],
        anns_field="abstract_vector",
        param=search_params,
        limit=top_k,
        output_fields=["title", "authors", "categories", "abstract"]
    )

    # Process and print the results
    for i, hits in enumerate(results):
        print(f"Top {len(hits)} results for query: '{query}'\n")
        for hit in hits:
            print(f"Title: {hit.entity.get('title')}, Distance: {hit.distance}")



In [5]:
query = "Machine Learning"
search_abstracts(query)

Top 5 results for query: 'Machine Learning'

Title: A general approach to statistical modeling of physical laws:
  nonparametric regression, Distance: 1.5056703090667725
Title: Real Options for Project Schedules (ROPS), Distance: 1.5358259677886963
Title: Inference on white dwarf binary systems using the first round Mock LISA
  Data Challenges data sets, Distance: 1.6015803813934326
Title: An algorithm for the classification of smooth Fano polytopes, Distance: 1.6232553720474243
Title: Intelligent location of simultaneously active acoustic emission sources:
  Part I, Distance: 1.633628010749817


## Section 1: Single Prompting

In [6]:
import os
from openai import OpenAI
from pymilvus import Collection, connections

# Set up OpenAI API key. Go to https://platform.openai.com/api-keys and create a key if you don't have one
OPENAI_API_KEY = "..."

client = OpenAI(
    api_key=OPENAI_API_KEY,
)

def retrieve_abstracts(topic, top_k=5):
    # Encode the query
    query_vector = embedding_model.encode(topic).tolist()

    # Perform the search
    search_params = {"metric_type": "L2", "params": {"nprobe": 10}}
    results = collection.search(
        data=[query_vector],
        anns_field="abstract_vector",
        param=search_params,
        limit=top_k,
        output_fields=["title", "authors", "categories", "abstract"]
    )
    return [hit for hits in results for hit in hits]


def simple_literature_review(topic):
    # Retrieve the 3 most relevant abstracts
    context = retrieve_abstracts(topic, top_k=3)
    context_string = ""
    for i, hit in enumerate(context):
      context_string += f"Paper {i+1}:\nTitle: {hit.entity.get('title')}\n\n Abstract: {hit.entity.get('abstract')}\n------\n"

    prompt = """Conduct a brief literature review on the topic: "{topic}".
    The following are the top 3 most relevant papers to the topic:
    {context_string}
    Please include:
    1. An overview of the main findings
    2. Key researchers or studies in this area
    3. Any major debates or controversies
    4. Gaps in current research
    5. Potential directions for future study
    Limit your response to approximately 500 words."""

    print("PROMPT USED:")
    print(prompt)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful AI assistant skilled in conducting literature reviews."},
            {"role": "user", "content": prompt.format(topic=topic, context_string=context_string)}
        ],
        max_tokens=800,
        n=1,
        stop=None,
        temperature=0.7,
    )
    return response.choices[0].message.content.strip()

# Example usage
topic = "The impact of microplastics on marine ecosystems"
review = simple_literature_review(topic)
print("\nREVIEW GENERATED:")
print(review)

PROMPT USED:
Conduct a brief literature review on the topic: "{topic}".
    The following are the top 3 most relevant papers to the topic:
    {context_string}
    Please include:
    1. An overview of the main findings
    2. Key researchers or studies in this area
    3. Any major debates or controversies
    4. Gaps in current research
    5. Potential directions for future study
    Limit your response to approximately 500 words.

REVIEW GENERATED:
### Literature Review: The Impact of Microplastics on Marine Ecosystems

#### Overview of Main Findings
Microplastics, defined as plastic particles less than 5 mm in size, have become pervasive in marine environments, affecting various aspects of marine ecosystems. Research indicates that microplastics can be ingested by a wide range of marine organisms, from plankton to larger fish and marine mammals. Studies show that ingestion can lead to physical harm, such as gastrointestinal blockage, and chemical harm due to the leaching of toxic 

## Section 2: Optimized Single Prompt

Set up DSPy to use gpt-4o-mini

In [7]:
import dspy

# Set up the LM.
mini = dspy.OpenAI(model='gpt-4o-mini', max_tokens=2000, api_key=OPENAI_API_KEY, model_type="chat")
dspy.settings.configure(lm=mini)

Set up Predict module

In [8]:
import dspy
from dsp.utils import dotdict
# from dspy.retrieve.milvus_rm import MilvusRM

class ReviewSignature(dspy.Signature):
    """Your task is to conduct a literature review on the input topic."""

    retrieved_papers = dspy.InputField(desc="The titles of relevant papers to the topic retrieved from the database.")
    topic = dspy.InputField(desc="The topic to conduct a literature review on")
    literature_review = dspy.OutputField(desc="The generated literature review")

class LiteratureReviewer(dspy.Module):
    def __init__(self):
        super().__init__()
        # # Ideally, should be done this way. But right now it's buggy :P
        # self.retrieve_abstracts = MilvusRM(
        #     collection_name="arxiv_abstracts",
        #     embedding_function=lambda text: embedding_model.encode(text).tolist(),
        # )

        self.get_review = dspy.Predict(ReviewSignature)

    def forward(self, topic, k=3):
        # context = self.retrieve_abstracts(topic, k=k).passages
        context = [
            f'Title: {hit.entity.get("title")}\nAbstract:\n{hit.entity.get("abstract")}'
            for hit in retrieve_abstracts(topic, top_k=k)
          ]
        context = "\n\n".join(context)
        review = self.get_review(retrieved_papers=context, topic=topic)
        return review

lit_reviewer = LiteratureReviewer()

topic = "The impact of microplastics on marine ecosystems"
review = lit_reviewer(topic=topic)
print(review.literature_review)

**Retrieved Papers:**
1. Title: Shaping the Globular Cluster Mass Function by Stellar-Dynamical Evaporation
2. Title: The Spitzer c2d Survey of Large, Nearby, Interstellar Clouds. IX. The Serpens YSO Population As Observed With IRAC and MIPS
3. Title: Clustering in a stochastic model of one-dimensional gas

**Topic:** The impact of microplastics on marine ecosystems

**Literature Review:**
Microplastics, defined as plastic particles less than 5 mm in size, have emerged as a significant environmental pollutant, particularly in marine ecosystems. Their prevalence in oceans has raised concerns regarding their impact on marine life and ecosystem health. Recent studies have highlighted several pathways through which microplastics affect marine organisms, including ingestion, entanglement, and the potential for toxicological effects.

One of the primary concerns is the ingestion of microplastics by marine organisms, ranging from plankton to larger fish and marine mammals. Research indicates 

In [41]:
lit_review_topics = [
    "The impact of artificial intelligence on job markets and employment trends",
    "Sustainable urban planning strategies for climate change adaptation",
    "The role of gut microbiota in mental health disorders",
    "Effectiveness of mindfulness-based interventions in reducing workplace stress",
    "Advances in quantum computing and its potential applications",
    "The influence of social media on political polarization",
    "Emerging technologies in renewable energy storage",
    "The effects of screen time on cognitive development in children",
    "Genetic factors contributing to autoimmune diseases",
    "The impact of remote work on organizational culture and productivity",
    "Advancements in personalized medicine and targeted cancer therapies",
    "The role of blockchain technology in supply chain management",
    "Psychological effects of long-duration space missions on astronauts",
    "The impact of microplastics on marine ecosystems and human health",
    "Neuroplasticity and its implications for learning and rehabilitation",
    "The effectiveness of various interventions in reducing childhood obesity",
    "The role of epigenetics in the development of complex diseases",
    "Sustainable agriculture practices for food security in developing countries",
    "The impact of virtual and augmented reality on education and training",
    "Ethical considerations in the development and use of autonomous vehicles"
  ]

trainset = [dspy.Example(topic=topic).with_inputs("topic") for topic in lit_review_topics]

import re


class LiteratureReviewJudge(dspy.Signature):
    """Your task is to evaluate the generated literature review of the given topic based on the given question and give it a score from 0 to 1."""
    topic = dspy.InputField(desc="The topic of the literature review")
    literature_review = dspy.InputField(desc="The generated literature review")
    question = dspy.InputField(desc="The question to evaluate the literature review on")
    score = dspy.OutputField(desc="A number from 0 to 1.")

max_length = 1000
def score_literature_review(gold, review, trace=None) -> float:
    judge = dspy.Predict(LiteratureReviewJudge)

    questions_and_weights = [
        ("How concise is the literature review?", 5),
        ("How comprehensive is the literature review?", 1),
        ("How accurate is the information presented?", 3),
        ("How well does the review synthesize information from multiple sources?", 2),
        ("How effectively does the review identify key trends and patterns?", 1),
        ("How well does the review identify research gaps and future directions?", 2),
        ("How clear and well-structured is the review?", 4),
        ("How well does the review address the original topic?", 3),
        ("What is the overall quality of this literature review?", 3),
        ("How diverse is the review?", 1),
        ("How well cited is this review?", 3),
        ("How well does the review cover what future works might be done?", 6)
    ]

    weighted_scores = []
    total_weight = sum(weight for _, weight in questions_and_weights)

    for question, weight in questions_and_weights:
        score = judge(topic=gold.topic, literature_review=review.literature_review, question=question).score
        if score := re.findall(r'0\.\d+', score):
            score = float(score[0])
        elif "1" in score:
            score = 1
        else:
            score = 0

        weighted_scores.append(score * weight)
    return sum(weighted_scores) / total_weight


In [42]:
from dspy.teleprompt import BootstrapFewShot

teleprompter = BootstrapFewShot(metric = score_literature_review, metric_threshold=0.8)
compiled_reviewer = teleprompter.compile(student = lit_reviewer, trainset=trainset[:6])

100%|██████████| 6/6 [00:03<00:00,  1.59it/s]

Bootstrapped 4 full traces after 6 examples in round 0.





In [43]:
compiled_reviewer.get_review.demos

[Example({'augmented': True, 'retrieved_papers': "Title: Intelligent Life in Cosmology\nAbstract:\n  I shall present three arguments for the proposition that intelligent life is\nvery rare in the universe. First, I shall summarize the consensus opinion of\nthe founders of the Modern Synthesis (Simpson, Dobzhanski, and Mayr) that the\nevolution of intelligent life is exceedingly improbable. Second, I shall\ndevelop the Fermi Paradox: if they existed they'd be here. Third, I shall show\nthat if intelligent life were too common, it would use up all available\nresources and die out. But I shall show that the quantum mechanical principle\nof unitarity (actually a form of teleology!) requires intelligent life to\nsurvive to the end of time. Finally, I shall argue that, if the universe is\nindeed accelerating, then survival to the end of time requires that intelligent\nlife, though rare, to have evolved several times in the visible universe. I\nshall argue that the acceleration is a consequen

In [44]:
print(compiled_reviewer(topic="The impact of microplastics on marine ecosystems and human health").literature_review)

**Retrieved Papers:**  
1. Title: Shaping the Globular Cluster Mass Function by Stellar-Dynamical Evaporation  
2. Title: Origin of adaptive mutants: a quantum measurement?  
3. Title: The Spitzer c2d Survey of Large, Nearby, Interstellar Clouds. IX. The Serpens YSO Population As Observed With IRAC and MIPS  

**Topic:** The impact of microplastics on marine ecosystems and human health  

**Literature Review:**  
The impact of microplastics on marine ecosystems and human health has emerged as a critical area of research, reflecting growing concerns about environmental pollution and its far-reaching consequences. Microplastics, defined as plastic particles smaller than 5 mm, originate from various sources, including the breakdown of larger plastic debris, industrial processes, and the shedding of synthetic fibers from textiles. Their pervasive presence in marine environments poses significant threats to marine life and, subsequently, human health through the food chain.

Research indica

In [46]:
import os
from typing import List, Dict

def call_openai(prompt: str, system_message: str = "You are a helpful AI assistant.") -> str:
    """Generic function to call OpenAI API"""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": prompt}
        ],
        max_tokens=800,
        n=1,
        stop=None,
        temperature=0.7,
    )
    return response.choices[0].message.content.strip()

def screen_papers(papers: List[Dict[str, str]], topic: str) -> List[Dict[str, str]]:
    """Screens papers for relevance"""
    prompt = f"Given the topic '{topic}', determine if each paper is highly relevant. Respond with only 'yes' or 'no' for each paper.\n\n"
    for i, paper in enumerate(papers):
        prompt += f"{i+1}. Title: {paper.get('title')}\nAbstract: {paper.get('abstract')}\n"

    responses = call_openai(prompt).split('\n')
    return [paper for paper, response in zip(papers, responses) if response.strip().lower() == 'yes']

def extract_key_info(papers: List[Dict[str, str]]) -> str:
    """Extracts key information from papers"""
    prompt = "Extract key findings, methodologies, and conclusions from these papers:\n\n"
    for paper in papers:
        prompt += f"Title: {paper.get('title')}\nAbstract: {paper.get('abstract')}\n\n"
    return call_openai(prompt)

def identify_trends(key_info: str) -> str:
    """Identifies trends and patterns in the research"""
    prompt = f"Identify the main trends and patterns in this research:\n\n{key_info}"
    return call_openai(prompt)

def identify_gaps(key_info: str, trends: str) -> str:
    """Identifies research gaps"""
    prompt = f"Based on this information:\n\nKey Info:\n{key_info}\n\nTrends:\n{trends}\n\nIdentify gaps in the current research and suggest future research directions."
    return call_openai(prompt)

def synthesize_review(key_info: str, trends: str, gaps: str, topic: str) -> str:
    """Synthesizes the final literature review"""
    prompt = f"""Synthesize a coherent literature review based on the following information:
    Topic:
    {topic}

    Key Information:
    {key_info}

    Trends and Patterns:
    {trends}

    Research Gaps and Future Directions:
    {gaps}

    Your review should include:
    1. An overview of the main findings
    2. Key researchers or studies in this area
    3. Any major debates or controversies
    4. Gaps in current research
    5. Potential directions for future study

    Limit your response to approximately 500 words."""

    return call_openai(prompt)

def multi_step_literature_review(topic: str) -> str:
    """Conducts a multi-step literature review"""
    abstracts = retrieve_abstracts(topic, top_k=3)
    abstracts = [abstract.entity for abstract in abstracts]
    relevant_papers = screen_papers(abstracts, topic)
    key_info = extract_key_info(relevant_papers)
    trends = identify_trends(key_info)
    gaps = identify_gaps(key_info, trends)
    review = synthesize_review(key_info, trends, gaps, topic)
    return review

# Example usage
topic = "The impact of microplastics on marine ecosystems"
review = multi_step_literature_review(topic)
print(review)

### Literature Review: The Impact of Microplastics on Marine Ecosystems

#### Overview of Main Findings

Microplastics, defined as plastic particles less than 5 mm in size, have emerged as a pervasive environmental pollutant, particularly in marine ecosystems. Numerous studies have demonstrated their negative impacts on marine life, including ingestion by marine organisms, which can lead to physical harm, chemical contamination, and trophic transfer of pollutants. For instance, a study by Wright et al. (2013) found that microplastics are ingested by a wide range of marine species, including fish and shellfish, leading to bioaccumulation and potential risks to human health through the seafood consumption pathway. Moreover, research by Desforges et al. (2014) highlighted the potential for microplastics to act as vectors for harmful chemicals and pathogens, exacerbating their ecological impact.

#### Key Researchers and Studies

Prominent researchers in this field include Dr. Richard Thom

In [81]:
import dspy
from typing import List, Dict


class IsRelevantSignature(dspy.Signature):
    """Classify whether the paper abstract is relevent to the topic"""
    topic = dspy.InputField()
    abstract = dspy.InputField()
    is_relevant = dspy.OutputField(desc="Yes or No")

class ExtractKeyInfo(dspy.Signature):
    """Extract key information from relevant papers for the given topic."""
    relevant_papers = dspy.InputField(desc="The relevant papers")
    topic = dspy.InputField()
    key_info = dspy.OutputField(desc="Extracted key information")

class IdentifyTrends(dspy.Signature):
    """Identify trends and patterns in the research."""
    key_info = dspy.InputField()
    trends = dspy.OutputField(desc="Identified trends and patterns")

class IdentifyGaps(dspy.Signature):
    """Identify research gaps and future directions."""
    key_info = dspy.InputField()
    trends = dspy.InputField()
    gaps = dspy.OutputField(desc="Identified research gaps and future directions")

class SynthesizeReview(dspy.Signature):
    """Synthesize the final literature review."""
    topic = dspy.InputField()
    key_info = dspy.InputField()
    trends = dspy.InputField()
    gaps = dspy.InputField()
    literature_review = dspy.OutputField(desc="Synthesized literature review")

# Define the main literature review pipeline
class LiteratureReviewPipeline(dspy.Module):
    def __init__(self):
        self.is_relevant = dspy.Predict(IsRelevantSignature)
        self.extract = dspy.Predict(ExtractKeyInfo)
        self.identify_trends = dspy.Predict(IdentifyTrends)
        self.identify_gaps = dspy.Predict(IdentifyGaps)
        self.synthesize = dspy.Predict(SynthesizeReview)

    def forward(self, topic, k=3):
        abstracts = [
          f'Title: {hit.entity.get("title")}\nAbstract:\n{hit.entity.get("abstract")}'
          for hit in retrieve_abstracts(topic, top_k=k)
        ]
        relevant_papers = []
        for abstract in abstracts:
            if self.is_relevant(topic=topic, abstract=abstract).is_relevant.startswith("Yes"):
                relevant_papers.append(abstract)
        relevant_papers = "\n\n".join(relevant_papers)
        key_info = self.extract(relevant_papers=relevant_papers, topic=topic).key_info
        trends = self.identify_trends(key_info=key_info).trends
        gaps = self.identify_gaps(key_info=key_info, trends=trends).gaps
        review = self.synthesize(key_info=key_info, trends=trends, gaps=gaps, topic=topic)
        return review


pipeline = LiteratureReviewPipeline()

# Example usage
topic = "The impact of microplastics on marine ecosystems"
result = pipeline(topic=topic)
print(result.literature_review)

**Literature Review: The Impact of Microplastics on Marine Ecosystems**

The increasing prevalence of microplastics in marine environments has garnered significant attention in recent years, highlighting their multifaceted impacts on marine ecosystems. Microplastics, defined as plastic particles less than 5mm in size, originate from various sources, including the degradation of larger plastic debris and the shedding of synthetic fibers from textiles. Their widespread distribution across marine habitats—from surface waters to deep-sea sediments—has been documented in multiple studies, indicating a pervasive environmental challenge (Smith et al., 2020; Johnson and Lee, 2021; Garcia et al., 2022).

Research consistently demonstrates that microplastics can be ingested by a wide range of marine organisms, leading to both physical and chemical harm. The ingestion of these particles can result in blockages, reduced feeding efficiency, and exposure to toxic substances, which collectively disru

In [55]:
teleprompter = BootstrapFewShot(metric = score_literature_review, metric_threshold=0.8, max_bootstrapped_demos=3, max_labeled_demos=3)
compiled_reviewer = teleprompter.compile(student = pipeline, trainset=trainset[:6])

 50%|█████     | 3/6 [01:07<01:07, 22.53s/it]

Bootstrapped 3 full traces after 4 examples in round 0.





In [None]:
compiled_reviewer.save("compiled_pipeline_reviewer.json")

In [57]:
print(compiled_reviewer(topic="The impact of microplastics on marine ecosystems and human health").literature_review)

**Literature Review: The Impact of Microplastics on Marine Ecosystems and Human Health**  
The growing concern over microplastics as pervasive pollutants in marine environments has prompted extensive research into their sources, ecological impacts, and potential risks to human health. Thompson et al. (2020) provide a comprehensive overview of the origins and distribution of microplastics, highlighting their prevalence in marine ecosystems due to various sources, including plastic waste, industrial processes, and personal care products. This widespread presence underscores the urgent need for effective monitoring and mitigation strategies.

Research consistently demonstrates that microplastics pose significant ecological risks to marine organisms. Smith & Jones (2021) detail the harmful effects of microplastics on marine life, including physical harm from ingestion and toxicological impacts that disrupt marine food webs. The accumulation of microplastics in the tissues of marine species

In [82]:
import dspy
from typing import List, Dict

def search(topic: str, top_k: int = 3):
    papers = retrieve_abstracts(topic, top_k=top_k)
    papers = [paper.entity for paper in papers]
    return papers

class SearchSignature(dspy.Signature):
    """Search for papers related to a given topic"""
    topic = dspy.InputField()
    papers = dspy.OutputField(desc="List of papers related to the topic")

class ScreenSignature(dspy.Signature):
    """Screen a paper's abstract for relevance to a topic"""
    topic = dspy.InputField()
    abstract = dspy.InputField()
    score = dspy.OutputField(desc="Relevance score between 0 and 1")

class SynthesizeSignature(dspy.Signature):
    """Synthesize information from relevant papers"""
    relevant_papers = dspy.InputField(desc="List of relevant papers (or none)")
    synthesis = dspy.OutputField(desc="Synthesized information from relevant papers")

class GenerateRQSignature(dspy.Signature):
    """Generate a research question based on the review and previous RQs"""
    review = dspy.InputField()
    rqs = dspy.InputField(desc="Previously generated research questions")
    rq = dspy.OutputField(desc="New research question")

class GeneratePlanSignature(dspy.Signature):
    """Generate a plan based on the topic, RQs, and relevant papers"""
    topic = dspy.InputField()
    rqs = dspy.InputField()
    relevant_papers = dspy.InputField()
    plan = dspy.OutputField(desc="Generated plan for the literature review")

def extract_score(score: str) -> float:
    if score := re.findall(r'0\.\d+', score):
        return float(score[0])
    elif "1" in score:
        return 1
    else:
        return 0

class LiteratureReviewAndPlanPipeline(dspy.Module):
    def __init__(self):
        self.screen = dspy.Predict(ScreenSignature)
        self.synthesize = dspy.Predict(SynthesizeSignature)
        self.generate_rq = dspy.Predict(GenerateRQSignature)
        self.generate_plan = dspy.Predict(GeneratePlanSignature)

    def forward(self, topic, n_rq = 3):
        # Search for initial papers on the given topic
        papers = search(topic, top_k=3)
        relevant_papers = []

        # Screen papers for relevance
        for paper in papers:
            relevance = self.screen(topic=topic, abstract=paper.get("abstract"))
            relevance_score = extract_score(relevance.score)

            # Add papers with high relevance score
            if relevance_score > 0.7:
                relevant_papers.append("Title:" + paper.get("title") + "\nAbstract:" + paper.get("abstract"))

        # Join relevant papers into a single string
        relevant_papers = "\n\n".join(relevant_papers)

        # Synthesize information from relevant papers
        synthesis = self.synthesize(relevant_papers=relevant_papers).synthesis

        rqs = ""
        relevant_papers = []

        # Generate research questions and find relevant papers for each
        for rq_i in range(n_rq):
            # Generate a research question
            rq = self.generate_rq(review=synthesis, rqs=rqs).rq
            search_results = search(rq)

            # Screen papers for relevance to both original topic and research question
            for paper in search_results:
                relevance_orig = self.screen(topic=topic, abstract=paper.get("abstract")).score
                relevance_rq = self.screen(topic=rq, abstract=paper.get("abstract")).score
                score_orig = extract_score(relevance_orig)
                score_rq = extract_score(relevance_rq)

                # Add papers with high relevance to both topic and research question
                if score_orig > 0.7 and score_rq > 0.7:
                    relevant_papers.append("Title:" + paper.get("title") + "\nAbstract:" + paper.get("abstract"))

            # Accumulate research questions
            rqs += rq + "\n"

        # Join relevant papers into a single string
        relevant_papers = "\n\n".join(relevant_papers)

        # Generate a research plan based on the topic, research questions, and relevant papers
        plan = self.generate_plan(topic=topic, rqs=rqs, relevant_papers=relevant_papers).plan

        # Return the literature review synthesis and research plan
        return dspy.Prediction(review=synthesis, plan=plan)

pipeline = LiteratureReviewAndPlanPipeline()

# Example usage
topic = "The impact of microplastics on marine ecosystems"
result = pipeline(topic)

In [83]:
print("----- REVIEW -----")
print(result.review)
print("----- PLAN -----")
print(result.plan)

----- REVIEW -----
Relevant Papers: 
1. Smith, J. A., & Doe, R. L. (2020). "The Impact of Climate Change on Coastal Ecosystems." Journal of Environmental Science, 45(3), 123-145.
2. Johnson, M. K., & Lee, T. H. (2021). "Adaptive Strategies for Coastal Communities Facing Sea Level Rise." Coastal Management, 49(2), 89-102.
3. Brown, P. Q., & Green, S. R. (2022). "Ecosystem Services in Coastal Regions: A Review." Marine Ecology Progress Series, 678, 1-15.

Synthesis: The reviewed literature highlights the significant impact of climate change on coastal ecosystems, emphasizing the vulnerability of these areas to rising sea levels and increased storm intensity. Smith and Doe (2020) discuss how changes in temperature and precipitation patterns are altering species distributions and habitat structures in coastal regions. Johnson and Lee (2021) propose adaptive strategies for coastal communities, including the implementation of managed retreat and the restoration of natural barriers, to mitiga