# Abstraction Review: Literature Analysis Example

In this notebook, we will use “LetteReview” to conduct a literature review and demonstrate how it can help grasp the big picture of the general concepts discussed in the literature.

## Setting up the notebook

High-level configs

In [1]:
%reload_ext autoreload
%autoreload 2

from dotenv import load_dotenv

# Load environment variables from .env file. Adjust the path to the .env file as needed.
load_dotenv(dotenv_path='../.env')

# Enable asyncio in Jupyter
import asyncio
import nest_asyncio

nest_asyncio.apply()

#  Add the package to the path (required if you are running this notebook from the examples folder)
import sys
sys.path.append('../../')


Import required packages

In [3]:
import json
import matplotlib.colors as mcolors
import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
import pickle 
from pydantic import BaseModel
from pyvis.network import Network

from lattereview.providers import OpenAIProvider
from lattereview.providers import LiteLLMProvider
from lattereview.agents import ScoringReviewer, AbstractionReviewer
from lattereview.workflows import ReviewWorkflow

## Data

In this notebook, we will open a CSV file containing records from a PubMed search for case report articles published in the last five years that discuss complications or adverse effects associated with COVID-19 vaccines. This dataset was created using a quick and simple PubMed search, and we do not claim that it is comprehensive. The intention of this notebook is not to conduct a thorough investigation of this topic or to advocate against COVID-19 vaccines. On the contrary, we believe that COVID-19 vaccines are highly effective and have saved millions of lives. Our purpose here is to demonstrate the use of the LatteReview package with a topic that may interest many people.

In [6]:
data = pd.read_excel('data.xlsx')
data

Unnamed: 0,PMID,PMC ID,Title,Author(s),Abstract,Date of Publication,Date of Electronic Publication
0,37789968,PMC10544010,Remimazolam vs Etomidate: Haemodynamic Effects...,Chen J||Zou X||Hu B||Yang Y||Wang F||Zhou Q||S...,BACKGROUND: Remimazolam tosilate (RT) is a nov...,2023,20230927.0
1,34558599,PMC8500128,Early Post-Renal Transplant Hyperglycemia.,Iqbal A||Zhou K||Kashyap SR||Lansang MC,CONTEXT: Though posttransplant diabetes mellit...,2022 Jan 18,
2,38716533,,Atorvastatin-induced Myositis and Drug-induced...,Kashyap K||Bisht K||Dhar M||Mittal K,Statins are drugs for preventing cardiac event...,2023 Oct,
3,36842342,PMC9905103,Cardiovascular complications of COVID-19 vacci...,Paknahad MH||Yancheshmeh FB||Soleimani A,BACKGROUND: There are multiple reviews on card...,2023 May-Jun,20230208.0
4,37148041,,Spectrum of Serious Neurological and Psychiatr...,Garg RK||Paliwal V||Malhotra HS||Singh BP||Riz...,Indian data regarding serious neurological and...,2023 Mar-Apr,
...,...,...,...,...,...,...,...
1488,31359833,,Tamoxifen-induced vasculitis.,Kulkarni U||Nayak V||Prabhu MM||Rao R,INTRODUCTION: Tamoxifen is a selective estroge...,2020 Apr,20190730.0
1489,31268769,,Efficacy of Adalimumab for Chronic Vogt-Koyana...,Takayama K||Obata H||Takeuchi M,Purpose: To report the efficacy of adalimumab ...,2020 Apr 2,20190703.0
1490,31238794,PMC7016355,Selective brain hypothermia: feasibility and s...,Seyedsaadat SM||Marasco SF||Daly DJ||McEgan R|...,BACKGROUND/OBJECTIVE: Reduction of brain tempe...,2020 Mar,20190626.0
1491,30991886,,Psychiatric Manifestations With Sacubitril/Val...,Wooster J||Cook EA||Shipman D,Sacubitril/valsartan is an angiotensin recepto...,2020 Aug,20190416.0


## Review

Now, we will begin by defining three agents to conduct the literature review. The first two agents will act as scoring reviewers, tasked with examining each abstract to determine if it introduces any complications related to COVID-19 vaccines. For simplicity, if both reviewers agree on this initial step, the corresponding article will be shortlisted for the next stage. In the second step, we will define a third agent, referred to as the extraction reviewer, which will extract the names of the complications mentioned in the abstracts concerning COVID-19 vaccines. We will establish a workflow where the first two agents perform the initial round of review, and subsequently, the third agent will identify and extract the complications from the abstracts successfully selected in the first round.

In [10]:
Agent1 = ScoringReviewer(
    provider=LiteLLMProvider(model="gemini/gemini-1.5-flash"),
    name="Agent1",
    max_concurrent_requests=20, 
    backstory="a infectious disease specialist MD with years of experience in diagnosing and treating patients with COVID19 infection",
    input_description = "title and abstracts of case review articles",
    model_args={"max_tokens": 200, "temperature": 0.9},
    reasoning = "brief",
    scoring_task="Look for articles that disucss complications or adverse effects of vaccinations against COVID19",
    scoring_set=[1, 2],
    scoring_rules='Score 1 if the paper meets the criteria, and 2 if the paper does not meet the criteria.',
)

Agent2 = ScoringReviewer(
    provider=OpenAIProvider(model="gpt-4o-mini"),
    name="Agent2",
    max_concurrent_requests=20, 
    backstory="an expert PhD scientist in immunology and infectious disease",
    input_description = "title and abstracts of case review articles",
    model_args={"max_tokens": 200, "temperature": 0.1},
    reasoning = "brief",
    scoring_task="Look for articles that disucss complications or adverse effects of vaccinations against COVID19",
    scoring_set=[1, 2],
    scoring_rules='Score 1 if the paper meets the criteria, and 2 if the paper does not meet the criteria.',
)

Agent3 = AbstractionReviewer(
    provider=LiteLLMProvider(model="gpt-4o-mini"),
    name="Agent3",
    max_concurrent_requests=20, 
    backstory="an expert MD-PhD practitioner and scientist in immunology and infectious disease",
    input_description = "title and abstracts of case review articles",
    model_args={"max_tokens": 200, "temperature": 0.1},
    abstraction_keys = {
        "complications": list[str]
    },
    key_descriptions = {
        "complications": "A list of the complications or adverse effects of COVID-19 vaccines mentioned in the abstract."
    }
)

title_abs_review = ReviewWorkflow(
    workflow_schema=[
        {
            "round": 'A',
            "reviewers": [Agent1, Agent2],
            "text_inputs": ["Title", "Abstract"]
        },
        {
            "round": 'B',
            "reviewers": [Agent3],
            "text_inputs": ["Title", "Abstract", "round-A_Agent1_score", "round-A_Agent2_score"],
            "filter": lambda row: int(row["round-A_Agent1_score"]) == int(row["round-A_Agent2_score"]) == 1
        }
    ]
)


reviewed_data = asyncio.run(title_abs_review(data))
reviewed_data.to_csv("reviwed_data.csv", index=False)
reviewed_data.head()



Processing 1493 eligible rows


['round: A', 'reviewer_name: Agent1'] -                     2024-12-30 21:49:11: 100%|██████████| 1493/1493 [00:45<00:00, 32.47it/s]


The following columns are present in the dataframe at the end of Agent1's reivew in round A: ['PMID', 'PMC ID', 'Title', 'Author(s)', 'Abstract', 'Date of Publication', 'Date of Electronic Publication', 'round-A_Agent1_output', 'round-A_Agent1_reasoning', 'round-A_Agent1_score', 'round-A_Agent1_certainty']


['round: A', 'reviewer_name: Agent2'] -                     2024-12-30 21:49:57: 100%|██████████| 1493/1493 [01:20<00:00, 18.64it/s]


The following columns are present in the dataframe at the end of Agent2's reivew in round A: ['PMID', 'PMC ID', 'Title', 'Author(s)', 'Abstract', 'Date of Publication', 'Date of Electronic Publication', 'round-A_Agent1_output', 'round-A_Agent1_reasoning', 'round-A_Agent1_score', 'round-A_Agent1_certainty', 'round-A_Agent2_output', 'round-A_Agent2_reasoning', 'round-A_Agent2_score', 'round-A_Agent2_certainty']


Processing 149 eligible rows


['round: B', 'reviewer_name: Agent3'] -                     2024-12-30 21:51:17: 100%|██████████| 149/149 [00:08<00:00, 18.12it/s]

The following columns are present in the dataframe at the end of Agent3's reivew in round B: ['PMID', 'PMC ID', 'Title', 'Author(s)', 'Abstract', 'Date of Publication', 'Date of Electronic Publication', 'round-A_Agent1_output', 'round-A_Agent1_reasoning', 'round-A_Agent1_score', 'round-A_Agent1_certainty', 'round-A_Agent2_output', 'round-A_Agent2_reasoning', 'round-A_Agent2_score', 'round-A_Agent2_certainty', 'round-B_Agent3_output', 'round-B_Agent3_complications']





Unnamed: 0,PMID,PMC ID,Title,Author(s),Abstract,Date of Publication,Date of Electronic Publication,round-A_Agent1_output,round-A_Agent1_reasoning,round-A_Agent1_score,round-A_Agent1_certainty,round-A_Agent2_output,round-A_Agent2_reasoning,round-A_Agent2_score,round-A_Agent2_certainty,round-B_Agent3_output,round-B_Agent3_complications
0,37789968,PMC10544010,Remimazolam vs Etomidate: Haemodynamic Effects...,Chen J||Zou X||Hu B||Yang Y||Wang F||Zhou Q||S...,BACKGROUND: Remimazolam tosilate (RT) is a nov...,2023,20230927.0,"{'certainty': 100, 'reasoning': 'The article d...",The article does not discuss complications or ...,2,100,{'reasoning': 'The article discusses the haemo...,The article discusses the haemodynamic effects...,2,95,,
1,34558599,PMC8500128,Early Post-Renal Transplant Hyperglycemia.,Iqbal A||Zhou K||Kashyap SR||Lansang MC,CONTEXT: Though posttransplant diabetes mellit...,2022 Jan 18,,"{'certainty': 100, 'reasoning': 'The article d...",The article does not discuss complications or ...,2,100,{'reasoning': 'The article discusses early pos...,The article discusses early post-renal transpl...,2,90,,
2,38716533,,Atorvastatin-induced Myositis and Drug-induced...,Kashyap K||Bisht K||Dhar M||Mittal K,Statins are drugs for preventing cardiac event...,2023 Oct,,"{'certainty': 100, 'reasoning': 'The article f...",The article focuses on statin-induced myositis...,2,100,{'reasoning': 'The article discusses adverse e...,The article discusses adverse effects related ...,2,90,,
3,36842342,PMC9905103,Cardiovascular complications of COVID-19 vacci...,Paknahad MH||Yancheshmeh FB||Soleimani A,BACKGROUND: There are multiple reviews on card...,2023 May-Jun,20230208.0,"{'certainty': 100, 'reasoning': 'The abstract ...",The abstract explicitly discusses cardiovascul...,1,100,{'reasoning': 'The article discusses various c...,The article discusses various cardiovascular c...,1,90,"{'complications': ['Myocarditis', 'Takotsubo c...","[Myocarditis, Takotsubo cardiomyopathy (TTC), ..."
4,37148041,,Spectrum of Serious Neurological and Psychiatr...,Garg RK||Paliwal V||Malhotra HS||Singh BP||Riz...,Indian data regarding serious neurological and...,2023 Mar-Apr,,"{'certainty': 100, 'reasoning': 'The abstract ...",The abstract explicitly discusses serious neur...,1,100,{'reasoning': 'The article systematically revi...,The article systematically reviews serious neu...,1,95,{'complications': ['serious neurological adver...,"[serious neurological adverse events, serious ..."


## Literature Analysis

Now that we have a list of complications extracted from each article, let’s compile them to see which complications have been identified across the entire pool of articles and which ones are mentioned most frequently.

In [20]:
reviewed_data = pd.read_csv("reviwed_data.csv")
reviewed_data = reviewed_data[reviewed_data["round-B_Agent3_complications"].notna()]

print(f"{len(reviewed_data)} articles met the criteria for complications or adverse effects of COVID-19 vaccines.")

149 articles met the criteria for complications or adverse effects of COVID-19 vaccines.


In [156]:
complication_dict = dict()
for i, complication_list in enumerate(reviewed_data["round-B_Agent3_complications"].tolist()):
    complication_list = eval(complication_list)
    for complication in complication_list:
        article_list = complication_dict.get(complication, [])
        article_list.append(i)
        complication_dict[complication] = article_list

complication_dict = {k.replace('"', '').replace("'", "").strip(): v for k, v in complication_dict.items()}
unique_complications = list(complication_dict.keys())
unique_complications.sort(key=lambda x: len(complication_dict[x]), reverse=True)

print(f"Unique complications: {len(unique_complications)}\n")

for unique_complication in unique_complications:
    print(f"{unique_complication}: {len(complication_dict[unique_complication])} articles")


Unique complications: 587

Guillain-Barre syndrome: 6 articles
myocarditis: 6 articles
thrombocytopenia: 5 articles
fever: 5 articles
neurological complications: 3 articles
myopericarditis: 3 articles
Thrombocytopenia: 3 articles
headache: 3 articles
dizziness: 3 articles
lower limb weakness: 3 articles
pulmonary embolism: 3 articles
Guillain-Barre syndrome (GBS): 3 articles
nausea: 3 articles
acute disseminated encephalomyelitis (ADEM): 2 articles
pericarditis: 2 articles
stroke: 2 articles
acute kidney injury: 2 articles
chest pain: 2 articles
disseminated intravascular coagulation: 2 articles
superior mesenteric vein thrombosis: 2 articles
pain: 2 articles
diplopia: 2 articles
thrombotic complications: 2 articles
encephalitis: 2 articles
cardiogenic shock: 2 articles
Bilateral optic neuritis: 2 articles
portal vein thrombosis: 2 articles
facial nerve palsy: 2 articles
parakeratosis: 2 articles
acute myopericarditis: 2 articles
herpes zoster: 2 articles
hypophysitis: 2 articles
eyeli

By examining the above list of complications, it is evident that some complications have different labels but refer to similar pathologies. Although we can proceed with the current list, it may be beneficial to standardize it so that complications referring to the same pathology are labeled consistently. One strategy for achieving this is to make a single call to a large language model to replace each pathology name with its corresponding PubMed MeSH term. GPT-4-O, in particular, has a strong understanding of MeSH terms and can efficiently standardize the extracted complications by replacing them with their appropriate MeSH terms. Note that, as this is a single call to GPT-4-O (not a call per article or per list), we will not involve any reviewer for this task. Instead, we will simply call the OpenAI provider once to standardize all the complications identified so far.

In [157]:
import ast 

class StandardizedTerms(BaseModel):
    formatted_list: list[str, str]

async def make_standard():
    prompt = f"""
    Convert each of the complications in the following list to their corresponding standard PubMed MESH term.
    Return a python list of tuples that has each of the items in the input list as the key and their corresponding MESH term as the value. 
    If you encounter a term in the input list that does not look like to be a pathology name or pathology class, use "NA" as the corresponding Mesh Term.
    f you encounter a term in the input list that you do not know a pathology name or pathology class as its corresponding Mesh term, use the input term itself as the corresponding Mesh Term.
    Here is the input list: {unique_complications}
    """
    provider = OpenAIProvider(model="gpt-4o", response_format_class=StandardizedTerms)
    return await provider.get_json_response(prompt, temperature=0.1)

mesh_terms = json.loads(asyncio.run(make_standard())[0])["formatted_list"]
mesh_terms = dict(ast.literal_eval(item) for item in mesh_terms)


In [158]:
# Merge the keys in complication_dict that have the same standard formats

formatted_complication_dict = {}
for unique_complication in complication_dict.keys():
    new_list = complication_dict[unique_complication]
    standard_complication = mesh_terms.get(unique_complication, unique_complication)
    if standard_complication=='NA':
        continue
    old_list = formatted_complication_dict.get(standard_complication, [])
    old_list.extend(new_list)
    formatted_complication_dict[standard_complication] = old_list

with open('formatted_complication_dict.pkl', 'wb') as f:
    pickle.dump(formatted_complication_dict, f)

Last but not least, let’s create an interactive graph of all complications identified from the reviewed case report articles. We will construct the graph so that complications mentioned more frequently across all the case report articles will have larger nodes and darker color shades. Note that this graph is best visualized by opening the generated HTML file in a browser. However, if you are running this notebook in JupyterLab or Jupyter Notebook, you can also view the graph inline within the notebook, provided you have the PyViz library and its dependencies installed. Unfortunately, the inline visualization of the graph will not work if you are viewing this notebook in the VS Code IDE. Keep in mind that this is an interactive graph, allowing for a more dynamic exploration of the data.

In [4]:
with open('formatted_complication_dict.pkl', 'rb') as f:
    formatted_complication_dict = pickle.load(f)
    
# Initialize the graph
G = nx.Graph()

# Add nodes and their sizes
for complication, articles in formatted_complication_dict.items():
    G.add_node(complication, size=len(articles))
    
# Add edges based on co-occurrence
for comp1, articles1 in formatted_complication_dict.items():
    for comp2, articles2 in formatted_complication_dict.items():
        if comp1 != comp2:
            common_articles = set(articles1) & set(articles2)
            if common_articles:
                G.add_edge(comp1, comp2, weight=len(common_articles))
                
# Create a fresh Network instance with custom settings
net = Network(
    height="100%", 
    width="100%", 
    bgcolor="#222222", 
    font_color="white",
    directed=False,
    neighborhood_highlight=True,
    select_menu=True,
    filter_menu=True,
    notebook=True,
    cdn_resources='in_line'  # Changed to in_line to include resources directly
)

# Physics settings optimized for more spacing
net.set_options("""
{
  "physics": {
    "forceAtlas2Based": {
      "gravitationalConstant": -200,
      "centralGravity": 0.005,
      "springLength": 400,
      "springConstant": 0.02,
      "avoidOverlap": 1.0
    },
    "maxVelocity": 20,
    "solver": "forceAtlas2Based",
    "timestep": 0.3,
    "stabilization": {
      "enabled": true,
      "iterations": 2000,
      "updateInterval": 25
    }
  },
  "nodes": {
    "shape": "dot",
    "scaling": {
      "min": 8,
      "max": 45
    },
    "font": {
      "size": 20,
      "color": "white",
      "strokeWidth": 3,
      "strokeColor": "#222222",
      "face": "arial"
    },
    "shadow": {
      "enabled": true,
      "color": "rgba(0,0,0,0.5)",
      "size": 5
    }
  },
  "edges": {
    "smooth": {
      "type": "continuous",
      "forceDirection": "none"
    },
    "color": {
      "inherit": false,
      "opacity": 0.15
    },
    "width": 0.5
  },
  "interaction": {
    "navigationButtons": true,
    "keyboard": true,
    "hover": true,
    "multiselect": true,
    "dragNodes": true,
    "hideEdgesOnDrag": true,
    "hideNodesOnDrag": false
  }
}
""")

# Modify node addition with continuous color gradient
sizes = nx.get_node_attributes(G, 'size')
min_size = min(sizes.values())
max_size = max(sizes.values())

def get_color_for_size(size):
    # Normalize size to 0-1 range
    normalized = (size - min_size) / (max_size - min_size)
    # Create a color gradient from light blue to dark blue
    return mcolors.to_hex(plt.cm.Blues(0.2 + (normalized * 0.8)))

for node, size in sizes.items():
    net.add_node(
        node, 
        size=size * 8,
        title=f"{node}: {size} articles",
        color=get_color_for_size(size),  # Use continuous color mapping
        label=node,
        shape="dot",
        mass=size/15
    )

# Add edges with increased transparency
MIN_EDGE_WEIGHT = 2
MIN_NODE_SIZE_FOR_LABEL = 4

for u, v, data in G.edges(data=True):
    if data['weight'] >= MIN_EDGE_WEIGHT:
        net.add_edge(
            u, v, 
            value=data['weight'],
            title=f"{data['weight']} common articles",
            color="rgba(255, 255, 255, 0.15)",  # More transparent edges
            smooth={'type': 'continuous'}
        )

# Save the graph to a temporary HTML file and also show it in a notebook (works with jupyter notebooks
html_file = "complications_graph.html"
net.save_graph(html_file)
net.show("complications_graph.html")

complications_graph.html
