In [9]:
import sys
import os
from  pathlib import  Path
sys.path.append(str(Path.cwd().parent))
from  config import OPENAI_API_KEY,NCBI_API_KEY,EMAIL
from  src.clinfoai.pubmed_engine import PubMedNeuralRetriever


# Using Clinfo.AI 

In this tutorial, we will go through each step of the Clinfo.AI workflow. Before we start, we need to set up a few things. 


### 1.- Setting up enviorment:
1.a.- Install the conda enviroment using the yml file provided.

``` conda env create -f environment.yaml ```

1.b.- Select your environment to run notebook. I recommend using VScode: 



### 2.- Creating Accounts

You will need at least one account and at most two (depending on how many calls/hour you plan to do):
* OPENAI account: If you start a free account for the first time, you will get $5 in API credits.
* NCBI_API_KEY: This is only necessary if you plan to make more than 10 calls per hour.


Once you have created both accounts  go to **src\config.py** file and: 

* Set OPENAI_API_KEY to your openAI API key

If you created an NCBI API account add your key and email in the following values: 
* NCBI_API_KEY 
* EMAIL
* 
Otherwise, leave them as None

In [10]:
# Make Sure you followed at least step 1-2 before running this cell.
from  config import OPENAI_API_KEY, NCBI_API_KEY, EMAIL
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY


### 3.- Defining your own promts:
We have designed prompts for each step of Clinfo.ai Workflow, leaveriging the power of in-contex-learning. If you want to us your own promps you can edit them **src\prompts** otherwise we will use the default prompts:

In [11]:
PROMPS_PATH = os.path.join("src","clinfoai","prompts","PubMed","Architecture_1","master.json")

### 4.- Define Clinfo.AI LLM Backbone
Clinfo uses a chain of LLMs to summarize information, thus we need to define an LLM backbone. 

We will start with OpenAI models, however, if you have access to GPUs it is possible to use Clinfo.AI with vLLM to use OpenSource LLMs as backbones (check tutorial 3).

In [12]:
MODEL:str  = "gpt-3.5-turbo"
#MODEL:str = "Qwen/Qwen2-beta-7B-Chat"

### 5.- Init Clinfo+Pubmed Engnie
We have all the necessary data to start our clinfo+pubmed instance:

In [13]:
## 5.- Init Neural Retriever from path. 
# Do not change the path if you want to use base  prompts, otherwise specify your own prompt architecture

nrpm = PubMedNeuralRetriever(
    architecture_path = PROMPS_PATH,
    model             = MODEL,
    verbose           = False,
    debug             = False,
    open_ai_key       = OPENAI_API_KEY,
    email             = EMAIL)


Task Name: pubmed_query_prompt
------------------------------------------------------------------------

Task Name: relevance_prompt
------------------------------------------------------------------------

Task Name: summarization_prompt
------------------------------------------------------------------------

Task Name: synthesize_prompt
------------------------------------------------------------------------


# Let's start!

In [14]:
### Step 0 : Ask a question ###
QUESTION    = "What is the prevalence of COVID-19 in the United States?"
QUESTION    = "What tests are needed to diagnose Chronic Neutropenia?"


## STEP 1 (Search PubMed): Convert the question into a query using an LLM
# This returns a list of queries (containing MESH terms)
# These queries are used to retrieve articles from NCBI
# Once retrieved we collect a list article ids.
pubmed_queries, article_ids = nrpm.search_pubmed(
    question=QUESTION,
    num_results=10,
    num_query_attempts=1)

print(f"Articles retrived: {len(article_ids)}")
print(pubmed_queries)
print(article_ids)

  warn_deprecated(
  warn_deprecated(
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Articles retrived: 10
['("Chronic Neutropenia" AND "diagnosis" AND "tests")']
['11964321', '6602565', '34303547', '27841775', '19305028', '30870474', '3534197', '10388004', '24827415', '20301576']


In [15]:
## Step 2: Fetch article data
# Preiously, we only extracted he PMIDs. No we will use those  PMIDs to retrive the metadata:
articles = nrpm.fetch_article_data(article_ids)

# Print example for first article: 
article_num = 1
print(f"Article {article_num}:\n")

#print(articles[article_num].keys())
#print(articles[article_num]['PubmedData'])
print(articles[article_num]["MedlineCitation"]["Article"]["Abstract"]["AbstractText"])
#print(articles[article_num]["MedlineCitation"]["Article"])


Article 1:

["Chronic neutropenia is a term used to describe a group of disorders characterized by a persistent neutrophil count of less than 1500 cells/microliters. We studied seven children and three sets of parents. We separated patients into a group with good prognosis and a group at higher risk of infection by using a combination of tests, including bone marrow aspiration and biopsy, steroid stimulation of bone marrow reserve, and in vitro CFU-GM and CSA assays. Children with a normal number of myeloid elements in their bone marrow and a normal bone marrow response to steroid stimulation had a benign course. CFU-GM and CSA assays helped to classify these children's neutropenia when their bone marrow had decreased numbers of myeloid elements. Family studies in three children were consistent with an inherited neutropenia, even when their parents were hematologically normal."]


In [16]:
# STEP 3 Summarize each article
# This step is parallelized, though it might look like one single call, it performs one call per article to summarize.
# Then the relevancy of the article (based on the original question) is provided by another LLM call.

article_summaries,irrelevant_articles =  nrpm.summarize_each_article(articles, QUESTION)

In [17]:
# Summaries for relevant articles
article_summaries

[{'title': 'An update on the diagnosis and treatment of chronic idiopathic neutropenia.',
  'url': 'https://pubmed.ncbi.nlm.nih.gov/27841775/',
  'abstract': 'PURPOSE OF REVIEW:\nNeutropenia lasting for at least for 3 months and not attributable to drugs or a specific genetic, infectious, inflammatory, autoimmune or malignant cause is called chronic idiopathic neutropenia (CIN). CIN and autoimmune neutropenia (AIN) are very similar and overlapping conditions. The clinical consequences depend upon the severity of neutropenia, but it is not considered a premalignant condition.\n\nRECENT FINDINGS:\nLong-term observational studies in children indicate that the disease often lasts for 3-5 years in children, then spontaneously remits, but it rarely remits in adult cases. The value of antineutrophil antibody testing in both children and adults is uncertain. Most recent data suggest that CIN and AIN are immune-mediated diseases, but there are no new clinical or genetic tests to aid in diagnosi

In [18]:
# Articles deemed irelevant
irrelevant_articles 

[{'title': 'Invasive aspergillosis and endocarditis.',
  'url': 'https://pubmed.ncbi.nlm.nih.gov/34303547/',
  'abstract': 'INTRODUCTION:\nAspergillusfumigatus can cause a systemic infection called invasive aspergillosis causing pulmonary and extra-pulmonary damage. Aspergillus endocarditis (AE) is a relatively rare disease but can be life-threatening.\n\nCASE REPORTS:\nWe report here on five cases of endocarditis due to invasive aspergillosis: a 58-year-old man receiving immunosuppressive medication following a kidney graft, a 58-year-old man undergoing chemotherapy for chronic lymphocytic leukaemia, a 55-year-old man receiving corticosteroids for IgA vasculitis, a 52-year-old HIV-infected woman under no specific treatment and a 17-year-old boy under immunosuppressive therapy for auto-immune chronic neutropenia.\n\nDISCUSSION:\nAspergillus accounts for 25-30% of fungal endocarditis and 0.25% to 8.5% of all cases of infectious endocarditis. Aspergillus endocarditis results from invasio

In [19]:
# STEP 4 do a synthesis of all summaries to answer question: 
synthesis =   nrpm.synthesize_all_articles(article_summaries, QUESTION)
print("synthesis")
print(synthesis)

synthesis
Literature Summary: Chronic neutropenia can be diagnosed through various tests and evaluations. Studies suggest that chronic idiopathic neutropenia (CIN) and autoimmune neutropenia (AIN) are immune-mediated diseases characterized by neutropenia lasting for at least 3 months, with CIN often remitting in children but rarely in adults [1]. Antineutrophil antibody testing's value remains uncertain, and no new tests for diagnosis have been identified [1]. Quantification of antineutrophil antibodies at the time of diagnosis may be useful in predicting the clinical course of chronic neutropenia in childhood [3]. Additionally, the presence of neutrophil antibodies in infants and young children with chronic neutropenia establishes a diagnosis of autoimmune neutropenia of infancy, with recovery typically occurring by the age of 5 [7]. Treatment with granulocyte colony-stimulating factor (G-CSF) has shown efficacy in increasing neutrophils, especially in cases with recurrent fevers and 

# Great! We answered our first question using Clinfo.AI!
## Here are all the steps condensed:

In [20]:
import pandas as pd
mpg = pd.read_csv(os.path.join("PubMedRS-200","PubMedRS-200.csv"))
mpg.head()
# mpg.columns

Unnamed: 0,specialty,SubTopic,Title,Abstract,Introduction,Methods,Results,Conclusion,PMID,Ref_PMIDs,...,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32,Unnamed: 33,Unnamed: 34
0,allergy and immunology,asthma,Are prenatal anxiety or depression symptoms as...,BACKGROUND:\nAsthma is the most common respira...,Asthma is the most common respiratory disease ...,"PubMed, Embase, and theÂ Cochrane Library were...",A total of 598 studies were initially identifi...,This meta-analysis demonstrated that prenatal ...,34158009,"['18158379', '17932381', '21094921', '21474570...",...,,,,,,,,,,
1,allergy and immunology,HIV,Should we care about Plasmodium vivax and HIV ...,BACKGROUND:\nMalaria and HIV are two important...,Malaria and HIV are two important public healt...,Medical records from a tertiary care centre in...,"A total of 1,048 vivax malaria patients were h...",Reports of HIV/PvCo are scarce in the literatu...,33407474,"['23327493', '23327493', '22970336', '27402513...",...,,,,,,,,,,
2,allergy and immunology,HIV,Bladder Cancer in HIV-infected Adults: An Emer...,OBJECTIVES:\nNon-AIDS-related malignancies now...,Non-AIDS-related malignancies now represent a ...,We conducted a single center retrospective stu...,During the study period we identified 15 HIV-i...,Bladder cancers in HIV-infected patients remai...,26642314,"['24901259', '19741479', '19818686', '19219610...",...,,,,,,,,,,
3,allergy and immunology,HIV,Is having sex with other men a risk factor for...,BACKGROUND:\nAlthough increased prevalence of ...,Although increased prevalence of transfusion t...,"We searched MEDLINE, Embase, The Cochrane Cent...","Out of 18 987 articles, 14 observational studi...",High-quality studies investigating the risk of...,25875812,"['24498030', '24498030', '19638153', '20527321...",...,,,,,,,,,,
4,allergy and immunology,pediatric allergy,"Atopic dermatitis, atopic eczema, or eczema? A...",BACKGROUND:\nThe lack of standardized nomencla...,The lack of standardized nomenclature for atop...,"A systematic review of the MEDLINE, EMBASE, an...","In MEDLINE, 33 060 were identified, of which 2...",Atopic dermatitis is the most commonly used te...,27392131,"['14657842', '14657842', '16867052', '26538253...",...,,,,,,,,,,


In [21]:
clean_data =  mpg.iloc[:, :15]
clean_data.head()

Unnamed: 0,specialty,SubTopic,Title,Abstract,Introduction,Methods,Results,Conclusion,PMID,Ref_PMIDs,Ref_DOIs,questions,PublishedDate,HumanQuestions,HumanAnswer
0,allergy and immunology,asthma,Are prenatal anxiety or depression symptoms as...,BACKGROUND:\nAsthma is the most common respira...,Asthma is the most common respiratory disease ...,"PubMed, Embase, and theÂ Cochrane Library were...",A total of 598 studies were initially identifi...,This meta-analysis demonstrated that prenatal ...,34158009,"['18158379', '17932381', '21094921', '21474570...","['10.1097/PSY.0b013e31815c1b71', '10.1164/rccm...",Are prenatal anxiety or depression symptoms as...,11/1/2021,Are prenatal anxiety or depression symptoms as...,prenatal mental disorders increase the risk of...
1,allergy and immunology,HIV,Should we care about Plasmodium vivax and HIV ...,BACKGROUND:\nMalaria and HIV are two important...,Malaria and HIV are two important public healt...,Medical records from a tertiary care centre in...,"A total of 1,048 vivax malaria patients were h...",Reports of HIV/PvCo are scarce in the literatu...,33407474,"['23327493', '23327493', '22970336', '27402513...","['10.1186/1756-3305-6-18', '10.1186/1756-3305-...",Should we care about Plasmodium vivax and HIV ...,7/9/2021,Should we care about Plasmodium vivax and HIV ...,Reports of HIV/PvCo are scarce in the literatu...
2,allergy and immunology,HIV,Bladder Cancer in HIV-infected Adults: An Emer...,OBJECTIVES:\nNon-AIDS-related malignancies now...,Non-AIDS-related malignancies now represent a ...,We conducted a single center retrospective stu...,During the study period we identified 15 HIV-i...,Bladder cancers in HIV-infected patients remai...,26642314,"['24901259', '19741479', '19818686', '19219610...","['10.1097/QAD.0000000000000222', '10.1097/QAD....",Bladder Cancer in HIV-infected Adults: An Emer...,6/17/2016,Bladder Cancer in HIV-infected Adults common?,Bladder cancer was diagnosed a median of 14 ye...
3,allergy and immunology,HIV,Is having sex with other men a risk factor for...,BACKGROUND:\nAlthough increased prevalence of ...,Although increased prevalence of transfusion t...,"We searched MEDLINE, Embase, The Cochrane Cent...","Out of 18 987 articles, 14 observational studi...",High-quality studies investigating the risk of...,25875812,"['24498030', '24498030', '19638153', '20527321...","['10.1371/journal.pone.0087139', '10.1371/jour...",Is having sex with other men a risk factor for...,1/15/2016,Is having sex with other men a risk factor for...,High-quality studies investigating the risk of...
4,allergy and immunology,pediatric allergy,"Atopic dermatitis, atopic eczema, or eczema? A...",BACKGROUND:\nThe lack of standardized nomencla...,The lack of standardized nomenclature for atop...,"A systematic review of the MEDLINE, EMBASE, an...","In MEDLINE, 33 060 were identified, of which 2...",Atopic dermatitis is the most commonly used te...,27392131,"['14657842', '14657842', '16867052', '26538253...",[],"Atopic dermatitis, atopic eczema, or eczema?",11/16/2017,What is the most common term Atopic dermatitis...,Atopic dermatitis is the most commonly used te...


In [38]:
# Test from chunk 37 
response = {
    "HumanAnswer" : [],
    "Introduction" : [],
    "TLDR" : []
}
for index, row in clean_data.iterrows():
    human_answer = row["HumanAnswer"]
    abstract = row["Introduction"]
    question = row["HumanQuestions"]
    pubmed_queries, article_ids = nrpm.search_pubmed(question,num_results=10,num_query_attempts=1)
    articles = nrpm.fetch_article_data(article_ids)
    article_summaries,irrelevant_articles =  nrpm.summarize_each_article(articles, question)
    synthesis =   nrpm.synthesize_all_articles(article_summaries, question)
    #extract TL:DR 
    sentences = synthesis.split("\n")

    #search the array for the sentence that contains TL:DR
    TLDR = ""
    for sentence in sentences:
        if "TL;DR" in sentence:
            TLDR = sentence.replace("TL;DR: ", "")
            break
    

    print (TLDR)
    response["HumanAnswer"].append(human_answer)
    response["Introduction"].append(abstract)
    response["TLDR"].append(TLDR)
    
    # create new csv file with the above fields 
    # new_row = {"HumanAnswer": human_answer, "Abstract": abstract, "HumanQuestions": question, "TLDR": TLDR}

    if index == 10:
        break

Strong evidence suggests that prenatal anxiety or depression symptoms are associated with an increased risk of childhood asthma in offspring, emphasizing the importance of monitoring maternal mental health during pregnancy.
Evidence on Plasmodium vivax and HIV co-infection is limited and mixed, highlighting the need for further research to determine the true prevalence and clinical implications.
Bladder cancer can occur in HIV-infected adults, potentially linked to factors like urinary schistosomiasis, emphasizing the need for vigilance in monitoring and early detection in this population.
There is some evidence to suggest that having sex with other men may increase the risk of transfusion-transmissible infections, particularly HIV-1, but the data on other infections like HBV or HCV is inconclusive, highlighting the need for further research in Western countries.
Could not find 'Abstract' for article with PMID = 36648466
Could not find 'Abstract' for article with PMID = 36161367
Atopic

In [40]:
new_csv = pd.DataFrame(data=response)
new_csv.to_csv("new_csv.csv", mode='a', header=True)
new_csv.head()

Unnamed: 0,HumanAnswer,Introduction,TLDR
0,prenatal mental disorders increase the risk of...,Asthma is the most common respiratory disease ...,Strong evidence suggests that prenatal anxiety...
1,Reports of HIV/PvCo are scarce in the literatu...,Malaria and HIV are two important public healt...,Evidence on Plasmodium vivax and HIV co-infect...
2,Bladder cancer was diagnosed a median of 14 ye...,Non-AIDS-related malignancies now represent a ...,Bladder cancer can occur in HIV-infected adult...
3,High-quality studies investigating the risk of...,Although increased prevalence of transfusion t...,There is some evidence to suggest that having ...
4,Atopic dermatitis is the most commonly used te...,The lack of standardized nomenclature for atop...,"Atopic dermatitis, atopic eczema, and eczema a..."
