## QnA with Pubmed using ThirdAI's Playground

In this notebook, you will be able to

1. Download ThirdAI's BOLT LLM trained on Pubmed-800K and the processed data.

2. Ask any question and get relevant references from Pubmed.

3. (Optional) How to use your OpenAI key to generate grounded answers without hallucination.

In [None]:
!pip3 install thirdai==0.7.6
!pip3 install openai
!pip3 install paper-qa
!pip3 install langchain
!pip3 install json
!pip3 install transformers

In [21]:
from thirdai import bolt,licensing
from transformers import GPT2Tokenizer
import numpy as np

import os
if "THIRDAI_KEY" in os.environ:
    licensing.activate(os.environ["THIRDAI_KEY"])
else:
    ## Please request for a trial license @ https://www.thirdai.com/try-bolt/
    # licensing.activate("")  # Enter your ThirdAI key here
    pass
import json

### Load Model

In [22]:
#### Model Checkpoint
checkpoint = "pubmed_800k.bolt"
if not os.path.exists(checkpoint):
    os.system("wget -nv -O pubmed_800k.bolt 'https://www.dropbox.com/s/kwoqt5c7bqbisbl/pubmed_800k.bolt?dl=0'")

model = bolt.UniversalDeepTransformer.load(checkpoint)

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

### Load Dataset to display references

In [23]:
### Processed Data to show references
display_data = 'pubmed_800k_combined.json'
if not os.path.exists(display_data):
    os.system("wget -nv -O pubmed_800k_combined.json 'https://www.dropbox.com/s/8phkx4fht9j2npy/pubmed_800k_combined.json?dl=0'")

data_store = {}
with open(display_data, "r") as f:
    data = json.load(f)

for json_data in data:
    data_store[json_data["label"]] = json_data

### Get Answers from OpenAI using Langchain

In this section, we will show how to use LangChain and query OpenAI's QnA module to generate an answer from the references that you retrieve from the above DB. You'll have to specify your own OpenAI key for this module to work. You can replace this segment with any other generative model of your choice. You can choose to use an source model like MPT or Dolly for answer generation with the same prompt that you use with OpenAI.

In [24]:
import os
if "OPENAI_API_KEY" not in os.environ:
  os.environ["OPENAI_API_KEY"] = ""

In [25]:
from langchain.chat_models import ChatOpenAI
from paperqa.prompts import qa_prompt
from paperqa.chains import make_chain

llm = ChatOpenAI(
    model_name='gpt-3.5-turbo', 
    temperature=0.1,
)

qa_chain = make_chain(prompt=qa_prompt, llm=llm)

In [26]:
def get_references(query):
    tokens = tokenizer.encode(query)
    predictions = model.predict({"QUERY": " ".join(map(str, tokens))})
    top_results = np.argsort(-predictions)[:3]
    references = []
    for result in top_results:
        references.append(data_store[result]["abstract"])
    return references

def get_answer(query, references):
    return qa_chain.run(question=query, context='\n\n'.join(references), answer_length="abt 50 words")

### Example Question 1

In [27]:
query = "what percentage of cancer patients have depression?"

references = get_references(query)
print(references)

['Background \n Depression is the most common psychiatric comorbidity among people living with HIV/AIDS (PLWHA). Little is known about the comparative effectiveness between different types of antidepressants used to treat depression in this population. We compared the effectiveness of dual-action and single-action antidepressants in PLWHA for achieving remission from depression. \n \n \n Methods \n We used data from the Centers for AIDS Research Network of Integrated Clinic Systems to identify 1,175 new user dual-action or single-action antidepressant treatment episodes occurring from 2005–2014 for PLWHA diagnosed with depression. The primary outcome was remission from depression defined as a Patient Health Questionnaire-9 (PHQ-9) score <5. Mean difference in PHQ-9 depressive symptom severity was a secondary outcome. The main approach was an intent-to-treat (ITT) evaluation complemented with a per protocol (PP) sensitivity analysis. Generalized linear models were fitted to estimate tre

In [28]:
answer = get_answer(query, references)

print(answer)

Approximately 47% of cancer patients have depressive symptoms (Example2012).


### Example Question 2

In [29]:
query = "How to detect depression in geriatric cancer patients ?"

references = get_references(query)
answer = get_answer(query, references)
print(answer)

To detect depression in geriatric cancer patients, the Geriatric Depression Scale-Short Form, the Hospital Anxiety and Depression Scale, and the Center for Epidemiological Studies Depression Scale—Revised can be used as self-report measures. However, the published cutoff scores for detecting major depression may miss a significant number of depressed geriatric cancer patients. Revised cutoff scores may be more effective (Example2012).
