<a href="https://colab.research.google.com/github/Condemor-bit/Large-Language-Models-/blob/main/Q_A_PubMed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

"This is a simple RAG system using LlamaIndex and Zephyr-7b for PubMed searches.

28/12/2023

In [13]:
#@title 1º) Change the runtime environment to 'T4 GPU' and install the dependencies
#%%capture
!pip install -q --upgrade git+https://github.com/huggingface/transformers
!pip install -q bitsandbytes
!pip install -q accelerate
!pip install -q llama-index
!pip install -q pypdf
!pip install -q docx2txt
!pip install -q llama-index[local_models]
!pip install -q llama-index[query_tools]
print("=========================")
print("Proceed to the next cell.")

In [None]:
#@title 2º) Load the model

import torch
from transformers import BitsAndBytesConfig
from llama_index.prompts import PromptTemplate
from llama_index.llms import HuggingFaceLLM
from llama_index import ServiceContext

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)


def messages_to_prompt(messages):
  prompt = ""
  for message in messages:
    if message.role == 'system':
      prompt += f"<|system|>\n{message.content}</s>\n"
    elif message.role == 'user':
      prompt += f"<|user|>\n{message.content}</s>\n"
    elif message.role == 'assistant':
      prompt += f"<|assistant|>\n{message.content}</s>\n"

  # ensure we start with a system prompt, insert blank if needed
  if not prompt.startswith("<|system|>\n"):
    prompt = "<|system|>\n</s>\n" + prompt

  # add final assistant prompt
  prompt = prompt + "<|assistant|>\n"

  return prompt


llm = HuggingFaceLLM(
    model_name="HuggingFaceH4/zephyr-7b-beta",
    tokenizer_name="HuggingFaceH4/zephyr-7b-beta",
    query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
    context_window=3900,
    max_new_tokens= 1024,#256,
    model_kwargs={"quantization_config": quantization_config},
    # tokenizer_kwargs={},
    generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    device_map="auto",
)


#embed model
service_context = ServiceContext.from_defaults(llm=llm, embed_model="local:BAAI/bge-small-en-v1.5", chunk_size=512, chunk_overlap=50)


print("=========================")
print("Proceed to the next cell.")

#*At this point, you have the model and the dependencies installed. Avoid re-running these cells (1º and 2º).*




In [31]:
#@title 3º) Select the number of first articles and the keywords to query following the same format as for PubMed. Perform the search and create the query vectors
from llama_index import download_loader


Max_results = 20 # @param {type:"integer"}
Search_Query = "(diabetes) AND (cardiovascular risk factors)"     #@param {type: "string"}

PubmedReader = download_loader("PubmedReader")
loader = PubmedReader()
documents = loader.load_data(search_query=f"""{Search_Query}""", max_results=Max_results)

from llama_index import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents, service_context=service_context)

print("=========================")
print("Proceed to the next cell.")

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10751216&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10751209&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10751160&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10751141&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10751107&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10751045&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10751026&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10750989&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10750977&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10750974&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10750805&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10750528&db=pmc
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?id=10750526&db=pmc

In [34]:
#@title 4º) Interact with the data
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="transformers.generation.configuration_utils")
query_engine = index.as_query_engine(similarity_top_k=4) #similarity_top_k=2 default se puede dejar vacio
query = input("How can I help you?: ")
response = query_engine.query(f""" {query} """)
print(response)

How can I help you?: Explain me the relationship with diabetes and cardiovascular risk factors
Diabetes, particularly type 2 diabetes (T2DM), is associated with an increased risk of developing cardiovascular diseases (CVD). This relationship is multifaceted and involves several cardiometabolic risk factors that are commonly present in individuals with T2DM.

One of the major risk factors is hypertension, or high blood pressure. People with T2DM are more likely to develop hypertension due to the presence of IR, which can lead to increased salt and water retention in the body, as well as increased activity of the renin-angiotensin-aldosterone system (RAAS). This can result in elevated blood pressure levels, which in turn increase the risk of CVD.

Dyslipidemia, or abnormal lipid levels, is another significant risk factor. People with T2DM often have high levels of triglycerides, low levels of high-density lipoprotein (HDL) cholesterol, and high levels of low-density lipoprotein (LDL) cho