<a href="https://colab.research.google.com/github/PranavkrishnaVadhyar/Gen-o-Sys-SlashKey3.0/blob/main/Llama2_7b_chat_feature_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Note: Responses from local models can be quite slow, especially with 8-bit quantization.

With 4bit quantization, llama2-7b-chat uses about 8GB of VRAM

In [None]:
%pip install llama-index
%pip install transformers accelerate bitsandbytes
%pip install llama-index-readers-web
%pip install llama-index-llms-huggingface
%pip install llama-index-embeddings-huggingface
%pip install llama-index-program-openai
%pip install llama-index-agent-openai

Collecting llama-index
  Downloading llama_index-0.10.56-py3-none-any.whl (6.8 kB)
Collecting llama-index-agent-openai<0.3.0,>=0.1.4 (from llama-index)
  Downloading llama_index_agent_openai-0.2.9-py3-none-any.whl (13 kB)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_cli-0.1.12-py3-none-any.whl (26 kB)
Collecting llama-index-core==0.10.56 (from llama-index)
  Downloading llama_index_core-0.10.56-py3-none-any.whl (15.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.5/15.5 MB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index-embeddings-openai<0.2.0,>=0.1.5 (from llama-index)
  Downloading llama_index_embeddings_openai-0.1.11-py3-none-any.whl (6.3 kB)
Collecting llama-index-indices-managed-llama-cloud>=0.2.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.2.5-py3-none-any.whl (9.3 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloading llama_index_le

## Setup

### Data

In [3]:
!pip install pypdf



In [10]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
documents = documents = SimpleDirectoryReader('/content/drive/MyDrive/doctor_data').load_data()

### LLM

This should run on a T4 GPU in the free tier

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [11]:
# huggingface api token for downloading llama2
from google.colab import userdata

hf_token = userdata.get('HF_TOKEN')

In [26]:
import torch
from transformers import BitsAndBytesConfig
from llama_index.core.prompts import PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

from llama_index.core.prompts.prompts import SimpleInputPrompt
system_prompt = """
You are a medical Q&A assistant. Your goal is to answer questions as accurately as possible based on the context and instructions provided.The responses should be always in JSON format with the specified keys.
"""
query_wrapper_prompt = SimpleInputPrompt("<|USER|>\n{query_str}\n<|ASSISTANT|>")

llm = HuggingFaceLLM(
    model_name="meta-llama/Llama-2-7b-chat-hf",
    tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
    query_wrapper_prompt=query_wrapper_prompt,
    system_prompt=system_prompt,
    context_window=3900,
    model_kwargs={"token": hf_token, "quantization_config": quantization_config},
    tokenizer_kwargs={"token": hf_token},
    device_map="auto",
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [13]:
from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = "local:BAAI/bge-small-en-v1.5"

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Index Setup

In [14]:
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents)

In [15]:
from llama_index.core.indices import SummaryIndex

summary_index = SummaryIndex.from_documents(documents)

### Helpful Imports / Logging

In [16]:
from llama_index.core.response.notebook_utils import display_response

In [17]:
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

## Basic Query Engine

### Compact (default)

In [29]:
query_engine = vector_index.as_query_engine(response_mode="compact", llm=llm)

response = query_engine.query("Recommend a doctor for my cold. Only give the response as JSON with keys as name, specialization")

display_response(response)

**`Final Response:`** Sure, I'd be happy to help! Based on the context information provided, I recommend Dr. Sophia Martinez for your cold.

* **Name:** Dr. Sophia Martinez
* **Specialization:** Neonatology, Pediatric Critical Care

Dr. Martinez has extensive experience in managing critically ill newborns and children with infectious diseases. She has played a pivotal role in developing protocols for pediatric emergency care and has contributed to numerous publications in reputable medical journals. Her leadership in the pediatric department ensures the highest standards of care for young patients.

I hope this helps! Let me know if you have any other questions.

### Refine

In [30]:
query_engine = vector_index.as_query_engine(response_mode="refine", llm = llm)

response = query_engine.query("Recommend a doctor for my cold")

display_response(response)

**`Final Response:`** Based on the updated context, the best doctor to recommend for your cold is Dr. Robert Green. Dr. Green is a pulmonologist with 17 years of experience, and his area of expertise includes respiratory infections, asthma, and COPD management. He is highly skilled in interventional pulmonology procedures such as bronchoscopy and has been instrumental in establishing protocols for the management of chronic respiratory diseases. His leadership and clinical expertise ensure the highest quality of care for patients with pulmonary conditions.

Please let me know if you have any other queries or if there's anything else I can help you with.

### Tree Summarize

In [24]:
query_engine = vector_index.as_query_engine(response_mode="tree_summarize")

response = query_engine.query("Recommend a doctor for my cold. Only give the response as JSON with keys as name, specialization")

display_response(response)

**`Final Response:`** Sure, based on the information provided in the text, I recommend Dr. Sophia Martinez for your cold. Here's the response in JSON format:

{
"name": "Dr. Sophia Martinez",
"specialization": "Pediatrics",
"qualification": "MBBS, MD (Pediatrics)"
}

Dr. Martinez has extensive experience in managing critically ill newborns and children with infectious diseases, including cold. She has developed protocols for pediatric emergency care and has contributed to numerous publications in reputable medical journals. Her leadership in the pediatric department ensures the highest standards of care for young patients.

## Router Query Engine

In [None]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata

vector_tool = QueryEngineTool(
    vector_index.as_query_engine(),
    metadata=ToolMetadata(
        name="vector_search",
        description="Useful for searching for specific facts."
    )
)

summary_tool = QueryEngineTool(
    summary_index.as_query_engine(response_mode="tree_summarize"),
    metadata=ToolMetadata(
        name="summary",
        description="Useful for summarizing an entire document."
    )
)

### Single Selector

In [None]:
from llama_index.core.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine.from_defaults(
    [vector_tool, summary_tool],
    select_multi=False
)

response = query_engine.query("What was mentioned about Meta?")

display_response(response)

ValueError: ignored

### Multi Selector

In [None]:
from llama_index.core.query_engine import RouterQueryEngine

query_engine = RouterQueryEngine.from_defaults(
    [vector_tool, summary_tool],
    select_multi=True,
)

response = query_engine.query("What was mentioned about Meta? Summarize with any other companies mentioned in the entire document.")

display_response(response)

ValueError: ignored

## SubQuestion Query Engine

In [None]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata

vector_tool = QueryEngineTool(
    vector_index.as_query_engine(),
    metadata=ToolMetadata(
        name="vector_search",
        description="Useful for searching for specific facts."
    )
)

summary_tool = QueryEngineTool(
    summary_index.as_query_engine(response_mode="tree_summarize"),
    metadata=ToolMetadata(
        name="summary",
        description="Useful for summarizing an entire document."
    )
)

In [None]:
import nest_asyncio
nest_asyncio.apply()

In [None]:
from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(
    [vector_tool, summary_tool],
    verbose=True,
)

response = query_engine.query("What was mentioned about Meta? How Does it differ from how OpenAI is talked about?")

display_response(response)

OutputParserException: ignored

## SQL Query Engine

Here, we download and use a sample SQLite database with 11 tables, with various info about music, playlists, and customers. We will limit to a select few tables for this test.

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!curl https://www.sqlitetutorial.net/wp-content/uploads/2018/03/chinook.zip -O /content/chinook.zip
!unzip /content/chinook.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  298k  100  298k    0     0  1334k      0 --:--:-- --:--:-- --:--:-- 1338k
curl: (3) URL using bad/illegal format or missing URL
Archive:  /content/chinook.zip
  inflating: chinook.db              


In [None]:
from sqlalchemy import create_engine, MetaData, Table, Column, String, Integer, select, column

engine = create_engine("sqlite:////content/chinook.db")

In [None]:
from llama_index.core import SQLDatabase

sql_database = SQLDatabase(engine)

In [None]:
from llama_index.core.indices.struct_store import NLSQLTableQueryEngine

query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=["albums", "tracks", "artists"],
)

In [None]:
response = query_engine.query("What are some albums?")

display_response(response)

NotImplementedError: ignored

In [None]:
response = query_engine.query("What are some artists? Limit it to 5.")

display_response(response)

NotImplementedError: ignored

This last query should be a more complex join

In [None]:
response = query_engine.query("What are some tracks from the artist AC/DC? Limit it to 3")

display_response(response)

NotImplementedError: ignored

In [None]:
print(response.metadata['sql_query'])

## Programs

Depending the LLM, you will have to test with either `OpenAIPydanticProgram` or `LLMTextCompletionProgram`

In [None]:
from typing import List
from pydantic import BaseModel

from llama_index.core.program import LLMTextCompletionProgram
from llama_index.program.openai import OpenAIPydanticProgram

class Song(BaseModel):
    """Data model for a song."""

    title: str
    length_seconds: int


class Album(BaseModel):
    """Data model for an album."""

    name: str
    artist: str
    songs: List[Song]

In [None]:
from llama_index.core.output_parsers import PydanticOutputParser

prompt_template_str = """\
Generate an example album, with an artist and a list of songs. \
Using the movie {movie_name} as inspiration.\
"""
program = LLMTextCompletionProgram.from_defaults(
    output_parser=PydanticOutputParser(Album),
    prompt_template_str=prompt_template_str,
    llm=llm,
    verbose=True,
)

This seems to error out only because it ran out of output token space. We could fix this by setting `max_new_tokens` on the constructor higher than the default of 256.

In [None]:
output = program(movie_name="The Shining")

ValidationError: ignored

In [None]:
print(output)

## Data Agent

Similar to programs, OpenAI LLMs will use `OpenAIAgent`, while other LLMs will use `ReActAgent`.

In [None]:
from llama_index.core.agent import ReActAgent
from llama_index.agent.openai import OpenAIAgent

agent = ReActAgent.from_tools(
    [vector_tool, summary_tool],
    llm=llm,
    verbose=True
)

In [None]:
response = agent.chat("Hello!")
print(response)

[1;3;38;5;200mResponse: Great, I'm happy to help you with your question! Can you please provide more context or clarify what you need help with? For example, what kind of information are you trying to find or what question are you trying to answer?
[0mGreat, I'm happy to help you with your question! Can you please provide more context or clarify what you need help with? For example, what kind of information are you trying to find or what question are you trying to answer?


Interesting tool inputs and responses, but I guess it works lol

In [None]:
response = agent.chat("What was mentioned about Meta? How Does it differ from how OpenAI is talked about?")
print(response)

[1;3;38;5;200mResponse: Great, let's get started! Based on your question, I understand that you want to know the difference between Meta and OpenAI. To answer this question, I will use the `summary` tool to summarize some relevant information about both Meta and OpenAI.
Action: summary
Action Input: {"text": "Meta and OpenAI are both AI research organizations, but they have some key differences. Meta was founded in 2015 by Mark Zuckerberg and is focused on developing AI technologies for Facebook and other Facebook-owned platforms. OpenAI, on the other hand, was founded in 2015 by a group of prominent AI researchers and is focused on developing AI technologies for a wide range of applications, including but not limited to natural language processing, computer vision, and robotics. Additionally, OpenAI is a non-profit organization, while Meta is a for-profit company. Both organizations have made significant contributions to the field of AI and have published numerous research papers and