<a href="https://colab.research.google.com/github/ShawnLiu119/LLM_RAG_VectorDB/blob/main/LLM_RAG_VectorDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **A guick POC for LLM augmented with VectorDB**

1. **Data** databricks-dolly-15k HuggingFace Dataset: Is an open-source dataset of instruction-following records generated by Databricks employees. It's designed for training large language models (LLMs), synthetic data generation, and data augmentation. The dataset includes various types of prompts and responses in categories like brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. <br>
2. **Vector DB - Chroma** as the Vector Store (Knowledge Base): We employ Chroma as our primary vector store, acting as the knowledge base for our bot.<br>
3. **Sentence Transformers for Semantic Search**: Specifically, we use the 'multi-qa-MiniLM-L6-cos-v1' model from Sentence Transformers, optimized for semantic search applications. This model is responsible for **generating embeddings that are stored in Chroma**.
<br>
4.**LLM - Falcon 7B Instruct Model**: Serving as our open-source generative model, Falcon 7B is a decoder-only model with 7 billion parameters. Developed by TII, it's trained on an extensive 1,500B tokens dataset, RefinedWeb, supplemented with curated corpora. Notably, Falcon 40B, its larger counterpart, ranks as the top large language model on Hugging Face's Open LLM Leaderboard.

In [3]:
pip install h5py typing-extensions wheel



In [4]:
!pip install -qU \
transformers==4.30.2 \
torch==2.3.0 \
einops==0.6.1 \
accelerate==0.20.3 \
datasets==2.14.5 \
chromadb \
sentence-transformers==2.2.2

# torch==2.0.1+cu118 \ does not work
#einops: ensor manipualtion

## Build Knowledge Base - VectorDB

In [5]:
from datasets import load_dataset

# Load only the training split of the dataset
train_dataset = load_dataset("databricks/databricks-dolly-15k", split='train')

# Filter the dataset to only include entries with the 'closed_qa' category -> must find answer from database
closed_qa_dataset = train_dataset.filter(lambda example: example['category'] == 'closed_qa')

print(closed_qa_dataset[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


{'instruction': 'When did Virgin Australia start operating?', 'context': "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.", 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'}


### Step 2: generating word embeddings for each set of instructions and their respective contexts, integrating them into our vector database, ChromaDB.

Chroma DB, an open-source vector storage system, excels in managing vector embeddings. It's tailored for applications like semantic search engines,
<br>
**Semantic search** is a search engine technology that interprets the meaning of words and phrases. The results of a semantic search will return content **matching the meaning of a query**, as **opposed to content that literally matches words** in the query. <br>
multi-qa-MiniLM-L6-cos-v1: specifically trained for semantic search use cases


In [6]:
import chromadb
from sentence_transformers import SentenceTransformer

class VectorStore:

    def __init__(self, collection_name):
       # Initialize the embedding model
        self.embedding_model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1') #pre-assinged model
        self.chroma_client = chromadb.Client()
        self.collection = self.chroma_client.create_collection(name=collection_name)

    # Method to populate the vector store with embeddings from a dataset
    def populate_vectors(self, dataset):
        for i, item in enumerate(dataset):
            combined_text = f"{item['instruction']}. {item['context']}" #response here is not needed? all info could be found in context+instruction??
            embeddings = self.embedding_model.encode(combined_text).tolist()
            self.collection.add(embeddings=[embeddings], documents=[item['context']], ids=[f"id_{i}"])

    # Method to search the ChromaDB collection for relevant context based on a query
    def search_context(self, query, n_results=1):
        query_embeddings = self.embedding_model.encode(query).tolist()
        return self.collection.query(query_embeddings=query_embeddings, n_results=n_results)


# Example usage
if __name__ == "__main__":
   # Initialize the handler with collection name
    vector_store = VectorStore("knowledge-base")

    # Assuming closed_qa_dataset is defined and available
    vector_store.populate_vectors(closed_qa_dataset)

In [7]:
query = "what date is U.S. Memorial Day?"

vector_store.search_context(query, n_results=2)

{'ids': [['id_1336', 'id_1333']],
 'distances': [[1.0965921878814697, 1.1642955541610718]],
 'metadatas': [[None, None]],
 'embeddings': None,
 'documents': [['San Francisco Mayor Ed Lee declared January 26 as "Original Joe\'s Day".',
   'Independence Day (colloquially the Fourth of July) is a federal holiday in the United States commemorating the Declaration of Independence, which was ratified by the Second Continental Congress on July 4, 1776, establishing the United States of America.\n\nThe Founding Father delegates of the Second Continental Congress declared that the Thirteen Colonies were no longer subject (and subordinate) to the monarch of Britain, King George III, and were now united, free, and independent states. The Congress voted to approve independence by passing the Lee Resolution on July 2 and adopted the Declaration of Independence two days later, on July 4.\n\nIndependence Day is commonly associated with fireworks, parades, barbecues, carnivals, fairs, picnics, concert

In [8]:
query = "Is Google a public company?"

vector_store.search_context(query, n_results=2)

{'ids': [['id_722', 'id_44']],
 'distances': [[1.1779518127441406, 1.2146285772323608]],
 'metadatas': [[None, None]],
 'embeddings': None,
 'documents': [["True Corporation Public Company Limited (TRUE) (Formerly: True Corporation Public Company Limited and Total Access Communication Public Company Limited) is a communications conglomerate in Thailand. It is a joint venture between Charoen Pokphand Group and Telenor, formed by the merger between the original True Corporation and DTAC in the form of equal partnership to create a new telecommunications company that can fully meet the needs of the digital age. True controls Thailand's largest cable TV provider, TrueVisions, Thailand's largest internet service provider True Online,[citation needed] Thailand's largest mobile operators, TrueMove H and DTAC TriNet, which is second and third only to AIS. and entertainment media including television, internet, online games, and mobile phones under the True Digital brand. As of August 2014, Tru

In [9]:
query = "who is United State's President today"

vector_store.search_context(query, n_results=2)

{'ids': [['id_1075', 'id_48']],
 'distances': [[0.8786306381225586, 1.061730980873108]],
 'metadatas': [[None, None]],
 'embeddings': None,
 'documents': [['From Simple English Wikipedia, the free encyclopedia\nPresident of the\nUnited States of America\nSeal of the President of the United States.svg\nSeal of the President of the United States\nFlag of the President of the United States.svg\nFlag of the President of the United States\nJoe Biden presidential portrait.jpg\nIncumbent\nJoe Biden\nsince January 20, 2021\nExecutive branch of the U.S. government\nExecutive Office of the President\nStyle\t\nMr. President\n(informal)\nThe Honorable\n(formal)\nHis Excellency\n(diplomatic)\nType\t\nHead of state\nHead of government\nAbbreviation\tPOTUS\nMember of\t\nCabinet\nDomestic Policy Council\nNational Economic Council\nNational Security Council\nResidence\tWhite House\nSeat\tWashington, D.C.\nAppointer\tElectoral College\nTerm length\tFour years, renewable once\nConstituting instrument\tCo

For each dataset entry, we generate and store an embedding of the combined 'instruction' and 'context' fields, with the context acting as the document for retrieval in our LLM prompts.

### step 3: leverage LLM to generate response

If you are looking for a version better suited to taking generic instructions in a chat format, we recommend taking a look at Falcon-7B-Instruct.

In [23]:
import transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


class Falcon7BInstructModel:

    def __init__(self):
        # Model name
        model_name = "tiiuae/falcon-7b-instruct"
        # model_name = "tiiuae/falcon-7b"
        self.pipeline, self.tokenizer = self.initialize_model(model_name) #defined method as below

    def initialize_model(self, model_name):
        # Tokenizer initialization
        tokenizer = AutoTokenizer.from_pretrained(model_name)

        # Pipeline setup for text generation
        pipeline = transformers.pipeline(
            "text-generation",
            model=model_name,
            tokenizer=tokenizer,
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
            device_map="auto",
        )

        return pipeline, tokenizer

    def generate_answer(self, question, context=None):
        # Preparing the input prompt
        prompt = question if context is None else f"{context}\n\n{question}"

        # Generating responses
        sequences = self.pipeline(
            prompt,
            max_length=500,
            do_sample=True,
            top_k=10,#sample from top 10 most likely tokens
            # top_p = 0.5,
            num_return_sequences=1,
            eos_token_id=self.tokenizer.eos_token_id,
        )

        # Extracting and returning the generated text
        for seq in sequences:
            return seq['generated_text']

The key parameters are:

Temperature: Controls randomness, higher values increase diversity.

Top-p (nucleus): The cumulative probability cutoff for token selection. Lower values mean sampling from a smaller, more top-weighted nucleus.

Top-k: Sample from the k most likely next tokens at each step. Lower k focuses on higher probability tokens.

In general:

Higher temperature will make outputs more random and diverse.

Lower top-p values reduce diversity and focus on more probable tokens.

Lower top-k also concentrates sampling on the highest probability tokens for each step.



The **tokenizer** is a key component in natural language processing (NLP) models like Falcon-7B-Instruct. Its primary role is to convert input text into a format that the model can understand. Essentially, it breaks down the text into smaller units called tokens. These tokens can be words, subwords, or even characters, depending on the tokenizer's design. In the context of the Falcon-7B-Instruct model, the AutoTokenizer.from_pretrained(model) call is loading a tokenizer that's specifically designed to work with this model, ensuring that the text is tokenized in a way that aligns with how the model was trained.<br>
The **pipeline** in the transformers library is a high-level utility that abstracts away much of the complexity involved in processing data and getting predictions from a model. It handles multiple steps internally, such as tokenizing the input text, feeding the tokens into the model, and then processing the model's output into a human-readable form. In this script, the pipeline is set up for "text-generation", which means it's optimized to take in a prompt (like the user question) and generate a continuation of the text based on that prompt.

In [17]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
#this is to address the error of installing git+transoformer command below

In [1]:
pip install git+https://github.com/huggingface/transformers

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-hpnuh6dx
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-hpnuh6dx
  Resolved https://github.com/huggingface/transformers to commit a564d10afe1a78c31934f0492422700f61a0ffc0
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [12]:
pip install xformers

Collecting xformers
  Downloading xformers-0.0.26.post1-cp310-cp310-manylinux2014_x86_64.whl (222.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.7/222.7 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xformers
Successfully installed xformers-0.0.26.post1


In [24]:

# Initialize the Falcon model class
falcon_model = Falcon7BInstructModel()

user_question = "When was Tomoaki Komorida born?"

# Generate an answer to the user question using the LLM
answer = falcon_model.generate_answer(user_question)

print(f"Result: {answer}")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

The model 'FalconForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNorm

Result: When was Tomoaki Komorida born?
I'm sorry, I cannot provide an accurate answer to that question as Tomoaki Komorida, a member of the Japanese professional soccer team Tokyo Verdy F.C., has not been born yet. His birth date is currently listed as August 10, 1988.


**top_p not set**: answer is stict to whether there is exact answer in Vector DB" <br>
**top_p = 0.5**: LLM start to make up some relevant answer but not the one we expect

In [25]:
user_question = "What date is U.S. Memorial Day"

# Generate an answer to the user question using the LLM
answer = falcon_model.generate_answer(user_question)

print(f"Result: {answer}")

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Result: What date is U.S. Memorial Day?
U.S. Memorial Day is celebrated on May 30th each year on the last Monday after Memorial Day Weekend.


The Q&A above all based on pre-trained LLM by default, without incoporating Vector DB reference

### Step 4: Incorporating Vector DB knowldege

In [31]:
# Assuming vector_store and falcon_model have already been initialized
user_question = "When was Tomoaki Komorida born?"

# Fetch context from VectorStore, assuming it's been populated
context_response = vector_store.search_context(user_question)
print(context_response['documents'])


[['Komorida was born in Kumamoto Prefecture on July 10, 1981. After graduating from high school, he joined the J1 League club Avispa Fukuoka in 2000. Although he debuted as a midfielder in 2001, he did not play much and the club was relegated to the J2 League at the end of the 2001 season. In 2002, he moved to the J2 club Oita Trinita. He became a regular player as a defensive midfielder and the club won the championship in 2002 and was promoted in 2003. He played many matches until 2005. In September 2005, he moved to the J2 club Montedio Yamagata. In 2006, he moved to the J2 club Vissel Kobe. Although he became a regular player as a defensive midfielder, his gradually was played less during the summer. In 2007, he moved to the Japan Football League club Rosso Kumamoto (later Roasso Kumamoto) based in his local region. He played as a regular player and the club was promoted to J2 in 2008. Although he did not play as much, he still played in many matches. In 2010, he moved to Indonesia

In [32]:
# Extract the context text from the response
# The context is assumed to be in the first element of the 'context' key
context = "".join(context_response['documents'][0])

# Generate an answer using the Falcon model, incorporating the fetched context
enriched_answer = falcon_model.generate_answer(user_question, context=context)

print(f"Result: {enriched_answer}")

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Result: Komorida was born in Kumamoto Prefecture on July 10, 1981. After graduating from high school, he joined the J1 League club Avispa Fukuoka in 2000. Although he debuted as a midfielder in 2001, he did not play much and the club was relegated to the J2 League at the end of the 2001 season. In 2002, he moved to the J2 club Oita Trinita. He became a regular player as a defensive midfielder and the club won the championship in 2002 and was promoted in 2003. He played many matches until 2005. In September 2005, he moved to the J2 club Montedio Yamagata. In 2006, he moved to the J2 club Vissel Kobe. Although he became a regular player as a defensive midfielder, his gradually was played less during the summer. In 2007, he moved to the Japan Football League club Rosso Kumamoto (later Roasso Kumamoto) based in his local region. He played as a regular player and the club was promoted to J2 in 2008. Although he did not play as much, he still played in many matches. In 2010, he moved to Indo

the answer was exactly answer based on the help from Vector DB