# Overview

In this notebook, we will dive into LLM and their synergy with Vector Databases. The vector databases store data in a unique format known as **vector embeddings** which enable LLMs to grasp and utilize information more contextually and accurately. Let's create an application that **context-aware and reliable**. 


# Embedding Process

In deep learning, the raw data transformed into a numerical format which know as vectors that AI system can understand. High-dimensional data referes to data that has many attributes or features, each representing a different dimension. These dimensions help in capturing the nuanced characteristics of the data. The vector embeddings process likes below:

```
Raw input -> Tokenization -> Embedding -> Vector
```

Each number in vector represents a specific feature of the data, and together, these numbers encapsulate the seence of the original input int a format that the machine can process. For example, see the illustration below:

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/112/030/524/565/225/614/original/62e6da79e44eda50.webp)

The vector representation of puppy would be positioned closer in vector space to dog than to house, reflecting their semantic proximity. This approach extends to analogical relationships as well. The vector distance and direction between man and woman can be analogous to that between king and queen. This illustrates **how word vectors not only represent words but also allow for a meaningful comparison of their semantic relationships in a multidimensional vector space.**


# Vector Databases

**Similarity Seach** is core function where vector databases excel. They can quickly **find data points that are similar to given query in a high-dimensional space.** Some examples like content-based Retrival, E-Commerce image searching, etc. It is also can enhance LLMs with **Contextual Understanding**. It stores and process text embeddings that enable LLMs to perform more nuanced and context-aware information retrieval. They help in understanding the semantic content of large volumes of text, which is pivotal in tasks like answering complex queries, **maintaining conversation context**, or generating relevant content.


# Vector VS Traditional Databases

Traditional SQL databases excel in structured data management, thriving on excat mathces and well-defined conditional logic. However, the rigid schema design makes them less adaptable to the semantic and contextual nuances of unstructured data. No SQL databases, offer more flexibility compared to traditional SQL systems. They can handle semi-structured and unstructured data. Despite this, even NoSQL databses can fall short in certain aspects of handling the complex, high-dimensional vector data essential for LLMs and Generative AI, which often involves interpreting context, patterns, and semantic content beyond simple data retrieval.


# Enriching Context for LLMs with Vector Databases

The pre-trained mdoel is powerful like LLama2. However, they face challenges in hanlding specialized contexts due to theie training on baord, general datasets. Addressing the contextual limitations can be approached in two main ways:


# Targeted Training

This involves retraining or fine-tuning the LLM on a dataset focused on the specific area of interest.


# Incorporating Context via Vector Databases

Alternatively, the LLM can be augmented by adding context directly into its prompts, using data from a vector database. In this setup, the vector database stores specialized information as vector embeddings, which can be retrieved and used by the LLM to enhance its responses. This approach allows for the inclusion of relevant, specialized knowledge without the need for extensive retraining. It's particularly useful for organizations of infivisuals lacking the resources for targeted training, as it leverages existing model capabilities while providing focused contextual insights. This is called **Retrieval Argumented Generation(RAG)**.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/112/031/445/938/150/573/original/aee0db54da449455.webp)

In [1]:
%%capture
!pip install transformers==4.36.2
!pip install accelerate==0.25.0
!pip install datasets==2.15.0
!pip install peft==0.7.1
!pip install bitsandbytes==0.41.3
!pip install einops==0.7.0
!pip install sentence-transformers==2.5.1
!pip install chromadb==0.4.24

In [2]:
import os

os.environ['DATASET']='databricks/databricks-dolly-15k'
os.environ['MODEL_NAME']='tiiuae/falcon-7b-instruct'
os.environ['TASK_TYPE']='text-generation'
os.environ['SENTENCE_TRANSFORMER']='multi-qa-MiniLM-L6-cos-v1'

In [3]:
!accelerate estimate-memory ${MODEL_NAME} --library_name transformers

Loading pretrained config for `tiiuae/falcon-7b-instruct` from `transformers`...
config.json: 100%|█████████████████████████| 1.05k/1.05k [00:00<00:00, 5.58MB/s]
┌────────────────────────────────────────────────────────┐
│  Memory Usage for loading `tiiuae/falcon-7b-instruct`  │
├───────┬─────────────┬──────────┬───────────────────────┤
│ dtype │Largest Layer│Total Size│  Training using Adam  │
├───────┼─────────────┼──────────┼───────────────────────┤
│float32│    1.1 GB   │ 25.82 GB │       103.27 GB       │
│float16│  563.56 MB  │ 12.91 GB │        51.63 GB       │
│  int8 │  281.78 MB  │ 6.45 GB  │        25.82 GB       │
│  int4 │  140.89 MB  │ 3.23 GB  │        12.91 GB       │
└───────┴─────────────┴──────────┴───────────────────────┘


In [4]:
from datasets import load_dataset

dataset=load_dataset(os.getenv('DATASET'), split='train[:5]')
print(dataset)

closed_qa_dataset=dataset.filter(lambda example: example['category']=='closed_qa')
print(closed_qa_dataset[0])

Downloading readme:   0%|          | 0.00/8.20k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/13.1M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['instruction', 'context', 'response', 'category'],
    num_rows: 5
})


Filter:   0%|          | 0/5 [00:00<?, ? examples/s]

{'instruction': 'When did Virgin Australia start operating?', 'context': "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.", 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'}


In [5]:
import chromadb
from sentence_transformers import SentenceTransformer

class VectorStore:
    def __init__(self, collection_name, embedding_model):
        self.embedding_model=embedding_model
        self.chroma_client=chromadb.Client()
        self.collection=self.chroma_client.create_collection(name=collection_name)
        
    # method to populate the vector store with embeddings from a dataset
    def populate_vectors(self, dataset):
        for i, item in enumerate(dataset):
            # each dataset entry, we generate and store an embedding of the combined 'instruction' and 'context' fields
            # with the context acting as the document for retrieval in our LLM prompts
            combined_text=f"{item['instruction']}, {item['context']}"
            embeddings=self.embedding_model.encode(combined_text).tolist()
            self.collection.add(embeddings=[embeddings], documents=[item['context']],ids=[f"id_{i}"])
            
    # method to search the ChromaDB collection for relevant based on a query
    def search_context(self, query, n_results=1):
        query_embeddings=self.embedding_model.encode(query).tolist()
        return self.collection.query(query_embeddings=query_embeddings, n_results=n_results)

    
encoder=SentenceTransformer(os.getenv('SENTENCE_TRANSFORMER'), device='cuda')
# Note: Fiting this according to the real word case
encoder.max_seq_length=200
vector_store=VectorStore("knowledge-base", encoder)

vector_store.populate_vectors(closed_qa_dataset)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [6]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

class Falcon7BInstructModelPipeline:
    def __init__(self):
        self.pipeline, self.tokenizer=self.initialize_model(os.getenv('MODEL_NAME'))
    
    def initialize_model(self, model_name):
        # It primary role is to convert input text into a format that the model can understand.
        # It breaks down the text into smaller units called tokens
        # ensuring the text is tokenized in a way that aligns with how the model was trained.
        tokenizer=AutoTokenizer.from_pretrained(model_name)
        
        # handle multiple steps internally, like tokenizing, feeding tokens into model, and processing the output into a human-readble form
        pipeline=transformers.pipeline(
            'text-generation', 
            model=os.getenv('MODEL_NAME'), 
            tokenizer=tokenizer, 
            torch_dtype=torch.bfloat16, 
            trust_remote_code=True, 
            device_map='auto'
        )
        
        return pipeline, tokenizer
    
    def generate_answer(self, question, context=None):
        prompt=question if context is None else f'{context}\n\n{question}'
        
        # generating responses
        sequences=self.pipeline(
            prompt,
            max_length=500,
            do_sample=True,
            top_k=10,
            num_return_sequences=1,
            eos_token_id=self.tokenizer.eos_token_id
        )
        
        # extracting and return the generated text
        return sequences
    
falcon_model_pipeline=Falcon7BInstructModelPipeline()

tokenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

2024-03-04 01:11:39.047672: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-04 01:11:39.047776: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-04 01:11:39.167158: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


configuration_falcon.py:   0%|          | 0.00/7.16k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b-instruct:
- configuration_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_falcon.py:   0%|          | 0.00/56.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b-instruct:
- modeling_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


pytorch_model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

In [7]:
user_question="When was Tomoaki Komorida born?"

answer=falcon_model_pipeline.generate_answer(user_question)
answer

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


[{'generated_text': 'When was Tomoaki Komorida born?\nTomoaki Komorida was born on January 19, 1960.'}]

# Generating Context-Aware Answers

Let's elevate our generative model's capability by providing it with relevant context, retrieved from our vector store.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/112/034/233/969/352/099/small/075dfa8bcd9f503f.webp)

In [8]:
context_response=vector_store.search_context(user_question)

context=''
# Note: here we pick up the highest value of distances
if (context_response['distances'][0][0])>0.5:
    context = ''.join(context_response['documents'][0])

eriched_answer=falcon_model_pipeline.generate_answer(user_question, context=context)
eriched_answer

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


[{'generated_text': "Komorida was born in Kumamoto Prefecture on July 10, 1981. After graduating from high school, he joined the J1 League club Avispa Fukuoka in 2000. Although he debuted as a midfielder in 2001, he did not play much and the club was relegated to the J2 League at the end of the 2001 season. In 2002, he moved to the J2 club Oita Trinita. He became a regular player as a defensive midfielder and the club won the championship in 2002 and was promoted in 2003. He played many matches until 2005. In September 2005, he moved to the J2 club Montedio Yamagata. In 2006, he moved to the J2 club Vissel Kobe. Although he became a regular player as a defensive midfielder, his gradually was played less during the summer. In 2007, he moved to the Japan Football League club Rosso Kumamoto (later Roasso Kumamoto) based in his local region. He played as a regular player and the club was promoted to J2 in 2008. Although he did not play as much, he still played in many matches. In 2010, he 

# Credit

* https://mlengineering.medium.com/integrating-vector-databases-with-llms-a-hands-on-guide-82d2463114fb
* https://www.elastic.co/what-is/vector-embedding
* https://stackoverflow.blog/2023/11/09/an-intuitive-introduction-to-text-embeddings/
* https://docs.trychroma.com