# Overview

In this notebook, we will dive into LLM and their synergy with Vector Databases. The vector databases store data in a unique format known as **vector embeddings** which enable LLMs to grasp and utilize information more contextually and accurately. Let's create an application that **context-aware and reliable**. 


# Embedding Process

In deep learning, the raw data transformed into a numerical format which know as vectors that AI system can understand. High-dimensional data referes to data that has many attributes or features, each representing a different dimension. These dimensions help in capturing the nuanced characteristics of the data. The vector embeddings process likes below:

```
Raw input -> Tokenization -> Embedding -> Vector
```

Each number in vector represents a specific feature of the data, and together, these numbers encapsulate the seence of the original input int a format that the machine can process. For example, see the illustration below:

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/112/030/524/565/225/614/original/62e6da79e44eda50.webp)

The vector representation of puppy would be positioned closer in vector space to dog than to house, reflecting their semantic proximity. This approach extends to analogical relationships as well. The vector distance and direction between man and woman can be analogous to that between king and queen. This illustrates **how word vectors not only represent words but also allow for a meaningful comparison of their semantic relationships in a multidimensional vector space.**


# Vector Databases

**Similarity Seach** is core function where vector databases excel. They can quickly **find data points that are similar to given query in a high-dimensional space.** Some examples like content-based Retrival, E-Commerce image searching, etc. It is also can enhance LLMs with **Contextual Understanding**. It stores and process text embeddings that enable LLMs to perform more nuanced and context-aware information retrieval. They help in understanding the semantic content of large volumes of text, which is pivotal in tasks like answering complex queries, **maintaining conversation context**, or generating relevant content.


# Vector VS Traditional Databases

Traditional SQL databases excel in structured data management, thriving on excat mathces and well-defined conditional logic. However, the rigid schema design makes them less adaptable to the semantic and contextual nuances of unstructured data. No SQL databases, offer more flexibility compared to traditional SQL systems. They can handle semi-structured and unstructured data. Despite this, even NoSQL databses can fall short in certain aspects of handling the complex, high-dimensional vector data essential for LLMs and Generative AI, which often involves interpreting context, patterns, and semantic content beyond simple data retrieval.


# Enriching Context for LLMs with Vector Databases

The pre-trained mdoel is powerful like LLama2. However, they face challenges in hanlding specialized contexts due to theie training on baord, general datasets. Addressing the contextual limitations can be approached in two main ways:


# Targeted Training

This involves retraining or fine-tuning the LLM on a dataset focused on the specific area of interest.


# Incorporating Context via Vector Databases

Alternatively, the LLM can be augmented by adding context directly into its prompts, using data from a vector database. In this setup, the vector database stores specialized information as vector embeddings, which can be retrieved and used by the LLM to enhance its responses. This approach allows for the inclusion of relevant, specialized knowledge without the need for extensive retraining. It's particularly useful for organizations of infivisuals lacking the resources for targeted training, as it leverages existing model capabilities while providing focused contextual insights. This is called **Retrieval Argumented Generation(RAG)**.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/112/031/445/938/150/573/original/aee0db54da449455.webp)

In [None]:
%%capture
!pip install transformers==4.36.2
!pip install accelerate==0.25.0
!pip install datasets==2.15.0
!pip install peft==0.7.1
!pip install bitsandbytes==0.41.3
!pip install einops==0.7.0
!pip install sentence-transformers==2.5.1
!pip install chromadb==0.4.24

In [None]:
import os

os.environ['DATASET']='databricks/databricks-dolly-15k'
os.environ['MODEL_NAME']='tiiuae/falcon-7b-instruct'
os.environ['TASK_TYPE']='text-generation'
os.environ['SENTENCE_TRANSFORMER']='multi-qs-MiniLM-L6-cos-v1'

In [None]:
from dataset import load_dataset

dataset=load_dataset(os.getenv('DATASET'), split='train[:1000]')
print(dataset)

closed_qa_dataset=dataset.filter(lambda example: example['category']=='closed_qa')
print(closed_qa_dataset[0])

In [None]:
import chromadb
from sentence_transformers import SentenceTransformer

class VectorStore:
    def __init__(self, collection_name, embedding_model):
        self.embedding_model=embedding_model
        self.chroma_client=chromadb.Client()
        self.collection=self.chroma_client.create_collection(name=collection_name)
        
    # method to populate the vector store with embeddings from a dataset
    def populate_vectors(self, dataset):
        for i, item in enumerate(dataset):
            combined_text=f"{item['instruction']}, {item['context']}"
            embeddings=self.embedding_model.encode(combined_text).to_list()
            self.collection.add(embeddings=[embeddings], documents=[item['context']],ids=[f"id_{id}"])
            
    # method to search the ChromaDB collection for relevant based on a query
    def search_context(self, query, n_results=1):
        query_embeddings=self.embedding_model(query).tolist()
        return self.collection.query(query_embeddings=query_embeddings, n_results=n_results)

    
encoder=SentenceTransformer(os.getenv('SENTENCE_TRANSFORMER'), device='cuda')
encoder.max_seq_length=200
vector_store=VectorStore("knowledge-base", encoder)

vector_store.populate_vectors(closed_qa_dataset)                             