source: https://medium.com/intel-tech/optimize-vector-databases-enhance-rag-driven-generative-ai-90c10416cb9c
## VectorDB: Similarity Search
Vector databases do vector retrieval by similarity search using a distance metric (where closer means the results are more similar) such as Euclidean, dot product, or cosine similarity.
https://www.pinecone.io/learn/vector-similarity/

## Indexing mechanism
To accelerate the retrieval process, the vector data is organized using an indexing mechanism
- Inverted File
- Hierarchical Navigable Small Worlds (HNSW)
- Locality-Sensitive Hashing (LSH)

source: https://zilliz.com/learn/how-to-pick-a-vector-index-in-milvus-visual-guide

In [None]:
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama

documents = SimpleDirectoryReader("assets/data").load_data()

# bge-base embedding model
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")

# ollama
Settings.llm = Ollama(model="llama3", request_timeout=360.0)

index = VectorStoreIndex.from_documents(
    documents,
)
query_engine = index.as_query_engine()
response = query_engine.query("Which job has the highest salary in 2024?")

print(response)

Based on the available data, it is difficult to determine which job has the highest salary in 2024. The provided information only shows compensation by roles for one specific company (NodeFlair) and does not provide a comprehensive view of salaries across various jobs or industries.


You can get more llama-index integration here: https://docs.llamaindex.ai/en/stable/getting_started/starter_example_local/


# Basic Data Ingestion
source: https://docs.llamaindex.ai/en/stable/understanding/loading/loading/

In [None]:
# high level transformation API
from llama_index.core import VectorStoreIndex

vector_index = VectorStoreIndex.from_documents(documents)
vector_index.as_query_engine()
from llama_index.core.node_parser import SentenceSplitter

text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10)

# global
from llama_index.core import Settings

Settings.text_splitter = text_splitter

# per-index
index = VectorStoreIndex.from_documents(
    documents, transformations=[text_splitter]
)

In [None]:
# low level transformation API
from llama_index.core import Document, SimpleDirectoryReader
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import TokenTextSplitter

documents = SimpleDirectoryReader("./data").load_data()

pipeline = IngestionPipeline(transformations=[TokenTextSplitter(), ...])

nodes = pipeline.run(documents=documents)

# Adding metadata to your document to aid in searching
document = Document(
    text="text",
    metadata={"filename": "<doc_file_name>", "category": "<category>"},
)

# Storing in a vector DB (chroma)

In [None]:
import chromadb
from llama_index.core import (SimpleDirectoryReader, StorageContext,
                              VectorStoreIndex)
from llama_index.vector_stores.chroma import ChromaVectorStore

# load some documents
documents = SimpleDirectoryReader("./data").load_data()

# initialize client, setting path to save data
db = chromadb.PersistentClient(path="./chroma_db")

# create collection
chroma_collection = db.get_or_create_collection("quickstart")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# create your index
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

# create a query engine and query
query_engine = index.as_query_engine()
response = query_engine.query("What is the meaning of life?")
print(response)

## Storing in a Vector DB (Milvus)

In [None]:
embeddings = OpenAIEmbeddings()
connection_args = { 'uri': URI, 'token': TOKEN }

vector_store = Milvus(
    embedding_function=embeddings,
    connection_args=connection_args,
    collection_name=COLLECTION_NAME,
    drop_old=True,
).from_documents(
    all_splits,
    embedding=embeddings,
    collection_name=COLLECTION_NAME,
    connection_args=connection_args,
)

# Think and Try it Out
1. Can you mix different embeddings in a single VectorStoreIndex instance? If you wish to support multiple embeddings in a single Vector Store, how would you do that
2. When choosing to configure your Vector Store, Think about the embeddings and how would you store them, ie Vector Size, Embedding Types ?
3. Are you able to improve your result by customizing your embedding strategy?
4. Given a dataset: https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail/data
    - Can you setup a Vector Index that ingest them and accessible through LLM? 