
# Introduction to LlamaIndex: High-Level Concepts

Welcome to this workshop on LlamaIndex! Before we dive into the code, let's explore some key concepts that will help you understand how LlamaIndex works and what it can do for your LLM-powered applications.

## What is LlamaIndex?

LlamaIndex is a data framework designed to help developers build applications with large language models (LLMs) that can interact with external data sources. It provides tools for data ingestion, structuring, indexing, and querying, bridging the gap between LLMs and custom datasets.

## Key Use Cases

LlamaIndex supports four main categories of use cases:

1. **Structured Data Extraction**: Extract precise data structures from unstructured sources like PDFs and websites.
2. **Query Engines**: Ask questions over your data and get responses with referenced context.
3. **Chat Engines**: Have multi-turn conversations with your data.
4. **Agents**: Create LLM-powered decision-makers that can complete complex tasks using a set of tools.

## Retrieval Augmented Generation (RAG)

RAG is a fundamental concept in LLM workflows It allows you to augment LLM knowledge with your own data. The RAG process typically involves:

1. Loading your data
2. Indexing the data (often using vector embeddings)
3. Storing the indexed data
4. Querying the data
5. Evaluating the results

## Important Concepts

As you work with LlamaIndex, you'll encounter these key concepts:

- **Nodes and Documents**: Containers for your data at different levels of granularity.
- **Connectors**: Tools to ingest data from various sources.
    - You can even use LLMs themselves as a way to read and determine whether data gets ingested, summarized and ingested, or ignored.
    - For this notebook we're using SimpleDirectoryReader, but there are others like DatabaseReader, which runs a query against a SQL database and returns every row of the results as a Document. There are hundreds of connectors to use on LlamaHub. ![Llamahub](llamahub.png)
- **Indexes**: Structures to organize your data for efficient retrieval.
    - **Vector Store Index** (most common but there are others):

        Splits documents into Nodes
        Creates vector embeddings of each node's text
        Enables semantic search
        Uses "top-k semantic retrieval" for querying
 
      We're using faiss, but there are many others such as chromadb, and pinecone. Choosing them is based on factors like cost, speed of input/output, accuracy, and scalability. 
      
    - **Knowledge graph**
 
        A Knowledge Graph Index in LlamaIndex represents data as a network of interconnected entities and **relationships**. This index type is particularly useful for:

        Modeling complex, interconnected data structures
        Capturing relationships between different pieces of information
        Enabling more contextual and relationship-based querying

 

- **Embeddings**: Numerical representations of your data's meaning.
    - **Vector embeddings**:
        
        Numerical representations of text semantics
        Allow finding related text based on meaning, not just keywords
        Default embedding in LlamaIndex is OpenAI's text-embedding-ada-002 **we're using BAAI/bge-small-en-v1.5 for it's high scoring  on benchmarks, but the default works just fine.** Each embedding model has a dimension, or length of the floating point that meaning is stored as. it is important to make sure that your embedding model, and your vector store dimensions are compatible. 
         
    - **Embedding process**:
        
        Can be time-consuming and potentially expensive for large amounts of text
        It's recommended to store embeddings for future use (vector store) **its possible to set up your RAG the wrong way** and it will embed your entire dataset every time you ask a question. seconds to hours of lost time. 
- **Retrievers**: Methods to fetch relevant context from your index.
    - Usually, your prompt to an LLM gets embedded and a similarity search is performed between your query and the nodes in the datastore. There are many other retrieval strategies.
    - chunk size: how much text is in each node.
    - chunk overlap: how much text from the previous and next chunks are represented in the current chunk.
    - top_k: number of chunks returned, usually ranked by most relavant (semantic similarity).
- **Routers**: Tools to select the appropriate retriever for a given query.
- **Node Postprocessors**: Functions to transform, filter, or re-rank retrieved nodes.
    - Postprocessing is when the Nodes retrieved are optionally reranked, transformed, or filtered, for instance by requiring that they have specific metadata such as keywords attached.
 
- **Response Synthesizers**: Components that generate LLM responses using retrieved context.
    - Response synthesis is when your query, your most-relevant data and your prompt are combined and sent to your LLM to return a response.
    - Users may expect that the final response contains some degree of structure (e.g. a JSON output, a formatted SQL query, etc.)
      



In [1]:
%pip install llama-index-llms-ollama
%pip install llama-index-embeddings-fastembed
%pip install fastembed
%pip install faiss-gpu
%pip install llama-index-embeddings-faiss
%pip install llama-index-core llama-index-readers-file
%pip install llama-index-vector-stores-faiss
%pip install faiss-cpu
%pip install llama-index

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
[31mERROR: Could not find a version that satisfies the requirement llama-index-embeddings-faiss (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for llama-index-embeddings-faiss[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-package

### Run this in a terminal tab

1. `curl -fsSL https://ollama.com/install.sh | sh`


In [1]:
from llama_index.llms.ollama import Ollama
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings, load_index_from_storage, StorageContext, PromptTemplate
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.vector_stores.faiss import FaissVectorStore
import faiss
from IPython.display import Markdown, display

[nltk_data] Downloading package punkt_tab to /home/jupyter-
[nltk_data]     exouser/.local/lib/python3.10/site-
[nltk_data]     packages/llama_index/core/_static/nltk_cache...
[nltk_data]   Package punkt_tab is already up-to-date!


In [6]:
# dimensions of BAAI/bge-small-en-v1.5
d = 384 #this number gotten from Huggingface
faiss_index = faiss.IndexFlatL2(d)

In [7]:
#Choose which LLM you'd like to use for response synthesis
llm = Ollama(model='llama3.1', request_timeout=30.0)

In [8]:
Settings.llm = llm

In [9]:
%%time
documents = SimpleDirectoryReader("test_data").load_data()
for i in range(1000000):
    pass

CPU times: user 1.25 s, sys: 11.3 ms, total: 1.26 s
Wall time: 1.26 s


In [10]:
# Set embedding model
Settings.embed_model = FastEmbedEmbedding(
    model_name = "BAAI/bge-small-en-v1.5"
)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

In [11]:
%%time
vector_store = FaissVectorStore(faiss_index=faiss_index)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
for i in range(1000000):
    pass

CPU times: user 38.5 s, sys: 103 ms, total: 38.6 s
Wall time: 5.29 s


In [12]:
%%time
# save index to disk
index.storage_context.persist()
for i in range(1000000):
    pass

CPU times: user 30.1 ms, sys: 34 μs, total: 30.1 ms
Wall time: 29.3 ms


In [16]:
%%time
# load index from disk
vector_store = FaissVectorStore.from_persist_dir("./storage")
storage_context = StorageContext.from_defaults(
    vector_store=vector_store, persist_dir="./storage"
)
index = load_index_from_storage(storage_context=storage_context)
for i in range(1000000):
    pass

CPU times: user 31.3 ms, sys: 0 ns, total: 31.3 ms
Wall time: 30.4 ms


In [17]:
%%time
query_engine = index.as_query_engine(streaming=False, similarity_top_k=3)
response = query_engine.query("what color is the waterfowl, and what does it not symbolize??")
for i in range(1000000):
    pass

CPU times: user 325 ms, sys: 4.26 ms, total: 329 ms
Wall time: 688 ms


In [20]:
response = query_engine.query("what is the banana?")
for i in range(1000000):
    pass

In [21]:
display(Markdown(f"{response}"))

There is no mention of a banana in the provided context.

In [13]:
%%time
query_engine = index.as_query_engine(streaming=False, similarity_top_k=4)
response = query_engine.query("When do I do the QA steps? after which steps? is it every single one?")
for i in range(1000000):
    pass
display(Markdown(f"{response}"))

Quality Control (QC) checklists must be completed at each stage of the process. Any deviations from the ADD or MSS must be documented and approved by the Galactic Curator.

In other words, you should perform a QC checklist step after Step 3.5 in the Quantum Fabrication Process section, which is to conduct a Quality Control (QC) check upon completion, assessing the artifact against the ADD and MSS for defects, accuracy, and overall quality.

However, this does not mean that you should only do a QC checklist at this point. Based on the context provided, it seems that QC checks are an integral part of each step in the process, where "any deviations from the ADD or MSS must be documented and approved by the Galactic Curator".

Therefore, it would be best to perform a QC check after every step, or at least after every significant milestone in the process, to ensure that everything is going as planned and to catch any potential issues early on.

CPU times: user 331 ms, sys: 3.27 ms, total: 334 ms
Wall time: 4.31 s


If the model is not answering the question correctly, we must start thinking about optimizing our data, or the processes used to index and retrieve the data

We're using a VM of course for this workshop. But this entire Workflow can be run on a macbook. Granted it will be much slower (think minutes to hours, not seconds) but with a GPU this process can be an extremely capable offline alternative to similar services from OpenAI, Google, and Amazon.  

For more tutorials and user-friendly documentation please visit https://docs.llamaindex.ai/en/stable/
