## Components in LlamaIndex

### Create a Query engine for RAG

#### Setting up the persona database 
i will be using personas from the https://huggingface.co/datasets/dvilasuero/finepersonas-v0.1-tiny. This dataset contains 5K personas that will be attending the party!

Let's load the dataset and store it as files in the data directory


In [7]:
!pip install llama-index datasets llama-index-callbacks-arize-phoenix llama-index-vector-stores-chroma llama-index-llms-huggingface-api -U -q

  error: subprocess-exited-with-error
  
  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [21 lines of output]
      + c:\Users\loicsteve.fohoue\OneDrive - Virgo Facilities\Bureau\LlamaIndexAgents\.venv\Scripts\python.exe C:\Users\loicsteve.fohoue\AppData\Local\Temp\pip-install-hke2s2xa\numpy_45c98572cdf84680b301507e2bdfe9b9\vendored-meson\meson\meson.py setup C:\Users\loicsteve.fohoue\AppData\Local\Temp\pip-install-hke2s2xa\numpy_45c98572cdf84680b301507e2bdfe9b9 C:\Users\loicsteve.fohoue\AppData\Local\Temp\pip-install-hke2s2xa\numpy_45c98572cdf84680b301507e2bdfe9b9\.mesonpy-g50r5edt -Dbuildtype=release -Db_ndebug=if-release -Db_vscrt=md --native-file=C:\Users\loicsteve.fohoue\AppData\Local\Temp\pip-install-hke2s2xa\numpy_45c98572cdf84680b301507e2bdfe9b9\.mesonpy-g50r5edt\meson-python-native-file.ini
      The Meson build system
      Version: 1.2.99
      Source dir: C:\Users\loicsteve.fohoue\AppData\Local\Temp\pip-install-hke2s2xa\numpy_45c9857

In [11]:
from datasets import load_dataset
from pathlib import Path

dataset = load_dataset(path="dvilasuero/finepersonas-v0.1-tiny", split="train")

Path("data").mkdir(parents=True, exist_ok=True)
for i, persona in enumerate(dataset):
    with open(Path("data") / f"persona_{i}.txt", "w", encoding="utf-8") as f:
        f.write(persona["persona"])

Using the latest cached version of the dataset since dvilasuero/finepersonas-v0.1-tiny couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at C:\Users\loicsteve.fohoue\.cache\huggingface\datasets\dvilasuero___finepersonas-v0.1-tiny\default\0.0.0\877c402c4434d631b5055853bc50ba93fbdf9c12 (last modified on Tue Apr 15 11:51:38 2025).


### Loading and embedding persona documents

We will use the SimpleDirectoryReader to load the persona descriptions from the data directory. This will return a list of Document object

In [14]:
pip install llama_index

Collecting llama_index
  Using cached llama_index-0.12.30-py3-none-any.whl.metadata (12 kB)
Collecting llama-index-agent-openai<0.5.0,>=0.4.0 (from llama_index)
  Using cached llama_index_agent_openai-0.4.6-py3-none-any.whl.metadata (727 bytes)
Collecting llama-index-cli<0.5.0,>=0.4.1 (from llama_index)
  Using cached llama_index_cli-0.4.1-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.13.0,>=0.12.30 (from llama_index)
  Using cached llama_index_core-0.12.30-py3-none-any.whl.metadata (2.6 kB)
Collecting llama-index-embeddings-openai<0.4.0,>=0.3.0 (from llama_index)
  Using cached llama_index_embeddings_openai-0.3.1-py3-none-any.whl.metadata (684 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0 (from llama_index)
  Using cached llama_index_indices_managed_llama_cloud-0.6.11-py3-none-any.whl.metadata (3.6 kB)
Collecting llama-index-llms-openai<0.4.0,>=0.3.0 (from llama_index)
  Using cached llama_index_llms_openai-0.3.35-py3-none-any.whl.metadata (3.3 kB

In [15]:
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_dir="data")
documents = reader.load_data()
len(documents)

5000

Now we have a list of Document objects, we can use the IngestionPipeline to create nodes from the documents and prepare them for the QueryEngine. We will use the SentenceSplitter to split the documents into smaller chunks and the HuggingFaceInferenceAPIEmbedding to embed the chunks.

In [29]:
from llama_index.core import Document
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline

# create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_overlap=0),
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ]
)

nodes = await pipeline.arun(documents=[Document.example()])

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
