## Components in LlamaIndex

### Create a Query engine for RAG

#### Setting up the persona database 
i will be using personas from the https://huggingface.co/datasets/dvilasuero/finepersonas-v0.1-tiny. This dataset contains 5K personas that will be attending the party!

Let's load the dataset and store it as files in the data directory


In [1]:
!pip install llama-index datasets llama-index-callbacks-arize-phoenix llama-index-vector-stores-chroma llama-index-llms-huggingface-api -U -q

  error: subprocess-exited-with-error
  
  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [21 lines of output]
      + c:\Users\loicsteve.fohoue\OneDrive - Virgo Facilities\Bureau\LlamaIndexAgents\.venv\Scripts\python.exe C:\Users\loicsteve.fohoue\AppData\Local\Temp\pip-install-pfi26jst\numpy_299d94914f944170b83a7d3460eef506\vendored-meson\meson\meson.py setup C:\Users\loicsteve.fohoue\AppData\Local\Temp\pip-install-pfi26jst\numpy_299d94914f944170b83a7d3460eef506 C:\Users\loicsteve.fohoue\AppData\Local\Temp\pip-install-pfi26jst\numpy_299d94914f944170b83a7d3460eef506\.mesonpy-b6a8s2py -Dbuildtype=release -Db_ndebug=if-release -Db_vscrt=md --native-file=C:\Users\loicsteve.fohoue\AppData\Local\Temp\pip-install-pfi26jst\numpy_299d94914f944170b83a7d3460eef506\.mesonpy-b6a8s2py\meson-python-native-file.ini
      The Meson build system
      Version: 1.2.99
      Source dir: C:\Users\loicsteve.fohoue\AppData\Local\Temp\pip-install-pfi26jst\numpy_299d949

In [2]:
from datasets import load_dataset
from pathlib import Path

dataset = load_dataset(path="dvilasuero/finepersonas-v0.1-tiny", split="train")

Path("data").mkdir(parents=True, exist_ok=True)
for i, persona in enumerate(dataset):
    with open(Path("data") / f"persona_{i}.txt", "w", encoding="utf-8") as f:
        f.write(persona["persona"])

  from .autonotebook import tqdm as notebook_tqdm


### Loading and embedding persona documents

We will use the SimpleDirectoryReader to load the persona descriptions from the data directory. This will return a list of Document object

In [3]:
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_dir="data")
documents = reader.load_data()
len(documents)

5000

Now we have a list of Document objects, we can use the IngestionPipeline to create nodes from the documents and prepare them for the QueryEngine. We will use the SentenceSplitter to split the documents into smaller chunks and the HuggingFaceInferenceAPIEmbedding to embed the chunks.

In [4]:
from llama_index.core import Document
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline

# create the pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_overlap=0),
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ]
)

nodes = await pipeline.arun(documents=[Document.example()])

### Storing and Indexing documents

Since we are using an ingestion pipeline, we can directly attach a vector store to the pipeline to populate it. In this case, we will use Chroma to store our documents. Let's run the pipeline again with the vector store attached. The IngestionPipeline caches the operations so this should be fast!

In [9]:
pip install chromadb

Collecting chromadb
  Using cached chromadb-1.0.4-cp39-abi3-win_amd64.whl.metadata (7.0 kB)
Collecting build>=1.0.3 (from chromadb)
  Using cached build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Using cached chroma_hnswlib-0.7.6.tar.gz (32 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting fastapi==0.115.9 (from chromadb)
  Using cached fastapi-0.115.9-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Using cached uvicorn-0.34.1-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Using cached posthog-3.24.1-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting onnx

  error: subprocess-exited-with-error
  
  × Building wheel for chroma-hnswlib (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [5 lines of output]
      running bdist_wheel
      running build
      running build_ext
      building 'hnswlib' extension
      error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for chroma-hnswlib
ERROR: Failed to build installable wheels for some pyproject.toml based projects (chroma-hnswlib)


In [10]:
import chromadb

from llama_index.vector_stores.chroma import ChromaVectorStore

db = chromadb.PersistentClient(path="./alfred_chroma_db")
chroma_collection = db.get_or_create_collection(name="alfred")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        HuggingFaceInferenceAPIEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ],
    vector_store=vector_store,
)

nodes = await pipeline.arun(documents=documents[:10])
len(nodes)

ModuleNotFoundError: No module named 'chromadb'