# Pneuma: Quick Start (Colab)

In this notebook, we show how to use each of Pneuma's features, from registering a dataset to querying the index.

## Offline Stage

In the offline stage, we set up Pneuma, including initializing the database, registering dataset and metadata, generating summaries, and generating both vector and keyword index.

To use pneuma, we import the class Pneuma from the pneuma module. 
- CUBLAS_WORKSPACE_CONFIG is set to a certain value (in this demo `:4096:8`) to enforce more deterministic behavior in cuBLAS operations.
- CUDA_VISIBLE_DEVICES is used to select the GPU. 
- The out_path is used to determine where the dataset and indexes will be stored. If not set, it will be defaulted to ~/.local/share/Pneuma/out on Linux or /Documents/Pneuma/out on Windows.

**NOTE**: You may need to restart the runtime after installing the requirements.

In [None]:
# Download sample data
!gdown "1NN_TxpgBlCjC_ZEBgOnBPMY0CxEX-_EL" -O "data_src.zip"
!unzip "data_src.zip"

In [None]:
# Install Pneuma
!pip install -i https://test.pypi.org/pypi/ --extra-index-url https://pypi.org/simple pneuma

In [1]:
import os

os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import json
from src.pneuma import Pneuma



We initialize the pneuma object with out_path and call the setup() function to initialize the database.

In [2]:
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

In [3]:
out_path = "out_demo/storage"
USE_OPEN_AI = True

if USE_OPEN_AI:
    pneuma = Pneuma(
        out_path=out_path,
        openai_api_key=OPENAI_API_KEY,
        use_local_model=False,
    )
else:
    pneuma = Pneuma(
        out_path=out_path,
        llm_path="Qwen/Qwen2.5-0.5B-Instruct",  # We use a smaller model to fit in Colab
        embed_path="BAAI/bge-base-en-v1.5",
        max_llm_batch_size=1,  # Limit exploration for limited Colab memory
    )
pneuma.setup()

2025-02-21 16:43:50,321 [Registrar] [INFO] HTTPFS installed and loaded.
2025-02-21 16:43:50,323 [Registrar] [INFO] Table status table created.
2025-02-21 16:43:50,325 [Registrar] [INFO] ID sequence created.
2025-02-21 16:43:50,484 [Registrar] [INFO] Table contexts table created.
2025-02-21 16:43:50,538 [Registrar] [INFO] Table summaries table created.
2025-02-21 16:43:50,539 [Registrar] [INFO] Indexes table created.
2025-02-21 16:43:50,561 [Registrar] [INFO] Index table mappings table created.
2025-02-21 16:43:50,571 [Registrar] [INFO] Query history table created.


'{"status": "SUCCESS", "message": "Database Initialized.", "data": null}'

* Note: For local LLMs, we limit exploration of dynamic batch size selector because it will fill the GPU memory quickly and not cleaned fast enough. This is not good for systems with limited GPU memory such as Colab with the T4 GPU.

### Registration

For this demo, we use a dataset of three tables taken from Chicago Open Data with the following descriptions:

- **5cq6-qygt.csv**: Bus stops in shelters and at Chicago Transport Authority (CTA) rail stations which have digital signs added to them to show upcoming arrivals.
- **5n77-2d6a.csv**: Survey results of the 12th ward residents about issues ranging from climate & sustainability to public safety.
- **28km-gtjn.csv**: Fire stations location in Chicago.

To register a dataset, we call the add_tables function while pointing to a directory and specifying the data creator.

In [4]:
data_path = "data_src/sample_data/csv"
response = pneuma.add_tables(path=data_path, creator="demo_user")
response = json.loads(response)
print(response)

2025-02-21 16:43:55,561 [Registrar] [INFO] Reading folder data_src/sample_data/csv...
2025-02-21 16:43:55,562 [Registrar] [INFO] Processing data_src/sample_data/csv/5cq6-qygt.csv...
2025-02-21 16:43:55,615 [Registrar] [INFO] Processing table data_src/sample_data/csv/5cq6-qygt.csv ERROR: This table already exists in the database with id ('data_src/sample_data/csv/5cq6-qygt.csv',).
2025-02-21 16:43:55,615 [Registrar] [INFO] Processing data_src/sample_data/csv/5n77-2d6a.csv...
2025-02-21 16:43:55,628 [Registrar] [INFO] Processing table data_src/sample_data/csv/5n77-2d6a.csv ERROR: This table already exists in the database with id ('data_src/sample_data/csv/5n77-2d6a.csv',).
2025-02-21 16:43:55,629 [Registrar] [INFO] Processing data_src/sample_data/csv/inner_folder...
2025-02-21 16:43:55,630 [Registrar] [INFO] Reading folder data_src/sample_data/csv/inner_folder...
2025-02-21 16:43:55,631 [Registrar] [INFO] Processing data_src/sample_data/csv/inner_folder/28km-gtjn.csv...
2025-02-21 16:43:

{'status': 'SUCCESS', 'message': '3 files in folder data_src/sample_data/csv has been processed.', 'data': {'file_count': 3, 'tables': [None, None, None]}}


Then, we can summarize the tables, all of which are not yet summarized at this point. These summaries then represent the tables for the discovery process.

In [5]:
response = pneuma.summarize()
response = json.loads(response)
print(response)

2025-02-21 16:43:58,588 [Summarizer] [INFO] Generating summaries for all unsummarized tables...
2025-02-21 16:43:58,590 [Summarizer] [INFO] Found 0 unsummarized tables.


{'status': 'SUCCESS', 'message': 'No unsummarized tables found.\n', 'data': {'table_ids': []}}


Optionally, if context (metadata) is available, we can register it as well using the add_metadata function.

In [6]:
metadata_path = "data_src/sample_data/metadata.csv"
response = pneuma.add_metadata(metadata_path=metadata_path)
response = json.loads(response)
print(response)

2025-02-21 16:44:02,798 [Registrar] [INFO] Context ID: 9
2025-02-21 16:44:02,835 [Registrar] [INFO] Summary ID: 10


{'status': 'SUCCESS', 'message': '2 metadata entries has been added.', 'data': {'file_count': 2, 'metadata_ids': [9, 10]}}


### Index Generation
The summaries (and optionally metadata) need to be indexed into a hybrid retriever (combining vector and full-text indices). To do so, we call the generate_index function while specifying a name for the index. By default, this function will index all registered tables.

In [7]:
response = pneuma.generate_index(index_name="demo_index")
response = json.loads(response)
print(response)

2025-02-21 16:44:09,599 [IndexGenerator] [INFO] No table ids provided. Generating index for all tables...
2025-02-21 16:44:09,601 [IndexGenerator] [INFO] Generating index for 3 tables...
2025-02-21 16:44:09,725 [IndexGenerator] [INFO] Vector index named demo_index with id 11 has been created.
2025-02-21 16:44:09,726 [IndexGenerator] [INFO] Processing table data_src/sample_data/csv/5cq6-qygt.csv...
2025-02-21 16:44:10,653 [IndexGenerator] [INFO] Processing table data_src/sample_data/csv/5n77-2d6a.csv...
2025-02-21 16:44:10,658 [IndexGenerator] [INFO] Processing table data_src/sample_data/csv/inner_folder/28km-gtjn.csv...
  0%|          | 0/1 [00:00<?, ?it/s]2025-02-21 16:44:11,128 [httpx] [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
100%|██████████| 1/1 [00:00<00:00,  1.42it/s]
2025-02-21 16:44:11,378 [IndexGenerator] [INFO] 3 Tables have been inserted to index with id 11.


Split strings: 0it [00:00, ?it/s]

2025-02-21 16:44:11,391 [bm25s] [DEBUG] Building index from IDs objects
  avg_doc_len = np.array([len(doc_ids) for doc_ids in corpus_token_ids]).mean()
  ret = ret.dtype.type(ret / rcount)


BM25S Count Tokens: 0it [00:00, ?it/s]

BM25S Compute Scores: 0it [00:00, ?it/s]

Finding newlines for mmindex: 0.00B [00:00, ?B/s]

2025-02-21 16:44:11,450 [IndexGenerator] [INFO] Keyword index named demo_index with id 12 has been created.
2025-02-21 16:44:11,451 [IndexGenerator] [INFO] Processing table data_src/sample_data/csv/5cq6-qygt.csv...
2025-02-21 16:44:11,455 [IndexGenerator] [INFO] Processing table data_src/sample_data/csv/5n77-2d6a.csv...
2025-02-21 16:44:11,459 [IndexGenerator] [INFO] Processing table data_src/sample_data/csv/inner_folder/28km-gtjn.csv...
2025-02-21 16:44:11,468 [bm25s] [DEBUG] Building index from IDs objects


Finding newlines for mmindex:   0%|          | 0.00/11.1k [00:00<?, ?B/s]

2025-02-21 16:44:11,495 [IndexGenerator] [INFO] 3 Tables have been inserted to index with id 12.


{'status': 'SUCCESS', 'message': 'Vector and keyword index named demo_index with id 11 and 12 has been created with 3 tables.', 'data': {'table_ids': ['data_src/sample_data/csv/5cq6-qygt.csv', 'data_src/sample_data/csv/5n77-2d6a.csv', 'data_src/sample_data/csv/inner_folder/28km-gtjn.csv'], 'vector_index_id': 11, 'keyword_index_id': 12, 'vector_index_generation_time': 0.12340712547302246, 'keyword_index_generation_time': 0.07125711441040039}}


## Online Stage (Querying)
To retrieve a ranked list of tables, we use the query_index function. In this case, the answer (`5n77-2d6a.csv`) is correct because climate & sustainability is one of the issues in this survey results dataset.

In [8]:
response = pneuma.query_index(
    index_name="demo_index",
    query="Which dataset contains climate issues?",
    k=1,
    n=5,
    alpha=0.5,
)
response = json.loads(response)
query = response["data"]["query"]
retrieved_tables = response["data"]["response"]

print(f"Query: {query}")
print("Retrieved tables:")
for table in retrieved_tables:
    print(table)

2025-02-21 16:45:10,583 [httpx] [INFO] HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-02-21 16:45:11,000 [httpx] [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-02-21 16:45:11,409 [httpx] [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-02-21 16:45:11,922 [httpx] [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-02-21 16:45:12,303 [httpx] [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-02-21 16:45:12,923 [httpx] [INFO] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Query: Which dataset contains climate issues?
Retrieved tables:
data_src/sample_data/csv/5n77-2d6a.csv
