# RAG for university courses

## Setup

We install dependecies and set up async support

In [138]:
%pip install -Uq llama-index llama-index-llms-groq llama-index-embeddings-huggingface

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [139]:
import nest_asyncio

nest_asyncio.apply()

In [140]:
import os

os.environ["GROQ_API_KEY"] = "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"

In [141]:
from llama_index.llms.groq import Groq

llm = Groq(model="llama3-8b-8192")
llm_70b = Groq(model="llama3-70b-8192")

We set up the embedding model. An embedding model is a model that takes a list of strings and returns a list of vectors. 

In [142]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3")

We enable globally the llm and the embedding model to subsitute the OpenAI defautl ones.

In [143]:
from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model

## Loading the data

In [144]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("documents").load_data()

print(len(documents))
print(documents[15])

48
Doc ID: 0c3a6598-5573-4e73-828f-57ce4231ed90
Text: Application Process - Details, deadlines, and requirements are
published on the 2025 Admissions Portal. - Test preparation resources
and simulations are available on the CISIA Website.


We are going to load each document separately

In [145]:
import os

file_paths = []
for x in os.listdir("documents"):
    if x.endswith(".md"):
        file_paths.append(x)

print(file_paths[:3])

['unitn-computer-science.md', 'unitn-industrial-engineering.md', 'unitn-mathematics.md']


In [146]:
%pip install -Uq unstructured

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [147]:
from llama_index.readers.file import MarkdownReader
from llama_index.core import Document

reader = MarkdownReader(
    
)

doc_limit = 3
docs = []
for idx, f in enumerate(file_paths):
    if idx > doc_limit:
        break
    print(f"Idx {idx}/{len(file_paths)}")
    loaded_docs = reader.load_data(file=f"documents/{f}")
    loaded_doc = Document(
        text="\n\n".join([d.get_content() for d in loaded_docs]),
        metadata={"path": str(f), "university": str(f.split("-")[0]), "course": str(" ".join(f.split("-")[1:]).split(".")[0])},
    )
    print(loaded_doc.metadata["path"])
    docs.append(loaded_doc)

print(docs[0].metadata)
print(docs[0].text)

Idx 0/3
unitn-computer-science.md
Idx 1/3
unitn-industrial-engineering.md
Idx 2/3
unitn-mathematics.md
{'path': 'unitn-computer-science.md', 'university': 'unitn', 'course': 'computer science'}
Bachelor's Degree in Computer Science - University of Trento

Program Overview

- **Level**: Bachelor's Degree (First Cycle)
- **Duration**: 3 years
- **Degree Class**: L-31 - Computer Science and Technologies
- **Language**: Offered in **Italian** and **English**
- **Admission**: **Limited enrollment**, requires passing an admission test
- **Location**: Department of Information Engineering and Computer Science, Via Sommarive 5, 38123 Povo (TN), Italy

About the Program

Computer Science at the University of Trento integrates elements from **Science** and **Engineering**:
- From **Science**, it inherits **curiosity**, such as exploring philosophical aspects of problem-solving.
- From **Engineering**, it inherits **methodological rigor** in solving problems.

Key Features
- Recognized as one of 

## Indexing

In [148]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=3)

## Agent building

In [149]:
from llama_index.core.agent import ReActAgent
from llama_index.core import (
    load_index_from_storage,
    StorageContext,
    VectorStoreIndex,
)
from llama_index.core import SummaryIndex
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.node_parser import MarkdownNodeParser
import os
from tqdm.notebook import tqdm
import pickle
from pathlib import Path


async def build_agent_per_doc(nodes, file_base):
    print(file_base)

    vi_out_path = f"./documents/llamaindex_docs/{file_base}"
    summary_out_path = f"./documents/llamaindex_docs/{file_base}_summary.pkl"
    if not os.path.exists(vi_out_path):
        Path("./documents/llamaindex_docs/").mkdir(parents=True, exist_ok=True)
        # build vector index
        vector_index = VectorStoreIndex(nodes)
        vector_index.storage_context.persist(persist_dir=vi_out_path)
    else:
        vector_index = load_index_from_storage(
            StorageContext.from_defaults(persist_dir=vi_out_path),
        )

    # build summary index
    summary_index = SummaryIndex(nodes)

    # define query engines
    vector_query_engine = vector_index.as_query_engine(llm=llm)
    summary_query_engine = summary_index.as_query_engine(
        response_mode="tree_summarize", llm=llm
    )

    # extract a summary
    if not os.path.exists(summary_out_path):
        Path(summary_out_path).parent.mkdir(parents=True, exist_ok=True)
        summary = str(
            await summary_query_engine.aquery(
                "Extract a concise 1-2 line summary of this document"
            )
        )
        pickle.dump(summary, open(summary_out_path, "wb"))
    else:
        summary = pickle.load(open(summary_out_path, "rb"))

    # define tools
    query_engine_tools = [
        QueryEngineTool(
            query_engine=vector_query_engine,
            metadata=ToolMetadata(
                name=f"vector_tool_{file_base}",
                description=f"Useful for questions related to specific facts",
            ),
        ),
        QueryEngineTool(
            query_engine=summary_query_engine,
            metadata=ToolMetadata(
                name=f"summary_tool_{file_base}",
                description=f"Useful for summarization questions",
            ),
        ),
    ]

    # build agent
    agent = ReActAgent.from_tools(
        query_engine_tools,
        llm=llm_70b,
        verbose=True,
        system_prompt=f"""\
You are a specialized agent designed to answer queries about the `{file_base}.md part of a chatbot that helps you choose university courses`.
You must ALWAYS use at least one of the tools provided when answering a question; do NOT rely on prior knowledge.\
""",
    )

    return agent, summary


async def build_agents(docs):
    node_parser = MarkdownNodeParser(
        include_metadata=True,
        include_prev_next_rel=True,
    )

    # Build agents dictionary
    agents_dict = {}
    extra_info_dict = {}

    # # this is for the baseline
    # all_nodes = []

    for idx, doc in enumerate(tqdm(docs)):
        nodes = node_parser.get_nodes_from_documents([doc])
        # all_nodes.extend(nodes)

        # ID will be base + parent
        file_path = Path(doc.metadata["path"])
        file_base = str(file_path.parent.stem) + "_" + str(file_path.stem)
        agent, summary = await build_agent_per_doc(nodes, file_base)

        agents_dict[file_base] = agent
        extra_info_dict[file_base] = {"summary": summary, "nodes": nodes}

    return agents_dict, extra_info_dict

In [150]:
agents_dict, extra_info_dict = await build_agents(docs)

print(agents_dict.keys())
print(extra_info_dict.keys())

  0%|          | 0/3 [00:00<?, ?it/s]

_unitn-computer-science
_unitn-industrial-engineering
_unitn-mathematics
dict_keys(['_unitn-computer-science', '_unitn-industrial-engineering', '_unitn-mathematics'])
dict_keys(['_unitn-computer-science', '_unitn-industrial-engineering', '_unitn-mathematics'])


In [151]:
# define tool for each document agent
all_tools = []
for file_base, agent in agents_dict.items():
    summary = extra_info_dict[file_base]["summary"]
    doc_tool = QueryEngineTool(
        query_engine=agent,
        metadata=ToolMetadata(
            name=f"tool_{file_base}",
            description=summary,
        ),
    )
    all_tools.append(doc_tool)

In [152]:
print(all_tools[2].metadata)

ToolMetadata(description="The Bachelor's Degree in Mathematics at the University of Trento aims to train graduates with a solid foundation in core mathematics and the ability to apply mathematical models in various fields, while also developing critical thinking, teamwork, and digital skills.", name='tool__unitn-mathematics', fn_schema=<class 'llama_index.core.tools.types.DefaultToolFnSchema'>, return_direct=False)


In [153]:
%pip install -Uq llama-index-postprocessor-colbert-rerank

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [154]:
# define an "object" index and retriever over these tools
from llama_index.core import VectorStoreIndex
from llama_index.core.objects import (
    ObjectIndex,
    ObjectRetriever,
)
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.schema import QueryBundle
from llama_index.llms.openai import OpenAI

obj_index = ObjectIndex.from_objects(
    all_tools,
    index_cls=VectorStoreIndex,
)
vector_node_retriever = obj_index.as_node_retriever(
    similarity_top_k=10,
)


# define a custom object retriever that adds in a query planning tool
class CustomObjectRetriever(ObjectRetriever):
    def __init__(
        self,
        retriever,
        object_node_mapping,
        node_postprocessors=None,
        llm=None,
    ):
        self._retriever = retriever
        self._object_node_mapping = object_node_mapping
        self._llm = llm
        self._node_postprocessors = node_postprocessors or []

    def retrieve(self, query_bundle):
        if isinstance(query_bundle, str):
            query_bundle = QueryBundle(query_str=query_bundle)

        nodes = self._retriever.retrieve(query_bundle)
        for processor in self._node_postprocessors:
            nodes = processor.postprocess_nodes(
                nodes, query_bundle=query_bundle
            )
        tools = [self._object_node_mapping.from_node(n.node) for n in nodes]

        sub_question_engine = SubQuestionQueryEngine.from_defaults(
            query_engine_tools=tools, llm=self._llm
        )
        sub_question_description = f"""\
Useful for any queries that involve comparing multiple documents. ALWAYS use this tool for comparison queries - make sure to call this \
tool with the original query. Do NOT use the other tools for any queries involving multiple documents.
"""
        sub_question_tool = QueryEngineTool(
            query_engine=sub_question_engine,
            metadata=ToolMetadata(
                name="compare_tool", description=sub_question_description
            ),
        )

        return tools + [sub_question_tool]

In [155]:
# wrap it with ObjectRetriever to return objects
from llama_index.postprocessor.colbert_rerank import ColbertRerank

colbert_reranker = ColbertRerank(
    top_n=5,
    model="colbert-ir/colbertv2.0",
    tokenizer="colbert-ir/colbertv2.0",
    keep_retrieval_score=True,
)


custom_obj_retriever = CustomObjectRetriever(
    vector_node_retriever,
    obj_index.object_node_mapping,
    node_postprocessors=[colbert_reranker],
    llm=llm,
)

In [156]:
tmps = custom_obj_retriever.retrieve("hello")

# should be 5 + 1 -- 5 from reranker, 1 from subquestion
print(len(tmps))

4


In [157]:
from llama_index.core.agent import ReActAgent

top_agent = ReActAgent.from_tools(
    tool_retriever=custom_obj_retriever,
    system_prompt=""" \
You are an agent designed to answer queries about the documentation.
Please always use the tools provided to answer a question. Do not rely on prior knowledge.\

""",
    llm=llm_70b,
    verbose=True,
)

In [158]:
response = top_agent.query("List all industrial engineering subjects")

> Running step 62ae713f-50c0-4fda-a4a2-d7512fe74130. Step input: List all industrial engineering subjects
[1;3;38;5;200mThought: The current language of the user is: English. I need to use a tool to help me answer the question.
Action: tool__unitn-industrial-engineering
Action Input: {'input': 'List all industrial engineering subjects'}
[0m> Running step 0b03666a-e6aa-4356-a3cc-0699e849af6a. Step input: List all industrial engineering subjects
[1;3;38;5;200mThought: The current language of the user is: English. I need to use a tool to help me answer the question.
Action: vector_tool__unitn-industrial-engineering
Action Input: {'input': 'list all industrial engineering subjects'}
[0m[1;3;34mObservation: Here is the list of industrial engineering subjects:

* Mathematics
* Chemistry
* Physics
* Computer Science
* Materials for Sustainable Industry
* Introduction to information engineering
* Robotics and Mechatronics
* Management Engineering
[0m> Running step f930267d-04e3-4052-abef



[1;3;38;5;200mThought: I have enough information to answer the question without using any more tools.
Answer: The key differences between the Bachelor's Degree in Industrial Engineering at UniTrento and the Bachelor's Degree in Computer Science at the University of Trento are the focus areas. The Industrial Engineering program at UniTrento provides a broad foundation in subjects like Mathematics, Chemistry, Physics, and Computer Science, and offers specialized tracks in Materials, Robotics, and Management Engineering, with a focus on sustainable production processes and the role of IT in production systems. In contrast, the Computer Science program at the University of Trento likely focuses on the design, development, and application of computer systems and algorithms.
[0m[1;3;38;2;90;149;237m[tool__unitn-industrial-engineering] A: The key differences between the Bachelor's Degree in Industrial Engineering at UniTrento and the Bachelor's Degree in Computer Science at the University 

In [161]:
print(response)

Here is the list of industrial engineering subjects:

* Management Engineering
* Industrial Plants


In [162]:
response = query_engine.query("List all industrial engineering subjects")

print(response)

Here is the list of industrial engineering subjects:

* Management Engineering
* Industrial Plants
