# MyVison: How Retrieval-Augmented Generation can enhance academic educational guidance

In this jupyter notetbook we will build a RAG capable of answering question about 70 university courses among 2 universities to enhance choices for future university students.

## Setup

Before we begin, it's good to check that we are using a Colab instance with a GPU to leverage on the power of a graphic card to run the embedding model.

In [1]:
!nvidia-smi

Mon Mar 24 10:19:05 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   55C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

We now connect to the google drive folder so we have access to cached data (more on that later) and the source documents. This is to avoid uploading every time the data in the Colab instance.   

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


We also specfy the path to the correct folder, so we don't have to specify it later.

In [3]:
drive_path = "/content/drive/MyDrive/Computational linguistics and language-based interaction"

We install basic llamaindex dependecies, as well as the Groq package as our LLM Client, and set up async support for api calls.

In [4]:
%pip install -Uq llama-index llama-index-llms-groq llama-index-embeddings-huggingface

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.4/40.4 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.3/251.3 kB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.3/302.3 kB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m55.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m57.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m48.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [5]:
import nest_asyncio

nest_asyncio.apply()

### Groq Client setup

We chose Groq as our llm API vendor which gives us a free tier to try our RAG.

We load our Groq API key, which is saved in the secrets section of Colab.

In [6]:
import os
from google.colab import userdata

os.environ["GROQ_API_KEY"] = userdata.get('GROQ_KEY')

# Alteratively you can set up manually the key here
# os.environ["GROQ_API_KEY"] = "<key>"

Once we have the API key we can set up 2 LLM clients:

*   `llm` will be based on llama3-8b-8192 which gives us the basic llm for most of the tasks
*   `llm_70b` use the llama3-70b-8192 with 70 billion parameters as our more accuarate llm (golden), which will be used in the evaluation phase

In [7]:
from llama_index.llms.groq import Groq

llm = Groq(model="llama3-8b-8192")
llm_70b = Groq(model="llama3-70b-8192")

### Initializing the Embedding Model with Hugging Face Transformers

An **embedding model** is one of the most important part of a RAG system as it is responsible for converting text into numerical vector representations (embeddings) that capture semantic meaning. These embeddings are then used for efficient similarity search within a vector database.

Using the `HuggingFaceEmbedding` component we can load a custom embedding model to be used instead of the defualt llamaindex one. Specifically, the code uses the `BAAI/bge-m3` model, performant and efficient with english text. Other parameters used are the `device="cuda"` argument that ensures that the model runs on a compatible NVIDIA GPU together with `parallel_process=True` for faster embedding generation. Finally the `cache_folder` argument specifies a local directory to cache the downloaded model weights, avoiding redundant downloads and speeding up subsequent runs.

The commented-out line shows an alternative configuration using the `BAAI/bge-small-en-v1.5` model. which is smaller and is less computationally intensive. This has been used to test quickly modifications in thepipeline.

In [8]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3", device="cuda", parallel_process=True, cache_folder=f"{drive_path}/embeddings_cache")
# embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Then, we can enable globally the llm and the embedding model to subsitute the OpenAI default ones.

In [9]:
from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model

## Loading and Ingestion

The first approach was to use the `SimpleDirectoryReader` provided by llamaindex to quickly get and parse the documents. However this methods revealed to be ineffective as a lotof the meaning from the documents was lost.




In [10]:
# from llama_index.core import SimpleDirectoryReader

# documents = SimpleDirectoryReader(drive_path).load_data()

# print(len(documents))
# print(documents[1])

Infact the `SimpleDirectoryReader` makes impossible to customize some essential metadata: as our chunks of information can be quite similar one from each other there may be problems in the retrieval phase. We are going to load each document separately and parse manually the content.

In [11]:
import os

file_paths_courses = []
for x in os.listdir(drive_path + "/courses_documents"):
    if x.endswith(".md"):
        file_paths_courses.append(x)

print(file_paths_courses[:3])

file_paths_libraries = []
for x in os.listdir(drive_path + "/libraries_documents"):
    if x.endswith(".md"):
        file_paths_libraries.append(x)

print(file_paths_libraries[:3])

['univr-informatics.md', 'univr-languages-and-digital-media.md', 'univr-literature.md']
['unitn-bur-rovereto-university-library.md', 'unitn-cavazzani-study-room.md', 'unitn-buc-university-central-library.md']


While loading sequentially the data we are going to add some initial metadata to the documents. Our dataset is composed by Markdown documents named `university-course-name.md`: this provide us with two important information to store as metadata, that is university and course to better filter our documents.

In [12]:
from llama_index.core import Document

documents = []
for idx, f in enumerate(file_paths_courses):
    print(f"Idx {idx}/{len(file_paths_courses)}")
    content = open(f"{drive_path}/courses_documents/{f}", "r").read()
    loaded_doc = Document(
        text=content,
        metadata={"university": str(f.split("-")[0]), "course": str(" ".join(f.split("-")[1:]).split(".")[0]), "type": "course information"},

    )
    documents.append(loaded_doc)

for idx, f in enumerate(file_paths_libraries):
    print(f"Idx {idx}/{len(file_paths_libraries)}")
    content = open(f"{drive_path}/libraries_documents/{f}", "r").read()
    loaded_doc = Document(
        text=content,
        metadata={"university": str(f.split("-")[0]), "library": str(" ".join(f.split("-")[1:]).split(".")[0]), "type": "library information"},

    )
    documents.append(loaded_doc)

print(documents[0].metadata)
print(documents[71].metadata)

Idx 0/70
Idx 1/70
Idx 2/70
Idx 3/70
Idx 4/70
Idx 5/70
Idx 6/70
Idx 7/70
Idx 8/70
Idx 9/70
Idx 10/70
Idx 11/70
Idx 12/70
Idx 13/70
Idx 14/70
Idx 15/70
Idx 16/70
Idx 17/70
Idx 18/70
Idx 19/70
Idx 20/70
Idx 21/70
Idx 22/70
Idx 23/70
Idx 24/70
Idx 25/70
Idx 26/70
Idx 27/70
Idx 28/70
Idx 29/70
Idx 30/70
Idx 31/70
Idx 32/70
Idx 33/70
Idx 34/70
Idx 35/70
Idx 36/70
Idx 37/70
Idx 38/70
Idx 39/70
Idx 40/70
Idx 41/70
Idx 42/70
Idx 43/70
Idx 44/70
Idx 45/70
Idx 46/70
Idx 47/70
Idx 48/70
Idx 49/70
Idx 50/70
Idx 51/70
Idx 52/70
Idx 53/70
Idx 54/70
Idx 55/70
Idx 56/70
Idx 57/70
Idx 58/70
Idx 59/70
Idx 60/70
Idx 61/70
Idx 62/70
Idx 63/70
Idx 64/70
Idx 65/70
Idx 66/70
Idx 67/70
Idx 68/70
Idx 69/70
Idx 0/39
Idx 1/39
Idx 2/39
Idx 3/39
Idx 4/39
Idx 5/39
Idx 6/39
Idx 7/39
Idx 8/39
Idx 9/39
Idx 10/39
Idx 11/39
Idx 12/39
Idx 13/39
Idx 14/39
Idx 15/39
Idx 16/39
Idx 17/39
Idx 18/39
Idx 19/39
Idx 20/39
Idx 21/39
Idx 22/39
Idx 23/39
Idx 24/39
Idx 25/39
Idx 26/39
Idx 27/39
Idx 28/39
Idx 29/39
Idx 30/39
Idx 31/39


Even if we got the documents, the number of information per file is too big to be used as source for our embeddings. We are going to split our documents in smaller chunks (Nodes) by using the `MarkdownNodeParser`, which is specifically designed for parsing Markdown content. In short, the parser is able to divide sections based on the headings but by keeping relations between the different Nodes.

The parameters used are:

*   `include_prev_next_rel=True`: This setting ensures that relationships between nodes (previous and next nodes) are preserved. This can be useful for maintaining context and coherence when retrieving and presenting information.
*   `include_metadata=True`: This ensures that the metadata extracted during document loading (e.g., university and course names) is also attached to each node. This allows for filtering and querying based on metadata during retrieval.

This granular representation allows for more precise and context-aware retrieval of information.

In [13]:
from llama_index.core.node_parser import MarkdownNodeParser

parser = MarkdownNodeParser(
        include_prev_next_rel=True,
        include_metadata=True,
    )

nodes = parser.get_nodes_from_documents(documents)

## Indexing, Embedding and Storing



Indexing is the process of creating a structured data format that allows for fast and efficient retrieval of information from a collection of documents or data points. It's a fundamental concept in information retrieval systems and search engines.


We opted for a combination of two indexes: we combine bm25 and chroma for sparse and dense retrieval.

First we install chromadb to store our vector-indexes.

In [14]:
%pip install -Uq chromadb llama-index-vector-stores-chroma

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m80.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.2/95.2 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m77.6 MB/s[0m eta [36m0:00:00

We also save them to a cache to improve speed.



In [15]:

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

docstore = None
try:
    docstore = SimpleDocumentStore.from_persist_path(f"{drive_path}/docstore.json")
except:
    docstore = SimpleDocumentStore()
    docstore.add_documents(nodes)

db = chromadb.PersistentClient(path=f"{drive_path}/chroma_db")
chroma_collection = db.get_or_create_collection("dense_vectors")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

storage_context = StorageContext.from_defaults(
    docstore=docstore, vector_store=vector_store
)

index = None
try:
    index = VectorStoreIndex(nodes=[], storage_context=storage_context)
except:
    index = VectorStoreIndex(nodes=nodes, storage_context=storage_context)

storage_context.docstore.persist(f"{drive_path}/docstore.json")

## Retrieval and Querying

To quickly test the system and compare it we use the default query engine provided by llamaindex. However this require a lot fo API calls and is not easily customizable.

In [16]:
# query_engine = index.as_query_engine(similarity_top_k=3)
# res = query_engine.query("What are the loan periods and number of items that can be borrowed from the bur of Rovereto?")
# print(res)

For our main retriever we are using bm25.

In [17]:
%pip install -Uq llama-index-retrievers-bm25

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/53.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.7/53.7 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/669.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m669.3/669.3 kB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[?25h

The basic use of BM25 is to build a retriever from the index, specifying how many similar documents you want back.

In [18]:
from llama_index.retrievers.bm25 import BM25Retriever

bm25_retriever = BM25Retriever.from_defaults(
    docstore=docstore,
    similarity_top_k=15,
)

DEBUG:bm25s:Building index from IDs objects


Actually we wanted to use the two combined indexes we said before, putting together the result with the `QueryFusionRetriever`. However it seems not to work and runs indefinetly. So we are going to use just the bm25 retriever.

In [19]:
# from llama_index.retrievers.bm25 import BM25Retriever
# from llama_index.core.retrievers import QueryFusionRetriever
# from llama_index.core.retrievers import VectorIndexRetriever

# base_retriever = VectorIndexRetriever(
#     index=index,
#     similarity_top_k=2,
# )

# retriever = QueryFusionRetriever(
#     [
#         base_retriever,
#         bm25_retriever
#     ],
#     similarity_top_k=2,
#     num_queries=1,
#     use_async=True,
#     verbose=True
# )

In [20]:
# from llama_index.core.llms import ChatMessage

# query = "List computer science courses in unitn"

# retrieved_nodes = retriever.retrieve(
#     query
# )

# context = "\n".join([node.text for node in retrieved_nodes])

# for node in retrieved_nodes:
#     print(node.metadata)

# messages= [
#     ChatMessage(
#         role="system", content="Use only the documents provided below the question and just give me the answer."
#     ),
#     ChatMessage(role="user", content=f"{query}\n\nf{context}"),
# ]

# res = llm.chat(messages)
# print(res)

# Evaluation

We have generated 61 question among all our documents to test our system. We load the json containing pairs of question-answers.

In [21]:
# get the json file with the questions
import json

questions = None

with open(f"{drive_path}/questions.json", "r") as f:
    questions = json.load(f)

Well known evaluation systems requires a lot of API calls but we are limited. Our solution is to use the same prompts but doing manually the requests so we can slow down the process.

In [22]:
DEFAULT_SYSTEM_TEMPLATE = """
You are an expert evaluation system for a question answering chatbot.

You are given the following information:
- a user query, and
- a generated answer

You may also be given a reference answer to use for reference in your evaluation.

Your job is to judge the relevance and correctness of the generated answer.
Output a single score that represents a holistic evaluation.
You must return your response in a line with only the score.
Do not return answers in any other format.
On a separate line provide your reasoning for the score as well.

Follow these guidelines for scoring:
- Your score has to be between 1 and 5, where 1 is the worst and 5 is the best.
- If the generated answer is not relevant to the user query, \
you should give a score of 1.
- If the generated answer is relevant but contains mistakes, \
you should give a score between 2 and 3.
- If the generated answer is relevant and fully correct, \
you should give a score between 4 and 5.

Example Response:
4.0
The generated answer has the exact same metrics as the reference answer, \
    but it is not as concise.

"""

DEFAULT_CONTEXT_TEMPLATE = """
    Your task is to evaluate if the retrieved context from the document sources are relevant to the query.
    The evaluation should be performed in a step-by-step manner by answering the following questions:
    1. Does the retrieved context match the subject matter of the user's query?
    2. Can the retrieved context be used exclusively to provide a full answer to the user's query?
    Each question above is worth 2 points, where partial marks are allowed and encouraged. Provide detailed feedback on the response
    according to the criteria questions previously mentioned.
    After your feedback provide a final result by strictly following this format:
    '[RESULT] followed by the float number representing the total score assigned to the response'
    Query: \n {query_str}
    Context: \n {context_str}
    Feedback:
"""


In [23]:
from llama_index.core.llms import ChatMessage
import csv
import time

# if present delete toreview.csv, context2.txt and evaluation2.txt
if os.path.exists(f"{drive_path}/toreview.csv"):
    os.remove(f"{drive_path}/toreview.csv")

if os.path.exists(f"{drive_path}/context2.txt"):
    os.remove(f"{drive_path}/context2.txt")

if os.path.exists(f"{drive_path}/evaluation2.txt"):
    os.remove(f"{drive_path}/evaluation2.txt")

# Open the CSV file for writing the review information.
with open(f"{drive_path}/toreview.csv", "w", newline="", encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile)
    # Write header row
    writer.writerow(["Question", "Gold Answer", "RAG Answer", "Document", "Evaluation Score", "Context Score"])

    for i, q in enumerate(questions):
        print(f"Question {i+1}/{len(questions)} - {q['question']}")
        retrievedNodes = bm25_retriever.retrieve(q["question"])
        # Combine the text from all retrieved nodes into one context string.
        context = "\n".join([node.text for node in retrievedNodes])

        #print(context)

        messages = [
            ChatMessage(
                role="system",
                content="Use only the documents provided below the question and just give me the answer."
            ),
            ChatMessage(
                role="user",
                content=f"{q['question']}\n\n{context}"
            ),
        ]
        # Get the RAG answer.
        ragResponse = llm.chat(messages)

        # Prepare the evaluation query.
        evaluationQuery = f"""
        Given the following information:
        - a user query: "{q["question"]}"
        - a generated answer: "{ragResponse}"
        - a reference answer: "{q["answer"]}"

        Evaluate the relevance and correctness of the generated answer.
        """
        evaluationSetup = [
            ChatMessage(role="system", content=DEFAULT_SYSTEM_TEMPLATE),
            ChatMessage(role="user", content=evaluationQuery)
        ]
        evaluationResults = llm_70b.chat(evaluationSetup)

        # Optionally write evaluation details to a separate file.
        with open(f"{drive_path}/evaluation2.txt", "a", encoding="utf-8") as f:
            f.write(q["question"] + "\n" + str(evaluationResults) + "\n")

        # Prepare the context query.
        contextQuery = f"""
        Given the following information:
        - a user query: "{q["question"]}"
        - a retrieved context: "{context}"
        """
        contextSetup = [
            ChatMessage(role="system", content=DEFAULT_CONTEXT_TEMPLATE),
            ChatMessage(role="user", content=contextQuery)
        ]
        contextResults = llm_70b.chat(contextSetup)

        # Optionally write context details to a separate file.
        with open(f"{drive_path}/context2.txt", "a", encoding="utf-8") as f:
            f.write(q["question"] + "\n" + str(contextResults.message) + "\n")

        # Write a row to the CSV file. Here, we assume that the evaluation and context results
        # have a 'message' attribute. If not, their string representation is used.
        writer.writerow([
            q["question"],
            q["answer"],
            ragResponse,
            context,
            evaluationResults.message if hasattr(evaluationResults, 'message') else str(evaluationResults),
            contextResults.message if hasattr(contextResults, 'message') else str(contextResults)
        ])
        time.sleep(2)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
---
Question 39/71 - What are some of the key skills that the English track of the Economics and Management program aims to impart, and what are some of the innovative teaching methods used?
### Tracks Available  

1. **Economia e Management** (Italian)  
2. **Economics and Management** (English)  

The **English track** aims to provide an innovative curriculum, integrating quantitative subjects (mathematics, statistics), economics, and computer science to equip students with advanced skills for data processing, modeling, and decision-making in economics and management fields.

---
#### Specific Objectives for the English Track  
- Provide structured programming skills (e.g., Python, R, Matlab) for applications in economics, management, and finance.  
- Prepare students to utilize big data, artificial intelligence, and machine learning in economic and managerial contexts.  
- Employ innovative teaching methodologies, such

## Evaluation score

We get from the text files the result and see the average

In [24]:
import re

rule = re.compile(r"assistant:\s*(-?\d+\.\d+)")

scores = []
with open(f"{drive_path}/evaluation2.txt", "r") as f:
    for line in f:
        match = rule.search(line)
        if match:
            scores.append(float(match.group(1)))

print(f"Average score: {sum(scores)/len(scores)}")
print(f"Number of questions: {len(scores)}")

Average score: 4.345070422535211
Number of questions: 71


## Context score

In [25]:
rule = re.compile(r"\[RESULT\]\s*(-?\d+\.\d+)")

scores = []
with open(f"{drive_path}/context2.txt", "r") as f:
    for line in f:
        match = rule.search(line)
        if match:
            scores.append(float(match.group(1)))


print(f"Average score: {sum(scores)/len(scores)}")
print(f"Number of questions: {len(scores)}")

Average score: 3.2106382978723405
Number of questions: 47
