# RAG for university courses

## Setup

We install dependecies and set up async support

In [1]:
%pip install -Uq llama-index llama-index-llms-groq llama-index-embeddings-huggingface

Note: you may need to restart the kernel to use updated packages.


In [2]:
import nest_asyncio

nest_asyncio.apply()

We set the API key for Groq (the LLM) and we set up 2 clients: one with 8b prameters, more basic, and one with 70b parameters, more advanced.

In [3]:
import os

os.environ["GROQ_API_KEY"] = "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"

In [4]:
from llama_index.llms.groq import Groq

llm = Groq(model="llama3-8b-8192")
llm_70b = Groq(model="llama3-70b-8192")

We set up the embedding model. An embedding model is a model that takes a list of strings and returns a list of vectors. 

In [5]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3")
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

We enable globally the llm and the embedding model to subsitute the OpenAI default ones.

In [6]:
from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model

## Loading the data

We tried to load the data using the SimpleDirectoryReader however most of the data was splitted not optimally.

In [7]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("documents").load_data()

print(len(documents))
print(documents[1])

25
Doc ID: 996d6684-7235-4de8-87ed-88d58172ce7d
Text: Orientamento ------------  * [I
nostri studenti](http://orienta.unitn.it/cosa-scegliere/56/beni-
culturali) * [Eventi di orientamento](http://orienta.unitn.it/come-
scegliere/4/eventi-di-orientamento) * [Orienta: tutti i
servizi](http://orienta.unitn.it) * [Scegliere
UniTrento](http://www.unitn.it/ateneo/4/perche-scegliere-unitrento)
* Livello...


Infact the SimpleDirectoryReader makes impossible to customize some essential metadata as chunks can be similar one from each other. We are going to load each document separately and parse manually the content.

In [18]:
import os

file_paths = []
for x in os.listdir("eng"):
    if x.endswith(".md"):
        file_paths.append(x)

print(file_paths[:3])

['unitn-computer-science.md', 'unitn-industrial-engineering.md', 'unitn-mathematics.md']


WWhile loading the data we are going to add some initial metadata to the documents. Specifically the university and the course the document is about.

In [20]:
from llama_index.core import Document

documents = []
for idx, f in enumerate(file_paths):
    print(f"Idx {idx}/{len(file_paths)}")
    content = open(f"eng/{f}", "r").read()
    loaded_doc = Document(
        text=content,
        metadata={"university": str(f.split("-")[0]), "course": str(" ".join(f.split("-")[1:]).split(".")[0])},

    )
    documents.append(loaded_doc)

print(documents[0].metadata)
print(documents[0].text)

Idx 0/3
Idx 1/3
Idx 2/3
{'university': 'unitn', 'course': 'computer science'}
# Bachelor's Degree in Computer Science - University of Trento

## Program Overview

- **Level**: Bachelor's Degree (First Cycle)
- **Duration**: 3 years
- **Degree Class**: L-31 - Computer Science and Technologies
- **Language**: Offered in **Italian** and **English**
- **Admission**: **Limited enrollment**, requires passing an admission test
- **Location**: Department of Information Engineering and Computer Science, Via Sommarive 5, 38123 Povo (TN), Italy

## About the Program

Computer Science at the University of Trento integrates elements from **Science** and **Engineering**:
- From **Science**, it inherits **curiosity**, such as exploring philosophical aspects of problem-solving.
- From **Engineering**, it inherits **methodological rigor** in solving problems.

### Key Features
- Recognized as one of the foundational pillars of modern sciences, alongside **theory** and **experimentation**.
- Highly **pe

## Indexing

Indexing is the process to assign documents to vectors. We are going to use the embedding model to assign a vector to each document. We also use MarkdownNodeParser that:

1. Split each md document in sections following the headings
2. Extract other metadata from the document
3. Link documents near each other

We then create a vector store and index the documents.

In [21]:
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.core.extractors import (
    QuestionsAnsweredExtractor,
    KeywordExtractor,
)

parser = MarkdownNodeParser(
        include_prev_next_rel=True,
        include_metadata=True,
    )

# Other possible metadata extractors to use however requires LLM calls and it reaches the limit

# question_extractor = QuestionsAnsweredExtractor(
#     questions=3
# )

# keyword_extractor = KeywordExtractor(
#     keywords=10
# )

# We may also save the index to disk

index = VectorStoreIndex.from_documents(
    documents,
    transformations=[parser],
    show_progress=True,
)

query_engine = index.as_query_engine(similarity_top_k=10)

Parsing nodes:   0%|          | 0/3 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/48 [00:00<?, ?it/s]

Having the query_engine we can now query the documents.

In [24]:
res = query_engine.query("What are the first year courses of mathematics?")
print(res)

Mathematical Analysis A1, Mathematical Analysis A2, Geometry A, Algebra A, Computer Science, English B1.


---

# Other

In [22]:
import json
import random

# show one node for debug
nodes = index.docstore.docs
j = json.dumps(str(nodes))
# select a random key and return the content

# save nodes to file index.json
with open("nodes.json", "w") as f:
    f.write(j)

Can you give me two short (max 20 words) questions which answer is contained in the following text? Please include in the questions as much context as you can. Give me also the answers. Use JSON format such as: [{question: '[QUESTION]', answer: '[ANSWER]' }, ...].\n\n

In [13]:
import os

files = []
for x in os.listdir("documents"):
    if x.endswith(".md"):
        files.append(x)

In [14]:
questions = []

for file in files:
    content = open(f"documents/{file}").read()
    res = query_engine.query("Can you give me two short (max 20 words) questions which answer is contained in the following text? Please include in the questions as much context as you can. Give me also the answers. Return just this JSON format such as: [{question: '[QUESTION]', answer: '[ANSWER]' }, ...].\n\n" + content)
    # write to file
    print(res.response)
    questions.append(res.response)

print(questions[:3])

ValueError: Calculated available context size -1317 was not non-negative.

In [None]:
# Now for each question in q.json file, we try to answer it using the LLM model
