<a href="https://colab.research.google.com/github/Lcs002/TFG-Ontology-Generation/blob/main/tfg_ontology_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook Setup

In [None]:
!pip install -q -U google-genai
!pip install PyPDF2==3.0.1

In [21]:
from google import genai
from google.genai import types
from google.colab import drive
from google.colab import auth
from google.auth import default
from openai import OpenAI
from transformers import AutoTokenizer
import os
import toml
import gspread
import base64
import requests
import PyPDF2
import pathlib
import pandas


def split_pdf(pdf_path, num_chunks):
  """Splits a PDF into chunks and saves them as new files."""

  base_filename = os.path.splitext(os.path.basename(pdf_path))[0]
  folder_path = "/content/" + base_filename
  os.makedirs(folder_path, exist_ok=True)

  with open(pdf_path, 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfReader(pdf_file)
    num_pages = len(pdf_reader.pages)
    chunk_size = num_pages // num_chunks  # Pages per chunk

    for i in range(num_chunks):
      start_page = i * chunk_size
      end_page = min((i + 1) * chunk_size, num_pages)  # Handle last chunk

      pdf_writer = PyPDF2.PdfWriter()
      for page_num in range(start_page, end_page):
        pdf_writer.add_page(pdf_reader.pages[page_num])

      chunk_filename = os.path.join(folder_path, f"{base_filename}_chunk_{i + 1}.pdf")
      with open(chunk_filename, 'wb') as chunk_file:
        pdf_writer.write(chunk_file)


def read_pdf_bytes(pdf_path):
  """Converts a PDF file to bytes."""
  filepath = pathlib.Path(pdf_path)
  return filepath.read_bytes()

def pdf_to_base64(pdf_path):
  """Converts a PDF file to base64."""
  return base64.b64encode(read_pdf_bytes(pdf_path)).decode('utf-8')

def read_pdf_text(pdf_path):
  """Converts a PDF file to text."""

  text = ""
  try:
    with open(pdf_path, 'rb') as file:
      reader = PyPDF2.PdfReader(file)
      for page_num in range(len(reader.pages)):
        page = reader.pages[page_num]
        text += page.extract_text()
  except FileNotFoundError:
    print(f"Error: PDF file not found at {pdf_path}")
    return None
  except Exception as e:
    print(f"Error extracting text from PDF: {e}")
    return None
  return text

with open("/content/drive/MyDrive/TFG/globals.toml", 'r') as f:
  GLOBALS = toml.load(f)

pandas.set_option('display.max_columns', None)
pandas.set_option('display.max_rows', None)
pandas.set_option('display.max_colwidth', None)

auth.authenticate_user()
creds, _ = default()
gc = gspread.authorize(creds)

p = """Given the annexed pdf and the following questions based on the document information that must be possible to answer by the resulting ontology, generate a precise, complete and coherent Ontology in Turtle RDF (.ttl) representing it's information.
Questions:
  - What is the name of the university?
  - What is the name of the career?
  - What is the code of the career?
  - What is the name of the course?
  - What is the code of the course?
  - What are the previous course recomendations for the course?
  - What are the previous knowledge recomendations for the course?
  - What topics does the course have?
  - What activities does the course have?
  - What competencies the course have?
  - What are the code and description of each competency?
  - What professors teach the course?
  - What are the email, name and office code of some professor?
The output must be only the content of the .ttl and not be inside a code block."""

# ***Ontology Generation via LLMs for Technical Documents***

## Must

- Investigate the viability of LLM usage for Ontology generation.
	- Test with many **LLM's**
	- Test with many **Prompts**
	- Test with many **Documents**

## Should

- Investigate via Blind-Generation. Only feed the LLM - Ontology Generation with the pdf, and request an Ontology.
	- Explain why Blind-Generation of Ontology is not appropriate.

- Investigate via Guided-Generation. Feed the LLM - Ontology Generation with the needed Information (Entities and Relationships) in Natural Language.
	- "Tokenize" the PDF content to extract Entities, Relationships, etc in a Deterministic manner. Feed LLM - Ontology Generation with the output.

- Investigate via Purpose-Generation. Feed the LLM - Ontology Generation with questions that must be possible to answer by the ontology.

- Investigate generation with diferent Temperatures.

- Investigate generation with Descriptions (System Prompt)

- Investigate and explain why parsing the pdf information to markdown could be more beneficial than passing all pdf data as base64

- Investigate if the usage of OWL would be better than the usage of TTL

## Could

- Investigate Models' Thinking Process
- Process every document in English (even Spanish documents)

## Wont

- Investigate Ontology Automatic Validation
- Handle Image Interpretation.
- Handle Table Interpretation.

## Ask

- Usage of a golden-standard ontology for ontology evaluation. Then calcualte the intersection of both ontologies.

## Challenges

- LLM hallucinations
- Language Ambiguity
- Number of Tests (Must not be too time consuming but also not be very small)
- Comparisons and Evaluation
- Indeterminism (Different outcomes for same entries)

## Tools

- Testing and Automatization: Python
	- PDF parsing: PyPDF
- Notebook: Google Colab
- Results: Google Sheets
- Ontology Format Validation: http://ttl.summerofcode.be/
- Ontology Visualizer: https://webprotege.stanford.edu/
- LLM - Google AI Studio: https://aistudio.google.com/
- LLM - OpenRouter: https://openrouter.ai/

## Used Models

Data taken from:
- https://lmarena.ai/ at date 2025/19/02
- https://llm-stats.com/ at date 2025/19/02

| Name | Rank | License | Context |
|-|-|-|-|
| Gemini-2.0-Flash-Thinking-Exp-01-21 | 2 | Propietary | 1.000.000 |
| Gemini-2.0-Flash-001 | 5 | Propietary | 1.048.576 |
| Grok 2 | 18 | Propietary | 128.000 |
| CPT 4o | 2 | Propietary | 128.000 |

# 1 **Blind-Generation**



## 1.1 Hypothesis

*Given a pdf with technical information the usage of a LLM could be sufficient to generate a ontology in a consistent and deterministic manner that represents and contains all contained information.*

## 1.2 Entry

### 1.2.1 Documents

In [3]:
doc_1 = "/content/drive/MyDrive/TFG/docs/doc-1_upm-gpr-assignment.pdf"

### 1.2.2 Prompts

In [4]:
ppt_1 = "Given the annexed pdf, generate a precise, complete and coherent Ontology in Turtle RDFS (.ttl) representing it's information. The output must be only the content of the .ttl and not be inside a code block."

ppt_2 = "Given the following base64 encoded pdf, first decode it, and then generate a precise, complete and coherent Ontology in Turtle RDFS (.ttl) representing the information that it contains. Do not make an ontology of the PDF structure itself, but of its contained information. The output must be only the content of the .ttl and not be inside a code block."

## 1.3 Definition of "Valid"
The ontology is seemed as comprehensive **if and only if**:
- Its content correctly represents an TTL file.


## 1.4 Definition of "Comprehensive"
The ontology is seemed as comprehensive **if and only if**:
- It contains all information considered important.
- The information considered important is well defined.

## 1.5 Definition of "Important Information"


### 1.5.1 For doc_1

- I1 Course Name
- I2 Course Code
- I3 Course University Name
- I4 Course Career Name
- I5 Course Career Code
- I6 Course Career Center
- I7 Course Academic Year
- I8 Course Course
- I9 Course Semester
- I10 Course Credits
- I11 Course Mandatory
- I12 Course Professors
  - I12.1 Course Professor Name
  - I12.2 Course Professor Email
  - I12.3 Course Professor Tutor Hours
  - I12.4 Course Professor's Office
- I13 Recommended Previous Coursed Courses
- I14 Recommended Other Previous Knowledge
- I15 Course Competencies
  - I15.1 Course Competency Code
  - I15.2 Course Competency Description
- I16 Course Learning Results
  - I16.1 Course Learning Results Code
  - I16.2 Course Learning Results Description
- I17 Course Description
- I18 Course Topics
- I19 Course Activities
- I20 Course Evaluation Criteria
- I21 Course Didactic Resources

## 1.6 Tests


### 1.6.1 Base Tests [1.x.x]

##### **Generation**

1. File text extraction.
1. LLM feeding with extracted text and prompt.
2. Resulting Ontology

##### **Validation**

Manually look for missing or wrongly represented data in the ontology, using as reference the defined **important information**.

Each test has a **Comprehensiveness** field. This field indicates the percentage of important information that is present and well defined.

##### **Notes**

As passing the entire pdf as bytes would consume a high amount of tokens, and some Models would not handle such values. Thus, extraction of the pdf information as text is needed.

#### 1.6.1.1 With Gemini 2.0 Flash [1.1.x]

##### **Implementation**

In [None]:
prompt = ppt_1
document = doc_1
output_path = "/content/gemini-2.0-flash-result.ttl"

# Using Google AI Studio
client = genai.Client(api_key=GLOBALS["keys"]["google-ai-api"])
response = client.models.generate_content(
  model="gemini-2.0-flash",
  contents=[
    prompt + " Pdf: " + read_pdf_text(document)
  ]
)

with open(output_path, 'w') as f:
  f.write(response.text)

##### **Results**


In [28]:
pandas.DataFrame.from_records(gc.open_by_url(GLOBALS["blind-gen"]["validation"]["url"]).get_worksheet_by_id(0).get_all_values())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34
0,Test Number,Prompt,Document,Result,I1 - Course Name,I2 - Course Code,I3 - Course University Name,I4 - Course Career Name,I5 - Course Career Code,I6 - Course Career Center,I7 - Course Academic Year,I8 - Course Course,I9 - Course Semester,I10 - Course Credits,I11 - Course Mandatory,I12 - Course Professors,I12.1 - Course Professor Name,I12.2 - Course Professor Email,I12.3 - Course Professor Tutor Hours,I12.4 - Course Professor's Office,I13 - Recommended Previous Coursed Courses,I14 - Recommended Other Previous Knowledge,I15 - Course Competencies,I15.1 - Course Competency Code,I15.2 - Course Competency Description,I16 - Course Learning Results,I16.1 - Course Learning Results Code,I16.2 - Course Learning Results Description,I17 - Course Description,I18 - Course Topics,I19 - Course Activities,I20 - Course Evaluation Criteria,I21 - Course Didactic Resources,Valid,Comprehensiveness
1,1,ppt-1,doc-1,blind-gen_test-1.1.1,Concatenated with I2,Concatenated with I1,,Concatenated with I5,Concatenated with I4,,,,,,,,,,,,,,,"Present in Individual name, but not as its property",,,"Present in Individual name, but not as its property",,,,,,,YES,05172413793
2,2,ppt-1,doc-1,blind-gen_test-1.1.2,Concatenated with I2,Concatenated with I1,,Concatenated with I5,Concatenated with I4,,,,,,,,,,,,Does not differ between I13 and I14,,,,,,,,,,,,,YES,06896551724
3,3,ppt-1,doc-1,blind-gen_test-1.1.3,Concatenated with I2,Concatenated with I1,,Concatenated with I5,Concatenated with I4,,,,,,,Not related to the course,,,,,,,,,,,,,,,,,,YES,0275862069


##### **Observations**

- All generated ontology is valid.
- None of the tests succeeded meeting all the required information.
- The ontology's Classes and Properties often represent the same data but in different ways.
- Some information is often not present such as:
  - Professor Tutor Hours
  - Course Description
  - Course Topics
  - Evaluation Criteria
  - Recomended Other Previous Knowledge

#### 1.6.1.2 Gemini 2.0 Flash Thinking Experimental Free [1.2.x]

##### **Implementation**

In [None]:
prompt = ppt_2
document = doc_1
output_path = "/content/gemini-2.0-flash-thinking-experimental-free-result.ttl"

# Using Open Router
client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=GLOBALS["keys"]["open-router-api"],
)

response = client.chat.completions.create(
  model="google/gemini-2.0-flash-thinking-exp:free",
  messages=[
    {
      "role": 'user',
      "content": prompt + " Pdf: " +  read_pdf_text(document)
    }
  ]
)

try:
  with open(output_path, 'w') as f:
    f.write(response.choices[0].message.content)
except Exception as e:
  print(response)

##### **Results**

In [25]:
pandas.DataFrame.from_records(gc.open_by_url(GLOBALS["blind-gen"]["validation"]["url"]).get_worksheet_by_id(303418187).get_all_values())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34
0,Test Number,Prompt,Document,Result,I1 - Course Name,I2 - Course Code,I3 - Course University Name,I4 - Course Career Name,I5 - Course Career Code,I6 - Course Career Center,I7 - Course Academic Year,I8 - Course Course,I9 - Course Semester,I10 - Course Credits,I11 - Course Mandatory,I12 - Course Professors,I12.1 - Course Professor Name,I12.2 - Course Professor Email,I12.3 - Course Professor Tutor Hours,I12.4 - Course Professor's Office,I13 - Recommended Previous Coursed Courses,I14 - Recommended Other Previous Knowledge,I15 - Course Competencies,I15.1 - Course Competency Code,I15.2 - Course Competency Description,I16 - Course Learning Results,I16.1 - Course Learning Results Code,I16.2 - Course Learning Results Description,I17 - Course Description,I18 - Course Topics,I19 - Course Activities,I20 - Course Evaluation Criteria,I21 - Course Didactic Resources,Valid,Comprehensiveness
1,1,ppt-2,doc-1,blind-gen_test-1.2.1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,YES,08275862069
2,2,ppt-2,doc-1,blind-gen_test-1.2.2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,NO,
3,3,ppt-2,doc-1,blind-gen_test-1.2.3,,,,,,,,,,,,,,,,,,,,Concatenated with I15.2,Concatenated with I15.1,,Concatenated with I16.2,Concatenated with I16.1,,,Not related to the Course,,,YES,06551724138
4,4,ppt-2,doc-1,blind-gen_test-1.2.4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,YES,08965517241
5,5,ppt-2,doc-1,blind-gen_test-1.2.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,YES,08965517241


##### **Observations**

- For this model, it was required a change in the prompt, to a more concise specification of the ontology content. Otherwise, the model would many times generate an ontology of pdf elements.
- None of the tests succeeded in meeting all required information.
- The ontology's Classes and Properties often represent the same data but in different ways.
- Some information is often not present such as:
  - Course University Name
  - Recomended Previous Knowledge
  - Course Evaluation Criteria

#### 1.6.1.3 Grok 2 [1.3.x]

##### **Implementation**

Using https://grok.com/ feed ppt-1 and doc-1 to the LLM.

##### **Results**

In [26]:
pandas.DataFrame.from_records(gc.open_by_url(GLOBALS["blind-gen"]["validation"]["url"]).get_worksheet_by_id(221064902).get_all_values())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34
0,Test Number,Prompt,Document,Result,I1 - Course Name,I2 - Course Code,I3 - Course University Name,I4 - Course Career Name,I5 - Course Career Code,I6 - Course Career Center,I7 - Course Academic Year,I8 - Course Course,I9 - Course Semester,I10 - Course Credits,I11 - Course Mandatory,I12 - Course Professors,I12.1 - Course Professor Name,I12.2 - Course Professor Email,I12.3 - Course Professor Tutor Hours,I12.4 - Course Professor's Office,I13 - Recommended Previous Coursed Courses,I14 - Recommended Other Previous Knowledge,I15 - Course Competencies,I15.1 - Course Competency Code,I15.2 - Course Competency Description,I16 - Course Learning Results,I16.1 - Course Learning Results Code,I16.2 - Course Learning Results Description,I17 - Course Description,I18 - Course Topics,I19 - Course Activities,I20 - Course Evaluation Criteria,I21 - Course Didactic Resources,Valid,Comprehensiveness
1,1,ppt-1,doc-1,blind-gen_test-1.3.1,,,,,,,,Error in the Value,,,,,,,,,,,,,,,,,,N7A,,,,YES,03793103448
2,2,ppt-1,doc-1,blind-gen_test-1.3.2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,NO,
3,3,ppt-1,doc-1,blind-gen_test-1.3.3,,,,,,,,,,,,Not related to Course,,,,,,,Not related to Course,,,Not related to Course,,,,,,,Not related to Course,YES,03793103448


##### **Observations**

#### 1.6.1.4 GPT 4o Reasoning [1.4.x]

##### **Implementation**

Using https://chatgpt.com/ feed ppt-1 and doc-1 to the LLM.

##### **Results**

In [27]:
pandas.DataFrame.from_records(gc.open_by_url(GLOBALS["blind-gen"]["validation"]["url"]).get_worksheet_by_id(1658648548).get_all_values())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34
0,Test Number,Prompt,Document,Result,I1 - Course Name,I2 - Course Code,I3 - Course University Name,I4 - Course Career Name,I5 - Course Career Code,I6 - Course Career Center,I7 - Course Academic Year,I8 - Course Course,I9 - Course Semester,I10 - Course Credits,I11 - Course Mandatory,I12 - Course Professors,I12.1 - Course Professor Name,I12.2 - Course Professor Email,I12.3 - Course Professor Tutor Hours,I12.4 - Course Professor's Office,I13 - Recommended Previous Coursed Courses,I14 - Recommended Other Previous Knowledge,I15 - Course Competencies,I15.1 - Course Competency Code,I15.2 - Course Competency Description,I16 - Course Learning Results,I16.1 - Course Learning Results Code,I16.2 - Course Learning Results Description,I17 - Course Description,I18 - Course Topics,I19 - Course Activities,I20 - Course Evaluation Criteria,I21 - Course Didactic Resources,Valid,Comprehensiveness
1,1,ppt-1,doc-1,blind-gen_test-1.4.1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,YES,03103448276
2,2,ppt-1,doc-1,blind-gen_test-1.4.2,,Concatenated with I3,Concatenated with I2,Concatenated with I5,Concatenated with I4,Concatenated with I7,Concatenated with I6,,,,,,,,,,,,,,,,,,,Does not contain Sub-Topics,,,,YES,06206896552


##### **Observations**

## 1.7 Conclusions