<center><img src="../images/logo.png" width="500" /></center>

# LLMs and Ontologies

## 1 Context

As referenced in our previous notebook about ontologies ([Ontologies Part I](Ontologies/Ontologies_Part_I.ipynb)), ontologies provide a semantic way to organise your data. More explicitly, it provies a structured and formal representation of knowledge, defining concepts, relationships, and rules within a specific domain to enable interoperability, reasoning, and data integration. As we showed before, one can use SPARQL to query the RDF Triples format of an ontology and and instance thereof. This of course requires knowledge on the structure of your ontology to a degree.

Given that Large Language Models (LLMs) have improved to a degree where they are quite proficient in extracting information from a document ([Integrate and use GPT with Python](GPT_and_Python/GPT_and_Python.ipynb)), how can we leverage an ontology to further improve our queries by using an ontology? We provide an example here which shows on a small scale with a simple ontology, how it can provide semantic and structured information to the LLM, helping it to provide clearer responses within context.


**Please note that you will need access to an LLM to run this notebook. Here we have used OpenAI for which you will need a license key. Another LLM may be used however with possible syntax changes. The core methodology is the same.**



## 2 Python Implementation

The steps needed to implement our approach is as follows:

1. Extract ontology data as txt
2. Feed to LLM as context
3. Query LLM and get response

where the biggest obstacle is transforming the ontology into a format that is LLM friendly, i.e., text. 
To do so, we need to ensure that the following concepts are captured in the text:

1. Classes, properties, relationships
2. Inheritance of subclasses
3. Instances of classes
4. Data values of instances

And so we keep these in mind while defining the function to convert an ontology into text.


### 2.1 LLM friendly format

We must keep in mind that an LLM such as OpenAI takes in as input, regular text.
Therefore, to make the most of the ontology, which normally might look something like this in RDF or OWL format:

&nbsp;&nbsp;&nbsp; <owl:ObjectProperty rdf:about="#DividedInto">  
&nbsp;&nbsp;&nbsp;   <rdfs:domain rdf:resource="#ReportingFramework"/>  
&nbsp;&nbsp;&nbsp;   <rdfs:range rdf:resource="#Category"/>  
&nbsp;&nbsp;&nbsp; </owl:ObjectProperty>  

it needs to be converted into something like:

&nbsp;&nbsp;&nbsp; **A _ReportingFramework_ is a class that has the property that it is _DivivdedInto_ some _Categories_.**

Therefore, we need to define a function like **extract_ontology_info** (below) that will extract all the classes into proper sentences.
It also needs to make proper sentences defining the relationships and properties.
We aim for something like:

&nbsp;&nbsp;&nbsp; Class Hierarchy  
&nbsp;&nbsp;&nbsp; - **ReportingFramework** is a subclass of Thing.  
&nbsp;&nbsp;&nbsp; - **Category** is a subclass of Thing.  

&nbsp;&nbsp;&nbsp; Properties and Inheritance

&nbsp;&nbsp;&nbsp; Class: **ReportingFramework**  
&nbsp;&nbsp;&nbsp; Directly Defined Properties:  
&nbsp;&nbsp;&nbsp; - Object Properties: DividedInto  

&nbsp;&nbsp;&nbsp; Class: **Category**  
&nbsp;&nbsp;&nbsp; Directly Defined Properties:   
&nbsp;&nbsp;&nbsp; - Object Properties: DividedIntoSub  

### 2.2 Setup

In [1]:
# IMPORT MODULES

import openai  # Replace with your LLM provider
from openai import OpenAI
from owlready2 import *

In [None]:
## Configure the OpenAI API Licence key using an explicit LOCAL file
#tmp_file = open("openaikey.txt", "r")
#tmp = tmp_file.readline()
## The licence key ist stored in the second line
#the_key = tmp_file.readline().strip()
#tmp_file.close()

#Please replace the string with your license key
the_key = "123456789_REPLACE"
print(f"Using key: {the_key}")
print("*** Please do not share this key! Remove this output if in public; here for debug. ***")

Using key: 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
*** Please do not share this key! Remove this output if in public; here for debug. ***


In [4]:
openai.api_key = the_key 
client = OpenAI(api_key=the_key)

### 2.3 Function Definitions

In [None]:
# FUNCTION DEFINITIONS

# Function to extract knowledge from ontology
def extract_ontology_info(onto, save_knowledge_as_txt = False):
    """
    Converts an OWL ontology into structured text for an LLM.
    - Captures class hierarchy and inheritance.
    - Lists properties, including inherited ones.
    - Records individualsinstances and their values.

    params:
    - onto: Owlready2 ontology format
    - save_knowledge_as_txt: boolean to save converted ontology as txt
    
    """
    text_output = ""


    # Extract all object and data properties
    all_object_properties = list(onto.object_properties())
    all_data_properties = list(onto.data_properties())

    # Extract class hierarchy
    class_hierarchy = {cls.name: [p.name for p in cls.is_a if isinstance(p, ThingClass)]
                       for cls in onto.classes()}

    # Describe the class hierarchy
    text_output += "## Class Hierarchy\n"
    for cls, parents in class_hierarchy.items():
        if parents:
            text_output += f"- **{cls}** is a subclass of {', '.join(parents)}.\n"
        else:
            text_output += f"- **{cls}** is a top-level class.\n"

    text_output += "\n## Properties and Inheritance\n"

    # Extract properties per class
    class_properties = {}
    for cls in onto.classes():
        own_props = {"Object Properties": [], "Data Properties": []}
        inherited_props = {}

        # Find properties that apply to this class
        for prop in all_object_properties + all_data_properties:
            if cls in prop.domain:
                if isinstance(prop, ObjectPropertyClass):
                    own_props["Object Properties"].append(prop.name)
                elif isinstance(prop, DataPropertyClass):
                    own_props["Data Properties"].append(prop.name)

        # Store properties for later instance processing
        class_properties[cls] = own_props

        # Check inherited properties from parent classes
        for parent in cls.is_a:
            if isinstance(parent, ThingClass):  
                for prop in all_object_properties + all_data_properties:
                    if parent in prop.domain and prop.name not in own_props:
                        inherited_props[prop] = parent.name

        text_output += f"\n### Class: {cls.name}\n"
        if own_props["Object Properties"] or own_props["Data Properties"]:
            text_output += "**Directly Defined Properties:**\n"
            if own_props["Object Properties"]:
                text_output += f"- Object Properties: {', '.join(own_props['Object Properties'])}\n"
            if own_props["Data Properties"]:
                text_output += f"- Data Properties: {', '.join(own_props['Data Properties'])}\n"
        
        if inherited_props:
            text_output += "**Inherited Properties:**\n"
            for prop, parent in inherited_props.items():
                text_output += f"- {prop.name} (from {parent})\n"

    text_output += "\n## Instances and Their Values\n"

    # Extract all individuals (instances)
    for cls in onto.classes():
        instances = list(cls.instances())
        if instances:
            text_output += f"\n### Instances of {cls.name}:\n"
            for inst in instances:
                text_output += f"- **Instance:** {inst.name}\n"

                # Extract direct data properties
                for prop in all_data_properties:
                    if inst in prop.domain:
                        values = getattr(inst, prop.name, None)
                        if values:
                            values = [str(v) for v in values] if isinstance(values, list) else [str(values)]
                            text_output += f"  - {prop.name}: {', '.join(values)}\n"

                # Extract object properties
                for prop in all_object_properties:
                    if inst in prop.domain:
                        values = getattr(inst, prop.name, None)
                        if values:
                            values = [str(v) for v in values] if isinstance(values, list) else [str(values)]
                            text_output += f"  - {prop.name}: {', '.join(values)}\n"

                # Check for FunctionalProperty relationships (like IsInstrument)
                for prop in all_object_properties:

                    if FunctionalProperty in prop.is_a and cls in prop.domain:
                        #print('Got Here 2')
                        linked_entity = getattr(inst, prop.name, None)
                        if linked_entity:
                            text_output += f"  - **Linked via {prop.name}:** {linked_entity.name} (inherits its properties)\n"
                            # Inherit values from linked instance
                            for linked_prop in class_properties.get(type(linked_entity), {}).get("Data Properties", []):
                                linked_value = getattr(linked_entity, linked_prop, None)
                                if linked_value:
                                    text_output += f"    - Inherited {linked_prop}: {linked_value}\n"

    


    text_output += '-------\n'
    for inst in onto.individuals():
        text_output += f"- Instance: {inst.name} (Class: {inst.is_a[0].name})\n"
        # Extract data property values
        for prop in onto.data_properties():
            if prop in inst.get_properties():
                values = [str(val) for val in prop[inst]]
                text_output += f"  - {prop.name}: {', '.join(values)}\n"


    knowledge = text_output #f"Classes: {classes}\nProperties: {properties}\nRelations: {relations}\nInstances: {instances}"
    if save_knowledge_as_txt:
        print(knowledge)
        f  = open('part_ii/knowledge_feed.txt', 'w+')
        f.write(knowledge)
        f.close()

    return knowledge




# Function to query the LLM
def query_llm(question, knowledge):
    response = client.chat.completions.create(#openai.ChatCompletion.create(
        model="gpt-4o-mini",#"gpt-3.5-turbo",#"gpt-4",  # Change based on your LLM
        messages=[
            {"role": "system", "content": "You are an expert in ontologies."},
            {"role": "user", "content": f"Ontology knowledge:\n{knowledge}\n\nQuestion: {question}"}
        ]
    )
    #return response["choices"][0]["message"]["content"]
    return response.choices[0].message.content

# Main function
def main(onto, save_knowledge_as_txt = False, LOGFILE=True, single_question=''):
    knowledge = extract_ontology_info(onto, save_knowledge_as_txt)

    while True:

        if single_question != '':
            question = single_question
            answer = query_llm(question, knowledge)
            print("Answer:", answer)
            break
        
        question = input("Ask a question about the ontology (or type 'exit' to quit): ")
        if question.lower() == 'exit':
            break
        answer = query_llm(question, knowledge)
        print("Answer:", answer)

        if LOGFILE:
            f  = open('part_ii/OUTPUT_LOG.txt', 'a+')
            f.write('------\n')
            f.write('Question: ' +question)
            f.write('\n')
            f.write('\n')
            f.write('Answer: ' +answer)
            f.write('\n')
            f.write('------\n')
            f.close()



## 3 Query the LLM

In the wrappepr function we defined to query the LLM, you will notice that we preface all queries by defining the role of the LLM: *"You are an expert in ontologies."* aas well as feeding it first, the context of ontology and then followed by the question.

    f"Ontology knowledge:\n{knowledge}\n\nQuestion: {question}"

This query prompting style allows the LLM to better understand the context.
We note, however, that in this simple example, we are feeding the LLM the same context with every query which for substantial products is inefficient but works here as an example. Ideally, one would a) fine-tune their LLM and/or b) tokenize a database for Retrieval Augmented Generation (RAG) in order to more efficiently query the LLM. These of course require computational resources which are not explored here.

In [None]:
# Load ontology
onto_fname = "onto_repo/instances_SCR_onto.owl"
onto = get_ontology(onto_fname).load()

In [7]:
# Run this cell if you want to keep asking questions continuously
main(onto, save_knowledge_as_txt= True, LOGFILE= True)

## Class Hierarchy
- **ReportingFramework** is a subclass of Thing.
- **Category** is a subclass of Thing.
- **Subcategory** is a subclass of Category.
- **ComputingModel** is a subclass of Thing.
- **Metric** is a subclass of Thing.
- **Dataset** is a subclass of Thing.
- **Indicator** is a subclass of Thing.
- **Datasource** is a subclass of Thing.
- **Institution** is a subclass of Thing.
- **Portfolio** is a subclass of Thing.
- **Position** is a subclass of Thing.
- **Instrument** is a subclass of Thing.
- **Property** is a subclass of Instrument.
- **Bonds** is a subclass of Instrument.

## Properties and Inheritance

### Class: ReportingFramework
**Directly Defined Properties:**
- Object Properties: DividedInto

### Class: Category
**Directly Defined Properties:**
- Object Properties: DividedIntoSub

### Class: Subcategory
**Inherited Properties:**
- DividedIntoSub (from Category)

### Class: ComputingModel
**Directly Defined Properties:**
- Object Properties: HasCategory, Depen

Ask a question about the ontology (or type 'exit' to quit):  exit


### 3.1 More queries

In [9]:
main(onto, single_question=" Use only the knowledge provided in the given ontology. What are all the instances of a metric obtained from a computing model and which computing model?")

Answer: The instance of a metric obtained from a computing model is:

- **Metric:**
  - **Instance:** Property_Risk_sub_module
    - **ObtainedFromModel:** Market_Risk_Module

The computing model associated is:

- **Computing Model:**
  - **Instance:** Market_Risk_Module


In [10]:
main(onto, single_question="Use only the knowledge provided in the given ontology. Which metrics are used to calculate market risk?")

Answer: The metrics used to calculate market risk are:

- **Equity_Risk_sub_module**
- **Interest_Rate_Risk_sub_module**
- **Spread_Risk_sub_module**


In [11]:
main(onto, single_question="Use only the knowledge provided in the given ontology. Which variables are used to calculate various metrics?")

Answer: The variables used to calculate various metrics are:

1. **DependentVariableMetric** from ComputingModel.
2. **ObtainedFromModel** from Metric.
3. **ObtainedFromDataset** from Metric.
4. **DependentVariableIndicatorM** from Metric.
5. **UsesDataset** from Indicator. 

These properties connect metrics to their relevant models and datasets.


In [12]:
main(onto, single_question="Use only the knowledge provided in the given ontology. Which reporting body supervises the corresponding reporting framework")

Answer: The reporting body is the  
European Insurance and  
Occupational Pensions Authority.


In [13]:
main(onto, single_question="Use only the knowledge provided in the given ontology. Where can one obtain the data to calculate property risk?")

Answer: You can obtain the data to calculate property risk  
from the **Dataset** instance: book_assets.csv.  
This dataset is linked to the necessary metrics  
to evaluate property risk.


In [14]:
main(onto, single_question="Use only the knowledge provided in the given ontology. Who controls this dataset?")

Answer: The dataset "book_assets.csv" is controlled by  
the institution "best_insurance_firm" through the  
property "OwnsDataset".


In [15]:
main(onto, single_question="Use only the knowledge provided in the given ontology. Which institutions are available?")

Answer: The available institution is:

- best_insurance_firm


In [16]:
main(onto, single_question="Use only the knowledge provided in the given ontology. How many positions does it have?")

Answer: There are four positions: position_1, position_2,  
position_3, and position_4.


### 3.2 Use the LLM to extract ontological info from new documents:

In [None]:
with open('part_ii/equity_risk_submodule.txt', 'r') as file:
    text = file.read()

main(onto, single_question="Using the given ontology as a framework, convert from the following text, into instances of the ontology. Provide your answer in points. Make sure to grab institutions or regulatory bodies. Text: "+text)

Answer: - **Instance:** Equity_Risk_Sub_Module (Class: Metric)
  - Article: 168

- **Linked to:** Market_Risk_Module (Class: ComputingModel)

- **Instance:** European_Parliament (Class: Institution)
  - Regulatory Body: European Insurance Authority

- **Instance:** European_Council (Class: Institution)
  - Regulatory Body: Directive 2009/138/EC

- **Instance:** collective_investment_undertakings (Class: Instrument)

- **Instance:** qualifying_social_entrepreneurship_funds (Class: Instrument)
  - Article: 3(b) of Regulation (EU) No 346/2013

- **Instance:** qualifying_venture_capital_funds (Class: Instrument)
  - Article: 3(b) of Regulation (EU) No 345/2013

- **Instance:** closed_ended_unleveraged_AIF (Class: Instrument)
  - Article: 35 or 40 of Directive 2011/61/EU


### Remark:

We note that the quality of the output can naturally differ depending on the LLM used. Furthermore, in our example, we have provided only a simple ontology that has returned acceptable results in classifying new text under the ontological framework. With even more examples, one would be able to fine-tune an LLM to better classify new documents to help populate and develop their ontology.

## 4 Outlook

Here we have provided a very simple implementation of feeding an LLM with an ontology.
However, it is clear that given the structure of an ontology, it helps LLMs to grab contextually relevant information.

This code demonstrates how ontologies and large language models can be effectively integrated to enhance knowledge retrieval and contextual understanding. By reading an ontology from an OWL file, the script extracts its structure, including classes, properties, and instances, and transforms this information into structured text which is easier for the LLM to understand the nuances of our real world requirements.

However, with even more data and computational resouces, ontologies can be used with LLMs in several more powerful ways to improve knowledge retrieval, reasoning, and consistency such as:

1. Knowledge Enhancement (Retrieval Augmented Generation)
Ontologies structure domain knowledge, which can be retrieved and provided as context to an LLM. (Which we have scratched the surface of conceptually)
2. Better Understanding of Queries
LLMs can use ontologies to disambiguate terms (e.g., “bank” as a financial institution vs. a riverbank).
Helps interpret user intent more precisely by mapping terms to concepts.
3. Reasoning and Consistency Checking
Ontologies define rules (e.g., “All mammals have lungs”).
LLMs can use these rules to verify facts and ensure consistent answers.
4. Data Integration and Structuring
LLMs can generate structured outputs (like JSON or RDF) based on ontology definitions.
Example: If an ontology defines "Patient → hasCondition → Disease," an LLM can generate structured knowledge graphs.
5. Fine-tuning LLMs
Instead of using general knowledge, an LLM can be fine-tuned on an ontology’s data to specialize in a specific field (e.g., finance, healthcare, energy).

## Sources

[[1] Ontology Wiki](https://en.wikipedia.org/wiki/Ontology_(information_science))\
[[2] Ontology vs. KG](https://enterprise-knowledge.com/whats-the-difference-between-an-ontology-and-a-knowledge-graph/)\
[[3] ESG Ontology](https://www.mdpi.com/2079-9292/13/9/1719)\
[[4] SCR](https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32015R0035)\
[[5] Owlready2](https://owlready2.readthedocs.io/en/latest/)
