# KG creation with LLM

* GitHub: https://github.com/fusion-jena/automatic-KG-creation-with-LLM
* 깃헙에 공개된 데이터, 코드를 기반으로 실험을 재현하는 코드
* 모델은 GPT-4.1

## (1) Data collection, (2) CQ genertaion

* 이 과정은 수작업으로 진행됨

1. 해당 도메인 관련 논문 문서 수집
2. (논문을 기반으로) 해당 도메인의 온톨로지를 구축하기 위한 Ontology Requirement Specification을 구축하기
3. ORSD를 기반으로 온톨로지 구축에 필요한 질문(Competency Question) 생성하기
    - e.g. What methods are utilized for collecting raw data in the deep learning pipeline (e.g., surveys, sensors, public datasets)? 

## (3) Ontology creation

* LLM_loader.py, helper_functions.py

#### 🚫 Mixtral-8x22B-Instruct-v0.1/sentence-transformers/all-MiniLM-L12-v2 -> 로컬에서 실행하기 어려움(보류)

In [1]:
pip install ipywidgets

Collecting ipywidgets
  Using cached ipywidgets-8.1.6-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.14 (from ipywidgets)
  Using cached widgetsnbextension-4.0.14-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab_widgets~=3.0.14 (from ipywidgets)
  Using cached jupyterlab_widgets-3.0.14-py3-none-any.whl.metadata (4.1 kB)
Using cached ipywidgets-8.1.6-py3-none-any.whl (139 kB)
Using cached jupyterlab_widgets-3.0.14-py3-none-any.whl (213 kB)
Using cached widgetsnbextension-4.0.14-py3-none-any.whl (2.2 MB)
Installing collected packages: widgetsnbextension, jupyterlab_widgets, ipywidgets
Successfully installed ipywidgets-8.1.6 jupyterlab_widgets-3.0.14 widgetsnbextension-4.0.14

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the ke

In [2]:
pip install sentence-transformers

Collecting sentence-transformers
  Using cached sentence_transformers-4.1.0-py3-none-any.whl.metadata (13 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers)
  Using cached transformers-4.51.3-py3-none-any.whl.metadata (38 kB)
Collecting tqdm (from sentence-transformers)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting torch>=1.11.0 (from sentence-transformers)
  Using cached torch-2.6.0-cp312-none-macosx_11_0_arm64.whl.metadata (28 kB)
Collecting scikit-learn (from sentence-transformers)
  Using cached scikit_learn-1.6.1-cp312-cp312-macosx_12_0_arm64.whl.metadata (31 kB)
Collecting scipy (from sentence-transformers)
  Using cached scipy-1.15.2-cp312-cp312-macosx_14_0_arm64.whl.metadata (61 kB)
Collecting huggingface-hub>=0.20.0 (from sentence-transformers)
  Using cached huggingface_hub-0.30.2-py3-none-any.whl.metadata (13 kB)
Collecting Pillow (from sentence-transformers)
  Using cached pillow-11.2.1-cp312-cp312-macosx_11_0_arm64.whl.metadat

In [3]:
pip install -U bitsandbytes

Collecting bitsandbytes
  Using cached bitsandbytes-0.42.0-py3-none-any.whl.metadata (9.9 kB)
Using cached bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
Installing collected packages: bitsandbytes
Successfully installed bitsandbytes-0.42.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install accelerate

Collecting accelerate
  Using cached accelerate-1.6.0-py3-none-any.whl.metadata (19 kB)
Using cached accelerate-1.6.0-py3-none-any.whl (354 kB)
Installing collected packages: accelerate
Successfully installed accelerate-1.6.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
import os
import re
import transformers
from langchain.embeddings import HuggingFaceEmbeddings
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain import HuggingFacePipeline
import torch
from huggingface_hub import login
from dotenv import load_dotenv

load_dotenv(override=True)
access_token_read = os.getenv('HUGGINGFACE_API_KEY')

login(token=access_token_read)

In [None]:
def load_llm(model_id,embedding_model_id):
    bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
    )
    # embeddings = HuggingFaceEmbeddings(model_name=embedding_model_id,model_kwargs={'device': 'cuda'})
    embeddings = HuggingFaceEmbeddings(model_name=embedding_model_id)
    model_config = transformers.AutoConfig.from_pretrained(model_id)
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(model_id,
        trust_remote_code=True,
        config=model_config,
        quantization_config=bnb_config,
        device_map='auto')
    model.eval()
    pipe = pipeline(model=model, tokenizer=tokenizer,
        #return_full_text=True,  # langchain expects the full text
        task='text-generation',
        temperature=0.00001,
        max_new_tokens=25000,  # max number of tokens to generate in the output
        #repetition_penalty=1.1,  # without this output begins repeating
        device_map = "auto"
    )
    llm=HuggingFacePipeline(pipeline=pipe, model_kwargs={'temperature':0.00001})
    return llm

def get_embeddings(embedding_model_id):
    return HuggingFaceEmbeddings(model_name=embedding_model_id,model_kwargs={'device': 'cuda'})

In [None]:
# from helper_functions import load_llm
from configparser import ConfigParser

# config = ConfigParser()
# config.read('config.ini')
# model_id = config.get('Models', 'model_id')
# embedding_model_id = config.get('Models', 'embedding_model_id')

model_id = 'mistralai/Mixtral-8x22B-Instruct-v0.1'
embedding_model_id = 'sentence-transformers/all-MiniLM-L12-v2'

llm = load_llm(model_id, embedding_model_id)

#### OpenAI API

In [1]:
import os
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

openai_api_key = os.getenv('OPENAI_API_KEY')

llm = ChatOpenAI(model="gpt-4.1", temperature=0)
embedding = OpenAIEmbeddings(model='text-embedding-3-large')  

##### 생성(Concepts_relations_generate.py)

* CQs를 기반으로 Concepts, Relations, DataProperties, InverseProperties 추출하는 과정

In [3]:
# helper_functions.py
def read_txt(txt_path):
    with open(txt_path,'r') as f:
        content = f.read()
    return content

def load_cqs(CQs_path):
    with open(CQs_path) as f:
        lines = f.readlines()
    CQs = [l[:-1] for l in lines]
    return CQs

In [None]:
# from helper_functions import load_llm, read_txt, load_cqs
from langchain.prompts import PromptTemplate
# from LLM_loader import llm

def Concepts_relations_generate():
    # template = read_txt(config.get('Paths', 'Concepts_and_relationships_prompt_path'))
    template = read_txt('../Ontology/Mixtral_8_22b/Prompt/Concepts_relationships_dataproperties_extraction.txt')
    prompt_template = PromptTemplate(input_variables=["CQs"], template=template)
    # prompt = prompt_template.format(CQs=load_cqs(config.get('Paths', 'CQs_path')))
    prompt =  prompt_template.format(CQs=load_cqs('../CQs/CQs.txt')) # CQ 넣어주기

    with open('concept_relation_data_inverse.txt',"w") as f:
        f.write(llm.invoke(prompt).content)
        
Concepts_relations_generate()

##### 변환 (Ontology_creation.py)

* 추출한 Concepts, Relations, DataProperties, InverseProperties를 OWL로 변환

In [4]:
pip install rdflib

Collecting rdflib
  Using cached rdflib-7.1.4-py3-none-any.whl.metadata (11 kB)
Collecting pyparsing<4,>=2.1.0 (from rdflib)
  Using cached pyparsing-3.2.3-py3-none-any.whl.metadata (5.0 kB)
Using cached rdflib-7.1.4-py3-none-any.whl (565 kB)
Using cached pyparsing-3.2.3-py3-none-any.whl (111 kB)
Installing collected packages: pyparsing, rdflib
Successfully installed pyparsing-3.2.3 rdflib-7.1.4

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [18]:
# helper_functions.py
from rdflib import Graph, Namespace

def get_base_onto_class(owl_file):
    g = Graph()
    g.parse(owl_file)

    # Define commonly used namespaces
    RDF = Namespace("http://www.w3.org/1999/02/22-rdf-syntax-ns#")
    OWL = Namespace("http://www.w3.org/2002/07/owl#")

    classes = []
    relations = []
    # Find classes
    for class_uri in g.subjects(RDF.type, OWL.Class):
        class_label = class_uri.split('#')[-1]
        classes.append(class_label)

    # Find properties
    for prop_uri in g.subjects(RDF.type, OWL.ObjectProperty):
        prop_label = prop_uri.split('#')[-1]
        relations.append(prop_label)
    return classes, relations

In [47]:
# from helper_functions import load_llm, read_txt, get_base_onto_class
from langchain.prompts import PromptTemplate
from rdflib import Graph, Namespace
# from LLM_loader import llm

def Ontology_creation():
    # template = read_txt(config.get('Paths', 'Ontology_creation_prompt'))
    template = read_txt('../Ontology/Mixtral_8_22b/Prompt/Ontology_creation_with_data_props.txt')
    prompt_template = PromptTemplate(input_variables=["concepts", "relations","data_properties","InverseProperties","base_onto_class","base_onto_property"], template=template)
    # content = read_txt(config.get('Paths', 'Concepts_and_relationships_save_path'))
    content = read_txt('../Ontology/Mixtral_8_22b/Concepts_relations/concept_relation_data_inverse.txt')
    concepts_s_ind = content.find('Concepts:') + len('Concepts:')
    concepts_e_ind = content.find('Relationships:')
    concepts = content[concepts_s_ind:concepts_e_ind].strip()
    
    relationships_s_ind = content.find('Relationships:') + len('Relationships:')
    relationships_e_ind = content.find('DataProperties:')
    relationships = content[relationships_s_ind:relationships_e_ind].strip()
    
    data_properties_s_ind = content.find('DataProperties:') + len('DataProperties:')
    data_properties_e_ind = content.find('InverseProperties:')
    data_properties = content[data_properties_s_ind:data_properties_e_ind].strip()
    
    inverse_properties_s_ind = content.find('InverseProperties:') + len('InverseProperties:')
    inverse_properties = content[inverse_properties_s_ind:].strip()
    # base_onto_class, base_onto_property = get_base_onto_class(config.get('Paths', 'SOTAOntology_path'))
    base_onto_class, base_onto_property = get_base_onto_class('../SOTAOntologies/prov-o.owl') #PROV-O ontology 참고하는 파일

    prompt = prompt_template.format(concepts=concepts,relations=relationships,data_properties=data_properties,InverseProperties=inverse_properties,base_onto_class=base_onto_class,base_onto_property=base_onto_property)
    print(prompt)

    with open('orig_datapropsandinver_v1.owl',"w") as f:
        f.write(llm.invoke(prompt).content)
        
Ontology_creation()

%INSTRUCTIONS:
Use the concepts and relations (properties) and build an ontology in RDF format for describing the provenance of Deep Learning Pipeline.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Don't provide anything other than the ontology in RDF format.

Use the IRI for the base ontology: https://w3id.org/dlprovenance/
Use the ontology classes and relations as the base ontology.
Incorporate data properties, Inverse Properties, and ensure the hierarchy is well-structured. 
Below are the examples and follow the same format for all the questions:

Concepts: Hyperparameter, Model
Relations: hasHyperparameter, hasModel

<?xml version="1.0"?>
<rdf:RDF xmlns="https://w3id.org/dlprovenance#"
     xml:base="https://w3id.org/dlprovenance"
     xmlns:owl="http://www.w3.org/2002/07/owl#"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:xml="http://www.w3.org/XML/1998/namespace"
     xmlns:xsd="http://www.w3.org/2001/XMLSche

##### 변환된 파일 열기 (create_ontology.py)

* 추출한 concept_relation_data_inverse.txt 파일(각모델별로 다른 결과가 나왔을 것)을 기준으로 온톨로지를 구축하는 코드라고 하는데, 사실 잘 모르겠음
* 실제 개념 간의 관계는 설정하지 않고 그냥 정의만 한 것으로 보임
    1. 개념(Concepts)을 OWL 클래스로 정의
    2. 관계(Relationships)를 OWL 객체 속성으로 정의
    3. 데이터 속성(DataProperties)을 OWL 데이터타입 속성으로 정의
    4. 역관계(InverseProperties)를 OWL 객체 속성으로 정의
* 주석에보면 이건 데모용이고 실제 데이터의 관계를 표현해야 한다는 내용이 있음

In [22]:
import os
import csv
from rdflib import Graph, Namespace, RDF, RDFS, OWL, Literal, URIRef, XSD

models = ['GPT4', 'GPT3.5', 'Gemini']
ontology_folder = "../Ontology/"
concept_relations_folder = "Concepts_relations"

# Define namespaces
DLPROV = Namespace("https://w3id.org/dlprov#")
PROV = Namespace("http://www.w3.org/ns/prov#")

# Function to load concept relations from concept_relations.txt
def load_concept_relations(file_path):
    concepts = []
    relationships = []
    data_properties = []
    inverse_properties = []

    with open(file_path, 'r') as file:
        lines = file.readlines()

        current_section = None
        for line in lines:
            line = line.strip()
            if line.startswith('Concepts:'):
                current_section = 'Concepts'
                concepts.extend(line.split(':')[1].strip().split(', '))
            elif line.startswith('Relationships:'):
                current_section = 'Relationships'
                relationships.extend(line.split(':')[1].strip().split(', '))
            elif line.startswith('DataProperties:'):
                current_section = 'DataProperties'
                data_properties.extend(line.split(':')[1].strip().split(', '))
            elif line.startswith('InverseProperties:'):
                current_section = 'InverseProperties'
                inverse_properties.extend(line.split(':')[1].strip().split(', '))
            elif current_section:
                # If the line doesn't start with one of the section headers, it belongs to the current section
                if line:
                    if current_section == 'Concepts':
                        concepts.extend(line.split(', '))
                    elif current_section == 'Relationships':
                        relationships.extend(line.split(', '))
                    elif current_section == 'DataProperties':
                        data_properties.extend(line.split(', '))
                    elif current_section == 'InverseProperties':
                        inverse_properties.extend(line.split(', '))

    return concepts, relationships, data_properties, inverse_properties


# Function to create RDF triples for concepts, relationships, properties
def create_ontology(model_name, concepts, relationships, data_properties, inverse_properties):
    g = Graph()    
    g.bind("dlprov", DLPROV)
    print(f'model_name:{model_name}')
    print(f'concepts:{concepts}')
    print(f'relationships:{relationships}')

    # Define classes (concepts) as subclasses of DLPROV:Entity
    for concept in concepts:
        concept_uri = DLPROV[concept]
        print(f'concept_uri:{concept_uri}')
        g.add((concept_uri, RDF.type, OWL.Class))
        g.add((concept_uri, RDFS.subClassOf, PROV.Entity))
        g.add((concept_uri, RDFS.label, Literal(concept, datatype=XSD.string)))

    # Define relationships (object properties)
    for rel in relationships:
        rel_uri = DLPROV[rel]
        print(f'rel_uri:{rel_uri}')
        g.add((rel_uri, RDF.type, OWL.ObjectProperty))
        g.add((rel_uri, RDFS.label, Literal(rel, datatype=XSD.string)))

    # Define data properties
    for prop in data_properties:
        prop_uri = DLPROV[prop]
        print(f'prop_uri:{prop_uri}')
        g.add((prop_uri, RDF.type, OWL.DatatypeProperty))
        g.add((prop_uri, RDFS.label, Literal(prop, datatype=XSD.string)))

    # Define inverse properties
    for inv_prop in inverse_properties:
        inv_prop_uri = DLPROV[inv_prop]
        print(f'inv_prop_uri:{inv_prop_uri}')
        g.add((inv_prop_uri, RDF.type, OWL.ObjectProperty))
        g.add((inv_prop_uri, OWL.inverseOf, inv_prop_uri))
        g.add((inv_prop_uri, RDFS.label, Literal(inv_prop, datatype=XSD.string)))

    # # Define relationships between concepts and properties (for demonstration)
    # # These should be replaced with actual relationships from your data
    # for concept in concepts:
    #     for rel in relationships:
    #         g.add((DLPROV[concept], DLPROV[rel], DLPROV[concept + '_' + rel]))

    # Save the ontology to a Turtle file
    ontology_folder_path = os.path.join(ontology_folder, model_name, 'Ontology')
    os.makedirs(ontology_folder_path, exist_ok=True)
    ontology_file = os.path.join(ontology_folder_path, f'dlprov.ttl')
    print(ontology_file)
    g.serialize(destination=ontology_file, format='turtle')
    print(g.serialize(format='turtle'))

    print(f'Ontology for {model_name} created and saved at {ontology_file}')

# Main script
if __name__ == "__main__":
    # List of models with their concept_relations.txt paths
    for model in models:
        model_folder = os.path.join(ontology_folder, model, concept_relations_folder)
        if not os.path.isdir(model_folder):
            print(f"Warning: {concept_relations_folder} folder not found for {model}. Skipping.")
            continue     
    
    
        concept_relations_file = os.path.join(model_folder, 'concept_relation_data_inverse.txt')
        
        concepts, relationships, data_properties, inverse_properties = load_concept_relations(concept_relations_file)
        create_ontology(model, concepts, relationships, data_properties, inverse_properties)

model_name:GPT4
concepts:['Method', 'RawData', 'DataFormat', 'DataAnnotationTechnique', 'DataAugmentationTechnique', 'Dataset', 'PreprocessingStep', 'DataSplitCriteria', 'CodeRepository', 'DataRepository', 'CodeRepositoryLink', 'DataRepositoryLink', 'DeepLearningModel', 'Hyperparameter', 'HyperparameterOptimization', 'OptimizationTechnique', 'TrainingCompletionCriteria', 'RegularizationMethod', 'ModelPerformanceMonitoringStrategy', 'Framework', 'HardwareResource', 'PostprocessingStep', 'PerformanceMetric', 'GeneralizabilityMeasure', 'RandomnessStrategy', 'ModelPurpose', 'DataBiasTechnique', 'ModelDeploymentProcess', 'DeploymentPlatform']
relationships:['hasMethod', 'hasRawData', 'hasDataFormat', 'hasDataAnnotationTechnique', 'hasDataAugmentationTechnique', 'hasDataset', 'hasPreprocessingStep', 'hasDataSplitCriteria', 'hasCodeRepository', 'hasDataRepository', 'hasCodeRepositoryLink', 'hasDataRepositoryLink', 'hasDeepLearningModel', 'hasHyperparameter', 'hasHyperparameterOptimization', '

## (4) CQ Answering

* 모든 pdf 파일을 임베딩해서 벡터 DB에 넣어놓고, 각 질문별 답변을 문서별로 확인 -> 문서 전체를 하나의 벡터에 넣는게 아니라 문서별로 저장, 검색하는 방식인 것 같음
* 논문에서는 Mixtral 모델로만 답변을 생성했다고 함

### 답변 생성(RAG_CQ_answering_with_LLMs.py)

* 각 문서별로 28개의 질문을 해서 30*28개의 답변 문서를 생성하는 코드

In [30]:
pip install "unstructured[pdf]"

Collecting onnxruntime>=1.19.0 (from unstructured[pdf])
  Using cached onnxruntime-1.21.0-cp312-cp312-macosx_13_0_universal2.whl.metadata (4.5 kB)
Collecting pdf2image (from unstructured[pdf])
  Using cached pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Collecting pdfminer.six (from unstructured[pdf])
  Downloading pdfminer_six-20250327-py3-none-any.whl.metadata (4.1 kB)
Collecting pikepdf (from unstructured[pdf])
  Downloading pikepdf-9.7.0-cp312-cp312-macosx_14_0_arm64.whl.metadata (8.1 kB)
Collecting pi-heif (from unstructured[pdf])
  Downloading pi_heif-0.22.0-cp312-cp312-macosx_14_0_arm64.whl.metadata (6.5 kB)
Collecting google-cloud-vision (from unstructured[pdf])
  Downloading google_cloud_vision-3.10.1-py3-none-any.whl.metadata (9.5 kB)
Collecting effdet (from unstructured[pdf])
  Using cached effdet-0.4.1-py3-none-any.whl.metadata (33 kB)
Collecting unstructured-inference>=0.8.10 (from unstructured[pdf])
  Using cached unstructured_inference-0.8.10-py3-none-any.whl.metad

In [37]:
# from helper_functions import load_llm, load_cqs, read_txt, get_embeddings
from langchain.prompts import PromptTemplate
from langchain.document_loaders import UnstructuredFileLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
import textwrap
import numpy as np
from tqdm import tqdm
# from LLM_loader import llm

def RAG_CQ_answering():
    # CQs = load_cqs(config.get('Paths', 'CQs_path'))
    CQs = load_cqs('../CQs/CQs.txt')
    # embedding_model_id = config.get('Models', 'embedding_model_id')
    # template = read_txt(config.get('Paths', 'RAG_question_answering_prompt'))
    template = read_txt('../RAG_CQ_ans/RAG_question_answering.txt')
    prompt_template = PromptTemplate(input_variables=["query"], template=template)

    # for d in tqdm([1,3,5,7,8,9,10,12,13,14,16,18,19,20,24,25,27,28,33,34,37,38,39,41,42,43,44,45,46,47]):
    for d in tqdm([1,3,5]):
        # loader = UnstructuredFileLoader(f"{config.get('Paths', 'pdfs_path')}{d}.pdf")
        loader = UnstructuredFileLoader(f"../Data/Pdfs/{d}.pdf")
        documents = loader.load()
        text_splitter=RecursiveCharacterTextSplitter(chunk_size=2500, chunk_overlap=100)
        text_chunks=text_splitter.split_documents(documents)
        vectorstore=FAISS.from_documents(text_chunks, embedding)
        chain = RetrievalQA.from_chain_type(llm=llm, chain_type = "stuff",return_source_documents=False, retriever=vectorstore.as_retriever())                                     
        for cq,p in zip(CQs,np.arange(1,len(CQs)+1)):
            prompt = prompt_template.format(query=cq)
            result = chain({"query": prompt}, return_only_outputs=True)
            wrapped_text = textwrap.fill(result['result'], width=100)
            with open(f"RAG_CQ_ans/Original_not_processed/Publication{d}_CQ{p}.txt", 'w') as f:
                f.write(wrapped_text)
                
RAG_CQ_answering()

  0%|          | 0/30 [00:00<?, ?it/s]CropBox missing from /Page, defaulting to MediaBox
  7%|▋         | 2/30 [03:03<42:57, 92.07s/it]CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBo

KeyboardInterrupt: 

### 후처리 (Process_CQ_ans.py)

In [38]:
def remove_question_and_query(text):
    if "Question:" in text:
        return text[:text.index("Question:")]
    elif "Query:" in text:
        return text[:text.index("Query:")]
    elif "References" in text:
        return text[:text.index("References")]
    elif "Reference(s):" in text:
        return text[:text.index("Reference(s):")]
    elif "Answer:" in text:
        return text[:text.index("Answer:")]
    elif "%Query Query:" in text:
        return text[:text.index("%Query Query:")]
    elif "%Explanation Explanation:" in text:
        return text[:text.index("%Explanation Explanation:")]
    elif "%Context Context:" in text:
        return text[:text.index("%Context Context:")]
    elif "%Context" in text:
        return text[:text.index("%Context")]
    elif "%Explanation" in text:
        return text[:text.index("%Explanation")]
    else:
        return text

In [39]:
def remove_repeated_sentences(text):
    sentences = text.split('.')
    seen = set()
    result = []
    for sentence in sentences:
        sentence = sentence.strip().replace('\n', ' ')
        sentence = re.sub(r'\s+', ' ', sentence)
        if sentence.strip() not in seen:
            seen.add(sentence.strip())
            result.append(sentence.strip())
    return '. '.join(result)

In [40]:
import os
import re
# from helper_functions import remove_question_and_query, remove_repeated_sentences

def process_cq_ans():
    # input_folder = config.get('Paths', 'Ans_to_cq_input_folder')   
    # output_folder = config.get('Paths', 'Ans_to_cq_output_folder') 
    input_folder = 'RAG_CQ_ans/Original_not_processed/'
    output_folder = 'RAG_CQ_ans/V1_processed/'

    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    for filename in os.listdir(input_folder):
        if filename.startswith("Pub") and filename.endswith(".txt"):
            with open(os.path.join(input_folder, filename), 'r') as file:
                text = file.read()
                text = text.replace('\n', ' ')
            text = remove_repeated_sentences(text)
            text = remove_question_and_query(text)
            with open(os.path.join(output_folder, filename), 'w') as file:
                file.write(text)
                
process_cq_ans()

## (5) KG construction

### NER(NER.py)

In [74]:
# from helper_functions import load_cqs, read_txt
import numpy as np
# from LLM_loader import llm

def NER():
    # for pub in [1,3,5,7,8,9,10,12,13,14,16,18,19,20,24,25,27,28,33,34,37,38,39,41,42,43,44,45,46,47]:
    for pub in tqdm([1,3]):
        # CQs = load_cqs(config.get('Paths', 'CQs_path'))
        CQs = load_cqs('../CQs/CQs.txt')
        # combined_prompt = read_txt(config.get('Paths', 'NER_prompt'))
        combined_prompt = read_txt('../NER_prompt/NER_Mixtral_prompt.txt')
        Answer_format  = '''
        Answer:::
        Provide your answer as follows:
        Named Entities: For each provided Concept(Corresponding Named Entity,..), ...
        Answer:::
        '''
        # 13개의 CQ만 선택함
        for i in tqdm([2,5,6,4,12,15,13,22,17,19,20,8,25]):
            # Answer = read_txt(f"{config.get('Paths', 'Ans_to_cq_output_folder')}Publication{pub}_CQ{i}.txt")
            Answer = read_txt(f'RAG_CQ_ans/V1_processed/Publication{pub}_CQ{i}.txt')
            prompt = f"Q: {CQs[i-1]}\nA: {Answer}\n"
            combined_prompt += prompt
            combined_prompt +=Answer_format
            # print(combined_prompt, '\n==============\n')
            output = llm.invoke(combined_prompt).content
            with open(f"NER/Publication{pub}_concepts.txt", 'w') as f:
                f.write(output)
                
NER()

  0%|          | 0/2 [00:00<?, ?it/s]

%INSTRUCTIONS: 
Your task is to do Named Entity Recognition. You extract named entities from the provided competency question answers. 
Use provided concepts to understand which named entities to extract from competency answers. 
If there are no entities found for a specific concept, please write "Not mentioned" instead of leaving empty parentheses.
Concepts: Method, RawData, DataFormat, DataAnnotationTechnique, DataAugmentationTechnique, Dataset, PreprocessingStep, DataSplitCriteria, CodeRepository, DataRepository, CodeRepositoryLink, DataRepositoryLink, DeepLearningModel, Hyperparameter, HyperparameterOptimization, OptimizationTechnique, TrainingCompletionCriteria, RegularizationMethod, ModelPerformanceMonitoringStrategy, Framework, HardwareResource, PostprocessingStep, PerformanceMetric, GeneralizabilityMeasure, RandomnessStrategy, ModelPurpose, DataBiasTechnique, ModelDeploymentProcess, DeploymentPlatform.         
Example: DeepLearningModel(CNN, RNN, Transformer), Framework(Tensor

 50%|█████     | 1/2 [00:51<00:51, 51.18s/it]

%INSTRUCTIONS: 
Your task is to do Named Entity Recognition. You extract named entities from the provided competency question answers. 
Use provided concepts to understand which named entities to extract from competency answers. 
If there are no entities found for a specific concept, please write "Not mentioned" instead of leaving empty parentheses.
Concepts: Method, RawData, DataFormat, DataAnnotationTechnique, DataAugmentationTechnique, Dataset, PreprocessingStep, DataSplitCriteria, CodeRepository, DataRepository, CodeRepositoryLink, DataRepositoryLink, DeepLearningModel, Hyperparameter, HyperparameterOptimization, OptimizationTechnique, TrainingCompletionCriteria, RegularizationMethod, ModelPerformanceMonitoringStrategy, Framework, HardwareResource, PostprocessingStep, PerformanceMetric, GeneralizabilityMeasure, RandomnessStrategy, ModelPurpose, DataBiasTechnique, ModelDeploymentProcess, DeploymentPlatform.         
Example: DeepLearningModel(CNN, RNN, Transformer), Framework(Tensor

100%|██████████| 2/2 [02:01<00:00, 60.92s/it]


### KG 생성(create_kg.py)

* 해당 Concept이 없는 경우 빈괄호()만 생성되는데, 코드 상에서는 Not mentioned가 있어야 패스되는 것 같음. NER_Mixtral_prompt.txt 프롬프트를 일부 수정함

In [6]:
import os
import re
import rdflib
from rdflib import Graph, Namespace, Literal, RDF
from urllib.parse import quote

# Define the folders and files paths
# ner_folder = 'NER/'
# kg_folder = 'KG/'


# Define the namespace
DLPROV = Namespace("https://w3id.org/dlprov#")

# Function to read named entities from a publication file
def read_named_entities(file_path):
    named_entities = {}
    exclude_phrases = ["not mentioned", "Not mentioned", "Not Provided",
                       "not explicitly mentioned", "not provided", "Not explicitly mentioned"]
    f = open(file_path, "r")
    lines = f.readlines()

    #lines = str(lines).strip().split('\n')


    for line in lines[1:]:  # Skip the first line
        # Using regular expression to correctly split at the first parenthesis
        #line = line.strip().split('\n')
        if 'named entities' in line.lower() or 'for each provided concept' in line.lower():
            continue
        if line.strip():
            if ':' in line:
                concept, entity_str = line.split(':', 1)
                concept = concept.strip()
                entity_str = entity_str.strip()
                entities = [entity.strip() for entity in entity_str.split(',')]
                filtered_entities = [entity for entity in entities if entity not in exclude_phrases]
                if filtered_entities:
                    named_entities[concept] = filtered_entities

            else:
                concept, entities_str = line.split('(', 1)
                entities_str = entities_str.rstrip('),')
                entities = [entity.strip() for entity in entities_str.split(',')]
                filtered_entities = [entity for entity in entities if entity not in exclude_phrases]
                if filtered_entities:
                    named_entities[concept.strip()] = filtered_entities

    return named_entities

# Clean and encode entity names
def clean_and_encode(entity_name):
    # Remove unwanted characters and encode
    entity_name = entity_name.replace('\\', '').replace(')', '').strip()
    return quote(entity_name.replace(' ', '_'))

# Convert named entities to KG
def create_kg(named_entities):

    g = Graph()
    g.bind("dlprov", DLPROV)

    for concept, entities in named_entities.items():
        # print(f'concept: {concept}')
        # print(f'entities: {entities[0]}')
        if ("not mentioned" in entities[0].lower()) or ("not provided" in entities[0].lower()):            
            continue

        concept_uri = DLPROV[clean_and_encode(concept.strip())]


        for entity in entities:
            entity_name, *abbr = entity.split('(')

            if entity_name:
                entity_name = entity_name.strip()
                abbr = abbr[0].strip(')') if abbr else None
                entity_uri = DLPROV[clean_and_encode(entity_name)]
                g.add((entity_uri, RDF.type, concept_uri))
                if abbr:
                    abbr_uri = DLPROV[clean_and_encode(abbr)]
                    g.add((abbr_uri, RDF.type, concept_uri))

    return g


# Function to write the KG to a file
def write_kg_to_file(kg, output_file):
    kg.serialize(destination=output_file, format='turtle')

## Main script
# if __name__ == "__main__":
#     models = ['GPT4', 'GPT3.5', 'Gemini', 'Mixtral_8_22b']
#     for model_name in models:
#         llm_model_path = os.path.join(ner_folder, model_name)
#         kg_model_path = os.path.join(kg_folder, model_name)
#         # Iterate through the publications
#         for publication_file in os.listdir(llm_model_path):
#             if publication_file.endswith('.txt'):
#                 publication_path = os.path.join(llm_model_path, publication_file)
#                 print('Publication Path: {}'.format(publication_path))
#                 named_entities = read_named_entities(publication_path)
#                 print('named_entities: {}'.format(named_entities))


#                 # Create KG
#                 kg = create_kg(named_entities)

#                 # Output KG file
#                 output_file = os.path.join(kg_model_path, publication_file.replace('.txt', '_KG.ttl'))
#                 write_kg_to_file(kg, output_file)



In [7]:
# llm_model_path = os.path.join(ner_folder, model_name)
llm_model_path = 'NER/'
# kg_model_path = os.path.join(kg_folder, model_name)
kg_model_path = 'KG/'
# Iterate through the publications
for publication_file in os.listdir(llm_model_path):
    if publication_file.endswith('.txt'):
        publication_path = os.path.join(llm_model_path, publication_file)
        print('Publication Path: {}'.format(publication_path))
        named_entities = read_named_entities(publication_path)
        print('named_entities: {}'.format(named_entities))

        # Create KG
        kg = create_kg(named_entities)
        
        # Output KG file
        output_file = os.path.join(kg_model_path, publication_file.replace('.txt', '_KG.ttl'))
        write_kg_to_file(kg, output_file)


Publication Path: NER/Publication3_concepts.txt
named_entities: {'Method': ['Not mentioned)', ''], 'RawData': ['Audio recordings of frog calls', 'specifically from the genus Platymantis)', ''], 'DataFormat': ['WAV files', 'PNG files)', ''], 'DataAnnotationTechnique': ['Not mentioned)', ''], 'DataAugmentationTechnique': ['Not mentioned)', ''], 'Dataset': ['Custom datasets of spectrograms derived from bioacoustic recordings', 'spectrogram images generated from Platymantis audio recordings', 'recordings sourced from Cornell Lab of Ornithology Macaulay Library and recent field surveys)', ''], 'PreprocessingStep': ['Audio clipping and standardization', 'Silence addition', 'File formatting', 'Spectrogram generation using warbleR and Seewave', 'Oscillogram parameters with FFT of 512 points and 90% overlap', 'Image saving as PNG)', ''], 'DataSplitCriteria': ['Validation percentage set to 20%)', ''], 'CodeRepository': ['Not mentioned)', ''], 'DataRepository': ['Cornell Lab of Ornithology Macaul

## (6) Evaluation