# From text to relations

In this notebook we’ll explore how to go from an arbitrary text snippet to a set of subject–predicate–object triples using a LLM

## 1. Load a single TXT file
To inspect the txt file it will be loaded directly in the notebook. The text file has been created from the wikipedia page of [Napoleon Bonaparte](https://en.wikipedia.org/wiki/Napoleon). In the text double spacing has been added in one place and a newline character

In [11]:
from pathlib import Path
import textwrap

path = Path("test_file.txt")
with open(path, encoding="utf-8") as f:
    raw_text = f.read()

wrapped = textwrap.fill(repr(raw_text), width=80)
print(wrapped)

'Napoleon Bonaparte (born Napoleone di Buonaparte 15 August 1769 – 5 May 1821),
later known by his regnal name Napoleon I, was a French   general and statesman
who rose to prominence during the French Revolution and led a series of military
campaigns across Europe during the French Revolutionary and Napoleonic Wars from
1796 to 1815. He led the French Republic as First Consul from 1799 to 1804, then
ruled the French Empire as Emperor of the French from 1804 to 1814, and briefly
again in 1815. He was King of Italy from 1805 to 1814 and Protector of the
Confederation of the Rhine from 1806 to 1813. Born on the island of Corsica to a
family of Italian origin, Napoleon moved to mainland France in 1779 and was
commissioned as an officer in the French Royal Army in 1785.\nHe supported the
French Revolution in 1789 and promoted its cause in Corsica.'


The LLM will be a smaller model. For this reason the splitting is important as to not confuse the model with too little context. Langchains Document loader will be used to handle the file.

First a method will be found to clean the text

In [12]:
import re
import unicodedata
from ftfy import fix_text

def clean_text(text: str) -> str:
    text = fix_text(text)
    text = unicodedata.normalize("NFC", text)
    text = text.replace("\\n", " ")
    text = re.sub(r"(?<!\n)\n(?!\n)", " ", text)

    text = (
        text.replace("“", '"')
            .replace("”", '"')
            .replace("‘", "'")
            .replace("’", "'")
            .replace("—", " - ")
            .replace("–", " - ")
            .replace("…", "...")
    )

    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n\s*\n+", "\n\n", text)
    lines = [line.strip() for line in text.splitlines()]
    return "\n".join(lines).strip()

In [13]:
cleaned = clean_text(raw_text)
print(repr(cleaned))

'Napoleon Bonaparte (born Napoleone di Buonaparte 15 August 1769 - 5 May 1821), later known by his regnal name Napoleon I, was a French general and statesman who rose to prominence during the French Revolution and led a series of military campaigns across Europe during the French Revolutionary and Napoleonic Wars from 1796 to 1815. He led the French Republic as First Consul from 1799 to 1804, then ruled the French Empire as Emperor of the French from 1804 to 1814, and briefly again in 1815. He was King of Italy from 1805 to 1814 and Protector of the Confederation of the Rhine from 1806 to 1813. Born on the island of Corsica to a family of Italian origin, Napoleon moved to mainland France in 1779 and was commissioned as an officer in the French Royal Army in 1785. He supported the French Revolution in 1789 and promoted its cause in Corsica.'


In [14]:
import difflib
import spacy

nlp = spacy.load("en_core_web_sm")
if not nlp.has_pipe("sentencizer"):
    nlp.add_pipe("sentencizer", first=True)
    
def split_into_sentences(text):
    return [sent.text.strip() for sent in nlp(text).sents]

raw_sentences = split_into_sentences(repr(raw_text))
cleaned_sentences = split_into_sentences(repr(cleaned))

diffs = list(difflib.ndiff(raw_sentences, cleaned_sentences))
print("\n".join(diffs))

- 'Napoleon Bonaparte (born Napoleone di Buonaparte 15 August 1769 – 5 May 1821), later known by his regnal name Napoleon I, was a French   general and statesman who rose to prominence during the French Revolution and led a series of military campaigns across Europe during the French Revolutionary and Napoleonic Wars from 1796 to 1815.
?                                                                  ^                                                                      ^^^^^^^^^^^^^^^^^^^^^^^

+ 'Napoleon Bonaparte (born Napoleone di Buonaparte 15 August 1769 - 5 May 1821), later known by his regnal name Napoleon I, was a French general and statesman who rose to prominence during the French Revolution and led a series of military campaigns across Europe during the French Revolutionary and Napoleonic Wars from 1796 to 1815.
?                                                                  ^                                                                      ^^^^^^^^^^^^^^^^^^^^^

  

As the output of difflib shows all mistakes in the text are found and corrected for.
The next step is splitting the given text so the model can use it.

## Split text
The text will be split per sentence to start testing. spacy with the en_Core_web_sm will be used

In [15]:
def split_sentences(text: str):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]

In [16]:
sentences = split_sentences(cleaned)
print("Sentences:")
for i, s in enumerate(sentences):
    print(f"{i+1}. {s}")

Sentences:
1. Napoleon Bonaparte (born Napoleone di Buonaparte 15 August 1769 - 5 May 1821), later known by his regnal name Napoleon I, was a French general and statesman who rose to prominence during the French Revolution and led a series of military campaigns across Europe during the French Revolutionary and Napoleonic Wars from 1796 to 1815.
2. He led the French Republic as First Consul from 1799 to 1804, then ruled the French Empire as Emperor of the French from 1804 to 1814, and briefly again in 1815.
3. He was King of Italy from 1805 to 1814 and Protector of the Confederation of the Rhine from 1806 to 1813.
4. Born on the island of Corsica to a family of Italian origin, Napoleon moved to mainland France in 1779 and was commissioned as an officer in the French Royal Army in 1785.
5. He supported the French Revolution in 1789 and promoted its cause in Corsica.


Now the model `flan-t5-large` will be loaded into the notebook to start testing how triples can be formed for each sentence

In [None]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline

def load_llm():
    text_gen = pipeline(
        "text2text-generation",
        model="google/flan-t5-large",
        device="cpu",
        do_sample=False,
        temperature=0.0,
        max_new_tokens=128,
    )

    return HuggingFacePipeline(pipeline=text_gen)

  from .autonotebook import tqdm as notebook_tqdm


In [77]:
llm = load_llm()

print("LLM:", llm)

Device set to use cpu
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


LLM: [1mHuggingFacePipeline[0m
Params: {'model_id': 'gpt2', 'model_kwargs': None, 'pipeline_kwargs': None}


  return HuggingFacePipeline(pipeline=text_gen)


The model is loaded succesfully. The code for generating the triples and correctly transforming the input text can be found in the next code block.
The code will be made to accept multiple documents

In [17]:
from typing import List, Dict
from langchain.schema import Document

import re
import unicodedata
from ftfy import fix_text

from langchain.schema import Document
from langchain.document_loaders import TextLoader


def clean_text(text: str) -> str:
    text = fix_text(text)
    text = unicodedata.normalize("NFC", text)
    text = text.replace("\\n", " ")
    text = re.sub(r"(?<!\n)\n(?!\n)", " ", text)

    text = (
        text.replace("“", '"')
            .replace("”", '"')
            .replace("‘", "'")
            .replace("’", "'")
            .replace("—", " - ")
            .replace("–", " - ")
            .replace("…", "...")
    )

    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n\s*\n+", "\n\n", text)
    lines = [line.strip() for line in text.splitlines()]
    return "\n".join(lines).strip()


def split_sentences(text: str):
    doc = nlp(text)
    return [sent.text.strip() for sent in doc.sents if sent.text.strip()]



def load_documents_and_split_sentences(file_path: str) -> list[Document]:
    loader = TextLoader(file_path, encoding="utf-8")
    raw_docs = loader.load()

    sentence_docs = []
    for doc in raw_docs:
        cleaned = clean_text(doc.page_content)
        sentences = split_sentences(cleaned)
        for sent in sentences:
            sentence_docs.append(Document(page_content=sent, metadata=doc.metadata))

    return sentence_docs


def extract_triples(docs: List[Document], llm) -> List[Dict[str, str]]:
    prompt = """
    Your task is to extract triples from a given sentence.
    Use the format {{ "subject": "Value", "predicate": "Value", "object": "Value" }}
    Sentence:
    {chunk_text}
    Only output one JSON per line
    """
    raw_triples: List[str] = []

    for doc in docs:
        full_prompt = prompt.format(chunk_text=doc.page_content)
        print("\n\nFull prompt >>>", full_prompt)
        raw = llm(full_prompt)
        print("\nMODEL RAW >>>", raw)

        lines = [line.strip() for line in raw.splitlines() if line.strip()]
        raw_triples.extend(lines)
    
    return raw_triples

In [None]:
sentences = load_documents_and_split_sentences("test_file.txt")

triples = extract_triples(sentences, llm)
print(f"Extracted {len(triples)} triples")
triples[:]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.




Full prompt >>> 
    Your task is to extract triples from a given sentence.
    Use the format { "subject": "Value", "predicate": "Value", "object": "Value" }
    Sentence:
    Napoleon Bonaparte (born Napoleone di Buonaparte 15 August 1769 - 5 May 1821), later known by his regnal name Napoleon I, was a French general and statesman who rose to prominence during the French Revolution and led a series of military campaigns across Europe during the French Revolutionary and Napoleonic Wars from 1796 to 1815.
    Only output one JSON per line
    


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



MODEL RAW >>> Napoleon Bonaparte NATIONALITY France Napoleon Bonaparte NATIONALITY France Napoleon Bonaparte NATIONALITY France Napoleon Bonaparte RANK 1 Napoleon Bonaparte NATIONALITY France Napoleon Bonaparte DATE_OF_BIRTH "1915-05-05"


Full prompt >>> 
    Your task is to extract triples from a given sentence.
    Use the format { "subject": "Value", "predicate": "Value", "object": "Value" }
    Sentence:
    He led the French Republic as First Consul from 1799 to 1804, then ruled the French Empire as Emperor of the French from 1804 to 1814, and briefly again in 1815.
    Only output one JSON per line
    


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



MODEL RAW >>>                                                                


Full prompt >>> 
    Your task is to extract triples from a given sentence.
    Use the format { "subject": "Value", "predicate": "Value", "object": "Value" }
    Sentence:
    He was King of Italy from 1805 to 1814 and Protector of the Confederation of the Rhine from 1806 to 1813.
    Only output one JSON per line
    


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



MODEL RAW >>>                                                                


Full prompt >>> 
    Your task is to extract triples from a given sentence.
    Use the format { "subject": "Value", "predicate": "Value", "object": "Value" }
    Sentence:
    Born on the island of Corsica to a family of Italian origin, Napoleon moved to mainland France in 1779 and was commissioned as an officer in the French Royal Army in 1785.
    Only output one JSON per line
    


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



MODEL RAW >>> Napoleon ORIGIN Italy Napoleon COMMANDER_IN_THE_FRENCH_RAYAL_ARMY 1785 Napoleon BIRTH_PLACE Corsica Napoleon COMMANDER_IN_THE_FRENCH_RAYAL_ARMY 1779


Full prompt >>> 
    Your task is to extract triples from a given sentence.
    Use the format { "subject": "Value", "predicate": "Value", "object": "Value" }
    Sentence:
    He supported the French Revolution in 1789 and promoted its cause in Corsica.
    Only output one JSON per line
    

MODEL RAW >>>                                                                
Extracted 2 triples


['Napoleon Bonaparte NATIONALITY France Napoleon Bonaparte NATIONALITY France Napoleon Bonaparte NATIONALITY France Napoleon Bonaparte RANK 1 Napoleon Bonaparte NATIONALITY France Napoleon Bonaparte DATE_OF_BIRTH "1915-05-05"',
 'Napoleon ORIGIN Italy Napoleon COMMANDER_IN_THE_FRENCH_RAYAL_ARMY 1785 Napoleon BIRTH_PLACE Corsica Napoleon COMMANDER_IN_THE_FRENCH_RAYAL_ARMY 1779']

### Result
The model is having problems with outputting the triples and following the JSON format. This can be a limitation of smaller models. There is a possibility that with few shot it can do it. For speed and not using more than needed only two sentences will be given to the model

In [None]:
def extract_triples(docs: List[Document], llm) -> List[Dict[str, str]]:
    prompt = """
    Your task is to extract triples from the given sentence.
    Use the format {{"subject": "Value", "predicate": "Value", "object": "Value"}}.
    Only output one JSON object per line, with no surrounding text.

    Examples:
    Sentence: "Marie Curie won the Nobel Prize in Physics in 1903."
    {{"subject": "Marie Curie", "predicate": "won", "object": "the Nobel Prize in Physics in 1903"}}

    Sentence: "The Amazon River flows through Brazil."
    {{"subject": "The Amazon River", "predicate": "flows through", "object": "Brazil"}}

    Now extract the triple from this sentence:
    {chunk_text}
    """
    raw_triples: List[str] = []

    for doc in docs:
        full_prompt = prompt.format(chunk_text=doc.page_content)
        print("\n\nFull prompt >>>", full_prompt)
        raw = llm(full_prompt)
        print("\nMODEL RAW >>>", raw)
        
        lines = [line.strip() for line in raw.splitlines() if line.strip()]
        raw_triples.extend(lines)
    
    return raw_triples


sentences = load_documents_and_split_sentences("test_file.txt")

triples = extract_triples(sentences[:2], llm)
print(f"Extracted {len(triples)} triples")
triples[:]

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.




Full prompt >>> 
    Your task is to extract triples from the given sentence.
    Use the format {"subject": "Value", "predicate": "Value", "object": "Value"}.
    Only output one JSON object per line, with no surrounding text.

    Examples:
    Sentence: "Marie Curie won the Nobel Prize in Physics in 1903."
    {"subject": "Marie Curie", "predicate": "won", "object": "the Nobel Prize in Physics in 1903"}

    Sentence: "The Amazon River flows through Brazil."
    {"subject": "The Amazon River", "predicate": "flows through", "object": "Brazil"}

    Now extract the triple from this sentence:
    Napoleon Bonaparte (born Napoleone di Buonaparte 15 August 1769 - 5 May 1821), later known by his regnal name Napoleon I, was a French general and statesman who rose to prominence during the French Revolution and led a series of military campaigns across Europe during the French Revolutionary and Napoleonic Wars from 1796 to 1815.
    


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



MODEL RAW >>> ['Napoleon Bonaparte', 'NATIONALITY', 'France'], 'Napoleon Bonaparte', 'NATIONALITY', 'France'], 'Napoleon Bonaparte', 'NATIONALITY', 'France'], 'Napoleon Bonaparte', 'BIRTH_PLACE', 'Napoleone di Buonaparte'], 'NATIONALITY', 'France'], 'N


Full prompt >>> 
    Your task is to extract triples from the given sentence.
    Use the format {"subject": "Value", "predicate": "Value", "object": "Value"}.
    Only output one JSON object per line, with no surrounding text.

    Examples:
    Sentence: "Marie Curie won the Nobel Prize in Physics in 1903."
    {"subject": "Marie Curie", "predicate": "won", "object": "the Nobel Prize in Physics in 1903"}

    Sentence: "The Amazon River flows through Brazil."
    {"subject": "The Amazon River", "predicate": "flows through", "object": "Brazil"}

    Now extract the triple from this sentence:
    He led the French Republic as First Consul from 1799 to 1804, then ruled the French Empire as Emperor of the French from 1804 to 1814, and b

["['Napoleon Bonaparte', 'NATIONALITY', 'France'], 'Napoleon Bonaparte', 'NATIONALITY', 'France'], 'Napoleon Bonaparte', 'NATIONALITY', 'France'], 'Napoleon Bonaparte', 'BIRTH_PLACE', 'Napoleone di Buonaparte'], 'NATIONALITY', 'France'], 'N",
 '[TABLECONTEXT] [TITLE] Louis XIV of France [TABLECONTEXT] LEADER_NAME Louis XIV of France [TABLECONTEXT] NAME Louis XIV of France [TABLECONTEXT] [TITLE] French Empire Louis XIV of France EMPEROR_OF_THE_FRANCE Louis XIV of France FIRST_CONSULT 1799-1804 Louis XIV of France PERIOD 1804-1814 Louis XIV of France BRIEF_']

From the output it is visible that there is more structure but it is still far from valid JSON or correct triples. It could be the model that is giving the problems. A bigger model or chat style models might perform better.
Chat style models are not as easy to implement, for now a bigger local model will be chosen

In [None]:
# from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# from langchain.llms import HuggingFacePipeline

# def load_llm():
#     tokenizer = AutoTokenizer.from_pretrained(
#         "lmsys/vicuna-7b-v1.5",
#         trust_remote_code=True,
#         use_fast=False
#     )
#     model = AutoModelForCausalLM.from_pretrained(
#         "lmsys/vicuna-7b-v1.5",
#         trust_remote_code=True,
#         device_map="auto"
#     )

#     chat_pipe = pipeline(
#         "chat",
#         model=model,
#         tokenizer=tokenizer,
#         temperature=0.0,
#         max_new_tokens=128,
#     )

#     return HuggingFacePipeline(pipeline=chat_pipe)

# llm = load_llm()

In [4]:
from langchain.llms import LlamaCpp

model_path = "models/mythomax-l2-13b.Q5_K_M.gguf"
n_ctx = 4096
llm = LlamaCpp(
    model_path=model_path,
    n_ctx=n_ctx,
    n_threads=16,
    n_gpu_layers=0,
    use_mmap=True,
    use_mlock=False,
    verbose=False
)

llama_context: n_batch is less than GGML_KQ_MASK_PAD - increasing to 64


In [18]:
def extract_triples(docs: List[Document], llm) -> List[str]:   
    system = f"""Your task is to extract triples from the given sentence.
    Use the format {{"subject": "Value", "predicate": "Value", "object": "Value"}}.
    Only output one JSON object per line, with no surrounding text.
    """

    few_shot = f"""
    Examples:
    Sentence: "Marie Curie won the Nobel Prize in Physics in 1903."
    {{"subject": "Marie Curie", "predicate": "won", "object": "the Nobel Prize in Physics in 1903"}}

    Sentence: "The Amazon River flows through Brazil."
    {{"subject": "The Amazon River", "predicate": "flows through", "object": "Brazil"}}
    """

    raw_triples: List[str] = []
    for doc in docs:
        chunk_text = doc.page_content
        user = f"User: {chunk_text}\nAssistant:\n"
        prompt = "\n\n".join([system, few_shot, user])
        raw = llm(
            prompt,
            max_tokens=64,
            temperature=0.0,
            top_p=0.9,
            top_k=50,
            repeat_penalty=1.1,
            stop=["\n\n","User:"],
        )
        print("\nMODEL RAW >>>", raw)

        # split into nonempty lines
        lines = [ln.strip() for ln in raw.splitlines() if ln.strip()]
        raw_triples.extend(lines)

    return raw_triples


sentences = load_documents_and_split_sentences("test_file.txt")

triples = extract_triples(sentences[:2], llm)
print(f"Extracted {len(triples)} triples")
triples[:]

  raw = llm(



MODEL RAW >>> {"subject": "Napoleon Bonaparte", "predicate": "(born Napoleone di Buonaparte 15 August 1769 - 5 May 1821)", "object": "a French general and statesman"}


MODEL RAW >>> {"subject": "He", "predicate": "led the French Republic as First Consul from 1799 to 1804", "object": ""}
{"subject": "He", "predicate": "ruled the French Empire as Emperor of the French from 180
Extracted 3 triples


['{"subject": "Napoleon Bonaparte", "predicate": "(born Napoleone di Buonaparte 15 August 1769 - 5 May 1821)", "object": "a French general and statesman"}',
 '{"subject": "He", "predicate": "led the French Republic as First Consul from 1799 to 1804", "object": ""}',
 '{"subject": "He", "predicate": "ruled the French Empire as Emperor of the French from 180']

The bigger model does perform better but it does not always axtract a complete set. The parameters can be changed to leave less room for the model

In [19]:
def extract_triples(docs: List[Document], llm) -> List[str]:   
    system = f"""Your task is to extract triples from the given sentence.
    Use the format {{"subject": "Value", "predicate": "Value", "object": "Value"}}.
    Only output one JSON object per line, with no surrounding text.
    """

    few_shot = f"""
    Examples:
    Sentence: "Marie Curie won the Nobel Prize in Physics in 1903."
    {{"subject": "Marie Curie", "predicate": "won", "object": "the Nobel Prize in Physics in 1903"}}

    Sentence: "The Amazon River flows through Brazil."
    {{"subject": "The Amazon River", "predicate": "flows through", "object": "Brazil"}}
    """

    raw_triples: List[str] = []
    for doc in docs:
        chunk_text = doc.page_content
        user = f"User: {chunk_text}\nAssistant:\n"
        prompt = "\n\n".join([system, few_shot, user])
        raw = llm(
            prompt,
            max_tokens=100,
            temperature=0.0,
            top_p=1.0,
            top_k=1,
            repeat_penalty=1.1,
            stop=["\n\n","User:"],
        )
        print("\nMODEL RAW >>>", raw)

        # split into nonempty lines
        lines = [ln.strip() for ln in raw.splitlines() if ln.strip()]
        raw_triples.extend(lines)

    return raw_triples


sentences = load_documents_and_split_sentences("test_file.txt")

triples = extract_triples(sentences[:2], llm)
print(f"Extracted {len(triples)} triples")
triples[:]


MODEL RAW >>> {"subject": "Napoleon Bonaparte", "predicate": "(born Napoleone di Buonaparte 15 August 1769 - 5 May 1821)", "object": "a French general and statesman"}


MODEL RAW >>> {"subject": "He", "predicate": "led the French Republic as First Consul from 1799 to 1804", "object": ""}
{"subject": "He", "predicate": "ruled the French Empire as Emperor of the French from 1804 to 1814", "object": ""}
{"subject": "He", "predicate": "briefly ruled the French Empire again in 1
Extracted 4 triples


['{"subject": "Napoleon Bonaparte", "predicate": "(born Napoleone di Buonaparte 15 August 1769 - 5 May 1821)", "object": "a French general and statesman"}',
 '{"subject": "He", "predicate": "led the French Republic as First Consul from 1799 to 1804", "object": ""}',
 '{"subject": "He", "predicate": "ruled the French Empire as Emperor of the French from 1804 to 1814", "object": ""}',
 '{"subject": "He", "predicate": "briefly ruled the French Empire again in 1']

The model manages to extract one extra triple but again not always complete in terms of three keys and values. Another option is to change the output format. JSON can be diffiult for these "smaller" models. Splitting by for example the pipe `|` and newline could be easier

In [20]:
def extract_triples(docs: List[Document], llm) -> List[str]:
    system = f"""Your task is to extract triples from the given sentence.
    Your only job is to pull out valid subject | predicate | object triples.
    - Output one triple per line, exactly in the format:
    subject | predicate | object
    If the sentence contains more than one relation, list every triple on its own line.
    """

    few_shot = f"""
    Examples:
    Sentence: "Marie Curie won the Nobel Prize in Physics in 1903."
    Marie Curie | won | the Nobel Prize in Physics in 1903

    Sentence: "The Amazon River flows through Brazil."
    The Amazon River | flows through | Brazil
    """

    raw_triples: List[str] = []
    for doc in docs:
        chunk_text = doc.page_content
        user = f"User: {chunk_text}\nAssistant:\n"
        prompt = "\n\n".join([system, few_shot, user])
        raw = llm(
            prompt,
            max_tokens=100,
            temperature=0.0,
            top_p=1.0,
            top_k=1,
            repeat_penalty=1.1,
            stop=["\n\n","User:"],
        )
        print("\nMODEL RAW >>>", raw)

        # split into nonempty lines
        lines = [ln.strip() for ln in raw.splitlines() if ln.strip()]
        raw_triples.extend(lines)

    return raw_triples


sentences = load_documents_and_split_sentences("test_file.txt")

triples = extract_triples(sentences[:2], llm)
print(f"Extracted {len(triples)} triples")
triples[:]


MODEL RAW >>> Napoleon Bonaparte | was born as | Napoleone di Buonaparte
Napoleon Bonaparte | later known by his regnal name | Napoleon I
Napoleon Bonaparte | rose to prominence during | the French Revolution
Napoleon Bonaparte | led a series of military campaigns across Europe during | the French Revolutionary and Napoleonic Wars from 1796 to 1815

MODEL RAW >>> - Napoleon Bonaparte | ruled as | Emperor of the French from 1804 to 1814 and briefly in 1815
- Napoleon Bonaparte | led the French Republic as | First Consul from 1799 to 1804
Extracted 6 triples


['Napoleon Bonaparte | was born as | Napoleone di Buonaparte',
 'Napoleon Bonaparte | later known by his regnal name | Napoleon I',
 'Napoleon Bonaparte | rose to prominence during | the French Revolution',
 'Napoleon Bonaparte | led a series of military campaigns across Europe during | the French Revolutionary and Napoleonic Wars from 1796 to 1815',
 '- Napoleon Bonaparte | ruled as | Emperor of the French from 1804 to 1814 and briefly in 1815',
 '- Napoleon Bonaparte | led the French Republic as | First Consul from 1799 to 1804']

With the new system prompt and few shot it is visible that the model extracts more triples and performs slightly better. In some cases it does not extract three correct triples. The system prompt can be extended together with the few shot to give the model a better idea of what it is supposed to do

In [21]:
def extract_triples(docs: List[Document], llm) -> List[str]:
    system = """You are an information-extraction assistant.
    Your only job is to pull out valid subject | predicate | object triples.
    Output one triple per line, exactly in the format:
    subject | predicate | object
    - If the sentence contains more than one relation, list every triple on its own line.
    Do NOT output:
        - duplicate or inverted triples
        - commentary or extra text
        - passive-voice fluff
    """

    few_shot = f"""
    User: Alice wrote Bob a letter.
    Assistant:
    Alice | wrote | Bob a letter

    User: John works at Acme Corp and lives in Paris.
    Assistant:
    John | works at | Acme Corp
    John | lives in | Paris

    User: In 2020, Company X acquired Company Y.
    Assistant:
    Company X | acquired | Company Y
    """

    raw_triples: List[str] = []
    for doc in docs:
        chunk_text = doc.page_content
        user = f"User: {chunk_text}\nAssistant:\n"
        prompt = "\n\n".join([system, few_shot, user])
        raw = llm(
            prompt,
            max_tokens=100,
            temperature=0.0,
            top_p=1.0,
            top_k=1,
            repeat_penalty=1.1,
            stop=["\n\n","User:"],
        )
        print("\nMODEL RAW >>>", raw)

        # split into nonempty lines
        lines = [ln.strip() for ln in raw.splitlines() if ln.strip()]
        raw_triples.extend(lines)

    return raw_triples


sentences = load_documents_and_split_sentences("test_file.txt")

triples = extract_triples(sentences[:2], llm)
print(f"Extracted {len(triples)} triples")
triples[:]


MODEL RAW >>> Napoleon Bonaparte | was born as | Napoleone di Buonaparte
Napoleon Bonaparte | later known by his regnal name | Napoleon I
Napoleon Bonaparte | rose to prominence during | French Revolution

MODEL RAW >>> Napoleon Bonaparte | led the French Republic as First Consul from | 1799 to 1804
Napoleon Bonaparte | ruled the French Empire as Emperor of the French from | 1804 to 1814
Napoleon Bonaparte | briefly ruled the French Empire as Emperor of the French from | 1815
Extracted 6 triples


['Napoleon Bonaparte | was born as | Napoleone di Buonaparte',
 'Napoleon Bonaparte | later known by his regnal name | Napoleon I',
 'Napoleon Bonaparte | rose to prominence during | French Revolution',
 'Napoleon Bonaparte | led the French Republic as First Consul from | 1799 to 1804',
 'Napoleon Bonaparte | ruled the French Empire as Emperor of the French from | 1804 to 1814',
 'Napoleon Bonaparte | briefly ruled the French Empire as Emperor of the French from | 1815']

The outputs is of a good enough. In some cases the predicate is a bit longer than needed. Now it is a good time to see how the model performs when there is more text. The next paragraph from the wikipedia page is added to a new text file named `test_file_v2.txt`

In [24]:
def load_documents_and_split_sentences(file_path: str) -> list[Document]:
    loader = TextLoader(file_path, encoding="utf-8")
    raw_docs = loader.load()

    sentence_docs = []
    for doc in raw_docs:
        cleaned = clean_text(doc.page_content)
        sentences = split_sentences(cleaned)
        for sentence in sentences:
            print("sentence added to list: ", sentence)
            sentence_docs.append(Document(page_content=sentence, metadata=doc.metadata))

    return sentence_docs

def extract_triples(docs: List[Document], llm) -> List[str]:
    system = """You are an information-extraction assistant.
    Your only job is to pull out valid subject | predicate | object triples.
    Output one triple per line, exactly in the format:
    subject | predicate | object
    - If the sentence contains more than one relation, list every triple on its own line.
    Do NOT output:
        - duplicate or inverted triples
        - commentary or extra text
        - passive-voice fluff
    """

    few_shot = f"""
    User: Alice wrote Bob a letter.
    Assistant:
    Alice | wrote | Bob a letter

    User: John works at Acme Corp and lives in Paris.
    Assistant:
    John | works at | Acme Corp
    John | lives in | Paris

    User: In 2020, Company X acquired Company Y.
    Assistant:
    Company X | acquired | Company Y
    """

    raw_triples: List[str] = []
    for doc in docs:
        chunk_text = doc.page_content
        user = f"User: {chunk_text}\nAssistant:\n"
        prompt = "\n\n".join([system, few_shot, user])
        raw = llm(
            prompt,
            max_tokens=100,
            temperature=0.0,
            top_p=1.0,
            top_k=1,
            repeat_penalty=1.1,
            stop=["\n\n","User:"],
        )
        print("\nMODEL RAW >>>", raw)

        # split into nonempty lines
        lines = [ln.strip() for ln in raw.splitlines() if ln.strip()]
        raw_triples.extend(lines)

    return raw_triples


sentences = load_documents_and_split_sentences("test_file_v2.txt")

triples = extract_triples(sentences, llm)
print(f"\nExtracted {len(triples)} triples")
triples[:]

sentence added to list:  Napoleon Bonaparte (born Napoleone di Buonaparte 15 August 1769 - 5 May 1821), later known by his regnal name Napoleon I, was a French general and statesman who rose to prominence during the French Revolution and led a series of military campaigns across Europe during the French Revolutionary and Napoleonic Wars from 1796 to 1815.
sentence added to list:  He led the French Republic as First Consul from 1799 to 1804, then ruled the French Empire as Emperor of the French from 1804 to 1814, and briefly again in 1815.
sentence added to list:  He was King of Italy from 1805 to 1814 and Protector of the Confederation of the Rhine from 1806 to 1813.
sentence added to list:  Born on the island of Corsica to a family of Italian origin, Napoleon moved to mainland France in 1779 and was commissioned as an officer in the French Royal Army in 1785.
sentence added to list:  He supported the French Revolution in 1789 and promoted its cause in Corsica.
sentence added to list: 

['Napoleon Bonaparte | was born as | Napoleone di Buonaparte',
 'Napoleon Bonaparte | later known by his regnal name | Napoleon I',
 'Napoleon Bonaparte | rose to prominence during | French Revolution',
 'Napoleon Bonaparte | led the French Republic as First Consul from | 1799 to 1804',
 'Napoleon Bonaparte | ruled the French Empire as Emperor of the French from | 1804 to 1814',
 'Napoleon Bonaparte | briefly ruled the French Empire as Emperor of the French from | 1815',
 'Napoleon Bonaparte | was King of Italy from | 1805 to 1814',
 'Napoleon Bonaparte | was Protector of the Confederation of the Rhine from | 1806 to 1813',
 'Napoleon | was born on | the island of Corsica',
 'Napoleon | has Italian origin',
 'Napoleon | moved to | mainland France in 1779',
 'Napoleon | was commissioned as an officer in the French Royal Army in 1785',
 'He | supported the French Revolution in | 1789',
 'He | promoted its cause in | Corsica',
 'Napoleon Bonaparte | rose rapidly through the ranks after wi

With more complicated text it still appears to perform well. Now the text mentions `he` and `napoleon` in the text instead of his full name. This can be handled in the easiest way by adding multiple sentences instead of one. This model allows it within the context limits. The proposal is to select 3 sentences with an overlap of one sentence to give better context to the model. In a few cases the model still has trouble extracting triples

In [26]:
def load_documents_and_chunk_sentences(
    file_path: str,
    window_size: int = 3,
    overlap: int = 1
) -> List[Document]:
    """
    Splits each raw document into sentences, then emits Documents
    whose page_content is a sliding window of `window_size`
    sentences, stepping by (window_size − overlap).
    """
    loader = TextLoader(file_path, encoding="utf-8")
    raw_docs = loader.load()

    sentence_chunks: List[Document] = []
    step = window_size - overlap
    for doc in raw_docs:
        cleaned = clean_text(doc.page_content)
        sentences = split_sentences(cleaned)

        for i in range(0, len(sentences), step):
            chunk_sents = sentences[i : i + window_size]
            if not chunk_sents:
                continue

            chunk_text = " ".join(chunk_sents)
            print("sentence_chunk added to list: ", chunk_text)
            sentence_chunks.append(
                Document(page_content=chunk_text, metadata=doc.metadata)
            )

    return sentence_chunks

sentences = load_documents_and_chunk_sentences("test_file_v2.txt")

sentence_chunk added to list:  Napoleon Bonaparte (born Napoleone di Buonaparte 15 August 1769 - 5 May 1821), later known by his regnal name Napoleon I, was a French general and statesman who rose to prominence during the French Revolution and led a series of military campaigns across Europe during the French Revolutionary and Napoleonic Wars from 1796 to 1815. He led the French Republic as First Consul from 1799 to 1804, then ruled the French Empire as Emperor of the French from 1804 to 1814, and briefly again in 1815. He was King of Italy from 1805 to 1814 and Protector of the Confederation of the Rhine from 1806 to 1813.
sentence_chunk added to list:  He was King of Italy from 1805 to 1814 and Protector of the Confederation of the Rhine from 1806 to 1813. Born on the island of Corsica to a family of Italian origin, Napoleon moved to mainland France in 1779 and was commissioned as an officer in the French Royal Army in 1785. He supported the French Revolution in 1789 and promoted its

The last sentence could be removed if it is already used in the second to last item. For now this will be left

In [27]:
triples = extract_triples(sentences, llm)
print(f"\nExtracted {len(triples)} triples")
triples[:]


MODEL RAW >>> Napoleon Bonaparte | was born as | Napoleone di Buonaparte
Napoleon Bonaparte | later known by his regnal name | Napoleon I
Napoleon Bonaparte | was a French general and statesman 
Napoleon Bonaparte | rose to prominence during the French Revolution
Napoleon Bonaparte | led military campaigns across Europe during the French Revolutionary and Napoleonic Wars from 1

MODEL RAW >>> Napoleon | was born on | the island of Corsica
Napoleon | was King of Italy from | 1805 to 1814
Napoleon | was Protector of the Confederation of the Rhine from | 1806 to 1813

MODEL RAW >>> Napoleon | supported the French Revolution in | 1789
Napoleon | promoted its cause in | Corsica
Napoleon | won the siege of Toulon in | 1793
Napoleon | defeated royalist insurgents in Paris on | 13 Vendémiaire in | 1795
Napoleon | commanded a military campaign against the Austrians and their Italian al

MODEL RAW >>> Napoleon | commanded | a military campaign against the Austrians and their Italian allies in t

['Napoleon Bonaparte | was born as | Napoleone di Buonaparte',
 'Napoleon Bonaparte | later known by his regnal name | Napoleon I',
 'Napoleon Bonaparte | was a French general and statesman',
 'Napoleon Bonaparte | rose to prominence during the French Revolution',
 'Napoleon Bonaparte | led military campaigns across Europe during the French Revolutionary and Napoleonic Wars from 1',
 'Napoleon | was born on | the island of Corsica',
 'Napoleon | was King of Italy from | 1805 to 1814',
 'Napoleon | was Protector of the Confederation of the Rhine from | 1806 to 1813',
 'Napoleon | supported the French Revolution in | 1789',
 'Napoleon | promoted its cause in | Corsica',
 'Napoleon | won the siege of Toulon in | 1793',
 'Napoleon | defeated royalist insurgents in Paris on | 13 Vendémiaire in | 1795',
 'Napoleon | commanded a military campaign against the Austrians and their Italian al',
 'Napoleon | commanded | a military campaign against the Austrians and their Italian allies in the War 

The model now seems to have some trouble to extract the triples, in some cases it extract more than three, in other only 2. Giving bigger chunks did help in removing the usage of `he` instead if his name. It also extracts 22 triples, this is 8 less than before.

the model might struggle with the longer sentences. The first step is to adjust the system prompt and the examples and see the results

In [None]:
def extract_triples(docs: List[Document], llm) -> List[str]:
    system = """You are an information-extraction assistant.
    Your only job is to pull out valid subject | predicate | object triples from the text that follows.
    **Only** use facts explicitly stated in that text. Do **not** use any external or world knowledge.
    Output one triple per line in the format:
    subject | predicate | object
    No other text or commentary.
    When you see a verb phrase followed by a prepositional phrase starting with “during”, “in”, “on”, or “at”, 
    treat that prepositional phrase as the object and do not include it in the predicate.
    """

    few_shot = """
    User: Alice and Bob co-authored a research paper on natural language processing, which was published in 2021.
    Assistant:
    Alice | co-authored | a research paper on natural language processing
    Bob | co-authored | a research paper on natural language processing
    a research paper on natural language processing | was published in | 2021

    User: John works at Acme Corp and lives in Paris.
    Assistant:
    John | works at | Acme Corp
    John | lives in | Paris

    User: Sarah, the founder of TechStart, delivered the keynote in Berlin. She also launched a new product.
    Assistant:
    Sarah | is the founder of | TechStart
    Sarah | delivered | the keynote in Berlin
    Sarah | launched | a new product

    User: The conference rose to prominence during the Industrial Revolution.
    Assistant:
    The conference | rose to prominence | the Industrial Revolution
    """

    raw_triples: List[str] = []
    for doc in docs:
        chunk_text = doc.page_content
        user = f"User: {chunk_text}\nAssistant:\n"
        prompt = "\n\n".join([system, few_shot, user])
        raw = llm(
            prompt,
            max_tokens=1250,
            temperature=0.0,
            top_p=1.0,
            top_k=1,
            repeat_penalty=1.1,
            stop=["\n\n","User:"],
        )
        print("\nMODEL RAW >>>", raw)

        lines = [ln.strip() for ln in raw.splitlines() if ln.strip()]
        raw_triples.extend(lines)

    return raw_triples


sentences = load_documents_and_chunk_sentences("test_file_v2.txt")

triples = extract_triples(sentences, llm)
print(f"\nExtracted {len(triples)} triples")
triples[:]

sentence_chunk added to list:  Napoleon Bonaparte (born Napoleone di Buonaparte 15 August 1769 - 5 May 1821), later known by his regnal name Napoleon I, was a French general and statesman who rose to prominence during the French Revolution and led a series of military campaigns across Europe during the French Revolutionary and Napoleonic Wars from 1796 to 1815. He led the French Republic as First Consul from 1799 to 1804, then ruled the French Empire as Emperor of the French from 1804 to 1814, and briefly again in 1815. He was King of Italy from 1805 to 1814 and Protector of the Confederation of the Rhine from 1806 to 1813.
sentence_chunk added to list:  He was King of Italy from 1805 to 1814 and Protector of the Confederation of the Rhine from 1806 to 1813. Born on the island of Corsica to a family of Italian origin, Napoleon moved to mainland France in 1779 and was commissioned as an officer in the French Royal Army in 1785. He supported the French Revolution in 1789 and promoted its

['Napoleon Bonaparte | was born on | 15 August 1769',
 'Napoleon Bonaparte | rose to prominence during | the French Revolution',
 'Napoleon Bonaparte | led military campaigns across Europe during | the French Revolutionary and Napoleonic Wars from 1796 to 1815',
 'Napoleon Bonaparte | was the First Consul of France from | 1799 to 1804',
 'Napoleon Bonaparte | was the Emperor of the French from | 1804 to 1814 and briefly again in 1815',
 'Napoleon Bonaparte | was King of Italy from | 1805 to 1814',
 'Napoleon Bonaparte | was the Protector of the Confederation of the Rhine from | 1806 to 1813',
 'Napoleon | was King of Italy from | 1805 to 1814',
 'Napoleon | was Protector of the Confederation of the Rhine from | 1806 to 1813',
 'Napoleon | was born on | the island of Corsica',
 'Napoleon | had Italian origin',
 'Napoleon | moved to mainland France in | 1779',
 'Napoleon | was commissioned as an officer in the French Royal Army in | 1785',
 'Napoleon | supported the French Revolution in 

The performance of triple generation is far better with the new system prompt and few shot. It sometimes keeps making the mistake of extracting only the subject and the predicate so there are some improvement left for a next iteration. It also in some cases keeps the predicate or object too long. For now the focus will be on extracting the triples and then implementing the solution in streamlit

In [31]:
triples_list = []

for triple in triples:
    for line in triple.splitlines():
        parts = [p.strip() for p in line.split("|")]
        if len(parts) == 3:
            subj, pred, obj = parts
            triples_list.append({"subject": subj, "predicate": pred, "object": obj})

display(triples_list)

[{'subject': 'Napoleon Bonaparte',
  'predicate': 'was born on',
  'object': '15 August 1769'},
 {'subject': 'Napoleon Bonaparte',
  'predicate': 'rose to prominence during',
  'object': 'the French Revolution'},
 {'subject': 'Napoleon Bonaparte',
  'predicate': 'led military campaigns across Europe during',
  'object': 'the French Revolutionary and Napoleonic Wars from 1796 to 1815'},
 {'subject': 'Napoleon Bonaparte',
  'predicate': 'was the First Consul of France from',
  'object': '1799 to 1804'},
 {'subject': 'Napoleon Bonaparte',
  'predicate': 'was the Emperor of the French from',
  'object': '1804 to 1814 and briefly again in 1815'},
 {'subject': 'Napoleon Bonaparte',
  'predicate': 'was King of Italy from',
  'object': '1805 to 1814'},
 {'subject': 'Napoleon Bonaparte',
  'predicate': 'was the Protector of the Confederation of the Rhine from',
  'object': '1806 to 1813'},
 {'subject': 'Napoleon',
  'predicate': 'was King of Italy from',
  'object': '1805 to 1814'},
 {'subject'