# Testing REBEL-large on the AI Act

In [14]:
from transformers import pipeline
from rebel_re_model import extract_triplets

In [15]:
triplet_extractor = pipeline('text2text-generation', model='Babelscape/rebel-large', tokenizer='Babelscape/rebel-large')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [16]:
# Get sample AI Act text
with open('sample_knowledge_graph_section.txt', 'r') as file:
    # Read the file content into a string
    ai_act_string = file.read()

ai_act_string

'A variety of AI systems can generate large quantities of synthetic content that becomes increasingly hard for humans to distinguish from human-generated and authentic content. The wide availability and increasing capabilities of those systems have a significant impact on the integrity and trust in the information ecosystem, raising new risks of misinformation and manipulation at scale, fraud, impersonation and consumer deception. In the light of those impacts, the fast technological pace and the need for new methods and techniques to trace origin of information, it is appropriate to require providers of those systems to embed technical solutions that enable marking in a machine readable format and detection that the output has been generated or manipulated by an AI system and not a human. Such techniques and methods should be sufficiently reliable, interoperable, effective and robust as far as this is technically feasible, taking into account available techniques or a combination of s

## Testing on the sample AI Act paragraph

In [17]:
# We need to use the tokenizr manually since we need special tokens.
extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor(ai_act_string, return_tensors=True, return_text=False)[0]["generated_token_ids"]])
extracted_text[0]

'<s><triplet> impersonation <subj> fraud <obj> subclass of</s>'

## Testing on a single sentence from the AI Act

In [20]:
# Testing on a shorter string
ai_act_one_sentence = "A variety of AI systems can generate large quantities of synthetic content that becomes increasingly hard for humans to distinguish from human-generated and authentic content."

extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor(ai_act_one_sentence,  return_tensors=True, return_text=False)[0]["generated_token_ids"]])
extracted_text[0]

'<s><triplet> synthetic content <subj> authentic content <obj> opposite of <triplet> authentic content <subj> synthetic content <obj> opposite of</s>'

## Testing on three sentences from the AI Act

In [23]:
ai_act_three_sentences = "A variety of AI systems can generate large quantities of synthetic content that becomes increasingly hard for humans to distinguish from human-generated and authentic content. The wide availability and increasing capabilities of those systems have a significant impact on the integrity and trust in the information ecosystem, raising new risks of misinformation and manipulation at scale, fraud, impersonation and consumer deception."

extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor(ai_act_three_sentences,  return_tensors=True, return_text=False)[0]["generated_token_ids"]])
extracted_text[0]

'<s><triplet> impersonation <subj> fraud <obj> subclass of</s>'

# First Impresions

The model seems to only output one or two relations, regardless of the input length.

## Passing in the sample paragraph one sentence at a time

In [26]:
sentences = ai_act_string.split(".")

# The last sentence is empty, let's test what the model outputs
sentences[7]

''

In [24]:
for sentence in sentences:
    extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor(sentence,  return_tensors=True, return_text=False)[0]["generated_token_ids"]])
    print(extracted_text[0])

<s><triplet> synthetic content <subj> authentic content <obj> opposite of <triplet> authentic content <subj> synthetic content <obj> opposite of</s>
<s><triplet> impersonation <subj> fraud <obj> subclass of</s>
<s><triplet> AI <subj> AI system <obj> studies <triplet> AI system <subj> AI <obj> studied by</s>
<s><triplet> watermark <subj> metadata <obj> subclass of</s>
<s><triplet> state-of-the-art <subj> technological <obj> instance of</s>
<s><triplet> AI model <subj> model <obj> subclass of</s>
<s><triplet> assistive function <subj> AI systems <obj> subclass of</s>
<s><triplet> World War I <subj> World War II <obj> followed by <triplet> World War II <subj> World War I <obj> follows</s>


# Hallucinaitons
The model hallucinate relations if passed in an empty string. For example, the model outputted the triplet (World War I, followed by, World War II)