# GLiREL Notebook: Relationship Extraction on Label Studio Annotations

This notebook demonstrates how to use the **GLiREL model** for relationship extraction (RE) on texts that have been annotated with entities in **Label Studio**.

## Workflow Overview

1. **Load Text & Annotations**: Read the original text and entity annotations from Label Studio JSON export
2. **Prepare GLiREL Input**: Convert Label Studio annotations to GLiREL-compatible format
3. **Relationship Extraction**: Use GLiREL to identify and classify relationships between entities
4. **Analyze Results**: Display and export extracted relationships

## Table of Contents

**Setup & Data Loading**
- [Installation](#installation) - Install dependencies
- [Load Example Data](#load-data) - Read text and Label Studio annotations
- [Data Exploration](#explore-data) - Understand the structure

**Data Preparation**
- [Convert LS to GLiREL Format](#convert-format) - Prepare input for GLiREL model

**Relationship Extraction**
- [Extract Relations](#extract-relations) - Run GLiREL on prepared data

---

## Installation {#installation}

Install required packages for relationship extraction with GLiREL:


In [41]:
import subprocess
import sys

packages = ["gliner", "pandas", "json"]

print("Installing required packages...")
for package in packages:
    try:
        __import__(package.replace("-", "_"))
        print(f"‚úì {package} already installed")
    except ImportError:
        print(f"Installing {package}...")
        try:
            subprocess.check_call([sys.executable, "-m", "pip", "install", package])
            print(f"‚úì {package} installed successfully")
        except:
            print(f"‚ö†Ô∏è  Could not install {package} (may already be available)")

print("\n‚úì Installation complete")


Installing required packages...
‚úì gliner already installed
‚úì pandas already installed
‚úì json already installed

‚úì Installation complete


## Load Example Data {#load-data}

Load the example text and its Label Studio entity annotations:


In [42]:
import json
from pathlib import Path

# Define file paths
TEXT_FILE = "example_text.txt"
LS_ANNOTATIONS_FILE = "example_text_LS_entities.json"
SCHEMA_FILE = "../gliner_schema_template.json"

# Load the raw text
print("Loading raw text file...")
with open(TEXT_FILE, 'r', encoding='utf-8') as f:
    raw_text = f.read()

print(f"‚úì Text loaded: {len(raw_text)} characters")
print(f"\nüìÑ Text preview (first 300 chars):")
print("-" * 60)
print(raw_text[:300] + "...")
print("-" * 60)

# Load Label Studio annotations
print("\n\nLoading Label Studio annotations...")
with open(LS_ANNOTATIONS_FILE, 'r', encoding='utf-8') as f:
    ls_data = json.load(f)

print(f"‚úì Annotations loaded: {len(ls_data)} document(s)")

# Load schema for reference
print("\nLoading schema configuration...")
with open(SCHEMA_FILE, 'r', encoding='utf-8') as f:
    schema_config = json.load(f)

print(f"‚úì Schema loaded: {schema_config['schema_name']}")


Loading raw text file...
‚úì Text loaded: 3688 characters

üìÑ Text preview (first 300 chars):
------------------------------------------------------------
Se sont r√©unis en assembl√©e g√©n√©rale extraordinaire tous les actionnaires de la soci√©t√© anonyme √©tablie √† Anvers sous la d√©nomination de ¬´ Caucasian Mangan√®se C¬∞ Ltd ¬ª, constitu√©e par acte devant le notaire Leclef, soussign√©, en date du vingt et un octobre mil neuf cent et sept, publi√© aux annexes d...
------------------------------------------------------------


Loading Label Studio annotations...
‚úì Annotations loaded: 1 document(s)

Loading schema configuration...
‚úì Schema loaded: Historical French NER


## Data Exploration {#explore-data}

Understand the structure of Label Studio annotations and extract entities:


In [43]:
# Extract annotations from Label Studio export
# Label Studio format: list of task objects -> annotations -> results

task = ls_data[0]  # Get first (and likely only) task
annotations = task['annotations'][0]  # Get first annotation set
results = annotations['result']  # Get entity annotations

print(f"Total entities in Label Studio export: {len(results)}")
print(f"\nEntity types found:")

# Count entities by type
entity_types = {}
for result in results:
    labels = result['value']['labels']
    if labels:
        entity_type = labels[0]
        entity_types[entity_type] = entity_types.get(entity_type, 0) + 1

for entity_type, count in sorted(entity_types.items()):
    print(f"  ‚Ä¢ {entity_type:20s}: {count:3d}")

print(f"\nüìã Sample entities (first 5):")
print("-" * 80)
for i, result in enumerate(results[:5]):
    entity_text = result['value']['text']
    entity_label = result['value']['labels'][0]
    char_start = result['value']['start']
    char_end = result['value']['end']
    confidence = result['value'].get('score', 0)

    print(f"{i+1}. [{char_start:4d}-{char_end:4d}] {entity_text:30s} | {entity_label:20s} (conf: {confidence:.3f})")


Total entities in Label Studio export: 79

Entity types found:
  ‚Ä¢ ADDRESS             :   2
  ‚Ä¢ ARCHIVAL_REFERENCE  :   4
  ‚Ä¢ CAPITAL_TYPE        :   1
  ‚Ä¢ CITY                :  13
  ‚Ä¢ CORPORATE_TITLE     :   8
  ‚Ä¢ DATE                :   6
  ‚Ä¢ HONORIFICS          :  10
  ‚Ä¢ LEGAL_PROCEDURE     :   6
  ‚Ä¢ LEGAL_STRUCTURE     :   2
  ‚Ä¢ MISSION_STATEMENT   :   1
  ‚Ä¢ ORGANIZATION        :   1
  ‚Ä¢ PERSON              :  18
  ‚Ä¢ PROFESSION          :   4
  ‚Ä¢ REGISTERED_OFFICE   :   1
  ‚Ä¢ SHARE_QUANTITY      :   1
  ‚Ä¢ SHARE_TYPE          :   1

üìã Sample entities (first 5):
--------------------------------------------------------------------------------
1. [1767-1779] 4 avril 1909                   | DATE                 (conf: 0.000)
2. [2399-2411] 22 juin 1909                   | DATE                 (conf: 0.000)
3. [3576-3588] 15 juin 1909                   | DATE                 (conf: 0.000)
4. [2802-2814] 24 juin 1909                   | DATE          

## Convert Label Studio to GLiREL Format {#convert-format}

Transform Label Studio entity annotations into GLiREL-compatible input format.

### GLiREL Input Format

GLiREL expects a specific format for relationship extraction. It needs:

1. **Text**: The original document text
2. **Entities**: A structured list of entities with their positions and types
3. **Relations**: (Optional for pre-annotation) Known relations between entities

The standard format is:
```json
{
  "text": "...",
  "entities": [
    {"id": "...", "type": "...", "start": 0, "end": 10}
  ],
  "relations": [
    {"head": "...", "tail": "...", "type": "..."}
  ]
}
```


In [44]:
import json
from pathlib import Path

def extract_entities_from_labelstudio(ls_data, text):
    """
    Extract entity information from Label Studio JSON export and format for GLiREL.

    Parameters:
    -----------
    ls_data : list
        Label Studio exported JSON (list of tasks)
    text : str
        Original text

    Returns:
    --------
    dict : GLiREL-formatted data with entities
    """

    task = ls_data[0]
    annotations = task['annotations'][0]
    results = annotations['result']

    # Convert Label Studio results to GLiREL entity format
    entities = []
    entity_id_counter = 0

    for result in results:
        # Extract entity information from Label Studio format
        entity_text = result['value']['text']
        entity_type = result['value']['labels'][0]  # Get first label
        char_start = result['value']['start']
        char_end = result['value']['end']
        confidence = result['value'].get('score', 0)

        # Create GLiREL entity entry
        gliren_entity = {
            "id": f"ent_{entity_id_counter}",
            "type": entity_type.upper(),
            "start": char_start,
            "end": char_end,
            "text": entity_text,
            "confidence": confidence,
            "ls_id": result['id']  # Keep reference to original Label Studio ID
        }

        entities.append(gliren_entity)
        entity_id_counter += 1

    # Create GLiREL input format
    gliren_input = {
        "text": text,
        "entities": entities,
        "metadata": {
            "source": "Label Studio",
            "task_id": task['id'],
            "annotation_id": annotations['id'],
            "num_entities": len(entities),
            "entity_types": list(set(e['type'] for e in entities))
        }
    }

    return gliren_input

# Convert Label Studio annotations to GLiREL format
print("Converting Label Studio annotations to GLiREL format...")
gliren_input = extract_entities_from_labelstudio(ls_data, raw_text)

print(f"‚úì Conversion complete!")
print(f"  Total entities: {gliren_input['metadata']['num_entities']}")
print(f"  Entity types: {', '.join(gliren_input['metadata']['entity_types'])}")


Converting Label Studio annotations to GLiREL format...
‚úì Conversion complete!
  Total entities: 79
  Entity types: LEGAL_PROCEDURE, PERSON, LEGAL_STRUCTURE, DATE, SHARE_QUANTITY, CAPITAL_TYPE, MISSION_STATEMENT, REGISTERED_OFFICE, CITY, ADDRESS, ARCHIVAL_REFERENCE, PROFESSION, SHARE_TYPE, ORGANIZATION, CORPORATE_TITLE, HONORIFICS


### Validate GLiREL Input Format

Ensure the converted data is properly formatted and can be used by GLiREL:


In [45]:
# Validate entity positions match the original text
print("Validating entity positions...")
validation_passed = True
errors = []

for i, entity in enumerate(gliren_input['entities']):
    entity_start = entity['start']
    entity_end = entity['end']
    entity_text_in_doc = raw_text[entity_start:entity_end]
    entity_text_stored = entity['text']

    # Check if entity position matches stored text
    if entity_text_in_doc != entity_text_stored:
        validation_passed = False
        errors.append(
            f"Entity {i} ({entity['id']}): "
            f"Text mismatch! "
            f"In document: '{entity_text_in_doc}' "
            f"vs Stored: '{entity_text_stored}'"
        )

if validation_passed:
    print("‚úì All entity positions are valid!")
    print(f"  {len(gliren_input['entities'])} entities verified")
else:
    print(f"‚ö†Ô∏è  Found {len(errors)} validation errors:")
    for error in errors:
        print(f"  - {error}")

# Display sample of converted entities
print(f"\nüìã Sample of converted GLiREL entities (first 5):")
print("-" * 100)

for entity in gliren_input['entities'][:5]:
    print(f"  ID: {entity['id']:8s} | Type: {entity['type']:20s} | "
          f"Pos: [{entity['start']:4d}-{entity['end']:4d}] | "
          f"Text: '{entity['text']:30s}' | Conf: {entity['confidence']:.3f}")


Validating entity positions...
‚úì All entity positions are valid!
  79 entities verified

üìã Sample of converted GLiREL entities (first 5):
----------------------------------------------------------------------------------------------------
  ID: ent_0    | Type: DATE                 | Pos: [1767-1779] | Text: '4 avril 1909                  ' | Conf: 0.000
  ID: ent_1    | Type: DATE                 | Pos: [2399-2411] | Text: '22 juin 1909                  ' | Conf: 0.000
  ID: ent_2    | Type: DATE                 | Pos: [3576-3588] | Text: '15 juin 1909                  ' | Conf: 0.000
  ID: ent_3    | Type: DATE                 | Pos: [2802-2814] | Text: '24 juin 1909                  ' | Conf: 0.000
  ID: ent_4    | Type: DATE                 | Pos: [ 320- 354] | Text: 'six novembre mil neuf cent et sept' | Conf: 0.000


### Save GLiREL Input Files

Export the prepared data in formats suitable for GLiREL:


In [46]:
import json
from pathlib import Path
from datetime import datetime

# Create output directory
output_dir = Path("gliren_input")
output_dir.mkdir(exist_ok=True)

# File paths
gliren_json_file = output_dir / "example_text_gliren_input.json"
metadata_file = output_dir / "conversion_metadata.json"

print("Saving GLiREL input files...")

# 1. Save as JSON (single document)
print(f"\n1. Saving as JSON: {gliren_json_file}")
with open(gliren_json_file, 'w', encoding='utf-8') as f:
    json.dump(gliren_input, f, indent=2, ensure_ascii=False)
print(f"   ‚úì {gliren_json_file.stat().st_size / 1024:.1f} KB")

# 2. Save as JSONL (one entry per line, for batch processing)
print(f"\n2. Saving as JSONL: {gliren_jsonl_file}")
with open(gliren_jsonl_file, 'w', encoding='utf-8') as f:
    f.write(json.dumps(gliren_input, ensure_ascii=False) + '\n')
print(f"   ‚úì {gliren_jsonl_file.stat().st_size / 1024:.1f} KB")

# 3. Save conversion metadata
print(f"\n3. Saving metadata: {metadata_file}")
conversion_metadata = {
    "source_file": TEXT_FILE,
    "ls_annotations_file": LS_ANNOTATIONS_FILE,
    "schema_file": SCHEMA_FILE,
    "conversion_timestamp": datetime.now().isoformat(),
    "text_length": len(raw_text),
    "num_entities": len(gliren_input['entities']),
    "entity_types_found": gliren_input['metadata']['entity_types'],
    "output_files": {
        "json": str(gliren_json_file),
        "jsonl": str(gliren_jsonl_file)
    }
}

with open(metadata_file, 'w', encoding='utf-8') as f:
    json.dump(conversion_metadata, f, indent=2, ensure_ascii=False)
print(f"   ‚úì {metadata_file.stat().st_size / 1024:.1f} KB")

print(f"\n‚úì All files saved to: {output_dir}/")
print(f"\nFiles created:")
print(f"  ‚Ä¢ {gliren_json_file.name} - Full GLiREL input (JSON)")
print(f"  ‚Ä¢ {gliren_jsonl_file.name} - GLiREL input for batch processing (JSONL)")
print(f"  ‚Ä¢ {metadata_file.name} - Conversion metadata")


Saving GLiREL input files...

1. Saving as JSON: gliren_input\example_text_gliren_input.json
   ‚úì 19.0 KB

2. Saving as JSONL: gliren_input\example_text_gliren_input.jsonl
   ‚úì 14.1 KB

3. Saving metadata: gliren_input\conversion_metadata.json
   ‚úì 0.8 KB

‚úì All files saved to: gliren_input/

Files created:
  ‚Ä¢ example_text_gliren_input.json - Full GLiREL input (JSON)
  ‚Ä¢ example_text_gliren_input.jsonl - GLiREL input for batch processing (JSONL)
  ‚Ä¢ conversion_metadata.json - Conversion metadata


---


# Relation Extraction with GLiREL

Now that we have all the entities extracted and formatted, we can proceed to run the GLiREL model to identify relationships between these entities.

### GLiREL labels
To extract the relationships, GLiREL first has to know what types of relationships to look for. Therefore, you have to define the possible head and/or tail entity types and the possible relationship types in the schema file. The schema file should be formatted as follows:
```json
{
  "glirel_labels": {
    "RELATION_NAME_1": {
      "allowed_head": ["ENTITY_TYPE_A"],
      "allowed_tail": ["ENTITY_TYPE_B"]
    },
    "RELATION_NAME_2": {
      "allowed_head": ["ENTITY_TYPE_X", "ENTITY_TYPE_Y"],
      "allowed_tail": ["ENTITY_TYPE_Z"]
    },
    "RELATION_NAME_3": {
      "allowed_head": ["ENTITY_TYPE"],
      "allowed_tail": ["ENTITY_TYPE"]
    },
    "no relation": {}
  }
}
```
The 'gliren_imput' folder contains a relations schema file 'gliren_schema_relations.json'. These relations will be used in the following steps.

In [61]:
!pip install spacy
!pip install glirel
!python -m spacy download en_core_web_sm




[notice] A new release of pip is available: 23.0.1 -> 26.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.0.1 -> 26.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --- ------------------------------------ 1.0/12.8 MB 20.3 MB/s eta 0:00:01
     ------ --------------------------------- 2.1/12.8 MB 22.4 MB/s eta 0:00:01
     ---------- ----------------------------- 3.2/12.8 MB 22.8 MB/s eta 0:00:01
     ------------- -------------------------- 4.3/12.8 MB 22.9 MB/s eta 0:00:01
     ----------------- ---------------------- 5.5/12.8 MB 23.4 MB/s eta 0:00:01
     -------------------- ------------------- 6.7/12.8 MB 23.7 MB/s eta 0:00:01
     ------------------------ --------------- 7.8/12.8 MB 23.6 MB/s eta 0:00:01
     --------------------------- ------------ 9.0/12.8 MB 23.9 MB/s eta 0:00:01
     ------------------------------- ------- 10.2/12.8 MB 24.2 MB/s eta 0:00:01
     ---------------------------


[notice] A new release of pip is available: 23.0.1 -> 26.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [64]:
from glirel import GLiREL

import spacy

model = GLiREL.from_pretrained("jackboyla/glirel-large-v0")

nlp = spacy.load('en_core_web_sm')

text = 'Derren Nesbitt had a history of being cast in "Doctor Who", having played villainous warlord Tegana in the 1964 First Doctor serial "Marco Polo".'

doc = nlp(text)

tokens = [token.text for token in doc]

labels = ['country of origin', 'licensed to broadcast to', 'father', 'followed by', 'characters']

ner = [[26, 27, 'PERSON', 'Marco Polo'], [22, 23, 'Q2989412', 'First Doctor']] # 'type' is not used -- it can be any string!

relations = model.predict_relations(tokens, labels, threshold=0.0, ner=ner, top_k=1)

print('Number of relations:', len(relations))

sorted_data_desc = sorted(relations, key=lambda x: x['score'], reverse=True)

print("\nDescending Order by Score:")

for item in sorted_data_desc:

    print(f"{item['head_text']} --> {item['label']} --> {item['tail_text']} | score: {item['score']}")



TypeError: GLiREL._from_pretrained() missing 2 required keyword-only arguments: 'proxies' and 'resume_download'