# Extracting specific information from text

This notebook shows how the package LangExtract (from Google) can be used to extract information from text data, using LLMs. 
For this, we use the example of extracting characters, emotions, and relationships from Romeo and Juliet.

First, we have to install the package. Note: we use the sub-package notation `[]` to install the openAI depentencies, which allow us to use Nebius models

In [None]:
# pip install langextract[openai]

Now we import the packages we need:

In [None]:
import langextract as lx
import textwrap
from collections import Counter
import os
from langextract.providers.openai import OpenAILanguageModel
import requests


This part took me 3 hours to figure out...
Here, we declare the model we want to use

In [None]:
lm = OpenAILanguageModel(
    model_id="google/gemma-2-9b-it-fast",
    api_key=os.environ.get("HCAI_NEBIUS_API_KEY"), # Replace this with your own API Key if needed.
    base_url="https://api.tokenfactory.nebius.com/v1/",
)

Lets quickly check if we got the right model loaded in. (For me, this line helped A LOT in debugging.)

In [None]:
print(f"Using model: {lm.model_id}")
print(f"Base URL: {lm.base_url}")

Okay, now that we have the right model, we can worry about the data. We can download the entire book, but for the sake of brevity, lets keep it short and take the first part of the book. 

In [None]:

# Download the text
url = "https://www.gutenberg.org/files/1513/1513-0.txt"
response = requests.get(url)
full_text = response.text

# grab the first 5000 chracters
partial_text = full_text[:5000]

print("sample:\n", partial_text[200:300])


Okay, now we have the data, lets define the system prompt.
This prompt has been made by copying the [tutorial example](https://github.com/google/langextract/blob/main/docs/examples/longer_text_example.md), and finetuned to improve consistent output, since we use a much weaker LLM than they do.

In [None]:
# Define comprehensive prompt and examples for complex literary text
prompt = textwrap.dedent("""\
    Extract characters, emotions, and relationships from the given text.

    Provide meaningful attributes for every entity to add context and depth.

    Important: Use exact text from the input for extraction_text. Do not paraphrase.
    Extract entities in order of appearance with no overlapping text spans.

    Note: In play scripts, speaker names appear in ALL-CAPS followed by a period.
                         
    Rules:
    - extraction_text MUST be a short string from the original text (max 50 characters)
    - Use exact text from input, no paraphrasing
    - Extract entities in order of appearance
    - Each extraction_text must be a simple string value
    - In play scripts, speaker names are in ALL-CAPS followed by a period
    
    Output valid JSON only.""")

Now we can add some examples to improve the performance.

In [None]:
examples = [
    lx.data.ExampleData(
        text=textwrap.dedent("""\
            ROMEO. But soft! What light through yonder window breaks?
            It is the east, and Juliet is the sun.
            JULIET. O Romeo, Romeo! Wherefore art thou Romeo?"""),
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe", "character": "Romeo"}
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet is the sun",
                attributes={"type": "metaphor", "character_1": "Romeo", "character_2": "Juliet"}
            ),
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="JULIET",
                attributes={"emotional_state": "yearning"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="Wherefore art thou Romeo?",
                attributes={"feeling": "longing question", "character": "Juliet"}
            ),
        ]
    )
]

Okay, now we can actually run it and feed it to the system.

In [None]:
result = lx.extract(
    # text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt", # can also feed it the entire text at once using an URL.
    
    # Feed the model the document, prompt and examples
    text_or_documents=partial_text,
    prompt_description=prompt,
    examples=examples,
    
    # Declare that it should use our defined nebius model
    model=lm,

    # Hyperparameters:
    extraction_passes=1,          # Multiple passes for improved recall
    max_workers=1,                # Parallel processing for speed
    max_char_buffer=400,          # Smaller contexts for better accuracy
    temperature=0.0,              # Lower temperature for more consistent output
    fence_output=True,            # no clue what this does
    use_schema_constraints=True,  # Enable schema constraints
    format_type="json",           # Explicitly specify JSON format
    batch_length=1,               # Process one chunk at a time
    resolver_params={
        "suppress_parse_errors": True,  # Continue on parse errors
    },
)

print(f"\nExtracted {len(result.extractions)} entities from {len(result.text):,} characters")

Some of the outputs might be faulty, since we use a very very weak model, so lets see if we can remove some of the errors with some small post-processing code.

In [None]:
valid_extractions = []
for ext in result.extractions:
    if isinstance(ext.extraction_text, (str, int, float)) and ext.extraction_text:
        valid_extractions.append(ext)
    else:
        print(f"Skipping invalid extraction: {ext}")

print(f"Valid extractions: {len(valid_extractions)}")
result.extractions = valid_extractions

Als je dit meerdere keren runt zal er steeds een andere output uit komen. In mijn voorbeeld kwamen er eers 101 en daarna 71 outputs uit. 

De LangExtract package kan de resultaten vervolgens voor je visualiseren, dit doen we met de code hieronder.

In [None]:
if valid_extractions:
    lx.io.save_annotated_documents([result], output_name="romeo_juliet_extractions.jsonl", output_dir=".")
    
    # Generate the interactive visualization
    html_content = lx.visualize("romeo_juliet_extractions.jsonl")
    with open("romeo_juliet_visualization.html", "w", encoding="utf-8") as f:
        if hasattr(html_content, 'data'):
            f.write(html_content.data)
        else:
            f.write(html_content)
    
    print("Interactive visualization saved to romeo_juliet_visualization.html")
else:
    print("No valid extractions found.")

Open de HMTL pagina maar eens en kijk hoe goed dit heeft gewerkt!

Nu kunnen we hieronder nog wat analyses doen op de output data, dit voorbeeldje komt ook gewoon uit de tutorial.

In [None]:
# Analyze character mentions
characters = {}
for e in result.extractions:
    if e.extraction_class == "character":
        char_name = e.extraction_text
        if char_name not in characters:
            characters[char_name] = {"count": 0, "attributes": set()}
        characters[char_name]["count"] += 1
        if e.attributes:
            for attr_key, attr_val in e.attributes.items():
                characters[char_name]["attributes"].add(f"{attr_key}: {attr_val}")

# Print character summary
print(f"\nCHARACTER SUMMARY ({len(characters)} unique characters)")
print("=" * 60)

sorted_chars = sorted(characters.items(), key=lambda x: x[1]["count"], reverse=True)
for char_name, char_data in sorted_chars[:10]:  # Top 10 characters
    attrs_preview = list(char_data["attributes"])[:3]
    attrs_str = f" ({', '.join(attrs_preview)})" if attrs_preview else ""
    print(f"{char_name}: {char_data['count']} mentions{attrs_str}")

# Entity type breakdown
entity_counts = Counter(e.extraction_class for e in result.extractions)
print(f"\nENTITY TYPE BREAKDOWN")
print("=" * 60)
for entity_type, count in entity_counts.most_common():
    percentage = (count / len(result.extractions)) * 100
    print(f"{entity_type}: {count} ({percentage:.1f}%)")