[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/use_cases/finance/03_Earnings_Call_Analysis.ipynb)

# Earnings Call Analysis with Docling, Semantica & AWS Neptune  
MDA Space Ltd. ‚Äî Q3 2025

## Data Sources

This notebook analyzes two financial documents from **MDA Space Ltd.** for Q3 2025:

1. **Press Release** ‚Äî Summary of financial results and management commentary  
2. **Earnings Call Transcript** ‚Äî Management presentation and analyst Q&A  

Together, these documents provide both quantitative results and qualitative context.

---

## Overview

This notebook demonstrates an end-to-end semantic pipeline for transforming
unstructured financial documents into structured, queryable knowledge.

**Docling** is used for high-fidelity document parsing. **Semantica** performs
semantic extraction, cleaning, validation, and knowledge graph construction.
The final knowledge graph is stored in **AWS Neptune** and used for hybrid
retrieval, agent memory, and grounded question answering.

---

## End-to-End Workflow

**Workflow:**  
Dual PDF Input ‚Üí Docling Parsing ‚Üí Normalization & Chunking ‚Üí Entity, Relation Extraction ‚Üí Conflict Resolution & Deduplication ‚Üí Knowledge Graph Construction ‚Üí Amazon Neptune ‚Üí GraphRAG ‚Üí Agent Memory & Context ‚Üí Strategic Q&A

---

## Pipeline Capabilities

- High-fidelity PDF parsing (text, tables, structure)  
- Semantic extraction of entities, and relationships
- Conflict detection and resolution with confidence awareness  
- Entity deduplication and canonicalization  
- Knowledge graph construction and validation  
- Persistent graph storage in **AWS Neptune** (IAM, OpenCypher)  
- Hybrid retrieval using **GraphRAG** (vector + graph)  
- Long-term agent memory and unified context management  
- Grounded LLM-based question answering  
- Structured export to JSON and RDF formats  

---

## Outcome

The output is a cleaned, deduplicated knowledge graph stored in **AWS Neptune**,
along with supporting context for hybrid retrieval and question answering.
This enables reliable financial analysis and downstream applications built
on structured, traceable knowledge.

In [9]:
!pip install git+https://github.com/Hawksight-AI/semantica.git@main
!pip install -qU  docling pdfplumber groq


Collecting git+https://github.com/Hawksight-AI/semantica.git@main
  Cloning https://github.com/Hawksight-AI/semantica.git (to revision main) to /tmp/pip-req-build-q2cf5now
  Running command git clone --filter=blob:none --quiet https://github.com/Hawksight-AI/semantica.git /tmp/pip-req-build-q2cf5now
  Resolved https://github.com/Hawksight-AI/semantica.git to commit a6b102fa3d1d31e05593bcddc86a1124feb39e66
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [2]:
from getpass import getpass
from semantica.llms import Groq

GROQ_API_KEY = getpass("Enter your GROQ API key: ")

if not GROQ_API_KEY:
    raise ValueError("GROQ API key is required")

groq_llm = Groq(
    model="llama-3.1-8b-instant",
    api_key=GROQ_API_KEY,
)

print(f"‚úì Groq LLM initialized: {groq_llm.model}")



Enter your GROQ API key: ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑
‚úì Groq LLM initialized: llama-3.1-8b-instant


## Step 1: Parse PDF with Docling

Parse earnings call PDF and extract financial tables using DoclingParser.


In [3]:
import requests
from pathlib import Path
from semantica.parse import DoclingParser

parser = DoclingParser()

PRESS_RELEASE_URL = "https://filecache.investorroom.com/mr5ircnw_mda/677/MDA_Space_Ltd_Q3_2025_Press_Release_Nov_14_2025_FINAL.pdf"
TRANSCRIPT_URL = "https://filecache.investorroom.com/mr5ircnw_mda/681/MDA%20Space%20Ltd.%20Q3%202025%20Earnings%20Conference%20Call%20Transcript%20%28November%2014%202025%29.pdf"

download_dir = Path("downloads")
download_dir.mkdir(exist_ok=True)

press_release_pdf = download_dir / "mda_space_q3_2025_press_release.pdf"
transcript_pdf = download_dir / "mda_space_q3_2025_transcript.pdf"

if not press_release_pdf.exists():
    press_release_pdf.write_bytes(requests.get(PRESS_RELEASE_URL).content)

if not transcript_pdf.exists():
    transcript_pdf.write_bytes(requests.get(TRANSCRIPT_URL).content)

try:
    press_release = parser.parse(press_release_pdf)
    transcript = parser.parse(transcript_pdf)
except Exception as e:
    print("Parsing failed")
    print(e)
    print("Using fallback empty documents.")
    press_release = {"full_text": "", "tables": []}
    transcript = {"full_text": "", "tables": []}

parsed_doc = {
    "full_text": (
        "# Press Release\n\n"
        f"{press_release['full_text']}\n\n"
        "# Transcript\n\n"
        f"{transcript['full_text']}"
    ),
    "tables": press_release["tables"] + transcript["tables"],
    "metadata": {
        "title": "MDA Space Ltd. Q3 2025 Earnings Analysis",
        "company": "MDA Space Ltd.",
        "quarter": "Q3 2025",
        "date": "November 14, 2025",
    },
}

print("Parsing completed")
print("Documents processed: 2")
print("Tables extracted:", len(parsed_doc["tables"]))

Status,Action,Module,Submodule,Progress,ETA,Rate,Time,Extracted (Docling)
‚úÖ,Semantica is parsing,üîç parse,DoclingParser,100.0% (9/10),-,0.0/s,229.58s,"8 tables, 0 images, 10 pages"
‚úÖ,Semantica is parsing,üîç parse,DoclingParser,100.0% (9/10),-,0.1/s,149.97s,"0 tables, 0 images, 37 pages"


‚úÖ Semantica is parsing: Parsed document (Docling): 37 pages extracted üîç parse DoclingParser |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ETA: - Rate: 0.1/s Time: 149.97s Extracted (Docling): 0 tables, 0 images, 37 pagesParsing completed
Documents processed: 2
Tables extracted: 8


## Step 2: Normalize Text

Normalize extracted text using TextNormalizer for consistent processing.


In [4]:
from semantica.normalize import TextNormalizer

normalizer = TextNormalizer()

normalized_text = normalizer.normalize(
    parsed_doc["full_text"],
    clean_html=False,
    remove_extra_whitespace=False,
    lowercase=False,
)

print("Normalization completed")
print("Normalized text length:", len(normalized_text))

Status,Action,Module,Submodule,Progress,ETA,Rate,Time,Extracted (Docling)
‚úÖ,Semantica is parsing,üîç parse,DoclingParser,100.0% (9/10),-,0.0/s,229.58s,"8 tables, 0 images, 10 pages"
‚úÖ,Semantica is parsing,üîç parse,DoclingParser,100.0% (9/10),-,0.0/s,149.97s,"0 tables, 0 images, 37 pages"
‚úÖ,Semantica is normalizing,üîß normalize,TextNormalizer,100.0%,-,-,0.01s,-


Normalization completed
Normalized text length: 85180


## Step 3: Split Text into Chunks

Split the normalized text into overlapping chunks to enable scalable and accurate entity and relation extraction.
This step prepares the text for LLM-based semantic processing.

In [5]:
from semantica.split import TextSplitter

CHUNK_SIZE = 1000
CHUNK_OVERLAP = 250

splitter = TextSplitter(
    method="recursive",
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
)

chunks = splitter.split(normalized_text)

def get_chunk_text(chunk):
    return getattr(chunk, "content", getattr(chunk, "text", ""))

print("Chunking completed")
print("Total chunks:", len(chunks))
print("Sample chunk:")
print(get_chunk_text(chunks[0]))

Chunking completed
Total chunks: 148
Sample chunk:
# Press Release

## NEWS RELEASE

## MDA SPACE REPORTS THIRD QUARTER 2025 RESULTS

- Q3 2025 Highlights
- o Backlog of $4.4 billion at quarter-end, provides revenue visibility for 2025 and beyond
- o Revenues of $409.8 million, up 45% YoY
- o Adjusted EBITDA 1 of $82.8 million, up 49% YoY, and adjusted EBITDA margin 1 of 20.2%
- o Adjusted net income 1 of $46.1 million, up 33% YoY, and adjusted diluted earnings per share 1 of $0.35, up 25% YoY
- o Operating cash flow of $32.8 million
- o Net debt to adjusted EBITDA 1 ratio of 0.3x at quarter-end
- Reaffirmed 2025 full-year financial outlook

Brampton, Ontario (November 14, 2025) -- MDA Space Ltd. (TSX: MDA), a trusted space mission partner to the rapidly expanding global space industry, today announced its financial results for the third quarter ended September 30, 2025.


## Step 4: Extract Entities

Extract entities (organizations, people, financial terms) using NERExtractor with Groq LLM.


In [6]:
from semantica.semantic_extract import NERExtractor

ner = NERExtractor(
    method="llm",
    provider="groq",
    llm_model="llama-3.1-8b-instant",
    temperature=0.0,
    api_key=GROQ_API_KEY,
)

ENTITY_TYPES = ["ORGANIZATION", "PERSON", "MONEY", "PERCENT", "DATE", "EVENT"]

all_entities = [
    e
    for c in chunks
    for e in ner.extract_entities(
        get_chunk_text(c),
        entity_types=ENTITY_TYPES,
    )
    if get_chunk_text(c).strip()
]

print("Entities:", len(all_entities))

Status,Action,Module,Submodule,Progress,ETA,Rate,Time,Extracted (Docling)
‚úÖ,Semantica is parsing,üîç parse,DoclingParser,100.0% (9/10),-,0.0/s,229.58s,"8 tables, 0 images, 10 pages"
‚úÖ,Semantica is parsing,üîç parse,DoclingParser,100.0% (9/10),-,0.0/s,149.97s,"0 tables, 0 images, 37 pages"
‚úÖ,Semantica is normalizing,üîß normalize,TextNormalizer,100.0%,-,-,0.01s,-
‚úÖ,Semantica is extracting,üéØ semantic_extract,NERExtractor,100.0%,-,-,0.81s,-


‚úÖ Semantica is extracting: Extracted 9 entities using llm üéØ semantic_extract NERExtractor |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ETA: - Rate: - Time: 0.81s Extracted: -Entities: 3426


## Step 5: Extract Financial Metrics

Extract financial metrics (money, percentages, dates) from text and tables.


In [7]:
FINANCIAL_ENTITY_TYPES = [
    "MONEY", "CURRENCY", "PERCENT", "PERCENTAGE",
    "QUANTITY", "CARDINAL",
]

financial_entities = []

for chunk in chunks:
    text = get_chunk_text(chunk)
    if text.strip():
        financial_entities += ner.extract_entities(
            text,
            entity_types=FINANCIAL_ENTITY_TYPES,
        )

money, percentages, quantities = [], [], []

for e in financial_entities:
    label = e.label.upper()
    if label in ("MONEY", "CURRENCY"):
        money.append(e.text)
    elif label in ("PERCENT", "PERCENTAGE"):
        percentages.append(e.text)
    elif label in ("CARDINAL", "QUANTITY"):
        quantities.append(e.text)

print("\nFinancial entity extraction completed")
print("Total financial entities:", len(financial_entities))
print("Money:", len(money))
print("Percentages:", len(percentages))
print("Quantities:", len(quantities))

if financial_entities:
    print("Sample:", f"{financial_entities[0].text} ({financial_entities[0].label})")

Status,Action,Module,Submodule,Progress,ETA,Rate,Time,Extracted (Docling)
‚úÖ,Semantica is parsing,üîç parse,DoclingParser,100.0% (9/10),-,0.0/s,229.58s,"8 tables, 0 images, 10 pages"
‚úÖ,Semantica is parsing,üîç parse,DoclingParser,100.0% (9/10),-,0.0/s,149.97s,"0 tables, 0 images, 37 pages"
‚úÖ,Semantica is normalizing,üîß normalize,TextNormalizer,100.0%,-,-,0.01s,-
‚úÖ,Semantica is extracting,üéØ semantic_extract,NERExtractor,100.0%,-,-,0.95s,-


‚úÖ Semantica is extracting: Extracted 8 entities using llm üéØ semantic_extract NERExtractor |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ETA: - Rate: - Time: 0.95s Extracted: -
Financial entity extraction completed
Total financial entities: 4736
Money: 843
Percentages: 486
Quantities: 2330
Sample: $4.4 billion (MONEY)


## Step 5: Extract Relationships

Extract relationships between entities using RelationExtractor with Groq LLM.


In [11]:
from concurrent.futures import ThreadPoolExecutor, TimeoutError
from semantica.semantic_extract import RelationExtractor

MAX_ENTITIES = 30
CHUNK_TIMEOUT = 60

relation_extractor = RelationExtractor(
    method="llm",
    confidence_threshold=0.6,
    relation_types=[
        "HAS_REVENUE",
        "HAS_GROWTH",
        "REPORTS",
        "PROVIDES_GUIDANCE",
        "IN_QUARTER",
        "FOR_PERIOD",
        "RELATED_TO",
    ],
    provider="groq",
    llm_model="llama-3.1-8b-instant",
    api_key=GROQ_API_KEY,
    temperature=0.0,
    verbose=False,
)

def filter_entities(text, entities):
    t = text.lower()
    return [e for e in entities if e.text.lower() in t]

def process_chunk(idx, chunk, total):
    text = get_chunk_text(chunk).strip()
    if not text:
        remaining = total - (idx + 1)
        print(f"Chunk {idx+1}/{total} | remaining {remaining} | skipped (empty)")
        return []

    chunk_entities = filter_entities(text, all_entities)[:MAX_ENTITIES]

    remaining = total - (idx + 1)

    if len(chunk_entities) < 2:
        print(f"Chunk {idx+1}/{total} | remaining {remaining} | skipped (entities={len(chunk_entities)})")
        return []

    print(f"Chunk {idx+1}/{total} | remaining {remaining} | entities={len(chunk_entities)}")

    return relation_extractor.extract_relations(
        text=text,
        entities=chunk_entities,
        verbose=False,
    )

relationships = []
total_chunks = len(chunks)

with ThreadPoolExecutor(max_workers=1) as executor:
    for i, c in enumerate(chunks):
        future = executor.submit(process_chunk, i, c, total_chunks)
        try:
            rels = future.result(timeout=CHUNK_TIMEOUT)
            relationships.extend(rels)
            print(f"  relations={len(rels)}")
        except TimeoutError:
            remaining = total_chunks - (i + 1)
            print(f"Chunk {i+1}/{total_chunks} | remaining {remaining} | timed out")
        except Exception as e:
            remaining = total_chunks - (i + 1)
            print(f"Chunk {i+1}/{total_chunks} | remaining {remaining} | failed: {e}")

print(f"Done {total_chunks}/{total_chunks}")
print(f"Total relationships: {len(relationships)}")

if relationships:
    for r in relationships[:10]:
        print(f"{r.subject.text} ‚Üí {r.predicate} ‚Üí {r.object.text}")

Status,Action,Module,Submodule,Progress,ETA,Rate,Time,Extracted (Docling)
‚úÖ,Semantica is parsing,üîç parse,DoclingParser,100.0% (9/10),-,0.0/s,229.58s,"8 tables, 0 images, 10 pages"
‚úÖ,Semantica is parsing,üîç parse,DoclingParser,100.0% (9/10),-,0.0/s,149.97s,"0 tables, 0 images, 37 pages"
‚úÖ,Semantica is normalizing,üîß normalize,TextNormalizer,100.0%,-,-,0.01s,-
‚úÖ,Semantica is extracting,üéØ semantic_extract,NERExtractor,100.0%,-,-,0.95s,-
‚úÖ,Semantica is extracting,üéØ semantic_extract,RelationExtractor,100.0%,-,-,1.87s,-


‚úÖ Semantica is extracting: Extracted 1 relations using llm üéØ semantic_extract RelationExtractor |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ETA: - Rate: - Time: 1.87s Extracted: -  relations=1
Done 148/148
Total relationships: 6933
MDA Space Ltd. ‚Üí HAS_REVENUE ‚Üí $409.8 million
MDA Space Ltd. ‚Üí HAS_GROWTH ‚Üí 20.2%
MDA Space Ltd. ‚Üí HAS_REVENUE ‚Üí $4.4 billion
MDA Space Ltd. ‚Üí HAS_GROWTH ‚Üí 20.2%
MDA Space Ltd. ‚Üí HAS_GROWTH ‚Üí 0.3x
MDA Space Ltd. ‚Üí REPORTS ‚Üí Q3 2025
MDA Space Ltd. ‚Üí PROVIDES_GUIDANCE ‚Üí 2025
MDA Space Ltd. ‚Üí IN_QUARTER ‚Üí Q3 2025
MDA Space Ltd. ‚Üí FOR_PERIOD ‚Üí 2025
MDA Space Ltd. ‚Üí REPORTS ‚Üí September 30, 2025


## Step 6: Detect Conflicts

Detect conflicts in extracted entities and relationships using ConflictDetector.


In [19]:
from semantica.conflicts import SourceTracker, SourceReference, ConflictDetector

source_tracker = SourceTracker()

conflict_detector = ConflictDetector(
    source_tracker=source_tracker,
    similarity_threshold=0.8,
    confidence_threshold=0.7,
)

entities = all_entities
extracted_relationships = relationships

for e in entities:
    entity_id = getattr(e, "id", None) or e.text
    source_tracker.track_property_source(
        entity_id=entity_id,
        property_name="name",
        value=e.text,
        source=SourceReference(
            document="earnings_call",
            timestamp="2024-Q1",
            metadata={"entity_type": getattr(e, "label", "UNKNOWN")},
        ),
    )

entity_records = [
    {
        "id": getattr(e, "id", None) or e.text,
        "name": e.text,
    }
    for e in entities
]

entity_value_conflicts = conflict_detector.detect_value_conflicts(
    entity_records,
    property_name="name",
)

normalized_relationships = [
    {
        "id": getattr(r, "id", None),
        "source_id": getattr(r.subject, "id", None) or r.subject.text,
        "target_id": getattr(r.object, "id", None) or r.object.text,
        "type": r.predicate,
        "confidence": getattr(r, "confidence", 1.0),
        "metadata": {},
    }
    for r in extracted_relationships
]

relationship_conflicts = conflict_detector.detect_relationship_conflicts(
    normalized_relationships
)

print("Conflict detection completed")
print("Entity value conflicts:", len(entity_value_conflicts))
print("Relationship conflicts:", len(relationship_conflicts))

if entity_value_conflicts:
    print("\nSample entity conflict:")
    print(entity_value_conflicts[0])

if relationship_conflicts:
    print("\nSample relationship conflict:")
    print(relationship_conflicts[0])

Status,Action,Module,Submodule,Progress,ETA,Rate,Time,Extracted (Docling)
‚úÖ,Semantica is parsing,üîç parse,DoclingParser,100.0% (9/10),-,0.0/s,229.58s,"8 tables, 0 images, 10 pages"
‚úÖ,Semantica is parsing,üîç parse,DoclingParser,100.0% (9/10),-,0.0/s,149.97s,"0 tables, 0 images, 37 pages"
‚úÖ,Semantica is normalizing,üîß normalize,TextNormalizer,100.0%,-,-,0.01s,-
‚úÖ,Semantica is extracting,üéØ semantic_extract,NERExtractor,100.0%,-,-,0.95s,-
‚úÖ,Semantica is extracting,üéØ semantic_extract,RelationExtractor,100.0%,-,-,1.87s,-
‚úÖ,Semantica is resolving,‚ö†Ô∏è conflicts,ConflictDetector,100.0% (864/894),-,334.0/s,2.59s,-
‚úÖ,Semantica is resolving,‚ö†Ô∏è conflicts,ConflictResolver,100.0%,-,-,0.00s,-


‚úÖ Semantica is resolving: Detected 62 relationship conflicts ‚ö†Ô∏è conflicts ConflictDetector |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ETA: - Rate: 333.6/s Time: 2.59s Extracted: -Conflict detection completed
Entity value conflicts: 0
Relationship conflicts: 62

Sample relationship conflict:
Conflict(conflict_id='MDA Space Ltd._2025_PROVIDES_GUIDANCE_confidence_conflict', conflict_type=<ConflictType.RELATIONSHIP_CONFLICT: 'relationship_conflict'>, entity_id=None, property_name='confidence', relationship_id='MDA Space Ltd._2025_PROVIDES_GUIDANCE', conflicting_values=[0.9875, 0.95, 0.975, 0.925], sources=[], confidence=0.8, severity='medium', recommended_action='Review relationship definition', metadata={})


## Step 7: Resolve Conflicts

Resolve detected conflicts using ConflictResolver with voting strategy.


In [20]:
from semantica.conflicts import ConflictResolver

conflict_resolver = ConflictResolver(
    default_strategy="voting",
    source_tracker=source_tracker,
)

resolved_entity_value_conflicts = []
resolved_relationship_conflicts = []

for conflict in entity_value_conflicts:
    resolved_entity_value_conflicts.append(
        conflict_resolver.resolve_conflict(
            conflict,
            strategy="voting",
        )
    )

for conflict in relationship_conflicts:
    resolved_relationship_conflicts.append(
        conflict_resolver.resolve_conflict(
            conflict,
            strategy="voting",
        )
    )

print("Conflict resolution completed")
print("Entity value conflicts resolved:", len(resolved_entity_value_conflicts))
print("Relationship conflicts resolved:", len(resolved_relationship_conflicts))

if resolved_entity_value_conflicts:
    print("\nSample resolved entity conflict:")
    print(resolved_entity_value_conflicts[0])

if resolved_relationship_conflicts:
    print("\nSample resolved relationship conflict:")
    print(resolved_relationship_conflicts[0])

Status,Action,Module,Submodule,Progress,ETA,Rate,Time,Extracted (Docling)
‚úÖ,Semantica is parsing,üîç parse,DoclingParser,100.0% (9/10),-,0.0/s,229.58s,"8 tables, 0 images, 10 pages"
‚úÖ,Semantica is parsing,üîç parse,DoclingParser,100.0% (9/10),-,0.0/s,149.97s,"0 tables, 0 images, 37 pages"
‚úÖ,Semantica is normalizing,üîß normalize,TextNormalizer,100.0%,-,-,0.01s,-
‚úÖ,Semantica is extracting,üéØ semantic_extract,NERExtractor,100.0%,-,-,0.95s,-
‚úÖ,Semantica is extracting,üéØ semantic_extract,RelationExtractor,100.0%,-,-,1.87s,-
‚úÖ,Semantica is resolving,‚ö†Ô∏è conflicts,ConflictDetector,100.0% (864/894),-,15.6/s,2.59s,-
‚úÖ,Semantica is resolving,‚ö†Ô∏è conflicts,ConflictResolver,100.0%,-,-,0.01s,-


‚úÖ Semantica is resolving: Resolved conflict using voting ‚ö†Ô∏è conflicts ConflictResolver |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100.0% ETA: - Rate: - Time: 0.01s Extracted: -Conflict resolution completed
Entity value conflicts resolved: 0
Relationship conflicts resolved: 62

Sample resolved relationship conflict:
ResolutionResult(conflict_id='MDA Space Ltd._2025_PROVIDES_GUIDANCE_confidence_conflict', resolved=True, resolved_value=0.9875, resolution_strategy='voting', confidence=0.25, sources_used=[], resolution_notes='Resolved by voting: 1/4 votes for this value', metadata={'conflict_type': 'relationship_conflict', 'entity_id': None, 'property_name': 'confidence', 'relationship_id': 'MDA Space Ltd._2025_PROVIDES_GUIDANCE'})


## Step 8: Deduplicate Entities

Detect and merge duplicate entities using DuplicateDetector and EntityMerger.


In [25]:
from semantica.deduplication import DuplicateDetector, EntityMerger

duplicate_detector = DuplicateDetector(
    similarity_threshold=0.8,
    confidence_threshold=0.7,
)

entity_dicts = [
    {
        "id": getattr(e, "id", None) or e.text,
        "name": e.text,
        "type": getattr(e, "label", "UNKNOWN"),
        "confidence": getattr(e, "confidence", 1.0),
        "metadata": getattr(e, "metadata", {}),
    }
    for e in resolved_entities
]

duplicates = duplicate_detector.detect_duplicates(entity_dicts)

entity_merger = EntityMerger(preserve_provenance=True)

merge_operations = entity_merger.merge_duplicates(
    entity_dicts,
    strategy="keep_most_complete",
)

merged_entities = [op.merged_entity for op in merge_operations]

print("Entity deduplication completed")
print("Resolved entities:", len(entity_dicts))
print("Merged entities:", len(merged_entities))
print("Duplicates removed:", len(entity_dicts) - len(merged_entities))

if merge_operations:
    print("\nSample merge operation:")
    print(merge_operations[0])

NameError: name 'resolved_entities' is not defined

## Step 9: Build Knowledge Graph

Build knowledge graph from cleaned entities, relationships, and triplets using GraphBuilder.


In [None]:
from semantica.kg import GraphBuilder

graph_builder = GraphBuilder(
    merge_entities=True,
    entity_resolution_strategy="fuzzy",
)

triplet_relationships = [
    {
        "source": t.subject,
        "predicate": t.predicate,
        "target": t.object,
        "confidence": t.confidence,
        "metadata": t.metadata,
    }
    for t in validated_triplets
]

final_relationships = resolved_relationships + triplet_relationships

kg_data = {
    "entities": merged_entities,
    "relationships": final_relationships,
    "triplets": validated_triplets,
    "metadata": {
        "source": "earnings_call_transcript",
        "extraction_method": "Groq LLM",
    },
}

knowledge_graph = graph_builder.build(
    sources=[kg_data],
    merge_entities=True,
)

print("Knowledge graph build completed")
print("Final entities:", len(knowledge_graph.get("entities", [])))
print("Final relationships:", len(knowledge_graph.get("relationships", [])))


## Step 10: Analyze Knowledge Graph

This step evaluates the structure and quality of the knowledge graph.

- **Centrality**  
  Identifies the most influential entities based on connectivity.

- **Communities**  
  Groups related entities to reveal themes such as business units, markets, or topics.

- **Connectivity**  
  Shows how well the graph is linked and whether information is fragmented.

These metrics help validate extraction quality and guide downstream analysis.


In [None]:
from semantica.kg import GraphAnalyzer

graph_analyzer = GraphAnalyzer()

analysis = graph_analyzer.analyze_graph(knowledge_graph)
centrality = graph_analyzer.calculate_centrality(
    knowledge_graph,
    method="degree",
)
communities = graph_analyzer.detect_communities(
    knowledge_graph,
    algorithm="louvain",
)
connectivity = graph_analyzer.analyze_connectivity(knowledge_graph)
metrics = graph_analyzer.compute_metrics(knowledge_graph)

top_entities = centrality.get("rankings", [])[:5]
num_communities = len(communities.get("communities", []))

print("Graph analysis completed")
print("Communities:", num_communities)
print("Top entities:", len(top_entities))


## Step 11: Persist Knowledge Graph in Amazon Neptune

After cleaning, conflict resolution, and deduplication, the final step is to
persist the **canonical knowledge graph** into a production graph database.

Semantica integrates with **Amazon Neptune** to provide a secure, scalable,
and query-efficient backend for long-lived knowledge graphs.

- **Canonical Storage**  
  Only deduplicated entities and resolved relationships are written to Neptune.

- **Secure Access**  
  Uses AWS IAM authentication (SigV4) for production-grade security.

- **Flexible Graph Model**  
  Supports property graphs (OpenCypher / Gremlin) and RDF (SPARQL).

- **Efficient Querying**  
  Leverages the Bolt protocol for low-latency graph queries and traversal.

- **Production Ready**  
  Designed for compliance, provenance, and downstream analytics.

This step enables durable storage, rich querying, and integration with
analytics and applications on top of the extracted knowledge graph.


In [None]:
import os

os.environ["NEPTUNE_ENDPOINT"] = "your-cluster.us-east-1.neptune.amazonaws.com"
os.environ["NEPTUNE_PORT"] = "8182"
os.environ["AWS_REGION"] = "us-east-1"

In [None]:
from semantica.graph_store import GraphStore
import os

neptune_store = GraphStore(
    backend="neptune",
    endpoint=os.environ["NEPTUNE_ENDPOINT"],
    port=int(os.environ.get("NEPTUNE_PORT", 8182)),
    region=os.environ["AWS_REGION"],
    iam_auth=True,
)

neptune_store.connect()
print("Connected to AWS Neptune")

In [None]:
for entity in knowledge_graph.get("entities", []):
    neptune_store.create_node(
        labels=[entity.get("type", "Entity")],
        properties=entity,
    )

for rel in knowledge_graph.get("relationships", []):
    neptune_store.create_relationship(
        start_node_id=rel["source"],
        end_node_id=rel["target"],
        rel_type=rel["predicate"],
        properties=rel.get("metadata", {}),
    )

print("Knowledge graph populated to AWS Neptune")


In [None]:
# Verify data in Neptune
results = neptune_store.execute_query(
    "MATCH (n) RETURN count(n) AS node_count"
)
print("Total nodes:", results.get("records", [{}])[0].get("node_count"))

results = neptune_store.execute_query(
    "MATCH ()-[r]->() RETURN count(r) AS rel_count"
)
print("Total relationships:", results.get("records", [{}])[0].get("rel_count"))

# Sample query: list a few entities
results = neptune_store.execute_query(
    "MATCH (n) RETURN labels(n), n.name LIMIT 5"
)

print("Sample nodes:")
for r in results.get("records", []):
    print(r)


## Step 12: Context Retrieval

Set up hybrid retrieval (vector + graph) using ContextRetriever for GraphRAG queries.


In [None]:
from semantica.vector_store import VectorStore
from semantica.context import ContextRetriever

vector_store = VectorStore(backend="faiss")

vector_store.add(
    texts=[parsed_doc["full_text"]],
    metadata=[{"source": "earnings_call", "type": "transcript"}],
)

context_retriever = ContextRetriever(
    knowledge_graph=knowledge_graph,
    vector_store=vector_store,
    hybrid_alpha=0.6,
    use_graph_expansion=True,
    max_expansion_hops=2,
)

queries = [
    "What was the company's revenue guidance?",
    "What were the key financial metrics discussed?",
]

retrieved_contexts = []

for query in queries:
    results = context_retriever.retrieve(
        query=query,
        max_results=3,
        min_relevance_score=0.2,
    )
    retrieved_contexts.append(results)

print("Hybrid GraphRAG configured")
print("Queries processed:", len(queries))
print("Sample results:", len(retrieved_contexts[0]) if retrieved_contexts else 0)


## Step 13: Agent Memory (Long-Term Context)

This step enables long-term memory for agents by storing important facts,
metrics, and entities extracted from the knowledge graph.

- **Semantic Memory Storage**  
  Stores structured memories enriched with entities and relationships,
  not just raw text.

- **Hybrid Recall**  
  Combines vector similarity with graph structure for accurate retrieval.

- **Time-Bound Retention**  
  Supports memory expiration policies for freshness and governance.

- **Agent-Ready Context**  
  Allows agents to recall prior earnings, metrics, and entities across sessions.

Agent Memory turns one-off analysis into **persistent, reusable intelligence**
for downstream agents and workflows.


In [None]:
from semantica.context import AgentMemory

agent_memory = AgentMemory(
    vector_store=vector_store,
    knowledge_graph=knowledge_graph,
    retention_days=30,
)

memory_contents = [
    f"Earnings call transcript: {parsed_doc['metadata'].get('title', 'Earnings Call')}",
    f"Financial metrics extracted: {sum(len(v) for v in financial_metrics.values())}",
    f"Key entities identified: {len(merged_entities)}",
]

memory_ids = []

for content in memory_contents:
    memory_ids.append(
        agent_memory.store(
            content=content,
            metadata={"source": "earnings_call", "type": "analysis"},
            extract_entities=True,
            extract_relationships=True,
        )
    )

financial_memories = agent_memory.retrieve(
    query="financial metrics and earnings",
    max_results=5,
)

memory_stats = agent_memory.get_statistics()

print("Agent memory configured")
print("Memories stored:", len(memory_ids))
print("Total memories:", memory_stats.get("total_memories", 0))
print("Retrieved memories:", len(financial_memories))


## Step 14: Agent Context

**AgentContext** provides a unified context layer that combines **vector-based RAG**
with **graph-based GraphRAG** for grounded and explainable retrieval.

### Key Controls
- **Graph Expansion**  
  Uses connected entities and relationships from the knowledge graph to
  expand context beyond direct text matches.

- **Hybrid Alpha**  
  Balances text similarity with graph structure. Lower values favor text;
  higher values favor graph reasoning.

- **Expansion Hops**  
  Limits how far context can expand through the graph. Fewer hops keep
  results focused; more hops increase coverage.

AgentContext enables agents to reason over both **documents** and
**knowledge graphs** through a single interface.


### AgentContext Parameters

- **vector_store** ‚Äì Vector search over unstructured text  
- **knowledge_graph** ‚Äì Structured entities and relationships  
- **use_graph_expansion** ‚Äì Enable GraphRAG (graph-based context expansion)  
- **max_expansion_hops** ‚Äì How far to traverse the graph  
- **hybrid_alpha** ‚Äì Balance between vector and graph relevance  
- **retention_days** ‚Äì How long context is kept  

**Store options**
- **link_entities** ‚Äì Link to existing graph nodes  

**Retrieve options**
- **max_results** ‚Äì Number of results returned  
- **expand_graph** ‚Äì Expand context via the graph  
- **include_entities** ‚Äì Return related entities  

AgentContext unifies **memory, GraphRAG, and retrieval** in one interface.


In [None]:
from semantica.context import AgentContext

agent_context = AgentContext(
    vector_store=vector_store,
    knowledge_graph=knowledge_graph,
    use_graph_expansion=True,
    max_expansion_hops=2,
    hybrid_alpha=0.6,
    retention_days=30,
)

memory_id = agent_context.store(
    content=parsed_doc["full_text"][:1000],
    metadata={"source": "earnings_call", "date": "2024-Q1"},
    extract_entities=True,
    extract_relationships=True,
    link_entities=True,
)

results = agent_context.retrieve(
    query="What was discussed about revenue growth?",
    max_results=5,
    expand_graph=True,
    include_entities=True,
)

stats = agent_context.stats()

print("AgentContext configured")
print("Memory stored:", memory_id)
print("GraphRAG results:", len(results))
print("Total memories:", stats.get("total_memories", 0))


NameError: name 'vector_store' is not defined

## Step 15: Answer Generation

Generate answers to financial questions using Groq LLM with retrieved context and knowledge graph.


In [None]:
financial_questions = [
    "What were the key financial metrics discussed?",
    "What guidance was provided for future quarters?",
]

generated_answers = []

for question in financial_questions:
    retrieved_contexts = context_retriever.retrieve(
        query=question,
        max_results=3,
        min_relevance_score=0.2,
    )

    context_text = "\n\n".join(
        ctx.get("content", ctx.get("text", ""))
        for ctx in retrieved_contexts
    )[:1000]

    entity_names = [
        entity.get("name", "")
        for entity in knowledge_graph.get("entities", [])[:5]
    ]
    entities_text = ", ".join(entity_names) or "N/A"

    prompt = f"""
Answer the question using only the context below.
If the answer is not present, say so.

Context:
{context_text}

Key entities: {entities_text}

Question:
{question}

Answer:
""".strip()

    try:
        answer = groq_llm.generate(
            prompt,
            temperature=0.7,
            max_tokens=400,
        )
    except Exception as error:
        answer = f"Answer generation failed: {error}"

    generated_answers.append(answer)

print("Answer generation completed")
print("Questions answered:", len(generated_answers))


## Step 16: Export Results

Export knowledge graph and analysis results to JSON and RDF formats.


In [None]:
from semantica.export import JSONExporter, RDFExporter

json_exporter = JSONExporter()
rdf_exporter = RDFExporter()

kg_json = json_exporter.export(knowledge_graph, format="json")
kg_rdf = rdf_exporter.export_to_rdf(knowledge_graph, format="turtle")

analysis_summary = {
    "entities": len(knowledge_graph.get("entities", [])),
    "relationships": len(knowledge_graph.get("relationships", [])),
    "conflicts_resolved": len(resolved_conflicts),
    "merged_entities": len(merged_entities),
    "communities": num_communities,
    "questions_answered": len(generated_answers),
    "llm_model": groq_llm.model,
}

print("Export completed")
print("KG JSON entities:", analysis_summary["entities"])
print("KG RDF size (chars):", len(kg_rdf))
print("Questions answered:", analysis_summary["questions_answered"])
print("LLM model:", analysis_summary["llm_model"])
