# demo: textgraph

Implementation of an LLM-augmented `textgraph` algorithm for constructing a _knowledge graph_ from raw, unstructured text source.

This code is based on work developed by [Derwen](https://derwen.ai/graph) in early 2023 for enterprise customer sample apps and our _Cysoni_ product. It integrates code from:

  * <https://github.com/tomaarsen/SpanMarkerNER/>
  * <https://github.com/thunlp/OpenNRE/>
  * <https://github.com/DerwenAI/pytextrank/>
  * [_Create a spaCy Visualizer with Streamlit_](https://medium.com/@groxli/create-a-spacy-visualizer-with-streamlit-8b9b41b36745)

Our approach was presented in the talks:

  * ["Language, Graphs, and AI in Industry"](https://derwen.ai/s/mqqm), **Paco Nathan**, K1st World (2023-10-11)
  * ["Language Tools for Creators"](https://derwen.ai/s/rhvg), **Paco Nathan**, FOSSY (2023-07-13)

Two tutorials also from 2023 which include related material:

  * ["Natural Intelligence is All You Need \[tm\]"](https://youtu.be/C9p7suS-NGk?si=7Ohq3BV654ia2Im4), **Vincent Warmerdam**, PyData Amsterdam (2023-09-15)

  * ["How to Convert Any Text Into a Graph of Concepts"](https://towardsdatascience.com/how-to-convert-any-text-into-a-graph-of-concepts-110844f22a1a), **Rahul Nayak**, _Towards Data Science_ (2023-11-09)

## parse a document

In [1]:
from icecream import ic
from textgraph import Node, Edge, RelEnum, TextGraph
import spacy

  from .autonotebook import tqdm as notebook_tqdm


In [122]:
SRC_TEXT: str = """                                                                                                                      
Werner Herzog is a remarkable filmmaker and an intellectual originally from Germany, the son of Dietrich Herzog.                                
"""

SRC_TEXT = """
Herzog was born Werner Stipetić in Munich, Nazi Germany, to Elisabeth Stipetić, an Austrian of Croatian descent, and Dietrich Herzog, a German.
"""

In [123]:
tg: TextGraph = TextGraph()

sample_doc: spacy.tokens.doc.Doc = tg.build_doc(
    SRC_TEXT.strip(),
    use_llm = False,
)

2023-11-25 22:19:45,907 - root - INFO - Initializing word embedding with word2vec.


In [124]:
spacy.displacy.render(
    sample_doc,
    style = "ent",
    jupyter = True,
)

In [125]:
spacy.displacy.render(
    sample_doc,
    style = "dep",
    jupyter = True,
)

## build a lemma graph from the document

In [126]:
tg.build_graph_embeddings(
    sample_doc,
    debug = True,
)

ic| sent: Herzog was born Werner Stipetić in Munich, Nazi Germany, to Elisabeth Stipetić, an Austrian of Croatian descent, and Dietrich Herzog, a German.


In [127]:
tg.infer_relations(
    SRC_TEXT.strip(),
    debug = True,
)

ic| src.node_id: 3, dst.node_id: 5, path: [3, 4, 5]
ic| rel: 'residence', prob: 0.393816202878952
ic| src.node_id: 3, dst.node_id: 7, path: [3, 4, 5, 7]
ic| rel: 'country of citizenship', prob: 0.8388875722885132
ic| src.node_id: 3, dst.node_id: 10, path: [3, 2, 9, 10]
ic| rel: 'mother', prob: 0.5407394170761108
ic| src.node_id: 3, dst.node_id: 13, path: [3, 2, 9, 10, 13]
ic| rel: 'country of citizenship', prob: 0.767692506313324
ic| src.node_id: 3
    dst.node_id: 15
    path: [3, 2, 9, 10, 13, 14, 16, 15]
ic| rel: 'country of citizenship', prob: 0.5234690308570862
ic| src.node_id: 3, dst.node_id: 19, path: [3, 2, 19]
ic| rel: 'sibling', prob: 0.37065911293029785
ic| src.node_id: 3, dst.node_id: 22, path: [3, 2, 19, 22]
ic| rel: 'father', prob: 0.16549424827098846
ic| src.node_id: 5, dst.node_id: 3, path: [5, 4, 3]
ic| rel: 'sibling', prob: 0.31588953733444214
ic| src.node_id: 5, dst.node_id: 7, path: [5, 7]
ic| rel: 'country', prob: 0.30468812584877014
ic| src.node_id: 5, dst.node_id

In [128]:
tg.calc_phrase_ranks()

ic(tg.edges);

ic| tg.edges: {'0.2.nsubjpass.0': Edge(src_node=0, dst_node=2, kind=<RelEnum.DEP: 0>, rel='nsubjpass', prob=1.0, count=1),
               '1.2.auxpass.0': Edge(src_node=1, dst_node=2, kind=<RelEnum.DEP: 0>, rel='auxpass', prob=1.0, count=1),
               '10.13.country_of_citizenship.1': Edge(src_node=10, dst_node=13, kind=<RelEnum.INF: 1>, rel='country of citizenship', prob=0.6766995191574097, count=1),
               '10.15.country_of_citizenship.1': Edge(src_node=10, dst_node=15, kind=<RelEnum.INF: 1>, rel='country of citizenship', prob=0.47829607129096985, count=1),
               '10.19.sibling.1': Edge(src_node=10, dst_node=19, kind=<RelEnum.INF: 1>, rel='sibling', prob=0.4432893991470337, count=1),
               '10.22.father.1': Edge(src_node=10, dst_node=22, kind=<RelEnum.INF: 1>, rel='father', prob=0.21240809559822083, count=1),
               '10.3.sibling.1': Edge(src_node=10, dst_node=3, kind=<RelEnum.INF: 1>, rel='sibling', prob=0.45087841153144836, count=1),
         

show the resulting entities extracted from the document

In [129]:
for node in tg.get_phrases():
    ic(node)

ic| node: Node(node_id=10, span=Elisabeth Stipetić, text='Elisabeth Stipetić', pos='PROPN', kind='PERSON', count=1, weight=0.12497108235660234)
ic| node: Node(node_id=13, span=Austrian, text='Austrian', pos='ADJ', kind='NORP', count=1, weight=0.12497108235660234)
ic| node: Node(node_id=3, span=Werner Stipetić, text='Werner Stipetić', pos='PROPN', kind='PERSON', count=1, weight=0.11991720772538196)
ic| node: Node(node_id=7, span=Nazi Germany, text='Nazi Germany', pos='PROPN', kind='GPE', count=1, weight=0.11991720772538196)
ic| node: Node(node_id=5, span=Munich, text='Munich', pos='PROPN', kind='GPE', count=1, weight=0.11515368845567156)
ic| node: Node(node_id=19, span=Dietrich Herzog, text='Dietrich Herzog', pos='PROPN', kind='PERSON', count=1, weight=0.11515368845567156)
ic| node: Node(node_id=15, span=Croatian, text='Croatian', pos='ADJ', kind='NORP', count=1, weight=0.11071800754286118)
ic| node: Node(node_id=22, span=German, text='German', pos='PROPN', kind='NORP', count=1, weight=

## visualize the knowledge graph

In [130]:
import typing
from dataclasses import dataclass

@dataclass(order=False, frozen=True)
class NodeStyle:  # pylint: disable=R0902                                                                                                    
    """                                                                                                                                      
Dataclass used for styling PyVis nodes.                                                                                                      
    """
    color: str
    shape: str

SHAPES: typing.List[ str ] = [
    "dot",
    "square",
    "diamond",
]

DIM_NODE: NodeStyle = NodeStyle(
    color = "hsla(72, 19%, 90%, 0.4)",
    shape = "star",
)

In [131]:
import networkx as nx
import pyvis

debug = False # True

for node in tg.nodes.values():
    neighbors: int = 0
    
    try:
        neighbors = len(list(nx.neighbors(tg.lemma_graph, node.node_id)))
    except Exception:
        pass

    g_node = tg.lemma_graph.nodes[node.node_id]
    g_node["value"] = node.weight
    g_node["size"] = node.count
    g_node["neighbors"] = neighbors

    if node.count < 1:
        g_node["kind"] = 0
        g_node["shape"] = "star"
        g_node["label"] = ""
        g_node["title"] = node.text
        g_node["color"] = "hsla(72, 19%, 90%, 0.4)"
    elif node.kind is not None:
        g_node["kind"] = 1
        g_node["shape"] = "circle"
        g_node["label"] = node.text
        g_node["color"] = "#d2d493"  
    else:
        g_node["kind"] = 2
        g_node["shape"] = "square"
        g_node["label"] = node.text
        g_node["color"] = "#c083bb"

    if debug:
        ic(node.count, node, g_node)

edge_labels: dict = {}

for edge in tg.edges.values():
    edge_labels[(edge.src_node, edge.dst_node,)] = ( edge.kind.value, edge.rel, )

In [132]:
vis_graph: pyvis.network.Network = pyvis.network.Network()
vis_graph.from_nx(tg.lemma_graph)

for edge in vis_graph.get_edges():
    edge_key = ( edge["from"], edge["to"], )
    edge_info = edge_labels.get(edge_key)

    if edge_info[0] == 0:
        edge["color"] = "ltgray"
        edge["width"] = 0
        edge["label"] = ""
    else:
        edge["label"] = edge_info[1]

vis_graph.force_atlas_2based(
    gravity = -38,
    central_gravity = 0.01,
    spring_length = 231,
    spring_strength = 0.7,
    damping = 0.8,
    overlap = 0,
)

vis_graph.show_buttons(filter_ = [ "physics" ])
vis_graph.toggle_physics(True)

In [133]:
vis_graph.prep_notebook()
vis_graph.show("vis.html")

vis.html


## outro

_\[ more parts are getting added to this demo \]_