# demo: textgraph

Implementation of an LLM-augmented `textgraph` algorithm for constructing a _knowledge graph_ from raw, unstructured text source.

This code is based on work developed by [Derwen](https://derwen.ai/graph) in early 2023 for enterprise customer sample apps and our _Cysoni_ product. It integrates code from:

  * <https://github.com/tomaarsen/SpanMarkerNER/>
  * <https://github.com/thunlp/OpenNRE/>
  * <https://github.com/DerwenAI/pytextrank/>
  * [_Create a spaCy Visualizer with Streamlit_](https://medium.com/@groxli/create-a-spacy-visualizer-with-streamlit-8b9b41b36745)

Our approach was presented in the talks:

  * ["Language, Graphs, and AI in Industry"](https://derwen.ai/s/mqqm), **Paco Nathan**, K1st World (2023-10-11)
  * ["Language Tools for Creators"](https://derwen.ai/s/rhvg), **Paco Nathan**, FOSSY (2023-07-13)

Two tutorials also from 2023 which include related material:

  * ["Natural Intelligence is All You Need \[tm\]"](https://youtu.be/C9p7suS-NGk?si=7Ohq3BV654ia2Im4), **Vincent Warmerdam**, PyData Amsterdam (2023-09-15)

  * ["How to Convert Any Text Into a Graph of Concepts"](https://towardsdatascience.com/how-to-convert-any-text-into-a-graph-of-concepts-110844f22a1a), **Rahul Nayak**, _Towards Data Science_ (2023-11-09)

## parse a document

In [1]:
from icecream import ic
from textgraph import Node, Edge, RelEnum, TextGraph
import spacy

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
SRC_TEXT: str = """                                                                                                                      
Werner Herzog is a remarkable filmmaker and an intellectual originally from Germany, the son of Dietrich Herzog.                                
"""

In [3]:
tg: TextGraph = TextGraph()

sample_doc: spacy.tokens.doc.Doc = tg.build_doc(
    SRC_TEXT.strip(),
    use_llm = False,
)

2023-11-25 13:57:04,840 - root - INFO - Initializing word embedding with word2vec.


In [4]:
spacy.displacy.render(
    sample_doc,
    style = "ent",
    jupyter = True,
)

In [5]:
spacy.displacy.render(
    sample_doc,
    style = "dep",
    jupyter = True,
)

## build a lemma graph from the document

In [6]:
tg.build_graph_embeddings(
    sample_doc,
    debug = True,
)

ic| sent: Werner Herzog is a remarkable filmmaker and an intellectual originally from Germany, the son of Dietrich Herzog.


In [7]:
tg.infer_relations(
    SRC_TEXT.strip(),
    debug = True,
)

ic| src.node_id: 0, dst.node_id: 10, path: [0, 1, 4, 7, 9, 10]
ic| rel: 'country of citizenship', prob: 0.9073653221130371
ic| src.node_id: 0
    dst.node_id: 15
    path: [0, 1, 4, 7, 9, 10, 13, 14, 15]
ic| rel: 'father', prob: 0.5981622934341431
ic| src.node_id: 10, dst.node_id: 0, path: [10, 9, 7, 4, 1, 0]
ic| rel: 'father', prob: 0.46716827154159546
ic| src.node_id: 10, dst.node_id: 15, path: [10, 13, 14, 15]
ic| rel: 'father', prob: 0.6251675486564636
ic| src.node_id: 15
    dst.node_id: 0
    path: [15, 14, 13, 10, 9, 7, 4, 1, 0]
ic| rel: 'father', prob: 0.41431477665901184
ic| src.node_id: 15, dst.node_id: 10, path: [15, 14, 13, 10]
ic| rel: 'country of citizenship', prob: 0.8607672452926636


In [8]:
tg.calc_phrase_ranks()

ic(tg.edges);

ic| tg.edges: {'0.1.nsubj.0': Edge(src_node=0, dst_node=1, kind=<RelEnum.DEP: 0>, rel='nsubj', prob=1.0, count=1),
               '0.10.country_of_citizenship.1': Edge(src_node=0, dst_node=10, kind=<RelEnum.INF: 1>, rel='country of citizenship', prob=0.9073653221130371, count=1),
               '0.15.father.1': Edge(src_node=0, dst_node=15, kind=<RelEnum.INF: 1>, rel='father', prob=0.5981622934341431, count=1),
               '10.0.father.1': Edge(src_node=10, dst_node=0, kind=<RelEnum.INF: 1>, rel='father', prob=0.46716827154159546, count=1),
               '10.15.father.1': Edge(src_node=10, dst_node=15, kind=<RelEnum.INF: 1>, rel='father', prob=0.6251675486564636, count=1),
               '10.9.pobj.0': Edge(src_node=10, dst_node=9, kind=<RelEnum.DEP: 0>, rel='pobj', prob=1.0, count=1),
               '11.10.punct.0': Edge(src_node=11, dst_node=10, kind=<RelEnum.DEP: 0>, rel='punct', prob=1.0, count=1),
               '12.13.det.0': Edge(src_node=12, dst_node=13, kind=<RelEnum.DEP: 

show the resulting entities extracted from the document

In [9]:
for node in tg.get_phrases():
    ic(node)

ic| node: Node(node_id=10, span=Germany, text='Germany', pos='PROPN', kind='GPE', count=1, weight=0.21223930373896718)
ic| node: Node(node_id=0, span=Werner Herzog, text='Werner Herzog', pos='PROPN', kind='PERSON', count=1, weight=0.19256602205245993)
ic| node: Node(node_id=15, span=Dietrich Herzog, text='Dietrich Herzog', pos='PROPN', kind='PERSON', count=1, weight=0.19256602205245993)
ic| node: Node(node_id=4, span=filmmaker, text='filmmaker', pos='NOUN', kind=None, count=1, weight=0.14093430663797185)
ic| node: Node(node_id=7, span=intellectual, text='intellectual', pos='NOUN', kind=None, count=1, weight=0.13407290293982194)
ic| node: Node(node_id=13, span=son, text='son', pos='NOUN', kind=None, count=1, weight=0.12762144257831917)


## outro

_\[ more parts are getting added to this demo \]_