# textgraph demo

Implementation of an LLM-augmented `textgraph` algorithm for constructing a _knowledge graph_ from raw, unstructured text source.

This code is based on work developed by [Derwen](https://derwen.ai/graph) in early 2023 for the _Cysoni_ product and presented in the talk ["Language, Graphs, and AI in Industry"](https://derwen.ai/s/mqqm). It integrates code from:

  * <https://github.com/tomaarsen/SpanMarkerNER/>
  * <https://github.com/thunlp/OpenNRE/>
  * <https://github.com/DerwenAI/pytextrank/>
  * [_Create a spaCy Visualizer with Streamlit_](https://medium.com/@groxli/create-a-spacy-visualizer-with-streamlit-8b9b41b36745)

Two tutorials also from 2023 which include related material:

  * ["Natural Intelligence is All You Need \[tm\]"](https://youtu.be/C9p7suS-NGk?si=7Ohq3BV654ia2Im4), **Vincent Warmerdam**, PyData Amsterdam (2023-09-15)

  * ["How to Convert Any Text Into a Graph of Concepts"](https://towardsdatascience.com/how-to-convert-any-text-into-a-graph-of-concepts-110844f22a1a), **Rahul Nayak**, _Towards Data Science_ (2023-11-09)

## parse a document

In [1]:
from icecream import ic
from textgraph import Node, Edge, RelEnum, TextGraph
import spacy

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
SRC_TEXT: str = """                                                                                                                      
Werner Herzog is a remarkable filmmaker and intellectual originally from Germany, the son of Dietrich Herzog.                                
"""

In [3]:
tg: TextGraph = TextGraph()

sample_doc: spacy.tokens.doc.Doc = tg.build_doc(
    SRC_TEXT.strip(),
    use_llm = False,
)

2023-11-24 15:40:59,952 - root - INFO - Initializing word embedding with word2vec.


In [4]:
spacy.displacy.render(
    sample_doc,
    style = "ent",
    jupyter = True,
)

In [5]:
spacy.displacy.render(
    sample_doc,
    style = "dep",
    jupyter = True,
)

## build a lemma graph from the document

In [6]:
tg.build_graph_embeddings(
    sample_doc,
    debug = True,
)

ic| sent: Werner Herzog is a remarkable filmmaker and intellectual originally from Germany, the son of Dietrich Herzog.


In [7]:
tg.infer_relations(
    SRC_TEXT.strip(),
    debug = True,
)

ic| src.node_id: 0, dst.node_id: 9, path: [0, 1, 4, 8, 9]
ic| rel: 'country of citizenship', prob: 0.8758444786071777
ic| src.node_id: 0, dst.node_id: 14, path: [0, 1, 4, 8, 9, 12, 13, 14]
ic| rel: 'father', prob: 0.5328834652900696
ic| src.node_id: 9, dst.node_id: 0, path: [9, 8, 4, 1, 0]
ic| rel: 'father', prob: 0.4474824368953705
ic| src.node_id: 9, dst.node_id: 14, path: [9, 12, 13, 14]
ic| rel: 'father', prob: 0.6295275092124939
ic| src.node_id: 14, dst.node_id: 0, path: [14, 13, 12, 9, 8, 4, 1, 0]
ic| rel: 'child', prob: 0.4072343409061432
ic| src.node_id: 14, dst.node_id: 9, path: [14, 13, 12, 9]
ic| rel: 'country of citizenship', prob: 0.8630587458610535


In [8]:
tg.calc_phrase_ranks()

ic(tg.edges)
ic(tg.nodes)

ic| tg.edges: {'0.1.nsubj.0': Edge(src_node=0, dst_node=1, kind=<RelEnum.DEP: 0>, rel='nsubj', prob=1.0, count=1),
               '0.14.father.1': Edge(src_node=0, dst_node=14, kind=<RelEnum.INFER: 1>, rel='father', prob=0.5328834652900696, count=1),
               '0.9.country_of_citizenship.1': Edge(src_node=0, dst_node=9, kind=<RelEnum.INFER: 1>, rel='country of citizenship', prob=0.8758444786071777, count=1),
               '10.9.punct.0': Edge(src_node=10, dst_node=9, kind=<RelEnum.DEP: 0>, rel='punct', prob=1.0, count=1),
               '11.12.det.0': Edge(src_node=11, dst_node=12, kind=<RelEnum.DEP: 0>, rel='det', prob=1.0, count=1),
               '12.9.appos.0': Edge(src_node=12, dst_node=9, kind=<RelEnum.DEP: 0>, rel='appos', prob=1.0, count=1),
               '13.12.prep.0': Edge(src_node=13, dst_node=12, kind=<RelEnum.DEP: 0>, rel='prep', prob=1.0, count=1),
               '14.0.child.1': Edge(src_node=14, dst_node=0, kind=<RelEnum.INFER: 1>, rel='child', prob=0.40723434090

OrderedDict([('werner herzog.PROPN',
              Node(node_id=0, span=Werner Herzog, text='Werner Herzog', pos='PROPN', kind='PERSON', count=1, weight=0.05292064566661492)),
             ('1.is.AUX',
              Node(node_id=1, span=is, text='is', pos='AUX', kind=None, count=0, weight=0.0)),
             ('2.a.DET',
              Node(node_id=2, span=a, text='a', pos='DET', kind=None, count=0, weight=0.0)),
             ('3.remarkable.ADJ',
              Node(node_id=3, span=remarkable, text='remarkable', pos='ADJ', kind=None, count=0, weight=0.0)),
             ('filmmaker.NOUN',
              Node(node_id=4, span=filmmaker, text='filmmaker', pos='NOUN', kind=None, count=1, weight=0.15543634971530518)),
             ('5.and.CCONJ',
              Node(node_id=5, span=and, text='and', pos='CCONJ', kind=None, count=0, weight=0.0)),
             ('6.intellectual.ADJ',
              Node(node_id=6, span=intellectual, text='intellectual', pos='ADJ', kind=None, count=0, weight=0.0)),
   