# Demo: Creating and inspecting a narrative graph

This notebook will serve as a demo and small tour of some of the core functionalities of a `NarrativeGraph` object.

## Data setup

For this demo notebook, we will be using _News Category Dataset_ [1, 2] available on Kagglehub because it has short texts, timestamps and are categorized.

In [1]:
from kagglehub import KaggleDatasetAdapter
import kagglehub

data = kagglehub.dataset_load(
    KaggleDatasetAdapter.PANDAS,
    "rmisra/news-category-dataset",
    "News_Category_Dataset_v3.json",
    pandas_kwargs=dict(lines=True),
)
data.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


The columns that we will be using as input for our narrative graph.
- Documents: _headline_ + _short_description_
- IDs: link, but without the part that is in all of them
- Timestamps: _date_
- Categories: _category_

There are many categories. We will create a subset with just two of them: _U.S. News_ and _Politics_.

In [2]:
# create a sample
sample = data[data["category"].isin(["U.S. NEWS", "POLITICS"])].sample(
    5000, random_state=42
)
docs = sample["headline"] + "\n\n" + sample["short_description"]
ids = sample["link"].replace("https://www.huffpost.com/entry/", "")  # get rit of the first part of the URL
categories = sample["category"]
timestamps = sample["date"]

## Creating the model

Once we have our list of documents, which is the only required input, and extra metadata in aligned lists, we can create a narrative graph.

In [3]:
from narrativegraphs import NarrativeGraph

model = NarrativeGraph()
model.fit(docs, doc_ids=ids, categories=categories, timestamps=timestamps)

INFO:narrativegraphs.pipeline:Adding 5000 documents to database
INFO:narrativegraphs.pipeline:Extracting triplets
Extracting triplets: 100%|██████████| 5000/5000 [00:17<00:00, 288.87it/s]
INFO:narrativegraphs.pipeline:Resolving entities and predicates
INFO:narrativegraphs.pipeline:Mapping triplets and tuplets
INFO:narrativegraphs.pipeline:Calculating stats


<narrativegraphs.graphs.NarrativeGraph at 0x175c6a5d0>

## Inspecting the model visually

One of the key features of the _narrativegraphs_ package is that it lets a user inspect the output interactively in a browser-based visualizer. It is hosted directly on your machine by the Python package – no extra dependencies required. This is achieved with the one line below.

Click the link in the log messages to open in your browser.

In [4]:
# create server to be viewed in own browser which blocks execution of other cells
model.serve_visualizer()

INFO:     Started server process [22036]
INFO:     Waiting for application startup.
INFO:root:Database engine provided to state before startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8001 (Press CTRL+C to quit)
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [22036]
INFO:root:Server stopped by user


Stop it by hitting the stop button on the cell in Jupyter Notebook or hit CTRL+C elsewhere.

## Inspecting and accessing the model programmatically

The graph consists of entities as nodes and their relations or cooccurrences as edges. These, along with the data that back them, like documents and extracted semantic triplets, can be retrieved from the model through properties or service attributes.

### Attributes

We can get the graph as a whole, as NetworkX graph, through the properties `.relation_graph_` and `.cooccurrence_graph_`.

In [5]:
relation_graph = model.relation_graph_

ERROR:asyncio:Task exception was never retrieved
future: <Task finished name='Task-1' coro=<BackgroundServer._run_server() done, defined at /Users/au479461/PycharmProjects/narrative-graph/narrativegraphs/server/backgroundserver.py:20> exception=KeyboardInterrupt()>
Traceback (most recent call last):
  File "/Users/au479461/PycharmProjects/narrative-graph/narrativegraphs/server/backgroundserver.py", line 35, in start
    asyncio.run(self._run_server())
  File "/Users/au479461/PycharmProjects/narrative-graph/.venv/lib/python3.12/site-packages/nest_asyncio.py", line 30, in run
    return loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/au479461/PycharmProjects/narrative-graph/.venv/lib/python3.12/site-packages/nest_asyncio.py", line 92, in run_until_complete
    self._run_once()
  File "/Users/au479461/PycharmProjects/narrative-graph/.venv/lib/python3.12/site-packages/nest_asyncio.py", line 133, in _run_once
    handle._run()
  File "/Library/Framework

In [6]:
print(type(relation_graph))

<class 'networkx.classes.digraph.DiGraph'>


In [7]:
print(*list(relation_graph.nodes(data=True))[:3], sep="\n")

(1, {'id': 1, 'label': 'Deportation Agents', 'frequency': 1, 'focus': False})
(2, {'id': 2, 'label': 'An App', 'frequency': 1, 'focus': False})
(3, {'id': 3, 'label': 'the non-partisan Congressional Budget Office (CBO', 'frequency': 1, 'focus': False})


Similarly, entities and relations and everything else can be accessed as `pandas.DataFrame`s through properties.

In [8]:
model.entities_

Unnamed: 0,id,label,frequency,doc_frequency,spread,adjusted_tf_idf,first_occurrence,last_occurrence,alt_labels,category
0,1,Deportation Agents,1,1,0.0002,0.0,2022-03-11,2022-03-11,[],[POLITICS]
1,2,An App,1,1,0.0002,0.0,2022-03-11,2022-03-11,[],[POLITICS]
2,3,the non-partisan Congressional Budget Office (CBO,1,1,0.0002,0.0,2017-12-11,2017-12-11,[],[POLITICS]
3,4,that its estimate,1,1,0.0002,0.0,2017-12-11,2017-12-11,[],[POLITICS]
4,5,a measure,3,3,0.0006,2500.0,2015-09-04,2018-02-09,"[""the measure""]","[POLITICS, POLITICS, POLITICS]"
...,...,...,...,...,...,...,...,...,...,...
3384,3385,His Mind,1,1,0.0002,0.0,2016-04-13,2016-04-13,[],[POLITICS]
3385,3386,Interested,1,1,0.0002,0.0,2017-01-19,2017-01-19,[],[POLITICS]
3386,3387,The DNC Contenders,1,1,0.0002,0.0,2017-01-19,2017-01-19,[],[POLITICS]
3387,3388,the stage,1,1,0.0002,0.0,2017-12-16,2017-12-16,[],[POLITICS]


The properties (with trailing `_`) are nice in that they give back the data in well-known formats that one can continue working with, e.g. NetworkX graphs for graph algorithms and DataFrames for statistical analyses.

### Service attributes

However, the service attributes offer more control and may be especially handy if the model is quite big, so that you do not necessarily want everything spit out at once.

For instance, you can search for entities with the `entities` service.

In [9]:
white_house_matches = model.entities.search("White House")
white_house_matches[:10]

[EntityLabel(id=3160, label="White House's Idea"),
 EntityLabel(id=2317, label='White House Official'),
 EntityLabel(id=35, label='White House influence'),
 EntityLabel(id=1525, label='White House press secretary Sarah Huckabee Sanders'),
 EntityLabel(id=3046, label='The White House chief'),
 EntityLabel(id=2994, label='the White House Council'),
 EntityLabel(id=166, label='the White House adviser'),
 EntityLabel(id=1134, label='the White House narrative'),
 EntityLabel(id=965, label='Former White House counselor'),
 EntityLabel(id=1009, label='The former White House communications director')]

In [10]:
white_house_id = white_house_matches[0].id

And you can create a subgraph that expands from a set of focus nodes and only includes those that pass a filter.

In [11]:
from datetime import date
from narrativegraphs import GraphFilter

white_house_graph = model.graph.expand_from_focus_entities(
    [white_house_id],
    "relation",
    graph_filter=GraphFilter(
        minimum_node_frequency=20,
        categories={'category': ["POLITICS"]},
        earliest_date=date(2014, 1, 1)
    )
)

# stripping labels to remove some whitespaces
print("NODES")
for node in white_house_graph.nodes:
    print(node.id, node.label.strip())

print("\nEDGES")
for edge in white_house_graph.edges:
    print(edge.subject_label.strip(), '-', edge.label, '->', edge.object_label.strip())

NODES
3160 White House's Idea

EDGES


### Saving and loading the model

We can save the model for later use, especially if we have a lot of documents that takes a while to process.

In [17]:
model.save_to_file("demo")

And we can load it from that saved file.

In [16]:
model = NarrativeGraph.load("demo")

## References

[1] Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022).

[2] Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).

