# Demo: Creating and inspecting a narrative graph

This notebook will serve as a demo and small tour of some of the core functionalities of a `NarrativeGraph` object.

## Data setup

For this demo notebook, we will be using _News Category Dataset_ [1, 2] available on Kagglehub because it has short texts, timestamps and are categorized.

In [1]:
import time

from kagglehub import KaggleDatasetAdapter
import kagglehub

data = kagglehub.dataset_load(
    KaggleDatasetAdapter.PANDAS,
    "rmisra/news-category-dataset",
    "News_Category_Dataset_v3.json",
    pandas_kwargs=dict(lines=True),
)
data.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


The columns that we will be using as input for our narrative graph.
- Documents: _headline_ + _short_description_
- IDs: link, but without the part that is in all of them
- Timestamps: _date_
- Categories: _category_

There are many categories. We will create a subset with just two of them: _U.S. News_ and _Politics_.

In [2]:
# create a sample
sample = data[data["category"].isin(["U.S. NEWS", "POLITICS"])].sample(
    5000, random_state=42
)
docs = sample["headline"] + "\n\n" + sample["short_description"]
ids = sample["link"].replace("https://www.huffpost.com/entry/", "")  # get rit of the first part of the URL
categories = sample["category"]
timestamps = sample["date"]

## Creating the model

Once we have our list of documents, which is the only required input, and extra metadata in aligned lists, we can create a narrative graph.

In [3]:
from narrativegraphs import NarrativeGraph

model = NarrativeGraph()
model.fit(docs, doc_ids=ids, categories=categories, timestamps=timestamps)

INFO:narrativegraphs.pipeline:Adding 5000 documents to database
INFO:narrativegraphs.pipeline:Extracting triplets
Extracting triplets: 100%|██████████| 5000/5000 [00:16<00:00, 307.33it/s]
INFO:narrativegraphs.pipeline:Resolving entities and predicates
INFO:narrativegraphs.pipeline:Mapping triplets and tuplets
INFO:narrativegraphs.pipeline:Calculating stats


<narrativegraphs.graphs.NarrativeGraph at 0x16b15aae0>

## Inspecting the model visually

One of the key features of the _narrativegraphs_ package is that it lets a user inspect the output interactively in a browser-based visualizer. It is hosted directly on your machine by the Python package – no extra dependencies required. This is achieved with the one line below.

Click the link in the log messages to open in your browser.

In [6]:
# create server to be viewed in own browser which blocks execution of other cells
## model.serve_visualizer()

## Or run in the background
server = model.serve_visualizer(block=False)

INFO:root:Server started in background on port 8001
INFO:     Started server process [89034]
INFO:     Waiting for application startup.
INFO:root:Database engine provided to state before startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8001 (Press CTRL+C to quit)


INFO:     127.0.0.1:64204 - "GET /graph/types HTTP/1.1" 200 OK
INFO:     127.0.0.1:64204 - "GET /graph/bounds/relation HTTP/1.1" 200 OK
INFO:     127.0.0.1:64204 - "GET /graph/bounds/relation HTTP/1.1" 200 OK
INFO:     127.0.0.1:64206 - "POST /graph HTTP/1.1" 200 OK


INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [89034]


In [7]:
server.stop()

INFO:root:Background server stopped


Stop it by hitting the stop button on the cell in Jupyter Notebook or hit CTRL+C elsewhere.

## Inspecting and accessing the model programmatically

The graph consists of entities as nodes and their relations or cooccurrences as edges. These, along with the data that back them, like documents and extracted semantic triplets, can be retrieved from the model through properties or service attributes.

### Attributes

We can get the graph as a whole, as NetworkX graph, through the properties `.relation_graph_` and `.cooccurrence_graph_`.

In [7]:
relation_graph = model.relation_graph_

In [8]:
print(type(relation_graph))

<class 'networkx.classes.digraph.DiGraph'>


In [9]:
print(*list(relation_graph.nodes(data=True))[:3], sep="\n")

(1, {'id': 1, 'label': 'Deportation Agents', 'frequency': 1, 'focus': False})
(2, {'id': 2, 'label': 'An App', 'frequency': 1, 'focus': False})
(3, {'id': 3, 'label': 'the non-partisan Congressional Budget Office (CBO', 'frequency': 1, 'focus': False})


Similarly, entities and relations and everything else can be accessed as `pandas.DataFrame`s through properties.

In [10]:
model.entities_

Unnamed: 0,id,label,frequency,doc_frequency,spread,adjusted_tf_idf,first_occurrence,last_occurrence,alt_labels,category
0,1,Deportation Agents,1,1,0.0002,0.000000,2022-03-11,2022-03-11,[],[POLITICS]
1,2,An App,1,1,0.0002,0.000000,2022-03-11,2022-03-11,[],[POLITICS]
2,3,the non-partisan Congressional Budget Office (CBO,1,1,0.0002,0.000000,2017-12-11,2017-12-11,[],[POLITICS]
3,4,that its estimate,1,1,0.0002,0.000000,2017-12-11,2017-12-11,[],[POLITICS]
4,5,The city council,2,2,0.0004,1666.666667,2016-03-29,2016-08-20,[],"[POLITICS, POLITICS]"
...,...,...,...,...,...,...,...,...,...,...
3437,3438,His Mind,1,1,0.0002,0.000000,2016-04-13,2016-04-13,[],[POLITICS]
3438,3439,Interested,1,1,0.0002,0.000000,2017-01-19,2017-01-19,[],[POLITICS]
3439,3440,The DNC Contenders,1,1,0.0002,0.000000,2017-01-19,2017-01-19,[],[POLITICS]
3440,3441,the stage,1,1,0.0002,0.000000,2017-12-16,2017-12-16,[],[POLITICS]


The properties (with trailing `_`) are nice in that they give back the data in well-known formats that one can continue working with, e.g. NetworkX graphs for graph algorithms and DataFrames for statistical analyses.

### Service attributes

However, the service attributes offer more control and may be especially handy if the model is quite big, so that you do not necessarily want everything spit out at once.

For instance, you can search for entities with the `entities` service.

In [11]:
democrats_matches = model.entities.search("democrats")
democrats_matches[:10]

[EntityLabel(id=114, label='Democrats'),
 EntityLabel(id=1480, label="Democrats' big reform bill")]

In [12]:
democrats_id = democrats_matches[0].id

And you can create a subgraph that expands from a set of focus nodes and only includes those that pass a filter.

In [13]:
from datetime import date
from narrativegraphs import GraphFilter

democrats_graph = model.graph.expand_from_focus_entities(
    [democrats_id],
    "relation",
    graph_filter=GraphFilter(
        categories={'category': ["POLITICS"]},
        earliest_date=date(2014, 1, 1)
    )
)

# stripping labels to remove some whitespaces
print("NODES")
for node in democrats_graph.nodes:
    print(node.id, node.label.strip())

print("\nEDGES")
for edge in democrats_graph.edges:
    print(edge.subject_label.strip(), '--', edge.label, '->', edge.object_label.strip())

NODES
13 Trump
23 GOP
24 Bill
32 Betsy DeVos
65 Obama
77 States
88 Biden
114 Democrats
115 A majority
144 this week's "Candidate Confessional
185 Jefferson Jackson Dinner
261 Chuck Schumer
329 A Landslide
349 A Run
413 Liberals
431 Health Care
479 different findings
539 Record Donations
606 an even more ambitious vision
609 Planned Parenthood Shooting
621 The candidate
641 Republican Mike DeWine
681 Tehran
685 Anthony Weiner
704 judicial nominee Steven Menashi
742 This Billionaire Environmental Activist
911 Pennsylvania Republican
956 Republican Lt. Gov. Kim Guadagno
1070 Special Election
1071 the ballot box
1072 Resources
1222 More Clarity
1324 World leaders
1326 Eyeing 2018 Senate Takeover
1370 key states
1373 Little Time
1443 Hugh Hewitt
1578 The Longest-Serving Woman
1755 Hemp
1785 ‘Hostage Czar
1866 Nationwide Day
1940 Black candidates
1947 A Call
2033 For Independent Commission
2080 Republican Cory Gardner
2181 Rep. Ruben Kihuen
2227 Renewed Push
2280 Oversight Committee
2398 Cal

### Saving and loading the model

We can save the model for later use, especially if we have a lot of documents that takes a while to process.

In [14]:
model.save_to_file("demo.db", overwrite=True)

And we can load it from that saved file.

In [15]:
model = NarrativeGraph.load("demo")

## References

[1] Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022).

[2] Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).

