# Who-Did-What Package: Advanced Functions & Analytics Guide

In addition to extracting Subject-Verb-Object (SVO) triples from text and visualizing them as graphs, the **Who-Did-What** package provides several utility functions for merging DataFrames, filtering triples, exporting specific nodes (Subjects, Verbs, Objects), and computing degree centrality metrics.

This guide offers a deeper look into these advanced functions and analytical tools of this library.

By the end of this tutorial, you'll understand how these functions work and how to integrate them into your own Who-Did-What workflow.

**First of all, we need to import the modules whodiwhat and whodidwhat.analytics to be able to use them in the scripts.**

In [1]:
# First of all, we need to import the modules whodiwhat and whodidwhat.analytics to be able to use them in the scripts.
import whodidwhat as wdw
import whodidwhat.analytics as wdw_analytics

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [2]:

text = """Gale and Karlach go to a nice pub to drink a tasty beer and a bad wine. 
They love to share stories with each other. 
Wyll really loves eggplants and fantastic aubergines."""

# Extract the SVO triples
df = wdw.extract_svos_from_text(text)
display(df.head())


  _torch_pytree._register_pytree_node(
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\riccardo.improta\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
  return torch.load(checkpoint_file, map_location=map_location)
01/31/2025 11:40:34 - INFO - 	 missing_keys: []
01/31/2025 11:40:34 - INFO - 	 unexpected_keys: []
01/31/2025 11:40:34 - INFO - 	 mismatched_keys: []
01/31/2025 11:40:34 - INFO - 	 error_msgs: []
01/31/2025 11:40:34 - INFO - 	 Model Parameters: 90.5M, Transformer: 82.1M, Coref head: 8.4M
01/31/2025 11:40:34 - INFO - 	 Tokenize 1 inputs...
Map: 100%|██████████| 1/1 [00:00<00:00, 63.99 examples/s]
01/31/2025 11:40:37 - INFO - 	 ***** Running Inference on 1 texts *****
Inference: 100%|██████████| 1/1 [00:01<00:00,  1.55s/it]


Unnamed: 0,Node 1,WDW,Node 2,WDW2,Hypergraph,Semantic-Syntactic,svo_id
0,gale,Who,go,Did,"[[('gale', []), ('karlach', [])], ['go'], [('t...",0,0
1,karlach,Who,go,Did,"[[('gale', []), ('karlach', [])], ['go'], [('t...",0,0
2,go,Did,to nice pub,What,"[[('gale', []), ('karlach', [])], ['go'], [('t...",0,0
3,gale,Who,karlach,Who,"[[('gale', []), ('karlach', [])], ['go'], [('t...",0,0
4,gale,Who,drink,Did,"[[('gale', []), ('karlach', [])], ['drink'], [...",0,1


In [3]:
text2 = """Marco and Michelle go to the hawaii to drink a tasty beer and a bad wine."""

# Extract the SVO (subject - verb - object) triples
df2 = wdw.extract_svos_from_text(text)
display(df2.head())

  return torch.load(checkpoint_file, map_location=map_location)
01/31/2025 11:40:43 - INFO - 	 missing_keys: []
01/31/2025 11:40:43 - INFO - 	 unexpected_keys: []
01/31/2025 11:40:43 - INFO - 	 mismatched_keys: []
01/31/2025 11:40:43 - INFO - 	 error_msgs: []
01/31/2025 11:40:43 - INFO - 	 Model Parameters: 90.5M, Transformer: 82.1M, Coref head: 8.4M
01/31/2025 11:40:43 - INFO - 	 Tokenize 1 inputs...
Map: 100%|██████████| 1/1 [00:00<00:00, 1291.35 examples/s]
01/31/2025 11:40:46 - INFO - 	 ***** Running Inference on 1 texts *****
Inference: 100%|██████████| 1/1 [00:00<00:00,  7.32it/s]


Unnamed: 0,Node 1,WDW,Node 2,WDW2,Hypergraph,Semantic-Syntactic,svo_id
0,gale,Who,go,Did,"[[('gale', []), ('karlach', [])], ['go'], [('t...",0,0
1,karlach,Who,go,Did,"[[('gale', []), ('karlach', [])], ['go'], [('t...",0,0
2,go,Did,to nice pub,What,"[[('gale', []), ('karlach', [])], ['go'], [('t...",0,0
3,gale,Who,karlach,Who,"[[('gale', []), ('karlach', [])], ['go'], [('t...",0,0
4,gale,Who,drink,Did,"[[('gale', []), ('karlach', [])], ['drink'], [...",0,1


-----

## WDW Manipulation tools

### 1. Merging Multiple WDW DataFrames

**Function**: `merge_svo_dataframes(df_list)`

**Purpose**: Merges a list of DataFrames containing WDW data into a single DataFrame. The function ensures unique `svo_id` values across all merged DataFrames.


In [4]:
df_list = [df, df2]

# Merge them
merged_df = wdw_analytics.merge_svo_dataframes(df_list)

# Display the merged DataFrame
display(merged_df.head())


Unnamed: 0,Node 1,WDW,Node 2,WDW2,Hypergraph,Semantic-Syntactic,svo_id
0,gale,Who,go,Did,"[[('gale', []), ('karlach', [])], ['go'], [('t...",0,0
1,karlach,Who,go,Did,"[[('gale', []), ('karlach', [])], ['go'], [('t...",0,0
2,go,Did,to nice pub,What,"[[('gale', []), ('karlach', [])], ['go'], [('t...",0,0
3,gale,Who,karlach,Who,"[[('gale', []), ('karlach', [])], ['go'], [('t...",0,0
4,gale,Who,drink,Did,"[[('gale', []), ('karlach', [])], ['drink'], [...",0,1


### 2. Saving the WDW data to a CSV file.

The WDW data is saved as as a Pandas dataframe, as such, the pandas function `to_csv` can be used to save the data to a CSV file.

We refer the reader to the pandas documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

### 3. Exporting Elements from the WDW Dataframe


#### Exporting Hypergraphs

**Function**: `export_hypergraphs(df)`

**Purpose**: Extracts syntactic hypergraphs from the DataFrame.



In [5]:
hypergraphs_list = wdw_analytics.export_hypergraphs(df)

print("Number of hypergraphs extracted:", len(hypergraphs_list))
print("Sample hypergraph:", hypergraphs_list[0] if hypergraphs_list else None)

Number of hypergraphs extracted: 4
Sample hypergraph: [[('gale', []), ('karlach', [])], ['go'], [('to nice pub', [])]]


#### Exporting Subjects


**Function**: `export_subj(df)`

**Purpose**: Extracts a set of **unique subjects** from the DataFrame. Each element in the set is a tuple of the form `(subject, valence)`, where `valence` can be `'positive'`, `'negative'`, or `'neutral'`, determined by VADER sentiment.


In [6]:
subjects = wdw_analytics.export_subj(df)
print(subjects)


{('wyll', 'neutral'), ('karlach', 'neutral'), ('gale', 'neutral')}


#### Exporting Objects

**Function**: `export_obj(df)`

**Purpose**: Extracts a set of **unique objects** from the DataFrame. Each element in the set is a tuple of the form `(object, valence)`.


In [7]:
objects = wdw_analytics.export_obj(df)
print(objects)

{('bad wine', 'negative'), ('to nice pub', 'positive'), ('eggplant', 'neutral'), ('fantastic aubergine', 'positive'), ('story', 'neutral'), ('tasty beer', 'neutral')}


#### Exporting Verbs


**Function**: `export_verb(df)`

**Purpose**: Extracts a set of **unique verbs** from the DataFrame. Each element in the set is a tuple of the form `(verb, valence)`.


In [8]:
verbs = wdw_analytics.export_verb(df)
print(verbs)

{('drink', 'neutral'), ('really love', 'positive'), ('go', 'neutral'), ('love share', 'positive')}


### 4. Filtering the SVO DataFrame by WDW

**Function**: `filter_svo_dataframe_by_wdw(df, WDW, WDW2=None)`

**Purpose**: Filters the SVO DataFrame to keep rows where `WDW` and `WDW2` match the provided arguments (`"Who"`, `"Did"`, `"What"`).  
- If `WDW2` is `None`, it filters by `WDW` alone.  
- Ensures we only look at **syntactic** relationships (`Semantic-Syntactic == 0`).

In [9]:
# Filter to keep only Subject ("Who") - Verb ("Did") edges
filtered_df_who_did = wdw_analytics.filter_svo_dataframe_by_wdw(df, "Who", "Did")

# Examine the result
display(filtered_df_who_did.head(10))

Unnamed: 0,Node 1,WDW,Node 2,WDW2,Hypergraph,Semantic-Syntactic,svo_id
0,gale,Who,go,Did,"[[('gale', []), ('karlach', [])], ['go'], [('t...",0,0
1,karlach,Who,go,Did,"[[('gale', []), ('karlach', [])], ['go'], [('t...",0,0
4,gale,Who,drink,Did,"[[('gale', []), ('karlach', [])], ['drink'], [...",0,1
5,karlach,Who,drink,Did,"[[('gale', []), ('karlach', [])], ['drink'], [...",0,1
10,gale,Who,love share,Did,"[[('gale', []), ('karlach', [])], ['love share...",0,2
11,karlach,Who,love share,Did,"[[('gale', []), ('karlach', [])], ['love share...",0,2
14,wyll,Who,really love,Did,"[[('wyll', [])], ['really love'], [('eggplant'...",0,3


-----

# WDW Analysis tools

Who-Did-What Networks can be analyzed with standard network science procedures. This library implements some of the

##  Degree Centrality

### Weighted Degree Centrality


**Function**: `wdw_weighted_degree_centrality(df, WDW, WDW2=None)`

**Purpose**: 
1. Filters the DataFrame according to `WDW` (and optionally `WDW2`).
2. Builds a graph where edges have a weight for each syntactic relationship.
3. Calculates the normalized node strength (sum of edge weights) as the degree centrality.

In [10]:

# Degree centrality for Subjects ("Who")
subject_centrality_df = wdw_analytics.wdw_weighted_degree_centrality(df, "Who")
display(subject_centrality_df)



Unnamed: 0,node,degree_centrality
0,gale,0.3
1,karlach,0.3
2,wyll,0.05


In [11]:
# Degree centrality for Verbs ("Did")
verb_centrality_df = wdw_analytics.wdw_weighted_degree_centrality(df, "Did", None)
display(verb_centrality_df)

Unnamed: 0,node,degree_centrality
0,drink,0.153846
1,go,0.115385
2,love share,0.115385
3,really love,0.115385


### Weighted Degree Centrality Overview

**Function**: `wdw_degree_centrality_overview(df)`

**Purpose**: A convenient function that prints an **overview** of the top 20 nodes in each role (Subject, Verb, Object) by their **weighted degree centrality**.


In [12]:
wdw_analytics.wdw_degree_centrality_overview(df)

Degree centrality for Subject


Unnamed: 0,node,degree_centrality
0,gale,0.214286
1,karlach,0.214286
2,wyll,0.071429


############################################ 

Degree centrality for Verb


Unnamed: 0,node,degree_centrality
0,drink,0.153846
1,go,0.115385
2,love share,0.115385
3,really love,0.115385


############################################ 

Degree centrality for Object


Unnamed: 0,node,degree_centrality
0,to nice pub,0.083333
1,tasty beer,0.083333
2,bad wine,0.083333
3,story,0.083333
4,eggplant,0.083333
5,fantastic aubergine,0.083333


############################################ 



-----

# Putting It All Together

Below is a short example workflow illustrating how you might use these functions in tandem:


In [15]:
import whodidwhat as wdw
import whodidwhat.analytics as wdw_analytics

# 1. Extract SVOs from multiple texts
text1 = "Gale loves Karlach. They drink coffee together."
text2 = "Wyll meets Gale. They go to a tavern."
df1 = wdw.extract_svos_from_text(text1)
df2 = wdw.extract_svos_from_text(text2)

# 2. Merge DataFrames
merged_df = wdw_analytics.merge_svo_dataframes([df1, df2])

# 3. Export unique subjects, objects, and verbs
subjects = wdw_analytics.export_subj(merged_df)
objects = wdw_analytics.export_obj(merged_df)
verbs = wdw_analytics.export_verb(merged_df)

# 4. Filter the merged DataFrame to only subject-verb relationships
df_who_did = wdw_analytics.filter_svo_dataframe_by_wdw(merged_df, "Who", "Did")

# 5. Compute Weighted Degree Centrality for subjects
subject_centrality = wdw_analytics.wdw_weighted_degree_centrality(merged_df, "Who")
print("Subject Weighted Degree Centrality\n", subject_centrality)

# 6. Get an overview of Weighted Degree Centralities (Subject, Verb, Object)
wdw_analytics.wdw_degree_centrality_overview(merged_df)


  return torch.load(checkpoint_file, map_location=map_location)
01/31/2025 11:41:18 - INFO - 	 missing_keys: []
01/31/2025 11:41:18 - INFO - 	 unexpected_keys: []
01/31/2025 11:41:18 - INFO - 	 mismatched_keys: []
01/31/2025 11:41:18 - INFO - 	 error_msgs: []
01/31/2025 11:41:18 - INFO - 	 Model Parameters: 90.5M, Transformer: 82.1M, Coref head: 8.4M
01/31/2025 11:41:18 - INFO - 	 Tokenize 1 inputs...
Map: 100%|██████████| 1/1 [00:00<00:00, 471.22 examples/s]
01/31/2025 11:41:20 - INFO - 	 ***** Running Inference on 1 texts *****
Inference: 100%|██████████| 1/1 [00:00<00:00, 19.02it/s]
01/31/2025 11:41:21 - INFO - 	 missing_keys: []
01/31/2025 11:41:21 - INFO - 	 unexpected_keys: []
01/31/2025 11:41:21 - INFO - 	 mismatched_keys: []
01/31/2025 11:41:21 - INFO - 	 error_msgs: []
01/31/2025 11:41:21 - INFO - 	 Model Parameters: 90.5M, Transformer: 82.1M, Coref head: 8.4M
01/31/2025 11:41:21 - INFO - 	 Tokenize 1 inputs...
Map: 100%|██████████| 1/1 [00:00<?, ? examples/s]
01/31/2025 11:41

Subject Weighted Degree Centrality
    node  degree_centrality
0  they              0.250
1  gale              0.125
2  wyll              0.125
Degree centrality for Subject


Unnamed: 0,node,degree_centrality
0,they,0.25
1,gale,0.125
2,wyll,0.125


############################################ 

Degree centrality for Verb


Unnamed: 0,node,degree_centrality
0,love,0.125
1,drink together,0.125
2,meet,0.125
3,go,0.125


############################################ 

Degree centrality for Object


Unnamed: 0,node,degree_centrality
0,karlach,0.125
1,coffee,0.125
2,gale,0.125
3,to tavern,0.125


############################################ 



-----

# Troubleshooting and Tips

1. **Large Texts**: Processing very large texts might be slow due to dependency parsing and coreference resolution. Consider splitting text into smaller chunks.
2. **Coreference Models**: Ensure you have the correct models downloaded if using `stanza` or `fastcoref`. Consider that `stanza` takes up a significant amount of time and computational power compared to the default `fastcoref`.
3. **Inspecting DataFrames**: Always check intermediate DataFrames (e.g., `df.head()`) to ensure expected behavior.
4. **Synonym vs. Syntactic Relationships**:  
   - `Semantic-Syntactic = 1` indicates a semantic (synonym) relationship (e.g., "eggplant" ~ "aubergine").  
   - `Semantic-Syntactic = 0` indicates a syntactic Who-Did-What relationship.


**Enjoy exploring your text data with the Who-Did-What package!**