# EventKG - Extracting info for one event

The aim of the notebook is to automatically retrieve info for one event, and esp. the ground truth for EventKG.

Before running the notebook, ensure to have the followings:
* EventKG downloaded and preprocessed, cf. `eventkg-filtering.ipynb`
* Subset of EventKG loaded in [GraphDB](https://graphdb.ontotext.com)
* GraphDB endpoint active (Repositories name `eventkg`)

In [1]:
import io
import os
import json
import yaml
import requests
import psutil
from tqdm import tqdm

import wandb

import pandas as pd
from settings import FOLDER_PATH, WANDB_USER

In [26]:
HEADERS = {
    "Accept": "text/csv"
}

ENDPOINT = "http://localhost:7200/repositories/eventkg"

# Folder where data necessary to run experiments will be saved
# This folder should contain the following sub folders: `config`, `gs_events` and `referents`
FOLDER_SAVE_DATA = os.path.join(FOLDER_PATH, "data-ind")

NB_CPUS = psutil.cpu_count(logical=False)

EVENT = "http://dbpedia.org/resource/2020_United_States_presidential_election" 

## 1. Retrieving info for the input event

* Ground truth events from EventKG 
* Referents (URI mapping)
* Start/End dates


### 1.1. Ground truth for each event

Ground truth = event part of that event in EventKG

In [27]:
QUERY_GROUND_TRUTH_TEMPLATE = """
PREFIX sem: <http://semanticweb.cs.vu.nl/2009/11/sem/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT DISTINCT(?subEventKG as ?linkDBpediaEn)
WHERE {
    
?event owl:sameAs <event-to-replace> .
?event sem:hasSubEvent* ?subEvent .
?subEvent owl:sameAs ?subEventKG .
    
?event sem:hasBeginTimeStamp ?startTimeEvent .
?event sem:hasEndTimeStamp ?endTimeEvent .

?subEvent sem:hasBeginTimeStamp ?startTimeSubEvent .
?subEvent sem:hasEndTimeStamp ?endTimeSubEvent .
    
FILTER( strStarts( str(?subEventKG), "http://dbpedia" ) ) .
FILTER (?endTimeSubEvent >= ?startTimeEvent) .
FILTER (?startTimeSubEvent <= ?endTimeEvent) .
}
"""

In [29]:
query = QUERY_GROUND_TRUTH_TEMPLATE.replace(
    "event-to-replace", EVENT
)
response = requests.get(ENDPOINT, headers=HEADERS,
                        params={"query": query})
df_sub_event = pd.read_csv(io.StringIO(response.content.decode('utf-8'))) 
df_sub_event.to_csv(os.path.join(FOLDER_SAVE_DATA, f"{EVENT.split('/')[-1]}.csv"))
df_sub_event.head(10)
        

Unnamed: 0,linkDBpediaEn
0,http://dbpedia.org/resource/2020_United_States...
1,http://dbpedia.org/resource/2020_United_States...
2,http://dbpedia.org/resource/2020_United_States...
3,http://dbpedia.org/resource/2020_United_States...
4,http://dbpedia.org/resource/2020_United_States...
5,http://dbpedia.org/resource/2020_United_States...
6,http://dbpedia.org/resource/2020_United_States...
7,http://dbpedia.org/resource/2020_United_States...
8,http://dbpedia.org/resource/2020_United_States...
9,http://dbpedia.org/resource/2020_United_States...


### 1.2. URI referents for each sub event

Due to differences in dataset version, URIs can vary over time, the aim of this section is to retrieve a unique ID referent for each set of URIs.


In [30]:
from src.get_equivalent_url import get_equivalent_url

In [31]:
get_equivalent_url(os.path.join(FOLDER_SAVE_DATA, f"{EVENT.split('/')[-1]}.csv"),
                   os.path.join(FOLDER_SAVE_DATA, f"{EVENT.split('/')[-1]}.json"))

100%|██████████| 54/54 [00:10<00:00,  4.99it/s]


### 2.3. Start and End dates of each event

Minimum start date among all start dates, maximum end date among all end dates.

Start date must be before end date.

In [32]:
QUERY_DATES_TEMPLATE = """
PREFIX sem: <http://semanticweb.cs.vu.nl/2009/11/sem/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT (min(?startTimeEvent) as ?min) (max(?endTimeEvent) as ?max)
WHERE {
    
 ?event owl:sameAs <event-to-replace> .
 ?event sem:hasSubEvent* ?subEvent .
 ?event sem:hasBeginTimeStamp ?startTimeEvent .
 ?event sem:hasEndTimeStamp ?endTimeEvent .
 ?event owl:sameAs ?eventKG .

 FILTER( strStarts( str(?eventKG), "http://dbpedia" ) ) .
}
GROUP BY ?eventKG
HAVING (max(?endTimeEvent) >= min(?startTimeEvent))
"""

In [33]:
def get_dates(event):
    query = QUERY_DATES_TEMPLATE.replace(
        "event-to-replace", event)
    response = requests.get(ENDPOINT, headers=HEADERS,
                        params={"query": query})
    return pd.read_csv(io.StringIO(response.content.decode('utf-8')))

In [34]:
result = get_dates(EVENT)
result

Unnamed: 0,min,max
0,2020-11-03,2020-11-03


In [35]:
def store_changing_config(dico):
    name = EVENT.split("/")[-1]
    dico = {
        "start": EVENT,
        "start_date": result["min"].values[0],
        "end_date": result["max"].values[0],
        "gold_standard": os.path.join(FOLDER_SAVE_DATA, f"{name}.csv"),
        "referents": os.path.join(FOLDER_SAVE_DATA, f"{name}.json"),
        "name_exp": name,
    }
    return dico

dico_config = store_changing_config(dico={})

In [36]:
dico_config

{'start': 'http://dbpedia.org/resource/2020_United_States_presidential_election',
 'start_date': '2020-11-03',
 'end_date': '2020-11-03',
 'gold_standard': '/Users/ines/Projects/graph_search_framework/data-ind/2020_United_States_presidential_election.csv',
 'referents': '/Users/ines/Projects/graph_search_framework/data-ind/2020_United_States_presidential_election.json',
 'name_exp': '2020_United_States_presidential_election'}

## 3. Prepare configuration files for sweep for each event

In [21]:
BASE_CONFIG = {
    "rdf_type": {
        "event": "http://dbpedia.org/ontology/Event"
    },
    "predicate_filter": ["http://dbpedia.org/ontology/wikiPageWikiLink",
                         "http://dbpedia.org/ontology/wikiPageRedirects",
                         "http://dbpedia.org/ontology/wikiPageDisambiguates",
                         "http://www.w3.org/2000/01/rdf-schema#seeAlso",
                         "http://xmlns.com/foaf/0.1/depiction",
                         "http://xmlns.com/foaf/0.1/isPrimaryTopicOf",
                         "http://dbpedia.org/ontology/thumbnail",
                         "http://dbpedia.org/ontology/wikiPageExternalLink",
                         "http://dbpedia.org/ontology/wikiPageID",
                         "http://dbpedia.org/ontology/wikiPageLength",
                         "http://dbpedia.org/ontology/wikiPageRevisionID",
                         "http://dbpedia.org/property/wikiPageUsesTemplate",
                         "http://www.w3.org/2002/07/owl#sameAs",
                         "http://www.w3.org/ns/prov#wasDerivedFrom",
                         "http://dbpedia.org/ontology/wikiPageWikiLinkText",
                         "http://dbpedia.org/ontology/wikiPageOutDegree",
                         "http://dbpedia.org/ontology/abstract",
                         "http://www.w3.org/2000/01/rdf-schema#comment",
                         "http://www.w3.org/2000/01/rdf-schema#label"],
    "start": "http://dbpedia.org/resource/WTA_Tier_II_tournaments",
    "start_date": "1990-01-01",
    "end_date": "2008-12-31",
    "iterations": 50,
    "type_ranking": "pred_object_freq",
    "type_interface": "hdt",
    "gold_standard": "./data/gs_events/events_WTA_Tier_II_tournaments.csv",
    "referents": "./data/referents/referents_WTA_Tier_II_tournaments.json",
    "type_metrics": ["precision", "recall", "f1"],
    "ordering": {
        "domain_range": 1
    },
    "filtering": {
        "what": 1,
        "where": 1,
        "when": 1
    },
    "name_exp": "wta_tier_ii_tournament"
}

In [22]:
name = EVENT.split("/")[-1]
BASE_CONFIG.update(dico_config)
with open(os.path.join(FOLDER_SAVE_DATA, f"config_{name}.json"), "w", encoding='utf-8') as openfile:
    json.dump(BASE_CONFIG, openfile, indent=4)

## 4. Run weights & biases sweep for each event

In [15]:
with open(os.path.join(FOLDER_PATH, "graph_search_sweep.yaml")) as file:
    WANDB_SWEEP_CONFIG = yaml.load(file, Loader=yaml.FullLoader)

In [16]:
name = EVENT.split("/")[-1]
WANDB_SWEEP_CONFIG["parameters"]["json"]["value"] = \
    os.path.join(FOLDER_SAVE_DATA, f"config_{name}.json")
WANDB_SWEEP_CONFIG["name"] = EVENT.split("/")[-1]

sweep_id = wandb.sweep(WANDB_SWEEP_CONFIG, project=WANDB_SWEEP_CONFIG["project"])

Create sweep with ID: typjj4by
Sweep URL: https://wandb.ai/ines-blin/event-graph-search-framework/sweeps/typjj4by


In [17]:
agent = f"{WANDB_USER}/{WANDB_SWEEP_CONFIG['project']}/{sweep_id}"
wandb.agent(agent, count=8)

wandb: Agent Starting Run: 61ehq2rh with config:
	filtering_when: 0
	json: /Users/ines/Projects/graph_search_framework/data-ind/config_2014_United_States_House_of_Representatives_elections.json
	ordering_domain_range: 0
	type_ranking: pred_object_freq
wandb: Agent Starting Run: 2d928zci with config:
	filtering_when: 0
	json: /Users/ines/Projects/graph_search_framework/data-ind/config_2014_United_States_House_of_Representatives_elections.json
	ordering_domain_range: 0
	type_ranking: entropy_pred_object_freq
wandb: Agent Starting Run: grg3mw7r with config:
	filtering_when: 0
	json: /Users/ines/Projects/graph_search_framework/data-ind/config_2014_United_States_House_of_Representatives_elections.json
	ordering_domain_range: 1
	type_ranking: pred_object_freq


wandb: Currently logged in as: ines-blin (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.12.15
wandb: Run data is saved locally in /Users/ines/Projects/graph_search_framework/wandb/run-20220429_105048-grg3mw7r
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run 2014_United_States_House_of_Representatives_elections_pred_object_freq_domain_range_what_where_
wandb: ⭐️ View project at https://wandb.ai/ines-blin/event-graph-search-framework
wandb: 🧹 View sweep at https://wandb.ai/ines-blin/event-graph-search-framework/sweeps/typjj4by
wandb: 🚀 View run at https://wandb.ai/ines-blin/event-graph-search-framework/runs/grg3mw7r
100% 1/1 [00:00<00:00, 29.76it/s]
  0% 0/7 [00:00<?, ?it/s]

Processing node 1/1	http://dbpedia.org/resource/2014_United_States_House_of_Representatives_elections


100% 7/7 [00:00<00:00, 35.42it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  date_df.object = date_df.object.astype(str)
100% 1/1 [00:00<00:00, 17.60it/s]
100% 6/6 [00:00<00:00, 23.16it/s]
wandb: Waiting for W&B process to finish... (success).
wandb:                                                                                
wandb: 
wandb: Run history:
wandb:                       f1 ▁
wandb:                precision ▁
wandb:                   recall ▁
wandb:        subgraph_nb_event ▁
wandb: subgraph_nb_event_unique ▁
wandb: 
wandb: Run summary:
wandb:                       f1 0.0
wandb:                precision 0.0
wandb:                   recall 0.0
wandb:        subgraph_nb_event 1
wandb: subgraph_nb_event_unique 1
wandb: 
wandb: Synced 2014_United_States_Hous

wandb: Agent Starting Run: jtjlemoo with config:
	filtering_when: 0
	json: /Users/ines/Projects/graph_search_framework/data-ind/config_2014_United_States_House_of_Representatives_elections.json
	ordering_domain_range: 1
	type_ranking: entropy_pred_object_freq


wandb: Currently logged in as: ines-blin (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.12.15
wandb: Run data is saved locally in /Users/ines/Projects/graph_search_framework/wandb/run-20220429_105108-jtjlemoo
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run 2014_United_States_House_of_Representatives_elections_entropy_pred_object_freq_domain_range_what_where_
wandb: ⭐️ View project at https://wandb.ai/ines-blin/event-graph-search-framework
wandb: 🧹 View sweep at https://wandb.ai/ines-blin/event-graph-search-framework/sweeps/typjj4by
wandb: 🚀 View run at https://wandb.ai/ines-blin/event-graph-search-framework/runs/jtjlemoo
100% 1/1 [00:00<00:00, 40.23it/s]
  0% 0/7 [00:00<?, ?it/s]

Processing node 1/1	http://dbpedia.org/resource/2014_United_States_House_of_Representatives_elections


100% 7/7 [00:00<00:00, 41.85it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  date_df.object = date_df.object.astype(str)
100% 1/1 [00:00<00:00, 44.81it/s]
100% 6/6 [00:00<00:00, 27.45it/s]
wandb: Waiting for W&B process to finish... (success).
wandb:                                                                                
wandb: 
wandb: Run history:
wandb:                       f1 ▁
wandb:                precision ▁
wandb:                   recall ▁
wandb:        subgraph_nb_event ▁
wandb: subgraph_nb_event_unique ▁
wandb: 
wandb: Run summary:
wandb:                       f1 0.0
wandb:                precision 0.0
wandb:                   recall 0.0
wandb:        subgraph_nb_event 1
wandb: subgraph_nb_event_unique 1
wandb: 
wandb: Synced 2014_United_States_Hous

wandb: Agent Starting Run: 9jnw857u with config:
	filtering_when: 1
	json: /Users/ines/Projects/graph_search_framework/data-ind/config_2014_United_States_House_of_Representatives_elections.json
	ordering_domain_range: 0
	type_ranking: pred_object_freq


wandb: Currently logged in as: ines-blin (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.12.15
wandb: Run data is saved locally in /Users/ines/Projects/graph_search_framework/wandb/run-20220429_105124-9jnw857u
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run 2014_United_States_House_of_Representatives_elections_pred_object_freq__what_where_when
wandb: ⭐️ View project at https://wandb.ai/ines-blin/event-graph-search-framework
wandb: 🧹 View sweep at https://wandb.ai/ines-blin/event-graph-search-framework/sweeps/typjj4by
wandb: 🚀 View run at https://wandb.ai/ines-blin/event-graph-search-framework/runs/9jnw857u
100% 1/1 [00:00<00:00, 39.64it/s]
  0% 0/7 [00:00<?, ?it/s]

Processing node 1/1	http://dbpedia.org/resource/2014_United_States_House_of_Representatives_elections


100% 7/7 [00:00<00:00, 41.76it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  date_df.object = date_df.object.astype(str)
wandb: Waiting for W&B process to finish... (success).
wandb:                                                                                
wandb: 
wandb: Run history:
wandb:                       f1 ▁
wandb:                precision ▁
wandb:                   recall ▁
wandb:        subgraph_nb_event ▁
wandb: subgraph_nb_event_unique ▁
wandb: 
wandb: Run summary:
wandb:                       f1 0.0
wandb:                precision 0.0
wandb:                   recall 0.0
wandb:        subgraph_nb_event 1
wandb: subgraph_nb_event_unique 1
wandb: 
wandb: Synced 2014_United_States_House_of_Representatives_elections_pred_object_freq__what_where_when: ht

wandb: Agent Starting Run: a6yuaew2 with config:
	filtering_when: 1
	json: /Users/ines/Projects/graph_search_framework/data-ind/config_2014_United_States_House_of_Representatives_elections.json
	ordering_domain_range: 0
	type_ranking: entropy_pred_object_freq


wandb: Currently logged in as: ines-blin (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.12.15
wandb: Run data is saved locally in /Users/ines/Projects/graph_search_framework/wandb/run-20220429_105140-a6yuaew2
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run 2014_United_States_House_of_Representatives_elections_entropy_pred_object_freq__what_where_when
wandb: ⭐️ View project at https://wandb.ai/ines-blin/event-graph-search-framework
wandb: 🧹 View sweep at https://wandb.ai/ines-blin/event-graph-search-framework/sweeps/typjj4by
wandb: 🚀 View run at https://wandb.ai/ines-blin/event-graph-search-framework/runs/a6yuaew2
100% 1/1 [00:00<00:00, 40.01it/s]
  0% 0/7 [00:00<?, ?it/s]

Processing node 1/1	http://dbpedia.org/resource/2014_United_States_House_of_Representatives_elections


100% 7/7 [00:00<00:00, 41.46it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  date_df.object = date_df.object.astype(str)
wandb: Waiting for W&B process to finish... (success).
wandb:                                                                                
wandb: 
wandb: Run history:
wandb:                       f1 ▁
wandb:                precision ▁
wandb:                   recall ▁
wandb:        subgraph_nb_event ▁
wandb: subgraph_nb_event_unique ▁
wandb: 
wandb: Run summary:
wandb:                       f1 0.0
wandb:                precision 0.0
wandb:                   recall 0.0
wandb:        subgraph_nb_event 1
wandb: subgraph_nb_event_unique 1
wandb: 
wandb: Synced 2014_United_States_House_of_Representatives_elections_entropy_pred_object_freq__what_where_

wandb: Agent Starting Run: 4wgsekd6 with config:
	filtering_when: 1
	json: /Users/ines/Projects/graph_search_framework/data-ind/config_2014_United_States_House_of_Representatives_elections.json
	ordering_domain_range: 1
	type_ranking: pred_object_freq


wandb: Currently logged in as: ines-blin (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.12.15
wandb: Run data is saved locally in /Users/ines/Projects/graph_search_framework/wandb/run-20220429_105156-4wgsekd6
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run 2014_United_States_House_of_Representatives_elections_pred_object_freq_domain_range_what_where_when
wandb: ⭐️ View project at https://wandb.ai/ines-blin/event-graph-search-framework
wandb: 🧹 View sweep at https://wandb.ai/ines-blin/event-graph-search-framework/sweeps/typjj4by
wandb: 🚀 View run at https://wandb.ai/ines-blin/event-graph-search-framework/runs/4wgsekd6
100% 1/1 [00:00<00:00, 39.02it/s]
  0% 0/7 [00:00<?, ?it/s]

Processing node 1/1	http://dbpedia.org/resource/2014_United_States_House_of_Representatives_elections


100% 7/7 [00:00<00:00, 41.28it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  date_df.object = date_df.object.astype(str)
100% 1/1 [00:00<00:00, 45.07it/s]
0it [00:00, ?it/s]
wandb: Waiting for W&B process to finish... (success).
wandb:                                                                                
wandb: 
wandb: Run history:
wandb:                       f1 ▁
wandb:                precision ▁
wandb:                   recall ▁
wandb:        subgraph_nb_event ▁
wandb: subgraph_nb_event_unique ▁
wandb: 
wandb: Run summary:
wandb:                       f1 0.0
wandb:                precision 0.0
wandb:                   recall 0.0
wandb:        subgraph_nb_event 1
wandb: subgraph_nb_event_unique 1
wandb: 
wandb: Synced 2014_United_States_House_of_Representa

wandb: Agent Starting Run: l1bv3nyv with config:
	filtering_when: 1
	json: /Users/ines/Projects/graph_search_framework/data-ind/config_2014_United_States_House_of_Representatives_elections.json
	ordering_domain_range: 1
	type_ranking: entropy_pred_object_freq


wandb: Currently logged in as: ines-blin (use `wandb login --relogin` to force relogin)
wandb: Tracking run with wandb version 0.12.15
wandb: Run data is saved locally in /Users/ines/Projects/graph_search_framework/wandb/run-20220429_105212-l1bv3nyv
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run 2014_United_States_House_of_Representatives_elections_entropy_pred_object_freq_domain_range_what_where_when
wandb: ⭐️ View project at https://wandb.ai/ines-blin/event-graph-search-framework
wandb: 🧹 View sweep at https://wandb.ai/ines-blin/event-graph-search-framework/sweeps/typjj4by
wandb: 🚀 View run at https://wandb.ai/ines-blin/event-graph-search-framework/runs/l1bv3nyv
100% 1/1 [00:00<00:00, 40.15it/s]
  0% 0/7 [00:00<?, ?it/s]

Processing node 1/1	http://dbpedia.org/resource/2014_United_States_House_of_Representatives_elections


100% 7/7 [00:00<00:00, 41.85it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  date_df.object = date_df.object.astype(str)
100% 1/1 [00:00<00:00, 43.88it/s]
0it [00:00, ?it/s]
wandb: Waiting for W&B process to finish... (success).
wandb:                                                                                
wandb: 
wandb: Run history:
wandb:                       f1 ▁
wandb:                precision ▁
wandb:                   recall ▁
wandb:        subgraph_nb_event ▁
wandb: subgraph_nb_event_unique ▁
wandb: 
wandb: Run summary:
wandb:                       f1 0.0
wandb:                precision 0.0
wandb:                   recall 0.0
wandb:        subgraph_nb_event 1
wandb: subgraph_nb_event_unique 1
wandb: 
wandb: Synced 2014_United_States_House_of_Representa