In [1]:
import followthemoney as ftm
from followthemoney.cli.util import read_entities
import followthemoney_enrich as ftm_enrich
import followthemoney.model as model
from followthemoney.dedupe import Match, Linker
import json
import pandas as pd
import gdown
import alephclient

# Overview

Here I will go throught the steps of merging two collections according to the following architecture.

!<img src="./img/architecture.JPG" width="400" />

For this evaluation, this is done with the [everypolitician](http://everypolitician.org) dataset, which has been mapped to the FtM ontology by an existing [scraper](https://github.com/pudo/opensanctions/blob/main/opensanctions/crawlers/everypolitician.py).


## Prerequesite: Data to reconcile
This uploads data to a private Aleph instance to allow the user to interactively reconcile. In natural setting, this is automated via OpenSanctions.

The following only works if an Aleph instance is running and the API key belongs an admin, as otherwise collections are not mutable when uploaded through the CLI. This constrain applies only to scraped scripts. The reconciliation of mapped entities from documents mappings or manually created ones works all the time for any user.

In [2]:
unlabled = "./data/meineabgeordneten_agg.json"

In [3]:
host="http://localhost:3000"
api_key="lUuPrgpmvqM24Mktqbtefw8cmPGY9U4ky0eotN44Kbc" 
collection_id = "ce6318ac176844bf90410c83d7e1cd87"
target_collection_id = "4b713bcbe372492894529764d7f8096e"

In [7]:
%%bash -s "$host" "$api_key" "$collection_id" "$unlabled"
alephclient --host $1 --api-key $2 write-entities -i $4 -f $3

INFO:alephclient.cli:[ce6318ac176844bf90410c83d7e1cd87] Bulk load entities: 1000...
INFO:alephclient.cli:[ce6318ac176844bf90410c83d7e1cd87] Bulk load entities: 2000...
INFO:alephclient.cli:[ce6318ac176844bf90410c83d7e1cd87] Bulk load entities: 3000...
INFO:alephclient.cli:[ce6318ac176844bf90410c83d7e1cd87] Bulk load entities: 4000...
INFO:alephclient.cli:[ce6318ac176844bf90410c83d7e1cd87] Bulk load entities: 5000...
INFO:alephclient.cli:[ce6318ac176844bf90410c83d7e1cd87] Bulk load entities: 6000...
INFO:alephclient.cli:[ce6318ac176844bf90410c83d7e1cd87] Bulk load entities: 7000...
INFO:alephclient.cli:[ce6318ac176844bf90410c83d7e1cd87] Bulk load entities: 8000...
INFO:alephclient.cli:[ce6318ac176844bf90410c83d7e1cd87] Bulk load entities: 9000...
INFO:alephclient.cli:[ce6318ac176844bf90410c83d7e1cd87] Bulk load entities: 10000...
INFO:alephclient.cli:[ce6318ac176844bf90410c83d7e1cd87] Bulk load entities: 11000...
INFO:alephclient.cli:[ce6318ac176844bf90410c83d7e1cd87] Bulk load entities

# Get Data
In a normal setting, one would load two collection by using the CLI.

```
alephclient --host https://aleph.occrp.org --api-key <api-key> stream-entities -f <collection-id> -o <outfile>
```


 


In [8]:
%%bash -s "$host" "$api_key" "$collection_id" 
alephclient --host $1 --api-key $2 stream-entities -f $3 | ftm aggregate -o data/output/mein_abg_aleph.json 

For reproducibility, this has been done in advance.

In [20]:
path = "./data/output/"
ep_path = path + "everypolitician.json"
# Person entities reconciled against Wikidata.
ma_path = path + "meineabgeordneten_wikidata.json"

In [86]:
# Everypolitician file snapshot, which is too big for git.
gdown.download("https://drive.google.com/u/0/uc?id=1YNQKfm6qLKb5M6cfNrk9m8UYwi4iOxdF", ep_path)

Downloading...
From: https://drive.google.com/u/0/uc?id=1YNQKfm6qLKb5M6cfNrk9m8UYwi4iOxdF
To: /home/peter/dev/evaluation_bachelor_thesis/data/output/everypolitician.json
127MB [00:09, 13.1MB/s]


'./data/output/everypolitician.json'

In [13]:
def read_ftm_json(path, filter = "Person"):
    entity_dict = {}
    with open(path) as f:
        for line in f:
            entity = model.get_proxy(json.loads(line))
            wd = entity.first("wikidataId", True)
            if entity.schema.name == filter:
                entity_dict[wd] = entity
    return entity_dict
path = "./data/output/"

In [41]:
mein_abg = read_ftm_json(ma_path)
every_polit = read_ftm_json(ep_path)

# Matching
We will check for equal Wikidata IDs and create a Match object. A match objects holds two entity IDs and a decision about the sameness. This match object could be uploaded to Aleph via the API. However, we will use it to perfom it instantly.

In [12]:
enricher = ftm_enrich.enricher.Enricher()

In [13]:
matches = []
for idx, polit in every_polit.items():
    abg = mein_abg.get(idx)
    if abg:
        #print(polit.to_dict())
        match = enricher.make_match(abg, polit)
        match = Match(model, {})
        match.entity=  abg
        match.canonical =  polit
        match.decision = match.SAME
        matches.append(match)


In total, there are 77 matching person entities with respect to the wikidata ID.

In [14]:
len(matches)

77

# Merging
The merging logic actually exists in the FtM [repository](https://github.com/alephdata/followthemoney/blob/6cb55e319f69443dff17bf1ee5dd1a37a31b5c4a/followthemoney/cli/dedupe.py) and works the following:

1. Create a linker object, which takes match objects and checks if there is a sameness decision.
2. If so, add the pair to a hashmap (Python dict) in the linker object.
3. Iterate through both collection of to-be-merged entities and pass each entity to the linker object (which knows the links). If the entity's ID is stored in the hashmap, adopt the entity ID. If not, keep the ID. This also applies for "edges", such as memberships.
4. Write to file. 
5. As we have duplicates, we aggregate, which merges items with the same ID. Merging just unions both, properties and their values. Therefore, same properties are merged, and different ones are just added to the multi-valued list.

## Example on how it works

In [15]:
linker_exmpl = Linker(model)

a = ftm.model.make_entity("Person")
a.add("name", "hans kelsen")
a.add("title", "Dr")
a.add("birthDate", "1908-07-06")
a.make_id("hans kelsen")

b = ftm.model.make_entity("Person")
b.add("name", "Prof. Dr. hans kelsen")
b.add("birthDate", "1908")
b.add("title", "Prof.")
b.make_id("Prof. Dr. Hans Kelsen")

match = enricher.make_match(a, b)
match.decision = match.SAME
linker_exmpl.add(match)

merged_ent  = a.merge(b)
{
    "a": a.id,
    "b": b.id,
    "merged": merged_ent.id,
    "result": merged_ent.to_dict()["properties"]}

{'a': '891bd4dbcf5506d489f8d6e757ace9411eccee55',
 'b': 'a21072d75aebf5f72865a70ca9e10beffb9ddb27',
 'merged': '891bd4dbcf5506d489f8d6e757ace9411eccee55',
 'result': {'name': ['Prof. Dr. hans kelsen', 'hans kelsen'],
  'title': ['Dr', 'Prof.'],
  'birthDate': ['1908-07-06', '1908']}}

## On data

In [16]:
# logic adapted form https://github.com/alephdata/followthemoney/blob/6cb55e319f69443dff17bf1ee5dd1a37a31b5c4a/followthemoney/cli/dedupe.py

linker = Linker(model)
for match in matches: 
    linker.add(match)


In [20]:
def mergeEntities(inpath, outfile, linker):
    infile = open(inpath)

    with infile as f:
        for line in f:
            entity = model.get_proxy(json.loads(line))
            applied = linker.apply(entity)
            

            json_ent = json.dumps(applied.to_dict(), sort_keys=True)
            outfile.write(json_ent + "\n")

nb_merge_output = path + "/nb-merge-output"
merged_path  = nb_merge_output + "/merged.json"
outfile = open(merged_path,  "w")
mergeEntities(ep_path , outfile, linker)
mergeEntities(ma_path, outfile, linker)
merged_aggr_path = nb_merge_output + "/merged_aggr.json"

### Aggregate CLI command

In [21]:
%%bash -s "$merged_path" "$merged_aggr_path"
cat $1 | ftm aggregate -o $2


# CLI
I also implemented the merge-file-generator as a command-line tool in a more performant way. The `merger` package generates a matching file that links FtM IDs that have a common ID and calls `ftm link`, which applies the merges and `ftm aggregate` to actually merge them (see [fragmentation](https://followthemoney.readthedocs.io/en/latest/fragments.html))

The CLI can either be used by setting the property argument to `wikidataId` or by leaving it empty, which goes through all properties of an entity, checks if the type is an identifier and emits an match object for any kind of identifier match. Therefore, this works for any identifier in the FtM ontology.

```
python wd_merger.py id-match -i <dir> | python wd_merger.py merger -i <dir> | ftm aggregate -o <outfile> 
```


In [None]:
%%bash
pip install ./merger

Due to duplicates within everypolitician regarding Wikidata IDs, we have more matches (685):

In [21]:
cli_merge_output = path + "cli-merge-output"
cli_merged = cli_merge_output + "/merged.json"

In [27]:
%%bash -s "$path" "$cli_merged"
merger pmatch -i $1 | merger pmerge  -i $1 | ftm aggregate -o $2 



This generates a `matches.json`, which can be used to stream to an Aleph instance or ingest into a Neo4j db.

Aleph:
```
cat matches.json | alephclient --host https://aleph.occrp.org --api-key <api-key> write-entities -f <collection_id> 
```

 Neo4j:
 ```
 cat matches.json | ftm export-cypher | cypher-shell -u user -p password
 ```

# Enrich
Basic enrichment can also also be perfromed.
Two commands are called here:

1. The `extract` command pulls out entities that have Wikidata IDs from a file or path of FtM entity JSONs and writes them to stdout (seperated by newlines).
2. The `enrich` command actually performs the enrichment by calling the Wikidata API. Currently, there are no linking entities caputured and the requests are performed synchronously.

Here, we only do this for the dataset from meineabgeordneten.at as it is smaller. Note that this process takes some time. For reproducibility, this has also been done in advance.

In [32]:
enrich_output = path + "cli_enrich_output" 
enrich_output_new = enrich_output + "/enriched_MA.json"

In [None]:
%%bash -s "$ma_path" "$enrich_output_new"
merger extract -i $1 -p wikidataId | merger enrich -o $2

All 745 person entities with Wikidata ID could be enriched.

In [35]:
enriched = read_ftm_json(enrich_output + "/enriched_MA_queried.json")
len(enriched)

745

In [43]:
# Item from Everypolitician.
every_polit["Q15787318"].to_dict()

{'origin': 'memorious',
 'updated_at': '2019-05-08T01:55:52',
 'id': 'cfdfd70fdf76870ec3f0d998ce7e2ff83f516674',
 'schema': 'Person',
 'properties': {'alias': ['Beate Meinl-Reisinger'],
  'birthDate': ['1978-04-25'],
  'firstName': ['Beate'],
  'gender': ['female'],
  'lastName': ['Meinl-Reisinger'],
  'name': ['Mag. Beate Meinl-Reisinger, MES'],
  'nationality': ['at'],
  'title': ['Mag.', 'MES'],
  'topics': ['role.pep'],
  'website': ['https://facebook.com/BeateMeinl', 'https://twitter.com/bmeinl'],
  'wikidataId': ['Q15787318'],
  'wikipediaUrl': ['https://fr.wikipedia.org/wiki/Beate_Meinl-Reisinger',
   'https://de.wikipedia.org/wiki/Beate_Meinl-Reisinger',
   'https://hu.wikipedia.org/wiki/Beate_Meinl-Reisinger',
   'https://pl.wikipedia.org/wiki/Beate_Meinl-Reisinger',
   'https://en.wikipedia.org/wiki/Beate_Meinl-Reisinger']}}

In [45]:
# Item from Wikidata(enrichment).
enriched["Q15787318"].to_dict()

{'id': '356d3d7712c6ed51f31db7768a562c94d8f23c5d',
 'schema': 'Person',
 'properties': {'alias': ['Майнль-Райзингер, Беата'],
  'birthDate': ['1978-04-25T00:00:00'],
  'birthPlace': ['Vienna'],
  'description': ['Austrian jurist and politician'],
  'email': ['mailto:beate.meinl@neos.eu'],
  'firstName': ['Beate'],
  'gender': ['female'],
  'lastName': ['Meinl-Reisinger'],
  'name': ['Beate Meinl-Reisinger'],
  'nationality': ['at'],
  'title': ['Magister Juris', 'Master of European Studies'],
  'website': ['https://beatemeinl.com/'],
  'wikidataId': ['Q15787318']}}

In [46]:
# Item from meineabgeordneten.at.
mein_abg["Q15787318"].to_dict()

{'origin': 'memorious',
 'id': 'acee34362b60a3d4d8978c0bb0350dd1df447e84',
 'schema': 'Person',
 'properties': {'birthDate': ['1978-04-25'],
  'birthPlace': ['Wien'],
  'country': ['at'],
  'email': ['beate.meinl@neos.eu'],
  'firstName': ['Beate'],
  'lastName': ['Meinl-Reisinger'],
  'name': ['Beate Meinl-Reisinger'],
  'sourceUrl': ['https://www.meineabgeordneten.at/Abgeordnete/Beate.Meinl-Reisinger'],
  'summary': ['Abgeordnete zum Nationalrat'],
  'title': ['Mag.a'],
  'website': ['https://www.linkedin.com/in/beate-meinl-reisinger-0a827a84',
   'https://www.facebook.com/beate.meinlreisinger',
   'https://www.facebook.com/BeateMeinl',
   'https://www.instagram.com/beate_meinl_reisinger/',
   'https://twitter.com/BMeinl'],
  'wikidataId': ['Q15787318']}}

# Push back to Aleph
If Aleph is running:

In [None]:
%%bash -s "$host" "$api_key" "$target_collection_id"  "$cli_merged"
cat $4 | alephclient --host $1 --api-key $2 write-entities -f $3