In [6]:
import followthemoney as ftm
import followthemoney_enrich as ftm_enrich
import followthemoney.model as model
from followthemoney.dedupe import Match, Linker
import json
import pandas as pd
import gdown

# Overview

After having assigned an identifier to an entity via the PoC, we want to integrate two different collections in order to form one big interlinked graph. This corresponds to step 2 in the architectural overview.

!<img src="./img/architecture.JPG" width="400" />

For this evaluation, this is done with the [everypolitician](http://everypolitician.org) dataset, which has been mapped to the FtM ontology by an existing [scraper](https://github.com/pudo/opensanctions/blob/main/opensanctions/crawlers/everypolitician.py).


# Get Data
In a normal setting, one would load collection by using the CLI.

```
alephclient --host https://aleph.occrp.org --api-key <api-key> stream-entities -f <collection-id> -o <outfile>
```

For reproducibility, this has been done in advance.
 


In [10]:
path = "./data/output/"
ep_path = path + "everypolitician.json"
ma_path = path + "meineabgeordneten_wikidata.json"

In [11]:

gdown.download("https://drive.google.com/u/0/uc?id=1YNQKfm6qLKb5M6cfNrk9m8UYwi4iOxdF", ep_path)

Downloading...
From: https://drive.google.com/u/0/uc?id=1YNQKfm6qLKb5M6cfNrk9m8UYwi4iOxdF
To: /home/peter/dev/evaluation_bachelor_thesis/data/output/everypolitician.json
127MB [00:09, 13.1MB/s]


'./data/output/everypolitician.json'

In [32]:
def read_ftm_json(path, filter = "Person"):
    entity_dict = {}
    with open(path) as f:
        for line in f:
            entity = model.get_proxy(json.loads(line))
            wd = entity.first("wikidataId", True)
            if entity.schema.name == filter:
                if wd:
                    entity_dict[wd] = entity
    return entity_dict
path = "./data/output/"

In [33]:
mein_abg = read_ftm_json(ma_path)
every_polit = read_ftm_json(ep_path)

# Matching
We will check for equal Wikidata IDs and create a Match object. A match objects holds two entity IDs and a decision about the sameness. This match object could be uploaded to Aleph via the API. However, we will use it to perfom it instantly.

In [34]:
enricher = ftm_enrich.enricher.Enricher()

In [52]:
matches = []
for idx, polit in every_polit.items():
    abg = mein_abg.get(idx)
    if abg:
        #print(polit.to_dict())
        match = enricher.make_match(abg, polit)
        match = Match(model, {})
        match.entity=  abg
        match.canonical =  polit
        match.decision = match.SAME
        matches.append(match)


In total, there are 77 matching person entities with respect to the wikidata ID.

In [53]:
len(matches)

77

# Merging
The merging logic actually exists in the ftm [repository](https://github.com/alephdata/followthemoney/blob/6cb55e319f69443dff17bf1ee5dd1a37a31b5c4a/followthemoney/cli/dedupe.py) and works the following:

1. Create a linker object, which takes match objects and checks if there is a sameness decision.
2. If so, add the pair to a hashmap (Python dict) in the linker object.
3. Iterate through both collection of to-be-merged entities and pass each entity to the linker object (which knows the links). If the entity's ID is stored in the hashmap, adopt the entity ID. If not, keep the ID. This also applies for "edges", such as memberships.
4. Write to file. 
5. As we have duplicates, we aggregate, which merges items with the same ID. Merging just unions both, properties and their values. Therefore, same properties are merged, and different ones are just added to the multi-valued list.

## Example on how it works

In [54]:
linker_exmpl = Linker(model)

a = ftm.model.make_entity("Person")
a.add("name", "hans kelsen")
a.add("title", "Dr")
a.add("birthDate", "1908-07-06")
a.make_id("hans kelsen")

b = ftm.model.make_entity("Person")
b.add("name", "Prof. Dr. hans kelsen")
b.add("birthDate", "1908")
b.add("title", "Prof.")
b.make_id("Prof. Dr. Hans Kelsen")

match = enricher.make_match(a, b)
match.decision = match.SAME
linker_exmpl.add(match)

merged_ent  = a.merge(b)
{
    "a": a.id,
    "b": b.id,
    "merged": merged_ent.id,
    "result": merged_ent.to_dict()["properties"]}

{'a': '891bd4dbcf5506d489f8d6e757ace9411eccee55',
 'b': 'a21072d75aebf5f72865a70ca9e10beffb9ddb27',
 'merged': '891bd4dbcf5506d489f8d6e757ace9411eccee55',
 'result': {'name': ['Prof. Dr. hans kelsen', 'hans kelsen'],
  'title': ['Dr', 'Prof.'],
  'birthDate': ['1908', '1908-07-06']}}

## On data

In [55]:
# logic adapted form https://github.com/alephdata/followthemoney/blob/6cb55e319f69443dff17bf1ee5dd1a37a31b5c4a/followthemoney/cli/dedupe.py

linker = Linker(model)
for match in matches: 
    linker.add(match)


In [59]:
def mergeEntities(inpath, outfile, linker):
    infile = open(inpath)

    with infile as f:
        for line in f:
            entity = model.get_proxy(json.loads(line))
            applied = linker.apply(entity)
            

            json_ent = json.dumps(applied.to_dict(), sort_keys=True)
            outfile.write(json_ent + "\n")

merged_path  = path + "/nb-merge-output/merged.json"
outfile = open(merged_path,  "w")
mergeEntities(ep_path , outfile, linker)
mergeEntities(ma_path, outfile, linker)
merged_aggr_path = path + "/nb-merge-output/merged_aggr.json"

### Aggregate CLI command

In [61]:
%%bash -s "$merged_path" "$merged_aggr_path"
cat $1 | ftm aggregate -o $2


# CLI
I also implemented the merge-file-generator as a command-line tool in a more performant way. The `wd_merge.py` script generates a matching file that links FtM IDs that have a common Wikidata ID and calls `ftm link`, which applies the merges and `ftm aggregate` to actually merge them (see [fragmentation](https://followthemoney.readthedocs.io/en/latest/fragments.html))

```
python wd_merger.py match-wd -f <infile2> -s <infile1> -o <outfile>
```

Due to duplicates within everypolitician regarding Wikidata IDs, we have more matches (685).

In [58]:
%run -i wd_merger.py match-wd -f data/output/everypolitician.json -s data/output/meineabgeordneten_wikidata.json -o data/output/cli-merge-output/merged.json

Found 685 matches


This generates a `matches.json`, which can be used to stream to an Aleph instance or ingest into a Neo4j db.

Aleph:
```
cat matches.json | alephclient --host https://aleph.occrp.org --api-key <api-key> write-entities -f <collection_id> 
```

 Neo4j:
 ```
 cat matches.json | ftm export-cypher | cypher-shell -u user -p password
 ```