# Creating d3fc data 

Here we create a TSV of UMAP-projected embeddings, alongside different groupings of these embeddings, for a 2D projection visualisation. 

See [./create_colour_mappings_for_vis.ipynb](./create_colour_mappings_for_vis.ipynb) first.

In [64]:
import glob
from pathlib import Path
from typing import List
import json

import pandas as pd
import numpy as np
import requests
from tqdm.auto import tqdm

pd.set_option('display.max_colwidth', None)

In [63]:
def paginate_list(l, page_size):
    return [l[i : i + page_size] for i in range(0, len(l), page_size)]

def get_labels(uris: List[str]) -> dict:
    """Get labels for URIs using Heritage Connector API"""
    
    hc_api_labels_endpoint = "http://localhost:8010/labels"
    headers = {'Content-Type': 'application/json'}
    body = json.dumps({"uris": uris})
    res = requests.post(hc_api_labels_endpoint, headers=headers, data=body)
    
    return res.json()

get_labels(["http://www.wikidata.org/entity/Q3568968"])

{'http://www.wikidata.org/entity/Q3568968': 'William Stanley'}

## 1. Import data 

The ent-to-idx mapping created by DGL-KE, and the projected embeddings created by running the DGL-KE embeddings through UMAP.

In [7]:
ENT_MAPPING_PATH = "../data/processed/final_model_dglke/entities.tsv"
PROJECTED_EMBEDDINGS_PATH = "../data/processed/final_model_dglke/umap/best_projection.npy"

ent_idx_mapping = pd.read_csv(
    ENT_MAPPING_PATH,
    sep="\t",
    index_col=0,
    header=None,
    names=["value"],
).fillna("")

projs = np.load(PROJECTED_EMBEDDINGS_PATH).astype('float32')

ent_idx_mapping.shape, projs.shape

((645565, 1), (645565, 2))

The various mappings from entities to groups (which will be displayed in different colours in the visualisation) created by the notebook [./create_colour_mappings_for_vis.ipynb](./create_colour_mappings_for_vis.ipynb).

In [23]:
MAPPINGS_FOLDER = "./embedding_colour_mappings/"

mappings = {}

for filename in glob.glob(MAPPINGS_FOLDER + "*.tsv"):
    cat_name = Path(filename).stem
    mappings[cat_name] = pd.read_csv(filename, sep="\t", index_col=0, names=["value", "group"])
    
    print(f"Loaded {filename} to mappings['{cat_name}']")

Loaded ./embedding_colour_mappings/mapping_type.tsv to mappings['mapping_type']
Loaded ./embedding_colour_mappings/mapping_collection_category.tsv to mappings['mapping_collection_category']
Loaded ./embedding_colour_mappings/mapping_database.tsv to mappings['mapping_database']


## 2. Transform data

We want to make a DataFrame we can export as a TSV, with columns:

``` markdown
- id
- label
- collection_category
- type
- x
- y
- index
```

In [33]:
# create `id, index`
transformed_data = ent_idx_mapping.copy().rename(columns={"value": "id"}).reset_index()

transformed_data.head()

Unnamed: 0,index,id
0,0,https://collection.sciencemuseumgroup.org.uk/p...
1,1,http://www.wikidata.org/entity/Q3568968
2,2,https://collection.sciencemuseumgroup.org.uk/o...
3,3,plastic
4,4,https://collection.sciencemuseumgroup.org.uk/p...


In [43]:
# create x, y
projs_df = pd.DataFrame(projs, columns=["x", "y"])
transformed_data = pd.concat([transformed_data, projs_df], axis=1)

transformed_data.head()

Unnamed: 0,index,id,x,y
0,0,https://collection.sciencemuseumgroup.org.uk/p...,1.152161,-8.529404
1,1,http://www.wikidata.org/entity/Q3568968,1.233595,-8.767006
2,2,https://collection.sciencemuseumgroup.org.uk/o...,11.537273,9.740898
3,3,plastic,-18.416214,-13.079803
4,4,https://collection.sciencemuseumgroup.org.uk/p...,0.777103,-6.762215


In [45]:
# create mappings cols
MAPPINGS_TO_ADD = ["mapping_collection_category", "mapping_type"]

for mapping_name, mapping_df in mappings.items():
    if mapping_name in MAPPINGS_TO_ADD:
        new_col_name = mapping_name[8:] # remove prefix `mapping_`
        transformed_data[new_col_name] = mapping_df['group']
        
transformed_data.head()


Unnamed: 0,index,id,x,y,type,collection_category
0,0,https://collection.sciencemuseumgroup.org.uk/p...,1.152161,-8.529404,Person,Person
1,1,http://www.wikidata.org/entity/Q3568968,1.233595,-8.767006,Wikidata,Wikidata
2,2,https://collection.sciencemuseumgroup.org.uk/o...,11.537273,9.740898,Object,Category - Therapeutics
3,3,plastic,-18.416214,-13.079803,,
4,4,https://collection.sciencemuseumgroup.org.uk/p...,0.777103,-6.762215,Organisation,Organisation


In [81]:
# create labels col

def has_probably_got_label(value):
    prefixes = [
        "https://collection.sciencemuseumgroup", 
        "http://www.wikidata.org/entity", 
        "https://blog.sciencemuseum.org.uk/",
        "http://journal.sciencemuseum.ac.uk/"
    ]
    
    for p in prefixes:
        if value.startswith(p): return True
        
    return False

ids_for_label_lookup = transformed_data.loc[
    transformed_data['id'].apply(has_probably_got_label) & (~transformed_data['type'].isna() | ~transformed_data['collection_category'].isna()),
    "id"
].tolist()

id_label_mapping = {}

for page in tqdm(paginate_list(ids_for_label_lookup, 5000)):
    id_label_mapping.update(get_labels(page))
    
transformed_data['label'] = transformed_data['id'].map(id_label_mapping)

transformed_data.head()

  0%|          | 0/87 [00:00<?, ?it/s]

Unnamed: 0,index,id,x,y,type,collection_category,label
0,0,https://collection.sciencemuseumgroup.org.uk/people/cp28058,1.152161,-8.529404,Person,Person,William Ford Stanley
1,1,http://www.wikidata.org/entity/Q3568968,1.233595,-8.767006,Wikidata,Wikidata,William Stanley
2,2,https://collection.sciencemuseumgroup.org.uk/objects/co138741,11.537273,9.740898,Object,Category - Therapeutics,"Hypodermic needle, Luer No.26 G, in sealed packet,"
3,3,plastic,-18.416214,-13.079803,,,
4,4,https://collection.sciencemuseumgroup.org.uk/people/cp28358,0.777103,-6.762215,Organisation,Organisation,The Cunard Line


## 3. Export data

We remove all rows which have a NaN value for both the `type` and `collection_category` columns as these will never show on the plot. The javascript powering the visualisation will still need to check for NaN values.

In [82]:
EXPORT_PATH = os.path.join()

export_data = transformed_data[~transformed_data['collection_category'].isna() & ~transformed_data['type'].isna()]
len(transformed_data), len(export_data)

(645565, 433571)

In [83]:
export_data.to_csv(
    "../data/processed/final_model_dglke/umap/visualisation_data.tsv", 
    sep="\t", 
    index=False,
)