# Creating d3fc data 

Here we create a TSV of UMAP-projected embeddings, alongside different groupings of these embeddings, for a 2D projection visualisation. 

See [./create_colour_mappings_for_vis.ipynb](./create_colour_mappings_for_vis.ipynb) first.

In [5]:
import glob
from pathlib import Path
from typing import List
import json
import csv

import pandas as pd
import numpy as np
import requests
from tqdm.auto import tqdm

pd.set_option('display.max_colwidth', None)

In [7]:
def paginate_list(l, page_size):
    return [l[i : i + page_size] for i in range(0, len(l), page_size)]

def get_labels(uris: List[str]) -> dict:
    """Get labels for URIs using Heritage Connector API"""
    
    hc_api_labels_endpoint = "http://localhost:8010/labels"
    headers = {'Content-Type': 'application/json'}
    body = json.dumps({"uris": uris})
    res = requests.post(hc_api_labels_endpoint, headers=headers, data=body)
    
    return res.json()

get_labels(["http://www.wikidata.org/entity/Q3568968"])

{'http://www.wikidata.org/entity/Q3568968': 'William Stanley'}

## 1. Import data 

The ent-to-idx mapping created by DGL-KE, and the projected embeddings created by running the DGL-KE embeddings through UMAP.

In [6]:
ENT_MAPPING_PATH = "../data/processed/final_model_dglke_vanda/entities.tsv"
PROJECTED_EMBEDDINGS_PATH = "../data/processed/final_model_dglke_vanda/umap/best_projection.npy"

ent_idx_mapping = pd.read_csv(
    ENT_MAPPING_PATH,
    sep="\t",
    index_col=0,
    header=None,
    names=["value"],
    quoting=csv.QUOTE_NONE, 
    error_bad_lines=False,
).fillna("")

projs = np.load(PROJECTED_EMBEDDINGS_PATH).astype('float32')

ent_idx_mapping.shape, projs.shape



  exec(code_obj, self.user_global_ns, self.user_ns)


((1208256, 1), (1208256, 2))

The various mappings from entities to groups (which will be displayed in different colours in the visualisation) created by the notebook [./create_colour_mappings_for_vis.ipynb](./create_colour_mappings_for_vis.ipynb).

In [8]:
MAPPINGS_FOLDER = "../data/processed/embedding_colour_mappings_vanda/"

mappings = {}

for filename in glob.glob(MAPPINGS_FOLDER + "*.tsv"):
    cat_name = Path(filename).stem
    mappings[cat_name] = pd.read_csv(filename, sep="\t", index_col=0, names=["value", "group"])
    
    print(f"Loaded {filename} to mappings['{cat_name}']. Shape {mappings[cat_name].shape}")

Loaded ../data/processed/embedding_colour_mappings_vanda/mapping_type.tsv to mappings['mapping_type']. Shape (1208256, 2)
Loaded ../data/processed/embedding_colour_mappings_vanda/mapping_collection_category.tsv to mappings['mapping_collection_category']. Shape (1208256, 2)
Loaded ../data/processed/embedding_colour_mappings_vanda/mapping_database.tsv to mappings['mapping_database']. Shape (1208256, 2)


## 2. Transform data

We want to make a DataFrame we can export as a TSV, with columns:

``` markdown
- id
- label
- collection_category
- type
- x
- y
- index
```

In [9]:
# create `id, index`
transformed_data = ent_idx_mapping.copy().rename(columns={"value": "id"}).reset_index()

transformed_data.head()

Unnamed: 0,index,id
0,0,http://collections.vam.ac.uk/item/O1149857
1,1,http://www.wikidata.org/entity/Q7338619
2,2,http://collections.vam.ac.uk/item/O1175446
3,3,https://api.vam.ac.uk/v2/objects/search?id_material=AAT14233
4,4,http://collections.vam.ac.uk/item/O1163824


In [10]:
# create x, y
projs_df = pd.DataFrame(projs, columns=["x", "y"])
transformed_data = pd.concat([transformed_data, projs_df], axis=1)

transformed_data.head()

Unnamed: 0,index,id,x,y
0,0,http://collections.vam.ac.uk/item/O1149857,10.939911,20.82769
1,1,http://www.wikidata.org/entity/Q7338619,-18.717411,1.43478
2,2,http://collections.vam.ac.uk/item/O1175446,-19.311466,-17.896444
3,3,https://api.vam.ac.uk/v2/objects/search?id_material=AAT14233,15.814499,18.287727
4,4,http://collections.vam.ac.uk/item/O1163824,-18.210323,-16.473248


In [11]:
# create mappings cols
MAPPINGS_TO_ADD = ["mapping_collection_category", "mapping_type"]

for mapping_name, mapping_df in mappings.items():
    if mapping_name in MAPPINGS_TO_ADD:
        new_col_name = mapping_name[8:] # remove prefix `mapping_`
        transformed_data[new_col_name] = mapping_df['group']
        
transformed_data.head()


Unnamed: 0,index,id,x,y,type,collection_category
0,0,http://collections.vam.ac.uk/item/O1149857,10.939911,20.82769,Object,Category - THES48602 - Theatre and Performance Collection
1,1,http://www.wikidata.org/entity/Q7338619,-18.717411,1.43478,Wikidata,Wikidata
2,2,http://collections.vam.ac.uk/item/O1175446,-19.311466,-17.896444,Object,Category - THES48602 - Theatre and Performance Collection
3,3,https://api.vam.ac.uk/v2/objects/search?id_material=AAT14233,15.814499,18.287727,,
4,4,http://collections.vam.ac.uk/item/O1163824,-18.210323,-16.473248,Object,Category - THES48602 - Theatre and Performance Collection


In [12]:
# create labels col

def has_probably_got_label(value):
    prefixes = [
        "https://collection.sciencemuseumgroup", 
        "http://www.wikidata.org/entity", 
        "https://blog.sciencemuseum.org.uk/",
        "http://journal.sciencemuseum.ac.uk/"
        "http://collections.vam.ac.uk/",
        "https://api.vam.ac.uk/v2/objects/search?id_person",
        "https://api.vam.ac.uk/v2/objects/search?id_organisation"
    ]
    
    for p in prefixes:
        if value.startswith(p): return True
        
    return False

ids_for_label_lookup = transformed_data.loc[
    transformed_data['id'].apply(has_probably_got_label) & (~transformed_data['type'].isna() | ~transformed_data['collection_category'].isna()),
    "id"
].tolist()

id_label_mapping = {}

for page in tqdm(paginate_list(ids_for_label_lookup, 5000)):
    id_label_mapping.update(get_labels(page))
    
transformed_data['label'] = transformed_data['id'].map(id_label_mapping)

transformed_data.head()

  0%|          | 0/111 [00:00<?, ?it/s]

Unnamed: 0,index,id,x,y,type,collection_category,label
0,0,http://collections.vam.ac.uk/item/O1149857,10.939911,20.82769,Object,Category - THES48602 - Theatre and Performance Collection,
1,1,http://www.wikidata.org/entity/Q7338619,-18.717411,1.43478,Wikidata,Wikidata,Riverside Studios
2,2,http://collections.vam.ac.uk/item/O1175446,-19.311466,-17.896444,Object,Category - THES48602 - Theatre and Performance Collection,
3,3,https://api.vam.ac.uk/v2/objects/search?id_material=AAT14233,15.814499,18.287727,,,
4,4,http://collections.vam.ac.uk/item/O1163824,-18.210323,-16.473248,Object,Category - THES48602 - Theatre and Performance Collection,


## 3. Export data

We remove all rows which have a NaN value for both the `type` and `collection_category` columns as these will never show on the plot. The javascript powering the visualisation will still need to check for NaN values.

In [13]:
export_data = transformed_data[~transformed_data['collection_category'].isna() & ~transformed_data['type'].isna()]
len(transformed_data), len(export_data)

(1208256, 876866)

In [14]:
export_data.to_csv(
    "../data/processed/final_model_dglke_vanda/umap/visualisation_data.tsv", 
    sep="\t", 
    index=False,
)