# Creating d3fc data 

Here we create a TSV of UMAP-projected embeddings, alongside different groupings of these embeddings, for a 2D projection visualisation. 

See [./create_colour_mappings_for_vis.ipynb](./create_colour_mappings_for_vis.ipynb) first.

In [1]:
import glob
from pathlib import Path
from typing import List
import json
import csv

import pandas as pd
import numpy as np
import requests
from tqdm.auto import tqdm

pd.set_option('display.max_colwidth', None)

In [3]:
def paginate_list(l, page_size):
    return [l[i : i + page_size] for i in range(0, len(l), page_size)]

def get_labels(uris: List[str]) -> dict:
    """Get labels for URIs using Heritage Connector API"""
    
    hc_api_labels_endpoint = "http://localhost:8010/labels"
    headers = {'Content-Type': 'application/json'}
    body = json.dumps({"uris": uris})
    res = requests.post(hc_api_labels_endpoint, headers=headers, data=body)
    
    return res.json()

get_labels(["http://www.wikidata.org/entity/Q3568968"])

{'http://www.wikidata.org/entity/Q3568968': 'William Stanley'}

## 1. Import data 

The ent-to-idx mapping created by DGL-KE, and the projected embeddings created by running the DGL-KE embeddings through UMAP.

In [4]:
ENT_MAPPING_PATH = "../data/processed/final_model_dglke/entities.tsv"
PROJECTED_EMBEDDINGS_PATH = "../data/processed/final_model_dglke/umap/best_projection_n_neighbours_10.npy"

# ENT_MAPPING_PATH = "../data/processed/final_model_dglke_vanda/entities.tsv"
# PROJECTED_EMBEDDINGS_PATH = "../data/processed/final_model_dglke_vanda/umap/best_projection.npy"

ent_idx_mapping = pd.read_csv(
    ENT_MAPPING_PATH,
    sep="\t",
    index_col=0,
    header=None,
    names=["value"],
    quoting=csv.QUOTE_NONE, 
    error_bad_lines=False,
).fillna("")

projs = np.load(PROJECTED_EMBEDDINGS_PATH).astype('float32')

ent_idx_mapping.shape, projs.shape



  exec(code_obj, self.user_global_ns, self.user_ns)


((645565, 1), (645565, 2))

The various mappings from entities to groups (which will be displayed in different colours in the visualisation) created by the notebook [./create_colour_mappings_for_vis.ipynb](./create_colour_mappings_for_vis.ipynb).

In [13]:
MAPPINGS_FOLDER = "../data/processed/embedding_colour_mappings/"
# MAPPINGS_FOLDER = "../data/processed/embedding_colour_mappings_vanda/"

mappings = {}

for filename in glob.glob(MAPPINGS_FOLDER + "*.tsv"):
    cat_name = Path(filename).stem
    mappings[cat_name] = pd.read_csv(filename, sep="\t", index_col=0, names=["value", "group"])
    
    print(f"Loaded {filename} to mappings['{cat_name}']. Shape {mappings[cat_name].shape}")

Loaded ../data/processed/embedding_colour_mappings/mapping_type.tsv to mappings['mapping_type']. Shape (645565, 2)
Loaded ../data/processed/embedding_colour_mappings/mapping_hdbscan_clusters_min_size_500.tsv to mappings['mapping_hdbscan_clusters_min_size_500']. Shape (645565, 2)
Loaded ../data/processed/embedding_colour_mappings/mapping_collection_category.tsv to mappings['mapping_collection_category']. Shape (645565, 2)
Loaded ../data/processed/embedding_colour_mappings/mapping_hdbscan_clusters_min_size_500_min_samples_125.tsv to mappings['mapping_hdbscan_clusters_min_size_500_min_samples_125']. Shape (645565, 2)
Loaded ../data/processed/embedding_colour_mappings/mapping_database.tsv to mappings['mapping_database']. Shape (645565, 2)
Loaded ../data/processed/embedding_colour_mappings/mapping_hdbscan_clusters_min_size_750_min_samples_187.tsv to mappings['mapping_hdbscan_clusters_min_size_750_min_samples_187']. Shape (645565, 2)
Loaded ../data/processed/embedding_colour_mappings/mapping

## 2. Transform data

We want to make a DataFrame we can export as a TSV, with columns:

``` markdown
- id
- label
- collection_category
- type
- x
- y
- index
```

In [6]:
# create `id, index`
transformed_data = ent_idx_mapping.copy().rename(columns={"value": "id"}).reset_index()

transformed_data.head()

Unnamed: 0,index,id
0,0,https://collection.sciencemuseumgroup.org.uk/people/cp28058
1,1,http://www.wikidata.org/entity/Q3568968
2,2,https://collection.sciencemuseumgroup.org.uk/objects/co138741
3,3,plastic
4,4,https://collection.sciencemuseumgroup.org.uk/people/cp28358


In [7]:
# create x, y
projs_df = pd.DataFrame(projs, columns=["x", "y"])
transformed_data = pd.concat([transformed_data, projs_df], axis=1)

transformed_data.head()

Unnamed: 0,index,id,x,y
0,0,https://collection.sciencemuseumgroup.org.uk/people/cp28058,4.982193,4.177696
1,1,http://www.wikidata.org/entity/Q3568968,5.086479,4.084801
2,2,https://collection.sciencemuseumgroup.org.uk/objects/co138741,-9.554541,-5.450275
3,3,plastic,-3.474861,-15.172668
4,4,https://collection.sciencemuseumgroup.org.uk/people/cp28358,2.950962,4.017635


In [16]:
# create mappings cols
MAPPINGS_TO_ADD = [
    "mapping_collection_category", 
    "mapping_type", 
    "mapping_hdbscan_clusters_min_size_500", 
#     "mapping_hdbscan_clusters_min_size_200", 
    "mapping_hdbscan_clusters_min_size_750_min_samples_187", 
    "mapping_hdbscan_clusters_min_size_750_min_samples_375",
    "mapping_hdbscan_clusters_min_size_500_min_samples_125",
]

for mapping_name, mapping_df in mappings.items():
    if mapping_name in MAPPINGS_TO_ADD:
        new_col_name = mapping_name[8:] # remove prefix `mapping_`
        if "clusters" in mapping_name:
            mapping_df["group"] = mapping_df["group"].astype(str)
        transformed_data[new_col_name] = mapping_df['group']
        
transformed_data.head()


Unnamed: 0,index,id,x,y,type,hdbscan_clusters_min_size_500,collection_category,hdbscan_clusters_min_size_750_min_samples_187,hdbscan_clusters_min_size_200,hdbscan_clusters_min_size_750_min_samples_375,label,hdbscan_clusters_min_size_500_min_samples_125
0,0,https://collection.sciencemuseumgroup.org.uk/people/cp28058,4.982193,4.177696,Person,182,Person,151,467,137,William Ford Stanley,264
1,1,http://www.wikidata.org/entity/Q3568968,5.086479,4.084801,Wikidata,182,Wikidata,151,467,137,William Stanley,264
2,2,https://collection.sciencemuseumgroup.org.uk/objects/co138741,-9.554541,-5.450275,Object,148,Category - Therapeutics,117,379,107,"Hypodermic needle, Luer No.26 G, in sealed packet,",219
3,3,plastic,-3.474861,-15.172668,,53,,74,-1,38,,-1
4,4,https://collection.sciencemuseumgroup.org.uk/people/cp28358,2.950962,4.017635,Organisation,-1,Organisation,145,-1,-1,The Cunard Line,259


In [9]:
# create labels col

def has_probably_got_label(value):
    prefixes = [
        "https://collection.sciencemuseumgroup", 
        "http://www.wikidata.org/entity", 
        "https://blog.sciencemuseum.org.uk/",
        "http://journal.sciencemuseum.ac.uk/",
        "http://collections.vam.ac.uk/",
        "https://api.vam.ac.uk/v2/objects/search?id_person",
        "https://api.vam.ac.uk/v2/objects/search?id_organisation",
    ]
    
    for p in prefixes:
        if value.startswith(p): return True
        
    return False

ids_for_label_lookup = transformed_data.loc[
    transformed_data['id'].apply(has_probably_got_label) & (~transformed_data['type'].isna() | ~transformed_data['collection_category'].isna()),
    "id"
].tolist()

id_label_mapping = {}

for page in tqdm(paginate_list(ids_for_label_lookup, 5000)):
    id_label_mapping.update(get_labels(page))
    
transformed_data['label'] = transformed_data['id'].map(id_label_mapping)

transformed_data.head()

  0%|          | 0/87 [00:00<?, ?it/s]

Unnamed: 0,index,id,x,y,type,hdbscan_clusters_min_size_500,collection_category,hdbscan_clusters_min_size_750_min_samples_187,hdbscan_clusters_min_size_200,hdbscan_clusters_min_size_750_min_samples_375,label
0,0,https://collection.sciencemuseumgroup.org.uk/people/cp28058,4.982193,4.177696,Person,182,Person,151,467,137,William Ford Stanley
1,1,http://www.wikidata.org/entity/Q3568968,5.086479,4.084801,Wikidata,182,Wikidata,151,467,137,William Stanley
2,2,https://collection.sciencemuseumgroup.org.uk/objects/co138741,-9.554541,-5.450275,Object,148,Category - Therapeutics,117,379,107,"Hypodermic needle, Luer No.26 G, in sealed packet,"
3,3,plastic,-3.474861,-15.172668,,53,,74,-1,38,
4,4,https://collection.sciencemuseumgroup.org.uk/people/cp28358,2.950962,4.017635,Organisation,-1,Organisation,145,-1,-1,The Cunard Line


In [12]:
curr_mapping = transformed_data.copy()

## 3. Export data

We remove all rows which have a NaN value for both the `type` and `collection_category` columns as these will never show on the plot. The javascript powering the visualisation will still need to check for NaN values.

In [17]:
export_data = transformed_data[~transformed_data['collection_category'].isna() & ~transformed_data['type'].isna()]
len(transformed_data), len(export_data)

(645565, 433571)

In [18]:
export_data.to_csv(
    "../data/processed/final_model_dglke/umap/visualisation_data_with_clusters.tsv", 
    sep="\t", 
    index=False,
)

**for v&a data:** rotate 180 degrees so v&a visualisation aligns with smg

In [30]:

# export_data_rotated = export_data.copy()
# export_data_rotated[["x", "y"]] = export_data_rotated[["x", "y"]].applymap(lambda i: i*-1)

# export_data_rotated.to_csv(
#     "../data/processed/final_model_dglke_vanda/umap/visualisation_data_rotated.tsv", 
#     sep="\t", 
#     index=False,
# )