### Cleaning UKBB Showcase Data and Merging onto Dimensions
##### Last Updated: 31.03.2025

Note: a large number of helper functions are stored in `./src/helpers_scrape.py`. Launch this notebook from `./src/`. See other notebooks for visualisations of this and related data (from other DSL calls and a bespoke dataset).

First, lets load in a couple of things we'll need; a couple are standard python libraries, but most are our own custom functions.

In [1]:
from datetime import datetime
from helpers_scrape import (get_ukb_showcase_data, # grab the upstream UKB file
                            login_dimcli, # simple login to DIMCLI helper function
                            get_raw_data, # grab raw article level data from the DSL
                            merger, # merge individual API returns together
                            evaluate_raw_scrape, # evaluate the original scrape of raw data
                            make_long_refs, # explode our original scrape to be long for refs
                            save_combined # save all wrangled files out
                            )

We want our analysis to be easily updateable, but not so frequently updateable that it puts pressure on the API or makes our results change every time we run the script. So, lets timestamp our incoming data into months,

In [2]:
timestamp = datetime.now().strftime("%Y%m")

Next, lets grab the raw UKBB data from upstream at the UKBB showcase. We're going to clean a couple of fields, and also look to see if there's anything wonky going on within it at the same time.

In [3]:
df = get_ukb_showcase_data(timestamp)

Number of records which we start with:  8553
Number of duplicated DOIS (excl. NaN):  41
Duplicate DOIs saved: ../data/raw/ukb_data_duplicated_DOIs.txt
Note: looks like one DOI occurs 3x...
Dropping all but the first DOIs (21) records.
Number of duplicated pmid (excl. NaN and pmid==0):  0
Number of duplicated titles (inc. NaN):  59
Duplicate titles saved: ../data/raw/ukb_data_duplicated_titles.txt
Note: Not dropping (unique DOIs which should be retreivable): curious, though...
Number of duplicated pub_id (inc. NaN):  0
Number of NaNs in DOI column:  19
These NaNs are saved out to:, ../data/raw/ukb_data_doi_nan.txt
Note: At least some of these NaNs _should_ have DOIs, though...
Number of records with no doi or pmid:  0
Number of records which we are left with:  8532


Some things to note here. That there are duplicate DOIs is problematic (ignoring NaNs initially). This is because this is going to be our primary key to query on in the DSL (below). So, we end up dropping 21 of the non-unique DOIs (20 of the 41 duplicates remain). One entry in the UKBB showcase data has 3 entries with the same DOI. When we have no DOI, we can -- here -- rely on the records which have a PubMed ID (because at least initially, there are no NaNs for DOI which dont have a corresponding PMID). Some of the DOIs which have NaNs do actually have DOIs -- they can be found online, and some are actually returned from the DSL when when we query the relevant entries for PMID. Warrants consideration\investigation upstream.

There are 59 rows of duplicated titles. This isnt urgently problematic, because we are going to look up against DOIs and PMIDs below. But, it's indicative of something bad potentially happening upstream and could be investigated further. The duplicate titles and dois -- as well as NaN DOIs -- get saved out to `/data/raw/`.

This leaves us with the number of rows printed at the bottom of the `get_ukb_showcase_data()` call, which is simply the number at the top minus the number of duplicated DOIs which get drops.

Next, lets login to the DIMCLI to query the DSL. API key is stored privately and loaded in via `./keys/`.

In [4]:
login_dimcli()

[2mDimcli - Dimensions API Client (v1.4)[0m
[2mConnected to: <https://app.dimensions.ai/api/dsl/v2> - DSL v2.10[0m
[2mMethod: manual login[0m


<dimcli.Dsl #138623054175680. API endpoint: https://app.dimensions.ai/api/dsl/v2>

OK, looks like this worked just fine. Now, we're going to get all Dimensions data related to these UKBB showcase rows. Our strategy -- as above -- is to first attempt to get all the DOIs (as the 'primary' search key), but where they are NaN, use PMID instead. We attempt to access ~200 rows at a time, to avoid HTTPErrors; if we hit a HTTPError, we keep retrying it (this seems to be because the query return is taking slightly too long). A log of the scrape gets printed to `./logging/` as a timestamped (to the second) file.

A quick note: as we're timestamping our _data_ by the month, this checks to see whether we already have data for this month or not (we assume that the upstream file doesnt change more than once per month, and the first time we run it each month is the time at which data gets cached to that month). If we've already scraped the DSL for this month's data, it quickly just iterates to check that we have everything we expect.

In [5]:
get_raw_data(200, ['doi', 'pmid'], df, timestamp)

Processing doi chunks: 100%|████████████████| 43/43 [00:00<00:00, 176992.22it/s]
Processing pmid chunks: 100%|███████████████████| 1/1 [00:00<00:00, 7182.03it/s]


Next, we'll wrangle our DOI and PMID returns, merge them, and then evaluate our scrape of the DSL:

In [6]:
doi = merger(f'../data/dimensions/api/raw/doi/{timestamp}')
pmid = merger(f'../data/dimensions/api/raw/pmid/{timestamp}')
doi, pmid, df_dim = evaluate_raw_scrape(df, timestamp)

We start off expecting to get back this number of rows:  8532
We expect to get this many doi:  8513
We expect to get this many pmid:  19
We actually get back this many dois:  8508
We actually get back this unique dois:  8504
Save out the 8/2 duplicates to  ../data/dimensions/api/raw/eval/202503/doi_scrape_duplicates.csv
If a DOI is duplicated, keep the one which has reference_ids (or the first one seen).
We now have this many dois:  8504
The 9 DOIs not returned are saved at ../data/dimensions/api/raw/eval/202503/doi_not_in_dim.tsv
Note: some of these are clearly non-indexed preprints.
Note: Some of are on dimensions.ai app?'
Note: Some of arent on dimensions.ai app, but look like they should be? e.g. 10.1038/s41588-018-0147-3
We get this many from the pmid search:  19
Cool, looks like we got all the pmids without issue
We got this many rows of data from Dimensions:  8523
Note: different to len(df) from i.) drop duplicates in wrangle_raw(), ii.) non-returns (sum to diff)


Some things to say about this. A very small number of input DOIs created >1 returns from the DSL -- this, probably, shouldn't happen, but is likely *their* fault. We keep ones which have reference IDs, or if neither have, just pick the first one relatively at random. This probably doesn't matter all that much, as we're still getting things like categories.

What *is* a little more consequential, is the fact that some DOIs are not actually returned at all. Nothing we can really do about that, but some things to say. A couple of these are big papers which should be getting returned, but arent even on their webapp (e.g. an EA GWAS is missing). One or two look like they *are* on the web app, but arent getting returned from the API. Most are preprints, and it's likely fine that these aren't found at all (Dimensions hasn't scraped SSRN recently for example?). So, we lose a *very* small number of rows from the showcase data because of duplicate DOIs, and a *very* small number here of DOIs which dont get returned at all.

Next, lets longways explode our `df_dim` dataframe to create source:target key-value pairs to scrape for references to each of these UKB Showcase papers. Note: we now move from using DOI to the Dimensions `id' file, as that's a far better UID within the DSL universe (ans is what is returned in the `reference_ids` field).

In [7]:
df_dim, df_exploded = make_long_refs(df_dim)

Number of rows where reference_ids is an empty list: 74
0 duplicates in the id-reference_id pair.
Drop the NaN exploded reference ids (empty lists)
Number of rows with NaN in either source_id or target_id: 0
We have 408821 source:target id pairs,but only 192930 refs to get


A surprisingly small number of papers have no references parsed (I expected a much higher % than this). Then take this long list of UIDs from the references, and push that back through the API. Again; if we already have this data, don't bother to get it again. Then, merge it using the same function as above.

In [8]:
get_raw_data(200, ['id'], df_exploded, timestamp)
refs = merger(f'../data/dimensions/api/raw/id/{timestamp}')

Processing id chunks: 100%|███████████████| 965/965 [00:00<00:00, 444780.59it/s]


Lets next evaluate again our scrape -- did we get all the data we expected to on the _references_?

In [9]:
import pandas as pd
print('Length of the unique reference file is:' , len(refs))
df_exploded = pd.merge(df_exploded, refs, how='left',
                       left_on='target_id', right_on='id')
df_exploded = df_exploded[df_exploded['id'].notnull()].drop('id', axis=1)
print('Length of the linked reference file is:' , len(df_exploded))
print('Length of rows of UKB data that have at least one reference: ',
      len(df_exploded['source_id'].unique()))

Length of the unique reference file is: 192909
Length of the linked reference file is: 408798
Length of rows of UKB data that have at least one reference:  8449


This is curious: it looks like a *tiny* number of references are not returned (99.989% are as of 202503!). I wonder why a truly tiny number of references aren't returned when they've been internally linked, and an explicit ID has been generated and has gone into the references of another paper? Likely their problem, and not of huge consequence for us, but good to document it all the same.
It looks like this doesnt leave any unique UKB papers with no references, because the number of unique source_ids which are returned from this is the same as the number of unique ids which come out of the original scrape minus those with no references. Great accounting!

Finally, save all these five (doi, pmid, df_dim, refs, df_exploded) wrangled files out to a dedicated subdirectory for downstream analysis.

In [10]:
save_combined(f'../data/dimensions/api/raw/combined/{timestamp}',
              doi, pmid, df_dim, refs, df_exploded)