Skip to content

[LungMAP] Add LungMAP projects to Google Datasets catalog #4808

@NoopDog

Description

@NoopDog

Add LungMAP projects to the Google Dataset Search catalog so they're discoverable from Google. This is done by embedding schema.org Dataset JSON-LD on project detail pages — Google's crawler picks it up.

Companion to galaxyproject/brc-analytics#1264, #4806 (HCA), and #4807 (AnVIL). LungMAP shares the HCA Azul backend, so the implementation should follow the HCA companion ticket closely with LungMAP-specific catalog naming.

Reference implementation

NCPI Dataset Catalog has already shipped this for studies. Mirror their pattern:

Google Dataset required + recommended fields

Per Google's Dataset structured data guidelines:

Required

  • name — descriptive title
  • description — 50–5000 characters

Recommended

  • identifier, url, sameAs
  • creator, funder, license
  • distribution (with contentUrl, encodingFormat)
  • keywords, variableMeasured, measurementTechnique
  • spatialCoverage, temporalCoverage
  • includedInDataCatalog, isAccessibleForFree, version, citation

Initial mapping — LungMAP project (ProjectResponse) → Dataset

Source entity at app/apis/azul/hca-dcp/common/entities.ts (ProjectResponse, shared with HCA via the lm2 catalog).

schema.org field Source / value
@context "https://schema.org"
@type "Dataset"
name projectTitle (fall back to projectShortname)
description projectDescription — strip HTML, truncate to 5000 chars (min 50)
identifier [projectId, ...accessions]
url ${browserURL}/projects/${projectId}
sameAs Accession URLs (GEO, ArrayExpress, INSDC, etc.) derived from accessions
includedInDataCatalog { "@type": "DataCatalog", name: "LungMAP Data Explorer", url: browserURL }
isAccessibleForFree true
keywords Union of genusSpecies, organ (focused on lung anatomy), organPart, disease, sampleEntityType, libraryConstructionApproach, developmentStage
creator Map contributorsPerson/Organization (name, affiliation, role)
funder Likely { "@type": "Organization", name: "NHLBI LungMAP" } plus per-project funders if available
citation Map publicationsScholarlyArticle (title, DOI/URL)
distribution DataDownload[] from matrix files / contributed analyses with contentUrl + encodingFormat
variableMeasured Optional PropertyValue[] derived from projectSummary (cell counts, donor counts, file counts, library construction approach, development stage)
license TBD — confirm with team (LungMAP data use terms)

Open questions for funder / license should be resolved before merge.

Implementation steps

  1. Add a LungMAP-aware buildProjectJsonLd(project, browserURL, catalog) (likely shared with the HCA implementation, parameterized on catalog name/URL).
  2. Add a ProjectJsonLd component that renders the JSON-LD via next/head with the same HTML-escape helper as NCPI.
  3. Mount the component on the LungMAP project detail page (pages/projects/[entityId] under the LungMAP site config).
  4. Unit-test the builder: required fields present, description truncation, conditional fields omitted when source is null, LungMAP catalog name asserted.
  5. Validate output against Google's Rich Results Test and Schema Markup Validator for representative LungMAP projects (mouse vs. human, single-cell vs. spatial).
  6. Once shipped, request indexing via Google Search Console and confirm project pages start appearing in Google Dataset Search.

Out of scope (follow-ups)

  • JSON-LD on samples/files detail pages.
  • Sitemap entries for project detail pages if not already complete.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions