Skip to content
Aaron Mussig edited this page Apr 15, 2024 · 7 revisions

This outlines the steps required to update the website for each release.

1 - Preparing the GTDB database

1.1 - Backup & restore

On Watson, run the command below. Then copy it across to the database server.

pg_dump -U gtdb --data-only --format=tar --no-owner --no-acl --no-privileges --dbname=gtdb_pierre_r220 --file "/tmp/gtdb_r220.tar"

On the database server, restore the release into a new database:

docker exec -it gtdb-postgres psql -U postgres gtdb_r220 -f /backup/gtdb_r220.sql

1.2 - Extensions

Check what extensions were enabled on the previous release database, and enable them on the new database:

SELECT *
FROM pg_extension;

For example, this enables the pg_trgm extension:

create extension pg_trgm;

1.3 - Views

There are a number of materialized views that need to be created, and potentially views that may need to be created. Check the previous database and bring across any that do not exist.

Views:

select 'Materialized Views' AS type, matviewname AS name
from pg_matviews
    UNION
select 'Views', table_name
from INFORMATION_SCHEMA.views
where table_schema = 'public'
order by 1, 2;

1.4 - Permissions

Update permissions on the database so that the API is allowed to read.

GRANT SELECT
    ON ALL TABLES IN SCHEMA public
    TO gtdb_api;

ALTER DEFAULT PRIVILEGES IN SCHEMA public
    GRANT SELECT 
    ON TABLES TO gtdb_api;

1.5 - Maintenance

Make sure that all tables are clustered on their primary keys, this can be done by querying for all primary keys, then copying the SQL from the output and running it:

SELECT conrelid::regclass                                              AS tbl,
       conname                                                         AS pkey,
       pg_get_constraintdef(oid)                                       AS description,
       CONCAT('CLUSTER ', conrelid::regclass, ' USING ', conname, ';') AS query
FROM pg_constraint
WHERE contype = 'p'
  AND connamespace = 'public'::regnamespace
ORDER BY conrelid::regclass::text, contype DESC;

You will also need to manually re-cluster the materialized views, they should have a "clst" index that dictates what index to use.

Reclaim any unused space by running the following:

VACUUM FULL ANALYZE;

Now, reindex the database and update the statistics:

REINDEX DATABASE gtdb_r220;
ANALYZE;
CHECKPOINT;

2 - Preparing the GTDB website database

2.1 - Backup & restore

To do this, dump the previous release website database, then restore it as follows:

docker exec -it gtdb-postgres pg_dump -U postgres --dbname=gtdb_r214_web --no-acl --no-privileges --no-owner --file="/backup/gtdb_r214_web.sql"

docker exec -it gtdb-postgres psql -U postgres gtdb_r220_web -f /backup/gtdb_r214_web.sql

2.2 - Permissions

Update permissions on the database so that the API is allowed to read.

GRANT SELECT
    ON ALL TABLES IN SCHEMA public
    TO gtdb_api;

ALTER DEFAULT PRIVILEGES IN SCHEMA public
    GRANT SELECT 
    ON TABLES TO gtdb_api;

2.2 - Updating the genome_taxid table

First, truncate the genome_taxid table:

TRUNCATE TABLE genome_taxid;

Run the script from the GTDB API repository: scripts/release/update_genome_taxid.py

Ensure that the connection string is setup to use the new web release database (see api/db/__init__.py).

2.3 - Updating the gtdb_taxa_not_in_lit table

This is done manually, discuss with the ledger holder to ensure that it is up-to-date.

2.4 - Updating the gtdb_tree and gtdb_tree_children tables

Run the script from the GTDB API repository: scripts/release/update_tree.py

Run the following command to purge the data from the database:

TRUNCATE TABLE gtdb_tree, gtdb_tree_children;

Import the new tsv files that were generated.

2.5 - Updating the taxon_hist table

  1. Connect to the gtdb_r220_web.taxon_hist table and run the following command:
DELETE FROM taxon_hist WHERE release_ver = 'NCBI';

Run the following command in the new GTDB release database (not the web database):

select 'NCBI'                            AS release_ver,
       g.formatted_source_id             AS genome_id,
       split_part(ncbi_taxonomy, ';', 1) AS rank_domain,
       split_part(ncbi_taxonomy, ';', 2) AS rank_phylum,
       split_part(ncbi_taxonomy, ';', 3) AS rank_class,
       split_part(ncbi_taxonomy, ';', 4) AS rank_order,
       split_part(ncbi_taxonomy, ';', 5) AS rank_family,
       split_part(ncbi_taxonomy, ';', 6) AS rank_genus,
       split_part(ncbi_taxonomy, ';', 7) AS rank_species
from genomes g
         inner join metadata_taxonomy mt ON mt.id = g.id
where mt.gtdb_domain != 'd__'
  AND mt.gtdb_phylum != 'p__'
  AND mt.gtdb_class != 'c__'
  AND mt.gtdb_order != 'o__'
  AND mt.gtdb_family != 'f__'
  AND mt.gtdb_genus != 'g__'
  AND mt.gtdb_species != 's__'

UNION

select 'R220'                AS release_ver,
       g.formatted_source_id AS genome_id,
       mt.gtdb_domain        AS rank_domain,
       mt.gtdb_phylum        AS rank_phylum,
       mt.gtdb_class         AS rank_class,
       mt.gtdb_order         AS rank_order,
       mt.gtdb_family        AS rank_family,
       mt.gtdb_genus         AS rank_genus,
       mt.gtdb_species       AS rank_species
from genomes g
         inner join metadata_taxonomy mt ON mt.id = g.id
where mt.gtdb_domain != 'd__'
  AND mt.gtdb_phylum != 'p__'
  AND mt.gtdb_class != 'c__'
  AND mt.gtdb_order != 'o__'
  AND mt.gtdb_family != 'f__'
  AND mt.gtdb_genus != 'g__'
  AND mt.gtdb_species != 's__'

Export the results and import them to the gtdb_r220_web.taxon_hist table.

2.6 - Updating the lpsn_url, and uba_alias tables

This should be done periodically and is not required for the release.

2.7 - Indexing

Make sure that all tables are clustered on their primary keys, this can be done by querying for all primary keys, then copying the SQL from the output and running it:

SELECT conrelid::regclass                                              AS tbl,
       conname                                                         AS pkey,
       pg_get_constraintdef(oid)                                       AS description,
       CONCAT('CLUSTER ', conrelid::regclass, ' USING ', conname, ';') AS query
FROM pg_constraint
WHERE contype = 'p'
  AND connamespace = 'public'::regnamespace
ORDER BY conrelid::regclass::text, contype DESC;

Reclaim any unused space by running the following (~3 minutes):

VACUUM FULL ANALYZE;

Now, reindex the database and update the statistics:

REINDEX DATABASE gtdb_r220_web;
ANALYZE;
CHECKPOINT;

3 - Updating the GTDB website

3.1 - Updating the main page

Run the following query to get the numbers for the index page:

SELECT count(*), gtdb_domain
FROM metadata_taxonomy mt
WHERE gtdb_phylum != 'p__'
GROUP BY gtdb_domain;
SELECT COUNT(*), 'Phylum' AS taxon
FROM (SELECT DISTINCT regexp_replace(gtdb_phylum, '_[A-Z]$', '') AS rank
      FROM metadata_taxonomy mt
      WHERE gtdb_phylum != 'p__'
        AND gtdb_domain = 'd__Bacteria') AS sub

UNION

SELECT COUNT(*), 'Class' AS taxon
FROM (SELECT DISTINCT regexp_replace(gtdb_class, '_[A-Z]$', '') AS rank
      FROM metadata_taxonomy mt
      WHERE gtdb_class != 'c__'
        AND gtdb_domain = 'd__Bacteria') AS sub

UNION

SELECT COUNT(*), 'Order' AS taxon
FROM (SELECT DISTINCT regexp_replace(gtdb_order, '_[A-Z]$', '') AS rank
      FROM metadata_taxonomy mt
      WHERE gtdb_order != 'o__'
        AND gtdb_domain = 'd__Bacteria') AS sub

UNION

SELECT COUNT(*), 'Family' AS taxon
FROM (SELECT DISTINCT regexp_replace(gtdb_family, '_[A-Z]$', '') AS rank
      FROM metadata_taxonomy mt
      WHERE gtdb_family != 'f__'
        AND gtdb_domain = 'd__Bacteria') AS sub

UNION

SELECT COUNT(*), 'Genus' AS taxon
FROM (SELECT DISTINCT gtdb_genus AS rank
      FROM metadata_taxonomy mt
      WHERE gtdb_genus != 'g__'
        AND gtdb_domain = 'd__Bacteria') AS sub

UNION

SELECT COUNT(*), 'Species' AS taxon
FROM (SELECT DISTINCT gtdb_species AS rank
      FROM metadata_taxonomy mt
      WHERE gtdb_species != 's__'
        AND gtdb_domain = 'd__Bacteria') AS sub

ORDER BY 1;

TODO: Put the commit with the release changes in.