Skip to content
This repository has been archived by the owner on Dec 16, 2019. It is now read-only.

Update the LocalEGA database schema #342

Closed
silverdaz opened this issue Sep 3, 2018 · 7 comments
Closed

Update the LocalEGA database schema #342

silverdaz opened this issue Sep 3, 2018 · 7 comments
Assignees
Milestone

Comments

@silverdaz
Copy link
Contributor

....so that ingestion and data-out play nicely with each other.

One database is enough for both.

@silverdaz silverdaz added this to the Sprint 35 milestone Sep 3, 2018
@silverdaz silverdaz self-assigned this Sep 3, 2018
@silverdaz
Copy link
Contributor Author

About harmonizing the data-in and data-out database schema, there is a better approach: Database Views.

It allows us to have whatever database backend we want, with whatever schema we fancy, and then we use 2 views, one for data-in, one for data-out.

That way, there is no need to harmonize the code to have given table names, column names, etc...
In other word: Those views are an interface to the database.

Here is an example:

CREATE TABLE local_ega (
       id                        SERIAL, PRIMARY KEY(id), UNIQUE (id),
       status                    status,
       -- Original file
       elixir_id                 TEXT NOT NULL,
       inbox_path                TEXT NOT NULL,
       -- EGA file ids
       stable_id                 TEXT,
       -- Vault information
       vault_path                TEXT,
       vault_filesize            INTEGER,
       vault_display_name        TEXT,    -- no idea what that is
       -- Crypt4GH header
       header                    TEXT,
       -- Useless checksums when using Crypt4GH
       checksum                  VARCHAR(32),
       checksum_type             checksum_algorithm,
       unencrypted_checksum      VARCHAR(32),
       unencrypted_checksum_type checksum_algorithm,
       -- Status
       status                    status NOT NULL,
       -- Audit / Logs
       created_by                TEXT,
       last_modified_by          TEXT,
       created_at                TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp(),
       last_modified             TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp()
);
-- WITH (OIDS=FALSE);
CREATE UNIQUE INDEX file_id_idx ON local_ega (id);

-- ################## VIEWS #########################
-- Used by data-in
CREATE VIEW files AS
SELECT id, elixir_id, inbox_path, status, vault_path, vault_filesize, stable_id, header, created_at, last_modified
FROM local_ega;

-- Used by data-out from EBI
CREATE VIEW dev_ega_file AS
SELECT id AS file_id,
       stable_id AS file_name,
       vault_path AS file_path,
       display_file_name,
       vault_filesize AS file_size,
       checksum, checksum_type, unencrypted_checksum, unencrypted_checksum_type,
       status AS file_status,
       created_by, last_updated_by, header,
       created_at AS created
       last_modified AS last_updated
FROM local_ega;

@silverdaz
Copy link
Contributor Author

Note that we can add a WHERE-clause if we want to disallow accessing some files.

Or here is another idea:
Create a view for (local) EGA files and another view for BioBank files (using yet another where-clause)

@blankdots
Copy link

Related issue ? EGA-archive/ega-data-api#10

@silverdaz
Copy link
Contributor Author

silverdaz commented Sep 17, 2018

ok, this seems to do it:

\connect lega

SET TIME ZONE 'UTC';

CREATE TYPE checksum_algorithm AS ENUM ('MD5', 'SHA256', 'SHA384', 'SHA512'); -- md5 is bad. Use sha*!
CREATE TYPE storage AS ENUM ('S3', 'POSIX');
-- Note: This is an enum, because that's what the "provided" database supports
--       If a site has its own database already, let them define their keyword in the ENUM 
--       and use it (Notice that their code must be update to push this value into the table)
--       There is no need to agree on how each site should operate their own database
--       What we need is to document where they need to update and what.


-- ##################################################
--                  FILE STATUS
-- ##################################################
CREATE TABLE status (
        id            INTEGER,
	code          VARCHAR(16) NOT NULL,
	description   TEXT,
	-- contraints
	PRIMARY KEY(id), UNIQUE (id), UNIQUE (code)
);

INSERT INTO status(id,code,description)
VALUES (10, 'INIT'        , 'Initializing a file ingestion'),
       (20, 'IN_INGESTION', 'Currently under ingestion'),
       (30, 'ARCHIVED'    , 'File moved to Vault'),
       (40, 'COMPLETED'   , 'File verified in Vault'),
       (50, 'READY'       , 'File ingested, ready for download'),
       -- (60, 'IN_INDEXING', 'Currently under index creation'),
       (0, 'ERROR'       , 'An Error occured, check the error table')
;

-- ##################################################
--                ENCRYPTION FORMAT
-- ##################################################
CREATE TABLE vault_encryption (
       mode          VARCHAR(16) NOT NULL, PRIMARY KEY(mode), UNIQUE (mode),
       description   TEXT
);

INSERT INTO vault_encryption(mode,description)
VALUES ('CRYPT4GH'  , 'Crypt4GH encryption (using version)'),
       ('PGP'       , 'OpenPGP encryption (RFC 4880)'),
       ('AES'       , 'AES encryption with passphrase'),
       ('CUSTOM1'   , 'Custom method 1 for local site'),
       ('CUSTOM2'   , 'Custom method 2 for local site')
    -- ...
;



-- ##################################################
--                        FILES
-- ##################################################
-- Main table with looooots of information
CREATE TABLE main (
       id                        SERIAL, PRIMARY KEY(id), UNIQUE (id),

       -- EGA file ids
       stable_id                 TEXT,

       -- Status
       status                    VARCHAR NOT NULL REFERENCES status (code), -- No "ON DELETE CASCADE": update to the new status
                                                                            --                         in case the old one is deleted 

       -- Original/Submission file
       submission_file_path                     TEXT NOT NULL,
       submission_file_extension                VARCHAR(260) NOT NULL,
       submission_file_calculated_checksum      VARCHAR(128),
       submission_file_calculated_checksum_type checksum_algorithm,

       submission_file_size                     INTEGER NULL,
       submission_user                          TEXT NOT NULL, -- Elixir ID, or internal user
 
       -- Vault information
       vault_file_reference      TEXT,    -- file path if POSIX, object id if S3
       vault_file_type           storage, -- S3 or POSIX file system
       vault_file_size           INTEGER,
       vault_file_checksum       VARCHAR(128) NULL, -- NOT NULL,
       vault_file_checksum_type  checksum_algorithm,
       
       -- Encryption/Decryption
       encryption_method         VARCHAR REFERENCES vault_encryption (mode), -- ON DELETE CASCADE,
       version                   INTEGER , -- DEFAULT 1, -- Crypt4GH version
       header                    TEXT,              -- Crypt4GH header
       session_key_checksum      VARCHAR(128) NULL, -- NOT NULL, -- To check if session key already used
       session_key_checksum_type checksum_algorithm,
       -- Note: We can support multiple encryption. See at the end of that file.

       -- Audit / Logs
       created_by                NAME DEFAULT CURRENT_USER, -- Postgres users
       last_modified_by          NAME DEFAULT CURRENT_USER, --
       created_at                TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp(),
       last_modified             TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp()
)
WITH (
     OIDS=FALSE
);
CREATE UNIQUE INDEX file_id_idx ON main(id);

-- ##################################################
--                      ERRORS
-- ##################################################
CREATE TABLE main_errors (
        id            SERIAL, PRIMARY KEY(id), UNIQUE (id),
	active        BOOLEAN NOT NULL DEFAULT TRUE,
	file_id       INTEGER NOT NULL REFERENCES main(id) ON DELETE CASCADE,
	hostname      TEXT,
	error_type    TEXT NOT NULL,
	msg           TEXT NOT NULL,
	from_user     BOOLEAN DEFAULT FALSE,
	occured_at    TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT clock_timestamp()
);

-- ##################################################
--                 Quality Control
-- ##################################################
-- Coming soon to a cinema near you

-- ##################################################
--                      Views
-- ##################################################
-- Used by data-in
CREATE VIEW files AS
SELECT id,
       submission_user                          AS elixir_id,
       submission_file_path                     AS inbox_path,
       submission_file_size                     AS inbox_filesize,
       submission_file_calculated_checksum      AS inbox_file_checksum,
       submission_file_calculated_checksum_type AS inbox_file_checksum_type,
       status,
       vault_file_reference                     AS vault_path,
       vault_file_type                          AS vault_type,
       vault_file_size                          AS vault_filesize,
       stable_id,
       header,  -- Crypt4gh specific
       version,
       session_key_checksum,
       session_key_checksum_type,
       created_at,
       last_modified
FROM main;

-- Insert into main
CREATE FUNCTION insert_file(inpath main.submission_file_path%TYPE,
			    eid    main.submission_user%TYPE)
    RETURNS main.id%TYPE AS $insert_file$
    #variable_conflict use_column
    DECLARE
        file_id  main.id%TYPE;
        file_ext main.submission_file_extension%TYPE;
    BEGIN
        file_ext := substring(inpath from '\.([^\.]*)$'); -- extract extension from filename
	INSERT INTO main (submission_file_path,
	                  submission_user,
			  submission_file_extension,
			  status,
			  encryption_method) -- hard-code the vault_encryption
	VALUES(inpath,eid,file_ext,'INIT','CRYPT4GH') RETURNING main.id
	INTO file_id;
	RETURN file_id;
    END;
$insert_file$ LANGUAGE plpgsql;


-- Just showing the current/active errors
CREATE VIEW errors AS
SELECT id, file_id, hostname, error_type, msg, from_user, occured_at
FROM main_errors
WHERE active = TRUE;

CREATE FUNCTION insert_error(fid   errors.file_id%TYPE,
                             h     errors.hostname%TYPE,
                             etype errors.error_type%TYPE,
                             msg   errors.msg%TYPE,
                             from_user  errors.from_user%TYPE)
    RETURNS void AS $insert_error$
    BEGIN
       INSERT INTO errors (file_id,hostname,error_type,msg,from_user) VALUES(fid,h,etype,msg,from_user);
       UPDATE files SET status = 'ERROR' WHERE id = fid;
    END;
$insert_error$ LANGUAGE plpgsql;


-- When File becomes 'READY', remove all its errors from current errors.
CREATE FUNCTION mark_ready()
RETURNS TRIGGER AS $mark_ready$
BEGIN
     UPDATE main_errors SET active = FALSE WHERE file_id = NEW.id;  -- or OLD.id
     RETURN NEW;
END;
$mark_ready$ LANGUAGE plpgsql;

CREATE TRIGGER mark_ready 
    AFTER UPDATE OF status ON main -- using the main and not files
                                   -- because "Views cannot have row-level BEFORE or AFTER triggers."
    FOR EACH ROW WHEN (NEW.status = 'READY')
    EXECUTE PROCEDURE mark_ready();

-- #######################################################################
-- Used by data-out from EBI
CREATE VIEW dev_ega_file AS
SELECT id AS file_id,
       stable_id AS file_name,
       vault_file_reference AS file_path,
       vault_file_type AS vault_type,
       submission_file_path AS display_file_name,
       vault_file_size AS file_size,
       submission_file_calculated_checksum AS checksum,
       submission_file_calculated_checksum_type AS checksum_type,
       vault_file_checksum AS unencrypted_checksum,
       vault_file_checksum_type AS unencrypted_checksum_type,
       status AS file_status,
       header,
       created_by,
       last_modified_by AS last_updated_by,
       created_at AS created,
       last_modified AS last_updated
FROM main;


-- ##########################################################################
--                   About the encryption
-- ##########################################################################
-- 
-- 
-- We can support multiple encryption types in the vault
-- (Say, for example, Crypt4GH, PGP and plain AES),
-- in the following manner:
-- 
-- We create a table for each method of encryption.
-- Each table will have its own set of fields, refering to data it needs for decryption
-- 
-- Then we update the main file table with a vault_encryption "keyword".
-- That will tell the main file table to look at another table for that
-- particular file. (Note that this file reference is found in only one
-- encryption table).
--
-- At the moment, we added the whole Crypt4gh-related table inside the main table.
-- That's easily changeable, using views.
-- 
-- The site that wants to support multiple encryption methods must update the data-in and
-- data-out code in order to push/pull the data accordingly.

@blankdots
Copy link

Compared with the fields descried in https://github.com/EGA-archive/ega-data-api/tree/master/ega-data-api-filedatabase/src/main/resources and EGA-archive/ega-data-api#10 (comment)
At first glance the names seem to be ok ( the focus was on File.sql) with an additional one above vault_type.
The data out seems to operate within a certain schema: dev_ega_file and it uses other tables such as downdload_log and event that are not present in the schema above.
Maybe a CREATE TABLE IF NOT EXISTS would be nice either on data out or on the schema above.
@silverdaz you mentioned yesterday some changes related to schema, would be nice to take a look at them before proceeding with the next step.

Next step would be to see if data out can utilize this schema, and if any changes required on that part.

@silverdaz
Copy link
Contributor Author

@silverdaz you mentioned yesterday some changes related to schema, would be nice to take a look at them before proceeding with the next step.

I updated the comment with the recent changes.

The data out seems to operate within a certain schema: dev_ega_file and it uses other tables such as downdload_log and event that are not present in the schema above.

I'm not aware of the download_log tables, can you point me to the file loading that schema?

@blankdots
Copy link

@silverdaz

I'm not aware of the download_log tables, can you point me to the file loading that schema?

There are multiple files used: https://github.com/EGA-archive/ega-data-api/tree/master/ega-data-api-filedatabase/src/main/resources . For donwload_log see: https://github.com/EGA-archive/ega-data-api/blob/master/ega-data-api-filedatabase/src/main/resources/DownloadLog.sql

Also refer to this comment for more information: EGA-archive/ega-data-api#10 (comment)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants