Add clientpath to Filesets #12

will-moore · 2023-10-11T21:16:21Z

Since existing FilesetEntry.clientpath values are set to unknown for mkngff Filesets, and we also don't have any reference to the original source of the data, we can set this value to something more useful.

This PR adds a --clientpath option which is a path or URL to the Fileset e.g. https://s3-server/bucket/data.zarr that corresponds to the mounted s3 Fileset /dir/path/to/data.zarr.
This enables the creation of a clientpath for every file found under the mounted Fileset.

E.g.

$ omero mkngff sql 4053141 --clientpath=https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD852/f12bdada-57eb-4fab-90ef-9655e4106497/f12bdada-57eb-4fab-90ef-9655e4106497.zarr --secret=$SECRET /bia-integrator-data/S-BIAD852/f12bdada-57eb-4fab-90ef-9655e4106497/f12bdada-57eb-4fab-90ef-9655e4106497.zarr > 4053141.sql

This creates sql output with a 4th clientpath item in each sql ROW. If the --clientpath option is not used as above then the placeholder unknown is added to each ROW in the sql, which results in the same outcome as before.

Tested at IDR/idr-utils#56 (comment)

will-moore · 2023-10-11T22:00:17Z

idr0004,Screen:202,S-BIAD867
idr0010,Screen:1351,S-BIAD885
idr0011,Screen:1501,S-BIAD866
idr0011,Screen:1551,S-BIAD866
idr0011,Screen:1601,S-BIAD866
idr0011,Screen:1602,S-BIAD866
idr0011,Screen:1603,S-BIAD866
idr0012,Screen:1202,S-BIAD845
idr0013,Screen:1101,S-BIAD865
idr0013,Screen:1302,S-BIAD865
idr0015,Screen:1201,S-BIAD861
idr0016,Screen:1251,S-BIAD851
idr0025,Screen:1851,S-BIAD846
idr0026,Project:301,S-BIAD860
idr0033,Screen:1751,S-BIAD848
idr0035,Screen:2001,S-BIAD847
idr0036,Screen:1952,S-BIAD855
idr0051,Project:552,S-BIAD815
idr0054,Project:701,S-BIAD800
idr0090,Screen:2851,S-BIAD882
idr0091,Dataset:1351,S-BIAD852

pip install 'omero-mkngff @ git+https://github.com/will-moore/omero-mkngff@clientpath'

# 1 plate from idr0004
omero mkngff clientpath Plate:1751 https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD867/

# all of idr0004
omero mkngff clientpath Screen:202 https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD867/

# csv above...
for r in $(cat ngff_filesets.csv); do
  target=$(echo $r | cut -d',' -f2)
  biad=$(echo $r | cut -d',' -f3)
  echo $target
  omero mkngff clientpath $target "https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/$biad/"
done

will-moore · 2023-10-12T06:03:06Z

After running for nearly 8 hours, we have reached 14 plates into idr0012, (approx 400 plates done) so it will be at least another day before this is complete!
This seems the wrong way to go when we've only just generated the filesets.

@joshmoore I wonder if we could teach the sql function mkngff_fileset() to populate the clientpath as in the description above? The trouble is that we don't want to regenerate all the sql files from scratch, although we could add in the base URL for a Fileset e.g. https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD867/103d9428-b86b-4f4e-84d8-966b5d89aae1/103d9428-b86b-4f4e-84d8-966b5d89aae1.zarr into the parameter list.

Then, for each row in the array, e.g.

['demo_2/2015-10/01/07-25-30.185_mkngff/103d9428-b86b-4f4e-84d8-966b5d89aae1.zarr/A/10/0/3/', '.zarray', 'application/octet-stream'],

we'd need to be able to generate the clientpath within the mkngff_fileset() function, possibly using .zarr to split the path here to get the relative path A/10/0/3, to add to the base URL along with the name to get:
e.g. https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD867/103d9428-b86b-4f4e-84d8-966b5d89aae1/103d9428-b86b-4f4e-84d8-966b5d89aae1.zarr/A/10/0/3/.zarray

Is that possible within sql language?

will-moore · 2023-10-13T05:39:40Z

Still running...

Fileset 6312826
tosave 3061
Fileset 6312697
tosave 3061

This is taking 3 minutes per Fileset just now....

will-moore · 2023-10-13T16:22:53Z

get_filesets Screen:1251
Fileset 6313488
tosave 14610

joshmoore · 2023-10-14T21:11:49Z

Is that possible within sql language?

I'm not sure I fully understand but in general you can do anything with SQL if slightly more verbosely.

I like your idea of templating the output, but there would still need to be checks for the existence of the files, no?

This reverts commit edf7be3.

will-moore · 2023-10-16T15:09:27Z

Having experimented with trying this in mkngff_fileset() function within setup.sql script I have given up and I'm going to simply pass the clientpath argument as a 4th item for each row that creates an OriginalFile.

This also means that we don't need the complex logic to resolve clientpath from path and name.

e.g.

$ omero mkngff sql 1591301 --clientpath="https://s3/path/to/image.zarr" /path/to/data/6001247.zarr

Found prefix: demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023 for fileset: 1591301

UPDATE pixels SET name = '.zattrs', path = 'demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr' where image in (select id from Image where fileset = 1591301);

begin;
    select mkngff_fileset(
      1591301,
      'SECRETUUID',
      'cdf35825-def1-4580-8d0b-9c349b8f78d6',
      'demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/',
      array[
          ['demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr/', '.zattrs', 'application/octet-stream', 'https://s3/path/to/image.zarr/.zattrs'],
          ['demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr/', '.zgroup', 'application/octet-stream', 'https://s3/path/to/image.zarr/.zgroup'],
          ['demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr/0/', '.zarray', 'application/octet-stream', 'https://s3/path/to/image.zarr/0/.zarray'],
          ['demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr/1/', '.zarray', 'application/octet-stream', 'https://s3/path/to/image.zarr/1/.zarray'],
          ['demo_2/Blitz-0-Ice.ThreadPool.Server-5/2019-03/15/15-27-40.023_mkngff/6001247.zarr/2/', '.zarray', 'application/octet-stream', 'https://s3/path/to/image.zarr/2/.zarray']
      ]::text[][]
    );
commit;

will-moore · 2023-10-17T14:24:13Z

Tested at IDR/idr-utils#56 (comment)
with:

omero mkngff sql 4053141 --clientpath=https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD852/f12bdada-57eb-4fab-90ef-9655e4106497/f12bdada-57eb-4fab-90ef-9655e4106497.zarr --secret=$SECRET /bia-integrator-data/S-BIAD852/f12bdada-57eb-4fab-90ef-9655e4106497/f12bdada-57eb-4fab-90ef-9655e4106497.zarr > 4053141.sql

Re: @joshmoore "checks for the existence of the files" - I'm not sure what you mean, but in that example the clientpath values are set to files under https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD852/f12bdada-57eb-4fab-90ef-9655e4106497/f12bdada-57eb-4fab-90ef-9655e4106497.zarr, but we don't check for their existence.

Add clientpath command

edf7be3

will-moore added 2 commits October 16, 2023 14:31

Revert "Add clientpath command"

10abb27

This reverts commit edf7be3.

Add clientpath support to sql and setup.sql

0d2615a

Ignore Directories in walk() of fileset

2469f6d

will-moore mentioned this pull request Oct 16, 2023

Filesets to swap IDR/idr-utils#56

Closed

will-moore changed the title ~~Add clientpath command~~ Add clientpath to Filesets Oct 24, 2023

joshmoore merged commit e874ca3 into IDR:main Oct 31, 2023
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add clientpath to Filesets #12

Add clientpath to Filesets #12

will-moore commented Oct 11, 2023 •

edited

will-moore commented Oct 11, 2023

will-moore commented Oct 12, 2023

will-moore commented Oct 13, 2023 •

edited

will-moore commented Oct 13, 2023

joshmoore commented Oct 14, 2023

will-moore commented Oct 16, 2023

will-moore commented Oct 17, 2023

Add clientpath to Filesets #12

Add clientpath to Filesets #12

Conversation

will-moore commented Oct 11, 2023 • edited

will-moore commented Oct 11, 2023

will-moore commented Oct 12, 2023

will-moore commented Oct 13, 2023 • edited

will-moore commented Oct 13, 2023

joshmoore commented Oct 14, 2023

will-moore commented Oct 16, 2023

will-moore commented Oct 17, 2023

will-moore commented Oct 11, 2023 •

edited

will-moore commented Oct 13, 2023 •

edited