Bagit export updates #677

rstorey · 2018-11-29T18:19:43Z

Closes #664

rstorey · 2018-11-29T18:23:31Z

In the interest of making this project more reusable, should we have this as something like a dictionary in the settings file?

It seems that our BagIt export (at least for now) is going to be strictly for the purpose of getting transcription data into www.loc.gov. The most recent changes make it even more LC-specific.

When we are ready to talk about/implement public exports, I'm not sure that BagIt 1.0 bags are what we want to use. I think the CSV export is better, not the least of which is because it's a streaming response that won't OOM on large exports.

coveralls · 2018-11-29T18:28:32Z

Coverage decreased (-0.4%) to 71.414% when pulling 292a451 on bagit-export-updates into 7121c49 on master.

acdha

Do we need to apply published checks here to exclude the campaign/project/item/asset(s) which aren't published yet?

acdha · 2018-11-29T23:00:42Z

exporter/views.py

@@ -35,13 +37,81 @@ def get_original_asset_id(download_url):
    """
    if download_url.startswith("http://tile.loc.gov/"):
        pattern = r"/service:([A-Za-z0-9:\-]*)/"
-        asset_id = re.search(pattern, download_url)
+        asset_id = re.search(pattern, download_url).group(1)


This will fail if it gets a value which doesn't match the pattern. We should probably do the if not match: … check even if that's unlikely.

I changed it - let me know what you think.

exporter/views.py

acdha · 2018-11-29T23:11:30Z

When we are ready to talk about/implement public exports, I'm not sure that BagIt 1.0 bags are what we want to use. I think the CSV export is better, not the least of which is because it's a streaming response that won't OOM on large exports.

I suspect that the best answer here is going to be Celery and the work on bagit-python to make it upload directly to S3 (and that could be safely made public with some sort of throttle). That's definitely a separate project but we might want to see whether there's something we can do here to avoid OOMs — do you know where in the code it's hitting that limit?

Co-Authored-By: rstorey <rstorey@users.noreply.github.com>

rstorey · 2018-11-30T21:13:05Z

When we are ready to talk about/implement public exports, I'm not sure that BagIt 1.0 bags are what we want to use. I think the CSV export is better, not the least of which is because it's a streaming response that won't OOM on large exports.

I suspect that the best answer here is going to be Celery and the work on bagit-python to make it upload directly to S3 (and that could be safely made public with some sort of throttle). That's definitely a separate project but we might want to see whether there's something we can do here to avoid OOMs — do you know where in the code it's hitting that limit?

I haven't actually personally witnessed an OOM, so not sure.

rstorey · 2018-11-30T21:29:56Z

SInce we're only getting completed transcriptions in the BagIt export, either the assets are published or were published. I think the scenario where something is published, completed, unpublished, and then exported might be a valid use case, so I'm not going to add a filter for publish status at this time. If @elainekamlley wants to weigh in, I defer to them.

acdha · 2018-11-30T22:17:38Z

exporter/views.py

+            logger.error(
+                "Couldn't find a matching asset ID in download URL %s", download_url
+            )
+            raise AssertionError


This could be some other exception class but I care less about that than that it’s checked since it seems like this condition will usually be telling us that we got some bogus data from the API, and that edge case probably means we need to get some records fixed

acdha

I’m cutting my review short for family reasons but it looks good enough to me to merge for testing

rstorey added 7 commits November 28, 2018 14:27

Refs #664, remove project folders from exported BagIt structure

7bf164e

Add bag-info.txt fields to exported bags, Refs #664

041c91b

Add project level bagit export

56e7175

Add item level bagit export

4d48385

update bag-info.txt values

f44d2a5

Use asset ID to create folder structure in bagIt export, Refs #664

68ebed3

Remove timestamp from exported bag filename

ebd00dd

Update export bagit test for new folder structure

7fe3d65

rstorey requested a review from acdha November 29, 2018 18:29

Update isort configuration to more closely match Black

1678337

rstorey self-assigned this Nov 29, 2018

rstorey requested a review from elainekamlley November 29, 2018 20:54

acdha reviewed Nov 29, 2018

View reviewed changes

acdha and others added 8 commits November 30, 2018 16:09

Update exporter/views.py

af4885d

Co-Authored-By: rstorey <rstorey@users.noreply.github.com>

Update exporter/views.py

ea35edd

Co-Authored-By: rstorey <rstorey@users.noreply.github.com>

Update exporter/views.py

0e43a68

Co-Authored-By: rstorey <rstorey@users.noreply.github.com>

Update exporter/views.py

f32d6b1

Co-Authored-By: rstorey <rstorey@users.noreply.github.com>

Update exporter/views.py

04447df

Co-Authored-By: rstorey <rstorey@users.noreply.github.com>

Update exporter/views.py

280143e

Co-Authored-By: rstorey <rstorey@users.noreply.github.com>

Update exporter/views.py

5df2f97

Co-Authored-By: rstorey <rstorey@users.noreply.github.com>

Update exporter/views.py

d4f4302

Co-Authored-By: rstorey <rstorey@users.noreply.github.com>

if download URL doesn't contain matching asset id, fail explicitly

292a451

rstorey requested a review from acdha November 30, 2018 21:31

acdha reviewed Nov 30, 2018

View reviewed changes

acdha approved these changes Nov 30, 2018

View reviewed changes

rstorey merged commit b5f611a into master Dec 3, 2018

rstorey deleted the bagit-export-updates branch December 3, 2018 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bagit export updates #677

Bagit export updates #677

rstorey commented Nov 29, 2018 •

edited

rstorey commented Nov 29, 2018

coveralls commented Nov 29, 2018 •

edited

acdha left a comment

acdha Nov 29, 2018

rstorey Nov 30, 2018

acdha commented Nov 29, 2018

rstorey commented Nov 30, 2018

rstorey commented Nov 30, 2018

acdha Nov 30, 2018

acdha left a comment

Bagit export updates #677

Bagit export updates #677

Conversation

rstorey commented Nov 29, 2018 • edited

rstorey commented Nov 29, 2018

coveralls commented Nov 29, 2018 • edited

acdha left a comment

Choose a reason for hiding this comment

acdha Nov 29, 2018

Choose a reason for hiding this comment

rstorey Nov 30, 2018

Choose a reason for hiding this comment

acdha commented Nov 29, 2018

rstorey commented Nov 30, 2018

rstorey commented Nov 30, 2018

acdha Nov 30, 2018

Choose a reason for hiding this comment

acdha left a comment

Choose a reason for hiding this comment

rstorey commented Nov 29, 2018 •

edited

coveralls commented Nov 29, 2018 •

edited