[RFC0022] serving images for layout analysis #18

ta4tsering · 2023-02-03T07:21:52Z

Work Planning

Details

Housekeeping

[RFC0022] serving images for layout analysis

ALL BELOW FIELDS ARE REQUIRED

Named Concepts

Explain any new concepts introduced in this request.

Summary

selected unique images needs to be processed using the same methodology as used to process page annotation images and serve to prodigy recipe through a csv file to be streamed to layout_analysis instances

Reference-Level Explanation

I will be getting a .csv file containing the Repo name OCR### and work_id W######
download the repo, go through unique_images folder to get the list of the unique images
use the work_id to get all the s3_keys of all the images in the work_id on s3
get the s3_keys of all the unique_images list and write it in a .txt text file.
parse the .txt file containing s3_keys of selected unique images and process the images using the same processing methodology as used for the processing of sample images for page annotations which include resizing the image, compress the image and encode the image using Pillow
upload the processed images to a s3 bucket like openpecha.bdrc.io and append the processed image uploaded s3_key in a csv_file
give the csv_file_path to the prodigy recipe so it can parse the csv file to list of s3_keys to stream on prodigy.bdrc.io/layout_analysis/

The proposed changes interact with other systems (or other parts of the system that is changed)

The actual implementation will take place

Known challenges can be readily overcome

This section includes practical examples and explain how this proposal makes those examples work.

This section becomes the engineering specification and work plan, so it must be sufficiently detailed to faciliate for that.

Alternatives

Confirm that alternative approaches have been evaluated and explain those alternatives briefly.

Rationale

Why the currently proposed design was selected over alternatives?

What would be the impact of going with one of the alternative approaches?

Is the evaluation tentative, or is it recommended to use more time to evaluate different approaches?

Drawbacks

Describe any particular caveats and drawbacks that may arise from fulfilling this particular request?

Useful References

we already have all the scripts needed

What similar work have we already successfully completed?

Is this something that have already been built by others?

What other related learnings we have?

Are there useful academic literature or other articles related with this topic? (provide links)

Have we built a relevant prototype previously?

Do we have a rough mock for the UI/UX?

Do we have a schematic for the system?

Unresolved Questions

What is there that is unresolved (and will be resolved as part of fulfilling this request)?

Are there other requests with same or similar problems to solve?

Parts of the System Affected

Which parts of the current system are affected by this request?

What other open requests are closely related with this request?

Does this request depend on fulfillment of any other request?

Does any other request depend on the fulfillment of this request?*

Future possibilities

How do you see the particular system or part of the system affected by this request be altered or extended in the future.

Infrastructure

requires a s3 bucket to upload the processed selected unique images. like opnepecha.bdrc.io @ngawangtrinley

Testing

image-processing is already tested when used for the processing of images for page annotation

Documentation

Describe the level of documentation fulfilling this request involves. Consider both end-user documentation and developer documentation.

Version History

v0.1

Recordings

Links to audio recordings of related discussion.

Work Phases

parse .csv file containing the Repo name OCR### and work_id W######
time estimation: 10 min
time taken: 10 min
download the repo, go through unique_images folder to get the list of the unique images
time estimation: 10 min
time taken: 10 min
use the work_id to get all the s3_keys of all the images in the work_id on s3
time estimation: 1 hour
time taken: 1 hour
get the s3_keys of all the unique_images list and write it in a .txt text file on prodigy-tools.
time estimation: 30 min
time taken: 30 min
process the unique images list from the text file as the same methodology used for page annotations image processing and upload the images to the s3 bucket for example openpecha.bdrc.io #19
time estimation: 1 hours
time taken:
give the csv_file_path to the prodigy recipe so it can go through the list of s3_keys to stream on prodigy.bdrc.io/layout_analysis/ #20
time estimation: 10 min
time taken:

Planning

Keep original naming and structure, and keep as first section in Work phases section

RFC completed on:
Estimated time:
Actual time:
RFC reviewed and approved by:
Estimated time:
Actual time:

Implementation

A list of checkboxes, one per PR. Each PR should have a descriptive name that clearly illustrates what the work phase is about.

PR 1
Estimated time:
Actual time:
PR 2
Estimated time:
Actual time:

Completion

Tested and approved by: @username @username
Estimated time:
Actual time:
Documentation approved @evanyerburgh
Estimated time:
Actual time:

The text was updated successfully, but these errors were encountered:

ta4tsering self-assigned this Feb 3, 2023

ta4tsering transferred this issue from OpenPecha/Requests Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC0022] serving images for layout analysis #18

[RFC0022] serving images for layout analysis #18

ta4tsering commented Feb 3, 2023 •

edited

Loading

Table of Contents

Housekeeping

Named Concepts

Summary

Reference-Level Explanation

Alternatives

Rationale

Drawbacks

Useful References

Unresolved Questions

Parts of the System Affected

Future possibilities

Infrastructure

Testing

Documentation

Version History

Recordings

[RFC0022] serving images for layout analysis #18

[RFC0022] serving images for layout analysis #18

Comments

ta4tsering commented Feb 3, 2023 • edited Loading

Work Planning

Table of Contents

Housekeeping

Named Concepts

Summary

Reference-Level Explanation

Alternatives

Rationale

Drawbacks

Useful References

Unresolved Questions

Parts of the System Affected

Future possibilities

Infrastructure

Testing

Documentation

Version History

Recordings

Work Phases

Planning

Implementation

Completion

ta4tsering commented Feb 3, 2023 •

edited

Loading