Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC0014] Automated Image-cropping Pipeline #30

Open
21 of 41 tasks
Zakongjampa opened this issue Nov 11, 2022 · 4 comments
Open
21 of 41 tasks

[RFC0014] Automated Image-cropping Pipeline #30

Zakongjampa opened this issue Nov 11, 2022 · 4 comments
Assignees

Comments

@Zakongjampa
Copy link

Zakongjampa commented Nov 11, 2022

Click here for Docs

Table of Contents

Housekeeping

RFC0014 Automated Image-cropping Pipeline

ALL BELOW FIELDS ARE REQUIRED

Named Concepts

  • prodigy: prodi.gy
  • Image: Images will be accessed from the s3 aws server with the bucket name image-processing.bdrc.io and And the system or model should work any type of images regardless of whether its in pecha format or modern publications format.
  • Pecha page: Refers to the traditional Tibetan book format in landscape orientation. In the context of this project, several Pecha page sides are captured in a single image.

Summary

BDRC has many images that contain several Pecha pages. We need to automate the image-cropping process with a custom computer vision model. This project will use Prodigy as a human-in-the-loop pipeline to create an initial training dataset, train a model, and iteratively improve it.

Reference-Level Explanation

**System Diagram:**

Pasted image 20221114095616

prodigy image.manual images_dataset ./images --label PECHA

In this command line:
We used the command-line interface with a built-in image.manual recipe with image_dataset and manually write the boundary of each image in the image_dataset.

Preparing the training dataset

Here we manually write boundaries to each image for the training dataset. We can make it work faster by making it available to more people by deploying the model to a web using AWS and more people can take part in drawing the boundary around the PECHA at the same time.

Check whether we have enough training datasets to train the model

prodigy train-curve --son_on............

It will print the accuracy figures and accuracy figures and accuracy improvements with more data. This recipe takes pretty much the same arguments as the train.

Train the model

You can use the training recipe to train within the prodigy or outside using spaCy or other NLP packages

prodigy train --ner ds_GOLD ./tmp_model --eval-split 0.25

--ner -> telling prodigy that you are doing a NER
ds_GOLD -> name of the local dataset that has your manual annotation
./tmp_model -> path to where prodigy will create your model
--eval-split -> the train test split ratio is what you want prodigy to split the annotation in your dataset into.

Human in the Loop

Once we have a basic model, you can exponentially speed up the cropping process by letting the model try to do the rest of the image cropping.

prodigy ner.teach corrected_db ./tmp_model ./local.jsonl --label PECHA

The model will take over and crop the rest of the dataset and binarizes the decision process into an ACCEPT or REJECT for you.

If we notice that the model is not doing the cropping job well then this training dataset for cropping needs further correction by opting for the ner.correct recipe.

prodigy ner.correct corrected_db ./tmp_model ./local.jsonl --label PECHA

Prodigy output

Through the prodigy, we will get the coordinates of the border image

Pasted image 20221114103852

Image Cropping

By using the Python Image Library (PIL) which provides the python interpreter with image editing capability. This Library can crop the image base on the coordinates that we have from the prodigy.

Alternatives

Manually cropping each image to the borders but BDRC doesn't have the human power to do this work manually.

Rationale

- Why the currently proposed design was selected over alternatives? - First manually cropping each image is a tedious job and we don't have the manpower to do it - Doing it without deploying it to the AWS will delay the completion date. - What would be the impact of going with one of the alternative approaches? - Based on my understanding this would be a better solution. - Is the evaluation tentative, or is it recommended to use more time to evaluate different approaches? - Yes

Drawbacks

Need AWS to host the image and a domain to make it available to other people on the internet to draw the boundary.

Useful References

-Prodigy [Prodi.gy](https://prodi.gy/docs/computer-vision) [Using prodigy for NLP text annotation](https://medium.com/mlearning-ai/using-prodigy-for-nlp-text-annotation-revolution-ai-for-spacy-e5561d93a361) [SpaCy v3.4 documentation ](https://spacy.io/usage/v3-4)

Unresolved Questions

- What is there that is unresolved (and will be resolved as part of fulfilling this request)?

Prodigy is mostly used for Named Entity Recognition (NER) hence most of the documentation and online article are about
the same.
When I goes through the documentation they didn't specifically instruct the same with instruction for the image. Hence,
there is no way of confirming that it will work the same for the image as well.

  • Are there other requests with the same or similar problems to solve?
    No to my knowledge.

Parts of the System Affected

  • Which parts of the current system are affected by this request?: None
  • What other open requests are closely related to this request?: None
  • Does this request depend on the fulfillment of any other request?: No
  • Does any other request depend on the fulfillment of this request?: No

Future possibilities

- We can run the prodigy and the system built around it. We don't have to crop an image by ourselves. - The model will crop the image and save the image in the same format in a specified location.

Infrastructure

**Front end** - No need to do anything because the prodigy has a web interface to draw the rectangle or polygon shape onto the image coordinate.

Backend

  • make a big enough training data so that the model will able to learn from it.
  • Train the model and get the JSONL file from the database using the db_out recipe
  • Based on the coordinates, crop the image, and based on the name of the image file rename it.
  • Save the cropped image in a directory of the S3 bucket.

Testing

We will measure the performance of the model by training the model and testing it on the remaining images of PECHA. Will check the accuracy by using the teach and correct recipe.

Documentation

  • User documentation

    • Usage
  • Developer documentation

    • comment on all the module
    • comment on all classes
    • comment on all methods
    • comment on all functions

Version History

  • v0.0.2

Recordings

- None

Work Phases

  • Pre-processing of Image from BDRC's s3 server

    • according the requirements from ELie as a dict of image_options

      • the maximum width should be 2000
      • the maximum height should be 700
      • the quality of the image should be 75%
      • the image should be encoded using progressive encoding, default True
      • the image should be converted to greyscale if the greyscale is True, default is False
    • the image should be checked if it is binary or non-binary and name the processed file accordingly
      if binary: new_filename = origfilename + "_" + str(degree) + ".png"
      else: new_filename = origfilename + "_" + str(degree) + ".jpg"

    • procesed images will be upaloaded back to s3, to use in the prodigy

  • Creating a custome recipe as per our requirement

    • creating the recipe to stream images directly from s3 to prodigy server
  • encode image's s3 URL in the JSONL file for the prodigy server to use in the recipe

Non-Coding

Keep the original naming and structure, and keep it as the first section in the Work phases section

  • Planning
  • Documentation
  • Testing

Implementation

A list of checkboxes, one per PR. Each PR should have a descriptive name that clearly illustrates what the work phase is about.

pipeline for processing and getting the images ready for the prodigy server

@ta4tsering

  • Pre-processing of Image from BDRC's s3 server
    • according the requirements from ELie as a dict of image_options
      the maximum width should be 2000
      the maximum height should be 700
      the quality of the image should be 75%
      the image should be encoded using progressive encoding, default True
      the image should be converted to greyscale if the greyscale is True, default is False
      the image should be checked if it is binary or non-binary and name the processed file accordingly

@ta4tsering

  • Types of Images processsing
    • binary and non-binary images
    • raw and zipped raw images, raw images opened using raw-pillow-opener
  • procesed images will be upaloaded back to s3, to use in the prodigy
  • Creating a custome recipe as per our requirement
    creating the recipe to stream images directly from s3 to prodigy server

@Zakongjampa
alternative method to stream s3 images into the prodigy using JSONL

  • encode image's s3 URL in the JSONL file for the prodigy server to use in the recipe

Training prodigy image-cropping model

@Zakongjampa

  • server readiness
    • prodigy server running in the EC2 server
    • Deploy the local host to the web server and assign it to a subdomain called (prodigy.bdrc.io)
    • Connect it to a domain name using a library called nginx

@ta4tsering

  • Training data preparation
    • Pre-processing the images and get it ready for prodigy server for humans to annotate
    • Processed Images are uploaded to the s3 bucket

@Zakongjampa

  • Human-in-loop to annotate or crop the images

    • processed images are streamed to prodigy for humans to annotate
    • Prodigy then returns the copped-images file in JSONL format
  • Naming convention for the annotated images output from the prodigy
    Example input: .jpg, I8LS766730003.jpg
    Example output: _1.jpg, _2.jpg, I8LS766730003_1.jpg, I8LS766730003_2.jpg

  • train the model

    • when there is enough human annotated images, train the model
  • Quality control of the prodigy model

    • pre-annotate the second batch of images using the prodigy model
    • Get humans to correct the pre-annotated images.
    • loop these steps until a good-enough model is trained
  • Train using Tensor flow object detection API

    • If the performance of the default model of prodigy is not satisfiable then use a custom model from Tensorflow

@ta4tsering

  • Crop the image using output coordinates from the prodigy model and reflect it on the original image
    • get the JSONL file and read each image.
    • crop it bases on the coordination in the JSONL file.
    • get the original image from the s3 using its name
    • pass on the coordinates to the original image and crop the image
    • upload the cropped images to the s3 server to be used on the BDRC.io

Tests

@ta4tsering

  • Image pre-processing test
    • for binary images
    • for non-binary images
    • raw images
    • zipped raw images
@Zakongjampa Zakongjampa self-assigned this Nov 11, 2022
@Zakongjampa Zakongjampa changed the title [RFC0014] [RFC0014] Automated Image-cropping Pipeline Nov 11, 2022
@OpenPecha OpenPecha deleted a comment from ngawangtrinley Nov 23, 2022
@ngawangtrinley
Copy link
Contributor

ngawangtrinley commented Dec 5, 2022

Optimize images for Full HD on low bandwidth:

  1. Check the orientation
  2. If portrait: resize so height <= 1,080 px
  3. If landscape: resize so width <= 1,920 px
  4. Replace the decimal point in floats by an underscore in the resized file names. I.e. 1.5 --> 1_5 I1CZ17610227.tif --> I1CZ17610227f1_5.jpg

@eroux
Copy link

eroux commented Jan 4, 2023

I think the initial concepts are a bit off:

  • the images we're dealing with are not yet served by BDRC through IIIF, and the platform should work on any image, there's nothing specific about BDRC here
  • the system should also work on modern publications that are not pechas, it's just a matter of training data, there's no reason to limit the scope to pecha format

@ta4tsering
Copy link

ta4tsering commented Jan 5, 2023

okay got it, the images we are dealing with are from the s3 server with the bucket name image-processing.bdrc.io not served by BDRC through IIIF. And the system or model should work any type of images regardless of whether its in pecha format or modern publications format.

@eroux
Copy link

eroux commented Jan 5, 2023

yes, that's my point, that's why I thought the sentence

Image: Refers to photographed or scanned images, which are served by BDRC using the IIIF protocol

(towards the beginning) should be replaced

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants