Add ExternalReader that delegates all reading to an external function #57

mbklein · 2021-03-09T18:32:19Z

Description

This PR adds an option, externalReader, which takes over the task of reading files/chunks from the built-in readers.

Example

In the following example, the input to exifr is an object pointing to a file stored in an Amazon S3 bucket. The custom reader function uses the AWS SDK to grab the chunks exifr needs. With this functionality, the input given to exifr can be literally anything, as long as the supplied externalReader function knows what to do with it.

const AWS = require("aws-sdk");
const exifr = require("exifr");
const s3 = new AWS.S3();

function chunkReader(input, offset, length) {
  return new Promise((resolve, _reject) => {
    let params = {...input};

    if (typeof offset === 'number') {
      let end = length ? offset + length - 1 : undefined;
      params.Range = `bytes=${[offset, end].join('-')}`;
    }

    s3.getObject(params, (err, data) => {
      if (err) {
        console.error(err);
        resolve(undefined);
      } else {
        resolve(data.Body);
      }
    });
  });
}

exifr.parse({ Bucket: "my-aws-bucket", Key: "path/to/some.tif" }, { externalReader: chunkReader })
  .then((exif) => console.log(exif))
  .catch(err => reject(err));

mbklein · 2021-03-21T14:36:35Z

@MikeKovarik Just popping in to say that we’ve been using this code in production for a couple weeks now, and have used it to process more than 200,000 TIFF images of various sizes from 100MB up to almost 4GB. It’s been rock solid and remarkably fast.

MikeKovarik · 2021-03-24T06:49:08Z

Hello @mbklein. I'm so sorry for responding this late. Glad to hear it's working for you. It's impressive!

I actually saw the PR on the day you submitted it and was about to merge it right away. You've hit a great idea with the externalReader and I really like the direct use of AWS instead of just making UrlFetcher work in nodejs. But because of this shift I started thinking about further improving the idea further. Because the user facing API could be better aligned with how exifr does modularity.
So I got to work, changing a few things here and there, thinking "I'll wait with responding until there's something to bring to the table" and "I'll only do this one more change". And then the weekend came. And the next week. And here we are.

Anyway, back to the externalReader. I really like the idea but don't think that passing it through options object is the ideal way. Exifr already has a ground work for modularity in plugins.mjs in the form of Maps to which all the loaded parsers/readers are added (like this fileReaders.set('url', UrlFetcher))

My idea is to implemented custom reader class and add it to exifrs plugin mechanism. This way, I could also just ship the aws parser. Maybe I could finally turn exifr into monorepo and break it up into sepate packages like exifr-parser-jpeg and exifr-reader-aws.
All reader classes, such as UrlFetcher, inherit from ChunkedReader. canHandle method is used verify if the input is parseable by the reader. When exifr.parse(input) is called exifr goes through all loaded readers to pick the one best suited for the input format. If that's an object with Key and Bucket, then the AwsReader will be used without the need to pass options object.

So this is the API I came up with but am open to suggestions.

import exifr from '../src/bundles/full.mjs'

export class ExternalReader extends exifr.ChunkedReader {

  async readWhole() {
    this.chunked = false
    let arrayBuffer = await this._read()
    this._swapArrayBuffer(arrayBuffer)
  }

  async _readChunk(offset, length) {
    let abChunk = await this._read(offset, length)
    let bytesRead = abChunk.byteLength
    if (bytesRead !== length) this.size = offset + bytesRead
    return this.set(abChunk, offset, true)
  }

}

class AwsReader extends ExternalReader {

  static canHandle(input) {
    return input !== undefined
      && typeof input.Bucket === 'string'
      && typeof input.Key === 'string'
  }

  _read(offset, length) {
    // the aws code here
  }

}

exifr.use(AwsReader)
// or
fileReaders.set('aws', AwsReader)

exifr.parse({ Bucket: "my-aws-bucket", Key: "path/to/some.tif" })
  .then(console.log)
  .catch(console.error)

For now it's still open to changes since I'm still changing the logic of ChunkedReader. Specifically the readWhole, readChunked and _readChunk methods. These were internal methods until now, but would need to make more sense to be implemented by user.

PS: This is out of the topic now, but another thing that delayed me was exploring other ways of making UrlFetcher work in nodejs. I ended up with simple shim of fetch implemented with node's built in http and https modules. Here's a snippet:

let httpPromise  = dynamicImport('http',  http => http)
let httpsPromise = dynamicImport('https', https => https)

let fetch = platform.g.fetch || function(url, {headers}) {
  let {port, hostname, pathname, protocol} = new URL(url)
  const options = {
    method: 'GET',
    hostname,
    path: encodeURI(pathname),
    headers
  }
  if (port !== '') options.port = Number(port)
  return new Promise(async (resolve, reject) => {
    let lib = protocol === 'https:' ? await httpsPromise : await httpPromise
    const req = lib.request(options, res => {
      resolve({
        status: res.statusCode,
        arrayBuffer: () => new Promise(resolveAb => {
          let buffers = []
          res.on('data', buffer => buffers.push(buffer))
          res.on('end', () => resolveAb(concatBuffers(buffers)))
        })
      })
    })
    req.on('error', reject)
    req.end()
  })
}

mbklein · 2021-03-24T14:52:59Z

I really like where you've gone with this, especially the idea of breaking up the repo and allowing people to require only the bits they need. It would certainly make it way easier to extend – just write and release a module! – over the current all-in-one.

One small suggestion I have would be to allow the caller to pass the name of a registered reader (or something) when they call parse() (e.g., exifr.parse(input, { reader: 'aws' }), allowing them to bypass all the decision logic if they know they want to use a specific implementation. I like “just do the right thing” code, but I also appreciate when there's a way to impose my will on the process. 😄

mbklein · 2021-05-06T14:45:48Z

Hi @MikeKovarik – just thought I'd check in and see if there's anything I can do to help with where you'd like this PR/issue/refactor to go. Our application is chugging along nicely here using our fork, but it'd be great to get back to the main release at some point.

schonert · 2022-09-21T11:24:52Z

Any news on this? 🤓

Add ExternalReader that delegates all reading to an external function

8a624e3

This was referenced Mar 9, 2021

Chunked reading over HTTP from node? #54

Open

Use the exifr ExternalReader nulib/meadow#1968

Merged

This was referenced Mar 24, 2021

Command-line interface #55

Open

(node:68141) Warning: Closing file descriptor NNN on garbage collection #46

Closed

ombr mentioned this pull request Jun 7, 2021

Urls with query string are not working correctly. This fixes it. #65

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ExternalReader that delegates all reading to an external function #57

Add ExternalReader that delegates all reading to an external function #57

mbklein commented Mar 9, 2021 •

edited

Loading

mbklein commented Mar 21, 2021

MikeKovarik commented Mar 24, 2021 •

edited

Loading

mbklein commented Mar 24, 2021

mbklein commented May 6, 2021

schonert commented Sep 21, 2022

Add ExternalReader that delegates all reading to an external function #57

Are you sure you want to change the base?

Add ExternalReader that delegates all reading to an external function #57

Conversation

mbklein commented Mar 9, 2021 • edited Loading

Description

Example

mbklein commented Mar 21, 2021

MikeKovarik commented Mar 24, 2021 • edited Loading

mbklein commented Mar 24, 2021

mbklein commented May 6, 2021

schonert commented Sep 21, 2022

mbklein commented Mar 9, 2021 •

edited

Loading

MikeKovarik commented Mar 24, 2021 •

edited

Loading