Skip to content

IMPORTANT! Refactor and simplify #15

@johann-petrak

Description

@johann-petrak

This is a bigger task and should maybe have issues for each of the smaller steps. It also involves updating the LF to generate a different kind of corpus.

The main changes:

  • redesign the api to be more similar to the native Pytorch Dataset and DataLoader
  • What is currently "Dataset" should be one of several possible kinds, and all should be a subclass of something that conforms to PyTorch Dataset API (so we need len and getitem(index))
  • change stuff so we always adapt the representation on the fly, using an adaptor object or function (something that takes whatever the dataset returns and converts it to whatever the Module needs).
  • Adaptors can be used when creating the dataset or directly by the module. The advantage of using the adaptor in the dataset and thus through the dataloader is that the dataloader can be multi-processing.
  • A dataset MAY actually pre-cache the result of the adaptor in some way (still accessible by index) but we can keep that for later
  • Need to think if we should change the representation returned by Dataset to dictionary to make it clearler what is what? In that case, each feature should have a key "f_" as well.
  • !! also need to think if we should have the same already in the json file created by the LF and represent all instances through json maps, with standard keys for target etc.
  • here and LF: think about how to better name features, especially if there is just one! Probably worth giving up the possibility of user-naming features and instead use f_1, f_2, f_3?
  • LF: simplify the metadata as much as possible!
  • make it possible to use the corpora from torchnlp directly or convert from conll or line corpus format to our own format to easily use the library for other stuff as well!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions