Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/sg 729 abstract cocodataset #777

Merged
merged 31 commits into from
Mar 16, 2023

Conversation

Louis-Dupont
Copy link
Contributor

@Louis-Dupont Louis-Dupont commented Mar 13, 2023

Abstracting Coco dataset.
Changes bewteen the coco and its abstraction:

  • init parameters have different names (they also represent different things)

@dagshub
Copy link

dagshub bot commented Mar 13, 2023

@Louis-Dupont Louis-Dupont marked this pull request as ready for review March 13, 2023 12:47
BloodAxe
BloodAxe previously approved these changes Mar 16, 2023
Copy link
Collaborator

@BloodAxe BloodAxe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

But I also wanted to leave my 5 cents here:

I think at some point we would need to make something DatasetReader class with logic to parse a dataset in whatever format into List of samples and use that to instantiate dataset classes. What I mean is this:

class COCODetectionDatasetReader:
  def __init__(images_dir, annotation_file, with_crowd, ...):
       ...
  
  def load_samples() -> List[DetectionSample]:
       ...

reader = COCODetectionDatasetReader(...)
samples = reader.load_samples()
dataset = DetectionDataset(samples, transforms)

So the idea is DetectionDataset has zero knowledge about where the data is coming from. E.g instead of sub classing we use composition, which is more flexible:

reader = COCODetectionDatasetReader(...)
reader = PascalVOCDetectionDatasetReader(...)
reader = DetectionDatasetReader(...)

And DetectionSample to provide all necessary information for a single image sample:

@dataclass
class DetectionSample
   image_path: str
   bboxes: np.ndarray # [N, 4] in XYXY format
   labels: np.ndarray # [N]
   is_crowd: np.ndarray # [N]      

Note, let's say for some customer we have to have additional label attached to bbox. No problem:

@dataclass
class CustomerSpecificDetectionSample(DetectionSample):
   additional_property: np.ndarray # [N]


class CustomerSpecificDetectionDatasetReader:
  def load_samples() -> List[CustomerSpecificDetectionSample]:
       ...

I realize this is much of work and def. goes out of the scope of this PR.
So this is more to raise an awareness and suggest this for future sprints.

@Louis-Dupont
Copy link
Contributor Author

It sounds like a great idea, I guess that we could also make it so that load_samples would be a generator to provide an option not to cache labels.

@Louis-Dupont Louis-Dupont merged commit c986887 into master Mar 16, 2023
@Louis-Dupont Louis-Dupont deleted the feature/SG-729-abstract_cocodataset branch March 16, 2023 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants