Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement script for extracting features #85

Closed
AlekseySh opened this issue Aug 9, 2022 · 15 comments · Fixed by #234 or #384
Closed

Implement script for extracting features #85

AlekseySh opened this issue Aug 9, 2022 · 15 comments · Fixed by #234 or #384
Assignees

Comments

@AlekseySh
Copy link
Contributor

AlekseySh commented Aug 9, 2022

we need something like python extract.py and YAML config which parametrizes the model and DataLoader

probably we should store features in hdf5 format which may be useful for users without knowing python

@AlekseySh AlekseySh created this issue from a note in OML-planning (backlog) Aug 9, 2022
@AlekseySh AlekseySh moved this from backlog to To do in OML-planning Aug 10, 2022
@AlekseySh AlekseySh added the good first issue Good for newcomers label Aug 10, 2022
@armored-guitar armored-guitar self-assigned this Aug 10, 2022
@AlekseySh AlekseySh moved this from To do to backlog in OML-planning Aug 19, 2022
@churnikov
Copy link
Contributor

Hi!

I'd like to work on this issue. I hope it's not done yet :)

@AlekseySh
Copy link
Contributor Author

@churnikov you are welcome! @DaloroAT will provide you with a road map today or tomorrow

@churnikov
Copy link
Contributor

Sounds good :)

I've looked through contributions guide and I hope you'll give me details on what you expect :)

@DaloroAT
Copy link
Collaborator

DaloroAT commented Oct 27, 2022

Hey @churnikov !

We can split this script into several parts:

  1. Read config, extract parameters about the model, transforms, dataloader and folder with images
  2. Prepare ListDataset for process list of images
  3. Pass this dataset into a dataloader
  4. Save features into some file

We have no ListDataset in our repo now so we can start from point 2. After implementing and merging the code with ListDataset we can return to the rest of the points and discuss the design.


Design

ListDataset should be very simple. It can iterate over files, apply transforms, and return tensors.

from torch.utils.data import Dataset
import torch

class ListDataset(Dataset):
    def __init__(self, filenames_list, transforms, f_imread):
        ...

    def __getitem__(self, idx) -> torch.Tensor:
        ...

Please, don't forget to support different types of transforms. Now we use transforms from albumentations and default torchvision transforms, you can check it in oml.datasets.base.BaseDataset.

Transforms are responsible for augmentations of the original images, resizing and normalizing.

Test

Iterate over the files in mock dataset with dataloader and check shapes of batches.

Mock dataset can be found in

from oml.const import MOCK_DATASET_PATH
print(MOCK_DATASET_PATH / 'images')

If you have no mock dataset locally, use make download_mock_dataset from the root of the project.


What do you think @AlekseySh ?

@AlekseySh
Copy link
Contributor Author

@DaloroAT Agree, let's start by implementing ListDataset in a dedicated PR. Then we can finish the rest of the functionality.

if the task is clear for you, @churnikov ?

@churnikov
Copy link
Contributor

Sounds good :)

@DaloroAT
Copy link
Collaborator

Great!

I created a separate issue #205 for this particular task. Let's continue there @churnikov @AlekseySh .

We'll come back here when we're done with the dataset.

@churnikov
Copy link
Contributor

Hi!

I think, as #205 is done, we can continue with this task 😄

@DaloroAT
Copy link
Collaborator

Great! Thank you @churnikov

Now we have all components to implement that script.

Let's assume that our script's goal is to extract images' features and save them on some file.
Sources of images are the following:

  1. Specific folder.
  2. CSV with paths and possible bounding boxes.

Place your solution into the examples folder at the root of the project.

I'm not sure about the format of the features file, but json is supported across different languages and has a good standard. It's not optimal to store arrays, but now we have no real cases for that script. So we can add support for other formats in the future.

Config

We need to create and support a config with the parameters:

images_folder: ...
dataframe_name: ...
features_file: ...

batch_size: ...
num_workers: ...

transforms:
    ...

model: 
    ...

Features file

The structure of json is the following:

{
"images_folder": <from config>,
"dataframe_name": <from config>,
"model": <nested dict from config section>,
"transforms": <nested dict from config section>,
"filenames": [file1, fil2, ...],
"features": [vector1, vector2, ...]
  • fileK - absolute path to the Kth image.
  • vectorK - list with float values [0.123, 1.356, ...] for Kth image.

Test

Scenario:

  1. Write 2 configs.
  2. Inference on the folder with the mock dataset.
  3. Inference with using CSV to mock dataset.
  4. Compare pairs of (file, vector) between approaches and make sure that features are equal.

Check example1 and example2 how to run tests with configs.

Check list

  1. Parameters images_folder and dataframe_name are not specified simultaneously. One should be null if both are presented or only one should be in config.
  2. If CSV, check that all files exist.
  3. If a folder, use rglob to obtain nested files.
  4. Check, that all files can be read. You can check it without decoding with PIL. It's not 100% guaranteed that a file can be read, but better than nothing.
  5. If CSV, validate names of available columns (you can check it in const.py) and suggest to the user the correct names.
  6. You can ignore PyTorch Lightning and implement it on vanilla PyTorch.
  7. Add docs, we can discuss details after implementing core functionality.

Would you like to add something @AlekseySh ?

@DaloroAT DaloroAT assigned churnikov and unassigned armored-guitar Nov 14, 2022
@AlekseySh
Copy link
Contributor Author

AlekseySh commented Nov 14, 2022

@DaloroAT I am good.
But it's not true:

now we have no real cases for that script.

basically, I implemented a very dirty draft of it by myself and used it in my last experiments :)

@DaloroAT
Copy link
Collaborator

Oh, new PR)

Anyway this part of my comment was devoted to format of file for users, who need to extract and save for some purposes

@DaloroAT
Copy link
Collaborator

But if you found use case, we can add that to task

@AlekseySh
Copy link
Contributor Author

I use pickle for myself, but it's not language-agnostic.
We can implement it with JSON, and then check if it can handle features for the InShop dataset (for example).
If it will be inconvenient, we can think about hdf5 files..

@AlekseySh AlekseySh moved this from backlog to In progress in OML-planning Nov 24, 2022
@AlekseySh AlekseySh linked a pull request Dec 1, 2022 that will close this issue
@churnikov
Copy link
Contributor

Hi @DaloroAT
I'm almost done and approaching documentation
#234 (comment)

You suggested to discuss docs at that point.

@DaloroAT
Copy link
Collaborator

DaloroAT commented Dec 7, 2022

Hey @churnikov

Let's discuss docs.

We will add a new section in the big examples section, you can check the structure by link. You can create separate markdown snippet, and place them there.

In general, you should highlight the following points:

  • Which source of data user can use for inference, and how to prepare data for some cases?
  • Parameters (transforms) for inference. Probably give the link to some default transform with normalizing and resizing with keeping the aspect ratio.
  • Describe the structure of the output file.

Start with how you would like to see the manual as a lazy external user, then we can add some details during review.

@AlekseySh AlekseySh linked a pull request Feb 2, 2023 that will close this issue
@AlekseySh AlekseySh removed the good first issue Good for newcomers label Feb 7, 2023
@AlekseySh AlekseySh moved this from In progress to backlog in OML-planning Feb 7, 2023
@AlekseySh AlekseySh moved this from backlog to To do in OML-planning May 31, 2023
@AlekseySh AlekseySh moved this from To do to In progress in OML-planning Jun 3, 2023
@AlekseySh AlekseySh self-assigned this Jun 3, 2023
@AlekseySh AlekseySh linked a pull request Jun 5, 2023 that will close this issue
OML-planning automation moved this from In progress to Done Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment