Implement script for extracting features #85

AlekseySh · 2022-08-09T14:01:55Z

we need something like python extract.py and YAML config which parametrizes the model and DataLoader

probably we should store features in hdf5 format which may be useful for users without knowing python

The text was updated successfully, but these errors were encountered:

churnikov · 2022-10-26T08:58:50Z

Hi!

I'd like to work on this issue. I hope it's not done yet :)

AlekseySh · 2022-10-27T12:01:53Z

@churnikov you are welcome! @DaloroAT will provide you with a road map today or tomorrow

churnikov · 2022-10-27T13:29:57Z

Sounds good :)

I've looked through contributions guide and I hope you'll give me details on what you expect :)

DaloroAT · 2022-10-27T15:00:10Z

Hey @churnikov !

We can split this script into several parts:

Read config, extract parameters about the model, transforms, dataloader and folder with images
Prepare ListDataset for process list of images
Pass this dataset into a dataloader
Save features into some file

We have no ListDataset in our repo now so we can start from point 2. After implementing and merging the code with ListDataset we can return to the rest of the points and discuss the design.

Design

ListDataset should be very simple. It can iterate over files, apply transforms, and return tensors.

from torch.utils.data import Dataset
import torch

class ListDataset(Dataset):
    def __init__(self, filenames_list, transforms, f_imread):
        ...

    def __getitem__(self, idx) -> torch.Tensor:
        ...

Please, don't forget to support different types of transforms. Now we use transforms from albumentations and default torchvision transforms, you can check it in oml.datasets.base.BaseDataset.

Transforms are responsible for augmentations of the original images, resizing and normalizing.

Test

Iterate over the files in mock dataset with dataloader and check shapes of batches.

Mock dataset can be found in

from oml.const import MOCK_DATASET_PATH
print(MOCK_DATASET_PATH / 'images')

If you have no mock dataset locally, use make download_mock_dataset from the root of the project.

What do you think @AlekseySh ?

AlekseySh · 2022-10-27T15:05:27Z

@DaloroAT Agree, let's start by implementing ListDataset in a dedicated PR. Then we can finish the rest of the functionality.

if the task is clear for you, @churnikov ?

churnikov · 2022-10-27T15:13:00Z

Sounds good :)

DaloroAT · 2022-10-28T09:28:37Z

Great!

I created a separate issue #205 for this particular task. Let's continue there @churnikov @AlekseySh .

We'll come back here when we're done with the dataset.

churnikov · 2022-11-14T10:39:31Z

Hi!

I think, as #205 is done, we can continue with this task 😄

DaloroAT · 2022-11-14T16:55:33Z

Great! Thank you @churnikov

Now we have all components to implement that script.

Let's assume that our script's goal is to extract images' features and save them on some file.
Sources of images are the following:

Specific folder.
CSV with paths and possible bounding boxes.

Place your solution into the examples folder at the root of the project.

I'm not sure about the format of the features file, but json is supported across different languages and has a good standard. It's not optimal to store arrays, but now we have no real cases for that script. So we can add support for other formats in the future.

Config

We need to create and support a config with the parameters:

images_folder: ...
dataframe_name: ...
features_file: ...

batch_size: ...
num_workers: ...

transforms:
    ...

model: 
    ...

Features file

The structure of json is the following:

{
"images_folder": <from config>,
"dataframe_name": <from config>,
"model": <nested dict from config section>,
"transforms": <nested dict from config section>,
"filenames": [file1, fil2, ...],
"features": [vector1, vector2, ...]

fileK - absolute path to the Kth image.
vectorK - list with float values [0.123, 1.356, ...] for Kth image.

Test

Scenario:

Write 2 configs.
Inference on the folder with the mock dataset.
Inference with using CSV to mock dataset.
Compare pairs of (file, vector) between approaches and make sure that features are equal.

Check example1 and example2 how to run tests with configs.

Check list

Parameters images_folder and dataframe_name are not specified simultaneously. One should be null if both are presented or only one should be in config.
If CSV, check that all files exist.
If a folder, use rglob to obtain nested files.
Check, that all files can be read. You can check it without decoding with PIL. It's not 100% guaranteed that a file can be read, but better than nothing.
If CSV, validate names of available columns (you can check it in const.py) and suggest to the user the correct names.
You can ignore PyTorch Lightning and implement it on vanilla PyTorch.
Add docs, we can discuss details after implementing core functionality.

Would you like to add something @AlekseySh ?

AlekseySh · 2022-11-14T20:40:04Z

@DaloroAT I am good.
But it's not true:

now we have no real cases for that script.

basically, I implemented a very dirty draft of it by myself and used it in my last experiments :)

DaloroAT · 2022-11-14T20:53:31Z

Oh, new PR)

Anyway this part of my comment was devoted to format of file for users, who need to extract and save for some purposes

DaloroAT · 2022-11-14T20:55:34Z

But if you found use case, we can add that to task

AlekseySh · 2022-11-14T21:09:04Z

I use pickle for myself, but it's not language-agnostic.
We can implement it with JSON, and then check if it can handle features for the InShop dataset (for example).
If it will be inconvenient, we can think about hdf5 files..

churnikov · 2022-12-07T11:02:49Z

Hi @DaloroAT
I'm almost done and approaching documentation
#234 (comment)

You suggested to discuss docs at that point.

DaloroAT · 2022-12-07T14:24:01Z

Hey @churnikov

Let's discuss docs.

We will add a new section in the big examples section, you can check the structure by link. You can create separate markdown snippet, and place them there.

In general, you should highlight the following points:

Which source of data user can use for inference, and how to prepare data for some cases?
Parameters (transforms) for inference. Probably give the link to some default transform with normalizing and resizing with keeping the aspect ratio.
Describe the structure of the output file.

Start with how you would like to see the manual as a lazy external user, then we can add some details during review.

AlekseySh created this issue from a note in OML-planning (backlog) Aug 9, 2022

AlekseySh added the feature label Aug 9, 2022

AlekseySh moved this from backlog to To do in OML-planning Aug 10, 2022

AlekseySh added the good first issue Good for newcomers label Aug 10, 2022

armored-guitar self-assigned this Aug 10, 2022

AlekseySh moved this from To do to backlog in OML-planning Aug 19, 2022

AlekseySh mentioned this issue Oct 25, 2022

Check how can we make our repository work in google colab. #146

Closed

DaloroAT mentioned this issue Oct 28, 2022

Implement ListDataset to iterate over files #205

Closed

DaloroAT assigned churnikov and unassigned armored-guitar Nov 14, 2022

AlekseySh mentioned this issue Nov 20, 2022

Inference for the images grouped by their aspect ratios #93

Closed

churnikov mentioned this issue Nov 22, 2022

Feature extraction script #234

Merged

AlekseySh moved this from backlog to In progress in OML-planning Nov 24, 2022

AlekseySh linked a pull request Dec 1, 2022 that will close this issue

Feature extraction script #234

Merged

churnikov mentioned this issue Dec 13, 2022

Visualise 2-d embedding space for TripletLoss and for ArcFace for some simple dataset like MNIST #128

Closed

AlekseySh linked a pull request Feb 2, 2023 that will close this issue

[ON HOLD] Feature extraction script 2 #292

Closed

AlekseySh unassigned churnikov Feb 7, 2023

AlekseySh removed the good first issue Good for newcomers label Feb 7, 2023

AlekseySh moved this from In progress to backlog in OML-planning Feb 7, 2023

AlekseySh moved this from backlog to To do in OML-planning May 31, 2023

AlekseySh moved this from To do to In progress in OML-planning Jun 3, 2023

AlekseySh self-assigned this Jun 3, 2023

AlekseySh linked a pull request Jun 5, 2023 that will close this issue

Prediction script #384

Merged

AlekseySh closed this as completed in #384 Jun 7, 2023

OML-planning automation moved this from In progress to Done Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement script for extracting features #85

Implement script for extracting features #85

AlekseySh commented Aug 9, 2022 •

edited

Loading

churnikov commented Oct 26, 2022

AlekseySh commented Oct 27, 2022

churnikov commented Oct 27, 2022

DaloroAT commented Oct 27, 2022 •

edited by AlekseySh

Loading

AlekseySh commented Oct 27, 2022

churnikov commented Oct 27, 2022

DaloroAT commented Oct 28, 2022

churnikov commented Nov 14, 2022

DaloroAT commented Nov 14, 2022

AlekseySh commented Nov 14, 2022 •

edited

Loading

DaloroAT commented Nov 14, 2022

DaloroAT commented Nov 14, 2022

AlekseySh commented Nov 14, 2022

churnikov commented Dec 7, 2022

DaloroAT commented Dec 7, 2022

Implement script for extracting features #85

Implement script for extracting features #85

Comments

AlekseySh commented Aug 9, 2022 • edited Loading

churnikov commented Oct 26, 2022

AlekseySh commented Oct 27, 2022

churnikov commented Oct 27, 2022

DaloroAT commented Oct 27, 2022 • edited by AlekseySh Loading

Design

Test

AlekseySh commented Oct 27, 2022

churnikov commented Oct 27, 2022

DaloroAT commented Oct 28, 2022

churnikov commented Nov 14, 2022

DaloroAT commented Nov 14, 2022

Config

Features file

Test

Check list

AlekseySh commented Nov 14, 2022 • edited Loading

DaloroAT commented Nov 14, 2022

DaloroAT commented Nov 14, 2022

AlekseySh commented Nov 14, 2022

churnikov commented Dec 7, 2022

DaloroAT commented Dec 7, 2022

AlekseySh commented Aug 9, 2022 •

edited

Loading

DaloroAT commented Oct 27, 2022 •

edited by AlekseySh

Loading

AlekseySh commented Nov 14, 2022 •

edited

Loading