DPM is an extremely useful data preparation tool initially developed for personal use, yet its functionalities extend to benefit for example: data scientists, machine learning enthusiasts, and anyone interested in efficient data handling.
- Python (3.10.12)
- PyYAML (6.0.1)
- boto3 (1.34.74)
- botocore (1.34.74)
- requests (2.31.0)
- tqdm (4.66.2)
- autopep8 (2.0.4)
-
Download
- Works when the file is publicly available.
- Currently configured to download from:
- Amazon S3 bucket
- Google Drive
-
Filter
- Filter the dataset according to specific labels (or field values), and extract label-related(field related) data based on specific field values.
- Currently filtering support csv files for source and label creation done according to yolo formatting
- Filter the dataset according to specific labels (or field values), and extract label-related(field related) data based on specific field values.
- Navigate into the Preparator directory
- Create virtual environment & activate
- Install the requirements
pip install -r requirements.txt
- In case of arguments provided in the configurations:
python filter.py
- By providing them in the command line
python filter.py --filter-type img --csv-list example.csv example2.csv --label-list example1 example2
arg | description | default value |
---|---|---|
--filter-type | The result of filtering. img - only the image, label - both the image and label related files are generated. | img |
--output | The path and the file name where the filtered data will be saved. | ./download/image_list |
--csv-list | The list of csv files to use for filtering. You can give one or multiple files separated by a space. | None |
--label-list | The list of labels(or field values) to use for filtering. You can give one or multiple labels separated by a space. | None |
- In case of arguments provided in the configurations:
python download.py
- By providing them in the command line
python filter.py --filter-type img --csv-list example.csv example2.csv --label-list example1 example2
arg | description | default value |
---|---|---|
--image_list | The path and the filename that contains the split and image IDs of the images to download. | ./download/image_list |
--processes | The number of parallel processes to use . | 6 |
--output_folder | Folder where to download the images. | ./download |
--request | The download source type. True to use requests, false to use boto3(AWS SDK) to download | True |
--bucket_name | The name of the S3 bucket where the images are located. | None |