<a href="https://colab.research.google.com/github/OrsolaMBorrini/rcm-thesis/blob/main/TEST1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install necessary libraries
!pip install -U datasets
!pip install -U Pillow



In [None]:
# --- The dataset on the HF hub is now set as 'public' to simplify the access to it in this testing phase

#from huggingface_hub import login
#login(token="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")

## rcm-1 dataset
### New structure
Here's the link to the public dataset: [rcm-1](https://huggingface.co/datasets/ombrr/rcm-1).

```
rcm-1
├── pictures
│   ├── Q_017042.jpg
│   ├── Q_017043.jpg
│   └── ...
├── test.csv
├── train.csv
└── validation.csv
```

The CSV files have the following structure:

| annotation_id | choice | created_at | id | updated_at | provenance | set | image |
| --- | --- | --- | --- | --- | --- | --- | --- |
| 77 | not-sensitive | 2023-09-26T15:19:16.351587Z | 74 | 2023-09-26T15:19:16.351587Z | IWM | train | https://huggingface.co/datasets/ombrr/rcm-1/blob/main/pictures/Q_017117.jpg |
| 2 | sensitive | 2023-09-19T14:53:52.546617Z | 2 | 2023-09-19T14:54:01.018338Z	| IWM | train | https://huggingface.co/datasets/ombrr/rcm-1/blob/main/pictures/Q_017043.jpg |
| 139 | dubious | 2023-09-26T16:09:22.824390Z | 190 | 2023-09-26T16:09:22.824390Z | IWM | train | https://huggingface.co/datasets/ombrr/rcm-1/blob/main/pictures/Q_052457.jpg |

### Old structure
Previously, the dataset was **folder-based**.
```
rcm-1
│
├── train
│   ├── sensitive
│   │   ├── Q_017042.jpg
│   │   └── ...
│   ├── not-sensitive
│   │   ├── Q_017043.jpg
│   │   └── ...
│   └── dubious
│       ├── Q_017044.jpg
│       └── ...
│
├── validation
│   ├── sensitive
│   │   ├── Q_017045.jpg
│   │   └── ...
│   ├── not-sensitive
│   │   ├── Q_017046.jpg
│   │   └── ...
│   └── dubious
│       ├── Q_017047.jpg
│       └── ...
│
├── test
│   ├── sensitive
│   │   ├── Q_017042.jpg
│   │   └── ...
│   ├── not-sensitive
│   │   ├── Q_017043.jpg
│   │   └── ...
│   └── dubious
│       ├── Q_017044.jpg
│       └── ...
│
└── dataset.csv
```

The `dataset.csv` file had the same structure as the new CSV files, only without the `'set'` column, as the (not stratified!) splitting was done through folders.

This structure was changed due to the complications arising from the **parallel reframing of the definition of "sensitive content"**, which meant:
- Correcting multiple annotations
- Moving around the images in the different folders by re-running the Python script again (and again, and again...)

With the new structure, only the CSV files are changed (the values under the column `'choice'`) and the stratified sampling is consequently corrected in the same Python script, making it much simpler to update the dataset according to the newly discovered/discussed aspects of "sensitive content".


### However...
With the folder-based 'old' structure, there was no problem in "reading" the images and it was easy to follow [this tutorial](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/pytorch/image_classification.ipynb) provided by 🤗 HuggingFace.

With the new structure, the `UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7a6d74d2f420>` pops up anytime we try to access the images.

In [None]:
from datasets import load_dataset, Image
import PIL

train_dataset = load_dataset("ombrr/rcm-1", split="train")
train_dataset

Dataset({
    features: ['annotation_id', 'choice', 'created_at', 'id', 'updated_at', 'provenance', 'set', 'image'],
    num_rows: 139
})

In [None]:
train_dataset.features

{'annotation_id': Value(dtype='int64', id=None),
 'choice': Value(dtype='string', id=None),
 'created_at': Value(dtype='string', id=None),
 'id': Value(dtype='int64', id=None),
 'updated_at': Value(dtype='string', id=None),
 'provenance': Value(dtype='string', id=None),
 'set': Value(dtype='string', id=None),
 'image': Value(dtype='string', id=None)}

In [None]:
train_dataset[0]['image']

'https://huggingface.co/datasets/ombrr/rcm-1/blob/main/pictures/Q_017117.jpg'

In [None]:
train_dataset = train_dataset.cast_column("image", Image())

In [None]:
train_dataset.features

{'annotation_id': Value(dtype='int64', id=None),
 'choice': Value(dtype='string', id=None),
 'created_at': Value(dtype='string', id=None),
 'id': Value(dtype='int64', id=None),
 'updated_at': Value(dtype='string', id=None),
 'provenance': Value(dtype='string', id=None),
 'set': Value(dtype='string', id=None),
 'image': Image(decode=True, id=None)}

In [None]:
train_dataset[10]['image']  # UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7a6d74d2f420>

# Already checked: there's no corrupt file in the folder and when the dataset was folder based there was no UnidentifiedImageError

UnidentifiedImageError: ignored