## **Creating a binary dataset**


```Javascript
In this notebook, we aimed to preprocess the data and create a binary dataset using the Hugging Face library. This library offers significant advantages in image dataset creation, primarily due to its memory mapping method for loading datasets. This approach allows us to efficiently handle and modify large datasets, such as our 6GB collection of images. All of this processing was done on Kaggle, as it facilitated easier online data loading.
```




## **Login in hugging face**

In [None]:
!huggingface-cli login --token ""

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## **Imports**

In [1]:
from datasets import load_dataset
from datasets import concatenate_datasets

## **Loading the dataset**

```javascript
We decided to combine the training and test sets to resample the data. This is done in the second line of this cell.
``` 

In [None]:
# Cargar el dataset
dataset = load_dataset("imagefolder", data_dir="/kaggle/input/womanium-dataset")
dataset = concatenate_datasets([dataset['train'], dataset['test']])

## **Getting the labels**

```javascript
As we can see, the labels have changed and are not the original ones. This is because the load_dataset() function assigns labels in alphabetical order, causing the shift in label assignments.
```

In [None]:
labels = dataset.features["label"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label
    
label2id, id2label

({'burn through': '0',
  'contamination': '1',
  'good weld': '2',
  'lack of fusion': '3',
  'lack of penetration': '4',
  'misalignment': '5'},
 {'0': 'burn through',
  '1': 'contamination',
  '2': 'good weld',
  '3': 'lack of fusion',
  '4': 'lack of penetration',
  '5': 'misalignment'})

## **Resampling data and creating binary dataset**

```javascript
To balance the dataset, we used random resampling to obtain a fixed number of images per class. We chose to select 10,000 images for the "Good weld" cla5s and 2,000 images for each of the other classes. The decision to use 2,000 images was based on the fact that the "Burn through" class had over 2,000 images. We opted not to use data augmentation, as we believe future images will have consistent camera settings, orientation, and lighting conditions.
```

In [None]:
from datasets import Dataset
# Function to sample the desired number of images per class
def sample_dataset(dataset, class_counts):
    dataset_final = Dataset.from_list([])
    for label, count in class_counts.items():
        print(f"Class {label} Progess")
        # Get all indices for the current label
        dataset_only_one_label = dataset.filter(lambda example: example["label"] == label)
        dataset_only_one_label = dataset_only_one_label.shuffle()
        dataset_only_one_label = dataset_only_one_label.select(range(count))
        # Randomly sample the desired number of indices for the current label
        dataset_final = concatenate_datasets([dataset_final, dataset_only_one_label])
    return dataset_final

# Specify the desired number of images per class
class_counts = {
    0: 2000,  
    1: 2000,  
    2: 10000,  
    3: 2000,  
    4: 2000,  
    5: 2000   
}

# Sample the dataset to create the balanced dataset
balanced_dataset = sample_dataset(dataset, class_counts)


Class 0 Progess
Class 1 Progess


Filter:   0%|          | 0/33254 [00:00<?, ? examples/s]

Class 2 Progess


Filter:   0%|          | 0/33254 [00:00<?, ? examples/s]

Class 3 Progess


Filter:   0%|          | 0/33254 [00:00<?, ? examples/s]

Class 4 Progess


Filter:   0%|          | 0/33254 [00:00<?, ? examples/s]

Class 5 Progess


Filter:   0%|          | 0/33254 [00:00<?, ? examples/s]

```java
Final result a dataset of 20000 images
```

In [None]:
balanced_dataset

Dataset({
    features: ['image', 'label'],
    num_rows: 20000
})

## **Fixing the labels**

```java
After preparing our dataset, we need to finalize the labels. We created a simple function that checks if the original label is "Good weld." If it is, the function assigns a value of 0; otherwise, it assigns a value of 1.
```

In [None]:
def add_prefix(example):
    example["label_binary"] = 0 if example["label"] == 2 else 1
    return example

updated_dataset = balanced_dataset.map(add_prefix)
updated_dataset = updated_dataset.remove_columns(["label"])

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

## **Creating train and test splits**

```javascript
This cell creates the splits for training and testing, using 30% of the data for testing.
```

In [None]:
training_dataset = updated_dataset.train_test_split(test_size=0.3)
training_dataset = training_dataset.rename_column("label_binary", "label")

In [None]:
training_dataset

DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 14000
    })
    test: Dataset({
        features: ['image', 'label'],
        num_rows: 6000
    })
})

## **Upload to hugging face**

```r
We uploaded the data to Hugging Face Datasets so that both my colleague and I, as well as anyone else, could easily access it.
```

In [None]:
training_dataset.push_to_hub("LaLegumbreArtificial/womanium-balance", private=True)

Uploading the dataset shards:   0%|          | 0/6 [00:00<?, ?it/s]

Map:   0%|          | 0/2334 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/24 [00:00<?, ?ba/s]

Map:   0%|          | 0/2334 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/24 [00:00<?, ?ba/s]

Map:   0%|          | 0/2333 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/24 [00:00<?, ?ba/s]

Map:   0%|          | 0/2333 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/24 [00:00<?, ?ba/s]

Map:   0%|          | 0/2333 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/24 [00:00<?, ?ba/s]

Map:   0%|          | 0/2333 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/24 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/3 [00:00<?, ?it/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/20 [00:00<?, ?ba/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/20 [00:00<?, ?ba/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/20 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/328 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/LaLegumbreArtificial/womanium-balance/commit/4128dbfa5200b67e211392e0ea7c96379c77f001', commit_message='Upload dataset', commit_description='', oid='4128dbfa5200b67e211392e0ea7c96379c77f001', pr_url=None, pr_revision=None, pr_num=None)