# **Put the dataset in a correct format**

```javascript
This notebook focuses on converting the provided dataset into a usable format for machine learning models. The original structure of the dataset was unsuitable, so we needed to reorganize the folder structure to make it compatible with machine learning requirements.
```

## **Imports**

In [1]:
import numpy as np 
import pandas as pd 
import json 
import os
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from PIL import Image
import os
import shutil
import uuid


## **Create a pandas dataframe of relative paths**

```java
This cell focuses on retrieving information from train.json and test.json. Here is an example of the data:
{
  "170906-113317-Al 2mm-part3/frame_00647.png": 1,
  "170906-113317-Al 2mm-part3/frame_00672.png": 1,
  "170906-113317-Al 2mm-part3/frame_00677.png": 1,
  "170906-113317-Al 2mm-part3/frame_00646.png": 1,
  "170906-113317-Al 2mm-part3/frame_00691.png": 1,
  "170906-113317-Al 2mm-part3/frame_00684.png": 1,
  ...
}
```

In [3]:
main_path = "al5083/"
train_path = os.path.join(main_path,"train")
test_path = os.path.join(main_path,"test")

def load_dataset_pandas(train_test):
    dataset_img_paths = []
    labels = []
    with open(os.path.join(main_path,train_test,f"{train_test}.json"), 'r') as file:
        json_file = json.load(file)    
        # Print the content of the JSON file
        for key, value in json_file.items():
            dataset_img_paths.append(key)
            labels.append(value)

    data = {
        f'{train_test}_img_paths': dataset_img_paths,
        'labels': labels
    }
    return pd.DataFrame(data)


df_train = load_dataset_pandas("train")
df_test = load_dataset_pandas("test")

```java
Final result
```

In [4]:
df_train

Unnamed: 0,train_img_paths,labels
0,170906-113317-Al 2mm-part3/frame_00647.png,1
1,170906-113317-Al 2mm-part3/frame_00672.png,1
2,170906-113317-Al 2mm-part3/frame_00677.png,1
3,170906-113317-Al 2mm-part3/frame_00646.png,1
4,170906-113317-Al 2mm-part3/frame_00691.png,1
...,...,...
26661,170906-150801-Al 2mm/frame_00715.png,3
26662,170906-150801-Al 2mm/frame_00852.png,3
26663,170906-150801-Al 2mm/frame_00346.png,3
26664,170906-150801-Al 2mm/frame_00116.png,3


In [6]:
df_train["labels"].value_counts() + df_test["labels"].value_counts()

labels
0    10947
1     2134
2     8403
3     5035
4     3682
5     3053
Name: count, dtype: int64

## **Creating the right format of the dataset**

```java
In this cell, we formatted the image dataset correctly. In this case, we only have training and testing sets, but here is the intended structure:
dataset/
├── train/
│   ├── cats/
│   │   ├── cat1.jpg
│   │   ├── cat2.jpg
│   │   ├── ...
│   ├── dogs/
│   │   ├── dog1.jpg
│   │   ├── dog2.jpg
│   │   ├── ...
├── validation/
│   ├── cats/
│   │   ├── cat1.jpg
│   │   ├── cat2.jpg
│   │   ├── ...
│   ├── dogs/
│   │   ├── dog1.jpg
│   │   ├── dog2.jpg
│   │   ├── ...
└── test/
    ├── cats/
    │   ├── cat1.jpg
    │   ├── cat2.jpg
    │   ├── ...
    ├── dogs/
        ├── dog1.jpg
        ├── dog2.jpg
        ├── ...
```


In [38]:
labels = [
    "good weld",
    "burn through",
    "contamination",
    "lack of fusion",
    "misalignment",
    "lack of penetration"
]

# Define the main dataset directory
main_path = 'al5083'

# Function to organize images into their respective label directories
def organize_images(df, subset_name):
    count = 0
    for index, row in df.iterrows():
        count += 1
        img_path = row[f'{subset_name}_img_paths']
        label = labels[row['labels']]

        # Define the subset (train/test) directory path
        subset_dir = os.path.join("dataset", subset_name)
        
        # Create the subset directory if it does not exist
        if not os.path.exists(subset_dir):
            os.makedirs(subset_dir)
        
        # Define the label directory path
        label_dir = os.path.join(subset_dir, label)
        
        # Create the label directory if it does not exist
        if not os.path.exists(label_dir):
            os.makedirs(label_dir)
        
        # Create a unique filename to avoid conflicts
        file_extension = os.path.splitext(img_path)[1]
        unique_filename = f"{uuid.uuid4()}{file_extension}"

        # Define the destination path with the unique filename
        dest_path = os.path.join(label_dir, unique_filename)
        src_path = os.path.join(main_path, subset_name, img_path)
        
        # Move the image to the correct directory
        shutil.move(src_path, dest_path)

    print(count)

# Organize training images
organize_images(df_train, 'train')

# Organize testing images
organize_images(df_test, 'test')

print("Dataset organized successfully!")


dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
dataset/train/burn through
d

## **Final output**


```javascript
Here's what we got
```

<img src="Order.png">