# Data Processing for Dentex Dataset

In this notebook, we'll process the `xrays.zip` dataset and organize the data for training a deep learning model. The main tasks involved are:

1. **Extracting images from a ZIP file**: We'll extract the image data from the ZIP archive and store it in a specific directory in Google Drive
2. **Preparing the data structure**: We'll organize the images and labels in a structured format for training.


In [None]:
from google.colab import drive

# Unmount if already mounted
drive.flush_and_unmount()

# Mount again
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# !rm -r xrays/

In [None]:
import os

base_path = '/content/drive/MyDrive/DeepLearning/DentexDataSet/Original'
files = os.listdir(base_path)

print("Files found in the folder:")
for f in files:
    print(f)

Files found in the folder:
train_quadrant_enumeration_disease.json
xrays.zip


### Extracting the Images from the ZIP File

We begin by extracting the X-ray images from a ZIP file stored on Google Drive. This step is necessary because the images are initially compressed in a ZIP format and need to be extracted before processing.

The following code opens the ZIP file and extracts all the contents to a directory on the local machine (in this case, the `/content` directory in Google Colab).


In [None]:
import zipfile
from pathlib import Path

zip_path = '/content/drive/MyDrive/DeepLearning/DentexDataSet/Original/xrays.zip'
extract_path = Path('/content')

# Extract the zip file
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

print(f"Images extracted to: {extract_path}")


Images extracted to: /content


### JSON File

In this step, we copy the JSON file `train_quadrant_enumeration_disease.json` directory in the Colab environment.

This allows us to access and process the JSON file directly within the notebook. The file contains important annotations for the X-ray images that will be used in the dataset.


In [None]:
!cp -r '/content/drive/MyDrive/DeepLearning/DentexDataSet/Original/train_quadrant_enumeration_disease.json' /content/

### Parsing the Annotations and Splitting the Data

In this step, we process the annotations for the X-ray images, which are stored in a JSON file. The JSON file contains details about each image, such as its ID, file name, bounding box coordinates, and the associated disease label for each bounding box.

We begin by loading the JSON file and iterating through each annotation to extract the relevant information, such as the bounding box coordinates and disease IDs. This data is then used to generate YOLO-compatible labels, which are necessary for training the model.

Additionally, we split the dataset into two parts:
- 90% of the data is used for **training**.
- 10% of the data is reserved for **validation**.

The images and their corresponding labels are then organized into the following directories:
- `train/images` and `train/labels` for the training set.
- `val/images` and `val/labels` for the validation set.

This structure ensures that the model is trained on one subset of the data and validated on another, helping evaluate its performance on unseen images.


In [None]:
import json
import os
import random
import shutil

# Paths for Colab
json_path = "/content/train_quadrant_enumeration_disease.json"
images_base_path = "/content/xrays"
output_dir = "/content/drive/MyDrive/DeepLearning/DentexDataSet/Processed/data"

# Create output directories
splits = ["train", "val"]
for split in splits:
    os.makedirs(os.path.join(output_dir, f"{split}/images"), exist_ok=True)
    os.makedirs(os.path.join(output_dir, f"{split}/labels"), exist_ok=True)

train_images_dir = os.path.join(output_dir, "train/images")
train_labels_dir = os.path.join(output_dir, "train/labels")
val_images_dir = os.path.join(output_dir, "val/images")
val_labels_dir = os.path.join(output_dir, "val/labels")

# Load JSON
with open(json_path, "r") as f:
    data = json.load(f)

# Map image ID to image metadata
image_info = {img["id"]: img for img in data["images"]}

# Map disease ID to disease name
categories_3 = data.get("categories_3", [])
disease_id_to_name = {cat["id"]: cat["name"] for cat in categories_3}
disease_name_to_class_id = {}
current_class_id = 0

# List to store unique image names
image_files = []

# Process annotations
for ann in data["annotations"]:
    image_id = ann["image_id"]
    bbox = ann["bbox"]
    cat3 = ann.get("category_id_3")

    disease_name = disease_id_to_name.get(cat3)
    if disease_name is None:
        continue

    if disease_name not in disease_name_to_class_id:
        disease_name_to_class_id[disease_name] = current_class_id
        current_class_id += 1

    class_id = disease_name_to_class_id[disease_name]

    # Image information
    image = image_info[image_id]
    file_name = os.path.splitext(image["file_name"])[0]
    img_w, img_h = image["width"], image["height"]

    # YOLO normalized coordinates
    x_min, y_min, w, h = bbox
    x_center = (x_min + w / 2) / img_w
    y_center = (y_min + h / 2) / img_h
    w /= img_w
    h /= img_h

    # Paths for source and destination
    image_path = os.path.join(train_images_dir, f"{file_name}.jpg")
    label_path = os.path.join(train_labels_dir, f"{file_name}.txt")
    image_src_path = os.path.join(images_base_path, image["file_name"])

    if os.path.exists(image_src_path):
        shutil.copy(image_src_path, image_path)
    else:
        print(f"Image not found: {image_src_path}")
        continue

    # Write YOLO label
    with open(label_path, "a") as f:
        f.write(f"{class_id} {x_center:.6f} {y_center:.6f} {w:.6f} {h:.6f}\n")

    # Save file name for later split
    image_files.append(file_name)

# Split into train/val
random.shuffle(image_files)
train_split = int(0.9 * len(image_files))
val_files = image_files[train_split:]

for file_name in val_files:
    shutil.copy(os.path.join(train_images_dir, f"{file_name}.jpg"), val_images_dir)
    shutil.copy(os.path.join(train_labels_dir, f"{file_name}.txt"), val_labels_dir)

print("✅ YOLO conversion completed. Data organized into train and validation sets.")


✅ YOLO conversion completed. Data organized into train and validation sets.


### Saving Processed Data

Once the images and annotations are processed into YOLO format, they are saved into a new folder structure: `train/images` for the images and `train/labels` for the corresponding annotation files.

The processed data is saved to Google Drive for further use in model training. This structure makes it easier to manage the dataset and use it for training deep learning models such as YOLO.

You can now proceed to train your model using this dataset.
