# Data Processing for Dentex Dataset

In this notebook, we'll process the `xrays.zip` dataset and organize the data for training a deep learning model. The main tasks involved are:

1. **Extracting images from a ZIP file**: We'll extract the image data from the ZIP archive and store it in a specific directory in Google Drive
2. **Preparing the data structure**: We'll organize the images and labels in a structured format for training.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import os

base_path = '/content/drive/MyDrive/DeepLearning/DentexDataSet/Original'
files = os.listdir(base_path)

print("Files found in the folder:")
for f in files:
    print(f)

### Extracting the Images from the ZIP File

We begin by extracting the X-ray images from a ZIP file stored on Google Drive. This step is necessary because the images are initially compressed in a ZIP format and need to be extracted before processing.

The following code opens the ZIP file and extracts all the contents to a directory on the local machine (in this case, the `/content` directory in Google Colab).


In [None]:
import zipfile
from pathlib import Path

zip_path = '/content/drive/MyDrive/DeepLearning/DentexDataSet/Original/xrays.zip'
extract_path = Path('/content')

# Extract the zip file
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

print(f"Images extracted to: {extract_path}")


### JSON File

In this step, we copy the JSON file `train_quadrant_enumeration_disease.json` directory in the Colab environment.

This allows us to access and process the JSON file directly within the notebook. The file contains important annotations for the X-ray images that will be used in the dataset.


In [None]:
!cp -r '/content/drive/MyDrive/DeepLearning/DentexDataSet/Original/train_quadrant_enumeration_disease.json' /content/

### Parsing the Annotations

The annotations for the X-ray images are stored in a JSON file. This file contains information about the images, such as their IDs, file names, and the bounding box coordinates for each image, as well as the disease label associated with each bounding box.

We load the JSON file, iterate over each annotation, and extract the relevant information (e.g., bounding boxes and disease IDs). The annotations are then used to generate YOLO-compatible labels for training our model.


In [None]:
import json
import os
import shutil

# Paths
json_path = "/content/train_quadrant_enumeration_disease.json"
images_base_path = "/content/xrays"
output_dir = "/content/drive/MyDrive/DeepLearning/DentexDataSet/Processed/data"

train_images_dir = os.path.join(output_dir, "train/images")
train_labels_dir = os.path.join(output_dir, "train/labels")

# Create output directories
os.makedirs(train_images_dir, exist_ok=True)
os.makedirs(train_labels_dir, exist_ok=True)

# Load JSON file
with open(json_path, "r") as f:
    data = json.load(f)

# Auxiliary mappings
image_info = {img["id"]: img for img in data["images"]}
categories_3 = data.get("categories_3", [])
disease_id_to_name = {cat["id"]: cat["name"] for cat in categories_3}
disease_name_to_class_id = {}
current_class_id = 0

# Process annotations
for ann in data["annotations"]:
    image_id = ann["image_id"]
    bbox = ann["bbox"]
    disease_id = ann.get("category_id_3")

    disease_name = disease_id_to_name.get(disease_id)
    if disease_name is None:
        continue

    # Assign class_id if not already assigned
    if disease_name not in disease_name_to_class_id:
        disease_name_to_class_id[disease_name] = current_class_id
        current_class_id += 1

    class_id = disease_name_to_class_id[disease_name]
    image = image_info[image_id]
    file_name = os.path.splitext(image["file_name"])[0]
    img_w, img_h = image["width"], image["height"]

    # YOLO bounding box format
    x_min, y_min, w, h = bbox
    x_center = (x_min + w / 2) / img_w
    y_center = (y_min + h / 2) / img_h
    w /= img_w
    h /= img_h

    # File paths
    image_src_path = os.path.join(images_base_path, image["file_name"])
    image_dst_path = os.path.join(train_images_dir, f"{file_name}.jpg")
    label_path = os.path.join(train_labels_dir, f"{file_name}.txt")

    # Check if image exists and copy it
    if os.path.exists(image_src_path):
        shutil.copy(image_src_path, image_dst_path)

        # Write bounding box to the .txt file in YOLO format
        with open(label_path, "a") as f:
            f.write(f"{class_id} {x_center:.6f} {y_center:.6f} {w:.6f} {h:.6f}\n")
    else:
        print(f"Image not found: {image_src_path}")

print("YOLO structure created successfully!")


### Saving Processed Data

Once the images and annotations are processed into YOLO format, they are saved into a new folder structure: `train/images` for the images and `train/labels` for the corresponding annotation files.

The processed data is saved to Google Drive for further use in model training. This structure makes it easier to manage the dataset and use it for training deep learning models such as YOLO.

You can now proceed to train your model using this dataset.
