## Rose-Boosted Training Configuration

### Why this version exists
During our initial YOLOv8 training, the dataset was **imbalanced** — the *rose* class had significantly fewer samples compared to *weeds*.
As a result, the model tended to **underperform on roses**, missing detections or confusing them with background vegetation.

To address this, we created a **rose-boosted dataset**:
- All rose images were **oversampled ×3** to balance class frequency.
- A new YAML file, `data_rose_boosted.yaml`, was generated automatically by the oversampling script.
- The YAML still points to the same validation/test sets, ensuring **evaluation consistency**.

### How to use it:
To train using the boosted dataset:
1. Change the `DATA_YAML` path to point to the new file **(Cell number 2, "Paths & Config")**:
   ```python
   DATA_YAML = PROJECT_ROOT / "data/weeds_yolo/data_rose_boosted.yaml"


### Step 1 – Define base paths and configuration
This cell sets up the main folders and constants:
- `BASE` → points to your `data/weeds_yolo` directory
- `DATA_YAML` → the original dataset config
- `ROSE_ID` → the numeric class ID of "rose"
- `FACTOR` → how many times to oversample rose images


In [1]:
from pathlib import Path
from glob import glob
import yaml, re

BASE      = Path(r"D:/Ai Systems Group/data/weeds_yolo")
DATA_YAML = BASE / "data.yaml"

ROSE_ID = 0      # class id for "rose"
FACTOR  = 3      # oversample rose images


### Step 2 – Define helper functions
`collect_images()` → reads all training image paths (from a folder or .txt) <br>
`image_to_label()` → converts an image path to its corresponding label file path


In [2]:
def collect_images(train_entry):
    """Return list of train image Paths from a dir or a txt list."""
    p = Path(train_entry)
    if p.is_file() and p.suffix.lower() == ".txt":
        return [ (Path(ln).resolve() if not Path(ln).is_absolute() else Path(ln))
                 for ln in p.read_text().splitlines() if ln.strip() ]
    if p.is_dir():
        exts = ("*.jpg","*.jpeg","*.png","*.bmp","*.tif","*.tiff")
        files = []
        for e in exts: files += glob(str(p / e))
        return [Path(f) for f in files]
    raise FileNotFoundError(f"Unsupported train entry: {train_entry}")

def image_to_label(img_path: Path):
    s = img_path.as_posix()
    s = re.sub(r"/images?/", "/labels/", s)
    s = re.sub(r"\.(jpg|jpeg|png|bmp|tif|tiff)$", ".txt", s, flags=re.I)
    p = Path(s)
    return p if p.is_absolute() else (BASE / p).resolve()


### Step 3 – Load the original dataset YAML and separate images
This cell:
1. Loads `data.yaml`
2. Collects all training images
3. Reads label files to see which ones contain the class ID for "rose"
4. Splits them into `rose_imgs` and `other_imgs` lists


In [3]:
with open(DATA_YAML, "r") as f:
    y = yaml.safe_load(f)

train_entry = y["train"]
if not Path(train_entry).is_absolute():
    train_entry = (BASE / train_entry).resolve()

imgs = collect_images(train_entry)

rose_imgs, other_imgs = [], []
for img in imgs:
    if not img.exists():
        continue
    lab = image_to_label(img)
    if not lab.exists():
        other_imgs.append(img)
        continue
    lines = [ln.strip() for ln in lab.read_text().splitlines() if ln.strip()]
    has_rose = any(int(ln.split()[0]) == ROSE_ID for ln in lines)
    (rose_imgs if has_rose else other_imgs).append(img)

print(f"Train images: {len(imgs)} | rose: {len(rose_imgs)} | other: {len(other_imgs)}")


Train images: 153 | rose: 152 | other: 1


### Step 4 – Oversample rose images and create boosted dataset files
1. Duplicates rose images `FACTOR` times
2. Writes a new `train_boosted_rose_x{FACTOR}.txt` in the same folder as `data.yaml`
3. Writes a new `data_rose_boosted.yaml` pointing to the boosted train list


In [4]:
boost_list = [p.as_posix() for p in other_imgs] + [p.as_posix() for p in rose_imgs] * FACTOR
boost_txt  = BASE / f"train_boosted_rose_{FACTOR}x.txt"
boost_txt.write_text("\n".join(boost_list))
print(f"Boosted list saved -> {boost_txt} (total {len(boost_list)} lines)")

boosted_yaml = BASE / "data_rose_boosted.yaml"
y_boost = dict(y)
y_boost["train"] = str(boost_txt)

with open(boosted_yaml, "w") as f:
    yaml.safe_dump(y_boost, f, sort_keys=False)

print(f"Boosted YAML -> {boosted_yaml}")


Boosted list saved -> D:\Ai Systems Group\data\weeds_yolo\train_boosted_rose_3x.txt (total 457 lines)
Boosted YAML -> D:\Ai Systems Group\data\weeds_yolo\data_rose_boosted.yaml
