# Introduction
In this notebook, we will demonstrate the process of collecting, preparing, and training a model using the Bplusplus library.

The steps include:
1. Installing the required packages.
2. Importing the necessary modules.
3. Setting up the directories for data storage.
4. Collecting insect images from the Global Biodiversity Information Facility (GBIF).
5. Preparing the collected images for training.
6. Training a YOLO/Resnet/Multitask-Resnet model on the prepared dataset.
7. Validating the trained model.

Please do not run all cells in the notebook, choose whether you want to run one stage yolo, two stage resnet, or two stage multitask-resnet and follow the appropriate cells.




## Make virtual environment (recommended)
It is recommended to create a virtual environment to manage dependencies and avoid conflicts.
 
To create a virtual environment, open your terminal and run the following commands:
 
```bash
python3 -m venv bplusplus_env
source bplusplus_env/bin/activate
```

This will create and activate a virtual environment named `bplusplus_env`.

## Install required packages

In [None]:
#!pip install bplusplus

## Import required packages

In [1]:
import bplusplus
from typing import Any
from pathlib import Path
import requests
from tqdm import tqdm

## Set directories

In [2]:
MAIN_DIR = Path("/mnt/nvme0n1p1/datasets/bplusplus-update")

GBIF_DATA_DIR = MAIN_DIR / "GBIF_data"
PREPARED_DATA_DIR = MAIN_DIR / "prepared_data"
TRAINED_MODEL_DIR = MAIN_DIR / "trained_model"
TEST_DATA_DIR = MAIN_DIR / "test_data" #if you want to test the model on a different dataset (two stage)

## Collect insect images from GBIF

In [3]:
names = [
        "Coccinella septempunctata", "Apis mellifera", "Bombus lapidarius", "Bombus terrestris",
        "Eupeodes corollae", "Episyrphus balteatus", "Aglais urticae", "Vespula vulgaris",
        "Eristalis tenax"
    ]

search: dict[str, Any] = {
    "scientificName": names
}

In [None]:
bplusplus.collect(
    group_by_key=bplusplus.Group.scientificName,
    search_parameters=search, 
    images_per_group=3000,
    output_directory=GBIF_DATA_DIR,
    num_threads=3
)


Exception in thread Thread-9 (__collect_subset):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/mnt/nvme0n1p1/mit/bplusplus-env/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 766, in run_closure
    _threading_Thread_run(self)
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme0n1p1/mit/Bplusplus/src/bplusplus/collect.py", line 160, in __collect_subset
    __single_collect(
  File "/mnt/nvme0n1p1/mit/Bplusplus/src/bplusplus/collect.py", line 42, in __single_collect
    __create_folders(
  File "/mnt/nvme0n1p1/mit/Bplusplus/src/bplusplus/collect.py", line 146, in __create_folders
    os.makedirs(directory)
  File "/usr/lib/python3.10/os.py", line 225, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/mnt/nvme0n1p1/datasets/bplusplus-update/GBIF_data'


Thread 0 starting collection for 3 species.
Creating folders for images...
Thread 1 starting collection for 3 species.
Creating folders for images...
Thread 2 starting collection for 3 species.
Creating folders for images...
Beginning to collect images from GBIF...
Beginning to collect images from GBIF...
Downloading 3000 images into the Aglais urticae folder...


Downloading images for Aglais urticae:   0%|          | 2/3000 [00:01<45:10,  1.11image/s]  

Downloading 3000 images into the Bombus terrestris folder...


Downloading images for Aglais urticae:   5%|▍         | 149/3000 [02:58<54:51,  1.15s/image]  

## Prepare the dataset for training (yolov8)

For one stage yolo, you may want to filter by size of insect as small insects are desirable for training.

In [None]:
bplusplus.prepare(
    input_directory=GBIF_DATA_DIR,
    output_directory=PREPARED_DATA_DIR,
    one_stage=True,
    with_background=True, # Set to False if you don't want to include/download background images
    size_filter=True, #set to list of sizes if you want to filter by size of insect 
    sizes=["small", "medium"] #set to list of sizes if you want to filter by size of insect 
)

## Prepare the dataset for training (two stage)

When preparing the GBIF data for two stage, you may want to filter by large size of insects. This data is just used for classification from the cropped bounding box therefore large size is desirable. Even if no size is filtered, please run preparation to just split to train and valid for training. 

In [None]:
bplusplus.prepare(
    input_directory=GBIF_DATA_DIR,
    output_directory=PREPARED_DATA_DIR,
    one_stage=False
)

## Train the model (yolov8)

In [None]:
model = bplusplus.train(
    input_yaml=str(PREPARED_DATA_DIR / "dataset.yaml"),
    output_directory=TRAINED_MODEL_DIR,
    epochs=2,
    imgsz=256,
    batch=4
)

## Train the model (resnet standard)

In [None]:
species_list = names

In [None]:
bplusplus.train_resnet(
    species_list,
    model_type='resnet50',
    batch_size=4,
    num_epochs=2,
    patience=5,
    output_dir=TRAINED_MODEL_DIR,
    data_dir=PREPARED_DATA_DIR,
    img_size=256
)

## Train the model (resnet multitask)

In [None]:
bplusplus.train_multitask(
    batch_size=4,
    epochs=2,
    patience=3,
    img_size=256,
    data_dir=PREPARED_DATA_DIR,
    output_dir=TRAINED_MODEL_DIR,
    species_list=species_list
)


## Validate the model (yolov8)

In [None]:
metrics = bplusplus.validate(model, str(PREPARED_DATA_DIR / "dataset.yaml"))
print(metrics)

## For two stage model validation, download localisation weights from: 

https://github.com/orlandocloss/TwoStageInsectDetection/releases/download/models/small-generic.pt

```
OR
```


In [None]:
def __download_file_from_github_release(url, dest_path):

    """
    Downloads a file from a given GitHub release URL and saves it to the specified destination path,
    with a progress bar displayed in the terminal.

    Args:
        url (str): The URL of the file to download.
        dest_path (Path): The destination path where the file will be saved.

    Raises:
        Exception: If the file download fails.
    """

    response = requests.get(url, stream=True)
    total_size = int(response.headers.get('content-length', 0))
    block_size = 1024  # 1 Kibibyte
    progress_bar = tqdm(total=total_size, unit='iB', unit_scale=True)

    if response.status_code == 200:
        with open(dest_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=block_size):
                progress_bar.update(len(chunk))
                f.write(chunk)
        progress_bar.close()
    else:
        progress_bar.close()
        raise Exception(f"Failed to download file from {url}")

In [None]:
YOLO_WEIGHTS = TRAINED_MODEL_DIR / "small-generic.pt"

In [None]:
github_release_url = 'https://github.com/orlandocloss/TwoStageInsectDetection/releases/download/models/small-generic.pt'

if not YOLO_WEIGHTS.exists():
    __download_file_from_github_release(github_release_url, YOLO_WEIGHTS)

## Validate the model (resnet standard)

In [None]:
RESNET_WEIGHTS = TRAINED_MODEL_DIR / "best_resnet50.pt"

In [None]:
bplusplus.test_resnet(
    data_path=TEST_DATA_DIR,
    yolo_weights=YOLO_WEIGHTS,
    resnet_weights=RESNET_WEIGHTS,
    model="resnet50",
    species_names=species_list,
    output_dir=TRAINED_MODEL_DIR
)

## Validate the model (resnet multitask)


In [None]:
RESNET_MULTITASK_WEIGHTS = TRAINED_MODEL_DIR / "best_multitask.pt"

In [None]:
bplusplus.test_multitask(
    species_list,
    test_set=TEST_DATA_DIR,
    yolo_weights=YOLO_WEIGHTS,
    hierarchical_weights=RESNET_MULTITASK_WEIGHTS,
    output_dir=TRAINED_MODEL_DIR
)