# Assessment 1: BugFinder
### Intro
This notebook is to be the record of completion for Assessment 1: Machine Learning.
### Scenario
Develop a model to be used with a hand-held hyperspectral camera system to identify harmful pests on containers and vessels entering the country, with the aim of preventing those pests from establishing themselves in this country and destroying native wildlife. This project will use a standard camera to develop a proof of concept for this system.

In [None]:
# Pre-setup
%pip install -Uqq fastbook
%pip install -Uqq fastai
%pip install -Uqq ipywidgets

In [None]:
# Setup - imports
from fastai.vision.widgets import *
from fastai.vision.all import *
import pandas as pd
import os
import shutil

In [None]:
data = Image.open("Insects/Achatina fulica Bowdich/b589ecb6-505d-444f-8753-f4988c11615b.jpg")
data.to_thumb(128,128)

## Dataset Preprocessing & Organisation
#### Locating Dataset
I used Kaggle to research datasets containing images of a variety of different insects, and located two potentially suitable, pre-labelled data sets:
- https://www.kaggle.com/datasets/shameinew/insect-images-with-scientific-names
- https://www.kaggle.com/datasets/rtlmhjbn/ip02-dataset

After examining these large datasets, I chose to take a subset of the first dataset as a proof of concept to specialise in identifying pests of highest concern. The following code is importing this dataset for use in the next steps.

In [None]:
# block for kaggle 
insect_types = ['Trogoderma granarium Everts', 'Solenopsis invicta Buren', 'Lymantria dispar (L.)', 'Achatina fulica Bowdich', 'Apis mellifera Linnaeus']
path = Path('/kaggle/input/insects/Insects')
if not path.exists():
    path.mkdir()
khapra_folder = path/insect_types[0]
ant_folder = path/insect_types[1]
moth_folder = path/insect_types[2]
snail_folder = path/insect_types[3]
bee_folder = path/insect_types[4]
# count images in each folder, verify files have been located
for name in insect_types:
    dest = (path/name)
    dest.mkdir(exist_ok=True)
    count_images = sum(len(files) for _, _, files in os.walk(dest))
    print(count_images)
# show a bee to check images
test_bee = Image.open(bee_folder/os.listdir(bee_folder)[0])
test_bee.to_thumb(128,128)

In [None]:
insect_types = ['Trogoderma granarium Everts', 'Solenopsis invicta Buren', 'Lymantria dispar (L.)', 'Achatina fulica Bowdich', 'Apis mellifera Linnaeus']
path = Path('Insects')
if not path.exists():
    path.mkdir()
khapra_folder = path/insect_types[0]
ant_folder = path/insect_types[1]
moth_folder = path/insect_types[2]
snail_folder = path/insect_types[3]
bee_folder = path/insect_types[4]
# count images in each folder, verify files have been located
bug_counts = [0,0,0,0,0]
for i, name in enumerate(insect_types):
    dest = (path/name)
    dest.mkdir(exist_ok=True)
    bug_counts[i] = sum(len(files) for _, _, files in os.walk(dest))
    print(name + ": " + str(bug_counts[i]))
# show a bee to check images
test_bee = Image.open(bee_folder/os.listdir(bee_folder)[0])
test_bee.to_thumb(128,128)

#### Preprocessing Dataset
The following code will take the pre-labelled dataset and correct the sizes of all images to a consistent size.

In [None]:
datablock = DataBlock(blocks=(ImageBlock, CategoryBlock), get_items=get_image_files, splitter=RandomSplitter(valid_pct=0.2, seed=42), get_y=parent_label, item_tfms=RandomResizedCrop(224, min_scale=0.5))
datablock = datablock.new(item_tfms=RandomResizedCrop(224, min_scale=0.5), batch_tfms=aug_transforms())
dls = datablock.dataloaders(path)


### Dataset Organisation
The following code will split the dataset into a training folder, validation folder and testing folder.

In [None]:
bug_folders = [khapra_folder, ant_folder, moth_folder, snail_folder, bee_folder]
for i, bug in enumerate(insect_types):
    training_count = int(bug_counts[i] * 0.6)
    validation_count = int(bug_counts[i] * 0.2)
    test_count = int(bug_counts[i] * 0.2)
    training_dest = (bug_folders[i]/'training')
    training_dest.mkdir(exist_ok=True)
    validation_dest = (bug_folders[i]/'validation')
    validation_dest.mkdir(exist_ok=True)
    test_dest = (bug_folders[i]/'test')
    test_dest.mkdir(exist_ok=True)
    for j, image in enumerate(bug_folders[i]):  # TypeError: 'WindowsPath' object is not iterable
      if j < training_count:
        shutil.copy(image, training_dest)
      elif j >= training_count and j < training_count + validation_count:
         shutil.copy(image, validation_dest)
      else:
         shutil.copy(image, test_dest)
    print('# of training files: ' + sum(len(files) for _, _, files in os.walk(training_dest)))
    print('# of validation files: ' + sum(len(files) for _, _, files in os.walk(validation_dest)))
    print('# of test files: ' + sum(len(files) for _, _, files in os.walk(test_dest)))
    

## Creating an ML Model:

1. Utilize the fastai library to create an image classification model.
2. Choose an appropriate deep learning architecture (e.g., CNN) for the model.
3. Train the model using the training dataset, considering hyperparameter tuning.
4. Monitor training progress and adjust if necessary.

## Model Scoring:

1. Use the trained model to predict pest species in a given set of images from the validation dataset.
2. Evaluate the model's performance using appropriate metrics (e.g., accuracy, precision, recall).
3. Visualize the model's predictions and actual labels.

## Validation and Test Datasets:

1. Create a validation dataset that was not used during training to assess the model's generalization ability.
2. Ensure that the validation dataset contains images with varying conditions and perspectives.
3. Additionally, prepare a separate test dataset for final model evaluation.

## Model Performance Analysis:

1. Apply the trained model to the test dataset to evaluate its performance on previously unseen data.
2. Analyze the model's predictions, misclassifications, and potential areas of improvement.
3. Summarize the assessment of the model's capabilities and limitations.