Skip to content

Lepel1998/nederhorstLuca_research_project

Repository files navigation

Research Project for automatic identification of forensically relevant insect species in South Africa, Western Cape

Author: Luca Nederhorst
University: University of Amsterdam
Year: 2023-2024
Studentnumber: 12328154

Short description:

This project aims to develop an automatic identification model for forensically relevant insect species in Cape Town, South Africa. Both training and test data for the models have been generated as part of this research project. It is important to note that the models are not trained on an exhaustive list of forensically relevant species. New data of forensically relevant species can be incorporated to enhance the model's applicability across different insect species. The model is currently trained on photos of the following species, captured in lateral view under a microscope with magnifications of 12.5x and 16x, simulating laboratory conditions. These images were taken on grey photographic paper using a 12-megapixel camera iPhone 11:

  • Chrysomya albiceps - adult & larvae (96 photos)
  • Chrysomya megacephala - adult (96 photos)
  • Synthesiomya nudiseta - adult & larvae (96 photos)

Installation

To get started with this project, follow these steps to set up on you computer locally:

  1. Clone repository:
    git clone https://github.com/Lepel1998/nederhorstLuca_research_project

  2. Navigate to project directory:
    cd nederhorstLuca_research_project

  3. Set up environment:
    python -m venv env
    source env\Scripts\activate

  4. Install python dependencies (packages):
    pip install pillow-heif numpy seaborn tensorflow matplotlib scikit-learn opencv-python joblib torch transformers torchvision

  5. Check if versions of installed dependencies are compatible with each other.

  6. Some dependences have additional requirements or dependencies, some errors may be explained by this.

Usage

Instructions for using the project.

Test run

First, test if the program works on your computer by running the 'trial_dataset' folder, which is inside the 'datasets' folder, in the main.py file to check if everything works. You can also check if the other scripts of the models work. After the trial run, remove PROCESSED_DATASET folder, PREPROCESSED_DATASET.CSV, and METADATA_MODEL.CSV to avoid errors and duplicate data.

Uploading your dataset

You can use this project by uploading your dataset, which should be organised in a directory structure where each folder represents a species. The folders will contain the photos (see example below). For an example, open one of the folders in the dataset.

├── dataset/
│ ├── Chrysomya_albiceps_adult/
│ │ ├── photo1.jpg
│ │ ├── photo2.jpg
│ │ └── ...
│ │
│ ├── Synthesiomyia_nudiseta_adult/
│ │ ├── photo1.jpg
│ │ ├── photo2.jpg
└── ...

You can run the code on your own dataset

First, run main.py with your own dataset to augment and extract features. After this, you can train an existing model on your dataset by selecting a trained model and adding new data to retrain it. The existing models can be found in the 'trained_models' folder. Explanation of different trained models can be find below:

  • InClasCNN_Model.h5: trained model generated by running cnn.py on 'final_dataset'. It is a non-pretrained Convolutional Neural Network.

  • InClasPreTrainedCNN_Model.h5: trained model generated by running pretrained_cnn.py on 'final_dataset'. It is a pretrained Convolutional Neural Network. The network is pretrained on EfficientNetB0 model retrieved from "https://download.pytorch.org/models/efficientnet_b0_rwightman-7f5810bc.pth"

  • InclasSVM_Model.joblib: trained model generated by running svm_knn_nb.py on 'final_dataset'. It is a non-pretrained Support Vector Machine model trained on manually extracted features in main.py.

  • InClasViTSVM_Model.joblib: trained mdoel generated by running vit.py on 'trial_dataset'. It is a Support Vector Machine model trained on a pretrained Visual Transformers retrieved from 'google/vit-base-patch16-224-in21k'.

  • InClasKNN_Model.joblib: trained model generated by running svm_knn_nb.py on 'final_dataset'. It is a non-pretrained K-Nearest Neighbors model trained on manually extracted features in main.py.

  • InClasViTKNN_Model.joblib: trained mdoel generated by running vit.py on 'trial_dataset'. It is a K-Nearest Neighbor model trained on a pretrained Visual Transformers retrieved from 'google/vit-base-patch16-224-in21k'.

  • InClasNB_Model: trained model generated by running svm_knn_nb.py on 'final_dataset'. It is a non-pretrained Naive Bayes model trained on manually extracted features in main.py.

  • InClasViTNB_Model.joblib: trained model generated by running vit.py on 'trial_dataset'. It is a Naive Bayes model trained on a pretrained Visual Transformers retrieved from 'google/vit-base-patch16-224-in21k'.

If you want to train the full models again, you can add you data as directed in Uploading your dataset to the final_dataset folder and train the models from scratch. This will overwrite the old models, deleting them. Again, every time after running main.py, remove PROCESSED_DATASET FOLDER and METADATA_MODEL.CSV to avoid errors if you want to the main.py script again.

Folder Structure

  • datasets/: Contains all raw datasets.
    • final_dataset/: Dataset used for training.
    • trial_dataset/: Dataset used for developing the programs.
  • annotation_files/: Contains all metadata CSV files that describe the dataset and features.
  • models/: Contains the scripts used to train the models. vit.py did not work on my computer due to heavy mathematical calculations.
  • trained_models/: Containns all the trained models which can be used or extracted.
  • log_process_CNN/: Contains the test and training history of the non-pretrained CNN model.
  • main.py: Main script in were the dataset is preprocessed and augmented.
  • functions.py: All the functions used in the main.py.

Credits

For some scripts I followed certain GitHub repositories, I will list them per script below and how I used them to give credits to the original developers:

  • cnn.py:
    • main set-up of the script:
      Nochnack, N. (2022). ImageClassification: Getting Started.ipynb. GitHub. Retrieved June 2024, from https://github.com/nicknochnack/ImageClassification/blob/main/Getting%20Started.ipynb
    • Layers of the convolutional neural network:
      Kasinathan, T., Singaraju, D., & Uyyala, S. R. (2021). Insect classification and detection in field crops using modern machine learning techniques. Information Processing in Agriculture, 8(3), 446-457.
  • pretrained_cnn.py:
  • svm_knn_nb.py:
    • For implementing SVM and leading for implementing KNN and NB:
      Cloud and ML Online. (June 22, 2019). Support Vector Machine - SVM - Classification Implementation for Beginners (using python) - Detailed [Video]. Youtube. https://www.youtube.com/watch?v=7sz4WpkUIIs
  • vit.py:
  • functions.py:
    • Augmentation of photos inspired on:
      Kasinathan, T., Singaraju, D., & Uyyala, S. R. (2021). Insect classification and detection in field crops using modern machine learning techniques. Information Processing in Agriculture, 8(3), 446-457.
      Mikołajczyk, A., & Grochowski, M. (2018). Data augmentation for improving deep learning in image classification problem. In Proc IIPhDW’ 18 Proceedings of the 2018 International Interdisciplinary PhD Workshop (pp. 117–122). Swinoujscie, Poland.
      Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6(1), 60.

    • Lowpass filter inspired on:
      Makandar, A., & Halalli, B. (2015). Image enhancement techniques using highpass and lowpass filters. International Journal of Computer Applications, 109, 12–15.

    • Geometric feature extraction inspired on:
      Wen, C., & Guyer, D. (2012). Image-based orchard insect automated identification and classification method. Computers and Electronics in Agriculture, 89, 110-115

    • Fourier inspired on:
      Wen, C., & Guyer, D. (2012). Image-based orchard insect automated identification and classification method. Computers and Electronics in Agriculture, 89, 110-115
      Bleed AI Academy. (May 25, 2021). Cotour Detection in OpenCV 101 (1/3): The Basics. YouTube. https://www.youtube.com/watch?v=JfaZNiEbreE&t=1s
      OpenCV. (n.d.). Fourier Transform — OpenCV documentation. Retrieved June 2024, from https://docs.opencv.org/4.x/de/dbc/tutorial_py_fourier_transform.html

    • Minimum rectangle photo inspired on:
      OpenCV. (n.d.B). Contour features — OpenCV documentation. Retrieved June 2024, from https://docs.opencv.org/4.x/dd/d49/tutorial_py_contour_features.html

    • Invariant moments inspired on:
      Wen, C., & Guyer, D. (2012). Image-based orchard insect automated identification and classification method. Computers and Electronics in Agriculture, 89, 110-115
      Tutorialspoint. (n.d.). How to compute Hu moments of an image in OpenCV Python. Retrieved 2024, from https://www.tutorialspoint.com/how-to-compute-hu-moments-of-an-image-in-opencv-python

    • Texture features inspired on:
      Wen, C., & Guyer, D. (2012). Image-based orchard insect automated identification and classification method. Computers and Electronics in Agriculture, 89, 110-115

    • Color features inspired on:
      Wen, C., & Guyer, D. (2012). Image-based orchard insect automated identification and classification method. Computers and Electronics in Agriculture, 89, 110-115

Contact

luca-nederhorst@hotmail.nl

Additional Information

The future plans for this project are to significantly advance forensic entomology in South Africa. Our primary objective is to develop a highly accurate and robust model capable of identifying all forensically relevant insect species in South Africa, Western Cape. This will involve expanding the current dataset to include a comprehensive list of forensically relevant species, continually refining our models, and integrating advanced machine learning techniques. The final goal of this project is to create a user-friendly mobile application. This app will be designed for use at crime scene by first-responders, enabling them to quickly and accurately determine the minimum post-mortem interval (PMI) based on insect evidence. By providing real-time analysis and identification, this tool will enhance the efficiently and accuracy of forensic investigations.