# Insect Classification Project
### Project Overview
The goal of this project is to develop an improved algorithm for the detection and classification of insect species using the DIOPSIS image processing pipeline. The primary objective is to accurately outline all insects visible on the screen for detection, and classify each insect by species. This project is part of the ARISE Diopsis Challenge, which aims to enhance biodiversity monitoring through automated insect identification.

### Key Challenges:
Detection:
- Large number of insects per image.
- Wide range of insect sizes (from a few millimeters to several centimeters).
- Overlapping insects.
- Presence of non-insect structures like vegetation, dirt, and shadows.

Classification:
- Imbalance in the number of training examples per species.
- Fine-grained nature of the task.
- Appropriate taxonomic level to output results.
- Relatively poor image quality.

In [1]:
import os
import pandas as pd
import tensorflow as tf
from keras.preprocessing.image import ImageDataGenerator
from keras.applications import ResNet50
from keras.layers import Dense, GlobalAveragePooling2D
from keras.models import Model




# Data Preparation
Loading the Dataset:
- We load the classification_labels.csv and name_to_ancestors.csv files which provide the image labels and taxonomic hierarchy.

Preprocessing Images:
- All images are resized to a uniform size to facilitate consistent input into the model.

In [2]:
# Load classification labels
classification_labels = pd.read_csv('Data/input/classification_labels.csv')
name_to_ancestors = pd.read_csv('Data/input/name_to_ancestors.csv')

# Add appropriate file extension if needed
classification_labels['basename'] = classification_labels['basename'].apply(lambda x: x + '.jpg')  # Assuming .jpg extension

# Correct directory path
directory = 'Data/input/images_resized'

# Constants
IMG_SIZE = (128, 128)
BATCH_SIZE = 32
NUM_CLASSES = len(classification_labels['deepest_name'].unique())

# Model Training
Data Augmentation:
- We use ImageDataGenerator to augment the training data. This helps improve the model's robustness by introducing variations in the training images.

Building the Model:
- We use the ResNet50 model pre-trained on ImageNet and add custom layers for our classification task.
- The model is trained with categorical cross-entropy loss and Adam optimizer.

In [3]:
# Data augmentation and preprocessing
train_datagen = ImageDataGenerator(
    rescale=1./255,
    validation_split=0.2,
    horizontal_flip=True,
    zoom_range=0.2
)

train_generator = train_datagen.flow_from_dataframe(
    dataframe=classification_labels,
    directory=directory,
    x_col='basename',
    y_col='deepest_name',
    target_size=IMG_SIZE,
    batch_size=BATCH_SIZE,
    class_mode='categorical',
    subset='training'
)

validation_generator = train_datagen.flow_from_dataframe(
    dataframe=classification_labels,
    directory=directory,
    x_col='basename',
    y_col='deepest_name',
    target_size=IMG_SIZE,
    batch_size=BATCH_SIZE,
    class_mode='categorical',
    subset='validation'
)

Found 31556 validated image filenames belonging to 84 classes.
Found 7889 validated image filenames belonging to 84 classes.


In [4]:
# Model selection
base_model = ResNet50(include_top=False, weights='imagenet', input_shape=(128, 128, 3))
x = GlobalAveragePooling2D()(base_model.output)
x = Dense(1024, activation='relu')(x)
predictions = Dense(NUM_CLASSES, activation='softmax')(x)

model = Model(inputs=base_model.input, outputs=predictions)

for layer in base_model.layers:
    layer.trainable = False

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Training the model
model.fit(
    train_generator,
    epochs=25,
    validation_data=validation_generator
)

# Save model
model.save('insect_model.h5')




Epoch 1/25


Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


  saving_api.save_model(


# Prediction and Formatting
Generating Predictions:
- The trained model is used to predict the species of insects in the test images.

Formatting Predictions:
- The predictions are formatted to match the hierarchical structure required by the challenge.

In [42]:
# Debug: Print columns and first few rows
print("Columns in name_to_ancestors:")
print(name_to_ancestors.columns)
print("First few rows in name_to_ancestors:")
print(name_to_ancestors.head())

Columns in name_to_ancestors:
Index(['name', 'ancestors'], dtype='object')
First few rows in name_to_ancestors:
          name                               ancestors
0     Animalia                            ['Animalia']
1      Insecta                 ['Insecta', 'Animalia']
2  Hymenoptera  ['Hymenoptera', 'Insecta', 'Animalia']
3  Lepidoptera  ['Lepidoptera', 'Insecta', 'Animalia']
4      Diptera      ['Diptera', 'Insecta', 'Animalia']


In [48]:
import os
import ast
import pandas as pd
from keras.preprocessing.image import ImageDataGenerator
from keras.models import load_model

# Load the CSV files
classification_labels = pd.read_csv('Data/input/classification_labels.csv')
name_to_ancestors = pd.read_csv('Data/input/name_to_ancestors.csv')

# Load model
model = load_model('insect_model.h5')

# Add .jpg extension to basenames if not already present
classification_labels['basename'] = classification_labels['basename'].apply(lambda x: x if x.endswith('.jpg') else x + '.jpg')

# Function to parse ancestors column
def parse_ancestors(ancestors):
    ancestors = ast.literal_eval(ancestors)
    levels = {}
    for i, level in enumerate(ancestors):
        levels[f'level_{i}'] = level
        levels[f'level_{i}_probability'] = None  # Placeholder for probabilities
    return levels

# Apply parsing to name_to_ancestors dataframe
ancestors_df = name_to_ancestors['ancestors'].apply(parse_ancestors)
ancestors_df = pd.DataFrame(list(ancestors_df))

# Combine with name_to_ancestors dataframe
name_to_ancestors = pd.concat([name_to_ancestors['name'], ancestors_df], axis=1)

# Define function to format predictions
def format_predictions(model, dataframe, name_to_ancestors):
    # Generate predictions
    image_dir = 'Data/input/images_resized'
    datagen = ImageDataGenerator(rescale=1./255)
    generator = datagen.flow_from_dataframe(
        dataframe=dataframe,
        directory=image_dir,
        x_col='basename',
        y_col=None,
        target_size=(224, 224),
        batch_size=1,
        class_mode=None,
        shuffle=False
    )
    
    # Get predictions
    predictions = model.predict(generator, steps=len(generator), verbose=1)
    predicted_classes = predictions.argmax(axis=-1)
    class_probabilities = predictions.max(axis=-1)
    
    # Add predictions to dataframe
    dataframe['predicted_class'] = predicted_classes.astype(str)  # Convert to string for merging
    dataframe['confidence'] = class_probabilities
    
    # Ensure name column in name_to_ancestors is string
    name_to_ancestors['name'] = name_to_ancestors['name'].astype(str)
    
    # Merge with ancestors to get the full hierarchy
    formatted_predictions = dataframe.merge(name_to_ancestors, left_on='predicted_class', right_on='name')
    
    # Format according to the challenge requirements
    for i in range(6):
        formatted_predictions[f'level_{i}_probability'] = formatted_predictions['confidence']  # Use the prediction confidence for all levels
    
    formatted_predictions = formatted_predictions[['basename', 'level_0', 'level_0_probability', 
                                                   'level_1', 'level_1_probability', 
                                                   'level_2', 'level_2_probability', 
                                                   'level_3', 'level_3_probability', 
                                                   'level_4', 'level_4_probability', 
                                                   'level_5', 'level_5_probability']]
    
    return formatted_predictions

# Apply the formatting function
formatted_predictions = format_predictions(model, classification_labels, name_to_ancestors)

# Save to CSV
output_file_path = 'mnt/data/formatted_predictions.csv'
absolute_path = os.path.abspath(output_file_path)
formatted_predictions.to_csv(absolute_path, index=False)
print(f"Formatted predictions saved to {absolute_path}")


Found 39445 validated image filenames.
Formatted predictions saved to c:\Users\Gebruiker\Documents\Fontys\S6 - AI\Personal - Data Driven Challenge\arise-challenge-algorithm\mnt\data\formatted_predictions.csv


# Evaluation and Results
Evaluation:
- Compare the model's predictions with the ground truth to evaluate its accuracy and reliability.

Results:
- The formatted predictions are saved to formatted_predictions.csv for submission.

In [50]:
# Evaluate model and save results
def evaluate_and_save_results(model, csv_file, image_dir, output_file):
    test_labels = pd.read_csv(csv_file)
    test_labels.columns = test_labels.columns.str.strip()
    
    if 'basename' not in test_labels.columns:
        raise KeyError("Column 'basename' does not exist in the predictions file.")
    
    test_labels['basename'] = test_labels['basename'].apply(lambda x: x if x.endswith('.jpg') else x + '.jpg')
    
    missing_files = []
    for fname in test_labels['basename']:
        if not os.path.exists(os.path.join(image_dir, fname)):
            missing_files.append(fname)
    
    if missing_files:
        raise FileNotFoundError(f"The following files are missing: {missing_files}")
    
    test_datagen = ImageDataGenerator(rescale=1./255)
    test_generator = test_datagen.flow_from_dataframe(
        dataframe=test_labels,
        directory=image_dir,
        x_col='basename',
        y_col=None,
        target_size=IMG_SIZE,
        batch_size=1,
        class_mode=None,
        shuffle=False
    )
    
    if len(test_generator) == 0:
        raise ValueError("Test generator is empty. No images found.")
    
    predictions = model.predict(test_generator, steps=len(test_generator), verbose=1)
    predicted_classes = predictions.argmax(axis=-1)
    
    test_labels['predictions'] = predicted_classes
    test_labels.to_csv(output_file, index=False)
    
# Assuming test images are in 'Data/input/images_resized'
evaluate_and_save_results(model, 'mnt/data/predictions.csv', 'Data/input/images_resized', 'mnt/data/formatted_predictions.csv')

Found 39445 validated image filenames.


# Interim Conclusion and Future Work
### Summary of Work Completed
In this project, we have developed an initial version of an insect classification system as part of the ARISE Diopsis Challenge. Here are the key steps we have completed:

1. Data Preparation:

    - Loaded and preprocessed the insect image dataset.
    - Ensured all images are resized to a uniform size.
    - Verified and merged the classification labels with the taxonomic hierarchy.

2. Model Training:

    - Utilized the pre-trained ResNet50 model, fine-tuned with additional layers for insect species classification.
    - Implemented data augmentation techniques to enhance model robustness.
    - Trained the model using the training dataset and validated its performance on a separate validation set.

3. Prediction and Formatting:

    - Generated predictions using the trained model on the test dataset.
    - Formatted the predictions according to the hierarchical structure required by the challenge.
    - Saved the formatted predictions in the specified format for submission.

### Current Results
The initial model has successfully processed the dataset and generated predictions for insect species classification. The formatted predictions have been saved in the required format, ready for evaluation. This initial implementation serves as a solid foundation for further improvements and refinements.

### Future Work
Despite the progress made, there are several areas where we can further enhance the performance and accuracy of our model:

1. Addressing Class Imbalance:

    - Implement techniques to handle the imbalance in the number of training examples per species. This could include data augmentation, synthetic data generation, or using specialized loss functions that mitigate class imbalance.

2. Improving Model Accuracy:

    - Experiment with different neural network architectures and hyperparameters to improve model accuracy.
    - Fine-tune the pre-trained model more extensively with more epochs and different learning rates.

3. Enhanced Data Augmentation:

    - Apply more advanced data augmentation techniques to create a more diverse training dataset, potentially improving the model's ability to generalize.

4. Incorporating Additional Data:

    - Explore the possibility of incorporating additional datasets or external data sources to provide more training examples and enhance model performance.

5. Model Evaluation and Fine-Tuning:

    - Conduct a thorough evaluation of the model's predictions against the ground truth.
    - Fine-tune the model based on the evaluation results to address any weaknesses or inaccuracies.

6. Pipeline Optimization:

    - Optimize the overall image processing and classification pipeline for efficiency and scalability.
    - By focusing on these areas, we aim to develop a more robust and accurate insect classification system that meets the stringent requirements of the ARISE Diopsis Challenge. Our ultimate goal is to contribute to biodiversity monitoring efforts through automated and precise insect identification.

### Next Steps
- Conduct a detailed analysis of the current model's performance, identifying specific areas for improvement.
- Implement and test various techniques to address class imbalance.
- Experiment with different neural network architectures and fine-tune hyperparameters.
- Apply more advanced data augmentation techniques.
- Evaluate the model thoroughly and refine it based on the findings.
- Document the entire process, including methodologies, results, and improvements, for the final project submission.