<a href="https://www.kaggle.com/code/prathamkumar0011/pest-classification-detection?scriptVersionId=193851386" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Welcome to the Ali Analytics recruitment test.**


---



**Problem Statement:**

*  The agricultural industry faces significant challenges due to insect pests, which can cause substantial crop damage and yield loss. Accurate and timely identification of these pests is crucial for effective pest management and control.

**Dataset:**

-  The IP102 dataset provides a large-scale benchmark for insect pest recognition, containing over 75,000 images across 102 categories (classes)
-  The dataset has a split of 6:1:3 highly imbalanced some of the class consists of 71 images while the largest is 5740.

The dataset comprises:

    - 75,000 Images
    - 102 classes
    - 45,095 images in the training set
    - 22,169 image samples in the test

- **GitHub link:** https://github.com/xpwu95/IP102
  - trainset: https://drive.google.com/drive/folders/1UPW_wyfn-oP6YyO_AnUgue7Gur2dYcxP?usp=drive_link
  - valset: https://drive.google.com/drive/folders/1JddZOkBMRnfJHzRCSJS26JVbbG989W_L?usp=drive_link
  - testset: https://drive.google.com/drive/folders/1lWQwYxrYJ1ZfZrmQg51rnNRgl1InLWH0?usp=drive_link


**Further instructions:**
- Framework: TensorFlow 2.x / Pytorch
- Programing language: Python 3.x

**Submission Guidelines:**
*   Submit your solution as a notebook(ipynb) or Python script (.py)
*   Include clear, well-documented code and concise explanations for each step.
*   Trained Model: Capable of accurately classifying insect pests.
*   Performance Metrics: Demonstrating the effectiveness of the model.
*   Detailed Report: Summarizing the process, challenges, solutions, and insights.
*   Presentation: Effectively communicating the findings and results.
*   Email your submission to ali.analytics.io@gmail.com by the **19th of August, 2024, 21:00 IST**


**Evaluation Criteria:**
*   Code quality and documentation
*   Trained Model and Model performance and accuracy
*   Approach to problem-solving and data analysis
*   Project Presentation skills


**To assess your competence, we have divided the project into distinct steps, each with specific tasks. Please follow the steps below to complete the task:**

# **Step 1: Environment Setup**

**Tasks:**

1. Set up a Google Colab environment for the project.

  *   Install the necessary libraries (TensorFlow 2.x, Pytorch, NumPy, OpenCV, Matplotlib, scikit-learn etc.).
  *   Verify the installation by printing the versions of the installed libraries.
2. Add anything else that is required for this step (optional)



# **Step 2: Data Preparation**

Tasks:

1. Use Github/Google drive to download the IP102 dataset
2. Load the images and/or annotations into the environment.
3. Data Cleaning: Handle missing or inconsistent data.
4. Data Augmentation: Apply techniques such as rotation, flipping, and scaling to address class imbalance.
5. Add anything else that is required for this step (optional)

# **Step 3: Exploratory Data Analysis (EDA)**

**Tasks:**

1. Analyze and visualize the distribution of object classes.
2. Visualize sample images
3. Identify any missing values
4. Add anything else that is required for this step (optional)

# **Step 4: Feature Engineering**

**Objective:** Create meaningful features that enhance model performance.

**Tasks:**

1. Feature Extraction: Experiment with different feature extraction techniques such as Histogram of Oriented Gradients (HOG), Scale-Invariant Feature Transform (SIFT), or deep learning-based features.
2. Feature Selection: Select the most relevant features that contribute to model accuracy.
3. Add anything else that is required for this step (optional)

# **Step 5: Model Training**

**Objective:** Develop and train machine learning models for pest classification

**Tasks:**

1. **Data Splitting:** Split the dataset into training, validation, and test sets.
2. **Model Training:** Train multiple models (e.g., Convolutional Neural Networks, Random Forest, Support Vector Machines, ..) for classification.
3. **Object Detection:** Implement object detection models (e.g., YOLO, Faster R-CNN, VGG16, ..) for images with bounding box annotations.
4. Add anything else that is required for this step (optional)

**Motivation:** The use of multiple models and object detection techniques can enhance the robustness of the solution

# **Step 6: Model Evaluation and Tuning**

**Objective:** Evaluate and optimize model performance.

**Tasks:**

1. Evaluate the model on the validation/test set using appropriate metrics (e.g., precision, recall, mAP).
2. Visualize detection results on sample test images.
3. Discuss the model's performance and potential improvements.
4. Hyperparameter Tuning: Perform hyperparameter tuning to optimize model performance.
5. Model Comparison: Compare the performance of different models and select the best-performing one.
6. Add anything else that is required for this step (optional)

**Motivation:** Proper evaluation and tuning are essential for achieving high model accuracy.

# **Step 7: Deployment**

**Tasks:**

1. Explain how you would deploy your trained object detection model in a production environment.
2. Discuss the different deployment options available and their pros and cons.
3. Based on the deployment options, which deployment option would you go for and why?
3. Outline the steps you would take to monitor the deployed model and ensure its performance over time.
4. Add anything else that is required for this step (optional)

# **Good luck, and we look forward to reviewing your submission!**

# **Step 1: Environment Setup**

In [None]:
!pip install -q tensorflow matplotlib numpy scikit-learn torch opencv-python wurlitzer

In [None]:
!pip install -q torch torchvision torchaudio

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt
import numpy as np
import os
from sklearn.utils.class_weight import compute_class_weight

# **Step 2: Data Preparation**

## Data loading

In [None]:
# import tarfile

# tar_file_path = '/kaggle/input/ip102data/ip102_v1.1.tar'

# with tarfile.open(tar_file_path, 'r') as tar:
#     tar.extractall(path='/content')

In [None]:
import shutil
import os

# Define paths
base_dir = '/kaggle/input/ip102data/ip102_v1.1/images'
output_dir = '/kaggle/working/ip102_v1.1/data'
train_file = '/kaggle/input/ip102data/ip102_v1.1/train.txt'
test_file = '/kaggle/input/ip102data/ip102_v1.1/test.txt'
val_file = '/kaggle/input/ip102data/ip102_v1.1/val.txt'

In [None]:
# Create output directories based on label
def create_label_dirs(base_path, labels):
    for label in set(labels):
        os.makedirs(os.path.join(base_path, str(label)), exist_ok=True)

# Parse file and copy images
def process_file(file_list, base_dir, output_dir):
    labels = []
    with open(file_list, 'r') as file:
        lines = file.read().splitlines()
        for line in lines:
            img, label = line.split()
            labels.append(label)  # Collect labels first

    # Create directories for labels
    create_label_dirs(output_dir, labels)

    # Copy images to corresponding directories
    with open(file_list, 'r') as file:
        lines = file.read().splitlines()
        for line in lines:
            img, label = line.split()
            src = os.path.join(base_dir, img)
            dest = os.path.join(output_dir, label, img)
            if os.path.isfile(src):
                shutil.copy(src, dest)
            else:
                print(f"File not found: {src}")

# Execute the function for training, testing, and validation data
process_file(train_file, base_dir, os.path.join(output_dir, 'train'))
process_file(test_file, base_dir, os.path.join(output_dir, 'test'))
process_file(val_file, base_dir, os.path.join(output_dir, 'val'))


## Data Cleaning

In [None]:
def check_duplicates(file_path):
    with open(file_path, 'r') as file:
        lines = file.read().splitlines()
        seen = set()
        duplicates = set()
        for line in lines:
            if line in seen:
                duplicates.add(line)
            seen.add(line)

    if duplicates:
        print(f"Duplicate entries found in {file_path}: {duplicates}")
    else:
        print(f"No duplicate entries found in {file_path}.")

# Check for duplicates in each file
check_duplicates(train_file)
check_duplicates(test_file)
check_duplicates(val_file)


In [None]:
def check_missing_values(file_path):
    missing_lines = []
    with open(file_path, 'r') as file:
        lines = file.read().splitlines()
        for i, line in enumerate(lines):
            if len(line.strip()) == 0:
                missing_lines.append(i)
                continue
            parts = line.split()
            if len(parts) != 2 or not parts[1].isdigit():
                missing_lines.append(i)

    if missing_lines:
        print(f"Missing or invalid entries found in {file_path} at lines: {missing_lines}")
    else:
        print(f"No missing or invalid entries found in {file_path}")

# Check for missing values in train.txt, test.txt, and val.txt
check_missing_values(train_file)
check_missing_values(test_file)
check_missing_values(val_file)


In [None]:
def check_image_existence(file_path, base_dir):
    missing_images = []
    with open(file_path, 'r') as file:
        lines = file.read().splitlines()
        for line in lines:
            img, _ = line.split()
            img_path = os.path.join(base_dir, img)
            if not os.path.isfile(img_path):
                missing_images.append(img)

    if missing_images:
        print(f"Missing images listed in {file_path}: {missing_images}")
    else:
        print(f"All images listed in {file_path} exist.")

# Check image existence for each file
check_image_existence(train_file, base_dir)
check_image_existence(test_file, base_dir)
check_image_existence(val_file, base_dir)

## Data Augmentation

In [None]:
#  Setting up directories
train_dir = '/kaggle/working/ip102_v1.1/data/train'
val_dir = '/kaggle/working/ip102_v1.1/data/test'
test_dir = '/kaggle/working/ip102_v1.1/data/val'

# Image Preprocessing
img_size = (128, 128)
batch_size = 64

In [None]:
# Data Augmentation for training set
train_datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest'
)

# Only rescaling for validation and test sets
val_test_datagen = ImageDataGenerator(rescale=1./255)

In [None]:
# Loading data
train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size=img_size,
    batch_size=batch_size,
    class_mode='categorical'
)

val_generator = val_test_datagen.flow_from_directory(
    val_dir,
    target_size=img_size,
    batch_size=batch_size,
    class_mode='categorical'
)

test_generator = val_test_datagen.flow_from_directory(
    test_dir,
    target_size=img_size,
    batch_size=batch_size,
    class_mode='categorical',
    shuffle=False
)

In [None]:
# Compute class weights to handle imbalance
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(train_generator.classes),
    y=train_generator.classes
)

class_weights = {i: weight for i, weight in enumerate(class_weights)}
print("Class weights:", class_weights)

# **Step 3: Exploratory Data Analysis (EDA)**

In [None]:
# Visualizing some images
def plot_images(images_arr):
    fig, axes = plt.subplots(2, 5, figsize=(10, 8))
    axes = axes.flatten()
    for img, ax in zip(images_arr, axes):
        ax.imshow(img)
    plt.tight_layout()
    plt.show()

sample_training_images, _ = next(train_generator)
plot_images(sample_training_images[:10])

# **Step 4: Feature Engineering**

In [None]:
import cv2

def extract_sift_features(image_path):
    image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    sift = cv2.SIFT_create()
    keypoints, descriptors = sift.detectAndCompute(image, None)
    return keypoints, descriptors

# Example usage
keypoints, descriptors = extract_sift_features('/content/ip102_v1.1/data/train/0/00002.jpg')
print("SIFT descriptors shape:", descriptors.shape)

In [None]:
import cv2
import matplotlib.pyplot as plt

def extract_and_display_sift_features(image_path):
    # Load the image in grayscale
    image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)

    # Initialize SIFT detector
    sift = cv2.SIFT_create()

    # Detect keypoints and compute descriptors
    keypoints, descriptors = sift.detectAndCompute(image, None)

    # Convert keypoints to numpy array
    keypoints_img = cv2.drawKeypoints(image, keypoints, None, flags=cv2.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS)

    # Display the image with keypoints
    plt.figure(figsize=(10, 10))
    plt.imshow(cv2.cvtColor(keypoints_img, cv2.COLOR_BGR2RGB))
    plt.axis('off')
    plt.title('SIFT Keypoints')
    plt.show()

image_path = '/kaggle/working/ip102_v1.1/data/train/0/00002.jpg'  
extract_and_display_sift_features(image_path)


# **Step 5: Model Training**




In [None]:
# CNN Model Architecture
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Conv2D, MaxPooling2D, Flatten
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

input_shape=(128, 128, 3)
model = Sequential([
    Conv2D(128, (3, 3), input_shape=input_shape, activation='relu', padding='same'),
    MaxPooling2D(pool_size=(2, 2)),
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    Flatten(),
    Dense(256, activation='relu'),
    Dropout(0.25),
    Dense(128, activation='relu'),
    Dropout(0.25),
    Dense(102, activation='softmax')
])

In [None]:
# Compiling the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Summary of the model
model.summary()

In [None]:
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

# Early Stopping and Model Checkpoint
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
model_checkpoint = ModelCheckpoint('best_model.keras', monitor='val_accuracy', save_best_only=True)

# Training the model
history = model.fit(
    train_generator,
    epochs=10,
    validation_data=val_generator,
    class_weight=class_weights,
    callbacks=[early_stopping, model_checkpoint]
)

In [None]:
import pandas as pd
history_df = pd.DataFrame(history.history)
history_df

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

# **Step 6: Model Evaluation and Tuning**

In [None]:
import shutil

source_path = 'best_model.keras' # Replace with the actual path
destination_path = '/kaggle/working/ip102_v1.1/model.keras' # Where you want to copy it
shutil.copy(source_path, destination_path)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

# Loading the best model
model.load_weights('/kaggle/working/ip102_v1.1/model.keras')

# Evaluate on test data
test_loss, test_accuracy = model.evaluate(test_generator)
print(f'Test accuracy: {test_accuracy*100:.2f}%')

# Predicting the test data
predictions = model.predict(test_generator)
predicted_classes = np.argmax(predictions, axis=1)


In [None]:
# Classification report
report = classification_report(test_generator.classes, predicted_classes, target_names=list(test_generator.class_indices.keys()))
print(report)

# **YOLO v5**

In [None]:
import os
import numpy as np
import pandas as pd

In [None]:
!git clone https://github.com/ultralytics/yolov5  # clone
%cd yolov5
%pip install -qr requirements.txt  # install

import torch
import utils
display = utils.notebook_init()

In [None]:
%cd '/kaggle/working/yolov5'

In [None]:
import yaml
import os

cwd = '/kaggle/working/'


data = dict(
    path  = cwd,
    train =  '../input/ip102-yolov5/IP102_YOLOv5/images/train' ,
    val   =  '../input/ip102-yolov5/IP102_YOLOv5/images/val',
    nc    = 32,
    names =  ['rice leaf roller', 'rice leaf caterpillar', 'paddy stem maggot', 'asiatic rice borer', 'yellow rice borer',
        'rice gall midge', 'Rice Stemfly', 'brown plant hopper', 'white backed plant hopper', 'small brown plant hopper',
        'rice water weevil', 'rice leafhopper', 'grain spreader thrips', 'rice shell pest', 'grub', 'mole cricket', 'wireworm',
        'white margined moth', 'black cutworm', 'large cutworm', 'yellow cutworm', 'red spider', 'corn borer', 'army worm', 'aphids',
        'Potosiabre vitarsis', 'peach borer', 'english grain aphid', 'green bug', 'bird cherry-oataphid', 'wheat blossom midge',
        'penthaleus major', 'longlegged spider mite', 'wheat phloeothrips', 'wheat sawfly', 'cerodonta denticornis', 'beet fly',
        'flea beetle', 'cabbage army worm', 'beet army worm', 'Beet spot flies', 'meadow moth', 'beet weevil', 'sericaorient alismots chulsky',
        'alfalfa weevil', 'flax budworm', 'alfalfa plant bug', 'tarnished plant bug', 'Locustoidea', 'lytta polita', 'legume blister beetle',
        'blister beetle', 'therioaphis maculata Buckton', 'odontothrips loti', 'Thrips', 'alfalfa seed chalcid', 'Pieris canidia',
        'Apolygus lucorum', 'Limacodidae', 'Viteus vitifoliae', 'Colomerus vitis', 'Brevipoalpus lewisi McGregor', 'oides decempunctata',
        'Polyphagotars onemus latus', 'Pseudococcus comstocki Kuwana', 'parathrene regalis', 'Ampelophaga', 'Lycorma delicatula', 'Xylotrechus',
        'Cicadella viridis', 'Miridae', 'Trialeurodes vaporariorum', 'Erythroneura apicalis', 'Papilio xuthus', 'Panonchus citri McGregor',
        'Phyllocoptes oleiverus ashmead', 'Icerya purchasi Maskell', 'Unaspis yanonensis', 'Ceroplastes rubens', 'Chrysomphalus aonidum',
        'Parlatoria zizyphus Lucus', 'Nipaecoccus vastalor', 'Aleurocanthus spiniferus', 'Tetradacus c Bactrocera minax ', 'Dacus dorsalis(Hendel)',
        'Bactrocera tsuneonis', 'Prodenia litura', 'Adristyrannus', 'Phyllocnistis citrella Stainton', 'Toxoptera citricidus', 'Toxoptera aurantii',
        'Aphis citricola Vander Goot', 'Scirtothrips dorsalis Hood', 'Dasineura sp', 'Lawana imitata Melichar', 'Salurnis marginella Guerr',
        'Deporaus marginatus Pascoe', 'Chlumetia transversa', 'Mango flat beak leafhopper', 'Rhytidodera bowrinii white', 'Sternochetus frigidus',
        'Cicadellidae']
    )

with open(os.path.join( cwd , 'IP102.yaml'), 'w') as outfile:
    yaml.dump(data, outfile, default_flow_style=False)

f = open(os.path.join( cwd , 'IP102.yaml'), 'r')
print('\nyaml:')
print(f.read())

In [None]:
!python train.py --img 128 --batch 32 --epochs 2 --data "/kaggle/working/IP102.yaml" --weights yolov5s.pt --project "insect_Image Classification"  --name 'Image Classification'  --save-period 1 --bbox_interval 1 --cache

In [None]:
import glob
from IPython.display import Image, display

image_path_pattern = '/kaggle/working/yolov5/insect_Image Classification/Image Classification/*.jpg'

for imageName in glob.glob(image_path_pattern):
    display(Image(filename=imageName))
    print(f"Filename: {imageName.split('/')[-1]}")
    print("\n")


# **Step 7: Deployment**

### Deploying a Trained Object Detection Model

**Deployment Options**

**a. Cloud-based Deployment:** It refers to hosting and running applications, services, or models on cloud infrastructure rather than on local servers or hardware.
   - **Pros:**
     - **Scalability:** Easily handle high traffic by scaling resources.
     - **Managed Services:** Use services like AWS SageMaker, Google AI Platform, or Azure Machine Learning which provide infrastructure management.
     - **Integration:** Easy integration with other cloud services like storage and databases.
   - **Cons:**
     - **Cost:** Can become expensive with high usage.
     - **Latency:** May introduce latency due to network communication.
     - **Data Privacy:** Sensitive data may be exposed if not managed properly.
     
**b. On-Premises Deployment:** It refers to the practice of hosting and managing applications, services, or models on hardware and infrastructure located within an organization’s own premises.
   - **Pros:**
     - **Control:** Full control over the hardware and software environment.
     - **Latency:** Lower latency as the model runs locally.
     - **Data Privacy:** Better for handling sensitive data within a secure environment.
   - **Cons:**
     - **Scalability:** Limited by available hardware; scaling requires additional investment.
     - **Maintenance:** Requires in-house expertise for maintenance and updates.
     - **Cost:** Upfront costs for hardware and setup.

**c. Edge Deployment:** It refers to running applications, services, or models on local devices or computing resources that are situated closer to where data is generated or used, rather than in a centralized data center or cloud. 
   - **Pros:**
     - **Real-Time Processing:** Low latency as processing happens on the device.
     - **Reduced Bandwidth:** Less need to send data to the cloud, saving bandwidth.
     - **Offline Capabilities:** Can operate without an internet connection.
   - **Cons:**
     - **Hardware Constraints:** Limited by the computational power of the edge device.
     - **Updates:** More challenging to update models across multiple devices.
     - **Data Privacy:** While data remains local, security concerns still exist.

**d. Hybrid Deployment:** It combines both on-premises infrastructure and cloud-based services. This approach allows organizations to use local servers and systems for certain applications or data, while leveraging cloud resources for other tasks or to handle variable workloads. 
   - **Pros:**
     - **Flexibility:** Combine cloud, on-premises, and edge for optimized performance and cost.
     - **Redundancy:** Failover solutions can be implemented.
     - **Scalability and Privacy:** Balance between cloud scalability and on-premises privacy.
   - **Cons:**
     - **Complexity:** More complex to manage and configure.
     - **Integration:** Ensuring seamless integration between different environments can be challenging.

**Recommended Deployment:** Edge Deployment is the best choice for object detection models due to the need for real-time processing and low latency.

**Steps to Monitor and Ensure Model Performance**

**a. Establish Monitoring Metrics** :Track metrics like precision, recall, F1-score, inference time, and accuracy.

**b. Regular Model Evaluation** :Periodically evaluate the model using a validation set to check for concept drift. And also re-train the model as needed with updated data to maintain performance.

**c. User Feedback** :Collect feedback from end-users to identify potential issues or areas for improvement.
