# Soil Classification 
# Part 2: Binary Classification  - (Soil vs. Non-Soil)

**This project develops a binary classification model to distinguish soil from non-soil images using deep learning and feature similarity, submitted for the Soil Classification Part 2

# Team Members

- **Team Name**: Expendables
- **Team Leader**: Sushmetha S R

| Name                 | Role/Title                                                               | Affiliation                          | Email                     | GitHub Handle         |
|----------------------|---------------------------------------------------------------------------|--------------------------------------|---------------------------|-----------------------|
| Abhinav Chaitanya R  | BTech in Electronics and Communication Engineering, 2025                | Vellore Institute of Technology, VIT Chennai | abhinavchaitanya6@gmail.com | Abhinav302004         |
| Arjun M              | BTech in Computer Science and Engineering, 2025                          | Vellore Institute of Technology, VIT Chennai | arjunm.0510@gmail.com     | ArjunM05              |
| Harshavardhan S      | BTech in Computer Science and Engineering, 2025                          | Vellore Institute of Technology, VIT Chennai | harsak7@gmail.com         | harsha152003          |
| Kiranchandran H      | BTech in Computer Science and Engineering (Cyber Physical Systems), 2025 | Vellore Institute of Technology, VIT Chennai | kiranchandranh@gmail.com  | kiranchh08            |
| Sushmetha S R        | BTech in Computer Science and Engineering (AI & ML Specialization), 2025 | Vellore Institute of Technology, VIT Chennai | sush7niaa@gmail.com       | sushniaa              |

# Objective

The objective is to build a binary classifier to distinguish soil images from non-soil images (e.g., rocks, water) for the Soil Classification Part 2 challenge. This task is important for agricultural applications, such as automated soil analysis, and for improving land management by accurately identifying soil presence in diverse environments.

# Dataset

- **Source**: The dataset is provided by the Soil Classification Part 2 challenge on Kaggle (`/kaggle/input/soil-classification-part-2/soil_competition-2025/`).
  - Training set: Images labeled as soil (`train/` and `train_labels.csv`).
  - Test set: Images with a mix of soil and non-soil (`test/` and `test_ids.csv`).
  - Unseen images: Additional non-soil images (`rock.png`, `sea.jpeg`) for testing.
- **Dataset Statistics**:
  - Total training images: 1222 (all labeled as soil).
  - Total test images: 967 (mix of soil and non-soil).
  - Sample image dimensions:
    - Training: Varies, e.g., 728x728, 1160x522 (based on a sample of 5 images).
    - Test: Varies significantly, e.g., 319x158, 1500x1125, 100x100 (based on a sample of 5 images).
  - Invalid images: No invalid images found in a sample of 5 images from both training and test sets.
- **Preprocessing Steps**:
  - Resized images to 224x224 to match EfficientNet-B0 input requirements.
  - Converted images to RGB and normalized using ImageNet statistics (`mean=[0.485, 0.456, 0.406]`, `std=[0.229, 0.224, 0.225]`).
  - Skipped invalid images during dataset loading to ensure robustness.
- **Label Description**:
  - Training labels: All images are soil (label=1).
  - Test labels: Binary classification (1=soil, 0=non-soil), to be predicted.

# Model

- **Architecture Used**:
  - Used EfficientNet-B0 pretrained on ImageNet.
  - Removed the final classifier layer to output 1280-D feature embeddings.
  - Applied PCA to reduce features to 100 dimensions for similarity computation.
- **Loss Function**:
  - Not applicable, as we used a similarity-based approach (cosine similarity) instead of training with a loss function.
- **Optimization Details**:
  - Used a feature similarity approach instead of traditional training.
  - Computed cosine similarity to the top-5 training prototypes (k=5).
  - Set a threshold at the 10th percentile of validation similarities to classify images as soil (label=1) or non-soil (label=0).






# Evaluation

- **Metrics**:
  - F1-score (primary metric for the competition), precision, recall, and accuracy.
  - Evaluated on the validation set (all soil images, label=1).
- **Best Scores**:
  - Validation F1-Score: [Value depends on run, e.g., 0.9500]
  - Validation Precision: [e.g., 0.9300]
  - Validation Recall: [e.g., 0.9700]
  - Validation Accuracy: [e.g., 0.9000]
  - (Exact values are in `ml-metrics.json` generated by `training.ipynb` or `inference.ipynb`).
- **Visuals**:
  - **Similarity Distribution**: Histogram of validation similarities with the threshold (saved as `similarity_distribution_validation.png`).
  - **Sample Images**: Visualized sample training images to show soil characteristics (saved as `sample_training_images.png`).
  - **PCA Feature Distribution**: Scatter plot of the first two PCA components (saved as `pca_feature_distribution.png`).
  - **RGB Distribution**: Histogram of average RGB values to analyze color distribution (saved as `rgb_distribution_training.png`).


# Inference

To use the model for predictions:
1. Run `inference.ipynb`, which includes all training steps and inference.
2. The notebook:
   - Extracts features from the test set using EfficientNet-B0.
   - Applies PCA (fitted on training data) to reduce features to 100 dimensions.
   - Computes cosine similarity to the top-5 training prototypes.
   - Classifies images using the threshold (10th percentile of validation similarities).
   - Outputs predictions in `submission.csv` (`image_id`, `label` format).
3. For unseen images (e.g., `rocks.png`, `sea.jpeg`):
   - The notebook processes each image, computes its similarity, and predicts the label.