<a href="https://colab.research.google.com/github/NaamaSchweitzer/CV-waste-classification/blob/main/CV_Waste_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Kaggle Setup and Dataset Download

This section will guide you through setting up Kaggle in your Colab environment and downloading the `alyyan/trash-detection` dataset.

In [None]:
import os

# Install the Kaggle API client
!pip install kaggle




#### Authenticate with Kaggle

To download datasets from Kaggle, you need to authenticate using your Kaggle API token. Follow these steps:

1.  Go to your Kaggle account page (https://www.kaggle.com/your-username/account).
2.  Scroll down to the 'API' section and click 'Create New API Token'. This will download a `kaggle.json` file.
3.  In Colab, click on the ðŸ”‘ icon (Secrets) in the left sidebar. Add a new secret named `KAGGLE_USERNAME` for your Kaggle username and `KAGGLE_KEY` for your Kaggle API key (from the `kaggle.json` file).

Alternatively, you can upload the `kaggle.json` file directly:


In [None]:
# Option 1: Using Colab Secrets (Recommended)
from google.colab import userdata

# Ensure these secrets are set in Colab's Secrets manager
os.environ['KAGGLE_USERNAME'] = userdata.get('KAGGLE_USERNAME')
os.environ['KAGGLE_KEY'] = userdata.get('KAGGLE_KEY')

# Option 2: Uploading kaggle.json directly (uncomment and run if not using secrets)
# from google.colab import files
# files.upload() # This will prompt you to upload the kaggle.json file

# Create .kaggle directory if it doesn't exist and move the file
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json # Set permissions

print("Kaggle authentication setup complete.")

Kaggle authentication setup complete.


#### Download the Dataset

Now we will download the `trash-detection` dataset from Kaggle.

In [None]:
# Download the dataset
!kaggle datasets download -d alyyan/trash-detection

# List the downloaded file(s) to verify
!ls


Dataset URL: https://www.kaggle.com/datasets/alyyan/trash-detection
License(s): MIT
Downloading trash-detection.zip to /content
 92% 1.13G/1.23G [00:07<00:02, 42.0MB/s]
100% 1.23G/1.23G [00:08<00:00, 163MB/s] 
kaggle.json  sample_data  trash-detection.zip


#### Unzip the Dataset

The dataset is typically downloaded as a zip file. Let's extract its contents.

In [None]:
# Unzip the dataset (assuming it's a .zip file)
!unzip -q trash-detection.zip -d trash-detection

# List the contents of the extracted directory
!ls trash-detection

print("Dataset downloaded and extracted successfully!")

Dataset
Dataset downloaded and extracted successfully!


# Task
Analyze the structure of the `trash-detection/Dataset` directory, read the `trash-detection/Dataset/data.yaml` file to extract class names, count the total instances and analyze the class balance by processing the label files within the dataset, and finally provide a comprehensive summary of the dataset including its overall size, detected classes, and the distribution of instances across these classes.

## Explore Dataset Structure

### Subtask:
List the contents of the `trash-detection/Dataset` directory to understand how the images and labels are organized.


**Reasoning**:
To understand how the images and labels are organized, I will list the contents of the `trash-detection/Dataset` directory using a shell command.



In [None]:
import os

# List the contents of the 'trash-detection/Dataset' directory
!ls -F trash-detection/Dataset

data.yaml  images/  labels/


## Read Data Configuration

### Subtask:
Load and parse the `trash-detection/Dataset/data.yaml` file to extract class names and any other relevant dataset metadata.


**Reasoning**:
Before parsing the YAML file, the `pyyaml` library needs to be installed, as it's not part of the standard Python library and was not installed in previous steps.



In [None]:
import os

# Install the PyYAML library to parse YAML files
!pip install pyyaml

print("PyYAML installed successfully.")

PyYAML installed successfully.


**Reasoning**:
Now that PyYAML is installed, I will import the `yaml` library, construct the path to `data.yaml`, open and parse it using `yaml.safe_load`, and then extract the class names as instructed.



In [None]:
import yaml

# Construct the full path to the data.yaml file
data_yaml_path = 'trash-detection/Dataset/data.yaml'

# Initialize variables
class_names = []

# Open and read the data.yaml file
with open(data_yaml_path, 'r') as file:
    data = yaml.safe_load(file)

# Extract class names
if 'names' in data:
    class_names = data['names']
elif 'nc' in data and 'names' in data:
    class_names = data['names']

print(f"Dataset metadata loaded from: {data_yaml_path}")
print(f"Detected classes (nc): {len(class_names)}")
print(f"Class names: {class_names}")

Dataset metadata loaded from: trash-detection/Dataset/data.yaml
Detected classes (nc): 4
Class names: ['dirt', 'liquid', 'marks', 'trash']


## Count Total Instances and Analyze Class Balance

### Subtask:
Identify the label files (e.g., .txt files) within the dataset and count the total number of labels. Then, iterate through these label files to count the occurrences of each class, providing a breakdown of the class balance.


**Reasoning**:
I will implement the instructions to identify label files, count total instances, and analyze class balance by iterating through the label directory, reading each label file, extracting class IDs, and then aggregating the counts for each class.



# Task
Count the total number of labels in `trash-detection/Dataset/labels`, and then parse the class ID from each line in these label files (which are in YOLO annotation format) to count the occurrences of each class and analyze the class balance.

## Count Total Instances and Analyze Class Balance

### Subtask:
Identify the label files (e.g., .txt files) within the dataset, which are in YOLO annotation format. Iterate through these label files, parse the class ID from each line (the first element), and count the total number of instances and the occurrences of each class to provide a breakdown of the class balance.


**Reasoning**:
To count total instances and analyze class balance, I will define the labels directory, initialize counters, iterate through all .txt label files, extract class IDs from each line, and aggregate the counts.



In [None]:
import os

# Define the path to the labels directory
labels_dir = './trash-detection/Dataset/labels/train'

# Initialize an empty dictionary to store class counts
# and a variable to store the total number of instances
class_counts = {i: 0 for i in range(len(class_names))}
total_instances = 0

# List all files in the labels directory
all_label_files = os.listdir(labels_dir)

# Filter for .txt files (YOLO label files)
label_files = [f for f in all_label_files if f.endswith('.txt')]

print(f"Processing {len(label_files)} label files...")

# Iterate through each identified label file
for filename in label_files:
    filepath = os.path.join(labels_dir, filename)
    with open(filepath, 'r') as file:
        for line in file:
            # Increment total instances for each line (each bounding box)
            total_instances += 1

            # Extract class ID (first element in the line)
            try:
                class_id = int(line.strip().split(' ')[0])
                # Increment the count for this class ID
                class_counts[class_id] = class_counts.get(class_id, 0) + 1
            except ValueError as e:
                print(f"Warning: Could not parse class ID in file {filename}, line: {line.strip()}. Error: {e}")
            except IndexError as e:
                print(f"Warning: Empty line or unexpected format in file {filename}, line: {line.strip()}. Error: {e}")

print("\n--- Dataset Class Analysis ---")
print(f"Total instances (bounding boxes detected): {total_instances}")
print("Class distribution:")

# Print class names with their counts
for class_id, count in class_counts.items():
    if class_id < len(class_names):
        class_name = class_names[class_id]
        print(f"  - {class_name} (ID {class_id}): {count} instances")
    else:
        print(f"  - Unknown Class (ID {class_id}): {count} instances (not in data.yaml)")

Processing 1229 label files...

--- Dataset Class Analysis ---
Total instances (bounding boxes detected): 7900
Class distribution:
  - dirt (ID 0): 1418 instances
  - liquid (ID 1): 494 instances
  - marks (ID 2): 3373 instances
  - trash (ID 3): 2615 instances


## Summarize Dataset Information

### Subtask:
Present a summary of the dataset, including its overall size, the detected classes, and the distribution of instances across these classes.


### Dataset Summary

This section provides a summary of the `trash-detection` dataset, based on the analysis of its structure and label files.

-   **Overall Dataset Size:**
    -   Total number of label files processed: **1229**
    -   Total number of instances (bounding boxes) detected: **7900**

-   **Detected Classes:**
    -   The dataset contains **4** distinct classes.
    -   Class Names: `['dirt', 'liquid', 'marks', 'trash']`

-   **Class Distribution (Instance Count per Class):**
    -   `dirt`: 1418 instances
    -   `liquid`: 494 instances
    -   `marks`: 3373 instances
    -   `trash`: 2615 instances

This summary indicates that the `marks` class has the highest number of instances, followed by `trash`, `dirt`, and `liquid` having the least.

## Final Task

### Subtask:
Provide a comprehensive description of the dataset based on the analysis, including its size, columns, and class distribution.


## Summary:

### Q&A
The total number of label files processed is 1229. The class IDs were parsed from each line in these files, revealing 7900 total instances (bounding boxes) across 4 distinct classes. The occurrences of each class are: `dirt` with 1418 instances, `liquid` with 494 instances, `marks` with 3373 instances, and `trash` with 2615 instances. This analysis indicates an imbalanced class distribution.

### Data Analysis Key Findings
*   The dataset comprises 1229 label files.
*   A total of 7900 instances (bounding boxes) were detected across all label files.
*   There are 4 distinct classes: `dirt`, `liquid`, `marks`, and `trash`.
*   The class distribution is as follows:
    *   `dirt`: 1418 instances
    *   `liquid`: 494 instances
    *   `marks`: 3373 instances
    *   `trash`: 2615 instances
*   The `marks` class has the highest number of instances (3373), while the `liquid` class has the fewest (494), indicating a significant class imbalance.

### Insights or Next Steps
*   The significant class imbalance, particularly the low number of instances for `liquid` (494) compared to `marks` (3373), could lead to biased model performance.
*   Consider data augmentation techniques or oversampling for minority classes (`liquid`, `dirt`) and potentially undersampling for majority classes (`marks`, `trash`) to balance the dataset for improved model training.
