# Task
Unzip the file "archive.zip", delete unnecessary directories, create a YAML file for the dataset, and provide information about the dataset.

## Unzip the archive

### Subtask:
Unzip the provided `archive.zip` file.


**Reasoning**:
Unzip the provided `archive.zip` file using the `zipfile` module.



In [None]:
import zipfile

with zipfile.ZipFile('archive.zip', 'r') as zip_ref:
    zip_ref.extractall('.')

## Organize directories

### Subtask:
Delete any unnecessary directories created during the unzipping process.


**Reasoning**:
Delete the identified unnecessary directories using the `os` module.



In [None]:
import os
import shutil

# Identify directories to remove. This is based on the expected contents of the archive.
directories_to_remove = ['__MACOSX'] # Example: if __MACOSX is created by the unzip process

for directory in directories_to_remove:
    if os.path.exists(directory):
        shutil.rmtree(directory)
        print(f"Removed directory: {directory}")
    else:
        print(f"Directory not found: {directory}")

# Verify that the directories have been removed
print("\nContents of the current directory after removal:")
print(os.listdir('.'))

Directory not found: __MACOSX

Contents of the current directory after removal:
['.config', 'archive.zip', 'Ripe & Unripe Fruits', 'sample_data']


## Create YAML file

### Subtask:
Generate a YAML file containing the dataset configuration.

**Reasoning**:
Create a dictionary with the dataset information and save it as a YAML file.

In [None]:
import yaml
import os

# Define the base path where the dataset is located
dataset_base_path = '/content/Ripe & Unripe Fruits'

# Define the class names based on the 11 ripe and 11 unripe categories
# You might need to adjust these names to match the actual directory names if they are different
class_names = [
    'ripe apple', 'ripe banana', 'ripe dragon', 'ripe grapes', 'ripe lemon', 'ripe mango',
    'ripe orange', 'ripe papaya', 'ripe pineapple', 'ripe pomegranate', 'ripe strawberry',
    'unripe apple', 'unripe banana', 'unripe dragon', 'unripe grapes', 'unripe lemon',
    'unripe mango', 'unripe orange', 'unripe papaya', 'unripe pineapple', 'unripe pomegranate',
    'unripe strawberry'
]

dataset_config = {
    'path': dataset_base_path,  # Base path to the dataset
    'train': dataset_base_path, # Assuming training data is directly in the base path or subdirectories within it
    'val': dataset_base_path,   # Assuming validation data is structured similarly
    'nc': len(class_names), # Number of classes
    'names': class_names # Class names
}

with open('dataset_config.yaml', 'w') as file:
    yaml.dump(dataset_config, file, default_flow_style=False)

print("dataset_config.yaml created successfully.")

dataset_config.yaml created successfully.


## Dataset Information

### Subtask:
Provide information about the dataset, including the number of classes and examples per class.

**Reasoning**:
Read the dataset configuration from the generated YAML file, list the class names, count the total number of classes, and iterate through the dataset directory to count the number of examples in each class.

In [None]:
import yaml
import os

# Load the dataset configuration from the YAML file
with open('dataset_config.yaml', 'r') as file:
    dataset_config = yaml.safe_load(file)

# 1. List the names of the classes
class_names = dataset_config['names']
print("Class Names:", class_names)

# 2. Count the number of classes
num_classes = dataset_config['nc']
print("\nTotal Number of Classes:", num_classes)

# 3. Iterate through the training directory and count examples per class
train_dir = dataset_config['train']
examples_per_class = {}

print("\nExamples per Class in Training Dataset:")
if os.path.isdir(train_dir):
    for class_name in class_names:
        class_dir = os.path.join(train_dir, class_name)
        if os.path.isdir(class_dir):
            # Count image files (assuming they are jpg, jpeg, or png)
            num_examples = len([f for f in os.listdir(class_dir) if f.lower().endswith(('.jpg', '.jpeg', '.png'))])
            examples_per_class[class_name] = num_examples
            # 5. Print the name of each class along with the number of examples
            print(f"{class_name}: {num_examples} examples")
        else:
            print(f"Warning: Directory not found for class '{class_name}': {class_dir}")
else:
    print(f"Error: Training directory not found at {train_dir}")

Class Names: ['ripe apple', 'ripe banana', 'ripe dragon', 'ripe grapes', 'ripe lemon', 'ripe mango', 'ripe orange', 'ripe papaya', 'ripe pineapple', 'ripe pomegranate', 'ripe strawberry', 'unripe apple', 'unripe banana', 'unripe dragon', 'unripe grapes', 'unripe lemon', 'unripe mango', 'unripe orange', 'unripe papaya', 'unripe pineapple', 'unripe pomegranate', 'unripe strawberry']

Total Number of Classes: 22

Examples per Class in Training Dataset:
ripe apple: 388 examples
ripe banana: 400 examples
ripe dragon: 399 examples
ripe grapes: 398 examples
ripe lemon: 386 examples
ripe mango: 399 examples
ripe orange: 400 examples
ripe papaya: 400 examples
ripe pineapple: 400 examples
ripe pomegranate: 390 examples
ripe strawberry: 400 examples
unripe apple: 391 examples
unripe banana: 400 examples
unripe dragon: 400 examples
unripe grapes: 397 examples
unripe lemon: 396 examples
unripe mango: 400 examples
unripe orange: 400 examples
unripe papaya: 400 examples
unripe pineapple: 380 examples

## Train a Classification Model

### Subtask:
Train a classification model (e.g., ResNet) on the "Ripe & Unripe Fruits" dataset.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, models, transforms
import os

# Define data transformations (adjust as needed)
data_transforms = {
    'train': transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'val': transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
}

# Define the data directory (using the path from the dataset_config)
data_dir = dataset_config['path'] # Use the path from the dataset_config

# Create data loaders
# Fix: Removed os.path.join(data_dir, x) because class directories are directly in data_dir
image_datasets = {x: datasets.ImageFolder(data_dir, data_transforms[x])
                  for x in ['train', 'val']} # Assuming 'train' and 'val' phases, but using the same data_dir for simplicity here. You might need to split data into train/val folders for a proper split.

dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=4,
                                             shuffle=True, num_workers=2)
              for x in ['train', 'val']} # Adjust batch size and num_workers as needed

dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']}
class_names = image_datasets['train'].classes

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load a pre-trained ResNet model
model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
# Modify the final layer to match the number of classes
model_ft.fc = nn.Linear(num_ftrs, len(class_names))

model_ft = model_ft.to(device)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer_ft = optim.SGD(model_ft.parameters(), lr=0.001, momentum=0.9)

# --- Training Loop (Basic Example) ---
# You would typically add more to this, like a learning rate scheduler,
# saving checkpoints, and evaluating on the validation set.

num_epochs = 10 # Define the number of epochs to train for

for epoch in range(num_epochs):
    print(f'Epoch {epoch}/{num_epochs - 1}')
    print('-' * 10)

    # Each epoch has a training and validation phase
    for phase in ['train', 'val']:
        if phase == 'train':
            model_ft.train()  # Set model to training mode
        else:
            model_ft.eval()   # Set model to evaluate mode

        running_loss = 0.0
        running_corrects = 0

        # Iterate over data.
        for inputs, labels in dataloaders[phase]:
            inputs = inputs.to(device)
            labels = labels.to(device)

            # zero the parameter gradients
            optimizer_ft.zero_grad()

            # forward
            # track history if only in train
            with torch.set_grad_enabled(phase == 'train'):
                outputs = model_ft(inputs)
                _, preds = torch.max(outputs, 1)
                loss = criterion(outputs, labels)

                # backward + optimize only if in training phase
                if phase == 'train':
                    loss.backward()
                    optimizer_ft.step()

            # statistics
            running_loss += loss.item() * inputs.size(0)
            running_corrects += torch.sum(preds == labels.data)

        epoch_loss = running_loss / dataset_sizes[phase]
        epoch_acc = running_corrects.double() / dataset_sizes[phase]

        print(f'{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')

print('\nTraining complete!')

# You would typically save the trained model here
# torch.save(model_ft.state_dict(), 'fruit_classifier_model.pth')

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:00<00:00, 129MB/s]


Epoch 0/9
----------
train Loss: 2.5279 Acc: 0.2614
val Loss: 2.0220 Acc: 0.4772
Epoch 1/9
----------
train Loss: 2.1905 Acc: 0.3502
val Loss: 1.9543 Acc: 0.5198
Epoch 2/9
----------
train Loss: 2.0245 Acc: 0.3949
val Loss: 1.5830 Acc: 0.5757
Epoch 3/9
----------
train Loss: 1.9037 Acc: 0.4290
val Loss: 1.3730 Acc: 0.6063
Epoch 4/9
----------
train Loss: 1.8537 Acc: 0.4438
val Loss: 1.3844 Acc: 0.6403
Epoch 5/9
----------
