# DS 542 Fall 2025 Project 2

Your task for this project to design and train a convolutional neural network classifying pictures of butterflies from [iNaturalist](https://www.inaturalist.org) by species.


## Background

The iNaturalist project crowdsources observations of wildlife including images, location, timestamp, and species identification.

Here is an example image from [an observation of a Monarch butterfly](https://www.inaturalist.org/observations/231456432).

![Danaus plexippus plexippus](https://inaturalist-open-data.s3.amazonaws.com/photos/411089128/medium.jpeg)


## Data

For this project, observations and images for many butterfly species were collected in a GitHub repository.

  https://github.com/DL4DS/butterflies

**Checking out this repository requires about 21GB of disk space.**
This repository is checked out on the Shared Compute Cluster (SCC) at `/projectnb/ds542/materials/butterflies`.

## Outline

1. Visualize species distribution and prepare to classify top 9 species.
2. Implement Torch DataSet and DataLoader objects to simplify data access.
3. Define model structure and select an appropriate loss function.
4. Implement a training loop and explain your choices.
5. Predict species for the test set.

## Modules

In [None]:
import pandas as pd
import torch

## Part 1: Visualize species distribution and prepare to classify top 9 species.

### Plot a histogram of species in the training data set.

Don't worry about cleaning up possible duplicates.
Just use the species names as is.

In [None]:
# YOUR CHANGES HERE

...

### Plot a histogram of the top 9 species plus an "Other" category for the other species.

Use the training data set again.

In [None]:
# YOUR CHANGES HERE

...

### Write files "train-onehot.csv" and "validation-onehot.csv".

These will facilitate your data management later.
The columns for both files should be `uuid` and `image_path` from either "train.csv" or "validation.csv", followed by the top 9 species in the training data set in order, followed by `other`.
The values for the species and `other` columns should all be zero or one.

In [None]:
# YOUR CHANGES HERE

...

Submit "train-onehot.csv" and "validation-onehot.csv" in Gradescope.

Comment: The onehot encoding into files could have been skipped and instead the onehot encoding could have been handled by the DataSet subclass in part 2.
The auto-grader will check your files and give points after verifying their contents.
You should test submitting early to make sure that you have prepared the intended files.

## Part 2: Implement Torch DataSet and DataLoader objects to simplify data access.

[Homework 5](https://colab.research.google.com/github/DL4DS/fa2025/blob/main/static_files/assignments/homework5.ipynb) asked you to implement Torch [DataSet](https://docs.pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) and [DataLoader](https://docs.pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) objects to separate data loading issues from the training logic.

Implement an appropriate DataSet subclass for the training and validation data.
The class constructor should take in a filename ("train.csv" or "validation.csv") and load the corresponding data.
The `__getitem__` method should return a tuple with a tensor of a single image data and a tensor with the onehot encoding of the target value.

In [None]:
# YOUR CHANGES HERE

...

Create a DataLoader object for the training data.

In [None]:
# YOUR CHANGES HERE

...

Explain any non-default parameter choices you made for the DataLoader.

YOUR ANSWER HERE

...

## Part 3: Define model structure and select an appropriate loss function.

Implement the model class for your convolutional neural network.
The final output of your model should be a tensor of probabilities for each of the ten species choices.

Hint: refer to [homework 5](https://colab.research.google.com/github/DL4DS/fa2025/blob/main/static_files/assignments/homework5.ipynb) and [project 1](https://colab.research.google.com/github/DL4DS/fa2025/blob/main/static_files/assignments/project1.ipynb) for examples.

In [None]:
# YOUR CHANGES HERE

...

Explain your choices for the shapes in each convolutional layer of your model.

YOUR ANSWER HERE

...

Explain your choices for the channels in each convolutional layer of your model.

YOUR ANSWER HERE

...

Pick and define a loss function for your model.
Feel free to use an existing Torch function.

In [None]:
# YOUR CHANGES HERE

...

Explain your choice of loss function.
One sentence should suffice.

YOUR ANSWER HERE

...

## Part 4: Implement a training loop and explain your choices.

Implement and run an appropriate training loop.
Plot accuracy and loss for the training and validation sets for each epoch.
Plot a summary of the gradients of your choice for each epoch.

In [None]:
# YOUR CHANGES HERE

...

## Part 5: Predict species for the test set.

Use your model to predict probabilities for each of the species choices.

In [None]:
# YOUR CHANGES HERE

...

Save your predictions to "test-predictions.csv" with columns `uuid`, the top 9 species and `other`.

In [None]:
# YOUR CHANGES HERE

...

Submit "test-predictions.csv" to Gradescope.

## Part 6: Submit Your Notebook

Submit your notebook to Gradescope.

## Follow Up

Make sure to check the immediate auto-grader feedback in Gradescope.
It will let you know if your file formats are correct and help you catch some avoidable mistakes.