### Re-State Our Goal
*Goal:* We want to use Shaggy's dataset to train a male/female butterfly classifier. And we need to describe the dataset work that Shaggy's done.

### State Our Assumptions About the Dataset
When we started, our assumptions were:
- Shaggy's dataset is derived from Kydoimos
- No changes have been made to the image data content
- No images have been added or removed
- Each component of Shaggy's dataset can be linked to its corresponding component into the source dataset (provenance is intact)
- Test/train splits are done appropriately

Apparently, some, or all of these assumptions weren't accurate.

### Make an Intermediate Goal
*Goal:* We want to see if we can re-link Shaggy's work to the original dataset.</br>
 To test which of our assumptions are off, let's do the following:
1. Download the original/upstream/source dataset (Kydoimos)
2. Load Shaggy's dataset
3. Run MD5 checksums on all images in Kydoimos and Shaggy's dataset
4. Merge on MD5

In [None]:
# All imports
import pandas as pd
from datasets import load_dataset
import os
import hashlib
from io import BytesIO
from PIL import Image
from PIL.TiffTags import TAGS

## 1. Download the original/upstream/source dataset (Kydoimos) 

In [None]:
# The upstream dataset is on Hugging Face: https://huggingface.co/datasets/johnbradley/Kydoimos

dataset_path = "johnbradley/Kydoimos"

# Note that if the dataset does not load using the dataset ID above, try the following two lines instead:
# !git clone https://huggingface.co/datasets/johnbradley/Kydoimos ../../Kydoimos
# dataset_path = "../../Kydoimos"

kydoimos = load_dataset(dataset_path)

### Explore the upstream dataset

In [None]:
# View the upstream dataset contents
# Note that the full dataset is in the 'train' split only because that is the default split when loading the dataset


In [None]:
# Look at a sample image, say, index 1


In [None]:
# See that the image is a PIL object


### Load the upstream dataset into a Pandas dataframe for simpler exploration

In [None]:
kydoimos_df = pd.DataFrame(kydoimos['train'])
# This command loads the image data into the dataframe as a column. This is not recommended for large datasets. Since our dataset is small, it's OK.


# To make a dataframe without the image column, use the following command instead: 

# kydoimos_df = pd.DataFrame(kydoimos['train'].remove_columns(['image'])) 

In [None]:
# kydoimos_df.nunique() # Note that this gives an error due to the 'image' column


## 2. Load Shaggy's dataset

In [None]:
# Load the metadata table into a dataframe
shaggy_dir = '../../Shaggy/'
shaggy_df = pd.read_csv(os.path.join(shaggy_dir, 'metadata.csv'), encoding = 'utf-8', low_memory=False)

# Add a column showing how to get to each image from here.
shaggy_df['rel_file_path'] = shaggy_dir + shaggy_df['file_name']

## 3. Run MD5 checksums on all images in Kydoimos and Shaggy's dataset

In [None]:
# MD5 checksum practice ...


# Note that a small change in the data will result in a completely different checksum.

# Review the 'further-reading.ipynb' notebook for more info on computing MD5 checksums.

In [None]:
# Create functions for MD5 checksums

# For use with reading files from disk


# For use with PIL image objects


In [None]:
# Run MD5 checksum on Shaggy's and Kydoimos datasets


In [None]:
# Now that we have the MD5 checksums, we can save them to a CSV file for future use (omitting the 'image' column).


## 4. Merge the datasets on MD5 to link them together

### No matches???
Like, zoinks Scooby. Something is strange here.

There must be something about the image data that has been changed between the Kydoimos dataset and Shaggy's dataset.

Let's list out the things that could affect the MD5 checksum on the binary data for two images:
- Image data content (resizing, cropping, color changes, compression, corruption)
- Image intrinsic metadata

As a shortcut for this class, we'll skip the detective work and get to the solution:

*The intrinsic metadata is changed when the image is loaded as a PIL object, which was done automatically with the Kydoimos dataset by the `datasets` library.*

To address this, we'll load Shaggy's dataset with PIL before taking MD5s again.

In [None]:
# The plan for this cell is to load the images from disk into the dataframe as PIL objects and compute the MD5 checksums.
# Expect this cell to yield an UnidentifiedImageError: cannot identify image file '<path-to->/amalfreda_0.tif'
# We also got a warning stating "UserWarning: Corrupt EXIF data.  Expecting to read 2 bytes but only got 0. 



In [None]:
# To get a fresh preview at the dataframe:


In [None]:
# We can try to open the individual image causing a problem.


In [None]:
# Let's look through the whole set of Shaggy's images and verify the data integrity for each ...



# Apply the function to each image path in the DataFrame

# Optionally, you can drop the 'valid_image' column if it's no longer needed
# shaggy_df.drop(columns=['valid_image'], inplace=True)

In [None]:
# Filter the dataframe to remove rows with corrupt images


In [None]:
# Now we can apply the `pil_md5_checksum` function (same as the code that failed above)


# We can see below that the "md5" and "pil_md5" values for each entry are different. This is because the "md5" column was computed from the file on disk, while the "pil_md5" column was computed from the PIL image object. The PIL image object may have been modified in memory, which is why the checksums differ.

In [None]:
# After fix, 110 in kydoimos

In [None]:
# After fix, 107 in shaggy

In [None]:
# Retry the merge using the PIL checksums for each image


In [None]:

# But Shaggy had 107 images ... where's the extra one hiding? Did something change size?

In [None]:
# Let's find the images that don't match between the two datasets and display them




### Looks like some of Shaggy's personal photos found their way into the dataset while he was organizing things ...

In [None]:
# We can remove the images that don't match from the Shaggy dataframe


In [None]:

# Now the number of images in Shaggy's dataframe matches the number that were merged.

In [None]:

# However, we have more images in the merged dataframe compared to either input dataframe. There must be duplicates.

In [None]:
# We can identify the duplicates using the MD5 checksums for the PIL object form of each image.


In [None]:
# Look at a sample of the duplicates


Quite a few duplicates come from the upstream dataset
It looks like these are coming from unique "id" but identical "NHM specimen number" entries. 

Now that we know duplicates are an issue, we should check if we have duplicates between our test and train splits.

In [None]:
# shaggy_dups

# Filter groups where both 'test' and 'train' are present

# Count the number of 'pil_md5' values present in both 'test' and 'train'


This shows that there is data leakage between the test and train splits.
That's the final nail in the coffin for Shaggy's dataset. 
At this point, it will be simpler to start over from Kydoimos and organize a new dataset.