# Birds Notebook

https://www.cranberrygrape.com/projects/birds-ai

This notebook is part of the birds AI series of notebooks provided by Cosmic Bee (Tim Lovett).

Notebook's purpose: This particular notebook's purpose is to create a cached structure for the test, train, and validation sets. My reasoning for this was that I intended later to further refine the outputs to a shorter "birdfeeder friendly" list and I wanted to make sure I didn't leak test data at any point in my process.

This notebook contains the logic for both the initial caching of the segmented data and the further filtering.

In [None]:
# Imports
from pathlib import Path
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import tensorflow as tf

In [None]:
BATCH_SIZE = 128
# You may need to lower this depending upon your GPU

TARGET_SIZE = (224, 224)
# Target size should be left alone to avoid scaling images prematurely
# This is the image size


# Note: I've extracted the dataset to C:/colab/data/birds and am using the train
# folder to segment the data. The other existing segments contain only five
# images per bird lowering confidence in the model output if used.
dataset = "C:/colab/data/birds/train"

train_storage_ds = "C:/colab/data/birds/cache/train"
valid_storage_ds = "C:/colab/data/birds/cache/valid"
test_storage_ds = "C:/colab/data/birds/cache/test"
filtered_train_storage_ds = "C:/colab/data/birds/cache/filtered/train"
filtered_valid_storage_ds = "C:/colab/data/birds/cache/filtered/valid"
filtered_test_storage_ds = "C:/colab/data/birds/cache/filtered/test"

In [None]:
# The dataset creator included a Looney Birds category with several humans. It
# took me a very long time to go through each folder to ensure no other troll
# categories or data was present. Data quality directly effects model accuracy
# and quality.
def should_ignore(path, ignored_paths):
    return any(ignored_path in path for ignored_path in ignored_paths)

image_dir = Path(dataset)
ignored_paths = ['LOONEY BIRDS']

filepaths = [str(path) for path in image_dir.glob('**/*') if path.is_file() and not should_ignore(str(path), ignored_paths)]
labels = list(map(lambda x: os.path.split(os.path.split(x)[0])[1], filepaths))

# Create DataFrame
filepaths = pd.Series(filepaths, name='Filepath').astype(str)
labels = pd.Series(labels, name='Label')
image_df = pd.concat([filepaths, labels], axis=1)

# Split train and test
train_df, test_df = train_test_split(image_df, test_size=0.2, shuffle=True, stratify=labels, random_state=42)
print("Train and test split")

Train and test split


## Storage

Next we'll store these images to the cache directory we previously specified.

In [None]:
# Storage logic to break up dataset
splitter_generator = ImageDataGenerator(
    preprocessing_function=tf.keras.applications.efficientnet.preprocess_input,
    validation_split=0.2,
)

test_generator = ImageDataGenerator(
    preprocessing_function=tf.keras.applications.efficientnet.preprocess_input,
)
train_images = splitter_generator.flow_from_dataframe(
    dataframe=train_df,
    interpolation='bilinear',
    x_col='Filepath',
    y_col='Label',
    target_size=TARGET_SIZE,
    color_mode='rgb',
    class_mode='categorical',
    batch_size=BATCH_SIZE,
    subset="training",
    seed=42,
    shuffle=True,
)

validation_images = splitter_generator.flow_from_dataframe(
    dataframe=train_df,
    interpolation='bilinear',
    subset="validation",
    x_col='Filepath',
    y_col='Label',
    target_size=TARGET_SIZE,
    color_mode='rgb',
    class_mode='categorical',
    batch_size=BATCH_SIZE,
    seed=42,
    shuffle=True,
)

test_images = test_generator.flow_from_dataframe(
    dataframe=test_df,
    interpolation='bilinear',
    x_col='Filepath',
    y_col='Label',
    target_size=TARGET_SIZE,
    color_mode='rgb',
    class_mode='categorical',
    batch_size=BATCH_SIZE,
    shuffle=False,
)

import shutil
import os

def copy_files(generator, destination_root):
    for i in range(len(generator.filepaths)):
        filepath = generator.filepaths[i]
        filename = os.path.basename(filepath)
        class_label = os.path.basename(os.path.dirname(filepath))

        # Create destination directory
        destination_dir = os.path.join(destination_root, class_label)
        os.makedirs(destination_dir, exist_ok=True)

        # Destination path
        destination_path = os.path.join(destination_dir, filename)

        # Copy file
        shutil.copy(filepath, destination_path)
        print(f"Copying {filename} to {destination_dir}")

copy_files(train_images, train_storage_ds)
copy_files(validation_images, valid_storage_ds)
copy_files(test_images, test_storage_ds)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Copying 114.jpg to C:/colab/data/birds/cache/test\BARN SWALLOW
Copying 019.jpg to C:/colab/data/birds/cache/test\GAMBELS QUAIL
Copying 117.jpg to C:/colab/data/birds/cache/test\PHILIPPINE EAGLE
Copying 056.jpg to C:/colab/data/birds/cache/test\TEAL DUCK
Copying 109.jpg to C:/colab/data/birds/cache/test\PURPLE FINCH
Copying 068.jpg to C:/colab/data/birds/cache/test\SPOTTED WHISTLING DUCK
Copying 114.jpg to C:/colab/data/birds/cache/test\IBERIAN MAGPIE
Copying 019.jpg to C:/colab/data/birds/cache/test\GROVED BILLED ANI
Copying 149.jpg to C:/colab/data/birds/cache/test\RUFUOS MOTMOT
Copying 160.jpg to C:/colab/data/birds/cache/test\PALILA
Copying 113.jpg to C:/colab/data/birds/cache/test\EASTERN WIP POOR WILL
Copying 049.jpg to C:/colab/data/birds/cache/test\CHESTNET BELLIED EUPHONIA
Copying 093.jpg to C:/colab/data/birds/cache/test\BLUE GRAY GNATCATCHER
Copying 084.jpg to C:/colab/data/birds/cache/test\RUDY KINGFISHER
Copyi

## Filtering for Birdfeeder

Finally we'll take those folders we just created and create a filtered copy of a subset of the data. We'll use this later when dropping down from the 524 outputs as many of those birds will never come to a feeder so including them would be wasteful.

---

> **Subnote:** Why would it be wasteful? How does decreasing the output count affect the parameter count? When are there situations it may have less effect?

It would be wasteful as a layer's dimensionality affects the parameter count.

Take for example a model (ignoring all other layers) with the following Dense layers unit *values*:

*   Dense 1000
*   Dense (output) 500

For the final layer you'd determine its parameter count with the following: `1000 * 500 + 500`

This would give that output layer a parameter count of: `50500`

If you had another Dense layer in between like the following:
*   Dense 1000
*   Dense 10
*   Dense (output) 500

The two layers ending this model (to compare relative counts) can be calculated as:
`1000 * 10 + 10 = 10010`
`10 * 500 + 500 = 5500`
Or together `10010 + 5500 = 15510`

This of course will affect model accuracy but I mention these values to demonstrate the effect changing the outputs has on both situations.

In the former changing the output to 400 would cause a significant drop in the end parameter count as shown below:
`1000 * 400 + 400 = 40400` or a savings of `10100`.

Still a significant amount of parameters for the final layer but for demonstration purposes it is a significant amount of savings.

For the other scenario the savings are much more muted as the final layer is multiplied by only 10 regardless of how small it becomes.

So in that case it becomes: `10 * 400 + 400 = 4400` a savings of only `1100` despite losing `100` outputs.

In this way pruning outputs may have a more significant effect on models where the dense layer structure mimics the former scenario.

In [None]:
ignored_paths = ['LOONEY BIRDS', 'TURKEY', 'PENGUIN', 'OWL', 'EAGLE', 'EMU',
                 'VULTURE', 'CHICKEN', 'GOOSE', 'OSTRICH', 'CONDOR', 'PELICAN',
                 'CASSOWARY', 'FLAMINGO', "CRANE", "SWAN", "PEACOCK", "IBIS",
                 "STORK", "PHEASANT", "FALCON", "OSPREY", "HAWK", "CURASSOW",
                 "VULTURINE", "HEN", "HUMMINGBIRD", "SWALLOW", "WARBLER", "FLYCATCHER",
                 "KINGFISHER", "HERON", "EGRET", "KINGLET", "SHRIKE", "WREN"]


def filter_and_copy_files(source_root, destination_root, ignored_paths):
    for root, dirs, files in os.walk(source_root):
        for file in files:
            class_label = os.path.basename(root)
            if not should_ignore(class_label, ignored_paths):
                # Create destination directory including class label
                destination_dir = os.path.join(destination_root, class_label)
                os.makedirs(destination_dir, exist_ok=True)

                # Source and destination paths
                source_path = os.path.join(root, file)
                destination_path = os.path.join(destination_dir, file)

                # Copy file to destination directory
                shutil.copy(source_path, destination_path)
                print(f"Copying {file} to {destination_dir}")

filter_and_copy_files(train_storage_ds, filtered_train_storage_ds, ignored_paths)
filter_and_copy_files(valid_storage_ds, filtered_valid_storage_ds, ignored_paths)
filter_and_copy_files(test_storage_ds, filtered_test_storage_ds, ignored_paths)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Copying 103.jpg to C:/colab/data/birds/cache/filtered/test\KIWI
Copying 108.jpg to C:/colab/data/birds/cache/filtered/test\KIWI
Copying 110.jpg to C:/colab/data/birds/cache/filtered/test\KIWI
Copying 115.jpg to C:/colab/data/birds/cache/filtered/test\KIWI
Copying 117.jpg to C:/colab/data/birds/cache/filtered/test\KIWI
Copying 120.jpg to C:/colab/data/birds/cache/filtered/test\KIWI
Copying 127.jpg to C:/colab/data/birds/cache/filtered/test\KIWI
Copying 131.jpg to C:/colab/data/birds/cache/filtered/test\KIWI
Copying 136.jpg to C:/colab/data/birds/cache/filtered/test\KIWI
Copying 105.jpg to C:/colab/data/birds/cache/filtered/test\KNOB BILLED DUCK
Copying 110.jpg to C:/colab/data/birds/cache/filtered/test\KNOB BILLED DUCK
Copying 111.jpg to C:/colab/data/birds/cache/filtered/test\KNOB BILLED DUCK
Copying 114.jpg to C:/colab/data/birds/cache/filtered/test\KNOB BILLED DUCK
Copying 123.jpg to C:/colab/data/birds/cache/filtered/t

With this our data has been prepared and we can begin training the model.