---
# Data Analysis and Machine Learning
---
## Assessment 02

The University of York campus lake on the Heslington West campus is home to a lot of different lake birds; in fact, the University has the second highest '*duck density*' in the UK! All University of York students are able to recognise the ducks, geese, and swans that they see around the lake - but is a computer able to, or is it what separates us from machines?

Your dataset is packaged as a .zip archive (which you will need to download and unpack) and contains colour (RGB) images (***X***) of ducks, geese, and swans (*y*). Inside the .zip archive (`lake_bird_images.zip`), there are two subdirectories: `train` and `test`, containing the training and testing datasets, respectively. Inside each of these subdirectories are three further subdirectories: `duck`, `goose`, and `swan`. There are 498 images of ducks, 981 images of geese, and 335 images of swans inside `train` (1814 images in total), and there are 218 images of ducks, 405 images of geese, and 160 images of swans inside `test` (783 images in total).

Your task is to build one deep machine-learning model:
- a **multiclass classification model for predicting whether the image is a duck, a goose, or a swan**. You are only allowed to evaluate your model performance on the test dataset (`test`) once; all model (hyperparameter) tuning should be carried out using only the training dataset (`train`) and a validation set derived from it.

This assessment also has a written/report-style component which you can complete inside this notebook by adding additional text blocks if necessary. Once you have built your deep machine-learning model, you should:
- evaluate its performance, **producing at least three figures that illustrate the performance of the model**, and **write an analysis of each figure that outlines what the figure is showing and what it tells you about the performance of your model**. You are not limited to only three figures - you can produce more figures if they are useful in illustrating a point - although only three figures and accompanying analyses will count towards your grade on the assessment (these will be the highest-graded three that you present). The figures and accompanying analuses can focus on the training/validation performance, the testing performance, or - ideally - a mixture of the two;
- answer the question: **what limits the performance of the model?** Up to three proposed explanations will count towards your grade on the assessment.

You are limited to 2500 words for your written/report-style contribution, but this is **not** a guideline - it is likely much, much more than what you'll need, and contributions of this length are not expected.

All code should generally be commented where appropriate as good practice dictates. When you have finished, use the option on the File menu to download this notebook as in .ipynb format and upload it to the submission point on the VLE.

## Tips:

- If you cannot get code to work, comment it out and write comments about what you are trying to do and how it fails.
- Consider running your code locally on your computer or on a University-managed computer rather than Google Colab to avoid uploading the duck/goose/swan dataset to your Google Drive; the dataset contains around 2500 images and will not only be slow to upload but also slow to access for your deep machine-learning model. Your code will run much, much quicker if you run it offline!
- Don't expect the kind of accuracy that you were able to acheive in the last assessment (DAML Assessment 01); this is a much, much more challenging problem! Think, instead, about the baseline accuracy that you might expect for a multiclass classification task like this.
- Familiarise yourself with the new TensorFlow notebooks on deep neural networks (DNNs) and deep convolutional neural networks (CNNs) before you attempt the task.

Before you start, click the &#x25B8; icon below to allow colab to access the data files in your drive (not necessary if you plan to work offline).

In [None]:
from google.colab import drive; drive.mount('/content/drive')

... and click the &#x25B8; icon below to import the `numpy`, `matplotlib.pyplot` libraries:

In [32]:
import numpy as np
import matplotlib.pyplot as plt

You can click the &#x25B8; icon below to install TensorFlow if the environment/computer you're working on doesn't have TensorFlow installed already:

In [None]:
! pip install tensorflow

...and click the &#x25B8; icon below to import the `tensorflow` and `tensorflow.keras` libraries:

In [33]:
import tensorflow as tf
import tensorflow.keras as keras

### Task 01

Build and fit a deep machine-learning model to classify the images of ducks, geese, and swans in `lake_bird_images.zip`. Evaluate your multiclass classification model using the accuracy, and optimise the hyperparameters of your multiclass classification model to obtain the best performance possible on unseen data using the images in `train`. When you are satisified - **and only once in the notebook** - evaluate and/or produce predictions for the images in `test`.

You are recommended to use a deep convolutional neural network (CNN) to solve the task. Show evidence that you have:

- experimented with the structure and number of the layers (*e.g.* `layers.Conv2D`, `layers.MaxPooling2D`) in your CNN;
- experimented with the addition of other kinds of layers (*e.g.* for data augmentation, and/or regularisation \[`layers.Dropout`, `layers.BatchNormalzation`\]);
- evaluated your chosen multiclass classification model on held-out data.

In [55]:
## TODO:
# build and fit a deep machine-learning model to classify the images of ducks,
# geese, and swans in `lake_bird_images.zip`

import os
import shutil

# Path to the original data folder
source_dir = 'lake_bird_images/train'

# Path to the training and validation datasets
train_dir = 'lake_bird_images/train_dataset'
val_dir = 'lake_bird_images/val_dataset'

# Determine the proportion of training and validation sets (80% for training)
train_ratio = 0.8

# Traverse the subfolders of each category (duck, goose and swan)
for category in ['duck', 
                 'goose', 
                 'swan']:
    # Get a list of all image files under this category
    category_dir = os.path.join(source_dir, category)
    image_files = os.listdir(category_dir)
    num_images = len(image_files)

    # Determine cut points to split data into training and validation sets
    split_point = int(num_images * train_ratio)

    # Copy images to the training set folder
    for image_file in image_files[:split_point]:
        source_path = os.path.join(category_dir, image_file)
        target_path = os.path.join(train_dir, category, image_file)
        shutil.copyfile(source_path, target_path)

    # Copy images to the validation set folder
    for image_file in image_files[split_point:]:
        source_path = os.path.join(category_dir, image_file)
        target_path = os.path.join(val_dir, category, image_file)
        shutil.copyfile(source_path, target_path)

In [63]:
categories = ['duck', 'goose', 'swan']
data_root = 'lake_bird_images'  # 数据根目录

# Dictionary storing the number of images in the training set
num_images_train = {}  
# Dictionary storing the number of images in the validation set
num_images_val = {} 

for category in categories:
    train_folder = os.path.join(data_root, 'train_dataset', category)
    val_folder = os.path.join(data_root, 'val_dataset', category)
    
    num_train = len(os.listdir(train_folder))
    num_val = len(os.listdir(val_folder))
    
    num_images_train[category] = num_train
    num_images_val[category] = num_val

# print number of each category of training and validation set
for category in categories:
    print(f" '{category}' training: {num_images_train[category]}")
    print(f" '{category}' validation: {num_images_val[category]}")


 'duck' training: 399
 'duck' validation: 100
 'goose' training: 784
 'goose' validation: 197
 'swan' training: 268
 'swan' validation: 67


In [17]:
# 数据预处理和增强
DataGenerator = ImageDataGenerator(
    rescale=1./255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    validation_split=0.2
)

# 加载数据
train_generator = DataGenerator.flow_from_directory(
    dataset,
    target_size=image_size,
    batch_size=32,
    class_mode='categorical',
    subset='training'
)
validation_generator = DataGenerator.flow_from_directory(
    dataset,
    target_size=image_size,
    batch_size=32,
    class_mode='categorical',
    subset='validation'
)

print("--")
print("Total training images:", train_generator)
print("Total validation images:", total_val)

Found 2079 images belonging to 2 classes.
Found 518 images belonging to 2 classes.
--
Total training images: <keras.src.preprocessing.image.DirectoryIterator object at 0x2856c8850>


NameError: name 'total_val' is not defined

In [5]:
## TODO:
# plot a figure that illustrates the performance of your deep machine-learning
# model, eg. a confusion matrix, a hyperparameter optimisation curve, a training/
# validation loss curve, a learning curve, etc.

**TODO:** Write an analysis of the figure above here; what does the figure show, and what does the figure indicate about the performance of your model?

In [None]:
## TODO:
# plot a figure that illustrates the performance of your deep machine-learning
# model, eg. a confusion matrix, a hyperparameter optimisation curve, a training/
# validation loss curve, a learning curve, etc.

**TODO:** Write an analysis of the figure above here; what does the figure show, and what does the figure indicate about the performance of your model?

In [None]:
## TODO:
# plot a figure that illustrates the performance of your deep machine-learning
# model, eg. a confusion matrix, a hyperparameter optimisation curve, a training/
# validation loss curve, a learning curve, etc.

**TODO:** Write an analysis of the figure above here; what does the figure show, and what does the figure indicate about the performance of your model?

**TODO:** What limits the performance of your model? Discuss here, giving up to three possible explanations and (if useful) referring back to your figures.