# 1. Connect to github classroom

Some of the assignments in this unit will be managed via GitHub classroom. Please follow these steps to connect:

1. Follow this invitation link and accept the invitation: https://classroom.github.com/a/0BXQIFYZ
2. The link may ask you to sign in to GitHub (if you haven't signed in earlier). If you don't have a GitHub account, you will need to register.
3. Once you have logged in with GitHub, you may need to select your email address to associate your GitHub account with your email address (if you haven't done it in a previous COMP3420 activity). If you can't find your email address, please skip this step and contact diego.molla-aliod@mq.edu.au so that he can do the association manually.
4. Wait a minute or two, and refresh the browser until it indicates that your assignment repository has been created. Your repository is private to you, and you have administration privileges. Only you and the lecture will have access to it. The repository will be listed under the list of repositories belonging to this offering of COMP3420: https://github.com/orgs/COMP3420-2024S4/repositories
5. Your assignment repository will include starter code that you can use for the exercises below. Clone your repository into a folder in your computer.

This practical has two kinds of exercises:

1. **Implement functions and upload the implementation to github classroom**. The exercises will have associated automated tests. To run these tests, please commit your changes and push the changes to your repository. This will initiate the automated tests, and you will receive the test results. There are no marks associated with these tests, but they will help you get used to the environment that you will use for the assignments.
2. **Analyse the data, train and evaluate image classifiers.** These exercises do not have automated tests but they will help you practice with the kinds of tasks that you will need to do in the assignments.

# 2. Data Preparation

As training for assignment 2, let's prepare a collection images for processing by a convolutional network.

1. Download and unzip the CIFAR10 dataset **hosted by kaggle**. CIFAR10 is a very popular dataset with 60,000 32x32 colour images distributed evenly across 10 classes. The dataset is included in the TensorFlow library, but we will download and prepare it, for practice for assignment 2.
    - [CIFAR10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html).
2. Once you have downloaded and unzipped the code, move the folder `cifar-10-batches-py` to the same folder as this notebook.
3. Run the following code. This code converts the files from folder `cifar-10-batches-py` data into image files that will be stored in two folders:
    - `cifar_images_train`
    - `cifar_images_test`

In [3]:
import numpy as np
import os
import pickle
from matplotlib import pyplot as plt

def unpickle(file):
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
    return dict

def save_images_from_dict(dict, folder):
    for i in range(len(dict[b'data'])):
        flat_image = dict[b'data'][i]
        image = np.transpose(np.reshape(flat_image,(3, 32,32)), (1,2,0))
        label = dict[b'labels'][i]
        filename = f"{folder}/{label}_{i}.png"
        plt.imsave(filename,image)


In [4]:
# Based on output of Bing chat query: "write python code that converts CIFAR-10 data to image files. The code should use matplotlib to convert the data into images."


data_folder = "cifar-10-batches-py"
output_folder = "cifar_images"

if not os.path.exists(output_folder+"_train"):
    os.makedirs(output_folder+"_train")
if not os.path.exists(output_folder+"_test"):
    os.makedirs(output_folder+"_test")



In [5]:
for batch in range(1,6):
    batch_file = f"{data_folder}/data_batch_{batch}"
    batch_dict = unpickle(batch_file)
    save_images_from_dict(batch_dict,output_folder+"_train")



In [6]:
test_file = f"{data_folder}/test_batch"
test_dict = unpickle(test_file)
save_images_from_dict(test_dict,output_folder+"_test")

3. Inspect the resulting image files. You will see that the filenames are of the form `label_imagenumber.png`. For your reference, the following code is a list with the label names, so that label 0 is "airplane", label 1 is "automobile", etc.

In [7]:
label_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

Defaulting to user installation because normal site-packages is not writeable


ERROR: Could not find a version that satisfies the requirement collections (from versions: none)
ERROR: No matching distribution found for collections


4. Check whether the data are balanced. To do this, first write a function `call_counts` that takes as input the path of the folder that contains the training (or the test) data, and returns a dictionary where the keys are the label names, and the values are the image counts for each label. In the starter code we include a folder `sample_cifar10`  with a small number of images that you can use for your tests. We will use these images for the automatic tests in GitHub classroom. An example of execution of the file using this sample folder follows.

In [19]:
import week4
week4.call_counts('sample_cifar10')

{'airplane': 10,
 'automobile': 6,
 'bird': 9,
 'cat': 6,
 'deer': 6,
 'dog': 8,
 'frog': 6,
 'horse': 3,
 'ship': 8,
 'truck': 4}

Use then this function to check whether the data of the training and test data are balanced.

5. Write a Python function `sample_images` that takes as input the following information:
     - `path`: the path of the folder that contains images
     - `label_name`: a label name (e.g. "deer", "horse", etc)
   The function must return a list of tuples (filename, label_name) that contains all file names in the path that belong to class, together with the label. 
   An example of usage follows:


In [1]:
import week4
week4.sample_images("sample_cifar10", "horse")

[('sample_cifar10/7_4197.png', 'horse'),
 ('sample_cifar10/7_4202.png', 'horse'),
 ('sample_cifar10/7_4207.png', 'horse')]

6. With the help of the function `sample_image_folder` (which is included in the starter code), generate three CSV files as follows:
   - One CSV file "train_set.csv" that contains the first 2000 samples of categories "bird", "cat", "deer", "horse" from the **training** set.
   - One CSV file "devtest_set.csv" that contains the subsequent 500 samples of the same categories from the **training** set. 
   - One CSV file "test_set.csv" that contains the first 500 samples of the same categories from the **test** set.
   
   You will use these CSV files in your subsequent work.

   The function `sample_image_folder` takes these parameters:
    - `path`: the path of the folder containing the images.
    - `selected_label_names`: a list of label names.
    - `sample_numbers`: a list of integers.
    - `output_filenames`: a list of filenames where the CSV files will be stored. 
    
   An example of use of this function follows. In this example:
    - The first 2 samples of classes with names "deer", "airplane", and "truck" are saved in the CSV file with name "file1.csv".
    - The next sample of the same classes is saved in the CSV file with name "file2.csv".

In [3]:
import week4
selected_classes = ('bird', 'cat', 'deer', 'horse')
week4.sample_image_folder('cifar_images_train', selected_classes, (2000, 500), ('train.csv', 'validation.csv'))

['train.csv', 'validation.csv']

In [5]:
import pandas as pd
pd_file1 = pd.read_csv('train.csv', names=['path','label_name'])
pd_file1.head()

Unnamed: 0,path,label_name
0,cifar_images_train/2_1002.png,bird
1,cifar_images_train/2_1003.png,bird
2,cifar_images_train/2_1005.png,bird
3,cifar_images_train/2_1006.png,bird
4,cifar_images_train/2_1007.png,bird


In [7]:
import week4
selected_classes = ('bird', 'cat', 'deer', 'horse')
week4.sample_image_folder('cifar_images_test', selected_classes, (500,), ('test.csv',))

['test.csv']

In [9]:
pd_file2 = pd.read_csv('file2.csv', names=['path','label_name'])
pd_file2.head()

Unnamed: 0,path,label_name
0,sample_cifar10/4_4703.png,deer
1,sample_cifar10/0_0.png,airplane
2,sample_cifar10/9_1172.png,truck


# 3. Implement Transfer Learning

Re-use and adapt the code from this week's lecture notebook so that it uses Mobilenet_v02 pre-trained with imagenet. Train the model with your training data, and evaluate it with your test data. Comment on your results. Answer the following questions.

1. What is the accuracy of the training data and the test data?
2. What is the optimal choice of number of epochs, based on your experiments?

In [None]:
import tensorflow as tf
import keras


In [None]:
from matplotlib import pyplot as plt
img_path = train_file['path'][0]


In [None]:
import glob
import os
IMG_HEIGHT = 224
IMG_WIDTH = 224
IMG_CHANNELS = 4
CLASS_NAMES = ['bird', 'cat', ]

# 4. (Optional) Improve your system

Experiment with different numbers of hidden layers and sizes, and whether to include dropout or not. Comment on your results. Did you manage to obtain better results?