# **JHub Coding Module 4C Guidance and Submission Template**

#### **Link to Github Hosted Repo:** https://github.com/BenjaminFraser/JHubModule4C

This notebook is a template for module 4C from the JHub Coding Scheme. It has been produced to improve the consistency of submissions, and to act as a basic starting point for the challenge. Despite this, the challenge still aims to give students a good amount of flexibility in applying a range of techniques and chosen processes to solve the challenge. This challenge is designed to be a considerable step-up compared to Challenge 4B, and as such should test the ability of students more thoroughly for applying various principles of data science and machine learning.

**Important:** Please do not change or modify the code within Sections 1-3 of this notebook, with exception to Section 2.2 for your chosen dependencies. The prepopulated code has been provided to save difficulties with downloading and extracting the dataset for this challenge. When you make a submission, your notebook will be tested in Google Colab, so it is essential that your code has been tested and works without any issues on the Google Colab platform.

You need to populate the required functions to solve this problem. All dependencies should be documented in the next cell.

**You can:** add further cells or text blocks to extend or further explain your solution add further functions

**Dont:** rename the helper functions and classes within the notebook, since this ensures consistency during testing of submissions.

---

## **1. Challenge Overview and Description**

### **1.1 Summary of Requirements:**

In summary, this challenge consists of the following tasks:

1. Loading and processing of a large image dataset, which is stored in a realistic directory structure.

2. Preprocessing and preparation of image data and labels into a suitable form for a DNN.

3. Model production, tuning and evaluation of performance on the image classification task using the training and test sets provided.

4. Your submission must provide a function which can return a prediction when provided with a single image in the same format and dimensions as the training data.  This will be provided in as a png and should return a class prediction as text.

5.  Your submission must run on a Google Colab notebook and not take an undue amount of time to train (max 30 mins)  This should be easily achievable under 10 mins.

6. Your submission will be tested against an unseen holdout test dataset.

7. Your model sh

---

### **1.2 Introducing the dataset and model requirements:**

For this challenge you must produce a functioning Deep Neural Network model that classifies a 32x32 RGB images into one of three possible classes: Aircraft, Ships or Automobiles. 

The dataset provided consists of two splits of data - the training and test splits. Each of these have been split using stratified sampling techniques, with an equal proportion of data available for each class, and so imbalanced data is not an issue in this challenge. 

Within each split, there are a large number of .png images for each class. These are organised into a directory structure, like so:

    - Train
        - aircraft 
            [1000 PNG formatted 32x32 RBG images]
        - automobile
            [1000 PNG formatted 32x32 RBG images]
        - ship
            [1000 PNG formatted 32x32 RBG images]
        
    - Test
        - aircraft
            [333 PNG formatted 32x32 RBG images]
        - automobile
            [333 PNG formatted 32x32 RBG images]
        - ship
            [333 PNG formatted 32x32 RBG images]
            

To complete the task, you must first load and preprocess the images into suitable training and test datasets, followed by the training, tuning and evaluation of a Deep Neural Network model. The aim is to produce a tuned model that generalises as highly to the test set as possible. 

As a rule of thumb, if you are achieving an accuracy of 75% or higher on the test set (having only trained on training data), your model is performing well.

You are not limited to a particular type of Deep Learning Model, and may apply any architecture type you like.

---
### **1.3 Final predictions on held-out dataset:**

Once your model is finalised and you have evaluated the performance on the test dataset by the assessor.  Ensure that your code can be run on a Google Colab notebook within time constraints.  Your submission must train a model, which can then make predictions on unseen data of the same form as the training data.

---

## **2. Import Dependencies**

### **2.1 Fixed Dependencies - Leave these as they are**

In [None]:
# Fixed dependencies - do not remove or change.
import pytest
import pandas as pd
import os
import numpy as np

### **2.2 Custom dependencies - add whatever you need into here**

In [None]:
# import your dependencies here, e.g. tensorflow, keras, matplotlib...

---

## **3. Downloading and extracting the data**

**Important note** - Only do one of the following methods for obtaining the dataset. You don't need to run all three.

### **3.1 Downloading and extracting using curl and unzip:**

The following commands will download the dataset, as a .zip file, and then extract it into the local directory:

In [None]:
# give the following command sufficient time to download the 10 Mb dataset before running the next cell
!curl -L "https://github.com/BenjaminFraser/JHubModule4C/blob/main/Module_4C_Dataset.zip?raw=true" > dataset.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   143  100   143    0     0    725      0 --:--:-- --:--:-- --:--:--   725
100   154  100   154    0     0    158      0 --:--:-- --:--:-- --:--:--  150k
100 9432k  100 9432k    0     0  5954k      0  0:00:01  0:00:01 --:--:-- 5954k


In [None]:
!unzip dataset.zip

In [None]:
# remove zip file - no longer needed
!rm dataset.zip

In [None]:
!ls

### **3.2 Alternative method - downloading and extracting using wget (only if previous method did not work):**

Alternatively, if the curl command above does not work on your OS, try wget instead, like so:

In [None]:
!wget "https://github.com/BenjaminFraser/JHubModule4C/blob/main/Module_4C_Dataset.zip?raw=true" -O dataset.zip
!unzip dataset.zip

In [None]:
!unzip dataset.zip
!rm dataset.zip

### **3.3 Manual method (if two methods above both fail) - download .zip file from Github and extract manually**

Ultimately, if niether of these methods not on your OS, simply navigate to the following url:

- https://github.com/BenjaminFraser/JHubModule4C/blob/main/Module_4C_Dataset.zip?raw=true

Download the zip file manually, and unzip into the local directory.

You should now have the entire dataset within your local directory, named 'Data'.

## **4. Loading and processing the image data and labels into a dataset**

In [None]:
train_dir = "Data/train"
test_dir = "Data/test"

You need to import the dataset downloaded into a form suitable for model development, which the helper function below is intended to guide you with.

This function needs to import and process the image data appropriately. 

The techniques for preprocessing and handling the data are not set, and you can apply your own methodology as you choose.

The images are currenly stored in the directory structure defined above. Your methodology should involve importing all images and creating a suitable image dataset and corresponding set of class labels.
    
It's recommended that you create a function that loads and processes the image data based on a chosen directory given, e.g. train data filepath, which then returns the associated image data as np arrays, and labels as a pandas df or series as appropriate. A template function for this is provided below.

The suggested output format for the image dataset is a numpy array, with the shape: (dataset_size, 32, 32, 3). 

The labels can be either a numpy array with the corresponding class ids, or alternatively a pandas series with the string output labels.

In [None]:
def load_and_process_data(data_dir, img_height=32, img_width=32, channels=3):
    """ Example function for loading and processing images stored in 
        in a directory structure into an input dataset and corresponding labels
    """
        
    return image_dataset, image_labels

In [None]:
# load training and test images into memory
train_images, train_labels = load_and_process_data(train_dir)
test_images, test_labels = load_and_process_data(test_dir)

## **5. Preprocessing and preparation of the Image Data and Labels**

## **6. Model Production, Tuning and Evaluation**

In [None]:
## Perform chosen methodologies for model production, tuning and evaluation of a DNN model.

## This is the main section of the assignment and as such will likely involve the most code. 

## No helper functions or classes have been provided in this section, which is to encourage you
## to tackle the problem as you see fit. 

## Tips / Considerations: 
##     - Visualisation model performance is very important for DNNs, and makes optimisation much easier
##     - Consider the best evaluation metrics to use for this classification problem, is accuracy best?

In [None]:
def predict_class(model, single_image):
    """
    This function must take your trained model and a single image so that we can test your model.
    :param model : trained model object
    :param single_image: single image file, which will be a .png file with the same dimensions as the test image
    
    :return: image_class
    :rtype: str
    """
    

## **7. Model predictions on hold-out test set**

**Criteria:** You must ensure the resultant predictions on the hold-out test set are produced with the labels in string format, e.g. 'aircraft', 'ship' or 'automobile'. Do not submit the predictions in their encoded format, i.e. 0, 1 or 2.