# JHub Coding Module 4C Guidance and Submission Template

This notebook is a template for module 4C from the JHub Coding Scheme. 

**Important:** Please do not change or modify the code within Sections 1-3 of this notebook, with exception to Section 2.2 for your chosen dependencies. The prepopulated code has been provided to save difficulties with downloading and extracting the dataset for this challenge.

When you make a submission, your notebook will be tested in Google Colab, so it is essential that your code has been tested and works without any issues on the Google Colab platform.

This notebook has been produced to improve the consistency of submissions, and to act as a basic starting point for the challenge. Despite this, the challenge still aims to give students a good amount of flexibility in applying a range of techniques and chosen processes to solve the challenge. This challenge is designed to be a considerable step-up compared to Challenge 4B, and as such should test the ability of students more thoroughly for applying various principles of data science and machine learning.

You need to populate the required functions to solve this problem. All dependencies should be documented in the next cell.

**You can:** add further cells or text blocks to extend or further explain your solution add further functions

**Dont:** rename the helper functions and classes within the notebook, since this ensures consistency during testing of submissions.

---

## 1. Challenge Overview and Description

**Summary of Requirements:**

In summary, this challenge consists of the following tasks:

1. Loading and processing of a large image dataset, which is stored in a realistic directory structure.

2. Preprocessing and preparation of image data and labels into a suitable form for a DNN.

3. Model production, tuning and evaluation of performance on the image classification task using the training and test sets provided.

4. Final model predictions on a seperately held-out test set (supplied with no labels), and saving the prediction labels as .csv. A helper function is provided at the bottom of this template for loading this final data (labels are not provided).

---

**Introducing the dataset and model requirements:**

For this challenge you must produce a functioning Deep Neural Network model that classifies a 32x32 RGB images into one of three possible classes: Aircraft, Ships or Automobiles. 

The dataset provided consists of two splits of data - the training and test splits. Each of these have been split using stratified sampling techniques, with an equal proportion of data available for each class, and so imbalanced data is not an issue in this challenge. 

Within each split, there are a large number of .png images for each class. These are organised into a directory structure, like so:

    - Train
        - aircraft 
            [1000 PNG formatted 32x32 RBG images]
        - automobile
            [1000 PNG formatted 32x32 RBG images]
        - ship
            [1000 PNG formatted 32x32 RBG images]
        
    - Test
        - aircraft
            [333 PNG formatted 32x32 RBG images]
        - automobile
            [333 PNG formatted 32x32 RBG images]
        - ship
            [333 PNG formatted 32x32 RBG images]
            

To complete the task, you must first load and preprocess the images into suitable training and test datasets, followed by the training, tuning and evaluation of a Deep Neural Network model. The aim is to produce a tuned model that generalises as highly to the test set as possible. 

As a rule of thumb, if you are achieving an accuracy of 75% or higher on the test set (having only trained on training data), your model is performing well.

You are not limited to a particular type of Deep Learning Model, and may apply any architecture type you like. However, for the purpose of this challenge you should use the TensorFlow and the Keras frameworks for model production.

---
**Final predictions on held-out dataset:**

Once your model is finalised and you have evaluated the performance on the test dataset, you must make final predictions on a held-out private test set. This private test set can be loaded using the function provided at the bottom of this notebook. Please note - no labels are provided for this data, and it is up to you to provide labels for this data by making predictions with your model. The final prediction labels for these must be provided as a .csv with a label per row for each image. **Hint:** It's highly recommended to format your predictions as a Pandas series, followed by simply saving this as a .csv using .to_csv().

---

## 2. Import Dependencies

### 2.1 Fixed Dependencies - Leave these as they are

In [1]:
# Fixed dependencies - do not remove or change.
import pytest
import pandas as pd
import os
import numpy as np

#from google.colab import drive
# drive.mount('/content/gdrive/')

### 2.2 Custom dependencies - add whatever you need into here

In [9]:
# import your dependencies here, e.g. tensorflow, keras, matplotlib...

---

## 3. Downloading and extracting the data

**Important note** - Only do one of the following methods for obtaining the dataset. You don't need to run all three.

### 3.1 Downloading and extracting using curl and unzip:

The following commands will download the dataset, as a .zip file, and then extract it into the local directory:

In [None]:
# give the following command sufficient time to download the 10 Mb dataset before running the next cell
!curl -L "https://github.com/BenjaminFraser/JHubModule4C/blob/main/Module_4C_Dataset.zip?raw=true" > dataset.zip

In [None]:
!unzip dataset.zip

In [None]:
# remove zip file - no longer needed
!rm dataset.zip

### 3.2 Alternative method - downloading and extracting using wget (only if previous method did not work):

Alternatively, if the curl command above does not work on your OS, try wget instead, like so:

In [None]:
!wget "https://github.com/BenjaminFraser/JHubModule4C/blob/main/Module_4C_Dataset.zip?raw=true" -O dataset.zip
!unzip dataset.zip

In [None]:
!unzip dataset.zip
!rm dataset.zip

### 3.3 Manual method (if two methods above both fail) - download .zip file from Github and extract manually

Ultimately, if niether of these methods not on your OS, simply navigate to the following url:

- https://github.com/BenjaminFraser/JHubModule4C/blob/main/Module_4C_Dataset.zip?raw=true

Download the zip file manually, and unzip into the local directory.

You should now have the entire dataset within your local directory, named 'Data'.

## 4. Loading and processing the image data and labels into a dataset

In [None]:
train_dir = "Data/train"
test_dir = "Data/test"

In [None]:
def load_and_process_data(data_dir, img_height=32, img_width=32, channels=3):
    """This function needs to import and preprocess the image data appropriately. 
       The techniques for preprocessing and handling the data are not set, and 
       you can apply your own methodology as you choose.
       
       It's recommended that you create a function that loads and processes
       the image data based on a chosen directory given, e.g. train data filepath, 
       which then returns the associated image data as np arrays, and labels as a 
       pandas df or series as appropriate. """
        
    return image_dataset, image_labels

In [None]:
# load training and test images into memory
train_images, train_labels = load_and_process_data(train_dir)
test_images, test_labels = load_and_process_data(test_dir)

## 5. Preprocessing and preparation of the Image Data and Labels

In [3]:
class Module4C:
    def __init__(self):
        self.model = None
       
    def preprocess_image_data(self, image_data):
        """ This function should process your image data prior to training as 
            appropriate, along with storing any required features in the class """
        
        return preprocessed_images

    
    def preprocess_labels(self, img_labels):
        """ Preprocess image labels (if required), storing any features 
            required in the class """
            
        return final_labels

    
    def decode_predictions(self, predictions):
        """ Helper function for Decoding predition probabilities from DNN model, 
            returning as hard output labels """  
            
        return decoded_preds

In [None]:
tester_mod4c = Module4C()

# preprocess training images and labels
X_train_full = tester_mod4c.preprocess_image_data(train_images)
y_train_full = tester_mod4c.preprocess_labels(train_labels)

# preprocess test images and labels
X_test = tester_mod4c.preprocess_image_data(test_images)
y_test = tester_mod4c.preprocess_labels(test_labels)

## 6. Model Production, Tuning and Evaluation

In [10]:
## Perform chosen methodologies for model production, tuning and evaluation of a DNN model.

## This is the main section of the assignment and as such will likely involve the most code. 

## No helper functions or classes have been provided in this section, which is to encourage you
## to tackle the problem as you see fit. 

## Tips / Considerations: 
##     - Visualisation model performance is very important for DNNs, and makes optimisation much easier
##     - Consider the best evaluation metrics to use for this classification problem, is accuracy best?

## 7. Model predictions on hold-out test set

**Criteria:** You must ensure the resultant predictions on the hold-out test set are produced with the labels in string format, e.g. 'aircraft', 'ship' or 'automobile'. Do not submit the predictions in their encoded format, i.e. 0, 1 or 2. 

The helper function below will download and load the hold-out test images into a numpy array from a remote repository. You must use your existing trained model to make predictions on these images, and produce a set of .csv labels for the 550 images accordingly.

In [None]:
def download_holdout_test_set():
    """ Download the hold-out test set (labels not provided), which you must
        use with your trained model to make predictions """
    
    # url to github repo - if this doesn't work, change to kaggle alt url in module guidance
    url = 'https://github.com/BenjaminFraser/JHubModule4C/blob/main/holdout_test_images.npy?raw=true'
    
    try:
        request = requests.get(url, allow_redirects=True)
        open('holdout_test.npy', 'wb').write(request.content)
        test_data = np.load('holdout_test.npy')
        
    except requests.exceptions.RequestException as e:  # This is the correct syntax
        return(f"An error occurred downloading and saving the file: {e}")
    
    return test_data

In [None]:
# download the holdout test set and load as numpy array
hold_out_data = download_holdout_test_set()

In [11]:
# preprocess hold-out test data and make predictions etc..
# format labels as .csv, with labels in string form, e.g.: 'aircraft', 'ship', 'aircraft' ...