# Project Report

## Project Output

The output of this project (Data-Preprocessing Module) contains following files:  
   - **Data-Preprocessing Module.ipynb**  
Documentation and main workflow code of this module, from data import, data processing to data augmentation.  
   - **functions.py**  
4 newly written functions for building the functionality of this module.
   - **dependencies.py**  
Import dependencies for the module.
   - **test_functions.py**  
Test the functions written in *functions.py*.

## Contents

To achieve the preprocessing task for deep-learning model, this project completes three main functions, more details are shown in the documentation in `Data-Preprocessing Module.ipynb`:

### Data import

In this section, import images are resized and cropped to keep the most important part of images. However, we can't determine a uniform method for all datasets and models. The users might need to change the size range and cropping range before using it.

### Data processing

This section uses a lot of exising functions like **to_categorical()** and **train_test_split()** to achieve various processing methods. The functionality of subtracting mean RGB values is built by meself since the methods provided in existing packages are quite different and it's hard to determine which one is consistent with my aim.

### Data augmentation

This section calls a very useful function **ImageDataGenerator()** from `tensorflow.keras` to build my own fucntion. After test, only a part of its arguments are suitable to cloud image data.

## Instructions 

This project aims to build a module that achieves the functionality of **cloud image** data import, processing and augmentation for the training and testing of **deep-learning models**.  

The dara-preprocessing module for cloud images consists of three parts, which are data import, data processing and data augmentation built on some exising and new functions.  

 - For data import, the directory of cloud data should be provided. We can also set an index for random shuffling. This part should be slightly modified in advance if a new folder has different structure from the default.  
 
 - For data processing, two values between 0 and 1 should be provided, which are used to split training, test and validation sets.  
 
 - For data augmentation, what augmentation methods for cloud images are applied should be assigned, which includes ZCA whitening, rotation, width/height shift and horizontal/vertical flip.

## Dependencies

This module uses functions from packages `sklearn` and `tensorflow` to build the functionality of data processing and augmentation, respectively. Package `numpy` is imported to do array calculation and `matplotlib` is imported to plot some image examples.

In [2]:

f = open("dependencies.py")
lines = f.read()
print(lines)
f.close()

"""Import dependencies for the notebook"""

import os

import numpy as np

import matplotlib
import matplotlib.pyplot as plt

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

import cv2

from tqdm import tqdm

from tensorflow import keras
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.image import ImageDataGenerator 



## Testing

4 test functions in *test_functions.py* are defined to test each written function in `functions.py` respectively.  

- **get_images** is tested with the type of returned cloud class array and if the number of imported labels and images is equal.  
- **display_random_image** is tested with the type of returned figure variable.  

- **subtract_meanRGB** is tested with the mean R value of the returned array, which should be 0.  

- **Image_Generator** is tested with the type of returned data generator variable.

We can see test results in the final section of [`Data-Preprocessing Module.ipynb`](https://github.com/JiarunZhou/EMSC-4033-2022_Project_Jiarun/blob/f72bfc51e4ae72507af11fa183835b6e69a53bce/Data-Preprocessing%20Module.ipynb).

## Future work

This project focuses on the preprocessing of cloud images for deep learning models. The output from this project including data-preprocessing module and fucntions can be applied to the task for other objects easily in the future, since in most cases similar procedures of import, processing and augmentation should be went through before inputing the data into deep learning models. To complete this work, we need to determine the best cropping range and image shape, and add new data augmentation methods. However, these changes are not easy to determine since datasets and models can be very different actually. This problem limits the applied range of this module.

In this project, some test_functions are defined to test the returns of 4 new written functions. They have determined the functions run without mistakes and return expected variables. However, the finally returned data generators should be applied into a true training task of deep learning models in order to examine the actual performance of this data-preprocessing module, which is impossible to test by test_fucntions. This is an important future work for improve the module and find better data-processing methods. For example:

```python
#Fit
model_1 = load_model("model_CloudNet_KAxis")
hist_1 = model_1.fit(TrainGen,
                        steps_per_epoch = x_train.shape[0] // 8,
                        validation_data = ValGen,
                        validation_steps = x_val.shape[0]// 8,
                        epochs = 200)
```

## Records

### Issues

 - When I was writing the background part in my project planner, I attempted to citing literatures from BibTex file into the markdown file. However, I didn't figure out how to do this operation finally. Therefore, I had to input the reference information manually.   
   
 - In the function for data import, I wanted the fucntion to read the folder name as cloud classes rather than rely on manual input. This requires more loops in imprting process and caused some issues at the beginning. I ssolved this problem using three loop layers finally.  
 - The input images should be resized to a defined resoulution. If I only resize the images, needless black borders exist; If I only crop them, the information in obtianed images are too limited. Therefore, I chose to first resize the raw images to an appropriate resoulution and then do cropping operation to get required size.  
 - I originally wanted to plot 25 image samples. However, I noticed there might be less than 25 images in a dataset, thus I added some exception handling to solve this issue.  
 - There are a lot of data augmentation methods can be done. This is an issue to determine which operations are suitable to cloud images. Finally, I chose 4 operations here: ZCA whitening, rotation, flip and shift.

### Progress record

09/05
 - Completed `ProjectPlanner.md`.
 
11/05
 - Completed `Data-Preprocessing Module.ipynb` and determined the functions that should be written by myself.
 
13/05
 - Completed `functions.py` and `test_functions.py`.
 
18/05
 - Added examples into docstrings in `functions.py` and restructured the repo.
 
19/05
 - Added exception handling to `functions.py`.
 
20/05
 - Completed `README.md` and modified `ProjectReport.ipynb`.
 
22/05
 - Deleted some cache files on GitHub and released the version 1.0.
 
25/05
 - Modified `ProjectReport.ipynb` and released the version 1.1.