# ASTR3110 Computer Laboratory 3: Classifying Images using Artificial and Convolutional Neural Networks.

In this Lab, you will be using imaging from the [CORNISH](https://cornish.leeds.ac.uk/public/index.php) survey to classify images using Neural Networks. The CORNISH survey aimed to understand massive star formation in our Galaxy by searching for ultra-compact HII regions (regions that have stars with mass > 8 times the mass of our Sun that are ionising the surrounding gas, which emit at radio wavelengths) in a portion of the plane of the disk of the Galaxy using the [Very Large Array Radio telescope](http://www.vla.nrao.edu). 

While this survey successfully discovered many new HII regions, other sources such as planetary nebulae (PNE) and background galaxies (RGs) also emit at radio wavelengths, and so were also detected in the survey data. By design, the CORNISH survey targetted a region that also contained observations in the mid-infrared taken drung another survey with the Spitzer Space telescope ([the GLIMPSE survey](https://irsa.ipac.caltech.edu/data/SPITZER/GLIMPSE/)). In particular, the 4.5, 5.8, and 8.0 $\mu$m Spitzer images allow us to distinguish HII regions, PNE, and RGs due to their differnt appearance (see top panel in the below image: leftmost is a PNE, middle a HII region, and right a RG).

![SegmentLocal](CORNISH_image.png)

In the first part of the lab, you will train and test an Artificial Neural Network classifier using 300 8.0 $\mu$m images (100 each for HII, PNE, and RG sources).

In the second part of the lab, you will train and test a Convolutional Neural Network classifier, using the same sample as in Part One, but adding the 4.5 and 5.8 $\mu$m images. 

In both parts, you will need to run tests to determine the performance of your classifier, and tweak hyperparameters  in order to improve the performance. (N.B.: The term "hyperparameter" is reserved for those parameters that are set manually, e.g., the number of components in GMM, or the learning rate of your neural network. Normal parameters are determined from the data, e.g., the intercept and slope of a straight line fit)

## **If you are using google Colab, you may wish to switch to using a GPU hardware accelerator, as this can improve the speed of the Neural Networks. To do so, go to the "Edit" dropdown, click "Notebook Settings", and select "GPU" for the hardware accelerator. This needs to be done before you start coding!**

At the completion of this lab, you will have acquired (or improved) the following skills:
- Reading in fits images using astropy.
- Manipulating 2D image arrays to prepare them for input into the ANN and CNN architectures (using the ndimage and numpy packages).
- Using the keras packages for setting up and running ANN and CNN. classifiers.
- Using Scikit Learn and other packages to assess the performance of ANN and CNN classifiers.

You will need to turn in your completed notebook for marking (an upload link appear on iLearn later). You will be marked on the following with equal weight:
- Comments: Your code must be commented. You can either do this by adding explanation text placed in blocks just before code blocks OR as comments within the code blocks themselves. Your comments should demonstrate that you understand what your code is doing, and why! You will be marked based on the clarity of your comments, and whether your comments indicate that you understand your code. If using markdown cells, please consider using colour to distinguish your comments from the existing instructions in the notebook.
- Plots should be well presented and explained: E.g., reasonable axes (e.g., ranges should be set so that trends are clearly visible), clear axis labels. You should also explain the "why" and "what" of your plots. Why are you plotting this? What does your plot show?
- Formatting of your code (easy to understand, sensible variable names etc.)
- Clear explanations and/or justifications of experiments to design a better classifier.
- Comments and explanation on a final best classifier and performance results.

# Part 1: Artificial Neural Networks
In this part, we will use the 8.0 $\mu$m images to design a classifier using the keras backend to build ANNs in a similar manner to that described in the [Week 9 lectorial](https://github.com/MQ-ASTR3110-2021/ASTR3110_Tutorial_Notebooks_2021/blob/master/Solution_Notebooks/ASTR3110_Tutorial_9_ANNs.ipynb). First, the data must be read in and manipulated into a format that is accepted by the keras models.

## Accessing the data.

You should clone the Github repository to your Google Drive as per the usual method [described here](https://github.com/MQ-ASTR3110-2021/ASTR3110_Tutorial_Notebooks_2021/blob/master/SETUP_COLAB.md). **Please clone into a new directory so that you do not overwrite existing Labs!!**

As outlined above, the data that will be used in this lab consist of Spitzer imaging in the 4.5, 5.8, and 8.0 $\mu$m bands. The images have been uploaded to the [Github repository](https://github.com/MQ-ASTR3110-2021/ASTR3110_Practical_Notebooks_2021/), and are stored the main ```Cornish_data``` directory, which contains three subdirectories: one each for ```HII```, ```PNE```, and ```RG``` sources. Within each of these subdirectories, there are 300 "fits" files: 3 files for each source, where the filename gives the Galactic coordinates, and the 3 fiels are for the different Spitzer bands (I2 = 4.5 $\mu$m, I3 = 5.8 $\mu$m, and I4 = 8.0 $\mu$m). 

##  Getting to know the data

The fits format is commonly used for storing astronomical data, and can store binary tables, images, cubes, and other formats (the format is also used by Garmin, Strava and other activity trackers), as well as coordinate and information for the image in a header. Fits files can be read into numpy arrays using the [astropy.io.fits](https://docs.astropy.org/en/stable/io/fits/#) package. 

Using the astropy.io.fits.getdata() function, read in a source from each of the HII, PNE, and RG folders. At this point, we only require access to the 8.0 $\mu$m band images (labelled \*_I4.fits). Using numpy functions, determine basic statistics for the images (min, max pixel values, and shape of the array). Plot the images for the three sources to convince yourself that they appear different.

Because the images are relatively large, we want to resize them in order to decrease the runtime when we run our neural networks. Use the [scipy.ndimage.zoom](https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.zoom.html) function to bin the images to a coarser pixel scale. Plot the resulting coarsely binned image and compare with the original for each of the three sources. Try a few different 'zoom' factors and choose a suitable factor by which to reduce the size of your images. Note, you do not want to bin too much, otherwise you lose too much information. 



## Preparing the data for the ANN

As you know from lectorial 9, as input to the ANN we require the images to be flattened to 1D arrays, and also normalised so that the pixel values are between 0 and 1. We also require labels for the sources (both single digit labels as well as the one-hot vectors of binarized labels), and a training and testing/validation sample. 

To achieve some of the above, I suggest that you write a function that:
- accepts a directory path + fits filename (e.g., Cornish_data/HII/G010.8519-00.4407_I4.fits),
- reads the fits file using astropy.io.getdata,
- resizes the image,
- normalises the pixels values to be between 0 and 1,
- flattens the 2D images into 1D,
- returns the resized, normalised, flattened 1D array

You can use the [glob](https://docs.python.org/3/library/glob.html) function to return a list of filenames that can be looped over and read in by your function. You will need to stack each flattened image into a larger array that contains all 300 sources. In addition, you will need the corresponding 1D vector of labels, as well as the binarized version.

Finally, you will need to split your data into a training and testing/validation dataset (recall scikit learn's handy [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function).

## Running the ANN

With your data prepared, you can now start building your ANN. Start off by building a Sequential ANN with the same architecture, optimizer, and hyperparameters as that used in [Lectorial 9](https://github.com/MQ-ASTR3110-2021/ASTR3110_Tutorial_Notebooks_2021/blob/master/Solution_Notebooks/ASTR3110_Tutorial_9_ANNs.ipynb), but modify the inputs so that they suit the data used here. 

## Quantifying the performance

Once your model is trained, run a few predictions using the test data and compare with the known label. Then, produce a classification report using the test data. Using the saved history from your model fit, plot on separate graphs the evolution of the Training and Testing/Validation loss and the evolution of the Training and Testing/Validation accuracy. You can also use the plotting code from [lectorial in week 8](https://github.com/MQ-ASTR3110-2021/ASTR3110_Tutorial_Notebooks_2021/blob/master/Solution_Notebooks/ASTR3110_Tutorial_8_Random_Forest.ipynb) to produce a confusion matrix to help assess the classifier. Based on the outputs of the classification report and the plots, assess the performance of the classifier.

## Tweak the hyperparameters to improve performance

Try to improve your classifier by changing the following:

- The learning rate and number of epochs (smaller learning rate generally requires more training epochs and vice versa).
- The number of hidden layers (only try 1-2 fewer/more).
- The number of neurons in the layers (again, only try 1-2 different values).

For each tweak, run the classification reports, generate a confusion matrix, and produce plots of the history of the Loss and Accuracy and give a brief assessment of the performance. 

Finally, give a summary report for the best classifier achieved, and outline which of the changes was the most effective.

# <font color='red'> Aim to get up to here by the end of Lab session 1 </font>

# Part 2: Convolutional Neural Networks

In this part, you will build a classifier for the same dataset, but now using all three Spitzer bands (I2 = 4.5 $\mu$m, I3 = 5.8 $\mu$m, I4 = 8 $\mu$m) as input to a CNN. 


## Preparing the data

As input, CNNs take a 4D array of images with shape (N_source, N_pix, N_pix, N_channel), where  N_source is the number of sources in your batch of data, N_pix is the number of pixels (can be different for width/height), and N_channel is the number of different colour images available per source (this could be RGB channels for standard images, but here it is the 3 Spitzer bands). As for the ANNs, we also require labels that have been binarized. You need to write a function similar to the one used to manipulate the data for the ANN, but modified to produce the desired input for the CNN:
- For each source, read and then resize the I2, I3, and I4 images. Stack to form an array with shape (N_pix, N_pix, N_channel).
- Normalise each image in the stack, but here we'd like to maintain the colour differences. So, determine the maximum  across all 3 images and use that as your nomalisation factor for all three images.

In a similar fashion to Part 1, you will need to loop over each source, read in the images, and save to a 4D array with shape (N_source, N_pix, N_pix, N_Channel).  Again, you will require a vector containing the labels for each source, as well as the binarized version. 

Once you have the array containing the images and the labels, split your data into a training and test/validation set.

## Running the CNN
With your data prepared, you can now start building your CNN. Start off by bulding a Sequential CNN with the same architecture, optimizer, and hyperparameters as that used in [Lectorial 10](https://github.com/MQ-ASTR3110-2021/ASTR3110_Tutorial_Notebooks_2021/blob/master/Solution_Notebooks/ASTR3110_Tutorial_10_CNNs.ipynb), but modify the inputs so that they suit the data used here.

## Evaluate the performance of the CNN

Once your model is trained, run a few predictions using the test data and compare with the known label. Then, produce a classification report using the test data. Using the saved history from your model fit, plot on separate graphs the evolution of the Training and Testing/Validation loss and the evolution of the Training and Testing/Validation accuracy. You can also use the plotting code from [lectorial in week 8](https://github.com/MQ-ASTR3110-2021/ASTR3110_Tutorial_Notebooks_2021/blob/master/Solution_Notebooks/ASTR3110_Tutorial_8_Random_Forest.ipynb) to produce a confusion matrix to help assess the classifier. Based on the outputs of the classification report and the plots, assess the performance of the classifier.


## Tweak the CNN to improve performance

As before, try to improve your classifier by changing the following:

- The learning rate and number of epochs (smaller learning rate generally requires more training epochs and vice versa).
- The number of hidden layers (only try 1-2 extra).
- The number of neurons in the layers (again, only try 1-2 different values).
- You may also try changing the input resolution of the images.

For each tweak, run the classification reports, generate a confusion matrix, and produce plots of the history of the Loss and Accuracy and give a brief assessment of the performance. 

Finally, give a summary report for the best classifier achieved, and outline which of the changes was most effective. How did your best CNN classifier compare with your best ANN classifier?

## Stretch Goal: Data Augmentation

It is possible to increase the size of our training set by manipulating the current data. For example, we can increase the size of our training set by rotating each image by 90, 180, and 270 degrees. Because our images of HII regions, PNE and RGs can have very similar shapes, but many different orientations on the sky, adding rotated images can help the CNN better learn our data. Aside from rotation, there are many other ways to augment the data, e.g., mirror-imaging, scaling the size, changing the perspective, and more.

Here, you can modify your training dataset using the ndimage.rotate function. In principle, you can add as many random orientations as you like, but I suggest that you start by rotating the three images for each source by 90, 180, and 270 degrees. You will need to also produce a new set of labels for your expanded dataset. Rerun your best CNN classifier on this new training set. Can you seen an improvement in performance?
