## Creating a Simple Autoencoder

By: V. Ashley Villar (PSU)

In this problem set, we will use Pytorch to learn a latent space for the same galaxy image dataset we have previously played with.

In [None]:
!pip install astronn
import torch
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from astroNN.datasets import load_galaxy10
from astroNN.datasets.galaxy10 import galaxy10cls_lookup
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

# Problem 1a: Understanding our dataset...again

Our data is a little too big for us to train an autoencoder in ~1 minute. Let's lower the resolution of our images and only keep one filter. Plot an example of the lower resolution galaxies.

Next, flatten each image into a 1D array. Then rescale the flux of the images such that the mean is 0 and the standard deviation is 1. 

In [None]:
# Readin the data
images, labels = load_galaxy10()
labels = labels.astype(np.float32)
images = images.astype(np.float32)
images = torch.tensor(images)
labels = torch.tensor(labels)
# Cut down the resolution of the images!!! What is this line doing in words?
images = images[:,::6,::6,1]

#Plot an example image here

#Flatten images here

#Normalize the flux of the images here


# Problem 1b. 
Split the training and test set with a 66/33 split.

# Problem 2: Understanding the Autoencoder

Below is sample of an autoencoder, built in Pytorch. Describe the code line-by-line with a partner. Add another hidden layer before and after the encoded (latent) layer (this will be a total of 2 new layers). Choose the appropriate activation function for this regression problem. Make all of the activation functions the same.

In [None]:
class Autoencoder(torch.nn.Module):
      # this defines the model
        def __init__(self, input_size, hidden_size, hidden_inner, encoded_size):
            super(Autoencoder, self).__init__()
            print(input_size,hidden_size,encoded_size)
            self.input_size = input_size
            self.hidden_size  = hidden_size
            self.encoded_size = encoded_size
            self.hidden_inner = hidden_inner
            self.hiddenlayer1 = torch.nn.Linear(self.input_size, self.hidden_size)
            # ADD A LAYER HERE
            self.encodedlayer = torch.nn.Linear(self.hidden_inner, self.encoded_size)
            self.hiddenlayer3 = torch.nn.Linear(self.encoded_size, self.hidden_inner)
            # ADD A LAYER HERE
            self.outputlayer = torch.nn.Linear(self.hidden_size, self.input_size)
            # some nonlinear options
            self.sigmoid = torch.nn.Sigmoid()
            self.softmax = torch.nn.Softmax()
            self.relu = torch.nn.ReLU()
        def forward(self, x):
            layer1 = self.hiddenlayer1(x)
            activation1 = self.ACTIVATION?(layer1)
            layer2 = self.hiddenlayer2(activation1)
            activation2 = self.ACTIVATION?(layer2)
            layer3 = self.encodedlayer(activation2)
            activation3 = self.ACTIVATION?(layer3)
            layer4 = self.hiddenlayer3(activation3)
            activation4 = self.ACTIVATION?(layer4)
            layer5 = self.hiddenlayer4(activation4)
            activation5 = self.ACTIVATION?(layer5)
            layer6 = self.outputlayer(activation5)
            output = self.ACTIVATION?(layer6)

            # Why do I have two outputs?
            return output, layer3

# Problem 3. Training

This is going to be a lot of guess-and-check. You've been warned. In this block, we will train the autoencoder. Add a plotting function into the training.

Note that instead of cross-entropy, we use the "mean-square-error" loss. Switch between SGD and Adam optimized. Which seems to work better? Optimize the `learning-rate` parameter and do *not* change other parameters, like momentum.

Write a piece of code to run train_model for 10 epochs. Play with the size of each hidden layer and encoded layer. When you feel you've found a reasonable learning rate, up this to 100 (or even 500 if you're patient) epochs. Hint: You want to find MSE~0.25.

In [None]:
# train the model
def train_model(training_data,test_data, model):
  # define the optimization
  criterion = torch.nn.MSELoss()

  # Choose between these two optimizers
  #optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
  #optimizer = torch.optim.Adam(model.parameters(), lr=0.1,weight_decay=1e-6)

  for epoch in range(500):
    # clear the gradient
    optimizer.zero_grad()
    # compute the model output
    myoutput, encodings_train = model(training_data)
    # calculate loss
    loss = criterion(myoutput, training_data)
    # credit assignment
    loss.backward()
    # update model weights
    optimizer.step()
    # Add a plot of the loss vs epoch for the test and training sets here

#Do your training here!!
hidden_size_1 = 100
hidden_size_2 = 50
encoded_size = 10 
model = Autoencoder(np.shape(images_train[0])[0],hidden_size_1,hidden_size_2,encoded_size)
train_model(images_train, images_test, model)

# Problem 4a. Understand our Results

Plot an image (remember you will need to reshape it to a 14x14 grid) with imshow, and plot the autoencoder output for the same galaxy. Try plotting the difference between the two. What does your algorithm do well reconstructing? Are there certain features which it fails to reproduce?

In [None]:
#Make an image of the original image

#Make an image of its reconstruction

#Make an image of (original - reconstruction)

# Problem 4b. 

Make a scatter plot of two of the 10 latent space dimensions. Do you notice any interesting correlations between different subsets of the latent space? Any interesting clustering?

Try color coding each point by the galaxy label using `plt.scatter`

In [None]:
#Scatter plot between two dimensions of the latent space
#Try coloring the points

# Bonus Problem 5a Playing with the Latent Space

Create a random forest classifier to classiy each galaxy using only your latent space.

In [None]:

clf = RandomForestClassifier(...)

clf.fit(...)
new_labels = clf.predict(...)

cm = confusion_matrix(labels_test,new_labels,normalize='true')
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

# Bonus Problem 5b Playing with the Latent Space

Create an isolation forest to find the most anomalous galaxies. Made a cumulative distribution plot showing the anomaly scores of each class of galaxies. Which ones are the most anomalous? Why do you think that is?

In [None]:
clf = IsolationForest(...).fit(encodings)
scores = -clf.score_samples(encodings) #I am taking the negative because the lowest score is actually the weirdest, which I don't like...

#Plot an image of the weirdest galazy!

#This plots the cumulative distribution
def cdf(x, label='',plot=True, *args, **kwargs):
    x, y = sorted(x), np.arange(len(x)) / len(x)
    return plt.plot(x, y, *args, **kwargs, label=label) if plot else (x, y)

ulabels = np.unique(labels)
for ulabel in ulabels:
  gind = np.where(labels==ulabel)
  cdf(...)
