# Homework 4


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

import torch
from torch import nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from matplotlib import pyplot as plt
import os.path
from os.path import exists
import os

from sklearn.decomposition import PCA

import time
import numpy as np
import h5py
import urllib.request

device="cpu"
if torch.cuda.is_available():
    torch.backends.cudnn.deterministic = True
    device="cuda:0"
print(device)    

## Problem 1 (40 points): Revisiting our CNN from lecture

Let's revisit the CNN notebook that we worked through in class, and modify the code there to do something slightly different.

First, let's import our data:

In [None]:
output = 'ising_data.npz'

if not exists(output):
    import gdown
    url = "https://drive.google.com/u/0/uc?export=download&confirm=HgGH&id=1Ihxt1hb3Kyv0IrjHlsYb9x9QY7l7n2Sl"
    gdown.download(url, output, quiet=True)

f = np.load(output, allow_pickle=True)
n_train = 20000
n_test=len(f["C"])-n_train

x_train, x_test = f["C"][:n_train], f["C"][n_train:]
y_train, y_test = f["T"][:n_train], f["T"][n_train:]

x_train_tensor=torch.tensor(x_train,dtype=torch.float).unsqueeze(1)
y_train_tensor=torch.tensor(y_train,dtype=torch.float).unsqueeze(1)

x_test_tensor=torch.tensor(x_test,dtype=torch.float).unsqueeze(1)
y_test_tensor=torch.tensor(y_test,dtype=torch.float).unsqueeze(1)

xy_train = torch.utils.data.TensorDataset(x_train_tensor.float(),y_train_tensor)

torch.manual_seed(123)

In the CNN lecture notebook, we used a CNN to predict whether the image had crossed the phase transition boundary or not.  What I'd like to do is a followup that I may have mentioned in lecture: let's use the CNN to predict the temperature that corresponds to the image, instead of a binary "yes/no" decision.  Some things to consider:

* What should the loss function be?

**Now your job**: setup and train a CNN that predicts the temperature associated with each image.  Here are some requirements, which may be different from what was done in the CNN notebook in class:

* Have the first convolutional layer use 16 filters
* Have the second convolutional layer use 32 filters
* Have the fully-connected part at the end include two layers with 32 nodes
* Use the `Adam` optimizer with a learning rate of 0.01
* Run for 10 epochs


As part of this problem, you should produce:
* The normal loss plot showing the training and test losses vs the number of epochs
* Once the network is trained, pass the test data through the network again, and get the predicted values.  Compare the predicted values with the true values; plot the difference and find the bias and variance of the network.

Some hints:
* You don't necessarily need to run the code for the full 16+32 convolutional filter network every time, the training takes a little while!  Feel free to debug with smaller numbers of filters.
* However, the setup is a bit delicate, so even small changes in the hyperparameters specified above may cause the training to not converge.

In [None]:
# your code defining the AutoEncoder class goes here

In [None]:
# create the net, define the loss fuction and optimizer

In [None]:
# train the net

In [None]:
# plot the losses

In [None]:
# evaluate and plot the losses for individual training dataset events

In [None]:
# evaluate and plot the losses for individual training dataset events

## Problem 2 (20 points): Principal Components Analysis

Let's perform a PCA test on some input data that we'll use for Problem #3.

First let's get some training and test datasets that we can use...  we'll start with one of the files that we used from HW3:

In [None]:
def clean_sample(sample):
    events=[]
    for i in sample:
        if i[0]>2:
            events.append(i)
    return events

def get_sample_from_URL(samplename,branches):
    
    if not os.path.isfile(samplename):
        urllib.request.urlretrieve("http://mhance.scipp.ucsc.edu/%s" % samplename, samplename)

    alldata=None
    with h5py.File(samplename,'r') as hdf5file:
        data=hdf5file[list(hdf5file.keys())[0]]["lowleveltree"]
        num_backgr_events=len(data["numjet"])
        alldata = data[branches]    

    Alldata=[[float(i) for i in j] for j in alldata]
    Alldata_clean=clean_sample(Alldata)
    return Alldata_clean

In [None]:
# now scale based on the training data:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

branches=("numjet","numlepton","numbtagjet",
          "met","metphi",
          "jet1pT", "jet1eta", "jet1phi","jet1b",
          "jet2pT", "jet2eta", "jet2phi","jet2b",
          "jet3pT", "jet3eta", "jet3phi","jet3b")

traindata=get_sample_from_URL('lowlevelAna_ttbar.hf5',branches)
x_train = sc.fit_transform(traindata)
train=torch.tensor(x_train,dtype=torch.float)

testdata=get_sample_from_URL('lowlevelAna_test.hf5',branches)
x_test = sc.transform(testdata)
test=torch.tensor(x_test,dtype=torch.float)

Using the `scikit-learn` PCA tool, analyze the `ttbar` dataset (`x_train`) retrieved above.

* Find the number of components needed to explain 50% of the variance.
* List the magnitudes of that number of leading eigenvalues, as a fraction of the total variance they capture

In [None]:
# your code here

## Problem 3 (40 points): AutoEncoders for Anomaly Detection

We discussed the use of AutoEncoders for de-noising, let's also look at AutoEncoders in the context of anomaly detection.  The goal will be to use an AutoEncoder to find "anomalies" in a data sample without knowing what those anomalies are.  We'll do this by training the network on an un-labeled data sample (i.e. unsupervised learning) and then using the trained network to evaluate a test dataset that has anomalies included.  The anomalies should show up as events with large loss values.

For this problem, we'll use the `ttbar` dataset from problem 2 for training, and the `testdata` dataset from problem 2 for testing.

**Your job**: Define our AutoEncoder, following the scheme we used in the AutoEncoder notebook in class, with the following changes:

* The number of input variables should be the number of variables needed to explain 50% of the variance in the `ttbar` sample (see solutions to Problem 2, let's call this number N_50).
* Use the `LeakyReLU` activation function, with args = `(0.2,inplace-True)`
* The encoder should have three layers:  (1) the input layer, with N nodes; (2) a hidden layer with 2\*N nodes; (3) a hidden layer with 10 nodes; (4) the latent-space layer, with N_50 nodes.
* Make sure to not define an activation on the last layer!



In [None]:
# your code here

Now train the autoencoder on the training data.  Run for 1500 epochs, using the `Adam`optimizer with a learning rate of 0.01.  Afterwards:
* Make a plot of the loss function vs epochs for the training data
* Make a plot of the loss *values* for all events in the training dataset.  This means passing single events into the loss function, instead of entire tensors!
* Now put the net into evaluation mode, and pass the test dataset through the network.  Make the same plot of loss values for all events in the test dataset.  
* Make a single plot that includes both histograms: the training set, and the testing set.

The difference between the training data and the test data can be subtle unless you overlay them, and (hopefully) see that there are more events in the large-loss tail of the test distribution than the training distribution!

In [None]:
# training loop

In [None]:
# plot the losses vs epoch

A helper function to retrieve the individual event-wise losses from the sample (`getLosses`) 

In [None]:
net.eval()
def getLosses(sample):
    losses=[]
    for i in range(len(sample)):
        pred = net(sample[i])
        loss = loss_fn(pred, sample[i])
        losses.append(loss.data.item())
    return losses

In [None]:
# get the training dataset losses and plot them

In [None]:
# get the test dataset losses and plot them

In [1]:
# make a single plot overlaying the training and test dataset losses, 
# with each normalized so they both have unit area.