# Active Learning

In [9]:
import numpy as np
from scipy.sparse import linalg as spla
from scipy.io import loadmat
import graphlearning as gl
from matplotlib import pyplot as plt
import random

## Create the Indian Pines Dataset

In [10]:
def get_dataset():
    img = loadmat('Indian_pines_corrected.mat')['indian_pines_corrected']
    gt = loadmat('Indian_pines_gt.mat')['indian_pines_gt']
    
    # Save dataset
    X = img.reshape(-1, 200)
    gl.save_dataset(X, 'indian_pines', overwrite=True)

    # Save labels
    L = gt.reshape(-1)
    gl.save_labels(L, 'indian_pines', overwrite=True)

    # To add a dataset to the simulation environment, we also need
    # to save a label permutation, which is a number of random train/test splits
    # and store some precomputed knn-data

    # Create label permutation with 100 trials at 1,2,3,4,5 labels per class
    # You can add any identifying string as name='...' if you need to create additional
    # label permutations for a dataset.
#     gl.create_label_permutations(L, 100, [1,2,3,4,5], dataset='indian_pines', name=None, overwrite=True)

    # Run knn search and save info on 30 nearest neighbors
    # Choose as many as you are likely to use in practice, the code will automatically subset if needed.
    # This uses a kd-tree. For higher dimensional data use the annoy package, as below
    I, J, D = gl.knnsearch_annoy(X, 30, dataset='indian_pines')
    # I, J, D = gl.knnsearch(X, 30, dataset='indian_pines')
    return I, J, D,img
    
I, J, D, img = get_dataset()

kNN search with Annoy approximate nearest neighbor package...
Progress: |██████████████████████████████████████████████████| 100.0% Complete


## High Level Overview of Active Learning from Kevin's presentation

<img src="high_level_overview.png"/>

## Training Loop

In [18]:
def loop(iters = 100,added_label = 25):
#   initialize
    labels = gl.load_labels('indian_pines')
    train_ind = gl.randomize_labels(labels, 10)
    tau = .1
    idxs = range(len(list(labels)))
    pred_labels = []
    # Get Weight Matrix and Adjusted Graph Laplacian
    W = gl.weight_matrix(I, J, D, 10)
    L = gl.graph_laplacian(W, norm='none') + tau**2*gl.sparse.identity(W.shape[0])
#     while acc < .01:
        # Run SSL
    accuracy = []
    percentage = []
    i = 0
    while i < iters:
        train_labels = labels[train_ind]
        pred_labels = gl.graph_ssl(W, train_ind, train_labels, algorithm='laplace')
        # Calculate Accuracy
        acc = np.sum(pred_labels == labels) / len(labels)
        accuracy.append(acc)
#         print('Accuracy:', acc)
        idxs  = list(set(idxs) - set(train_ind))
        new_ind = random.choices(idxs,k = added_label)
        train_ind = list(train_ind) + new_ind 
        percentage.append(len(train_ind)/len(labels))
        i+=1
    return accuracy,percentage,pred_labels

In [None]:
acc,perc,pred = loop()

plt.plot(acc,label = 'Accuracy')
plt.plot(perc,label = 'Percentage of Data Labeled')
plt.legend()
plt.show()

In [None]:
img = loadmat('Indian_pines_gt.mat')['indian_pines_gt']
plt.subplots(12)
plt.subplot(121)
plt.title('Ground Truth')
plt.imshow(img)
plt.subplot(122)
plt.title(122)
plt.title('Predicted')
plt.imshow(pred.reshape(145,145))
plt.show()

# Explanation

The idea behind active learning is that you initialy train a model with a very few labeled elements and then identify the datapoints that would have the most effect on your model if you did know the labels of that data. Then with an expert or 'oracle' you label those points and retrain the model. You repeat this process and are able to acheive a high level of accuracy while having to actively label a small percentage of the data.

For our attempt of this method we took in an image of fields in Indian Pines, Indiana. The image has 200 channels thus each pixel has 200 features. Using those features we group the pixels and by their grouping, experts are able to tell what sort of crop is in a field.

For our attempt we used Jeff Calder's `graphlearning` package to quickly group the data using a Laplacian Graph based Semi-Supervised Model. For the update step we did something similar to what Kevin did during his presentation. Kevin Miller demonstrated that usually the most 'helpful' peices of data were grouped together so if you were to only add the 25 most helpful data points at each iteration you would just be adding a clumped group of data. This would not be very helpful because after the most helpful peice was added, the points grouped around it would no longer be as helpful. It is not necessarily computationaly sensible to calculate the most helpful data point after the addition of each labeled peice of data. We had significant trouble attempting to calculate the 'most helpful points' using the equations presented in Kevin's presentation, so we decided at each stage to add a random assortment of points from the data and then use those points to improve our model. 

There is initially a very impressive return on investment but as with most investments there is the law of diminishing returns and eventually the improvement added by every iteration is quite unimpressive. Clearly if we could use Active learning, the return would be much accelerated with much more impressive results.

The bulk of our troubles resolved around the parameter `A` used in his presentation. We weren't quite sure how to determine that parameter for the multiclass version of the data. Calculating the Hessian also presented a challenge. We did however thoroughly digest his method and feel if we had received a response to our inquiries to him, we would have quickly had much more success.