# ***Subset selection:***
This notebook aims at demonstrating the use cases for the functions in spear library for subset selection. Subset selection is selecting a small subset of unlabeled data(or the data labeled by LFs, in case of supervised subset selection) so that it can be labeled and use that small labeled data(the L dataset) for effective training of <b>JL algorithm</b>(Cage algorithm doesn't need labeled data). Finding the best subset makes best use of the labeling efforts.

In [1]:
import sys
sys.path.append('../../')
import numpy as np

### **Random subset selection**
Here we select a random subset of instances to label. We need number of instances available and number of instances we intend to label to get a sorted numpy array of indices

In [2]:
from spear.JL import rand_subset

indices = rand_subset(n_all = 20, n_instances = 5) #select 5 instances from a total of 20 instances
print("indices selected by rand_subset: ", indices)
print("return type of rand_subset: ", type(indices))

indices selected by rand_subset:  [ 2  3  8 12 18]
return type of rand_subset:  <class 'numpy.ndarray'>


### **Unsupervised subset selection**
Here we select a unsupervised subset(for more on this, please refer [here](https://arxiv.org/abs/2008.09887) ) of instances to label. We need feature matrix(of shape (num_instaces, num_features)) and number of instances we intend to label and we get a sorted numpy array of indices. For any other arguments to unsup_subset(or to sup_subset_indices or sup_subset_save_files) please refer documentation.
<p>For this let's first get some data(feature matrix), say from sms_pickle_U.pkl(in data_pipeline folder). For more on this pickle file, please refer the other notebook named sms_cage_jl.ipynb</p>

In [3]:
from spear.utils import get_data, get_classes

U_path_pkl = 'data_pipeline/JL/sms_pickle_U.pkl' #unlabelled data - don't have true labels
data_U = get_data(U_path_pkl, check_shapes=True)
x_U = data_U[0] #the feature matrix
print("x_U shape: ", x_U.shape)
print("x_U type: ", type(x_U))

x_U shape:  (4500, 1024)
x_U type:  <class 'numpy.ndarray'>


Now that we have feature matrix, let's select the indices to label from it. After labeling(through a trustable means) those instances, whose indices(index with respect to feature matrix) are given by the following function, one can pass them as gold_labels to the PreLabels class in the process for labeling the subset-selected data and forming a pickle file.

In [4]:
from spear.JL import unsup_subset

indices = unsup_subset(x_train = x_U, n_unsup = 20)
print("first 10 indices given by unsup_subset: ", indices[:10])
print("return type of unsup_subset: ", type(indices))

first 10 indices given by unsup_subset:  [  94  150  302  783 1255 1524 1560 2098 2270 2485]
return type of unsup_subset:  <class 'numpy.ndarray'>


### **Supervised subset selection**
Here we select a supervised subset(for more on this, please refer [here](https://arxiv.org/abs/2008.09887) ) of instances to label. We need 
* path to json file having information about classes
* path to pickle file generated by feature matrix after labeling using LFs
* number of instances we intend to label

<p>we get a sorted numpy array of indices.</p>
<p>For this let's use sms_json.json, sms_pickle_U.pkl(in data_pipeline folder). For more on this json/pickle file, please refer the other notebook named sms_cage_jl.ipynb</p>

In [5]:
from spear.JL import sup_subset_indices

U_path_pkl = 'data_pipeline/JL/sms_pickle_U.pkl' #unlabelled data - don't have true labels
path_json = 'data_pipeline/JL/sms_json.json'
indices = sup_subset_indices(path_json = path_json, path_pkl = U_path_pkl, n_sup = 100, qc = 0.85)

print("first 10 indices given by sup_subset: ", indices[:10])
print("return type of sup_subset: ", type(indices))

first 10 indices given by sup_subset:  [ 26  38  59 151 185 247 255 315 336 407]
return type of sup_subset:  <class 'numpy.ndarray'>


Instead of just getting indices to already labeled data(stored in pickle format, using LFs), we also provide the following utility to split the input pickle file and save two pickle files on the basis of subset selection. Make sure that path_save_L and path_save_U are <b>EMPTY</b> pickle file. You can still get the return value of subset-selected indices.

In [6]:
from spear.JL import sup_subset_save_files

U_path_pkl = 'data_pipeline/JL/sms_pickle_U.pkl' #unlabelled data - don't have true labels
path_json = 'data_pipeline/JL/sms_json.json'
path_save_L = 'data_pipeline/JL/sup_subset_L.pkl'
path_save_U = 'data_pipeline/JL/sup_subset_U.pkl'

indices = sup_subset_save_files(path_json = path_json, path_pkl = U_path_pkl, path_save_L = path_save_L, \
                             path_save_U = path_save_U, n_sup = 100, qc = 0.85)

print("first 10 indices given by sup_subset: ", indices[:10])
print("return type of sup_subset: ", type(indices))

first 10 indices given by sup_subset:  [ 26  38  59 151 185 247 255 315 336 407]
return type of sup_subset:  <class 'numpy.ndarray'>


### **Inserting true labels into pickle files**
Now after doing supervised subset selection, say we get two pickle files path_save_L and path_save_U. Now say you labeled the instances of path_save_L and want to insert them into pickle file. So here, instead of going over the process of generating pickle through PreLabels again, you can use the following function to create a new pickle file, which now contain true labels, using path_save_L pickle file. There is no return value to this function. Make sure that path_save, the pickle file path that is to be formed with the data in path_save_L file and true labels, is <b>EMPTY</b>

In [7]:
from spear.JL import insert_true_labels

path_save_L = 'data_pipeline/JL/sup_subset_L.pkl'
path_save_labeled = 'data_pipeline/JL/sup_subset_labeled_L.pkl'
labels = np.random.randint(0,2,[100, 1])
'''
Above is just a random association of labels used for demo. In real time user has to label the instances in
path_save_L with a trustable means and use it here.

Note that the shape of labels is (num_instances, 1) and just for reference, feature_matrix(the first element
in pickle file) in path_save_L has shape (num_instances, num_features).
'''
insert_true_labels(path = path_save_L, path_save = path_save_labeled, labels = labels)

A similar function as insert_true_labels called replace_in_pkl is also made available to make changes to pickle file. replace_in_pkl usage is demonstrated below. Note that replace_in_pkl doesn't edit the pickle file, instead creates a new pickle file. Make sure that path_save, the pickle file path that is to be formed with the data in path file and a new numpy array, is <b>EMPTY</b>. There is no return value for this function too.
<p>It is highly advised to use insert_true_labels function for the purpose of inserting the labels since it does some other necessary changes.</p>

In [8]:
from spear.JL import replace_in_pkl

path_labeled = 'data_pipeline/JL/sup_subset_labeled_L.pkl' # aka path_save_labeled
path_save_altered = 'data_pipeline/JL/sup_subset_altered_L.pkl'
np_array = np.random.randint(0,2,[100, 1]) #we are just replacing the labels we inserted before
index = 3 
'''
index refers to the element we intend to replace. Refer documentaion(specifically 
spear.utils.data_editor.get_data) to understand which numpy array an index value
maps to(order the contents of pickle file from 0 to 8). Index should be in range [0,8].
'''

replace_in_pkl(path = path_labeled, path_save = path_save_altered, np_array = np_array, index = index)

### **Demonstrating the use of labeled subset-selected data**
Now that we have our subset(labeled) in path_save_labeled, lets see a use case by calling a member function of JL class using path_save_labeled as our path to L data.

In [9]:
from spear.JL import JL

n_lfs = 16
n_features = 1024
n_hidden = 512
feature_model = 'nn'
path_json = 'data_pipeline/JL/sms_json.json'

jl = JL(path_json = path_json, n_lfs = n_lfs, n_features = n_features, n_hidden = n_hidden, \
        feature_model = feature_model)

L_path_pkl = path_save_labeled #Labeled data - have true labels
U_path_pkl = path_save_U #unlabelled data - don't have true labels
V_path_pkl = 'data_pipeline/JL/sms_pickle_V.pkl' #validation data - have true labels
T_path_pkl = 'data_pipeline/JL/sms_pickle_T.pkl' #test data - have true labels
log_path_jl_1 = 'log/JL/jl_log_1.txt'
loss_func_mask = [1,1,1,1,1,1,1] 
batch_size = 150
lr_fm = 0.0005
lr_gm = 0.01
use_accuracy_score = False

probs_fm, probs_gm = jl.fit_and_predict_proba(path_L = L_path_pkl, path_U = U_path_pkl, path_V = V_path_pkl, \
        path_T = T_path_pkl, loss_func_mask = loss_func_mask, batch_size = batch_size, lr_fm = lr_fm, lr_gm = \
    lr_gm, use_accuracy_score = use_accuracy_score, path_log = log_path_jl_1, return_gm = True, n_epochs = \
    100, start_len = 7,stop_len = 10, is_qt = True, is_qc = True, qt = 0.9, qc = 0.85, metric_avg = 'binary')

labels = np.argmax(probs_fm, 1)
print("probs_fm shape: ", probs_fm.shape)
print("probs_gm shape: ", probs_gm.shape)

early stopping at epoch: 21	best_epoch: 10
score used: f1_score
best_gm_val_score:0.6153846153846153	best_fm_val_score:0.6666666666666667
best_gm_test_score:0.5714285714285714	best_fm_test_score:0.6776859504132232
best_gm_test_precision:0.42718446601941745	best_fm_test_precision:0.5857142857142857
best_gm_test_recall:0.8627450980392157	best_fm_test_recall:0.803921568627451
probs_fm shape:  (4400, 2)
probs_gm shape:  (4400, 2)
