# Asterisk

We propose Asterisk, a framework to generate high-quality training datasets at scale. Instead of relying on the end users to write user-defined heuristics, the proposed approach exploits a small set of labeled data and automatically produces a set of heuristics to assign initial labels. In this phase, the system applies an iterative process of creating, testing, and ranking heuristics in each, and every, iteration to only accommodate high-quality heuristics. Then, Asterisk examines the disagreements between these heuristics to model their accuracies. In order to enhance the quality of the generated labels, the framework improves the accuracies of the heuristics by applying a novel data-driven AL process. During the process, the system examines the generated weak labels along with the modeled accuracies of the heuristics to help the learner decide on the points for which the user should provide true labels. The process aims at enhancing the accuracy, and the coverage of the training data while engaging the user in the loop to execute the enhancement process. Therefore, by incorporating the underlying data representation, the user is only queried about the points that are expected to enhance the overall labeling quality. Then, the true labels provided by the users are used to refine the initial labels generated by the heuristics. 

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import warnings
warnings.filterwarnings("ignore")

try:
    import heuristics_generator
except ImportError:
    print(ImportError)

Reading and preparing the data from csv file

In [2]:
from heuristics_generator.hg_utils import *
from heuristics_generator.extra import *

DU= read_data('magic.csv', 'prediction')

Columns: 

Index(['index', 'fLength', 'fWidth', 'fSize', 'fConc', 'fConc1', 'fAsym',
       'fM3Long', 'fM3Trans', 'fAlpha', 'fDist', 'prediction'],
      dtype='object')


In [3]:
df_features, df_ground = split_features(DU)
train_set, dev_set, test_set, train_ground, val_ground, test_ground = split_sets(df_features, df_ground)

Clearing existing...
Running UDF...


  1%|          | 128/15216 [00:00<00:11, 1274.00it/s]

Train sent : # 15216
Dev sent : # 1902
Test sent : # 1902
Clearing existing...
Running UDF...


100%|██████████| 15216/15216 [00:13<00:00, 1139.03it/s]
  7%|▋         | 131/1902 [00:00<00:01, 1307.72it/s]

Number of candidates: 15216
Clearing existing...
Running UDF...


100%|██████████| 1902/1902 [00:01<00:00, 1155.69it/s]
  5%|▌         | 98/1902 [00:00<00:01, 977.86it/s]

Number of candidates: 1902
Clearing existing...
Running UDF...


100%|██████████| 1902/1902 [00:01<00:00, 1068.61it/s]

Number of candidates: 1902





## Heuristics Generator
The system starts with the heuristics generator component which takes the labeled set DL and the unlabeled set DU as inputs and outputs a set of heuristics H of size K denoted as (h1, h2, …hk) and a vector of initial probabilistic labels for the points in DU.
The process starts with defining the input (features) for the potential models. Then, the process continues with creating the models (heuristics) and evaluating their performance and coverage. Finally, the process ranks the heuristics generated by each, and every, iteration to decide upon which heuristic to add to the set H.

When evaluating the performance of the heuristics produced during each iteration, the component also considers the overall coverage of the heuristics when applied to DU. The component aims to widen the range of the data points that receive labels from H in DU. In other words, the goal of the component is to output a set of heuristics that are individually accurate while achieving high labeling coverage when combined. Therefore, to estimate the performance of the heuristics. The performance metrics are computed by applying each heuristic to DL.

In [4]:
from heuristics_generator.loader import DataLoader
from heuristics_generator.verifier import Verifier
from heuristics_generator.heuristic_generator import HeuristicGenerator
from heuristics_generator.synthesizer import Synthesizer

df_gold_label = prepare_hg(DU)
dl = DataLoader()
train_primitive_matrix, val_primitive_matrix, test_primitive_matrix, \
train_ground, val_ground, test_ground = dl.load_data_tabular(train_set, test_set, dev_set, train_ground, test_ground, val_ground)

In [5]:

validation_accuracy = []
training_accuracy = []
validation_coverage = []
training_coverage = []
training_marginals = []
idx = None

hg = HeuristicGenerator(train_primitive_matrix, val_primitive_matrix, val_ground, train_ground, b=0.5)
for i in range(3,26):
    if i == 3:
        hg.run_synthesizer(max_cardinality=2, idx=idx, keep=3, model='dt')
    else:
        hg.run_synthesizer(max_cardinality=2, idx=idx, keep=1, model='dt')
    hg.run_verifier()
   
    va,ta, vc, tc = hg.evaluate()
    validation_accuracy.append(va)
    training_accuracy.append(ta)
    training_marginals.append(hg.vf.train_marginals)
    validation_coverage.append(vc)
    training_coverage.append(tc)
        
    hg.find_feedback()
    idx = hg.feedback_idx
    
    if idx == []:
        break

In [6]:
training_marginals = training_marginals
L_train = hg.L_train

from scipy import sparse
L_train = L_train.astype(int)
L_train= sparse.csr_matrix(L_train) 

Finally, to combine the output of the heuristic and generate an initial vector of probabilistic labels, we employ a generative model to learn the accuracies of the heuristics in H and produce a single probabilistic label for each data point in the unlabeled dataset. 

In [7]:
global gen_model 
gen_model, gen_train_marginals = Fitting_Gen_Model(L_train)

Inferred cardinality: 2
[1.31495497 1.10530113 1.08545801 0.45992796 0.20731223 0.25482566
 0.26669704 0.26540976 0.50971165 0.48338961 0.18063157 0.22463982
 0.14409732 0.77814357 0.10974106 0.18896361 0.19093101 0.20958962
 0.15981939 0.16024946 0.44982601 0.1198484  0.1543989  0.20813864
 0.13101344]
0.0


In [8]:
predictions = gen_model.predictions(L_train, batch_size=None)
gen_model.learned_lf_stats()

Unnamed: 0,Accuracy,Coverage,Precision,Recall
0,0.934598,0.8012,0.940274,0.745322
1,0.902582,0.7668,0.909113,0.692552
2,0.893739,0.7698,0.906188,0.679641
3,0.721302,0.6821,0.753386,0.499626
4,0.605912,0.6732,0.63607,0.405876
5,0.627436,0.6721,0.653925,0.419349
6,0.641217,0.6706,0.678445,0.431138
7,0.624517,0.673,0.653936,0.419723
8,0.744547,0.7015,0.769423,0.5189
9,0.733343,0.6859,0.757799,0.5


## Data-driven Active Learner
The component tries to integrate the user in the loop at this point by employing active learning. However, our problem settings do not impose traditional active learning scenarios where we usually have a small set of labeled points and a larger set of unlabeled data. Instead, we deal with a set of probabilistic labels that are classified based on the confidence of the generative model. Therefore, we adopt meta-active learning in this component and apply a data-driven approach to learn the query strategy. The approach formulates the process of designing the query strategy as a regression problem. We train a regression model to predict the reduction of the generalization error associated with adding a labeled point {xi, yi} to the training data of a classifier. Our main hypothesis is that this regressor can serve as the query strategy in our problem settings to outperform the baseline strategies since it is customized to the underlying distribution and considers the output of the generative model.

In [10]:
df_full_data,AL_Pool = evaluate_disagreement_factor(L_train, DU)

Data with additional info: disagreements and abstain
(15216, 15)
Data for Active learning: Data with additional info after applying conditions
(4889, 15)


In [11]:
budget_variable=0.07
labeling_budget = int(budget_variable*len(df_full_data.index))
if(labeling_budget >= AL_Pool.shape[0]):
    labeling_budget= int(1*len(AL_Pool.index))

In [12]:
from sklearn.ensemble import RandomForestRegressor
from data_learner.active_learner import *
from data_learner.models import DDL_Dataset 
from data_learner.lal_model import LALmodel

fn = 'LAL-iterativetree-simulatedunbalanced-big.npz'
parameters = {'est': 1000, 'depth': 40, 'feat': 6 }
filename = '../data_learner/datasets/'+fn
regression_data = np.load(filename)
regression_features = regression_data['arr_0']
regression_labels = regression_data['arr_1']
lalModel = RandomForestRegressor(n_estimators = parameters['est'], max_depth = parameters['depth'], 
                                 max_features=parameters['feat'], oob_score=True, n_jobs=8)
lalModel.fit(regression_features, np.ravel(regression_labels))    

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=40,
           max_features=6, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=1000, n_jobs=8, oob_score=True, random_state=None,
           verbose=0, warm_start=False)

In [13]:
from data_learner.dl_utils import *
indices, labels= run_dll(lalModel, labeling_budget,AL_Pool);

## Probabilistic Labels Generator
The final component of Asterisk is the label generator, which aims at learning the accuracies of the generated heuristics using the refined heuristics matrix Hupdated, and then combines all the output of these heuristics to produce a single probabilistic label for each point in DU. This process is accomplished by learning the structure of a generative model Gen which utilizes the refined matrix to model a process of labeling the training set.

In [14]:
AL_Results=pd.DataFrame(columns=['index','AL_Label'])
AL_Results['index']= indices.astype('int64')
AL_Results['AL_Label']=labels
AL_Results = AL_Results.sort_values(by =['index'])

In [15]:
AL_Results.loc[AL_Results['AL_Label']==0, 'AL_Label']=-1
data_with_AL_Results = df_full_data.merge(AL_Results, on=['index'], how='left')
true_label = data_with_AL_Results['AL_Label']
true_label = true_label.fillna(0)
for i in range(len(true_label)):
    if true_label[i] !=0:
        L_train[i,:]=true_label[i]
gen_model, AL_train_marginals = Fitting_Gen_Model(L_train)
predictions = gen_model.predictions(L_train, batch_size=None)


Inferred cardinality: 2
[1.41152551 1.20340315 1.18982461 0.57966849 0.32033213 0.36200018
 0.37475127 0.37032741 0.61899355 0.5785432  0.29195317 0.32854532
 0.25414597 0.89155998 0.22035195 0.28852705 0.29128206 0.31601955
 0.26197951 0.26022231 0.54707463 0.22016906 0.2578832  0.31190517
 0.23169262]
0.0
