# Smart Mendr

We propose Smart Mendr, a new classification Model that applies Ensemble Learning and Data-driven Rectification to handle both scenarios of inaccurate and incomplete supervision. An overview of the proposed method is illustrated in Figure 5.1. As the figure shows, the method has two phases. In the first phase, Smart Mendr applies a preliminary stage of ensemble learning to estimate the probability of each instance being mislabeled and produce initially weak labels for unlabeled data. However, to overcome the challenges of noise detection using ensemble learning, we apply a semi-supervised learning approach to combine the output of the ensemble and report the noisy points. After that, the proposed method, in the second phase, applies a smart correcting procedure using meta-active learning to provide true labels for both noisy and unlabeled points

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import warnings
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

Reading and preparing the data from csv file

In [2]:
def load_data(input_folder, dataset_name, y_column):
    p_path =  input_folder + "//" + dataset_name + "_noisy.csv"
    u_path =  input_folder + "//" + dataset_name + "_unlabeled.csv"
    original_path = input_folder + "//" + dataset_name + ".csv"
    original_data = pd.read_csv(original_path, sep=',')

    df_p = pd.read_csv(p_path, sep=',')
    ground_truth = df_p[y_column]
    df_p = df_p.drop(y_column, axis = 1)
    df_u = pd.read_csv(u_path, sep=',') 
    df_u = df_u.drop(y_column, axis = 1)

    return df_p, df_u, ground_truth, original_data

In [3]:
df_p, df_u, ground_truth, original_data = load_data("datasets", "shoppers", "prediction")


DU = original_data
DU.index = [x for x in range(1, len(DU)+1)]
DU.index.name = 'index'
obj_columns = DU.select_dtypes(['object']).columns    
for col in obj_columns:
    DU[col] = DU[col].astype('category')
cat_columns = DU.select_dtypes(['category']).columns    
DU[cat_columns] = DU[cat_columns].apply(lambda x: x.cat.codes)

obj_columns = df_p.select_dtypes(['object']).columns    
for col in obj_columns:
    df_p[col] = df_p[col].astype('category')
cat_columns = df_p.select_dtypes(['category']).columns    
df_p[cat_columns] = df_p[cat_columns].apply(lambda x: x.cat.codes)


obj_columns = df_u.select_dtypes(['object']).columns   
for col in obj_columns:
    df_u[col] = df_u[col].astype('category')
cat_columns = df_u.select_dtypes(['category']).columns    
df_u[cat_columns] = df_u[cat_columns].apply(lambda x: x.cat.codes)

## Ensemble Learning
In this phase, the proposed method aims at detecting data points with noisy labels in Dp and producing initial labels for the unlabeled points in Du. Therefore, the phase employs a set of ensembles in two stages. In the first stage, a set of base learners are built to produce predictions for the data points in D. Then, the ensemble predictor is utilized in the second stage to detect noisy points Dnoise in Dp.

In [4]:
import numpy as np
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(df_p, ground_truth)
predictions_p = clf.predict(df_u)
predictions_u = clf.predict(df_p)
predictions = np.concatenate((predictions_u, predictions_p))

original_p = ground_truth
original_u = np.zeros(df_u.shape[0])
original = np.concatenate((original_u, original_p))

In [5]:
from scipy import sparse

S = np.column_stack((predictions,original))
S = S.astype(int)
L_train= sparse.csr_matrix(S) 


Finally, to combine the output of the ensemble_learner and generate an initial vector of probabilistic labels, we employ a generative model to learn the accuracies of the predictions in H_Best and produce a single probabilistic label for each data point in the unlabeled dataset. 

In [6]:
import smartmendr.ensemble_learner
from smartmendr.ensemble_learner.util_ensemble import *

global gen_model 
gen_model, gen_train_marginals = Fitting_Gen_Model(L_train)

Inferred cardinality: 2
[1.49074684 0.03306396]
0.0


In [7]:
predictions = gen_model.predictions(L_train, batch_size=None)
gen_model.learned_lf_stats()

Unnamed: 0,Accuracy,Coverage,Precision,Recall
0,0.955861,0.8224,0.957026,0.792959
1,0.52083,0.6649,0.517097,0.344869


In [8]:
columns_list = DU.columns
columns_list

Index(['0', '1', '2', '3', '4', '5', 'ground_truth'], dtype='object')

In [9]:
np.shape(gen_train_marginals.tolist())

(432,)

In [24]:
def Calculate_labeling_accuracy(threeshold, columns_list, train_marginals, df_original):
    id_list = df_original.index
    df_verify = df_original
    

    #df_verify ['index'] = id_list
    df_verify ['Label'] = train_marginals.tolist()

    df_verify.loc[df_verify ['Label'] >= threeshold, 'Label'] = 1
    df_verify.loc[df_verify ['Label'] < threeshold, 'Label'] = -1

    df_copy = df_original
    df_copy = df_copy.drop(columns_list, axis = 1)

    #print ("1- df_copy columns")
    #print (df_copy.columns)
    #calculating the labeling accuracy
    counter = 0
    for index, row in df_verify.iterrows():
        if row['Label'] == float(row['ground_truth']):
            counter = counter+1

    labeling_accuracy = float(counter)/len(id_list)
    return labeling_accuracy


In [11]:
noise_detection = Calculate_labeling_accuracy(0.5, columns_list, gen_train_marginals, DU)
print(noise_detection)

0.41898148148148145


In [12]:
abstain = 0
agreements_list=[]
for rows in L_train:
    if len(rows.data) == 0:
        abstain=abstain+1      
    agreements_list.append(abs(sum(rows.data)))

## Data-driven Active Learner
The component tries to integrate the user in the loop at this point by employing active learning. However, our problem settings do not impose traditional active learning scenarios where we usually have a small set of labeled points and a larger set of unlabeled data. Instead, we deal with a set of probabilistic labels that are classified based on the confidence of the generative model. Therefore, we adopt meta-active learning in this component and apply a data-driven approach to learn the query strategy. The approach formulates the process of designing the query strategy as a regression problem. We train a regression model to predict the reduction of the generalization error associated with adding a labeled point {xi, yi} to the training data of a classifier. Our main hypothesis is that this regressor can serve as the query strategy in our problem settings to outperform the baseline strategies since it is customized to the underlying distribution and considers the output of the generative model.

In [13]:
index_list = []
agreements_list=[]
unlabeled_list=[]
i = 1
for rows in L_train:
    index_list.append(i)
    if len(rows.data) == 0:
        unlabeled_list.append(True)
    else:
        unlabeled_list.append(False)
    agreements_list.append(abs(sum(rows.data)))
    i=i+1
    
df_with_additional_info = pd.DataFrame(index_list)
df_with_additional_info.columns=['index']
df_with_additional_info['disagreement_factor'] = agreements_list
df_with_additional_info['unlabeled_flag'] = unlabeled_list

In [14]:
df_with_additional_info=df_with_additional_info.rename(columns={'index': 'index'})
original_data = DU

In [15]:
df_with_additional_info = df_with_additional_info.merge(original_data, on=['index'], how='left', indicator= True)

In [16]:
cond1 = df_with_additional_info['unlabeled_flag'] == True
cond2 = df_with_additional_info['disagreement_factor'] <= 2
df_active_learning= df_with_additional_info[cond1 | cond2]
print("Data with additional info: disagreements and abstain")
print(df_with_additional_info.shape)
print("Data for Active learning: Data with additional info after applying conditions")
print(df_active_learning.shape)
df_active_learning = df_active_learning.drop(['unlabeled_flag', 'disagreement_factor', '_merge'], axis=1)
df_active_learning['prediction'] = predictions

Data with additional info: disagreements and abstain
(432, 12)
Data for Active learning: Data with additional info after applying conditions
(432, 12)


In [17]:
df_active_learning = df_active_learning.fillna(0)
df_active_learning.replace([np.inf, -np.inf], np.nan)
df_active_learning.astype('int32').dtypes

index           int32
0               int32
1               int32
2               int32
3               int32
4               int32
5               int32
ground_truth    int32
Label           int32
prediction      int32
dtype: object

In [18]:
budget_variable = 0.7
labeling_budget = int(budget_variable*len(df_with_additional_info.index))
if(labeling_budget >= df_active_learning.shape[0]):
    labeling_budget= int(1*len(df_active_learning.index))

In [19]:
from sklearn.ensemble import RandomForestRegressor
from data_learner.active_learner import *
from data_learner.models import DDL_Dataset 
from data_learner.lal_model import LALmodel

fn = 'LAL-iterativetree-simulatedunbalanced-big.npz'
parameters = {'est': 1000, 'depth': 40, 'feat': 6 }
filename = '../smartmendr/data_learner/datasets/'+fn
regression_data = np.load(filename)
regression_features = regression_data['arr_0']
regression_labels = regression_data['arr_1']
lalModel = RandomForestRegressor(n_estimators = parameters['est'], max_depth = parameters['depth'], 
                                 max_features=parameters['feat'], oob_score=True, n_jobs=8)
lalModel.fit(regression_features, np.ravel(regression_labels))    

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=40,
           max_features=6, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=1000, n_jobs=8, oob_score=True, random_state=None,
           verbose=0, warm_start=False)

In [20]:
from smartmendr.data_learner.dl_utils import *
indices, labels= run_dll(lalModel, labeling_budget,df_active_learning);

[[  1.   3.   3. ...   1.  -1.  -1.]
 [  2.   1.   1. ...   2.  -1.  -1.]
 [  3.   2.   3. ...   1.  -1.  -1.]
 ...
 [343.   3.   2. ...   1.   1.  -1.]
 [344.   3.   2. ...   1.   1.  -1.]
 [345.   1.   1. ...   1.   1.  -1.]]
final results:
-----------------
[ 38.   7. 106. 264.   5. 202.  68.  11.  88. 194.  22.  98.  45. 293.
 241. 101. 223.  26.  57.  37.  67.  20. 324. 190.  18.  10.  19.  47.
 274.   9.  21.  14. 271.  87. 282.  16.  39. 136. 215.  12. 320.  28.
   3.  55.   6.  29.  41.  35.   4.   2.  43. 178.  13.  62. 206.  25.
 159.  72. 153.  48.   8.  78.  70.  15.  63.  65.  84. 212.  76.  49.
  31.  83. 148.  75. 151.  77.  71.  85.  40. 167.  23. 152. 157.  92.
 158. 168.  30.  79. 145. 160.  64.  69.  32.  24.  97. 341.  27.  36.
  34.  86.  73.  80. 149.  61. 150.  66.  74.  81.  82.  42.  44. 154.
  52.  33. 179. 155. 232.  50. 156. 165. 169.  51.  46.  53. 161.  59.
 100. 162.  96. 170. 171.  95. 135. 163.  90. 164.  58.  94. 187.  54.
 123.  93. 166.  99. 172. 173

In [21]:
AL_Results=pd.DataFrame(columns=['index','AL_Label'])
AL_Results['index']=indices.astype('int64')
AL_Results['AL_Label']=labels
AL_Results = AL_Results.sort_values(by =['index'])

In [22]:
AL_Results.loc[AL_Results['AL_Label']==0, 'AL_Label']=-1
data_with_AL_Results = df_with_additional_info.merge(AL_Results, on=['index'], how='left')
true_label = data_with_AL_Results['AL_Label']
true_label = true_label.fillna(0)
for i in range(len(true_label)):
    if true_label[i] !=0:
        L_train[i,:]=true_label[i]
gen_model, AL_train_marginals = Fitting_Gen_Model(L_train)
predictions = gen_model.predictions(L_train, batch_size=None)


Inferred cardinality: 2


  self._set_arrayXarray(i, j, x)


[1.5446955  1.52143692]
0.0


In [36]:
columns_list
noise_detection = Calculate_labeling_accuracy(0.5, columns_list, AL_train_marginals, DU)
print(noise_detection)

0.5185185185185185
