# ***End to End tutorial for SMS_SPAM labeling using SPEAR(Cage and JL):***

In [4]:
#pip install

In [5]:
import sys
sys.path.append('../../')
import numpy as np

# ***Defining an Enum to hold labels:***
### **Representation of class Labels**

<p>All the class labels for which we define labeling functions are encoded in enum and utilized in our next tasks. Make sure not to define an Abstain(Labeling function(LF) not deciding anything) class inside this Enum, instead use the Abstain object as used later in LF section.</p>

<p>SPAM dataset contains 2 classes i.e <b>HAM</b> and <b>SPAM</b>. Note that the numbers we associate can be anything but it is suggested to use a continuous numbers from 0 to number_of_classes-1</p>

<p><b>**Note that even though this example is a binary classification, this(SPEAR) library supports multi-label classification**</b></p>

In [6]:
import enum

# enum to hold the class labels
class ClassLabels(enum.Enum):
    SPAM = 1
    HAM = 0

THRESHOLD = 0.8

# ***Defining preprocessors, continuous_scorers, labeling functions:***
During labeling the unlabelled data we lookup for few keywords to assign a class SMS.

<b>Example</b> : *If a message contains apply or buy in it then most probably the message is spam*

In [7]:
trigWord1 = {"free","credit","cheap","apply","buy","attention","shop","sex","soon","now","spam"}
trigWord2 = {"gift","click","new","online","discount","earn","miss","hesitate","exclusive","urgent"}
trigWord3 = {"cash","refund","insurance","money","guaranteed","save","win","teen","weight","hair"}
notFreeWords = {"toll","Toll","freely","call","meet","talk","feedback"}
notFreeSubstring = {"not free","you are","when","wen"}
firstAndSecondPersonWords = {"I","i","u","you","ur","your","our","we","us","youre"}
thirdPersonWords = {"He","he","She","she","they","They","Them","them","their","Their"}

### **Declaration of a simple preprocessor function**


For most of the tasks in NLP, computer vivsion instead of using the raw datapoint we preprocess the datapoint and then label it. Preprocessor functions are used to preprocess an instance before labeling it. We use **`@preprocessor(name,resources)`** decorator to declare a function as preprocessor.

In [8]:
from spear.labeling import preprocessor


@preprocessor(name = "LOWER_CASE")
def convert_to_lower(x):
    return x.lower().strip()

lower = convert_to_lower("RED")

### **Some Labeling function(LF) definitions**
Below are some examples on how to define LFs and continuous LFs(CLFs). To get the continuous score for a CLF, we need to define a function with continuous_scorer decorator(just like labeling_function decorator) and pass it to a CLF as displayed below. Also note how the continuous score can be used in CLF. Note that the word_similarity is the function with continuous_scorer decorator and is written in con_scorer file(this file is not a part of package) in same folder.

In [9]:
from spear.labeling import labeling_function, ABSTAIN

from helper.con_scorer import word_similarity
import re


@preprocessor()
def convert_to_lower(x):
    return x.lower().strip()


@labeling_function(resources=dict(keywords=trigWord1),pre=[convert_to_lower],label=ClassLabels.SPAM)
def LF1(c,**kwargs):    
    if len(kwargs["keywords"].intersection(c.split())) > 0:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(resources=dict(keywords=trigWord2),pre=[convert_to_lower],label=ClassLabels.SPAM)
def LF2(c,**kwargs):
    if len(kwargs["keywords"].intersection(c.split())) > 0:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(resources=dict(keywords=trigWord3),pre=[convert_to_lower],label=ClassLabels.SPAM)
def LF3(c,**kwargs):
    if len(kwargs["keywords"].intersection(c.split())) > 0:
        return ClassLabels.SPAM 
    else:
        return ABSTAIN

@labeling_function(resources=dict(keywords=notFreeWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def LF4(c,**kwargs):
    if "free" in c.split() and len(kwargs["keywords"].intersection(c.split()))>0:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(resources=dict(keywords=notFreeSubstring),pre=[convert_to_lower],label=ClassLabels.HAM)
def LF5(c,**kwargs):
    for pattern in kwargs["keywords"]:    
        if "free" in c.split() and re.search(pattern,c, flags= re.I):
            return ClassLabels.HAM
    return ABSTAIN

@labeling_function(resources=dict(keywords=firstAndSecondPersonWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def LF6(c,**kwargs):
    if "free" in c.split() and len(kwargs["keywords"].intersection(c.split()))>0:
        return ClassLabels.HAM
    else:
        return ABSTAIN


@labeling_function(resources=dict(keywords=thirdPersonWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def LF7(c,**kwargs):
    if "free" in c.split() and len(kwargs["keywords"].intersection(c.split()))>0:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(label=ClassLabels.SPAM)
def LF8(c,**kwargs):
    if (sum(1 for ch in c if ch.isupper()) > 6):
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=trigWord1),pre=[convert_to_lower],label=ClassLabels.SPAM)
def CLF1(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=trigWord2),pre=[convert_to_lower],label=ClassLabels.SPAM)
def CLF2(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=trigWord3),pre=[convert_to_lower],label=ClassLabels.SPAM)
def CLF3(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=notFreeWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def CLF4(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=notFreeSubstring),pre=[convert_to_lower],label=ClassLabels.HAM)
def CLF5(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=firstAndSecondPersonWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def CLF6(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=word_similarity,resources=dict(keywords=thirdPersonWords),pre=[convert_to_lower],label=ClassLabels.HAM)
def CLF7(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.HAM
    else:
        return ABSTAIN

@labeling_function(cont_scorer=lambda x: 1-np.exp(float(-(sum(1 for ch in x if ch.isupper()))/2)),label=ClassLabels.SPAM)
def CLF8(c,**kwargs):
    if kwargs["continuous_score"] >= THRESHOLD:
        return ClassLabels.SPAM
    else:
        return ABSTAIN

model loading
model loaded


# ***Accumulating all LFs into rules, an LFset(a class) object:***
### **Importing LFSet and passing LFs we defined, to that class**

In [10]:
from spear.labeling import LFSet

LFS = [LF1,
    LF2,
    LF3,
    LF4,
    LF5,
    LF6,
    LF7,
    LF8,
    CLF1,
    CLF2,
    CLF3,
    CLF4,
    CLF5,
    CLF6,
    CLF7,
    CLF8
      ]

rules = LFSet("SPAM_LF")
rules.add_lf_list(LFS)

# ***Loading data:***
### **Load the data: X, X_feats, Y**
<p>Note that the utils below is not a part of package but is used to load the necessary data. User have to use some means(which doesn't matter) to load his data. X is the raw data that is to be passed to LFs, X_feats is a numpy array of shape (num_instances, num_features) and Y are true labels(if available).</p>

In [11]:
from helper.utils import load_data_to_numpy, get_various_data

X, X_feats, Y = load_data_to_numpy()

validation_size = 100
test_size = 400
L_size = 100
U_size = 4500
n_lfs = len(rules.get_lfs())

X_V, Y_V, X_feats_V,_, X_T, Y_T, X_feats_T,_, X_L, Y_L, X_feats_L,_, X_U, X_feats_U,_ = get_various_data(X, Y,\
    X_feats, n_lfs, validation_size, test_size, L_size, U_size)

# ***Labeling data:***
### **Paths**
* path_json: path to json file generated by PreLabels
* V_path_pkl: path to pkl file generated by PreLabels containing the validation data with true labels
* L_path_pkl: path to pkl file generated by PreLabels containing the labeled data with true labels
* T_path_pkl: path to pkl file generated by PreLabels containing the test data with true labels
* U_path_pkl: path to pkl file generated by PreLabels containing the unlabelled data without true labels
* log_path: path to save the log which is generated during the algorithm

<p>Difference between test and labeled data is that labeled data may be used in the algorithm(JL uses it while Cage doesn't) but test data isn't. Make sure to have the pickle files <b>EMPTY</b> ie, it should not any data inside it before passing to .generate_pickle() member function of PreLabels</p>

In [12]:
path_json = 'data_pipeline/sms_json.json'
V_path_pkl = 'data_pipeline/sms_pickle_V.pkl' #validation data - have true labels
T_path_pkl = 'data_pipeline/sms_pickle_T.pkl' #test data - have true labels
L_path_pkl = 'data_pipeline/sms_pickle_L.pkl' #Labeled data - have true labels
U_path_pkl = 'data_pipeline/sms_pickle_U.pkl' #unlabelled data - don't have true labels

log_path_cage_1 = 'log/cage_log_1.txt' #cage is an algorithm, can be found below
log_path_jl_1 = 'log/jl_log_1.txt' #jl is an algorithm, can be found below

### **Importing PreLabels class and using it to label data**
Json file should be generated only once as shown below.

In [13]:
from spear.labeling import PreLabels

sms_noisy_labels = PreLabels(name="sms",
                               data=X_V,
                               gold_labels=Y_V,
                               data_feats=X_feats_V,
                               rules=rules,
                               labels_enum=ClassLabels,
                               num_classes=2)
sms_noisy_labels.generate_pickle(V_path_pkl)
sms_noisy_labels.generate_json(path_json) #generating json files once is enough

sms_noisy_labels = PreLabels(name="sms",
                               data=X_T,
                               gold_labels=Y_T,
                               data_feats=X_feats_T,
                               rules=rules,
                               labels_enum=ClassLabels,
                               num_classes=2)
sms_noisy_labels.generate_pickle(T_path_pkl)

sms_noisy_labels = PreLabels(name="sms",
                               data=X_L,
                               gold_labels=Y_L,
                               data_feats=X_feats_L,
                               rules=rules,
                               labels_enum=ClassLabels,
                               num_classes=2)
sms_noisy_labels.generate_pickle(L_path_pkl)

sms_noisy_labels = PreLabels(name="sms",
                               data=X_U,
                               rules=rules,
                               data_feats=X_feats_U,
                               labels_enum=ClassLabels,
                               num_classes=2)
sms_noisy_labels.generate_pickle(U_path_pkl)

100%|██████████| 100/100 [00:08<00:00, 12.19it/s]
100%|██████████| 400/400 [00:27<00:00, 14.81it/s]
100%|██████████| 100/100 [00:06<00:00, 15.05it/s]
100%|██████████| 4500/4500 [05:23<00:00, 13.93it/s]


# ***Accessing labeled data:***
### **Importing and the use of get_data and get_classes**
<p>These functions can be used to extract data from pickle files and json file respectively. Note that these are the files generated using PreLabels.</p>
<p>For detailed contents of output, please refer documentation.</p>

In [14]:
from spear.utils import get_data, get_classes

data_U = get_data(path = U_path_pkl, check_shapes=True)
#check_shapes being True(above), asserts for relative shapes of arrays in pickle file
print("Number of elements in data list: ", len(data_U))
print("Shape of feature matrix: ", data_U[0].shape)
print("Shape of labels matrix: ", data_U[1].shape)
print("Shape of continuous scores matrix : ", data_U[6].shape)
print("Total number of classes: ", data_U[9])

classes = get_classes(path = path_json)
print("Classes dictionary in json file(modified to have integer keys): ", classes)

Number of elements in data list:  10
Shape of feature matrix:  (4500, 1024)
Shape of labels matrix:  (4500, 16)
Shape of continuous scores matrix :  (4500, 16)
Total number of classes:  2
Classes dictionary in json file(modified to have integer keys):  {1: 'SPAM', 0: 'HAM'}


# ***Cage Algorithm:***
### **Importing Cage class (the algorithm) and declaring an object of it**
Cage algorithm needs only the pickle file(with labels given by LFs using PreLabels class) with unlabelled data(the data without true/gold labels) and it will predict the labels of this data. An optinal test data(which has true/gold labels) can also passed to get a log information of accuracies. 
<p><b>Note:</b> Multiple calls to fit_* functions will train parameters continuously ie, parameters are not reinitialised in fit_* functions. So, to train large data, one can call fit_* functions repeatedly on smaller chunks. Also, in order to perform multiple runs over the algorithm, one need to reinitialise paramters(by creating an object of Cage) at the start of each run.</p>

In [15]:
from spear.Cage import Cage

cage = Cage(path_json = path_json, n_lfs = n_lfs)

### **fit_and_predict_proba function of Cage class**
The output(probs) is a numpy matrix of shape (num_instances, num_classes) having the probability of a particular instance being that class. For more details about arguments, please refer documentation; same should be the case for any of the member functions used from here on.

In [16]:
cage = Cage(path_json = path_json, n_lfs = n_lfs)

probs = cage.fit_and_predict_proba(path_pkl = U_path_pkl, path_test = T_path_pkl, path_log = log_path_cage_1, \
                                   qt = 0.9, qc = 0.85, metric_avg = ['binary'], n_epochs = 200, lr = 0.01)
labels = np.argmax(probs, 1)
print("probs shape: ", probs.shape)
print("labels shape: ",labels.shape)

final_test_accuracy_score: 0.815
test_average_metric: binary	final_test_f1_score: 0.5747126436781609
probs shape:  (4500, 2)
labels shape:  (4500,)


### **fit_and_predict function of Cage class**
The output(probs) is a numpy matrix of shape (num_instances,) containing integers(because need_strings is False), having the classes of each instance.

In [17]:
cage = Cage(path_json = path_json, n_lfs = n_lfs)

labels = cage.fit_and_predict(path_pkl = U_path_pkl, path_test = T_path_pkl, path_log = log_path_cage_1, \
                              qt = 0.9, qc = 0.85, metric_avg = ['binary'], n_epochs = 200, lr = 0.01, \
                              need_strings = False)

print("labels shape: ", labels.shape)
print(type(labels[0]))

final_test_accuracy_score: 0.815
test_average_metric: binary	final_test_f1_score: 0.5747126436781609
labels shape:  (4500,)
<class 'numpy.int64'>


### **fit_and_predict function of Cage class**
The output(probs) is a numpy matrix of shape (num_instances,) containing strings(because need_strings is True), having the classes of each instance.

In [18]:
cage = Cage(path_json = path_json, n_lfs = n_lfs)

labels_strings = cage.fit_and_predict(path_pkl = U_path_pkl, path_test = T_path_pkl, path_log = log_path_cage_1, \
                              qt = 0.9, qc = 0.85, metric_avg = ['binary'], n_epochs = 200, lr = 0.01, \
                              need_strings = True)

print("labels_strings shape: ", labels_strings.shape)
print(type(labels_strings[0]))

final_test_accuracy_score: 0.815
test_average_metric: binary	final_test_f1_score: 0.5747126436781609
labels_strings shape:  (4500,)
<class 'numpy.str_'>


### **Save parameters**
<p>Make sure the pickle you are passing here is <b>EMPTY</b></p>

In [19]:
cage.save_params(save_path = 'params/sms_cage_params.pkl')

### **Load parameters**

In [20]:
cage_2 = Cage(path_json = path_json, n_lfs = n_lfs)
cage_2.load_params(load_path = 'params/sms_cage_params.pkl')

### **predict_proba function of Cage class**
The output(probs_test) is a numpy matrix of shape (num_instances, num_classes) having the probability of a particular instance being that class.

In [21]:
probs_test = cage_2.predict_proba(path_test = T_path_pkl, qc = 0.85) 
#NEED NOT use the same test data(above) used in Cage class before.
print("probs_test shape: ",probs_test.shape)

probs_test shape:  (400, 2)


### **predict function of Cage class**
The output(probs) is a numpy matrix of shape (num_instances,) containing integers(strings) if need_strings is Flase(True), having the classes of each instance. Just the use case with need_strings as False is displayed here.

In [22]:
labels_test = cage_2.predict(path_test = T_path_pkl, qc = 0.85, need_strings = False)
print("labels_test shape: ", labels_test.shape)

from sklearn.metrics import accuracy_score, f1_score

#Y_T is true labels of test data, type is numpy array of shape (num_instances,)
print("accuracy_score: ", accuracy_score(Y_T, labels_test))
print("f1_score: ", f1_score(Y_T, labels_test, average = 'binary'))

labels_test shape:  (400,)
accuracy_score:  0.815
f1_score:  0.5747126436781609


### **Converting numpy array of integers to enums**
The below utility from spear can help convert return values of predict(obtained when need_strings is Flase) to a numpy array of enums

In [23]:
from spear.utils import get_enum

labels_test_enum = get_enum(np_array = labels_test, enm = ClassLabels) 
#the second argument is the Enum class defined at beginning
print(type(labels_test_enum[0]))

<enum 'ClassLabels'>


# ***Joint Learning(JL) Algorithm:***
## **Importing JL class (the algorithm) and declaring an object of it**
JL algoritm needs the four types of data:(all this data should be labeled using LFs via PreLabels class)
* Unlabeled data(doesn't have true/gold labels)
* labeled data(have true/gold labels)
* validation data(have true/gold labels)
* test data(have true/gold labels)

<p>All this data is compulsory for training(passed in fit_and_predict functions). Note that the amount of labeled or validation data can be small, for example they can be of the order of 100 each. Also refer subset selection to find the subset of the data, that is available with you, to label(using a trustable means) and use it as 'labeled data' so that the data complements the LFs.</p>
<p>The member functions of JL can be choosen to return fm(feature model) or gm(graphical model) predictions. It is highly advised to use the predictions of fm.</p>
<p><b>Note:</b> Multiple calls to fit_* functions will train parameters continuously ie, parameters are not reinitialised in fit_* functions. So, to train large data, one can call fit_* functions repeatedly on smaller chunks. Also, in order to perform multiple runs over the algorithm, one need to reinitialise paramters(by creating an object of JL) at the start of each run.</p>

In [24]:
from spear.JL import JL

n_features = 1024
n_hidden = 512
feature_model = 'nn'

jl = JL(path_json = path_json, n_lfs = n_lfs, n_features = n_features, n_hidden = n_hidden, \
        feature_model = feature_model)

### **fit_and_predict_proba function of JL class with two return values**
Here return_gm argument is True which returns predictions from graphical model(Cage) along with feature model. Also note that here test data(path_T) is compulsory and metric_avg is just one value(instead of list as in Cage). The output(probs) is a numpy matrix of shape (num_instances, num_classes) having the probability of a particular instance being that class.

In [25]:
loss_func_mask = [1,1,1,1,1,1,1] 
'''
One can keep 0s in places where he don't want the specific loss function to be part
the final loss function used in training. Refer documentation(spear.JL.core.JL) to understand
the which index of loss_func_mask refers to what loss function.
Note: the loss_func_mask may not be the optimal mask for sms dataset.
'''
batch_size = 150
lr_fm = 0.0005
lr_gm = 0.01
use_accuracy_score = False

jl = JL(path_json = path_json, n_lfs = n_lfs, n_features = n_features, n_hidden = n_hidden, \
        feature_model = feature_model)

probs_fm, probs_gm = jl.fit_and_predict_proba(path_L = L_path_pkl, path_U = U_path_pkl, path_V = V_path_pkl, \
        path_T = T_path_pkl, loss_func_mask = loss_func_mask, batch_size = batch_size, lr_fm = lr_fm, lr_gm = \
    lr_gm, use_accuracy_score = use_accuracy_score, path_log = log_path_jl_1, return_gm = True, n_epochs = \
    100, start_len = 7,stop_len = 10, is_qt = True, is_qc = True, qt = 0.9, qc = 0.85, metric_avg = 'binary')

labels = np.argmax(probs_fm, 1)
print("probs_fm shape: ", probs_fm.shape)
print("probs_gm shape: ", probs_gm.shape)

early stopping at epoch: 19	best_epoch: 8
score used: f1_score
best_gm_val_score:0.5581395348837209	best_fm_val_score:0.7499999999999999
best_gm_test_score:0.6025641025641025	best_fm_test_score:0.7999999999999999
best_gm_test_precision:0.46078431372549017	best_fm_test_precision:0.7272727272727273
best_gm_test_recall:0.8703703703703703	best_fm_test_recall:0.8888888888888888
probs_fm shape:  (4500, 2)
probs_gm shape:  (4500, 2)


### **fit_and_predict_proba function of JL class with one return value**
Here return_gm argument is False which returns predictions only from feature model. The output(probs) is a numpy matrix of shape (num_instances, num_classes) having the probability of a particular instance being that class.

In [26]:
jl = JL(path_json = path_json, n_lfs = n_lfs, n_features = n_features, n_hidden = n_hidden, \
        feature_model = feature_model)

probs_fm = jl.fit_and_predict_proba(path_L = L_path_pkl, path_U = U_path_pkl, path_V = V_path_pkl, \
        path_T = T_path_pkl, loss_func_mask = loss_func_mask, batch_size = batch_size, lr_fm = lr_fm, lr_gm = \
    lr_gm, use_accuracy_score = use_accuracy_score, path_log = log_path_jl_1, return_gm = False, n_epochs = \
    100, start_len = 7,stop_len = 10, is_qt = True, is_qc = True, qt = 0.9, qc = 0.85, metric_avg = 'binary')

labels = np.argmax(probs_fm, 1)
print("probs_fm shape: ", probs_fm.shape)

early stopping at epoch: 27	best_epoch: 16
score used: f1_score
best_gm_val_score:0.5454545454545454	best_fm_val_score:0.631578947368421
best_gm_test_score:0.5949367088607594	best_fm_test_score:0.676056338028169
best_gm_test_precision:0.4519230769230769	best_fm_test_precision:0.5454545454545454
best_gm_test_recall:0.8703703703703703	best_fm_test_recall:0.8888888888888888
probs_fm shape:  (4500, 2)


### **fit_and_predict function of JL class**
Here return_gm argument is True. The output(probs) is a numpy matrix of shape (num_instances,) containing integers(because need_strings is False), having the classes of each instance.

In [27]:
jl = JL(path_json = path_json, n_lfs = n_lfs, n_features = n_features, n_hidden = n_hidden, \
        feature_model = feature_model)

labels_fm, labels_gm = jl.fit_and_predict(path_L = L_path_pkl, path_U = U_path_pkl, path_V = V_path_pkl, \
        path_T = T_path_pkl, loss_func_mask = loss_func_mask, batch_size = batch_size, lr_fm = lr_fm, lr_gm = \
    lr_gm, use_accuracy_score = use_accuracy_score, path_log = log_path_jl_1, return_gm = True, n_epochs = \
    100, start_len = 7,stop_len = 10, is_qt = True, is_qc = True, qt = 0.9, qc = 0.85, metric_avg = 'binary', \
    need_strings = False)

print("labels_fm shape: ", labels_fm.shape)
print("labels_gm shape: ", labels_gm.shape)
print(type(labels_fm[0]))
print(type(labels_gm[0]))

early stopping at epoch: 27	best_epoch: 16
score used: f1_score
best_gm_val_score:0.5581395348837209	best_fm_val_score:0.7058823529411764
best_gm_test_score:0.5911949685534591	best_fm_test_score:0.7868852459016393
best_gm_test_precision:0.44761904761904764	best_fm_test_precision:0.7058823529411765
best_gm_test_recall:0.8703703703703703	best_fm_test_recall:0.8888888888888888
labels_fm shape:  (4500,)
labels_gm shape:  (4500,)
<class 'numpy.int64'>
<class 'numpy.int64'>


### **fit_and_predict function of JL class**
Here return_gm argument is True. The output(probs) is a numpy matrix of shape (num_instances,) containing strings(because need_strings is True), having the classes of each instance.

In [28]:
jl = JL(path_json = path_json, n_lfs = n_lfs, n_features = n_features, n_hidden = n_hidden, \
        feature_model = feature_model)

labels_fm, labels_gm = jl.fit_and_predict(path_L = L_path_pkl, path_U = U_path_pkl, path_V = V_path_pkl, \
        path_T = T_path_pkl, loss_func_mask = loss_func_mask, batch_size = batch_size, lr_fm = lr_fm, lr_gm = \
    lr_gm, use_accuracy_score = use_accuracy_score, path_log = log_path_jl_1, return_gm = True, n_epochs = \
    100, start_len = 7,stop_len = 10, is_qt = True, is_qc = True, qt = 0.9, qc = 0.85, metric_avg = 'binary', \
    need_strings = True)

print("labels_fm shape: ", labels_fm.shape)
print("labels_gm shape: ", labels_gm.shape)
print(type(labels_fm[0]))
print(type(labels_gm[0]))

early stopping at epoch: 25	best_epoch: 14
score used: f1_score
best_gm_val_score:0.5714285714285715	best_fm_val_score:0.7272727272727273
best_gm_test_score:0.6025641025641025	best_fm_test_score:0.8235294117647058
best_gm_test_precision:0.46078431372549017	best_fm_test_precision:0.7538461538461538
best_gm_test_recall:0.8703703703703703	best_fm_test_recall:0.9074074074074074
labels_fm shape:  (4500,)
labels_gm shape:  (4500,)
<class 'numpy.str_'>
<class 'numpy.str_'>


### **Save parameters**
<p>Make sure the pickle you are passing here is <b>EMPTY</b></p>

In [29]:
jl.save_params(save_path = 'params/sms_jl_params.pkl')

### **Load parameters**

In [30]:
jl_2 = JL(path_json = path_json, n_lfs = n_lfs, n_features = n_features, n_hidden =  n_hidden, \
          feature_model = feature_model)
jl_2.load_params(load_path = 'params/sms_jl_params.pkl')

### **predict_fm/gm_proba functions of JL class**
The output(probs_test) is a numpy matrix of shape (num_instances, num_classes) having the probability of a particular instance being that class.
<p>Note that predict_fm_proba takes feature matrix(can also be obtained from pickle file using get_data()) as argument while predict_gm_proba takes pickle file(containing labels given by LFs) as argument.</p>

In [31]:
probs_fm_test = jl_2.predict_fm_proba(x_test = X_feats_T)
probs_gm_test = jl_2.predict_gm_proba(path_test = T_path_pkl, qc = 0.85)

print("probs_fm_test shape: ", probs_fm_test.shape)
print("probs_gm_test shape: ", probs_gm_test.shape)

probs_fm_test shape:  (400, 2)
probs_gm_test shape:  (400, 2)


### **predict_fm/gm functions of JL class**
The output(probs) is a numpy matrix of shape (num_instances,) containing integers(strings) if need_strings is Flase(True), having the classes of each instance. Just the use case with need_strings as False is displayed here. 
<p>Note that predict_fm takes feature matrix(can also be obtained from pickle file using get_data()) as argument while predict_gm takes pickle file(containing labels given by LFs) as argument.</p>

In [32]:
labels_fm_test = jl_2.predict_fm(x_test = X_feats_T, need_strings=False)
labels_gm_test = jl_2.predict_gm(path_test = T_path_pkl, qc = 0.85, need_strings=False)

print("labels_fm_test shape: ", labels_fm_test.shape)
print("labels_gm_test shape: ", labels_gm_test.shape)

from sklearn.metrics import accuracy_score, f1_score

#Y_T is true labels of test data, type is numpy array of shape (num_instances,)
print("accuracy_score of gm: ", accuracy_score(Y_T, labels_gm_test), "| fm: ", accuracy_score(Y_T, labels_fm_test))
print("f1_score of gm: ", f1_score(Y_T, labels_gm_test, average = 'binary'), "| fm: ", f1_score(Y_T, labels_fm_test, average = 'binary'))

labels_fm_test shape:  (400,)
labels_gm_test shape:  (400,)
accuracy_score of gm:  0.815 | fm:  0.9475
f1_score of gm:  0.5595238095238095 | fm:  0.8235294117647058


### **Converting numpy array of integers to enums**
The below utility from spear can help convert return values of predict_fm, predict_gm(obtained when need_strings is Flase) to a numpy array of enums

In [33]:
from spear.utils import get_enum

probs_fm_test_enum = get_enum(np_array = labels_fm_test, enm = ClassLabels) 
#the second argument is the Enum class defined at beginning
print(type(probs_fm_test_enum[0]))

<enum 'ClassLabels'>
