Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manipulating the training and test sets #11

Open
sreevarsha opened this issue Mar 8, 2019 · 21 comments
Open

Manipulating the training and test sets #11

sreevarsha opened this issue Mar 8, 2019 · 21 comments

Comments

@sreevarsha
Copy link

Hi, I have a slight data manipulation problem. Instead of splitting the entire data set automatically with split_AL, I want to use a pre-defined training set with instances drawn from a pre-defined test set for queries. How might I go about doing this using alipy ?

Please advise, thanks !!!

@tangypnuaa
Copy link
Collaborator

Hi, sreevarsha. You can pass the train_idx , test_idx , label_idx , unlabel_idx when initializing the ToolBox object if you have your own data split setting. Each of them should be a 2D list which has the shape [n_split_count, n__indexes].

For example:

train_idx, test_idx, label_idx, unlabel_idx = my_own_split_fun(X, y)
alibox = alipy.ToolBox(X=X, y=y, query_type='AllLabels',
                       train_idx=train_idx, test_idx=test_idx,
                       label_idx=label_idx, unlabel_idx=unlabel_idx)

If you have the independent training data X_train, just make sure the label_index and unlabel_index is refer to your the X parameter in initialization. (If you are using QUIRE which needs the train_idx, you can pass the [i for i in range(len(X_train))])

X_train = np.random.randn(3,3)
y_train = [0, 0, 1]
random = QueryInstanceRandom(X=X_train, y=y_train)
random.select(label_index=[0, 1], unlabel_index=[2], batch_size=1)

@sreevarsha
Copy link
Author

Hi, thanks so much for your response.

A couple of things, (1) I'm not splitting the data set in the code, I already have a split train and test sets, I merely want to read them and use them for test and train. So, how do I define that when I call ToolBox?
(2) I want to select the labels from what I have defined as the test set. I want to keep the train set as is, and draw instances from the test set as queries, in effect, creating a pseudo train set with the new instances added after each query (I'm using QueryInstanceUncertainty).

Could you shed some light on these please ?

Thanks so much !!!

@tangypnuaa
Copy link
Collaborator

tangypnuaa commented Mar 8, 2019

So you have a labeled train set, and an unlabeled test set for querying. All you want to do is labeling some instances from the test set.

Please try this:

import numpy as np
from alipy.query_strategy import QueryInstanceUncertainty
from alipy.index import IndexCollection
from alipy.oracle import MatrixRepository

# read your data
X_train = np.random.randn(100, 10)
y_train =np.random.randint(low=0, high=2, size=100)
X_test = np.random.rand(100,10)
y_test = np.random.randint(low=0, high=2, size=100)  # Can be anything. The algorithm will not use the labels of unlabeled set

unc = QueryInstanceUncertainty(X=np.vstack((X_test, X_train)), y=np.hstack((y_test, y_train)))
unlab_ind = np.arange(100)   # Indexes of your test set for querying
label_ind = np.arange(start=100, stop=200)  # Indexes of your train set
labeled_repo = MatrixRepository(examples=X_train, labels=y_train, indexes=label_ind)   # Create a repository to store the labeled instances
unlab_ind = IndexCollection(unlab_ind)
label_ind = IndexCollection(label_ind)

# Set the stopping criterion
for i in range(50):
    # To use your own model, please see # issue 2
    select_ind = unc.select(label_index=label_ind, unlabel_index=unlab_ind, model=None, batch_size=1)
    label_ind.update(select_ind)
    unlab_ind.difference_update(select_ind)

    # label the selected instance here
    selected_instance = X_test[select_ind]
    lab_of_ins = 1

    # add the labeled example to the repo
    labeled_repo.update_query(labels=lab_of_ins, indexes=select_ind, examples=selected_instance)

    # if you are using your own model, update your model here, and pass it to unc.select()
    # X_lab, y_lab, ind = labeled_repo.get_training_data()
    # model.fit(X_lab, y_lab)

    # if you are using default model (model=None), update the label matrix of the query strategy here
    # unc.y[select_ind] = lab_of_ins

# See the information of your labeling history
print(labeled_repo.full_history())
# Get the labeled set
print(labeled_repo.get_training_data())

import pickle
with open('my_labeled_set.pkl', 'wb') as f:
    pickle.dump(labeled_repo, f)

@sreevarsha
Copy link
Author

Thanks so much, I'll try this :)

@sreevarsha
Copy link
Author

Hi, I am having a strange problem. If I update the data after choosing the query from the test set like you have shown,
X_lab, y_lab, ind = labeled_repo.get_training_data()
model.fit(X_lab, y_lab),

the accuracies that I get when I test it on my test set are of the order of 65%.

But when I do that like this,
smallx=unc.X[label_ind.index,:];smally=unc.y[label_ind.index]
model.fit(smallx,smally) , I get accuracies of the order of ~85%. Am I doing something wrong here ? I checked the sizes of both X parameters and they are the same. Since label_ind is updated already, can't I also use that to index my training sets? Please let me know what you think. Thanks so much !!

@tangypnuaa
Copy link
Collaborator

Hi, If you have the ground-truth labels of your unlabeled set, and pass it to query strategy (y=np.hstack((y_unlab_gt, y_lab))), AND add query with your ground-truth labels,

    # Label the selected instance here
    selected_instance = X_unlab[select_ind]
    lab_of_ins = y_unlab_gt[select_ind]
    labeled_repo.update_query(labels=lab_of_ins, indexes=select_ind, examples=selected_instance)

Then, X_lab, y_lab, ind = labeled_repo.get_training_data() should be the same with X_lab=unc.X[label_ind.index,:]; y_lab=unc.y[label_ind.index].

Maybe you do not replace the example labeling code lab_of_ins = 1 with your own code. Would you please paste your full program so that I can help you.

Besides, I think we have the different definitions of the test set. in AL experiment, your data should be divided into test_set (fully labeled and only for testing model, not for querying or training) and train_set, and the train_set should be further split into labeled_set and unlabeled_set (for querying).

test U train = the whole dataset
labeled U unlabeled = train

If you are labeling real data that do not have a test set for testing model, there should be an initially labeled set and an unlabeled pool for querying. You can further draw out a validation set from the initially labeled set for testing (And only for testing).

@sreevarsha
Copy link
Author

Hi, I think I managed to solve this issue. I created a separate validation set from which queries are drawn, now it works.

How do I write the classifier into a file that can be loaded separately and used on another data set ?

Thanks !!!

@tangypnuaa
Copy link
Collaborator

Hi, Have you tried the pickle module?

You can save and load objects very easily with pickle. (Note that, if you change another data set, the classifier should be re-trained with the new data)

If your object has some attributes that can not be pickled, you can override the __setstate__ and __getstate__ methods for it.

@sreevarsha
Copy link
Author

Thanks, I did, and it works.

Now I have a new problem. If I choose 5 instances during one query, like this :

select_ind = unc.select(label_ind, unlab_ind, model=None,batch_size=5)
label_ind.update(select_ind)
unlab_ind.difference_update(select_ind)
selected_instance = X_test[select_ind]
lab_of_ins = 5
labeled_repo.update_query(labels=lab_of_ins, indexes=select_ind, examples=selected_instance)

I am getting an error message,

File "/home/sreejith/anaconda3/lib/python3.6/site-packages/alipy/oracle/knowledge_repository.py", line 422, in update_query
raise ValueError("Different length of parameters found. "
ValueError: Different length of parameters found. They should have the same length and is one-to-one correspondence.

I checked the lengths and they seem to correspond, so I am at a loss as to why this is. Please advise, thanks !

@tangypnuaa
Copy link
Collaborator

Hi, lab_of_ins is a list of the labels which correspond to the selected instances. So make sure they have the same length assert len(select_ind) == len(lab_of_ins ) and correspond.

@sreevarsha
Copy link
Author

Hi, thanks so much ! I understand now.

@sreevarsha
Copy link
Author

HI, I am using my own model and then routing it through alipy for active learning. I am choosing QueryInstanceUncertainty and specifying the measure entropy. Then why should I again choose a model (default linear regression) to choose the learning examples. When I say uncertainty sampling and batch size, shouldn't the algorithm already understand to choose the examples ?

Also, when I update the training set with the queried elements from the test set, do they automatically get deleted from the test set ? If not, how can I remove them ? Please advise.

Thank you so much !!!

@tangypnuaa
Copy link
Collaborator

1. The best option is using your target model to select instance. But if you are not using a sklearn model AND you don't know how to re-encapsulate it, you can use the default model at the expense of some performance for convenience. See issue 2 and document for more information. Note that, uncertainty has the function select_by_prediction_mat(). You can provide the probabilistic prediction matrix of your own model to select instance instead of passing your model, shape is usually like [n_samples, n_classes].

2. No, they will not automatically deleted, and we do not recommend this. ALiPy keeps the original data untouched and only records the indexes of selected data.

label_ind.update(select_ind)
unlab_ind.difference_update(select_ind)

The labeled data and unlabeled data can be obtained by indexing the original feature matrix.

@sreevarsha
Copy link
Author

Hi, thanks for your reply.

  1. select_ind = unc.select(label_ind, unlab_ind,model=gc,batch_size=10)

So, if I do like this, wherein gc is my model - I do an initial fit before running it through alipy, it is a combination of 4 scikit-learn models and it gives a predict_proba output.

  1. Ya, I am already doing that, but if those are not removed from the test set, it might lead to some (however small) amount of overfitting. So I think I'll remove it using numpy,
    X_test=np.delete(X_test,select_ind,axis=0)
    y_test=np.delete(y_test,select_ind,axis=0)

thanks so much for your reply.

@sreevarsha
Copy link
Author

Hi, I'm so sorry to bother you again, but is it possible to add metrics that I define into the saver ? instead of just accuracy I'd like to pass things like efficiency and purity etc. Thanks !

@tangypnuaa
Copy link
Collaborator

Hi, I'm glad to answer your question. Please feel free for that.
The State object can be used as a dict in python.
You can add any key-value pairs by add_element and get_value methods or indexing operation.

state.add_element(key='my_entry', value=my_value)
# or use the index operation directly
state['my_entry'] = my_value

value = state.get_value(key='my_entry')
# or use the index operation directly
value = state['my_entry']

You can see the document for more details.

@sreevarsha
Copy link
Author

Thanks so much :) I will try that !

@sreevarsha
Copy link
Author

Hi, so my alipy was working perfectly until yesterday. But today, every time I run my code, and when I collect the indexes of the unlabelled instances, like this:
unlab_ind = IndexCollection(unlab_ind)

It gives me an error,

"self._innercontainer = list(np.unique([i for i in data], axis=0))
TypeError: unique() got an unexpected keyword argument 'axis' "

Could you tell me why this happens and what can I do to rectify it please ? Thanks so much.

@tangypnuaa
Copy link
Collaborator

Hi, maybe your numpy version is outdated. We will update the dependency information of ALiPy in the future, sorry for the inconvenience. For now, try to upgrade your numpy.
pip install -U numpy==1.16.2

@sreevarsha
Copy link
Author

Thanks, that worked :)

I am testing a querying strategy where I have 3 distinct sets, a train set, a query set and a test set. The train set and the query set have the same number of objects. So, when I draw a query from the query set, their indexes are similar to those in the train set, leading to an error message such as this :

"RepeatElementWarning: Adding element 129 has already in the collection, skip.
label_ind_RF.update(select_ind_RF)
/home/sreejith/anaconda3/lib/python3.6/site-packages/alipy/oracle/knowledge_repository.py:381: RepeatElementWarning: Repeated index is found when adding element to knowledge repository. Skip this item
category=RepeatElementWarning)"

Is there a way to get around this issue ? Please advise, thanks so much in advance.

@tangypnuaa
Copy link
Collaborator

Hi, I guess you separate your dataset into 3 distinct sets and want to manipulate the instances directly. And you mark your data with 3 arrays:
e.g.,

train_X = np.random.randn(50, 10)   # 50 instances
query_X = np.random.randn(50, 10)   # 50 instances
test_X = np.random.randn(100, 10)   # 100 instances

train_set = range(50)
query_set = range(50)
test_set = range(100)

However, ALiPy keeps the original data untouched and only manipulates the indexes of data. We assume that each instance should have a unique index number and the RepeatElementWarning should never appear.
e.g.,

X =  np.random.randn(200, 10)

train_set = np.arange(50) 
query_set = np.arange(start=50, stop=100) 
test_set = np.arange(start=100, stop=200) 

To solve this confict, you can try to understand the following code:

X= np.vstack((train_X , query_X, test_X ))   #  concat your subsets

train_set = np.arange(50) 
query_set = np.arange(start=50, stop=100) 
test_set = np.arange(start=100, stop=200) 

# query
unc = QueryInstanceUncertainty(X, y)   # The whole dataset, but not a subset
unc.select(label_index=train_set , unlabel_index=query_set , model=model, batch_size=1)  # Each index is refer to the whole dataset

# update index
train_set = IndexCollection(train_set)
train_set.update(60)

# the updated train_X can be obtained by indexing X with the updated indexes
new_train_X = X[train_set]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants