Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Active learning query stategies #119

Open
rth opened this issue Mar 21, 2017 · 0 comments
Open

Active learning query stategies #119

rth opened this issue Mar 21, 2017 · 0 comments
Milestone

Comments

@rth
Copy link
Contributor

rth commented Mar 21, 2017

While the implementation of query strategies while doing text categorization iterations (active learning) is probably beyond the scope of FreeDiscovery, this issue aims to ensure than the output of the FreeDiscovery API contains sufficient information to apply active learning query strategies on them.

Here is a brief overview of possible query approaches (adapted from wikipedia)

  1. Uncertainty sampling: label those points for which the current model is least certain as to what the correct output should be. [For instance, SVM this could be the margin to the hyperplane].

Since we return the decision_function (typically in [-eps , +eps]), this would mean selecting the points with the decision_function around zero (lowest absolute value).

  1. Query by committee: a variety of models are trained on the current labeled data, and vote on the output for unlabeled data; label those points for which the "committee" disagrees the most

Currently this would mean running multiple categorizations with different algorithms and combining the results.

  1. Expected model change: label those points that would most change the current model
  2. Expected error reduction: label those points that would most reduce the model's generalization error
  3. Variance reduction: label those points that would minimize output variance, which is one of the components of error

All of these are not implemented. They also sound quite computationally expensive as this would mean a large number of training / scoring iterations .

  1. Balance exploration and exploitation: the choice of examples to label is seen as a dilemma between the exploration and the exploitation over the data space representation. This strategy manages this compromise by modelling the active learning problem as a contextual bandit problem. For example, Bouneffouf et al.[6] propose a sequential algorithm named Active Thompson Sampling (ATS), which, in each round, assigns a sampling distribution on the pool, samples one point from this distribution, and queries the oracle for this sample point label.
  2. Exponentiated Gradient Exploration for Active Learning: In this paper, the author proposes a sequential algorithm named exponentiated gradient (EG)-active that can improve any active learning algorithm by an optimal random exploration.

Probably not applicable.

In addition there is a few existing active learning libraries in Python, namely libact, iitml/AL and it could be worth considering whether adding them to FreeDiscovery could be beneficial...

@rth rth added this to the v2.0 milestone Mar 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant