Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Added removeDocument and retrain #36

Merged
merged 1 commit into from

3 participants

Matthew Eernisse Chris Umbel Michael Latman
Matthew Eernisse
mde commented

Wanted the ability to remove something from a category, so I added the removeDocument method. However, looks like train is both incremental, and additive-only, so it seemed like the most straightforward way to do it without a lot of rewriting was to add a retrain that would wipe the slate and start over.

The only thing I'm really dubious about is the ramifications of removing a text-item from features, in the case where the same thing exists in documents with multiple classifiers.

If this is completely crazy, I'd appreciate feedback on a better way to approach adding this feature.

I've also added a test for this -- the test for good/bad equality seems a little brittle, but as long as the classifications format doesn't change, it should work correctly. I am a little curious how the classify method returns a value in the case where categorizations are all equal. Does it just pick the first one it finds? Is this desirable behavior? If something can't be reasonably categorized, would returning null be too weird?

Thanks for the work on this. I plan to use this in a Hack Day project at Yammer. :)

Chris Umbel
Owner

sounds reasonable so far. i'll look this over within the next 48 hours or so as i'm a bit backed up now. thanks for the contribution regardless!

Michael Latman

One year later

Chris Umbel
Owner

Yeah, I'm looking for someone to take over the operations here as I don't have time to attend to issues much these days. In the meantime I'll try to have a look within the coming days.

I'll have to review the logic and resolve conflicts so it won't be super quick.

Sorry again.

Chris Umbel chrisumbel merged commit 91b9387 into from
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 2, 2012
  1. Added removeDocument and retrain.

    mde authored
This page is out of date. Refresh to see the latest.
35 lib/natural/classifiers/classifier.js
View
@@ -45,6 +45,33 @@ function addDocument(text, classification) {
}
}
+function removeDocument(text, classification) {
+ var docs = this.docs
+ , doc
+ , pos;
+
+ if (typeof text === 'string') {
+ text = this.stemmer.tokenizeAndStem(text);
+ }
+
+ for (var i = 0, ii = docs.length; i < ii; i++) {
+ doc = docs[i];
+ if (doc.text.join(' ') == text.join(' ') &&
+ doc.label == classification) {
+ pos = i;
+ }
+ }
+
+ // Remove if there's a match
+ if (!isNaN(pos)) {
+ this.docs.splice(pos, 1);
+
+ for (var i = 0, ii = text.length; i < ii; i++) {
+ delete this.features[text[i]];
+ }
+ }
+}
+
function textToFeatures(observation) {
var features = [];
@@ -71,6 +98,12 @@ function train() {
this.classifier.train();
}
+function retrain() {
+ this.classifier = new (this.classifier.constructor)();
+ this.lastAdded = 0;
+ this.train();
+}
+
function getClassifications(observation) {
return this.classifier.getClassifications(this.textToFeatures(observation));
}
@@ -106,7 +139,9 @@ function load(filename, callback) {
}
Classifier.prototype.addDocument = addDocument;
+Classifier.prototype.removeDocument = removeDocument;
Classifier.prototype.train = train;
+Classifier.prototype.retrain = retrain;
Classifier.prototype.classify = classify;
Classifier.prototype.textToFeatures = textToFeatures;
Classifier.prototype.save = save;
40 spec/bayes_classifier_spec.js
View
@@ -54,7 +54,7 @@ describe('bayes classifier', function() {
expect(classifier.getClassifications('i write code')[1].label).toBe('literature');
});
- it('should classify with arrays', function() {
+ it('should classify with strings', function() {
var classifier = new natural.BayesClassifier();
classifier.addDocument('i fixed the box', 'computing');
classifier.addDocument('i write code', 'computing');
@@ -69,6 +69,44 @@ describe('bayes classifier', function() {
expect(classifier.classify('read all the books')).toBe('literature');
});
+ it('should classify and re-classify after document-removal', function() {
+ var classifier = new natural.BayesClassifier()
+ , arr
+ , item
+ , classifications = {};
+
+ // Add some good/bad docs and train
+ classifier.addDocument('foo bar baz', 'good');
+ classifier.addDocument('qux zooby', 'bad');
+ classifier.addDocument('asdf qwer', 'bad');
+ classifier.train();
+
+ expect(classifier.classify('foo')).toBe('good');
+ expect(classifier.classify('qux')).toBe('bad');
+
+ // Remove one of the bad docs, retrain
+ classifier.removeDocument('qux zooby', 'bad');
+ classifier.retrain();
+
+ // Simple `classify` will still return a single result, even if
+ // ratio for each side is equal -- have to compare actual values in
+ // the classifications, should be equal since qux is unclassified
+ arr = classifier.getClassifications('qux');
+ for (var i = 0, ii = arr.length; i < ii; i++) {
+ item = arr[i];
+ classifications[item.label] = item.value;
+ }
+ expect(classifications.good).toEqual(classifications.bad);
+
+ // Re-classify as good, retrain
+ classifier.addDocument('qux zooby', 'good');
+ classifier.retrain();
+
+ // Should now be good, original docs should be unaffected
+ expect(classifier.classify('foo')).toBe('good');
+ expect(classifier.classify('qux')).toBe('good');
+ });
+
it('should serialize and deserialize a working classifier', function() {
var classifier = new natural.BayesClassifier();
classifier.addDocument('i fixed the box', 'computing');
Something went wrong with that request. Please try again.