improved LIME for text data #110

kmike · 2016-12-13T23:29:32Z

This is a work-in-progress.

TODO:

tutorial for debugging complex text processing pipelines using LIME;
cleanup LIME documentation, remove redundancy;
TextExplainer should handle target_names better when there is no examples of some class in the generated dataset;
fix test coverage

This PR also fixes #102 and #39.

codecov-io · 2016-12-14T22:59:01Z

Current coverage is 97.27% (diff: 98.78%)

Merging #110 into master will decrease coverage by 0.03%

@@             master       #110   diff @@
==========================================
  Files            35         37     +2   
  Lines          1894       2090   +196   
  Methods           0          0          
  Messages          0          0          
  Branches        362        390    +28   
==========================================
+ Hits           1843       2033   +190   
- Misses           24         28     +4   
- Partials         27         29     +2

Diff Coverage	File Path
••••••••• 96%	eli5/lime/lime.py
••••••••• 97%	new eli5/lime/_vectorizer.py
••••••••• 97%	eli5/lime/utils.py
•••••••••• 100%	eli5/lime/textutils.py
•••••••••• 100%	eli5/sklearn_crfsuite/explain_weights.py
•••••••••• 100%	new eli5/sklearn/_span_analyzers.py
•••••••••• 100%	eli5/utils.py
•••••••••• 100%	eli5/sklearn/utils.py
•••••••••• 100%	eli5/_graphviz.py
•••••••••• 100%	eli5/ipython.py

Review all 27 files changed

Powered by Codecov. Last update 5f4e975...aebd959

…a has incorrect shape

…s API

…ral MaskingTextSamplerUnion.

…en_pattern

… classes

… bit shorter

…es for some of the class

… in some cases

* add decision tree example; * add some docs for sampling; * other documentation improvements.

kmike · 2016-12-29T23:25:41Z

//cc @lopuhin I think this is ready. Do you have any comments?

Tutorial doesn't explain all options - most notably, it doesn't use position_dependent=True flag and doesn't explain rbf_sigma parameter. I know how they work and what they intend to do, but I don't know when should one use them in practice, so it is not in the tutorial :)

Tutorial: https://github.com/TeamHG-Memex/eli5/blob/d6786c56f51d8fa485829adcf641967ff8839416/notebooks/TextExplainer.ipynb

lopuhin · 2016-12-30T06:57:31Z

docs/source/_notebooks/text-explainer.rst

+
+If a library is not supported by eli5 directly, or the text processing
+pipeline is too complex for eli5, eli5 can still help - it provides an
+implementation of LIME (Ribeiro et al., 2016) algorithm which allows to


I think it would be nice to link to the paper (https://arxiv.org/abs/1602.04938)

Yeah, makes sense.

lopuhin · 2016-12-30T07:15:06Z

Hey @kmike , I just finished reading the notebook, it looks great - I like that most important stuff is at the start, and you show how it can break.

KL divergence and score of the white-box classifier predictions of the black-box classifier seem to be important to know if explanation should be trusted - does it make sense to include them in explanation output by default?

lopuhin · 2016-12-30T07:55:17Z

... does it make sense to include them in explanation output by default?

Hm, but I see it's not really convenient to implement, and it's not obvious that they are needed in the explanation.

The PR looks great! I didn't know about # type: (...) -> syntax, that is much more convenient.

kmike · 2016-12-30T09:16:21Z

I was also thinking about adding scores to the output by default, but it was not straightforward to implement indeed. Adding a custom field just for scores looks like a bit too much; putting it to description is not enough because description is hidden by default.

kmike force-pushed the lime-text branch 5 times, most recently from 24e79b4 to b73b393 Compare December 27, 2016 19:06

kmike mentioned this pull request Dec 27, 2016

explain_prediction for XGBoost #117

Merged

kmike added 23 commits December 28, 2016 14:31

move y_proba computation out of _train_local_classifier

1d8757e

handle a case when .fit haven’t seen all labels and thus predict_prob…

1ed6ac2

…a has incorrect shape

add more helper properties to SplitResult and TokenizedText

8e55609

allow to return masks from MaskingTextSampler; change generate_sample…

c3b159c

…s API

TokenizedText.spans_and_tokens

1347942

refactor eli5.lime.utils.fit_proba

dc8a923

move sklearn span analyzer code to a module

8b867af

this variable doesn’t have to be dict

d2782e9

fake vectorizer for a single document

07f7ed1

token_spans in a property

c723d2a

MaskingTextSampler: allow to customize replacement token

ffdbc59

add missing type annotation

9323316

remove support for float ‘bow’ in MaskingTextSampler; add a more gene…

72ab428

…ral MaskingTextSamplerUnion.

DOC add a missing docstring for ‘replacement’ argument

7d0e160

better type annotations

f922a91

MaskingTextSampler: min_replace and max_replace arguments

06e3edc

nicer type hints

55ea296

zero n_samples are no longer supported

ccf286d

for sample_near_with_mask to work all samplers must have the same tok…

b18205b

…en_pattern

TST check that LIME works if samplers can’t generate examples of some…

bf8c882

… classes

MaskingTextSampler: allow to replace groups of subsequent tokens

98b1b47

fix fit_proba with sparse X

715e992

small pep8 cleanup

505ad16

kmike added 7 commits December 28, 2016 14:44

whoops, check for FormattedFeatureName first

1477c54

drop unneeded check in SingleDocumentVectorizer; make the code a tiny…

fdc346c

… bit shorter

move semi-private function below the public class

1359cda

mypy: drop —silent-imports flag

28f144f

this condition was important, after all

e886bdb

TextExplainer: handle target_names when there are not generated sampl…

18779cd

…es for some of the class

remove outdated note

0f9421b

kmike added this to the 0.3 milestone Dec 28, 2016

kmike added 4 commits December 29, 2016 05:43

TextExplainer: enable token ngrams by default

8dbcb31

(unfinished) TextExplainer tutorial

5d6b6ee

DOC it is fine to use different token_patterns in MaskingTextSamplers…

2299eb1

… in some cases

DOC improve overview

24fc12d

kmike mentioned this pull request Dec 29, 2016

TextExplainer: add char-based sampling to the mix by default? #127

Open

kmike added 2 commits December 29, 2016 21:16

DOC TextExplainer/LIME tutorial

9989b68

DOC cleanup LIME docs

215b8a2

kmike changed the title ~~[wip] improved LIME for text data~~ improved LIME for text data Dec 29, 2016

kmike added 2 commits December 30, 2016 04:20

DOC update TextExplainer tutorial:

d314059

* add decision tree example; * add some docs for sampling; * other documentation improvements.

DOC minor documentation updates

a264698

DOC better wording

d6786c5

lopuhin reviewed Dec 30, 2016

View reviewed changes

kmike added 2 commits December 30, 2016 14:04

DOC add a link to LIME paper in TextExplainer tutorial

8b8e1c1

DOC typo fixes

adbd60b

DOC TextExplainer uses ngram_range=(1,2) by default

aebd959

kmike merged commit ad4e6bf into master Dec 30, 2016

kmike mentioned this pull request Dec 30, 2016

LIME: add an example on a real dataset #39

Closed

kmike deleted the lime-text branch December 30, 2016 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improved LIME for text data #110

improved LIME for text data #110

kmike commented Dec 13, 2016 •

edited

Loading

codecov-io commented Dec 14, 2016 •

edited

Loading

kmike commented Dec 29, 2016 •

edited

Loading

lopuhin Dec 30, 2016

kmike Dec 30, 2016

lopuhin commented Dec 30, 2016

lopuhin commented Dec 30, 2016

kmike commented Dec 30, 2016

improved LIME for text data #110

improved LIME for text data #110

Conversation

kmike commented Dec 13, 2016 • edited Loading

codecov-io commented Dec 14, 2016 • edited Loading

Current coverage is 97.27% (diff: 98.78%)

kmike commented Dec 29, 2016 • edited Loading

lopuhin Dec 30, 2016

Choose a reason for hiding this comment

kmike Dec 30, 2016

Choose a reason for hiding this comment

lopuhin commented Dec 30, 2016

lopuhin commented Dec 30, 2016

kmike commented Dec 30, 2016

kmike commented Dec 13, 2016 •

edited

Loading

codecov-io commented Dec 14, 2016 •

edited

Loading

kmike commented Dec 29, 2016 •

edited

Loading