Token Frequency Distribution #106

bbengfort · 2017-01-20T18:24:17Z

Create a TextFreqDist Visualizer (or something like that) that takes as input a CountVectorizer, a number n of top n most frequent words, and a parameter for cumulative frequency or not. Then on draw, display a frequency distribution plot of the most frequent tokens.

The goal of this plot is to show the effect of stopwords removal and other text feature normalization techniques (e.g. stemming vs. lemmatization).

The text was updated successfully, but these errors were encountered:

bbengfort · 2017-01-20T18:28:02Z

Advanced option: plot per-class distributions ...

rebeccabilbro · 2017-01-25T17:18:31Z

This could be either a line graph (like this) or a bar chart (like this).

rebeccabilbro · 2017-01-25T17:20:06Z

Plot first 50 words if it's a single corpus, otherwise if we're comparing two corpora (e.g. with an overlay), the words on the axis go away? Since it's really just the shape that matters.

rebeccabilbro · 2017-01-26T21:55:41Z

Re: comparing across corpora/texts: http://stackoverflow.com/questions/32040444/frequency-distribution-comparison-python

rebeccabilbro · 2017-02-08T20:13:08Z

Philosophical questions... Should this class expect text that has been count vectorized? Or use fit() to wrap Scikit-Learn's count vectorizer? Don't want to have to add NLTK as a dependency. See TfidfVectorizer and TfidfTransformer source from sklearn.

For now, to keep things simple, let's assume the input is a list of list of words (e.g. already preprocessed). Each row is a document which is a list of words, pass documents in as X. Then don't have to worry about tokenization. Then once this is closed, open up a new card to elaborate (import regular expression tokenizer utils - english stopwords, strip accents, strip tags, etc. from sklearn)?

rebeccabilbro · 2017-02-20T16:47:22Z

@rebeccabilbro before pushing, remember to rename text.ipynb from examples to avoid PITA merge conflict with @bbengfort

rebeccabilbro · 2017-02-20T21:34:38Z

Ok, have pushed a preliminary implementation of the FreqDistVisualizer here, and integrated an illustration into the text examples notebook here, and updated the api documentation accordingly here.

Things that we should do or consider doing in a future implementation:

per-class distribution plots, which also means being able to change the title to name the class
add a line graph option (like this)
figure out how to implement corpus overlays (suggested by Ben, but I think separate subplots might make more sense and be more readable)?
add tests
add a wrapper in case the text hasn’t been count vectorized yet (also, check out the utils available in sklearn: import regular expression tokenizer utils - english stopwords, strip accents, strip tags, etc.)?

bbengfort · 2017-02-21T13:52:56Z

Hawt.

I just have this voice in the back of my head going "muwahahahahaha" -- every time we implement a new visualizer. They're so simple to use (maybe a bit less simple to implement, but not too bad), but so powerful. The examples you gave show a very clear and intuitive use for a simple frequency distribution, but one that would take a lot of steps to do on your own so you'd probably skip it. YB puts it at your fingertips and looks good doing it!

bbengfort added this to the Version 0.3.3 milestone Jan 20, 2017

bbengfort self-assigned this Jan 20, 2017

bbengfort added level: intermediate python coding expertise required priority: medium can wait until after next release type: feature a new visualizer or utility for yb ready labels Jan 20, 2017

bbengfort assigned rebeccabilbro and unassigned bbengfort Jan 20, 2017

rebeccabilbro mentioned this issue Jan 21, 2017

Text Visualizations #40

Closed

rebeccabilbro added in progress label for Waffle board and removed ready labels Jan 27, 2017

rebeccabilbro pushed a commit that referenced this issue Feb 20, 2017

preliminary implementation of frequency distribution plot towards #106

93df7f7

bbengfort added review PR is open and removed in progress label for Waffle board labels Feb 22, 2017

rebeccabilbro mentioned this issue Feb 22, 2017

Refine FreqDistVisualizer #117

Open

bbengfort closed this as completed Feb 22, 2017

bbengfort removed the review PR is open label Feb 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token Frequency Distribution #106

Token Frequency Distribution #106

bbengfort commented Jan 20, 2017 •

edited

bbengfort commented Jan 20, 2017

rebeccabilbro commented Jan 25, 2017

rebeccabilbro commented Jan 25, 2017

rebeccabilbro commented Jan 26, 2017

rebeccabilbro commented Feb 8, 2017

rebeccabilbro commented Feb 20, 2017

rebeccabilbro commented Feb 20, 2017

bbengfort commented Feb 21, 2017

Token Frequency Distribution #106

Token Frequency Distribution #106

Comments

bbengfort commented Jan 20, 2017 • edited

bbengfort commented Jan 20, 2017

rebeccabilbro commented Jan 25, 2017

rebeccabilbro commented Jan 25, 2017

rebeccabilbro commented Jan 26, 2017

rebeccabilbro commented Feb 8, 2017

rebeccabilbro commented Feb 20, 2017

rebeccabilbro commented Feb 20, 2017

bbengfort commented Feb 21, 2017

bbengfort commented Jan 20, 2017 •

edited