Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token Frequency Distribution #106

Closed
bbengfort opened this issue Jan 20, 2017 · 8 comments
Closed

Token Frequency Distribution #106

bbengfort opened this issue Jan 20, 2017 · 8 comments
Assignees
Labels
level: intermediate python coding expertise required priority: medium can wait until after next release type: feature a new visualizer or utility for yb
Milestone

Comments

@bbengfort
Copy link
Member

bbengfort commented Jan 20, 2017

Create a TextFreqDist Visualizer (or something like that) that takes as input a CountVectorizer, a number n of top n most frequent words, and a parameter for cumulative frequency or not. Then on draw, display a frequency distribution plot of the most frequent tokens.

The goal of this plot is to show the effect of stopwords removal and other text feature normalization techniques (e.g. stemming vs. lemmatization).

@bbengfort bbengfort added this to the Version 0.3.3 milestone Jan 20, 2017
@bbengfort bbengfort self-assigned this Jan 20, 2017
@bbengfort bbengfort added level: intermediate python coding expertise required priority: medium can wait until after next release type: feature a new visualizer or utility for yb ready labels Jan 20, 2017
@bbengfort
Copy link
Member Author

Advanced option: plot per-class distributions ...

@rebeccabilbro
Copy link
Member

This could be either a line graph (like this) or a bar chart (like this).

@rebeccabilbro
Copy link
Member

Plot first 50 words if it's a single corpus, otherwise if we're comparing two corpora (e.g. with an overlay), the words on the axis go away? Since it's really just the shape that matters.

@rebeccabilbro
Copy link
Member

@rebeccabilbro rebeccabilbro added in progress label for Waffle board and removed ready labels Jan 27, 2017
@rebeccabilbro
Copy link
Member

Philosophical questions... Should this class expect text that has been count vectorized? Or use fit() to wrap Scikit-Learn's count vectorizer? Don't want to have to add NLTK as a dependency. See TfidfVectorizer and TfidfTransformer source from sklearn.

For now, to keep things simple, let's assume the input is a list of list of words (e.g. already preprocessed). Each row is a document which is a list of words, pass documents in as X. Then don't have to worry about tokenization. Then once this is closed, open up a new card to elaborate (import regular expression tokenizer utils - english stopwords, strip accents, strip tags, etc. from sklearn)?

@rebeccabilbro
Copy link
Member

@rebeccabilbro before pushing, remember to rename text.ipynb from examples to avoid PITA merge conflict with @bbengfort

@rebeccabilbro
Copy link
Member

Ok, have pushed a preliminary implementation of the FreqDistVisualizer here, and integrated an illustration into the text examples notebook here, and updated the api documentation accordingly here.

Things that we should do or consider doing in a future implementation:

  • per-class distribution plots, which also means being able to change the title to name the class
  • add a line graph option (like this)
  • figure out how to implement corpus overlays (suggested by Ben, but I think separate subplots might make more sense and be more readable)?
  • add tests
  • add a wrapper in case the text hasn’t been count vectorized yet (also, check out the utils available in sklearn: import regular expression tokenizer utils - english stopwords, strip accents, strip tags, etc.)?

@bbengfort
Copy link
Member Author

Hawt.

I just have this voice in the back of my head going "muwahahahahaha" -- every time we implement a new visualizer. They're so simple to use (maybe a bit less simple to implement, but not too bad), but so powerful. The examples you gave show a very clear and intuitive use for a simple frequency distribution, but one that would take a lot of steps to do on your own so you'd probably skip it. YB puts it at your fingertips and looks good doing it!

@bbengfort bbengfort added review PR is open and removed in progress label for Waffle board labels Feb 22, 2017
@bbengfort bbengfort removed the review PR is open label Feb 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level: intermediate python coding expertise required priority: medium can wait until after next release type: feature a new visualizer or utility for yb
Projects
None yet
Development

No branches or pull requests

2 participants