-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate to sklearn #52
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good. Just some nitpicking about typing and import sorting. No worries though. I''ll take care of it!
I'll take care of the pep8 issues as well while I'm at it
src/scripts/train_bayes.py
Outdated
tokens: list[str], trigrams: bool = False, use_stopwords: bool = False | ||
) -> dict[str, bool | None]: | ||
normalized = (token.lower() for token in tokens if token not in punctuation) | ||
def read_file(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs typing
src/scripts/train_bayes.py
Outdated
import nltk | ||
import pickle | ||
from string import punctuation | ||
from numpy import vectorize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort imports alphabetically
Replace nltk based bayes model with sklearn.
This also changes parameters in the model so we limit its vocabulary by 5000 tokens.
This is most likely the reason why the nltk bayes model blew up
With new model and vectorizer, it's about 25mb. Is that ok?
I also think the scripts should be rewritten for a different pr. I see clearly that I have learned a lot about readable and idiomatic python code
since i wrote this lol
This closes issue #35