Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to sklearn #52

Merged
merged 5 commits into from
Apr 3, 2022
Merged

Migrate to sklearn #52

merged 5 commits into from
Apr 3, 2022

Conversation

marksverdhei
Copy link
Collaborator

@marksverdhei marksverdhei commented Apr 2, 2022

Replace nltk based bayes model with sklearn.
This also changes parameters in the model so we limit its vocabulary by 5000 tokens.
This is most likely the reason why the nltk bayes model blew up
With new model and vectorizer, it's about 25mb. Is that ok?

I also think the scripts should be rewritten for a different pr. I see clearly that I have learned a lot about readable and idiomatic python code
since i wrote this lol

This closes issue #35

@LBlend LBlend self-requested a review April 3, 2022 16:16
@LBlend LBlend added this to In progress in Website via automation Apr 3, 2022
@LBlend LBlend linked an issue Apr 3, 2022 that may be closed by this pull request
@LBlend LBlend moved this from In progress to Ready for review in Website Apr 3, 2022
Copy link
Owner

@LBlend LBlend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good. Just some nitpicking about typing and import sorting. No worries though. I''ll take care of it!

I'll take care of the pep8 issues as well while I'm at it

tokens: list[str], trigrams: bool = False, use_stopwords: bool = False
) -> dict[str, bool | None]:
normalized = (token.lower() for token in tokens if token not in punctuation)
def read_file(path):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs typing

import nltk
import pickle
from string import punctuation
from numpy import vectorize
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort imports alphabetically

@LBlend LBlend self-requested a review April 3, 2022 20:39
Website automation moved this from Ready for review to Dev - Testing stage Apr 3, 2022
@LBlend LBlend merged commit a380e9d into dev Apr 3, 2022
Website automation moved this from Dev - Testing stage to Done Apr 3, 2022
@LBlend LBlend deleted the markus/migrate-sklearn branch April 3, 2022 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Website
  
Done
Development

Successfully merging this pull request may close these issues.

Replace all NLTK with scikit-learn
2 participants