Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One word comments don't pass the filter #1

Closed
yardenlai opened this issue Jul 12, 2023 · 6 comments
Closed

One word comments don't pass the filter #1

yardenlai opened this issue Jul 12, 2023 · 6 comments

Comments

@yardenlai
Copy link

I saw this happening with comments where they were one word (perfectly valid hebrew words) but they weren't showing up on the "with comments" feed. This might be happening with one word posts as well, if it matters I can check it out.

I'm opening an issue as a way to communicate about this and to remember it when I have some free time to open a PR to fix it.

If anyone happens to stumble across this issue - would the best fix be to just "hardcoded" enable posts of one word that have hebrew letters? Is there a smarter solution?

@AvivRubys
Copy link
Owner

AvivRubys commented Jul 12, 2023

Off the top of my head, we could:

  1. Specify that when the model fails to categorize a post, it default to hebrew
  2. Train a better model - since we're only concerned with hebrew/yiddish, it can be more specialized.
  3. Add another existing model - use a separate library like franc, languagedetect, fasttext and combine the result with cld's
  4. Do some hardcoding, not sure what that would look like though

I think option 1 would be the simplest way and might work well enough, although right now we don't save all posts with an undetected language, so it'll be a bit hard to test. I'll change it so it saves them as having an "unknown" language so we have something to backtest on it in a few days - 1f50067

@yardenlaif
Copy link
Contributor

Writing here since I've reached my limit on twitter DMs today 💁‍♀️
I can check 1 soon

@AvivRubys
Copy link
Owner

I'm actually looking into 3 right now, I've managed to make facebook's fasttext work so I'm writing a comparison script between it and today's model, I think I'll have results in an hour or so

@AvivRubys
Copy link
Owner

Actually, we don't need to default to hebrew - when the model doesn't know it defaults to unknown, so you can experiment with displaying unknowns as hebrew, so the categorization itself doesn't need to change

@AvivRubys
Copy link
Owner

Hebrew detection results, out of all the posts that have any hebrew characters:
BSKY built-in detection - 73%
CLD (current) - 81%
FastText - 94.5%
FastText Compressed - 94.9%

I tried some model combinations but it didn't improve percentages significantly. This looks like an easy decision, to go ahead with FastText Compressed. If anyone's interested in reviewing, I've thrown all the model results into sqlite for easy querying.

@AvivRubys
Copy link
Owner

I think this should mostly be resolved with FastText, we'll revisit if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants