One word comments don't pass the filter #1

yardenlai · 2023-07-12T17:55:26Z

I saw this happening with comments where they were one word (perfectly valid hebrew words) but they weren't showing up on the "with comments" feed. This might be happening with one word posts as well, if it matters I can check it out.

I'm opening an issue as a way to communicate about this and to remember it when I have some free time to open a PR to fix it.

If anyone happens to stumble across this issue - would the best fix be to just "hardcoded" enable posts of one word that have hebrew letters? Is there a smarter solution?

AvivRubys · 2023-07-12T21:14:41Z

Off the top of my head, we could:

Specify that when the model fails to categorize a post, it default to hebrew
Train a better model - since we're only concerned with hebrew/yiddish, it can be more specialized.
Add another existing model - use a separate library like franc, languagedetect, fasttext and combine the result with cld's
Do some hardcoding, not sure what that would look like though

I think option 1 would be the simplest way and might work well enough, although right now we don't save all posts with an undetected language, so it'll be a bit hard to test. I'll change it so it saves them as having an "unknown" language so we have something to backtest on it in a few days - 1f50067

yardenlaif · 2023-08-05T17:36:56Z

Writing here since I've reached my limit on twitter DMs today 💁‍♀️
I can check 1 soon

AvivRubys · 2023-08-05T17:52:43Z

I'm actually looking into 3 right now, I've managed to make facebook's fasttext work so I'm writing a comparison script between it and today's model, I think I'll have results in an hour or so

AvivRubys · 2023-08-05T18:05:32Z

Actually, we don't need to default to hebrew - when the model doesn't know it defaults to unknown, so you can experiment with displaying unknowns as hebrew, so the categorization itself doesn't need to change

AvivRubys · 2023-08-05T19:12:55Z

Hebrew detection results, out of all the posts that have any hebrew characters:
BSKY built-in detection - 73%
CLD (current) - 81%
FastText - 94.5%
FastText Compressed - 94.9%

I tried some model combinations but it didn't improve percentages significantly. This looks like an easy decision, to go ahead with FastText Compressed. If anyone's interested in reviewing, I've thrown all the model results into sqlite for easy querying.

AvivRubys · 2023-08-06T14:18:02Z

I think this should mostly be resolved with FastText, we'll revisit if needed

AvivRubys mentioned this issue Aug 5, 2023

Changed language classifier #5

Merged

AvivRubys closed this as completed Aug 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One word comments don't pass the filter #1

One word comments don't pass the filter #1

yardenlai commented Jul 12, 2023

AvivRubys commented Jul 12, 2023 •

edited

Loading

yardenlaif commented Aug 5, 2023

AvivRubys commented Aug 5, 2023

AvivRubys commented Aug 5, 2023

AvivRubys commented Aug 5, 2023

AvivRubys commented Aug 6, 2023

One word comments don't pass the filter #1

One word comments don't pass the filter #1

Comments

yardenlai commented Jul 12, 2023

AvivRubys commented Jul 12, 2023 • edited Loading

yardenlaif commented Aug 5, 2023

AvivRubys commented Aug 5, 2023

AvivRubys commented Aug 5, 2023

AvivRubys commented Aug 5, 2023

AvivRubys commented Aug 6, 2023

AvivRubys commented Jul 12, 2023 •

edited

Loading