Skip to content
This repository has been archived by the owner on Apr 14, 2021. It is now read-only.

Introducing project Ghamhilator! #57

Closed
ArcticEcho opened this issue Jan 22, 2015 · 7 comments
Closed

Introducing project Ghamhilator! #57

ArcticEcho opened this issue Jan 22, 2015 · 7 comments
Assignees

Comments

@ArcticEcho
Copy link
Owner

Today I've just started work on an NLP-based version of Pham, called Gham (as you can probably guess this bot will run under the account Gham). Gham's ultimate goal is to first use NLP (i.e., primarily a POS tagger) to build "models" (linguistic patterns) of spam, offensive & low quality posts which can then later be used to identify such posts (I aim for this entire process to be automated, but he will accept FP/TPs).

The exact inner workings of Gham have not yet been "set in stone", so feel free to put forward any ideas/suggestions.

@Unihedro
Copy link
Collaborator

Wow, this is a spectacular announcement! As spam is frequently "decorated" and made harder to be caught in specialized terms, this will be a massive breakthrough as it more effectively captures low quality posts. One thing though - How are you going to scrap Mathjax code and other code blocks like the chess widgets on Chess.SE? And (probably) more importantly, how will you handle foreign sites?

@ArcticEcho
Copy link
Owner Author

Thanks! Well this where I'll be adding another project which only focuses on fetching/parsing posts (real-time) and locally broadcasting the data to Gham and Pham (effectively splitting what Pham already does into a separate project (called Yham!)).

As for Mathjax/chess widgets, by default Yham fetches the post's HTML, which is useful for Pham, but not so much for Gham (as he can only analyse English words, so foreign sites will also forfeit Gham's scope). Having said that, I may be able to get my hands on a few foreign language POS tagger models; although, I doubt the extra effort of adding even more models for sites that don't actually attract many "bad" posts will pay off.

@honnza
Copy link
Collaborator

honnza commented Jan 23, 2015

Yham sounds terrible. I'd suggest "Yam".

On Fri, Jan 23, 2015 at 11:19 AM, Sam notifications@github.com wrote:

Thanks! Well this where I'll be adding another project which only focuses
on fetching/parsing posts (real-time) and locally broadcasting the data to
Gham and Pham (effectively splitting what Pham already does into a separate
project (called Yham!)).

As for Mathjax/chess widgets, by default Yham fetches the post's HTML,
which is useful for Pham, but not so much for Gham (as he can only analyse
English words, so foreign sites will also forfeit Gham's scope). Having
said that, I may be able to get my hands on a few foreign language POS
tagger models; although, I doubt the extra effort of adding even more
models for sites that don't actually attract many "bad" posts will pay off.


Reply to this email directly or view it on GitHub
#57 (comment)
.

@ArcticEcho
Copy link
Owner Author

[status-accepted]

@thomas-daniels
Copy link
Collaborator

That sounds cool!

@ArcticEcho
Copy link
Owner Author

In light of further discussion and testing, PoS tagging doesn't currently appear to be the most effective way to classify LQ posts. As such, all PoS tagging functionality will now be replaced with a weighted cue-based classification algorithm.

@ArcticEcho
Copy link
Owner Author

All NLP-based classification is now being moved to Pham. For now, we'll leave Gham to rest.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants