Skip to content
This repository has been archived by the owner on Apr 14, 2021. It is now read-only.

Please scrap Mathjax from all posts #61

Open
ghost opened this issue Jan 28, 2015 · 10 comments
Open

Please scrap Mathjax from all posts #61

ghost opened this issue Jan 28, 2015 · 10 comments

Comments

@ghost
Copy link

ghost commented Jan 28, 2015

Pham does not contain any filters specific to mathjax blocks. The only issue I can see this causing is false positives for some regexes such as {0,80}.

@honnza
Copy link
Collaborator

honnza commented Jan 28, 2015

Remove completely, or convert it to text? The former only requires pairing
$ on sites that support mathjax, but...

How do you tell if a site supports mathjax? Hardcoding it is an option, but
the same information is available through the API as well.

On Wed, Jan 28, 2015 at 6:40 PM, Mooseman notifications@github.com wrote:

Pham does not contain any filters specific to mathjax blocks. The only
issue I can see this causing is false positives for some regexes such as
{0,80}.


Reply to this email directly or view it on GitHub
#61.

@ArcticEcho
Copy link
Owner

Just some data of the post that prompted this. Rendered output:

Log entry:

{
    "ReportLink" : "http://chat.meta.stackexchange.com/transcript/message/2970223",
    "PostUrl" : "http://math.stackexchange.com/a/1123742",
    "Site" : "math.stackexchange.com",
    "Title" : "If $\\alpha_1||y_1||\\alpha_2||y_2||$, then $x=-y_1$.",
    "Body" : "<p>If $\\alpha_1||y_1||>\\alpha_2||y_2||$, then $x=-y_1$.</p>",
    "TimeStamp" : "2015-01-28T17:34:00.918Z",
    "ReportType" : "LowQuality",
    "BlackTerms" : [
        {
            "Type" : "AnswerLQ",
            "Regex" : "^(?i).{0,80}$",
            "IsAuto" : false,
            "Site" : "",
            "Score" : 89,
            "TPCount" : 486,
            "FPCount" : 119,
            "CaughtCount" : 3010
        }
    ],
    "WhiteTerms" : []
}

... which begs the question, do we want to classify these sorts of posts as LQ? If yes, then case closed. Otherwise, just let Pham do his thing and lower that term's weight for mathjax supporting sites (and optionally add another term for posts with a lower char count). Or...?

@ghost
Copy link
Author

ghost commented Jan 28, 2015

If the question or answer doesn't have enough content besides the mathjax, I think it will generally be LQ.

@honnza
Copy link
Collaborator

honnza commented Jan 28, 2015

Seems LQ to me, but I'm not sure it needs our handling. The auto-whitelist
should be able to handle that if we don't. I've never been a fan of that
regex, actually.

On Wed, Jan 28, 2015 at 7:02 PM, Sam notifications@github.com wrote:

Just some data of the post that prompted this. Rendered output:

https://camo.githubusercontent.com/bf4437ab6ef39b3a226ca95d34d777fb8d3bd342/687474703a2f2f692e737461636b2e696d6775722e636f6d2f5939724b552e706e67

Log entry:

{
"ReportLink" : "http://chat.meta.stackexchange.com/transcript/message/2970223",
"PostUrl" : "http://math.stackexchange.com/a/1123742",
"Site" : "math.stackexchange.com",
"Title" : "If $\alpha_1||y_1||\alpha_2||y_2||$, then $x=-y_1$.",
"Body" : "

If $\alpha_1||y_1||&gt;\alpha_2||y_2||$, then $x=-y_1$.

",
"TimeStamp" : "2015-01-28T17:34:00.918Z",
"ReportType" : "LowQuality",
"BlackTerms" : [
{
"Type" : "AnswerLQ",
"Regex" : "^(?i).{0,80}$",
"IsAuto" : false,
"Site" : "",
"Score" : 89,
"TPCount" : 486,
"FPCount" : 119,
"CaughtCount" : 3010
}
],
"WhiteTerms" : []
}

... which begs the question, do we want to classify these sorts of posts
as LQ? If yes, then case closed. Otherwise, just let Pham do his thing and
lower that term's weight for mathjax supporting sites (and optionally add
another term for posts with a lower char count). Or...?


Reply to this email directly or view it on GitHub
#61 (comment)
.

@ArcticEcho
Copy link
Owner

So... it looks like just a simple matter of adjusting the current terms. Should I continue to add mathjax scrapping then?

@honnza
Copy link
Collaborator

honnza commented Jan 28, 2015

Please do. Not sure if it's strictly necessary, but it should be helpful.

On Wed, Jan 28, 2015 at 7:44 PM, Sam notifications@github.com wrote:

So... it looks like just a simple matter of adjusting the current terms.
Should I continue to add mathjax scraping then?


Reply to this email directly or view it on GitHub
#61 (comment)
.

@ArcticEcho
Copy link
Owner

Sure, ok. Shall I just remove all mathjax or (somehow) convert it to plain text?

@ghost
Copy link
Author

ghost commented Jan 28, 2015

I'd remove it so we don't match phone numbers or other filters.

@honnza
Copy link
Collaborator

honnza commented Jan 28, 2015

If you feel like parsing mathjax... sure, go ahead. Be sure to keep the
main code clean, though.

On Wed, Jan 28, 2015 at 8:05 PM, Sam notifications@github.com wrote:

Sure, ok. Shall I just remove all mathjax or (somehow) convert it to plain
text?


Reply to this email directly or view it on GitHub
#61 (comment)
.

@ArcticEcho
Copy link
Owner

Alright, well I'm sure there's a library for that (I hope). Will do (I'm gonna put this as low priority until Pham's stable after the switch over to a CLI).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants