Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Detect publishing platforms #90
Similar to #77, detect the presence of publishing platforms like Wordpress and Drupal. A secondary goal would be to detect themes and plugins.
Unlike #77, the key metric here would just be a single string/enum value representing the detected platform - as opposed to a list.
For accurate detection, need to come up with a list of signals for each platform. This may be hard to achieve through custom metrics alone. For example, if it requires introspection of a script file's comments, that wouldn't be possible with client-side JS alone. We may need to do post-processing on response bodies.
Ideally, we want a breakdown similar to https://trends.builtwith.com/cms/. However, as a starting point, I think we can focus our conversation on WordPress and figure out the requirements and pipeline for that. With that in mind, a few thoughts...
There are two (complementary) ways we can attempt to detect these platforms:
My hunch is that we'll get the most mileage from focusing on (2). In the context of WordPress:
There may also be runtime specific signals we can extract, but I propose we focus on (2) as a starting point and see how far that gets us. Also, the other benefit of (2) is that we can update the logic and rerun the analysis on past crawls.. giving us access to trending data, and all the rest. Last but not least, we shouldn't restrict ourselves to a single label. In some cases we may be able to extract version number and other meta-data, so I think we should think of the output as another bag of values:
Concretely, we can extend the current DataFlow pipeline with an extra step and start encoding these rules there. For prototyping we can also run queries directly in BigQuery..
Working on a proof of concept: https://github.com/rviscomi/httparchive/blob/pub-cm/custom_metrics/publishing-platform.js
@igrigorik do you see (1) as the low-hanging first pass for well-formed pages, with (2) taking a closer look at everything else not already detected?
I think we can get real data more quickly with (1), albeit with more false negatives. In any case, I'll also look into extending the DataFlow pipeline as you mentioned.
SELECT page, JSON_EXTRACT(payload, '$.response.headers') AS response_headers FROM [httparchive:har.2017_03_15_chrome_requests] LIMIT 100000
Then ran this query on it:
SELECT page, response_headers FROM ( SELECT page, response_headers, REGEXP_MATCH(response_headers, 'X-Hacker') AS wordpress FROM [httparchive:scratchspace.response_headers] ) WHERE wordpress = true
Changed the regexp pattern to
Seems like not a strong signal. WDYT?
Hmm. We could sanity check against: https://vip.wordpress.com/clients/
On the other hand, lots of sites on that client list don't deliver above header either..
^ perhaps we should also look for files.wordpress.com, although I'm not sure if that's VIP only or true for any wordpress.com hosted site.
Ok I updated the query to be case insensitive and match both
I stuffed those 27 results into
SELECT page, url, body FROM [httparchive:har.2017_03_15_chrome_requests_bodies] WHERE page IN (SELECT page FROM [httparchive:scratchspace.wordpress_headers])
It joins the pages with WP headers with corresponding response bodies. Finally, I queried this table with the same signals in the custom metric POC:
SELECT COUNT(0), page FROM [httparchive:scratchspace.wordpress_response_bodies] WHERE REGEXP_MATCH(body, r'(?i)(<meta[^>]*WordPress|<link[^>]*wlwmanifest|src=[\'"]?[^\'"]*wp-includes)') GROUP BY page
Of the 100,000 pages sampled, 27 pages were detected with WP headers. 25 of those also had corresponding WP signals in the response body. The discrepancy seems to be due to a conflicting use of the
That said, it seems like markup analysis is no worse of a signal than header analysis. So I ran a related query to see how much better markup analysis is:
SELECT COUNT(0), page FROM [httparchive:har.2017_03_15_chrome_requests_bodies] WHERE REGEXP_MATCH(body, r'(?i)(<meta[^>]*WordPress|<link[^>]*wlwmanifest|src=[\'"]?[^\'"]*wp-includes)') AND page IN (SELECT page FROM [httparchive:scratchspace.response_headers]) GROUP BY page
This looks only at the 100,000 pages sampled by the header analysis and runs the body analysis. There are 9,677 results, or about 10%. It's still half as much as reported, so it seems like there are other strong signals we're missing.
As a meta thing, It'd be nice to start building a list of test cases and explanations for each pattern:
Otherwise, based on past experience, you quickly end up with unwieldy regexes that break easily and are impossible to maintain long-term.
Definitely. I'd first like to figure out which signals are weak/strong/redundant and narrow it down to a minimal list of strong signals. The
"Weak" can mean that the signal has a high number of false positives or a low number of true positives. Eg
Good news! Someone has already thought about this
We could do something similar to the library detector and generate a custom metric script based on the Wappalyzer