-
Notifications
You must be signed in to change notification settings - Fork 5
RFC: Reimplementing Harvester as a CLI program #15
Comments
Awesome!
I'm not sure it's extremely relevant for the Harvested use case, for something I did for another harvester , was using the structure But maybe you don't need that if you plan to track this in a database (that
That would be great. It would be good if there is some way to also filter repositories where there is a license detected but it is unknown to GitHub. It is somewhat common for GitHub to not detect slightly modified Apache 2, BSD or MIT license file. For ecosystems with a small amount of open source code, this may be useful. |
Well, it prevents a mess when harvesting files with the same filename, but different content. This is especially relevant when searching for filenames (as opposed to extensions), because the alternative is a folder with hundreds of Then there're those repositories that contain hundreds of results, but for an unrelated extension with zero usage elsewhere on GitHub (often demo files written by language authors for a hobbyist scripting language, or whatever). Once you identify a few of the files as unrelated, it's easier to delete them first to check if they comprise most of the search results. This can be the difference between an ostensibly eligible extension with 2,820 in-the-wild results, and an extension with only 86 actual uses and a single user whose project has 2,734 fixtures with that extension (limited to that repository only). Which has happened to me countless times in the past… and I usually realise this after spending hours of my time sorting files manually...
The
It's safer to prompt users to review ambiguous license files when running interactively (indicated by running Harvester with a |
This sounds great. It reminds me, I've still got my hacky, incomplete and pretty slow (mainly because of the REST API rate limiting) Ruby script in my fork that I've been meaning to finish. You can see it at https://github.com/lildude/linguist/blob/lildude/download-corpus/script/github-ext-pop if you need any inspiration. |
Rate-limiting? Erh... we're not talking "sooner than 5 seconds and we'll block you from loading this URL for good", right? 😰 One of the incentives to rewrite this was speed: even a modest-sized harvest takes forever to complete. I sometimes forget about it while it's running and then jump when I see an unexpected push notification. 😂 Probably compounded by Australia's crappy slow Internet speed. In a nutshell, what sort of rate limits am I up against…? 🤔 |
https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-limiting and https://docs.github.com/en/rest/reference/search#rate-limit for Search-specific API rate limits. In short, if you're going to use the Search API as I have, you're limited to 30 requests per minute. There's also an abuse rate limit, but I can't tell you any more than what's in the docs. You'll know when you've hit that from the response headers though. |
30 requests per minute? I'm guessing there's a hard limit on pagination size too? (Unless it's generously high, one could theoretically load 500 results per page in one requests…) I'll dig into those docs later once I've finished doing sixteen things at once. 😁 Thanks! |
Yup. Max 100 results per page. |
Compared to only 10 results per page, that's extraordinarily generous. 😁 If the user hits 30 requests in less than a minute, recovery time can always be spent downloading files from |
👋 I did a drive-by re-write for a day of learning, before seeing this issue. That version is at https://github.com/pjquirk/harvest-action. It's written to be an action first, and is focused on bulk monitoring, so it's probably not exactly what you're looking for. I'm happy to help adapt it, or otherwise feel free to use any of it, fork it, ignore it completely, or whatever 😄 Looking in from the outside, I'm advocating for making the tool actions-friendly since it's frequently being run to check if an extension is 'widely used' so a PR can be merged. Can these usage metrics be nailed down a bit more, so you could make it a required check on PRs? If so, you don't need to create a tracking issue for not-widely-used extensions, since it's just the open PRs with the One thing came up during testing of ☝️ is project Geyser recently changed which results are returned, so only active repos show up now. This drastically reduced a lot of the usage numbers, which makes popularity a littler harder to determine. |
Do Actions/Workflows have access to artefacts generated from previous runs? Half of the work on the Harvester rewrite will go towards locally-cached searches (a.ka., “silos”). If so, it'll speed up searches considerably (and minimise how many API calls are dispatched).
Absolutely. How does a # NB: The `--min-*` options exist chiefly to simplify local testing
$ harvest -t|--test-usage
[-u|--min-users=200]
[-r|--min-repos=200]
[-p|--pull-request=$PULL_REQUEST_ID]
candidates With
Or,
Output would be a coloured diff (depicting the total number of files/repos/users before and after the scan): 2 change(s) to eligibility status:
@@ extension:kql @@
- 152 files
+ 4200 files
- 199 repos
+ 200 repos
- 199 users
+ 200 users
@@ extension:kusto @@
- 4200 files
+ 24245 files
- 199 repos
+ 200 repos
- 99 users
+ 199 users and terminate with a non-zero exit status, à la the run: |
harvest -p 4827 -t kql kusto csl:-xml || {
# Run any code that needs to handle newly-eligible extensions
exec something
exit 1
} If no changes are detected, I'm unfamiliar with whatever environment variables are present when actions are executed. If there's a |
Yeah you can use the cache action to essentially zip/upload the working directory after a run and then restore it before the next run. Alternatively you could set up a self-hosted runner that runs on some machine you own (which would leave the working directory intact after each run), but I believe there's a policy now against doing that for
That all sounds perfect!
I suppose this is needed to somehow figure out what extensions the PR is trying to validate? There's definitely context for the PR data in the event that triggered the action, see this comment for pulling out the number explicitly. You may have an idea of how to get the new extensions already, or maybe require PRs to include a file in the tree somewhere that has the extensions/queries to harvest. |
Ah, there's no need to go that far. We only need to cache a single JSON file containing search queries, harvested URLs, and anything else that can speed up incremental reruns.
Not quite. I imagine this feature would be useful for automatically kicking off a harvest whenever a new PR is submitted. A bot would then post a comment with the results, so Linguist maintainers don't have to report on in-the-wild usage manually. If we know the ID of the pull-request being validated, we can load its raw diff (by adding |
Alright… sooo, this isn't going to happen. GitHub's code search supports regular expressions and proper filename matching, obviating what Harvester set out to do in the first place: facilitate research into new file-formats, identify conflicting (potentially undiscovered formats), and tally user/repo counts to determine eligibility for support by Linguist. Moreover, most of these responsibilities have been delegated to @lildude as discussed in github-linguist/linguist#5756 (who I assume has popularity monitoring automated on his end somewhere 😉). Having said all this, there really isn't a need for a dedicated file-harvesting tool when downloading search results is (probably) a trivial application of GitHub's REST API. I'm archiving this project so there's no ambiguity over its maintenance status; its spaghetti-code will stand as a sort of time capsule for the sort of nonsense Linguist contributors had to deal with in the days before Code Search. 😉 |
I intend to rewrite Harvester as a dedicated CLI tool running atop Node.js. This implementation will query GitHub's APIs to obtain search results, eliminating the need to copy+paste huge chunks of code into one's dev-console (which is inelegant, runs slowly, and prone to breakage).
I've opened this issue to collect feedback, ideas, and discuss any issues users foresee with the requirement to use an authentication token (see “Caveats” below).
Features
Multiple pages of search-results can be fetched simultaneously, reusing a single connection and pooling results either in memory or in a file (probably a CBOR file with a
.silo
extension).wget
orcurl
which operate linearly (that is, they wait for each URL to finish transferring before requesting the next). †./user/repo/file.ext
rather thanfile.ext.452
or./raw.githubusercontent.com/user/repo/branch/full/path/to/file.ext
)..silo
files:.silo
files will contain timestamps of each run, which should makegithub/linguist#4219
easier to manage.No more poking around to determine if a sample is copylefted or unlicensed!
Reports can be produced by running something like
Caveats
Things to consider:
NOT nothack
trick is designed to circumvent).Motivation
Currently, Harvester is using ~4-year old code I slapped together as a quick workaround for an unexpected change in GitHub's handling of search results (specifically, preventing anonymous users from viewing them). Since then, it's required manual maintenance and hotfixes each time GitHub updated their front-end code. The slop I wrote in 2016 was never intended to last as long as it did… I knew it'd need rewriting eventually, but as long as it worked and the fixes were trivial, I considered it low priority.
Recently, real-world issues reminded me I won't always be able to respond to issues or e-mails in a timely fashion… or even at all. Ergo, I'm reevaluating the priority of outstanding tasks like this one.
Footnotes
/cc @pchaigno, @lildude, @smola
The text was updated successfully, but these errors were encountered: