Skip to content
This repository has been archived by the owner on Apr 10, 2024. It is now read-only.

RFC: Reimplementing Harvester as a CLI program #15

Closed
Alhadis opened this issue Jul 22, 2020 · 13 comments
Closed

RFC: Reimplementing Harvester as a CLI program #15

Alhadis opened this issue Jul 22, 2020 · 13 comments
Assignees

Comments

@Alhadis
Copy link
Owner

Alhadis commented Jul 22, 2020

I intend to rewrite Harvester as a dedicated CLI tool running atop Node.js. This implementation will query GitHub's APIs to obtain search results, eliminating the need to copy+paste huge chunks of code into one's dev-console (which is inelegant, runs slowly, and prone to breakage).

I've opened this issue to collect feedback, ideas, and discuss any issues users foresee with the requirement to use an authentication token (see “Caveats” below).

Features

  1. Concurrent operation:
    Multiple pages of search-results can be fetched simultaneously, reusing a single connection and pooling results either in memory or in a file (probably a CBOR file with a .silo extension).
  2. Downloading actual files:
    1. Faster harvests. Since we can download files concurrently, Harvester will be potentially faster than tools like wget or curl which operate linearly (that is, they wait for each URL to finish transferring before requesting the next).
    2. Duplicate file detection is easier. Especially if GitHub's APIs provide a checksum when querying file metadata.
    3. Smarter subdirectory structure: Files will be downloaded to
      ./user/repo/file.ext rather than file.ext.452 or ./raw.githubusercontent.com/user/repo/branch/full/path/to/file.ext).
  3. Storing harvested URLs in .silo files:
    1. Harvests can be resumed: Aside from resuming an interrupted harvest, users can update a previous harvest without re-downloading files that were already reaped in an earlier run (feature Fix make install command #2 can track the locations of downloaded files, providing a means of tracking which files have been evaluated).
    2. Easier tracking of usage over time: .silo files will contain timestamps of each run, which should make github/linguist#4219 easier to manage.
  4. Ability to filter files with permissive and unambiguous licenses:
    No more poking around to determine if a sample is copylefted or unlicensed!
  5. Easier wrangling of user/repo counts:
    Reports can be produced by running something like
    λ harvest -s foo
    Summary of `.foo` usage:
    * Users: 424
    * Repos: 653
    * Files: 1453 (2428 total)
    * Last updated: 2020-07-22T14:56:28.196Z (3 minutes ago)
    Or even in Markdown format (for pasting into issues/PRs):
    λ harvest -s -m foo
    ## `.foo` usage
    * Users: 424
    * Repos: 653
    * Files: 1453 ([2428 total](https://github.com/advanced/search/link))
    * Last updated: 2020-07-22T14:56:28.196Z (3 minutes ago)
    
    <details><summary><b>1,523 total files:</b></summary>
    
    ```
    file:///1
    file:///2
    file:///3/and/so/forth
    ```
    
    </details>
  • I finally get to write a man page: So I don't lose my shit at Markdown again for being the worst markup format for anything more elaborate than a readme.

Caveats

Things to consider:

  • Users will need to generate authentication tokens before they can use GitHub's APIs. This can be passed to the Harvester executable by setting an environment variable.
  • I actually haven't delved too far into GitHub's API docs. Ergo, there may be limitations to its site-wide file search that I'm unaware of (ones that our NOT nothack trick is designed to circumvent).

Motivation

Currently, Harvester is using ~4-year old code I slapped together as a quick workaround for an unexpected change in GitHub's handling of search results (specifically, preventing anonymous users from viewing them). Since then, it's required manual maintenance and hotfixes each time GitHub updated their front-end code. The slop I wrote in 2016 was never intended to last as long as it did… I knew it'd need rewriting eventually, but as long as it worked and the fixes were trivial, I considered it low priority.

Recently, real-world issues reminded me I won't always be able to respond to issues or e-mails in a timely fashion… or even at all. Ergo, I'm reevaluating the priority of outstanding tasks like this one.

Footnotes

  •  This is actually the sane thing to do, when you consider race conditions with conflicting filenames, or URLs which influence subsequent server responses.
  •  Markdown's only feature is readability. Prove me wrong.

/cc @pchaigno, @lildude, @smola

@Alhadis Alhadis self-assigned this Jul 22, 2020
@smola
Copy link

smola commented Jul 23, 2020

Awesome!

Smarter subdirectory structure: Files will be downloaded to
./user/repo/file.ext rather than file.ext.452 or ./raw.githubusercontent.com/user/repo/branch/full/path/to/file.ext)

I'm not sure it's extremely relevant for the Harvested use case, for something I did for another harvester , was using the structure <user>/<repo>/<commit sha>/<full-path-to-file>. While it's certainly less convenient for manual navigation, it makes it much easier to attribute the origin of the file precisely, and to generate a GitHub URL pointing to the same exact version of the file that was harvested. This is useful when a file was deleted in a branch or when it changes to a different language (yes! this happens!).

But maybe you don't need that if you plan to track this in a database (that .silo file?).

Ability to filter files with permissive and unambiguous licenses

That would be great. It would be good if there is some way to also filter repositories where there is a license detected but it is unknown to GitHub. It is somewhat common for GitHub to not detect slightly modified Apache 2, BSD or MIT license file. For ecosystems with a small amount of open source code, this may be useful.

@Alhadis
Copy link
Owner Author

Alhadis commented Jul 23, 2020

I'm not sure it's extremely relevant for the Harvested use case

Well, it prevents a mess when harvesting files with the same filename, but different content. This is especially relevant when searching for filenames (as opposed to extensions), because the alternative is a folder with hundreds of .filename.1, .filename.2, .filename.3.filename.104, .filename.105, et cetera…

Then there're those repositories that contain hundreds of results, but for an unrelated extension with zero usage elsewhere on GitHub (often demo files written by language authors for a hobbyist scripting language, or whatever). Once you identify a few of the files as unrelated, it's easier to delete them first to check if they comprise most of the search results.

This can be the difference between an ostensibly eligible extension with 2,820 in-the-wild results, and an extension with only 86 actual uses and a single user whose project has 2,734 fixtures with that extension (limited to that repository only). Which has happened to me countless times in the past… and I usually realise this after spending hours of my time sorting files manually...

But maybe you don't need that if you plan to track this in a database (that .silo file?)

The .silo file will basically be a record of harvested files, with metadata and URL lists that currently have to be copied manually from the dev-console (and then grepped through with other tools to assemble a report of some kind).

It would be good if there is some way to also filter repositories where there is a license detected but it is unknown to GitHub.

It's safer to prompt users to review ambiguous license files when running interactively (indicated by running Harvester with a -I switch). This needs to be opt-in behaviour, though, since enabling it by default violates the PoLA.

@lildude
Copy link

lildude commented Jul 23, 2020

This sounds great. It reminds me, I've still got my hacky, incomplete and pretty slow (mainly because of the REST API rate limiting) Ruby script in my fork that I've been meaning to finish. You can see it at https://github.com/lildude/linguist/blob/lildude/download-corpus/script/github-ext-pop if you need any inspiration.

@Alhadis
Copy link
Owner Author

Alhadis commented Jul 23, 2020

Rate-limiting? Erh... we're not talking "sooner than 5 seconds and we'll block you from loading this URL for good", right? 😰 One of the incentives to rewrite this was speed: even a modest-sized harvest takes forever to complete. I sometimes forget about it while it's running and then jump when I see an unexpected push notification. 😂 Probably compounded by Australia's crappy slow Internet speed.

In a nutshell, what sort of rate limits am I up against…? 🤔

@lildude
Copy link

lildude commented Jul 23, 2020

In a nutshell, what sort of rate limits am I up against…? 🤔

https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-limiting and https://docs.github.com/en/rest/reference/search#rate-limit for Search-specific API rate limits.

In short, if you're going to use the Search API as I have, you're limited to 30 requests per minute. There's also an abuse rate limit, but I can't tell you any more than what's in the docs. You'll know when you've hit that from the response headers though.

@Alhadis
Copy link
Owner Author

Alhadis commented Jul 23, 2020

30 requests per minute? I'm guessing there's a hard limit on pagination size too? (Unless it's generously high, one could theoretically load 500 results per page in one requests…)

I'll dig into those docs later once I've finished doing sixteen things at once. 😁 Thanks!

@lildude
Copy link

lildude commented Jul 23, 2020

I'm guessing there's a hard limit on pagination size too?

Yup. Max 100 results per page.

@Alhadis
Copy link
Owner Author

Alhadis commented Jul 23, 2020

Compared to only 10 results per page, that's extraordinarily generous. 😁

If the user hits 30 requests in less than a minute, recovery time can always be spent downloading files from https://raw.githubusercontent.com/ (assuming the user's opted to download files as well as URLs/stats).

@pjquirk
Copy link

pjquirk commented Jan 5, 2021

👋 I did a drive-by re-write for a day of learning, before seeing this issue. That version is at https://github.com/pjquirk/harvest-action. It's written to be an action first, and is focused on bulk monitoring, so it's probably not exactly what you're looking for. I'm happy to help adapt it, or otherwise feel free to use any of it, fork it, ignore it completely, or whatever 😄

Looking in from the outside, I'm advocating for making the tool actions-friendly since it's frequently being run to check if an extension is 'widely used' so a PR can be merged. Can these usage metrics be nailed down a bit more, so you could make it a required check on PRs? If so, you don't need to create a tracking issue for not-widely-used extensions, since it's just the open PRs with the pending popularity tag. It should be up to the PR authors to check their extensions popularity from time to time, rather than something for the Linguist folks to do.

One thing came up during testing of ☝️ is project Geyser recently changed which results are returned, so only active repos show up now. This drastically reduced a lot of the usage numbers, which makes popularity a littler harder to determine.

@Alhadis
Copy link
Owner Author

Alhadis commented Jan 5, 2021

Do Actions/Workflows have access to artefacts generated from previous runs? Half of the work on the Harvester rewrite will go towards locally-cached searches (a.ka., “silos”). If so, it'll speed up searches considerably (and minimise how many API calls are dispatched).

Can these usage metrics be nailed down a bit more, so you could make it a required check on PRs?

Absolutely. How does a -t / --test-usage switch sound? Option-wise, it might look like

# NB: The `--min-*` options exist chiefly to simplify local testing
$ harvest -t|--test-usage
         [-u|--min-users=200]
         [-r|--min-repos=200]
         [-p|--pull-request=$PULL_REQUEST_ID]
         candidates

With candidates being either

  1. Explicitly defined using options:
    # At least one of these must be present!
    $ harvest -t [-e|--extensions]
                 [-f|--filenames]
                 [-q|--queries]

Or,

  1. Implicitly, treating $ARGV as a list of file extensions:
    # Equivalent to `extension:kql extension:kusto extension:csl NOT nothack`
    $ harvest -t [--] kql kusto csl

Output would be a coloured diff (depicting the total number of files/repos/users before and after the scan):

2 change(s) to eligibility status:

@@ extension:kql @@
- 152 files
+ 4200 files

- 199 repos
+ 200 repos

- 199 users
+ 200 users

@@ extension:kusto @@
- 4200 files
+ 24245 files

- 199 repos
+ 200 repos

- 99 users
+ 199 users

and terminate with a non-zero exit status, à la the diff(1) command. So your Action config (or Workflow definition) might have something like:

run: |
    harvest -p 4827 -t kql kusto csl:-xml || {
        # Run any code that needs to handle newly-eligible extensions
        exec something
        exit 1
    }

If no changes are detected, harvest -t emits no output and exits cleanly (with a 0 exit status; again, like the diff(1) command).

I'm unfamiliar with whatever environment variables are present when actions are executed. If there's a PULL_REQUEST_ID or something similar set in the environment, its value could be used as the default if --pull-request is omitted.

@pjquirk
Copy link

pjquirk commented Jan 6, 2021

Do Actions/Workflows have access to artefacts generated from previous runs? Half of the work on the Harvester rewrite will go towards locally-cached searches (a.ka., “silos”). If so, it'll speed up searches considerably (and minimise how many API calls are dispatched).

Yeah you can use the cache action to essentially zip/upload the working directory after a run and then restore it before the next run. Alternatively you could set up a self-hosted runner that runs on some machine you own (which would leave the working directory intact after each run), but I believe there's a policy now against doing that for github org-level repos.

terminate with a non-zero exit status, à la the diff(1) command

That all sounds perfect!

I'm unfamiliar with whatever environment variables are present when actions are executed. If there's a PULL_REQUEST_ID or something similar set in the environment, its value could be used as the default if --pull-request is omitted.

I suppose this is needed to somehow figure out what extensions the PR is trying to validate? There's definitely context for the PR data in the event that triggered the action, see this comment for pulling out the number explicitly. You may have an idea of how to get the new extensions already, or maybe require PRs to include a file in the tree somewhere that has the extensions/queries to harvest.

@Alhadis
Copy link
Owner Author

Alhadis commented Jan 6, 2021

Yeah you can use the cache action to essentially zip/upload the working directory after a run and then restore it before the next run.

Ah, there's no need to go that far. We only need to cache a single JSON file containing search queries, harvested URLs, and anything else that can speed up incremental reruns.

I suppose this is needed to somehow figure out what extensions the PR is trying to validate?

Not quite. I imagine this feature would be useful for automatically kicking off a harvest whenever a new PR is submitted. A bot would then post a comment with the results, so Linguist maintainers don't have to report on in-the-wild usage manually.

If we know the ID of the pull-request being validated, we can load its raw diff (by adding .diff to the PR's URL) and scan for changes to languages.yml that insert entries to a filenames or extensions field.

@Alhadis
Copy link
Owner Author

Alhadis commented Apr 10, 2024

Alright… sooo, this isn't going to happen. GitHub's code search supports regular expressions and proper filename matching, obviating what Harvester set out to do in the first place: facilitate research into new file-formats, identify conflicting (potentially undiscovered formats), and tally user/repo counts to determine eligibility for support by Linguist. Moreover, most of these responsibilities have been delegated to @lildude as discussed in github-linguist/linguist#5756 (who I assume has popularity monitoring automated on his end somewhere 😉).

Having said all this, there really isn't a need for a dedicated file-harvesting tool when downloading search results is (probably) a trivial application of GitHub's REST API. I'm archiving this project so there's no ambiguity over its maintenance status; its spaghetti-code will stand as a sort of time capsule for the sort of nonsense Linguist contributors had to deal with in the days before Code Search. 😉

@Alhadis Alhadis closed this as not planned Won't fix, can't repro, duplicate, stale Apr 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants