Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code Deposit Spike - GitHub Webhooks #5209

Closed
dlmurphy opened this issue Oct 17, 2018 · 21 comments
Closed

Code Deposit Spike - GitHub Webhooks #5209

dlmurphy opened this issue Oct 17, 2018 · 21 comments

Comments

@dlmurphy
Copy link
Contributor

dlmurphy commented Oct 17, 2018

Related: #2739

We're currently working on designing a workflow that will allow Dataverse users to connect their GitHub accounts using GitHub's webhooks, and then import a GitHub repo for deposit into Dataverse in the form of a dataset containing a .zip with all files from the repo. Users will be able to configure a "sync" whereby whenever a GitHub repo is updated in a specific way (probably when a release is minted), the dataset in Dataverse will be automatically updated.

This spike will help us learn more about what's possible when using GitHub webhooks with Dataverse.

Some questions we have that will help inform the design of this feature:

  1. Using GitHub’s webhooks, what events could we set to trigger syncing with Dataverse?
    E.g. releases, commits, pull requests
  2. When connecting your Dataverse account to your GitHub account, how does that initial connection get made?
    a. Can a Dataverse user authenticate w/ GitHub in a popup, and have that trigger a change to the Create Dataset or Add Files page?
  3. How can we determine which GitHub repositories a user can select from in Dataverse?
    a. E.g. Can the Dataverse user select from a list of GitHub repos he owns, or repos he has certain other permissions on?
  4. Can GitHub trigger a file replace in Dataverse when a new release is minted in GitHub?
  5. What metadata can we get from the GitHub repo?
  6. Is there an upper limit on the size of a GitHub repo we can reasonably import into Dataverse?
@qqmyers
Copy link
Member

qqmyers commented Oct 17, 2018

Cool! FWIW - if your workflow involves the existing workflow mechanism is Dataverse, you might want to start from #5048 that is winding its way through the process. It lets you send any Dataverse setting you want to the workflow and fixes some transaction issues (at least in local workflows, I'm not as sure the problem existed for workflows that do callbacks). As an example, #5049 is a workflow to submit a zipped bag to the Digital Preservation Network and it gets the hostname/port, etc. from Dataverse settings.

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Oct 25, 2018

After doing some research, here are some answers to our initial questions:

  1. Using GitHub’s webhooks, what events could we set to trigger syncing with Dataverse? (E.g. releases, commits, pull requests)

    • Answer: We can receive data payloads when any event we'd be interested in happens (link)
  2. When connecting your Dataverse account to your GitHub account, how does that initial connection get made?

    • Answer: There are a few options. It may be worth doing the first before the second as incremental progress:
        1. Have a user create a webhook themselves via their github settings page. We'd provide the users the dataverse api endpoint url to pass in, as well as a secret hash we can authenticate on. This could begin via a popup on our page. This does not require github auth.
        1. Automate this process via launching a popup. Similar to Zenodo (link) . This is discussed here (link)
        • It looks like you have to register your application with github. This will require each installation registering
          2.1 Can a Dataverse user authenticate w/ GitHub in a popup, and have that trigger a change to the Create Dataset or Add Files page?
      • Answer: Yes
  3. How can we determine which GitHub repositories a user can select from in Dataverse? (E.g. Can the Dataverse user select from a list of GitHub repos he owns, or repos he has certain other permissions on?)

    • Answer: a user grants access to certain scopes when authorizing with oauth. That access can then be used to do api calls that grab the list of repos ( link , link , link )
  4. Can GitHub trigger a file replace in Dataverse when a new release is minted in GitHub?

    • Answer: Github can send an automated notice (webhook) of new release, which dataverse can then act upon to grab the data and do a replace.
  5. What metadata can we get from the GitHub repo?

    • Answer: There is a lot that can be pulled about organizations, repos and projects (link)
  6. Is there an upper limit on the size of a GitHub repo we can reasonably import into Dataverse?

    • Answer: Github does not limit downloads of releases, and limits files in releases larger than 2GB, so that should not conflict with dataverse (link) . There seem to be more limits if we want to to updates based upon a commit, there is some info on (link) . There may be a limit of 1GB per month from downloading the source, tho I think I am misunderstanding it.

Other interesting info I found:

  • Git itself supports webhooks, without github (link)
    • It seems possible to create a script that git users can drop into their repo to trigger a push to dataverse on certain actions
    • As far as I can tell releases are not baked into git itself, if we supported this maybe we'd tap into archive? (link)

@pdurbin
Copy link
Member

pdurbin commented Oct 25, 2018

  • As far as I can tell releases are not baked into git itself, if we supported this maybe we'd tap into archive?

Releases in GitHub are based on Git tags. See https://help.github.com/articles/about-releases/

I'll probably have a number of questions but one that's top of mind is this:

  • What if Dataverse is down when GitHub sends a webhook? Is the webhook queued? Does it retry for several days like SMTP? Or is a webhook "fire and forget" and gone forever if your server isn't up to receive it?

@matthew-a-dunlap
Copy link
Contributor

matthew-a-dunlap commented Oct 26, 2018

It looks like github webhooks do not retry. Some remediation efforts I can see for this are provide some way through the UI for the user to publish a release from github manually. Or maybe we poll at regular intervals ontop of the webhook? Or maybe we just poll? A good thing to discuss in our next group.

If we decide to do polling, there are rate limits to take into account. These are per individual, with a limit of 5000 for that individual across all applications.

It looks like deep in the github UI there is a way to see if your webhooks succeeded. Its not obvious to the user on on github, we can get the info via the API but it looks to involve sifting through a lot of junk (link) .

@dlmurphy
Copy link
Contributor Author

Thanks for the research, @matthew-a-dunlap! Later today I plan to really dive into the answers you gave and start developing new mockups. If I have any followup questions, I'll post them here!

@matthew-a-dunlap
Copy link
Contributor

@dlmurphy No problem! We may want to have some discussion before we dig too deep, especially in light of @pdurbin 's question about retrying webhooks.

@matthew-a-dunlap matthew-a-dunlap removed their assignment Oct 29, 2018
@dlmurphy
Copy link
Contributor Author

dlmurphy commented Oct 29, 2018

RE: Question 3, How can we determine which GitHub repositories a user can select from in Dataverse?

In addition to the info Matthew posted above, I want to include this info I found from one of the pages he linked:

The authenticated user has explicit permission to access repositories they own, repositories where they are a collaborator, and repositories that they can access through an organization membership.

Any dropdown or type-ahead selector we include can allow the Dataverse user to select any repos that fit those criteria.

@dlmurphy
Copy link
Contributor Author

A follow-up question for @matthew-a-dunlap or anyone else in the know, RE: Question 5, "What metadata can we get from the GitHub repo?"

Could you please list more specifically what metadata we can pull from a repo, or link to a page with that info? I'm having a hard time finding that info.

@matthew-a-dunlap
Copy link
Contributor

@dlmurphy Regarding Q5: This page has info on all the objects that can be queried via the api, and their attributes . Including release, user, repository and organization.

@pdurbin
Copy link
Member

pdurbin commented Oct 29, 2018

@dlmurphy if you look at the dataset in https://dataverse.harvard.edu/dataverse/open-source-at-harvard you can get a sense of the metadata you can get from GitHub for one of their repos. Here's a handy link to Data Explorer: https://scholarsportal.github.io/Dataverse-Data-Explorer/?fileId=3040230&siteUrl=https://dataverse.harvard.edu and to the JSON GitHub exposed for our "dataverse" repo back in July last year: https://github.com/pdurbin/open-source-at-harvard-primary-data/blob/master/2017-07-31/IQSS-dataverse.json

@dlmurphy
Copy link
Contributor Author

Thanks, guys. That answers that question pretty well! I'm happy with the answers we've gathered, but I want to let this issue simmer until we can go over these answers in a design team meeting, perhaps this Wednesday.

@dlmurphy
Copy link
Contributor Author

dlmurphy commented Oct 31, 2018

Following today's design meeting, we decided that we'd like the next step for this spike to include:

  • Creation of a working prototype that demonstrates a basic Dataverse/GitHub webhook connection that can pull a repo from GitHub and create a .zip of it in a Dataverse dataset.

  • Decisions on which metadata fields would be appropriate for Dataverse to use in a software metadata block, and then a mapping of which of those can be autopopulated from GitHub.

@djbrooke
Copy link
Contributor

djbrooke commented Nov 1, 2018

@dlmurphy - FYI, in standup today, there was some discussion regarding the prototype, mostly related to whether or not it includes a front end. Some folks may check in with you.

@dlmurphy
Copy link
Contributor Author

dlmurphy commented Nov 1, 2018

Just talked about this with the design team -- we don't need a UI for this prototype.

@djbrooke djbrooke assigned dlmurphy and unassigned dlmurphy Nov 2, 2018
@dlmurphy
Copy link
Contributor Author

dlmurphy commented Nov 2, 2018

To be more specific, we're looking for a prototype that:

  • Can pull a repo from GitHub into Dataverse as a .zip file when a user manually specifies. (pull a specific release OR if there's no release pull to the latest commit)

  • Can pull a repo from GitHub into Dataverse as a .zip file via GitHub webhooks when a release is triggered. (How do you set up the webhook and how do you maintain it? What can we do if the webhook fails?)

  • Can pull metadata from a GitHub repo into Dataverse (which specific fields don't really matter, just want to demonstrate that this is doable).

The prototype doesn't need a frontend.

@mheppler, please feel free to weigh in on this, you might have a better idea of what's helpful here.

@pdurbin
Copy link
Member

pdurbin commented Nov 8, 2018

Yesterday I demo'ed some code I hacked together as of b9305c0 to @djbrooke @scolapasta @TaniaSchlatter @mheppler @dlmurphy @jggautier and @kcondon

All functionality is API only for now. There are two steps:

  • Set the GitHub repo URL for a dataset: curl -H "X-Dataverse-key: $API_TOKEN" http://localhost:8080/api/datasets/31/github {"status":"OK","data":{"datasetId":31,"githubUrl":"https://github.com/IQSS/Zelig"}}
  • Import the GitHub repo into Dataverse as a zip file: curl -H "X-Dataverse-key: $API_TOKEN" -X POST http://localhost:8080/api/datasets/31/github/import

The result is that a file is created that looks like the screenshot below from https://dev1.dataverse.org/file.xhtml?persistentId=doi:10.5072/FK2/FS7M3O/EBNKNB

zelig zip_-root-_2018-11-08_08 55 05

I had to leave to pick up my kids before any decisions were made about next steps.

@pdurbin
Copy link
Member

pdurbin commented Nov 8, 2018

Below is a more readable version of the output from https://api.github.com/repos/IQSS/Zelig that I shoved into the file description above. Please note that I believe that this is only the tip of the iceberg in terms of metadata that we could pull out of GitHub for a repo. The content is most URLs for pulling out additional information. The items that I find interesting are:

  • description: A statistical framework...
  • license: null (Zelig seems to be GPL based on its DESCRIPTION file but GitHub doesn't know to look there)
  • language: R (other languages used less can presumably be retrieved from "languages_url")
  • size: 115034 bytes (could be used to decided if a repo is too large to attempt to import it)
  • homepage: http://zeligproject.org
{
  "stargazers_count": 65,
  "pushed_at": "2018-02-27T13:49:49Z",
  "subscription_url": "https://api.github.com/repos/IQSS/Zelig/subscription",
  "language": "R",
  "branches_url": "https://api.github.com/repos/IQSS/Zelig/branches{/branch}",
  "issue_comment_url": "https://api.github.com/repos/IQSS/Zelig/issues/comments{/number}",
  "labels_url": "https://api.github.com/repos/IQSS/Zelig/labels{/name}",
  "subscribers_url": "https://api.github.com/repos/IQSS/Zelig/subscribers",
  "releases_url": "https://api.github.com/repos/IQSS/Zelig/releases{/id}",
  "svn_url": "https://github.com/IQSS/Zelig",
  "subscribers_count": 25,
  "id": 14958190,
  "forks": 32,
  "archive_url": "https://api.github.com/repos/IQSS/Zelig/{archive_format}{/ref}",
  "git_refs_url": "https://api.github.com/repos/IQSS/Zelig/git/refs{/sha}",
  "forks_url": "https://api.github.com/repos/IQSS/Zelig/forks",
  "statuses_url": "https://api.github.com/repos/IQSS/Zelig/statuses/{sha}",
  "network_count": 32,
  "ssh_url": "git@github.com:IQSS/Zelig.git",
  "license": null,
  "full_name": "IQSS/Zelig",
  "size": 115034,
  "languages_url": "https://api.github.com/repos/IQSS/Zelig/languages",
  "html_url": "https://github.com/IQSS/Zelig",
  "collaborators_url": "https://api.github.com/repos/IQSS/Zelig/collaborators{/collaborator}",
  "clone_url": "https://github.com/IQSS/Zelig.git",
  "name": "Zelig",
  "pulls_url": "https://api.github.com/repos/IQSS/Zelig/pulls{/number}",
  "default_branch": "master",
  "hooks_url": "https://api.github.com/repos/IQSS/Zelig/hooks",
  "trees_url": "https://api.github.com/repos/IQSS/Zelig/git/trees{/sha}",
  "tags_url": "https://api.github.com/repos/IQSS/Zelig/tags",
  "private": false,
  "contributors_url": "https://api.github.com/repos/IQSS/Zelig/contributors",
  "has_downloads": true,
  "notifications_url": "https://api.github.com/repos/IQSS/Zelig/notifications{?since,all,participating}",
  "open_issues_count": 26,
  "description": "A statistical framework that serves as a common interface to a large range of models",
  "created_at": "2013-12-05T15:57:10Z",
  "watchers": 65,
  "keys_url": "https://api.github.com/repos/IQSS/Zelig/keys{/key_id}",
  "deployments_url": "https://api.github.com/repos/IQSS/Zelig/deployments",
  "has_projects": true,
  "archived": false,
  "has_wiki": false,
  "updated_at": "2018-10-30T16:47:25Z",
  "comments_url": "https://api.github.com/repos/IQSS/Zelig/comments{/number}",
  "stargazers_url": "https://api.github.com/repos/IQSS/Zelig/stargazers",
  "git_url": "git://github.com/IQSS/Zelig.git",
  "has_pages": true,
  "owner": {
    "gists_url": "https://api.github.com/users/IQSS/gists{/gist_id}",
    "repos_url": "https://api.github.com/users/IQSS/repos",
    "following_url": "https://api.github.com/users/IQSS/following{/other_user}",
    "starred_url": "https://api.github.com/users/IQSS/starred{/owner}{/repo}",
    "login": "IQSS",
    "followers_url": "https://api.github.com/users/IQSS/followers",
    "type": "Organization",
    "url": "https://api.github.com/users/IQSS",
    "subscriptions_url": "https://api.github.com/users/IQSS/subscriptions",
    "received_events_url": "https://api.github.com/users/IQSS/received_events",
    "avatar_url": "https://avatars2.githubusercontent.com/u/675237?v=4",
    "events_url": "https://api.github.com/users/IQSS/events{/privacy}",
    "html_url": "https://github.com/IQSS",
    "site_admin": false,
    "id": 675237,
    "gravatar_id": "",
    "node_id": "MDEyOk9yZ2FuaXphdGlvbjY3NTIzNw==",
    "organizations_url": "https://api.github.com/users/IQSS/orgs"
  },
  "commits_url": "https://api.github.com/repos/IQSS/Zelig/commits{/sha}",
  "compare_url": "https://api.github.com/repos/IQSS/Zelig/compare/{base}...{head}",
  "git_commits_url": "https://api.github.com/repos/IQSS/Zelig/git/commits{/sha}",
  "blobs_url": "https://api.github.com/repos/IQSS/Zelig/git/blobs{/sha}",
  "git_tags_url": "https://api.github.com/repos/IQSS/Zelig/git/tags{/sha}",
  "merges_url": "https://api.github.com/repos/IQSS/Zelig/merges",
  "downloads_url": "https://api.github.com/repos/IQSS/Zelig/downloads",
  "has_issues": true,
  "url": "https://api.github.com/repos/IQSS/Zelig",
  "contents_url": "https://api.github.com/repos/IQSS/Zelig/contents/{+path}",
  "mirror_url": null,
  "milestones_url": "https://api.github.com/repos/IQSS/Zelig/milestones{/number}",
  "teams_url": "https://api.github.com/repos/IQSS/Zelig/teams",
  "fork": false,
  "issues_url": "https://api.github.com/repos/IQSS/Zelig/issues{/number}",
  "events_url": "https://api.github.com/repos/IQSS/Zelig/events",
  "issue_events_url": "https://api.github.com/repos/IQSS/Zelig/issues/events{/number}",
  "organization": {
    "gists_url": "https://api.github.com/users/IQSS/gists{/gist_id}",
    "repos_url": "https://api.github.com/users/IQSS/repos",
    "following_url": "https://api.github.com/users/IQSS/following{/other_user}",
    "starred_url": "https://api.github.com/users/IQSS/starred{/owner}{/repo}",
    "login": "IQSS",
    "followers_url": "https://api.github.com/users/IQSS/followers",
    "type": "Organization",
    "url": "https://api.github.com/users/IQSS",
    "subscriptions_url": "https://api.github.com/users/IQSS/subscriptions",
    "received_events_url": "https://api.github.com/users/IQSS/received_events",
    "avatar_url": "https://avatars2.githubusercontent.com/u/675237?v=4",
    "events_url": "https://api.github.com/users/IQSS/events{/privacy}",
    "html_url": "https://github.com/IQSS",
    "site_admin": false,
    "id": 675237,
    "gravatar_id": "",
    "node_id": "MDEyOk9yZ2FuaXphdGlvbjY3NTIzNw==",
    "organizations_url": "https://api.github.com/users/IQSS/orgs"
  },
  "assignees_url": "https://api.github.com/repos/IQSS/Zelig/assignees{/user}",
  "open_issues": 26,
  "watchers_count": 65,
  "node_id": "MDEwOlJlcG9zaXRvcnkxNDk1ODE5MA==",
  "homepage": "http://zeligproject.org",
  "forks_count": 32
}

@mheppler
Copy link
Contributor

mheppler commented Nov 8, 2018

Upon discussing this in more detail this morning with @djbrooke @TaniaSchlatter @pdurbin , an update on expected results of this spike:

  • QUESTION TO BE ANSWERED: Can the authentication required to set up a webhook be inserted into the deposit/upload workflow or does the system require the user to set up the webhook in an edit account info workflow PRIOR TO starting the deposit/upload workflow?

Also, scrap the "cherry pick and format the metadata" suggestion. That can be done when this full feature moves to development. We know that can be done and don't need a spike to prove it.

@pdurbin
Copy link
Member

pdurbin commented Nov 8, 2018

The short answer to the question above is "I don't know" because I struggle mightily with JSF. We can try to get both working so we have options. It will take time.

Meanwhile, below my todo list from the dev perspective. This is the logical order in which to work on the code.

  • Fix the dreaded problem that's slowing down development and causing "java.lang.IllegalStateException: This web container has not yet been started". It's something in the code because I see it on my laptop and on the dev1 server.
  • I demo'ed a 1.0 dataset based on import from GitHub. Support concept of 2.0 and later dataset based on a subsequent import. Create a new Dataverse API endpoint for this ("importZipToNewDatasetVersion") or add more logic to the "import" API endpoint mentioned above. Store the previous tag/release and/or sha1 to know if there is anything new on the GitHub side? Or just blindly create a new version with the same zip file?
  • Create API endpoint to receive GitHub webhooks. And GitLab webhooks? This will call the "import zip as another dataset version" code.
  • Create webhook manually in the GitHub GUI. Get it working with the new Dataverse "import zip as another dataset version" code.
  • Create webhook via API using a GitHub password, if possible (insecure).
  • Create a webhook via OAuth. Requires GUI work as this can only be done in a browser.

@djbrooke
Copy link
Contributor

djbrooke commented Nov 8, 2018

@pdurbin - Let's put the brakes on this for now (except #1 IMHO). We are verifying a proposed approach with @mercecrosas and I'd like to revisit the technical architecture after that. Let's pick this up when you're back next week. Apologies for the confusion.

@pdurbin pdurbin removed their assignment Nov 8, 2018
@pdurbin
Copy link
Member

pdurbin commented Oct 3, 2022

@atrisovic created an awesome GitHub integration using GitHub Actions.

I think we should close this issue. We can create a fresh one if we still want to investigate webhooks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants