Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formal governance document #5676

Closed
tdhock opened this issue Aug 22, 2023 · 32 comments · Fixed by #5772
Closed

Formal governance document #5676

tdhock opened this issue Aug 22, 2023 · 32 comments · Fixed by #5772
Assignees
Labels
governance Project governance

Comments

@tdhock
Copy link
Member

tdhock commented Aug 22, 2023

data.table has no formal governance document; Matt Dowle, the original author and only current maintainer, has commit permissions on github, and he submits the package to CRAN. The only other author, Arun, has been inactive for several years.
Matt has done a fantastic job at creating a highly efficient and widely used R package, and he continues to submit "patch" releases to CRAN (containing minimal fixes so that the package continues to pass CRAN checks). data.table has many contributors, who have submitted PRs, which have been reviewed and merged by Matt. This form of project leadership/governance is similar to the former python model, Benevolent Dictator For Life (BDFL). This form of governance can handle as many contributions/PRs as the BDFL (Matt) has time to review and merge. The purpose of this issue is to discuss alternative forms of governance which may be able to handle more contributions, from a larger and more diverse group of contributors.

I therefore propose that we use this issue to write constructive, thoughtful, respectful, and inclusive comments containing concrete propositions for the future governance of the data.table project. At the end of three months (end of November 2023), I will synthesize the comments into a draft governance document, which I will publish as a PR that creates a new GOVERNANCE.md file. After that, there will be a three one month period of open comments, where people can discuss the strengths/weaknesses of the draft, and we can discuss possible edits to make. At the end of that second period (end of Dec 2023), I will consider the consensus of the comments, make corresponding edits, and publish a second draft (by editing the PR). If that draft is sufficient, I will ask contributors to sign that document by adding their names to the GOVERNANCE.md file in that PR. If there are still significant concerns, we can use another three one month period of comments to resolve them, after which I will publish another (hopefully final) draft in end of Jan 2024.

Some significant questions that I suggest we try to answer in the governance document: (others are welcome too)

  • What is the purpose/scope of the data.table package? What kinds of functions should it contain? And what is out of scope? What is "within scope" for the data.table package? #5722
  • What are the guiding principles of the data.table project? efficiency? simplicity? inclusion of diverse contributors? backwards compatibility? See data.table principles #5693
  • How should conflicts be resolved? By consensus? voting? who should be allowed to vote? Suggestion: consensus, meaning that no formal voting is required, but discussion must continue until everyone who is participating in the discussion agrees. (we should not have to wait for people who are absent to express their approval)
  • What roles and/or permissions should we define? What process must a contributor follow in order to obtain special roles and/or permissions? see below, Formal governance document #5676 (comment)
  • How to decide when to merge a PR? Suggestion: need approval from at least one reviewer other than the PR submitter. (and do not merge if anyone has serious/blocking concerns)
  • How to interpret version numbers? version number conventions #5715
  • What standards of conduct should we expect from contributors, in their writing, in order to encourage diversity/inclusion? Code of conduct #5708
  • What is the frequency of releases to CRAN that should be expected? What process should be followed by the CRAN maintainer? (for updates in response to requests from CRAN, and for regular releases) See CRAN maintainer communication standards and release checklist #5714
  • If some one asks a question, or wants a code review, how much time is reasonable to expect a response? 1 week? 1 month? If no response after that time period, what are the consequences? How do we give due credit to past/inactive contributors while at the same time giving enough permission to current/active contributors?
  • What process should we use to make modifications to the GOVERNANCE.md document, after it has been initially accepted/ratified? 2/3 of people who signed the GOVERNANCE.md must approve? 50%+1?

We are not the first open-source project to have a governance document, here is a reading list about open-source governance, which can inform our discussion:

About the roles to define, I suggest replacing the current flat leadership model (one maintainer role at the top, many contributors on the bottom), with a hierarchical model containing intermediate "reviewers," similar to the successful model of subsystem maintainers from the linux kernel project. See figure below, but note that the names are totally arbitrary (for example, I do not expect Kelly to be release manager, but it would be nice to have someone take the role of release manager).

figure-PR-hierarchy

I would suggest that each intermediate “reviewer” volunteer to be in charge of reviewing and merging PRs for specific features/files, as defined in the CODEOWNERS file, #5629 So far only @ben-schwen @MichaelChirico and @jangorecki have volunteered to be reviewers.

In addition to the reviewer role in the figure above, there could be at least five other roles (with responsibilities):

  • release manager (to communicate with CRAN and submit new releases),
  • translation manager (to communicate with translators),
  • performance testing manager (to prevent performance regressions),
  • reverse dependency manager (to ensure compatibility with other CRAN packages),
  • binary manager (to build binaries of development branches for user testing before release).

I would volunteer to be reverse dependency manager, as I have set up the revdep-check system

I would nominate @MichaelChirico for translation manager and @jangorecki for binary manager.

Who would volunteer for release manager and performance testing manager?

@waynelapierre
Copy link

waynelapierre commented Aug 23, 2023

I have one suggestion that might sound too radical. Why not simply put data.table under the umbrella of rOpenSci?
https://github.com/ropensci
rOpenSci already has a well defined governance structure, so data.table contributors only need to focus on the coding part. Moreover, as I mentioned in another issue, rOpenSci has already been funded by NumFOCUS.
#5675

@minemR

This comment was marked as outdated.

@tdhock

This comment was marked as resolved.

@msummersgill

This comment was marked as off-topic.

@tdhock
Copy link
Member Author

tdhock commented Aug 23, 2023

I have one suggestion that might sound too radical. Why not simply put data.table under the umbrella of rOpenSci? https://github.com/ropensci rOpenSci already has a well defined governance structure, so data.table contributors only need to focus on the coding part. Moreover, as I mentioned in another issue, rOpenSci has already been funded by NumFOCUS. #5675

hi @waynelapierre thanks for your comment. Can you clarify what is the "well defined governance structure" of rOpenSci? I found https://softdev4research.github.io/4OSS-lesson/04-contributions/ which lists some general recommendations about what an open-source project governance document should contain, but I was not able to find a document describing governance of the rOpenSci project.

@TimTaylor
Copy link

Hi Toby - thank you for this!

I won't rush to comment save for one initial thought regarding triaging of both issues and PRs. Triage is a somewhat hidden layer between contributors and reviewers. I'll often see Jan diligently handling this and think it would be good if this process was captured in the document.

@assignUser
Copy link

Would the reviewers have commit rights to the main branch? The chart implies that they don't and combined with:

Matt has written me several emails, and has told me that going forward he will have very little time to devote to data.table development.

That keeps the current problems around...

I would recommend a read of the substrait governance which is based on the ASF system of a group of committers and a PMC with some changes to make room for automatic releases and such (which currently are not possible for ASF projects) .

@sluga
Copy link
Contributor

sluga commented Aug 30, 2023

Perhaps a survey would help in accumulating feedback faster and from more people?
In addition to the questions listed above, it could cover some extra ground such as:

  • respondent's background & if/how they'd be willing to contribute to data.table
  • what to prioritize in future development
  • feedback on more significant potential/upcoming changes (e.g. DT(), env)

@phisanti

This comment was marked as resolved.

@MichaelChirico
Copy link
Member

It would indeed be nice to hear from @mattdowle if he has any strict requirements for the new governance -- i.e., proposals/guidelines to which he would not give final approval.

Regarding timelines, it would be good to build in an acceleration mechanism, whereby we can move to the next phase in approval once sufficient consensus is achieved (e.g. certainly among @tdhock @jangorecki and myself, with perhaps a few more). It does seem to me surprising that we expect it will take a year to establish the new governance & then move to releasing new data.table code.

@tdhock
Copy link
Member Author

tdhock commented Sep 7, 2023

Perhaps a survey would help in accumulating feedback faster and from more people? In addition to the questions listed above, it could cover some extra ground such as:

* respondent's background & if/how they'd be willing to contribute to data.table
* what to prioritize in future development
* feedback on more significant potential/upcoming changes (e.g. DT(), env)

Hi @sluga thanks for sharing. The survey sounds like a good idea, that would help data.table developers and contributors get an idea about what kinds of features the users would like to prioritize. Would you be willing to set that up? (maybe this could be an additional role mentioned in the governance document, survey manager)

@tdhock
Copy link
Member Author

tdhock commented Sep 7, 2023

I would like to invite the following people to comment on this proposal, because they are listed among the top contributors https://github.com/Rdatatable/data.table/graphs/contributors
@st-pasha @lianos @tshort @eantonya @shrektan @HughParsonage @MarkusBonsch @ColeMiller1 @sritchie73 @tlapak @dracodoc @rsaporta @philippechataignon @eddelbuettel @OfekShilon @oseiskar @2005m @KyleHaynes @JoshOBrien @DavidArenburg @heavywatal @mcol @JenspederM @ajdamico @franknarf1 @eliocamp @UweBlock
If you have time and interest, I would appreciate your input/feedback on this issue; otherwise, if you don't have time to participate, or don't have any opinion to share, that is fine too.

@lianos

This comment was marked as resolved.

@tdhock

This comment was marked as outdated.

@HughParsonage

This comment was marked as resolved.

@ColeMiller1

This comment was marked as outdated.

@ColeMiller1 ColeMiller1 pinned this issue Sep 9, 2023
@jangorecki
Copy link
Member

I would like to have release ASAP (considering how many of my PRs are in queue to get merged it is quite obvious), but I don't think we should push for it in current situation. It is more important to put back the project to be thriving again. Having release now or year later is secondary problem. Matt is doing hot fixes for CRAN whenever needed, thanks to that package is still on CRAN. Let's appreciate that we got that time and let's look for a solutions that will work in the long run, not just release ASAP. We have this amazing opportunity thanks to Toby initiative. Personally I would like this process to be faster, like 1 month for each step, but I trust Toby, because I feel he knows that stuff better. IMO let's try to do it faster but be ready to wait when Toby will not be satisfied with outcome on each step. FYI: I (and other mods) will be cleaning this thread from time to time from messages that does not bring much to discussion.

@sluga
Copy link
Contributor

sluga commented Sep 10, 2023

Perhaps a survey would help in accumulating feedback faster and from more people? In addition to the questions listed above, it could cover some extra ground such as:

* respondent's background & if/how they'd be willing to contribute to data.table
* what to prioritize in future development
* feedback on more significant potential/upcoming changes (e.g. DT(), env)

Hi @sluga thanks for sharing. The survey sounds like a good idea, that would help data.table developers and contributors get an idea about what kinds of features the users would like to prioritize. Would you be willing to set that up? (maybe this could be an additional role mentioned in the governance document, survey manager)

Hi @tdhock,

will do, I'll make a draft & post it here as a separate issue, so that you and others can contribute. I'm also happy to then implement the survey & write up the results, though I'd ask for some help with survey distribution.

@tdhock
Copy link
Member Author

tdhock commented Sep 12, 2023

Several people have asked if Matt can comment on this issue, to express approval of the proposed process, and/or if he has any strict requirements for the new governance. I emailed Matt last week, to tell him about this issue, and to ask if I could publish his letters of collaboration/support, and he agreed, so here they are. I believe these letters provide convincing evidence of Matt's support; for example he wrote "I wholeheartedly support this POSE application and will collaborate with Toby's team."
2022-10-14-Dowle-collaboration-Hocking-NSF-POSE.pdf
2023-02-06-Dowle-clarification-Hocking-NSF-POSE.pdf

@ben-schwen
Copy link
Member

Thank you Toby for initiating this. My kudos go to @mattdowle for all the work he put into data.table over those years. Having Matt as a final reviewer is something I've always enjoyed since he will not merge until a PR meets his standards. I also believe that's one of the points which have made data.table so successful.

Since it seems that Matt does not have the time needed to fully commit I'm in big favor of establishing such a hierarchical review model. I'm convinced that there are long-time contributors to the project (who might also should have earned author status) like @jangorecki and @MichaelChirico who know good parts of our codebase pretty well to be able to review and approve PRs to their best knowledge similar to what Matt did.

Regarding releases, I think that we should release soon when most of the known hard issues in the dev version are fixed. I know that we are all used to work with the dev version but getting things out on CRAN has an even much bigger impact and I think that there are many features data.table users would enjoy.

@DavidArenburg
Copy link
Member

This looks good to me, but what this whole exaggerated emphasis on diversity/inclusion? Were there some issues regarding that topic in the package development process? Why is this present literally in every sentence of the proposal?

@tdhock
Copy link
Member Author

tdhock commented Sep 13, 2023

This looks good to me, but what this whole exaggerated emphasis on diversity/inclusion? Were there some issues regarding that topic in the package development process? Why is this present literally in every sentence of the proposal?

I am not aware of any issue regarding diversity/inclusion in the package development process. In fact, I believe Matt was doing a great job with inclusion, as he had a policy of adding people to the team https://github.com/orgs/Rdatatable/teams/project-members (permission to add branches/PRs from this repo rather than in forks) immediately after they submit their first PR. The focus on NSF POSE program is expanding the ecosystem of users and contributors, so I believe it is important that we are intentional about making diversity/inclusion a priority in the governance document, to encourage attracting and retaining the maximum number of contributors in the future.

@tlapak
Copy link
Contributor

tlapak commented Sep 26, 2023

Thank you @tdhock for driving this. From the letters you posted, this seems to have been a long time in the making. Like others, I had grown a bit worried over the months that data.table was being abandoned. It is, however, a very mature package. Huge props to @mattdowle and Arun for developing and maintaining this package for so long. I've learned a lot from both the project setup and his direct feedback on my PRs. I think Matt's expertise and experience with R internals will be hard to replace.

I feel like a governance document is necessary given the funding secured and I guess with a view towards actual growth. But at the core, we should strive to keep most of the development model and design fundamentals. There is de-facto a core team around Matt, so let's write this down in a way that doesn't get in the way later. The crucial aspects we need to tackle are commit rights to master and release to CRAN. Pre-submission testing and especially rev-dep testing seem to be very complex, so having one or more people who can tackle the release process seems essential.

Overall, the proposal seems fine to me. I'm not sure how strictly we should pin down responsible people there so as to not restrict ourselves needlessly. Given Toby's comments, I don't think we should rely on Matt for commits or releases going forward. But if he were available to consult on tricky C questions that would be beneficial to all of us. But if he can't we'll do our best.

A few key points regarding the design we should pin down, imo, are the long and careful life-cycle management of features (and consequent wariness of adding new features) and the backwards compatibility. *Although we might revisit the extent of the latter.) Adding people as contributors as soon as their first PR gets merged, etc. Basically, the front page needs to stay.

And since I'm already at it, I'll chuck out a couple more thoughts: data.table needs some marketing and we need to do some housekeeping on the issue tracker and open PRs. Again I'm not sure if this needs to be written down in the governance document but I think something needs to be done here.

@ben519
Copy link

ben519 commented Sep 27, 2023

Thanks @tdhock for spearheading this! Do you have an estimated timeline? It feels like this process is dragging. Perhaps it needs less formal discussion and more unilateral action? ..or perhaps some concrete dates so that we don't suffer paralysis by analysis.

Anyways I don't mean to complain. Really appreciate your work. I'll happily make a small donation to the team if it helps and I'd love to promote data.table if/when new releases start happening.

@MichaelChirico
Copy link
Member

Just came across this document written by cURL creator/maintainer Daniel Sternberg. Sharing as relevant:

https://un.curl.dev

@tdhock
Copy link
Member Author

tdhock commented Sep 29, 2023

What a great read! Thanks for sharing Michael. Here are some relevant passages

Contributors will not stick around -> My experience says that you will have better success in getting more maintainers if you (as an existing maintainer) ask those you consider being contenders, rather than waiting and hoping for them to ask.

https://un.curl.dev/code/quality#how-do-you-achieve-good-code-quality

Roles: https://un.curl.dev/maintain BDFL? security? release manager (use a checklist), web, reviewing, support, blog, debug, merging, feature dev, doc writers, event planning, stickers, presenters, world monitoring (surveys).

@Kamgang-B Kamgang-B unpinned this issue Oct 11, 2023
@phisanti
Copy link

Hi, just a friendly reminder to check on the progression of the new governance. We are about to be one month for the deadline proposed by @tdhock. Do we have already a core-team for the future data.table? Are the funding issues closed? Will we claim back the throne of speed from other packages such as collapse?

@jangorecki
Copy link
Member

End of November was the first milestone date mentioned by Toby, so we still have time.

As for benchmark with collapse, I invite you to submit new issue, for each task you are troubled by DT being slower than collapse.
It happened multiple times that scaling data up, or cardinality up, resulted in DT being faster. And not only vs collapse but in general.
There are also other places where we could provide faster functions but we rather hold to make them built-in, to reduce number of functions that user has to learn/discover, like fsum, fmean, etc.

As benchmarking is off-topic to this issue I kindly ask to not continue that topic here but create new issue if needed.

@tdhock
Copy link
Member Author

tdhock commented Oct 25, 2023

In my original post, I asked the questions, What roles and/or permissions should we define? What process must a contributor follow in order to obtain special roles and/or permissions? Here are some more detailed answers to these questions. Please comment constructively, discuss strengths/weaknesses of this structure, and propose concrete alternatives.

  • Contributor: a user who has written/commented at least one issue, worked to label/triage issues, written a blog post, given a talk, etc. This is not a formal role (no need to keep a list of contributors), but contributors should be encouraged to submit their first PR to become a project member.
  • Project member: some one who has submitted at least one PR that has been merged into master. Any user can become a member by submitting a PR, then having it reviewed and merged into master. Members are credited via role="ctb" in DESCRIPTION (so they appear in Author list on CRAN), and they are added to https://github.com/orgs/Rdatatable/teams/project-members so they can create new branches in the Rdatatable/data.table GitHub repo. Note: I like the term "member" here instead of "contributor" because I believe there are many ways to contribute to the project without having to submit a PR. Currently there are 50 members.
  • Reviewer: a member who has volunteered to do code reviews for some features/files. After one or more significant PRs to a given file, a member should be invited to add their name as a reviewer of that file in CODEOWNERS, and after that is merged into master, then they are considered a reviewer. Same credit in DESCRIPTION as a regular member, role="ctb" (so they appear in Author list on CRAN). Note: having your name in CODEOWNERS does not give any special permission, but it does mean that you will be notified whenever there is a new PR with changes to that file. Current reviewers listed in add CODEOWNERS file #5629 are @tdhock, @ben-schwen, @jangorecki, @MichaelChirico but it would be great to expand this to include more people who have submitted PRs in the last few years.
  • GitHub maintainer: permission to merge PRs into master branch. After a reviewer has a consistent history of careful reviews of others' PRs, then a current GitHub maintainer should ask all other current GitHub maintainers if they approve promoting the reviewer to GitHub maintainer, and it should be done if there is general consensus/agreement. Credited via role="aut" in DESCRIPTION (so they appear in Author list on CRAN), and added to https://github.com/orgs/Rdatatable/teams/maintainers which gives permission to merge PRs into master branch. Note: to avoid confusion with CRAN maintainer, should we use a different name for this role? "GitHub masters" or "committers" (thanks @assignUser) because they can commit to master branch? Currently there are two GitHub maintainers: Matt and Arun, but I would suggest expanding this to include at least @tdhock, @ben-schwen, @jangorecki, @MichaelChirico
  • CRAN maintainer: in charge of communication with CRAN. Responsible for submitting releases to CRAN on a regular basis, and in response to requests from CRAN. Credited via role="cre" in DESCRIPTION (so they appear as Maintainer on CRAN). Note: in the past this role has been filled by one of the GitHub maintainers, but CRAN maintainer does not actually need permission to merge PRs into master branch. Currently this is Matt, but for the future I would suggest @jangorecki or @MichaelChirico

@tdhock tdhock self-assigned this Oct 25, 2023
@jangorecki
Copy link
Member

jangorecki commented Oct 25, 2023

  • Project member - a rule for listing into DESCRIPTION file was once nicely defined by Matt, possibly in a video, but is also mentioned here (that non-code contributions may not qualify): fixes #504 fread now handles all kind of NAs without coercion to char #1236 (comment)

  • GitHub maintainer - we need to define how many GH maintainers have to give green light for PR to be merged. Proposed list (Matt, Arun, Michael, Ben, Toby, Jan) has 6 maintainers, normally I would then say 3 "lgtm" should be fine to merge a PR but considering only 4 of those 6 are active it feels like 3 out of 4 is quite a lot. Could as well be 1 GH maintainer plus 1 reviewer/GH maintainer.

@assignUser
Copy link

should we use a different name for this role?

'Committer' as used by the ASF is quite descriptive imo

how many GH maintainers have to give green light for PR to be merged

To keep development velocity I would suggest >= 1 👍 and no 👎 for smaller day to day changes (bugfixes, enhancements etc) and some higher requirement for more important/influential changes (e.g. changes to core api/implementation). This could also be a discussion + vote outside of the PR (e.g. for arrow we have to vote on the ML for things like format changes/additions).

@jangorecki
Copy link
Member

jangorecki commented Oct 26, 2023

For CRAN maintainer I think it would be most suitable to choose a person whose time will be already funded by NSF (or any other company/foundation willing to sponsor that). Preparing release and CRAN communication have often short deadlines and it can be quite time consuming, therefore relying on a volunteer to handle that does not seem to be fair.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
governance Project governance
Projects
None yet
Development

Successfully merging a pull request may close this issue.