Skip to content
This repository has been archived by the owner on Aug 5, 2021. It is now read-only.

Updating the Code.gov Metadata Schema [Due by 7/14] #250

Closed
okamanda opened this issue Jun 15, 2017 · 22 comments
Closed

Updating the Code.gov Metadata Schema [Due by 7/14] #250

okamanda opened this issue Jun 15, 2017 · 22 comments

Comments

@okamanda
Copy link
Contributor

okamanda commented Jun 15, 2017

Before Code.gov launched in November 2016, we relied on this community of supporters to help shape the first version of our metadata schema. That schema is the foundation of Code.gov. The vibrant discussion that followed reflected the range of expertise we're lucky to have at our disposal. And now we'd like to tap into that again as we revisit the next version of the Code.gov Metadata Schema.

Over the next few weeks (an updated timeline will be provided), we want to gather comments about what we should include (or remove) in the next iteration of the schema.

  • What kind of information should we be collecting from federal agencies?
  • How would you use that information?
  • What are examples of best practices that we should adopt?
  • What should we remove/deprecate from the current version of the schema because it isn't useful, doesn't make sense, or is burdensome to collect?

Please feel free to tag or refer other related issues!

Looking forward to seeing where this discussion takes us.

@PhilipBale
Copy link
Contributor

#241

@IanLee1521
Copy link
Contributor

IanLee1521 commented Jun 16, 2017

A few issues:

Other thoughts, that may not be in issues yet:

  • project.updated.lastCommit vs project.updated.sourceCodeLastModified -- I wonder if these two could be combined into a single field, rather than two separate fields that only differ in whether its an open or closed repo?
  • project.vcs -- Should this have a list of valid values (git, svn, hg, p4, ...?) , similarly to the enumerated values available for project.status ?

@ibarra-michelle
Copy link

One suggestion I have for the next schema would be to include a field with the date the code repository was created. Adding this date would help Agencies with the compliance pilot program since the policy is effective for custom code developed August 8, 2017 and forward.

@apyle
Copy link

apyle commented Jun 26, 2017

Add project.size - a numeric value that reflects the size of the project. This might be lines of code, cost, function points, or whatever method the agency uses to measure their code. This is needed to gauge where agencies are in reaching the 20% open source goal. By including the project size in the code.json file, agencies do not have to track and report this separately for OMB. This will eliminate the time and cost of linking & reconciling two reporting mechanisms and increase transparency.

@apyle
Copy link

apyle commented Jun 26, 2017

Add project.sizingMethod - a textual value for how agencies are measuring their code. This gives OMB and the public insight into how the government measures code. Agencies will also know where to reach out if they have questions about a particular measuring method.

@IanLee1521
Copy link
Contributor

@apyle - I wonder if "number of repositories open sources" would be the better metric for calculating that 20% number.

@apyle
Copy link

apyle commented Jun 29, 2017

@IanLee1521 - iif an agency chooses to measure code by the number of projects then your suggestion is appropriate. But if an agency chooses one of the other suggested measuring options such as lines of code, number of self-contained modules, cost, etc. then we need a way to report that measure.

@okamanda okamanda changed the title Updating the Code.gov Metadata Schema Updating the Code.gov Metadata Schema [Due by 7/14] Jul 11, 2017
@okamanda
Copy link
Contributor Author

Hi everyone,

In order to move this process along so we can start implementing the changes, we'd like to get everyone's comments in by this Friday, June 14th. If you haven't already provided comments and recommendations, please do so this week. Thanks!

@apyle
Copy link

apyle commented Jul 12, 2017

Reserve name spaces for agencies.

Within the agency object, reserve any name that starts with the agency code in upper case followed by an underscore for that agencies local use. For instance, a VA_key would be a key reserved for Veterans Affairs, USDA_key for the Dept. of Agriculture, and TRE_key reserved for Treasury.

This provides a safe space for agencies to implement local customizations that won't interfere with other agencies' customizations.

@ckaran
Copy link

ckaran commented Jul 12, 2017

@apyle That suggests that there will be a definitive list of agency codes that users can consult. Does that exist anywhere?

@apyle
Copy link

apyle commented Jul 12, 2017

@ckaran, Yes, it is currently embedded in the code. If we move forward with this we will want the list formalized and published in a user friendly format.

@ckaran
Copy link

ckaran commented Jul 12, 2017

Nice! And since its in JSON, user friendly forms can be automatically derived, which makes adding new agencies simple.

@ctubbsii
Copy link

ctubbsii commented Jul 13, 2017

I have three suggestions for the schema:


I'm still very confused by the governmentWideReuse field. As I understand it, this field is primarily intended to reflect the presence of a certain kind of clause in the contract under which the software was created. However, it is not clear what value this field is supposed to have for software which is entirely government produced (e.g. no contractors), or if it was produced under older contracts which might have similar reuse features.

From speaking with several people, I've heard many interpret this field as "can government reuse this software?". I don't think that's the correct interpretation, because it seems to me that it's possible the field is 0 because the contract clause it's supposed to be a proxy for is not present, but the project is still open source and able to be reused for some other reason (employee-produced not contracted, contractor releases under open source license, older contract with similar reuse features, etc.)

I would like to see that field documented better, and more guidance given as to how to set it. It might be worth considering deprecating it or replacing it with something more descriptive, like governmentWideReuseMechanism whose values can be the name of the contract clause or clause type, which permits reuse, or OSS to indicate it has been made available under an open source license, or Public Domain to indicate that it is reusable because it was employee-produced, or None to indicate it is not available for government-wide reuse.


Another issue I'd like clarified is which projects are tracked. Not every line of code a developer writes is worth tracking. When does a project become large enough that it goes from "some small utils in my home directory" to a "project" status and should be tracked? Additionally, it's not clear how projects should be tracked which may have started within an Agency, but then transferred outside via something like a CRADA or some contribution to an open source community, like Apache.

Clarity on tracking non-Agency published open source software which receives (or received) significant contribution from a Federal Agency would be very helpful, as well as clarity for those curated by another entity under a collaboration agreement.

If all of this should be tracked, the schema should be updated to indicate curation by another, non-government entity. Perhaps maintainedBy?


Another suggestion I have for the schema... which I don't think will go over well, because it represents a larger conceptual change... is that the schema should stop tracking projects, and start tracking "releases". This is what open source projects do, and it's far more useful of metadata to track. Releases are static, and the facts about them are true at the time of their release. Details about a project can change between releases, and metadata about releases can easily reflect this. See all the "POM" files in search.maven.org for an example, or any of the DOAP files at apache.org.

This requires introducing and defining the concept of "release" to the schema. Typically a "release" is a checkpoint in the code which represents a state which can be identified by a versioned name and a statement of quality or supportability that the developer wishes to communicate. In certain open source communities, such as the Apache Software Foundation, a release represents having a certain legal status as well. Of course, some projects are "continuous release" types, and that would have to be accounted for in the schema or the definition of "release" used by the schema if that were added. For these, a version could be imposed and incremented whenever any change in the metadata occurs.

@sheepeeh
Copy link

I probably have more thoughts than this, but wanted to be sure to get something in before the deadline.

  • contact appears under both required and optional fields. If the point is that email is required, but the other fields are not, that should be clarified.
    • Add githubAccount optional field for contact
  • I think exemption should be required if the project has one. It'll probably save your FOIA officer some time/headaches.
  • The list of agencies is pretty short--does it only include current pilot agencies? Or is the intention to not go lower than the department level? This might be a semantic argument, but I'd consider the Agricultural Research Service to be an agency under the Department of Agriculture. The National Agricultural Library is an organization under ARS. The US Digital Registry has 129 agencies, so it would just be good to have clarity on what is and is not included in the agency list vs what should be in the organization field.
  • Will there be any attempt to control what tags are used? (or at least have a suggested list?) I'd also like to see suggested formatting rules. E.g. are spaces allowed? If not, should words be separated by dashes, underscores, or camelCase?
  • It might be a good idea to include some guidance for description--is a sentence enough? Is more than a sentence a bad idea (e.g. should we avoid abstract-length descriptions)?
  • I'm tempted to suggest a lot more optional fields (especially checksum), but I understand the desire to make the schema as simple as possible to maximize adoption. I wonder if it might be worth having a separately documented extension that's available for those looking for it, but won't scare people away with a daunting initial list of fields.

Existing suggestions I agree with:

  • Agree that project.dateCreated or project.dateStarted would be helpful. I'd also like to see an optional field for project.dateArchived.
  • Agree that project.sizingMethod would be good to add, even if it's optional
  • On the confusion around governmentWideReuse--my understanding is that the intention of this field is to indicate whether the code itself was d_esigned to be reusable_ rather than if reuse is allowed. I agree with this comment that the field should be renamed designedForReuse and the definition updated to reflect this.
  • I also agree that provenance or source is a must-have for forked projects. Probably also with forked as a required binary field. (same comment as above)

@ctubbsii
Copy link

@sheepeeh The "designed for reuse" wording still leaves the question open as to whether the code was developed for multiple consumers (modular design, stable API, built as library), or whether the contract wording was designed to make the project freely reusable. Those aren't the same thing.

@sheepeeh
Copy link

@ctubbsii That's why the field definition should also be updated to reflect which case it's referring to.

@ctubbsii
Copy link

@sheepeeh Sorry, I should have been more clear. I agree with you on that. But, if the intent is to indicate something about the contract it was developed/acquired under, then my suggestion would be change the field so that it more explicitly declares which contract clause/type it is reusable under. That is, instead of a binary field, it should be an enum. Example: "governmentWideReuseMechanism" = "DFARS Part 252.227-7014" or "governmentWideReuseMechanism" = "Public Domain"

@sheepeeh
Copy link

@ctubbsii Ah, ok. As for permission to re-use, I believe a project is reusable unless it has an exemption--and a contract clause re: IP would be exempted under condition 1.

"The sharing of the source code is restricted by law or regulation, including—but not limited to—patent or intellectual property law, the Export Asset Regulations, the International Traffic in Arms Regulation, and the Federal laws and regulations governing classified information."

Whether it is Public Domain or something else would be recorded under license. If I'm understanding the purpose of the element correctly, that is.

@apyle
Copy link

apyle commented Jul 14, 2017

@sheepeeh A software project may be listed in the inventory and not be exempted but still not available for governmentwide reuse. For instance, the program may not know if they have the rights to share the code and are working through the contracting issues. A code project can have "exemption" = 0 (well, technically no exemption value since it's optional) and "governmentWideReuseProject" = 0.

@ctubbsii
Copy link

"exemption" = 0 is also confusing. It should declare an exemption reason. Again, binary values are less informative/useful than enums.

Instead of declaring the mechanism under which a project is reusable, it could declare the mechanism under which it is not reusable, due to an exemption. As in "reuseExemption" = "exemptionReason1". Either way would be better than the currently confusing binary field for either "governmentWideReuseProject" or "exemption".

(As for "Public Domain", my previous example might have been better if it instead said "government work", in conjunction with a license field that said "Public Domain".)

@iadgovuser1
Copy link

I really like the idea of having something that captures DFARs clauses for code that was developed under contract. Maybe this could also capture if the code is exempt from copyright in the US due to being developed a US government civilian. It would need to be flexible enough to capture the case where a code base has both those present.

@iadgovuser1
Copy link

Unless something like #238 is implemented, then I don't see the point of these fields as they are likely not going to be accurate unless the projects are inactive:

  • metadataLastUpdated
  • lastCommit

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants