[Request for Discussion] Software inventory metadata schema and inventory collection #41

theresaanna · 2016-09-07T16:41:08Z

Part of the Federal Source Code Policy requires that federal agencies make available an inventory of metadata describing their custom software. We’re exploring ways for agencies to provide their inventories. We want to implement a solution that works well for agencies and we need your help to do that.

The Federal Source Code Policy describes code.gov as “the primary discoverability portal for custom-developed code intended both for Government-wide reuse and for release as OSS.” The inventory data that agencies provide will be made available through code.gov. The data we collect should make it possible for agencies to find projects relevant to their needs.

There are two primary areas we see where decisions need to be made: the data format and what data is collected.

Data Format

The two options we are considering are CSV and JSON. The assumed benefit to a CSV-based approach is that it is easier for agencies to create and maintain a CSV than JSON. With this approach, we might create a system for agencies to submit their inventory CSV.
With a JSON-based approach, we might ask agencies to make the “inventory.json” available on their website and we would have a system to retrieve inventories as they change. One drawback to JSON is that it is more effort to maintain, takes specialized knowledge, and we may need to provide a tool to build the JSON. On the other hand, JSON is easy to work with programmatically and matches what Data.gov does, meaning many agencies have some familiarity with the process that inventory updating would entail.

The unanswered questions on data format are:

Which data format is the best fit: CSV or JSON?
Is it best to retrieve or ask agencies to submit their inventories?

Collected Data

In either data format, we need to determine what data we will collect. Below is a list of fields we are considering accepting.

Proposed required fields:

Project Name: The name of the project
Project Description: A description of what the software does
Point of Contact Email Address: Email address for the project point of contact

Proposed optional fields:

Version Control System: the VCS that the project uses
Repository URL: The URL of the upstream project repository, if applicable
Project URL: The URL of the project homepage
Project Tags: Tags that will help Code.gov users find projects based on their needs
Languages: Which programming languages are used in the software
Last Updated: Timestamp of when the project codebase was last updated
License: The type of license that the source code of the project is released under
Open Project Status: Whether or not this project is open source
Government-wide Reuse Project Status: Whether this project is designed for reuse across government
Exemption: Which exemption, if any, is being relied upon for keeping the source code private

For an idea of what the data might look like, we have an early draft of a schema with example content: (https://gist.github.com/theresaanna/a82bfb39b64362bca04e4644706b0ce4)

The questions that we are looking to answer here are:

Is this the right information to collect? Have we missed something?
What should we consider on the part of agencies when implementing this data schema?

Thanks for your feedback! It’s crucial for us in meeting our goal of providing a system and schema that are easy to use and meets agencies' needs.

theresaanna · 2016-09-07T17:22:40Z

I've asked a handful of developers here at 18F for some feedback on approach and schema. Here are some highlights:

Projects that have instances per-agency, like the eRegulations platform, will be duplicated in the directory.
We should provide a way for agencies to submit tarball or other packaged software URLs.
In the draft schema, we should make the names clearer and avoid abbreviations like "openPjct", "govwideReusePjct", and "closedPjct", opting to instead spell out "project".
"pjctTags" should be an array.
"license" may have multiple values
It could be helpful to have some searchable metadata as indicators of overall project health (# contributors, alpha/beta/prod status, etc.)
We may be able to remove "closedPjct" because the presence of "exemption" would indicate that it's a closed project.

mgifford · 2016-09-07T17:33:53Z

Would be interesting to be able to record if projects:

Meet with Section 508 (WCAG 2.0 AA) requirements
Support multi-lingual content (English/Spanish) and interface
Have undergone external security reviews
Are actively maintained or have critical mass
Include screenshots or other documentation

That being said, good to make it as easy as possible for folks to get started. We don't want to overwhelm folks.

rossdakin · 2016-09-07T19:18:17Z

I firmly believe that the proposed schema should include a standards-compliant 3.5mm headphone jack.

rossdakin · 2016-09-07T20:16:37Z

But seriously, some thoughts:

Data Format

Some thoughts on proposed data format standards for agency publication (code.gov consumption).

NOTE: a related but distinct feature of code.gov should be the publication of its aggregated inventory. There may be value in providing this inventory in many formats (expecting many varied consumers), whereas below I advocate for a single data format (expecting code.gov to be the sole consumer).

JSON

This should be the standard, IMO, for all the reasons that JSON has become popular: easily readable, ubiquitous, expressive (i.e. allows for collections (arrays) unlike CSV), and libraries exist in all major languages/platforms for JSON generation.

If only one format is supported, I suggest it be JSON.

XML

This wasn't mentioned, but is ubiquitous enough to warrant discussion. As I see it, JSON can do everything XML can do while being more readable and easier to construct and less complex to define (no WSDL, etc.).

If the schema were intended for broad consumption, I might suggest discussing XML support, but seeing as the schema is primarily intended for consumption exclusively by code.gov, I don't think the added complexity yields much additional benefit.

CSV

I don't see any benefits to supporting CSV, which lacks support for expressions like multi-dimensional collections (i.e. arrays) beyond the single dimension of the rows in a CSV table. One could hack around this constraint by supporting dynamic column headers (e.g. tag_1, tag_2, ... tag_n) or implicitly enumerated columns (e.g. tag, tag, tag -- similar to how some web frameworks handle array POST value). The same could be done for nested attributes (e.g. contractor_1_contacts_contact_2_phone) but that's incredibly inelegant.

One could argue that CSV is simpler to publish when maintaining an inventory by hand (e.g. by exporting an Excel spreadsheet). While this is true, I don't think that benefit outweighs the inherent limits of the format. It also seems that in the long-term, we would want agencies to programmatically generate their inventory file rather than hand-crafting it manually; not supporting CSV may nudge them in the desired direction.

YML

For the sake of completeness -- not mentioned above, but worth discussing. Same attributes as JSON but somewhat more human-readable and somewhat more fragile (white space dependency). I don't see a benefit to supporting YML in addition to JSON.

Collection Methodology

Is it best to retrieve or ask agencies to submit their inventories?

Pros and cons either way. A "pull" methodology seems to be the simplest (avoids "push" credential checking, account maintenance, etc.; also puts burden of initiation on code.gov centrally rather than on each agency individually).

One benefit of a "push" methodology would be more real-time reporting, though I'm not convinced that real-time reporting is very important in this project or outweighs the additional complexity.

CRUD

It's also worth talking about how specific actions should be taken and how certain situations should be interpreted.

For example, if a record suddenly stops being included in a agency's reported inventory, what does code.gov do? Delete it? Ignore the omission?

Should code.gov assign unique identifiers or require them as a part of inventory submission (to avoid duplication and enable "upserts")?

Which inventory actions should be idempotent?

Etc.

Collected Data

1,000% agree with @theresaanna on fully spelling out field names rather than using abbreviated/Hungarian-like naming.

Relationships / Reuse

If one agency does start using code from another agency, how is that represented in the code.gov data model?

ctubbsii · 2016-09-07T20:25:20Z

If data entry is provided, then the format CSV or JSON doesn't matter, because the view can be exposed either way. The format does matter for bulk-import of metadata, and for that, I'd prefer JSON.

I think it's best to ask agencies to submit their inventory to code.gov (this is where that bulk-import feature helps), rather than rely on them to publish on their own site and pull from there (not all government agency's have up-to-date and convenient sites, and if you provide the platform for receiving the information, it'll probably be easier and faster to get the data than requiring them to sustain their own platform for publishing). Some incentive should be provided to ensure project managers submit this data. Using the data to have a "featured projects" page, might be one way to incentivize timely submissions.

As for fields,

Project URL should be required. It's the single most important piece of information. Everything else can typically be discerned by visiting that URL. If it doesn't have a URL, I'm not sure why it would ever be listed here (unless the intent is to publish metadata about closed-source projects).
POC Email should not be required, because sometimes, the best way to contact an open source project is through the forums, not directly. Additionally, email as a mandatory method of communication is not really future-proof.
Last Updated is a confusing field. Does it refer to the last commit? The last time a user reported a bug? The last mailing list discussion? The last time the metadata was updated? If it's going to be there at all, it should refer to the last update of the metadata. It's not reasonable for projects to update this metadata field every time the project itself has activity, so it's pretty useless for that purpose. It would, however, be useful to see if the metadata is old or not. If it gets used this way, it should be required.
License should be required. This is a pretty important field.
Open Project Status is confusing. Is this metadata intended to index non-open source projects as well? Even if it is, this raises the question of "whose definition of open is being used?". This is also redundant with the License field, because the status is determined solely by the license.
Government-wide Reuse Project Status is also confusing. Why would this ever matter? An agency's intention that their release be reused across government has no bearing on whether or not it will be.
Exemption field may not be useful. I imagine that most things that would exempt it, would also exempt the metadata being requested. As an optional field, I guess it's fine, but it's probably better to simplify things and elminate fields, until a demonstrated value exists. Best to start small and grow bigger, than to start big, and just grow more complex.

niden · 2016-09-08T17:17:14Z

Adding to what @ctubbsii wrote:

Last Updated could be renamed to something like Updated and become an array that has more information in there such as LastCommitDate, LastMetadataUpdate, LastPullRequest etc.

The Languages should be an array not a comma separated field. It will be easier to index that way IMO

I see little value in supporting CSV or XML. As @rossdakin points out, not offering CSV will point people to the right direction :)

bondsbw · 2016-09-09T05:32:58Z

Government approval processes often become roadblocks and cause systems and data to become stale and unreliable for their purposes. I fear the same for this effort. As red tape is added, this data could become so dated that nobody finds it useful.

I suggest that Code.gov needs to get in front of this problem before the culture settles. Encourage agencies to push metadata updates as quickly and as often as possible while reducing red tape in these processes. Make the update process responsive by eliminating any approval processes aside from standard security and authorization measures.

I would hate to see all this effort reduced to the usual "I technically did my part" checkbox I find in too many government tasks.

jasonduley · 2016-09-09T16:27:44Z

Which data format is the best fit: CSV or JSON?

We prefer a JSON based serialization as we have tools inside of NASA to support both open data and internal code sharing that operate seamlessly with JSON. Also, it should be allowable for agencies to extend the base schema to append additional attributes and this is not possible or easily done with CSV.

Is it best to retrieve or ask agencies to submit their inventories?

We would like to mimic the scheme that data.gov uses and post the JSON file on a web server and have it harvested at some interval. Ideally, we'd like to have access to the harvest job admin screen in order to run it manually as well as a dev harvester for performing end of quarter batch loads

As I mentioned on our call today, since the majority of our code is behind the NASA firewall, it would reduce perceived risk and increase NASA's adoption of this policy if URLs are optional. Of course for open source repositories such as the ones NASA maintains here: www.github.com/nasa, we would include the URL fields as they are important in this context. I think title, description and POC are all important for code discovery and setting up potential collaborations between government parties

Schema comments I have, within the Projects array ...
VCS should be typed ENUM (avoid confusion between SVN and Subversion for example)
pjctTags should be typed an array of strings
codeLanguage should be typed an array of strings and potentially ENUM to avoid terminology mismatches (node vs node.js)
POCemail should be replaced with POC of type object similar to ...
POC: {
email: "jason.duley@nasa.gov",
name: "Jason Duley"
}
boolean fields should be true/false

also, from a schema standard we should decide if attributes should be included with NULL values OR if those NULL valued attributed should be omitted.

ddelmoli · 2016-09-09T17:08:52Z

If considering a JSON format, it may be useful to follow / look at the npm package file format https://docs.npmjs.com/files/package.json

RobertRM · 2016-09-09T17:33:30Z

And Git Hooks would be a good way to submit this information while pushing to github for projects hosted on that platform.

http://githooks.com/

IanLee1521 · 2016-09-11T01:05:18Z

Personally, I would prefer not XML for the reason that it isn't as well supported by tools like Jekyll which may be used for the display / web visualization of the data.

Another thought, should the fields / the spelling of the fields be aligned with the type of information that can be grabbed from sources like the GitHub REST API?

This would allow, at least for open / GitHub repos, the ability to absorb all projects by only knowing the organization names. This is something that I am doing for the @LLNL organization to create a software portal, much like what Code.gov will become, at software.llnl.gov.

IanLee1521 · 2016-09-11T01:06:42Z

Oh, and I also agree with @jasonduley that the ability for agencies to push into the repository would greatly ease the integration of "inside the firewall" code hosting.

jbjonesjr · 2016-09-12T18:54:10Z

I want to take a second before responding myself to thank @rossdakin for his detailed post above. He did a great job laying out reasoning behind multiple formats and each delivery mechanism. Thank you for taking the time to share that and add to the conversation.

Now, some thoughts in no particular order....

It would be really nice if tags could be a fixed set instead of freeform. I'b be curious if StackOverflow published a list of it's top tags that could seed this project. While rejecting data for incomplete tags is not optimal, dealing with multiple tags that mean the same thing "Subversion, SVN, svn" can make discovery very difficult.
As @IanLee1521 mentioned, this metadata should be derived wherever possible (the GitHub API is great for this). Creating an API process to iterate each repository in the organization, collect the proper information, then "push" it to either code.gov or the agency website would be a pretty low lift (for systems already collecting that information at least).
For @jasonduley,

it would reduce perceived risk and increase NASA's adoption of this policy if URLs are optional.

Can you tell me more of how NASA treats internally-resolvable urls as a risk? I'd think as the govt works towards more inner-sourcing and reuse, that being able to go "to" the code will be a big help.

As we talk about push vs pull, keep in mind what happens when projects are abandoned? Who updates this information then (or if someone wants to register a long-abandoned project)? Might it be helpful to include "expected project period of performance" to provide a hint that a project might be OBE at a later point in time?
Regardless of ingest format, couldn't code.gov due the translation and offer data in many formats? Very Write Once, Read Many...
In terms of push vs pull, would you require the parent agencies website to publish the inventory? So all the DOE labs would be required to roll up to DOE? How about multi-agency organizations? I think it would simplify the data flow if push was decided as the standard. The only issue there is you lose some agency capability to keep on top of the inventory.

jasonduley · 2016-09-12T19:18:24Z

@jbjonesjr
Today, mission-based CM systems that contain flight code, vehicle commands, ground software, etc. and other sensitive projects are not going to allow a firewall exception to government partners and will most likely share "released" source code by re-hosting to neutrally located CM systems outside the NASA internal firewall for government-wide sharing. For the inventory, the URL should be optional for internal source as they live behind the firewall.

IanLee1521 · 2016-09-12T19:39:55Z

@jasonduley -- Would providing the links, even if they are inaccessible be an issue? It seems like if it were possible to provide the where now, that would assist with identifying where new connections need to be established.

@jbjonesjr -- One other thought is that the number of sources for the metadata we (all) would be scraping is fairly limited... There are only so many tools for hosting code. GitHub.com obviously, but also: GitLab, Bitbucket.org, Bitbucket Server, SourceForge, etc. By deciding on a common format and building tools for scraping that data out of these sources, all of the agencies would be able to contribute collaboratively.

jasonduley · 2016-09-12T19:57:33Z

@IanLee1521
I think supplying URLs to NASA's internal and tightly secured code projects will cause issues for us. Please note this would be a subset of projects in the inventory and all already released open source would contain URLs.

IanLee1521 · 2016-09-12T20:16:36Z

Makes sense... For what it's worth, I suspect we would have similar issues @LLNL.

bbrotsos · 2016-09-13T16:31:07Z

Collected Data

Code.gov should reuse data element names and definitions from project open data metadata schema https://project-open-data.cio.gov/v1.1/schema/ where possible. These are based on W3C http://www.w3.org/TR/vocab-dcat/ and dublin core that has been around for many years. Alternatively, if GitHub, GitLab or other code repository has existing data elements and types, this project could use those fields. Code.gov could reuse the following fields from project open data:

title
description
keyword
contactPoint
publisher
license
landingPage
identifier

An example:

{
     "projects":{
          "title": "Important USDA Code Repository",
          "description": "Creates new automated farms",
          "landingPage": "usda.gov",
          "repositoryURL": "github.usda.gov/automated-farms"
          "softwareLanguage": ["ada", "perl", "cobol"]
          ...

There may be more fields to reuse. I also recommend adding fields which will be good for analytics of what agencies and investments are releasing their code:

bureauCode
programCode

By aligning to these fields names, there is also hope in developing common system for storing data sets, data assets and code repositories. For example, we could potentially create an extension for CKAN or DKAN to also store code repositories. You could also reuse existing documentation.

rough68fish · 2016-09-13T16:46:54Z

I think it would be a good idea to follow the process established by the data.gov effort as much as possible. Since most agencies have been working on setting that up they should be familiar with json and have processes for creating and maintaining the json data.

Also try not to invent a whole new schema and if possible try to reuse data.gov data descriptions where you are talking about the same thing.

theresaanna · 2016-09-13T16:59:35Z

That being said, good to make it as easy as possible for folks to get started. We don't want to overwhelm folks.

@mgifford I agree that we should make it easy for folks to get started, but you bring up some valuable data points we might collect. Thanks so much for your feedback. A question that remains for me is whether it's better to have an initial version of the schema that we add onto as agencies feel more comfortable or if it's better to be thorough up front.

thecapacity · 2016-09-13T17:43:52Z

@theresaanna I think you've got a lot of good material in the above discussion, and may have already seen this from some of my colleagues: https://18f.gsa.gov/2016/08/29/data-act-prototype-simplicty-is-key/

"... One of the earliest decisions our team grappled with centered on the data format we would receive from agencies. ... "

I wanted to augment some of the earlier comments that it definitely seems like an "and" and making one machine-readable format is a good way to validate another (e.g. CSV to validate a "more formal" JSON/XML/... spec).

theresaanna · 2016-09-13T17:45:10Z

@rossdakin Thank you so much for your thoughtful feedback! You've brought up some great food for thought. I am in agreement with you that a JSON, pull-based system makes the most sense. Some thoughts:

For example, if a record suddenly stops being included in a agency's reported inventory, what does code.gov do? Delete it? Ignore the omission?

My assumption is that code.gov always reflects the most recent version of agency inventories, meaning we'd delete the record. I don't know if this is a good assumption. Are there cases in which we'd want to hold onto old data? I imagine it'll be normal for software to drop out of inventories as it becomes replaced.

Should code.gov assign unique identifiers or require them as a part of inventory submission (to avoid duplication and enable "upserts")?

You bring up a great point. I think that for a first version, given the aggressive timeline the policy lays out, we won't be able to tackle this, however, I will add it to our backlog for addressing in the future. I cringe a little to say that, as this is something we'll want to think about sooner rather than later admittedly.

If one agency does start using code from another agency, how is that represented in the code.gov data model?

That is a fantastic question! I think we will need to have some discussion around how we might represent that - whether it's in the data model or a layer on top of it. Do you see any benefits to having it in the data model?

CynthiaParr-USDA · 2016-09-13T18:05:20Z

Because new code is often generated in association with research data, we are encouraging data submitters to the Ag Data Commons (https://data.nal.usda.gov) to also submit a pointer and metadata description for their software (which we hope is primarily managed in an open source code repository). Two points to make about this:

We have the same POD 1.1 metadata for the software (which we have augmented with a few fields -- see https://data.nal.usda.gov/description-fields-%E2%80%9Cedit-dataset%E2%80%9D-page)
We obtain DataCite DOIs for software tools, whether they are registered separately from their data or included as a resource in a data package.

I would encourage processes to align as closely as possible with the existing open data.gov processes. I have no problem with additional value-added metadata.

jecb · 2016-09-13T19:47:19Z

Apologies if this question has been asked, but has there been discussion around creating a JSON conversion tool similar to the DCOI Strategic Plan: https://datacenters.cio.gov/json-conversion-tool/?

okamanda · 2016-09-14T15:52:13Z

@jecb and others have brought up making this a tool or process to make generating the code inventories as easy as possible. I think the first step in doing so, is mapping schema fields to some of the web-based repo hosting tools (e.g., github,bitbucket), especially those that have APIs.

To that end, I've put together this table which shows what this might look like.

schema field	github field	bitbucket field
agencyAcronym	given	[given]
projects.vcs	[git]
projects.repoPath	[html_url] or [url]
projects.repoID	[id]
projects.projectURL	[homepage]
projects.projectName	[name] or [full_name]
projects.projectDescription	[description]
projectTags.tag	(?) process/analyze from [description], [name], and [language]
codeLanguage.language	[language]
Updated.LastCommitDate	[updated_at]
Updated.LastMetadataUpdate	[pushed_at] or [updated_at]
Updated.LastPullRequest	grab [updated_at] from [pulls_url]
POCemail	(?)
license	(?) grab/process/analyze from LICENSE.MD/README.MD, etc.
openproject	1
govwideReuseproject	0
closedproject	0
exemption	null

VisionPaul · 2016-09-14T16:07:44Z

Collected Data
Adding the name of the system or platform may help for purchased environments that allow for custom solutions to be developed within. We use both Salesforce and ServiceNow - and many other agencies are using these platforms as well, and it would be great to search and post developed solution sets - especially since they probably already come with some level of A&A.

Maybe "softwareLanguage" as @bbrotsos has listed above would be appropriate usage for this example....

theresaanna · 2016-09-14T22:03:16Z

@ctubbsii Thank you so much for all of your feedback. You've brought up some great points that are so valuable in helping us think this through. I've replied to much of your comment inline:

Project URL should be required. It's the single most important piece of information. Everything else can typically be discerned by visiting that URL. If it doesn't have a URL, I'm not sure why it would ever be listed here (unless the intent is to publish metadata about closed-source projects).

So, we will be collecting data about presumably many closed-source projects, and so a public URL may not be available.

POC Email should not be required, because sometimes, the best way to contact an open source project is through the forums, not directly. Additionally, email as a mandatory method of communication is not really future-proof.

I agree that it's not very future-proof. My assumption was that agencies would need a way to get in contact with the project maintainers if this inventory were to be useful. However, I'm not sure that's a good assumption. I'm planning to remove it as a required field unless a good argument is made to the contrary.

Last Updated is a confusing field. Does it refer to the last commit? The last time a user reported a bug? The last mailing list discussion? The last time the metadata was updated? If it's going to be there at all, it should refer to the last update of the metadata. It's not reasonable for projects to update this metadata field every time the project itself has activity, so it's pretty useless for that purpose. It would, however, be useful to see if the metadata is old or not. If it gets used this way, it should be required.

Interesting. I had thought of this field as a signifier of the activity of a project, but this would be hard to maintain unless we were pulling project info right from Github or similar. I agree that it would be useful to see when the metadata was last changed. The more I think about this field, though, the less convinced I am that we need it. Until we have a tool to generate the inventory JSON, I imagine this field will fail to be updated with changes, making it unreliable.

License should be required. This is a pretty important field.

Agreed that it is important, but unfortunately not all software will have a license. Along the lines of the suggestion you made about incentives, perhaps there's a way to encourage folks to release code and help them decide which license is right.

Government-wide Reuse Project Status is also confusing. Why would this ever matter? An agency's intention that their release be reused across government has no bearing on whether or not it will be.

There are projects that are built as platforms or to be reused specifically. For example, the eRegulations project. https://eregs.github.io/. This field will allow users to look specifically for these types of projects.

Exemption field may not be useful. I imagine that most things that would exempt it, would also exempt the metadata being requested. As an optional field, I guess it's fine, but it's probably better to simplify things and elminate fields, until a demonstrated value exists. Best to start small and grow bigger, than to start big, and just grow more complex.

Definitely agreed on preferring to start small.
@okamanda or @mattbailey0 - I realize I don't fully understand what it means for a project to be exempt. Is this exemption from the open source part of the policy?
A related question: If you look at the original schema, it has exemption and closedPjct. Will a closed project always have an exemption? Put another way, can we remove closedPjct and rely on the existence of exemption to indicate that it's closed?

theresaanna · 2016-09-14T22:11:42Z

Last Updated could be renamed to something like Updated and become an array that has more information in there such as LastCommitDate, LastMetadataUpdate, LastPullRequest etc.

The Languages should be an array not a comma separated field. It will be easier to index that way IMO

@niden these are great suggestions, thank you. I will implement your languages field suggestion - I agree.
In the interest of ease of use, I'm thinking we may want to drop the Last Updated field. Though, if we do implement it in the future, this object-based approach would make things clearer. I could see this field being more useful when we provide a JSON generator or can pull data from somewhere like Github.

@okamanda, I'm interested in your thoughts here. Do you see a need for Last Updated that I don't? I worry that it will fail to be updated and then become unreliable data if folks are updating it manually.

ckaran · 2016-09-30T13:27:42Z

@jbjonesjr I agree with you that sometimes there are version bumps solely for marketing and other purposes; however, nothing prevents someone from performing a pointless update to a code base solely to cause the Last Updated field to get updated¹. That said, if we assume that people are generally honest and will not deliberately game the system, then @bondsbw is right that both have their uses. Semantic versioning will tell you how important the change is, while the Last Updated field gives you a clue about the vibrancy of the project. So, I guess I'm now voting for both fields.

[1] I'm assuming here that once a project's URL has been submitted to code.gov, then the servers can automatically look for any updated projects and update their databases accordingly. Computers are lousy at determining which changes are important ones, so this would be a trivial trick for an unscrupulous person to make it appear that their project is getting lots of updates.

rossdakin · 2016-09-30T14:08:46Z

One thought on the topology.

"Project" here seems synonymous with "repository" — I could see this being confusing when listing projects that have multiple repositories (e.g. a UI, an API, etc.).

Possible mitigations:

use "Repo description" etc. rather than "Project description"
allow multiple repositories per project
leave topology as is but add a field for "related projects"

bandrzej · 2016-10-01T15:27:34Z

Some feedback, from my personal opinion:

codeLanguage.language
Realize this should be a multiple value field, unless you are going to specify in the description the primary code language in use as a single value.
license
This should be required to point to a README.md or license document within the code repository. This clearly defines the secure rights obtained for the source code that this OMB memo is trying to solve.
openproject vs. closed project
What about combining this into one field, and its values are "Open" or "Closed"
govwideReuseproject
Why would we be listing projects that are not gov wide re-use? Use case?
exemption
Is the plan here to list source code that has exemptions to gov-wide release? How do you plan to deal with FIOAs?

bandrzej · 2016-10-01T15:32:01Z

+1 for YML per @NoahKunin

It is assumed a developer would do this task, but it is left up to the agency how it is accomplished. It would not surprise me some agencies task their Public Affairs or Security Offices to maintain since it is public facing.

bandrzej · 2016-10-01T15:39:27Z

Question:

How do you plan to track government contributions to existing public OSS projects that were not started by the government?

philipashlock · 2016-10-12T23:17:59Z

Maybe it was intentional to get a fresh perspective, but it seems like the original discussions on this topic from the policy should be required reading here. See:

Seems like we may be re-hashing many of the same points. In fact, it seems like there are at least four different threads on the metadata schema topic and it's a bit confusing to follow along. Here are the threads I've identified in chronological order:

Where possible, I'd suggest trying to de-duplicate or consolide these threads or at least update the first post on the thread to distinguish the different threads if each is meant to serve a distinct purpose.

mattbailey0 · 2016-10-18T15:39:16Z

Thanks @philipashlock. Especially with your helpful write up in place, let's consolidate the discussion here. I'm going to close out the other issues and point folks to this thread, which is the most active overall.

IanLee1521 · 2016-11-01T17:56:16Z

@mattbailey0 -- Would it make sense to start working all of these discussions into the draft guidance, rather than continuing to solely use the issue threads?

philipashlock · 2016-11-07T19:26:15Z

Current Status

What's the status of this schema? There's documentation on code.gov that seems somewhat final, but there seem to be a number of important points that haven't been addressed, this issue is still open, and the code.gov site somewhat confusingly says that both the publication of the metadata schema and implementation of the schema by agencies are due December 9th (referring to Section 7.2).

Allow for revisions

Whether or not this is final or it's possible to make some minor updates, I would suggest creating some expectations or provisions for a revision within a year or so after there's sufficient experience and feedback from those who have implemented and consumed it. We did this with the Project Open Data Metadata schema and the 1.1 update not only allowed us to address issues that had come up, but to also fully align it with the international standard established by voluntary consensus bodies (DCAT). It's understandable that there was a short timeline to establish this schema, but we don't want to create the impression that this draft will be locked in for perpetuity.

One of the ways we addressed this with the Project Open Data schema is in the v1.1. update we required implementors to explicitly state the version of the schema at the top of the file.

Use existing standards

While it may not seem like the development of this schema is part of a standards making process, it really is if agencies are required to follow it. OMB A-119 sets out basic requirements for the use of standards in government, specifically "this Circular directs agencies to use voluntary consensus standards in lieu of government-unique standards except where inconsistent with law or otherwise impractical." In other words, government should avoid creating government-specific standards unless it has a good reason to do so. Avoiding reinventing the wheel also meets the spirit of reuse set out in this policy. With that in mind, it would be good to review existing standards and document why or why not they are practical to use here.

A number of existing schemas and specifications have been raised in this discussion including the Asset Description Metadata Schema for Software ADMS.SW used by federated national software catalogs across Europe - which integrates much of the DCAT vocabulary used for the Project Open Data data.json schema, the civic.json schema (with various flavors that have been used or proposed by the civic tech community in the U.S. including BetaNYC, Code for America, and DC Government), the Schema.org SoftwareSourceCode and SoftwareApplication schemas which appear to be implemented by a relatively small number of websites (10 and less than 50,000 respectively), and the NIST specification for Asset Identification which I think its mostly used to describe software in an operational environment rather than as an autonomous asset ready for reuse.

The current schema appears to be largely based on the civic.json specification. The pros of this is that it's something that's already been developed by the community and it's relatively simple. The cons of this is that it's not clear that it's widely been used, well documented, or even proposed consistently enough to enable interoperability.

The ADMS.SW specification seems like the most robust standard aligned with the needs of Code.gov. The pros of this is that it's been developed through formal voluntary consensus bodies, is thoroughly documented, aligns with the DCAT schema used for the open data policy, and is implemented in a federated way by European government bodies just as needed by U.S. federal agencies. The cons of this is that it appears overly complex with very dense documentation. You can see a full PDF copy of the ADMS.SW spec here (copied from here) and a presentation about it here

The Schema.org schemas are fairly simple, well documented, and developed through a voluntary consensus process. One of the biggest pros is that these are supported by the major search engines which means that they should be indexed by search engines and that's the most likely way people will find software (not on code.gov). The con is that these are not yet well adopted, at least not SoftwareSourceCode, and the search engines do not yet appear to be doing anything special to index these. However, it's totally possible to implement one of the schemas mentioned above while also implementing a schema.org schema, but you'll want to be sure there's a good mapping between the two. We did this with the Project Open Data metadata schema, but it was fairly easy because the POD schema is merely an extension of DCAT and the schema.org Dataset schema was explicitly based on DCAT. None of the major search engines were doing anything special by indexing the schema.org Dataset schema when it was first implemented on Data.gov, but Google is now working on this more and expanding the Dataset schema for the way Google wants to index things like Science Datasets and I think we can expect something similar to happen with software.

So while it seems like a fairly final decision to develop something new based on the civic.json schema, I think it's worth considering whether more could be done to leverage the work that's gone into ADMS.SW, to reuse the elements in DCAT already used by the open data policy, to align with a formal voluntary consensus standard, and to allow for interoperability with the federated European software catalog. That said, more should be done to provide a simplified profile of ADMS.SW and to better understand the pros and cons of ADMS.SW in practice. We did this with POD v1.1 and DCAT by working with W3C to make data.json a formal representation of DCAT with JSON-LD and I think we found a good compromise. When POD v1.0 was developed, it was mostly aligned with DCAT, but DCAT had not been finalized. POD v1.1 is now compatible with DCAT and a large portion of national data catalogs around the world use DCAT. The European Union uses DCAT as the basis for their federated Europe-wide data catalog.

And even where an existing specification isn't fully packaged to meet all the needs here, you can still assemble fields from existing vocabularies. This allows for field level interoperability and can ensure you reuse properties that are already well defined rather than coin new ones that are vague or inconsistent.

Feedback on fields

In the meantime, here's some feedback on specific fields (some of this reiterates or emphasizes John's comments)

agency - there are no official or consistent acronyms for government agencies in the federal government. To ensure consistency, you'll have to use a unique identifier like we did with Project Open Data. We primarily used bureauCode but GSA is also working on a more universal unique identifier system for agencies. Additionally, ideally this field would not be government specific. I would also suggest that this field be associated with each project entry rather than with the whole catalog as this will allow the metadata to be more easily mixed and aggregated across multiple sources without losing this important data.

organization - for Project Open Data we allowed folks to use the publisher field to optionally provide the context of where the office sits in the agency by indicating some level of hierarchy. I would also suggest that this field be associated with each project entry rather than with the whole catalog as this will allow the metadata to be more easily mixed and aggregated across multiple sources without losing this important data.

openSourceProject - this seems somewhat redundant with license. The policy defines open source as anything meeting the Open Source Definition and OSI has a list of licenses that meet that definition, so this field could just be derived from the license. Even if you feel the need to keep it here, I'd make it explicit that this means the code is licensed (or unlicensend) in a way that meets the OSD. It's also worth noting that OSI has not accepted CC0 as meeting the definition, but does recognize the public domain status of U.S. Government Works. This is a topic that should be discussed and debated further, but it might be worth considering whether it's better to use the usa.gov URL for U.S. public domain as defined by Project Open Data rather than assert international public domain with CC0 like we suggest for datasets. The difference in these use cases, as explained by OSI, has to do with patent rights which are relevant for software, but not data. Additionally, this field should use a boolean (true or false) not an integer since the the boolean datatype is intended specifically for this purpose and is more human readable.

governmentWideReuseProject - this should be renamed so it's less government-specific, e.g. designedForReuse and it should use a boolean (true or false) not an integer since the the boolean datatype is intended specifically for this purpose and is more human readable.

languages - this should make it clear that it's referring to the code language rather than human language. In ADMS.SW, DCAT, and Schema.org, language is used to refer to the human language used by the asset, whereas schema.org uses a term like programmingLanguage on their SoftwareSourceCode schema to be clear they're referring to code not content. This should also be singular, not plural regardless of whether the data type is singular or not.

exemption - I'd suggest making this more explicit like reuseExemption and using a more human readable controlled vocabulary for the excemption reasons rather than integers.

Missing Fields

identifier - It's important to try to establish a globally unique identifier for each project since many other fields will change and it will be hard to track the entry without a unique identifier. Data.gov uses the identifier field to know when an entry has been added or removed rather than updated. This field should be globally unique using a URI to avoid collisions from different catalogs when aggregated from multiple sources. This should be a required field.

provenance or source - In the spirit of reuse, it'd be helpful to know this codebase was forked or otherwise derived from a separate upstream codebase. This could be the URI of the unique identifier or the URL of the project.

Serialization Format

I recommend JSON for many of the reasons other have stated. It has worked for Project Open Data data.json and we have built out the infrastructure to validate and harvest in this format. JSON-LD is also now the format recommended by Google for schema.org schemas and other structured data on webpages. Some have suggested YAML as an alternate because it's more human readable and easy for folks to edit, but this also means it's more likely to result in poor or inconsistent data quality for any data structure with even moderate complexity. With the initial implementation of the Project Open Data data.json schema, many folks attempted to maintain their JSON metadata by hand, but this resulted in the majority of the problems we encountered with regard to harvesting and interoperability. I would strongly suggest that we do not rely on a structured data format that is edited by hand, but agencies are free to allow for this upstream as long as they validate it when compiling their aggregate copy. It's worth noting that JSON is actually a subset of YAML, so agencies could allow either YAML or JSON from individual offices if they're using a YAML parser, but they'll still have to validate it against the final JSON schema requirements and provide a comprehensive JSON version.

philipashlock · 2016-11-08T00:24:07Z

I've attempted an initial mapping between code.json and ADMS.SW. Note that ADMS.SW follows the same conceptual model as DCAT used for Project Open Data data.json:

To ensure that the Data Catalog Vocabulary (DCAT), the Asset Description Metadata Schema (ADMS), and the Asset Description Metadata Schema for Software (ADMS.SW) are seeded on the same structure, the RADion vocabulary was created [RADion]. RADion is shorthand for Repository, Asset, and Distribution – the three structural elements that RADion abstracts from.

In ADMS.SW, the concepts Software Repository, Software Release and Software Package are defined as specialisations of the more general concepts Repository, Asset and Distribution specified by RADion

To clarify these relationships, I created a visual diagram similar to the Schema Object Model Diagram provided for the Project Open Data version of DCAT, but this diagram includes all the fields provided by ADMS.SW rather than paired down to just the required, optional, and extended fields as is the case with the POD diagram.

The property mapping and descriptions here are based on the full ADMS.SW documentation PDF and the HTML version of the RDF schema. I would refer to those documents for full property definitions. Also note that some of the properties here are synonymous with those in DCAT even if they use a different property name or namespace.

Software Repository

A Software Repository is a system or service that provides facilities for storage and maintenance of descriptions of Software Projects, Software Releases and Software Packages, and functionality that allows users to search and access these descriptions. A Software Repository will typically contain descriptions of several Software Projects, Software Releases and related Software Packages.

An example of a Software Repository is the Apache Software Foundation Project Catalogue

ADMS.SW Property	ADMS.SW Label	Namespace:Property	code.json Property
accessURL	Access URL	adms:accessURL
created	Date of Creation	dcterms:created
modified	Date of Last Modification	dcterms:modified
description	Description	dcterms:description
label	Name	rdfs:label
supportedSchema	Supported Schema	adms:supportedSchema
hasPart	Includes	dcterms:hasPart
publisher	Publisher	dcterms:publisher	agency or organization
spatial	Spatial Coverage	dcterms:spatial
themeTaxonomy	Theme Taxonomy	rad:themeTaxonomy

Software Project

A Software Project is a time-delimited undertaking with the objective to produce one or more software releases, materialised as software packages. Some projects are long-running undertakings, and do not have a clear time-delimited nature or project organisation. In this case, the term ‘software project’ can be interpreted as the result of the work: a collection of related software releases that serve a common purpose.

An example of a Software Project is the Apache HTTP Server Project

ADMS.SW Property	ADMS.SW Label	Namespace:Property	code.json Property
description	Description	doap:description	project.description
homepage	Homepage	doap:homepage	project.homepage
keyword	Keyword	rad:keyword	project.tags
name	Name	doap:name	project.name
release	Release	doap:release
contributor	Contributor	schema:contributor	project.partners
fundedBy	Funded By	admssw:fundedBy	project.partners
forkOf	Fork Of	admssw:forkOf
developer	Developer	doap:developer	project.partners
documenter	Documenter	doap:documenter	project.partners
maintainer	Maintainer	doap:maintainer	project.contact
helper	Helper	doap:helper	project.partners
tester	Tester	doap:tester	project.partners
translator	Translator	doap:translator	project.partners
metrics	Metrics	admssw:metrics
theme	Theme	rad:theme
intendedAudience	Intended Audience	admssw:intendedAudience	project.governmentWideReuseProject
locale	Locale	admssw:locale
userInterfaceType	User Interface Type	admssw:userInterfaceType
programmingLanguage	Programming Language	admssw:programmingLanguage	project.languages
isPartOf	Repository Origin	dcterms:isPartOf	project.repository
operatingSystem	Operating System	schema:operatingSystem
supportsFormat	Supports Format	admssw:supportsFormat
status	Status	admssw:status	project.status

Software Release

A Software Release is an abstract entity that reflects the intellectual content of the software at a particular point in time and represents those characteristics of the software that are independent of its physical embodiment. This abstract entity corresponds to the FRBR entity expression (the intellectual or artistic realization of a work). A release is typically associated with a version number.

An example of a Software Release is the Apache HTTP Server 2.22.22 (httpd) release.

ADMS.SW Property	ADMS.SW Label	Namespace:Property	code.json Property
alternative	Alternative Name	dcterms:alternative
created	Date of Creation	dcterms:created
modified	Date of Last Modification	dcterms:modified	project.updated.sourceCodeLastModified
description	Description	dcterms:description
identifier	Identifier	admssw:identifier
keyword	Keyword	rad:keyword
metadataDate	Metadata Data	adms:metadataDate	project.updated.metadataLastUpdated
name	Label	rdfs:label
revision	Version	doap:revision
releaseNotes	Version Notes	schema:releaseNotes
assessment	Assessment	admssw:assessment
contactPoint	Contact Point	adms:contactPoint	project.contact
includedAsset	Included Asset	admssw:includedAsset
metrics	Metrics	admssw:metrics
language	Language	dcterms:language
logo	Logo	foaf:logo
describedBy	Main Documentation	wdrs:describedby
metadataLanguage	Metadata Language	adms:metadataLanguage
last	Current Version	xhv:last
next	Next Version	xhv:next
prev	Previous Version	xhv:prev
project	Project	admssw:project
publisher	Publisher	dcterms:publisher
relation	Related Asset	dcterms:relation
relatedWebPage	Related Web Page	adms:relatedWebPage
package	Package	admssw:package
isPartOf	Repository Origin	dcterms:isPartOf	project.repository
spatial	spatial coverage	dcterms:spatial
status	Status	admssw:status
theme	Theme	rad:theme
usedBy	Used By	admssw:usedBy

Software Package

A Software Package represents a particular physical embodiment of a Software Release, which is an example of the FRBR entity manifestation (the physical embodiment of an expression of a work). A Software Package is typically a downloadable computer file (but in principle it could also be a paper document) that implements the intellectual content of a Software Release. A particular Software Package is associated with one and only one Software Release, while all Packages of an Asset share the same intellectual content in different physical formats.

An example of a Software Package is httpd-2.2.22.tar.gz, which represents the Unix Source of the Apache HTTP Server 2.22.22 (httpd) software release.

Software often has at least two kinds of physical embodiments: a source code package and a binary package. Binary packages are sometimes compiled for different operating systems or are released under difference licences, e.g. in case of dual licensing. Also scripting languages need some sort of packaging for installation systems used by end users.

ADMS.SW Property	ADMS.SW Label	Namespace:Property	code.json Property
created	Date of creation	dcterms:created
modified	Date of last modification	dcterms:modified
description	Description	dcterms:description
label	Name	rdfs:label
software_id	Software_id	swid:software_id
tagURL	Tag URL	admssw:tagURL
fileSize	File size	schema:fileSize
checksum	Checksum	spdx:checksum
format	Format	dcterms:format
license	License	dcterms:license	project.license
downloadUrl	Download URL	schema:downloadUrl	project.downloadURL
release	Release	amdssw:release
publisher	Publisher	dcterms:publisher
status	Status	admssw:status

ckaran · 2016-12-21T17:58:27Z

As @philipashlock noted on November 7, CC0 is not considered Open Source by the OSI. Appendix A defines Open Source Software as:

Open Source Software (OSS): Software that can be accessed, used, modified, and shared by anyone. OSS is often distributed under licenses that comply with the definition of “Open Source” provided by the Open Source Initiative (https://opensource.org/osd) and/or that meet the definition of “Free Software” provided by the Free Software Foundation (https://www.gnu.org/philosophy/free-sw.html).

The first part suggests to me that anything under the CC0 license is considered to be Open Source, but the second part suggests that it isn't. What is the official consensus? How should we mark a project's openSourceProject key in their code.json file if they are using the CC0 license?

Part of my concern is how automated tools will handle the openSourceProject key for metrics purposes; if CC0 is not considered to be Open Source, then quite a few agencies will not be able to meet the 20% requirement, even though they are putting their code out there for others to use.

ctubbsii · 2016-12-21T18:24:06Z

@ckaran CC0 is addressed in OSI's FAQs. No decision was made by OSI whether it meets their definition of "Open Source". However, it would be useful to know what definition of "Open Source" to use when completing the openSourceProject field. Does code.gov offer a definition for the purpose of this field? Personally, I'd prefer deferring to OSI's definition, but if OSI can't reach a decision on CC0, then their definition is insufficient.

My personal recommendation is to avoid "traps" like CC0, where it's "open" with respect to copyright, but patent use rights are explicitly not conferred. MIT and BSD avoid the question entirely (not explicitly conferred), and GPL tends to impose restrictions on consumers that I don't think the government should be in the business of imposing, so I prefer ASL 2.0, myself, for government-released open source projects. (ASL 2.0 also provides a convention to use a NOTICE file for copyright notices, separate from the license, where it would be appropriate to add a brief text noting the license is not applicable domestically for the portions of code produced exclusively by government employees on behalf of the U.S. government.)

jbjonesjr · 2016-12-21T18:31:35Z

cc/ @benbalter if you have specific thoughts to share:

http://ben.balter.com/2014/10/08/open-source-licensing-for-government-attorneys/

ckaran · 2016-12-21T18:53:38Z

@ctubbsii You're right about the problems of patents, etc. with regards to CC0. The lab I work for has been working to avoid the problem by requiring all external contributors sign a contributor license agreement (CLA) before their contributions will be included in any of the lab's projects (you can read the policy here. The lab's lawyers believe that will solve the issues directly related to patents and other IP rights.

Note that the policy was adopted by the lab on 19 Dec 2016, but it already has one issue; we can't currently post our CLA, nor can we accept CLAs at the current time as (by design) an executed CLA will contain what can be argued to be personally identifiable information (PII). The lawyers I've talked to tell me that means the lab must obey the Privacy Act, which requires some more work. So, if you read the policy and expect that we'll be able to start accepting contributions immediately, I'm sorry to say that we can't.

ctubbsii · 2016-12-21T21:41:07Z

@ckaran Another great thing about ASL 2.0, it contains an embedded CLA, and defines "Contributor" and "Contributions". No need for a separate CLA 😄

ckaran · 2017-01-03T13:14:27Z

@ctubbsii Honestly, if we could, I would recommend the standard OSI-approved licenses, including the the ASL 2.0 for exactly that reason. Unfortunately, most of the work produced by my lab doesn't have copyright attached, which means that copyright-based licenses may fail in court.

ctubbsii · 2017-01-04T18:34:22Z

@ckaran Not sure what you mean by "doesn't have copyright attached". My guess is that you mean "public domain" or you simply mean that nobody is interested in asserting copyright. If it's the former, it probably only applies domestically. The creators still may own copyright internationally, so a license is still worth recommending. If you mean the latter, well, omission of a copyright notice does not disclaim copyrights.

If a work isn't covered by copyright (because it's public domain, for instance), an infringement claim would certainly fail in court... but I'm not sure why that matters. That only matters if the creators intend to enforce/assert their copyright claims in the face of a particular infringement, via a lawsuit. If you know the work is public domain in the jurisdiction where the violation occurred... simply don't pursue it with a lawsuit in that case... it's really as simple as that.

The license still communicates the limitations of the rights granted in jurisdictions where copyright is applicable (who cares if it's void in jurisdictions where it's not applicable?) and communicates a minimum set of rights guaranteed to everybody else. This instills confidence in the project's users, allowing them to use it according to the license conditions without fear of reprisal. Often (as in the case of ASL 2.0), it also explicitly conveys the rights the project expects contributors to grant, in order for the contributions to be accepted into the project (in other cases, this might be implicit). This is valuable to a project, even if some portions of the project are not subject to copyright protections (public domain).

ckaran · 2017-01-04T19:54:59Z

@ctubbsii Sorry, I've been talking with our legal counsel for too long. Yes, I mean works that are in the public domain. I've talked with the appropriate people in the Justice Department to see if US Government works have copyright outside of the US. They told me that the US Government's position is that it does, but the lawyer I spoke with wasn't able to find any case law to back that up. What's more, it would have to be litigated in the courts of each nation individually, so there isn't a single 'right' answer.

As for why all this is important, it comes down to severability and warranty/liability. Assume that some Government work is licensed under the Apache License 2.0, which is a license that depends on copyright. Someone can sue the Government claiming that the clauses that depend on copyright are void, and (because there is no severability clause), so are all the other clauses. If a court agrees that the license as a whole is void simply because the US Government doesn't have copyright within the US, then that includes the clauses regarding warranty and liability, which means that the Government might be on the hook for damages in some manner[1]. Moreover, downstream users/projects may also have problems[1].

For works that have copyright and are contributed to the Government, I think that the Government would be OK with any of the standard OSI-approved licenses. However, work that is created by Government employees might be in the public domain, so then you have a weird mix of stuff that is protected by the license, and stuff that might not be[1]. Will this cause an issue? I don't know, but I'm not interested in finding out.

[1] I'm not a lawyer, this is not legal advice, and as far as I know, this has not yet been litigated in a court.

ctubbsii · 2017-01-05T00:17:47Z

@ckaran Oh, I see. Perhaps code.gov should fork ASL 2.0 (which is permitted) and add a severability clause. (Note: I'm currently promoting a discussion on the Apache Mailing Lists about adding this in some future version of the license, perhaps 2.1).

ckaran · 2017-01-06T14:09:30Z

@ctubbsii I've thought about forking it, but that could also start to fork Open Source (there will be questions about which licenses are compatible with other license, which could be problematic; @massonpj, is this a good assessment?)

@ctubbsii I've seen your discussions on the ASL lists; I think that is the best way to go. Not only could everyone (Government and private) use the same license, it would also mean that the license is OSI-approved, which the forked license may not be. The reason this is important is because some journals will only accept code that is under and OSI-approved license; JOSS is one of them. See the discussion here for some of the issues.

Basically, what I want are modifications to the standard Open Source licenses that ensure that works that don't have copyright attached have all the following:

As many of the protections that the various licenses give as possible, for both code that has copyright, and code that is in the public domain. At a minimum, this has to include warranty, liability, and IP protections[1].
Protects anyone that uses the code or includes it in their own works.
Is fully inter-operable with the standard licenses (forked licenses might not be; updated licenses will by definition be).

[1] Public domain code by definition doesn't have copyright protections, but in a mixed work that has some copyrighted material and some public domain material, the copyrighted material should not be effectively reduced to being public domain; if that was what the authors had intended, then they would have put it in the public domain. That means that license has to be inherently flexible enough to handle this case. IP protections means that public domain work doesn't get hammered by patent headaches from contributions.

massonpj · 2017-01-06T17:56:06Z

@ckaran

is this a good assessment

Yes.

@ctubbsii, while anyone can create their own license, the OSI's License Review Process, "ensures that licenses and software labeled as 'open source' conforms to existing community norms and expectations." Simply creating a new license and labeling it an "open source software license" is not good.

ctubbsii · 2017-01-06T22:33:50Z

@massonpj Obviously, any new license should be approved by both FSF and OSI. The biggest issue I think OSI would have is seeing it as "duplicative" if it's too similar.

mattbailey0 added the [Status] Help Wanted label Sep 7, 2016

mattbailey0 added this to the 90 Days milestone Sep 13, 2016

This was referenced Oct 18, 2016

Required content: Metadata schema to help agencies fill out enterprise code inventory (7.2) #30

Closed

[Request for discussion] How agencies should inventory their software WhiteHouse/source-code-policy#116

Closed

mattbailey0 added [Category] Schema/code.json and removed [Status] Help Wanted labels Oct 18, 2016

IanLee1521 mentioned this issue Dec 1, 2016

Final Schema Definition? #196

Closed

okamanda mentioned this issue Jun 15, 2017

Updating the Code.gov Metadata Schema [Due by 7/14] #250

Closed

froi self-assigned this Feb 8, 2018

ToniBonittoGSA mentioned this issue Aug 18, 2019

Blog post - needs featured, social image GSA/digitalgov.gov#1235

Closed

[Request for Discussion] Software inventory metadata schema and inventory collection #41

[Request for Discussion] Software inventory metadata schema and inventory collection #41

Comments

theresaanna commented Sep 7, 2016 • edited

Data Format

Collected Data

Proposed required fields:

Proposed optional fields:

theresaanna commented Sep 7, 2016 • edited

mgifford commented Sep 7, 2016

rossdakin commented Sep 7, 2016

rossdakin commented Sep 7, 2016

Data Format

JSON

XML

CSV

YML

Collection Methodology

CRUD

Collected Data

Relationships / Reuse

ctubbsii commented Sep 7, 2016

niden commented Sep 8, 2016

bondsbw commented Sep 9, 2016

jasonduley commented Sep 9, 2016

ddelmoli commented Sep 9, 2016

RobertRM commented Sep 9, 2016

IanLee1521 commented Sep 11, 2016

IanLee1521 commented Sep 11, 2016

jbjonesjr commented Sep 12, 2016

jasonduley commented Sep 12, 2016 • edited

IanLee1521 commented Sep 12, 2016

jasonduley commented Sep 12, 2016

IanLee1521 commented Sep 12, 2016

bbrotsos commented Sep 13, 2016

rough68fish commented Sep 13, 2016

theresaanna commented Sep 13, 2016

thecapacity commented Sep 13, 2016

theresaanna commented Sep 13, 2016

CynthiaParr-USDA commented Sep 13, 2016

jecb commented Sep 13, 2016

okamanda commented Sep 14, 2016

VisionPaul commented Sep 14, 2016

theresaanna commented Sep 14, 2016

theresaanna commented Sep 14, 2016

ckaran commented Sep 30, 2016

rossdakin commented Sep 30, 2016

bandrzej commented Oct 1, 2016

bandrzej commented Oct 1, 2016 • edited

bandrzej commented Oct 1, 2016

philipashlock commented Oct 12, 2016

mattbailey0 commented Oct 18, 2016

IanLee1521 commented Nov 1, 2016

philipashlock commented Nov 7, 2016

Current Status

Allow for revisions

Use existing standards

Feedback on fields

Missing Fields

Serialization Format

philipashlock commented Nov 8, 2016

Software Repository

Software Project

Software Release

Software Package

ckaran commented Dec 21, 2016

ctubbsii commented Dec 21, 2016

jbjonesjr commented Dec 21, 2016

ckaran commented Dec 21, 2016

ctubbsii commented Dec 21, 2016

ckaran commented Jan 3, 2017

ctubbsii commented Jan 4, 2017

ckaran commented Jan 4, 2017

ctubbsii commented Jan 5, 2017

ckaran commented Jan 6, 2017 • edited

massonpj commented Jan 6, 2017

ctubbsii commented Jan 6, 2017

theresaanna commented Sep 7, 2016 •

edited

theresaanna commented Sep 7, 2016 •

edited

jasonduley commented Sep 12, 2016 •

edited

bandrzej commented Oct 1, 2016 •

edited

ckaran commented Jan 6, 2017 •

edited