Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider New York City (NYC) 2016 brainstorming ideas #42

Open
david-a-wheeler opened this issue Aug 18, 2016 · 0 comments
Open

Consider New York City (NYC) 2016 brainstorming ideas #42

david-a-wheeler opened this issue Aug 18, 2016 · 0 comments

Comments

@david-a-wheeler
Copy link
Contributor

The following are notes from a brainstorming session in NYC in 2016. This needs some cleanup & further analysis; the goal here is to capture the ideas.

morning session -- how to measure and prioritize project investment

Looking at what can be measured.

How to measure what exists. How would you understand what needs to exist and doesn’t yet?

Census -- CII
Grabbing data from OpenHub, Black Duck, Debian et al
Applying an algorithm to weigh that based on “risk”?

Q: What type of data in the Census?
Debian measure of “popularity”? I.e. users who allow tracking of use, report back which packages they instal…
OpenHub data -- commits, # of developers, etc.. Scrapes GitHub, SourceForge, etc.
Whether it has a website
CVEs…

of contributors (if in last year 0, problem)

Popularity (as above)
Is it in C or C++? More vulnerable…
Network exposure

Questions about how to interpret the data...one developer may mean an atrophied project. 100 may mean disorganized project. Non-linear…

Problem -- much won’t be in there. Reproducible builds would not be in there ever. It’s not an installed package. Or, deciding which browser needs to be re-written, wouldn’t be there.

Questions on how to expose link between application software and system packages…

Another dimension of measurement deserves to be called out: class of ux bugs. Can’t just scrape a bug database, etc. to understand where these exist. Making assumptions, automating, w/o “thick data” can lead to really bad decisions.

How do we measure importance in different ways, how to measure dependencies, use, etc.

Libraries.io -- measures software interdependencies, chains of dependencies…
Problems being solved: discoverability, maintainability (problem, there’s a suite of software that someone can work on in spare time, need to ensure it’s up to date), sustainability.
Might be useful to have as a map of what to work on next
Could be useful to help developers chose what to use (or stay away from!)
Discovering what to use, and what to fix, and what to replace, etc..

Important to do both “big public works” projects, and funding incremental maintenance and small projects.

What can data illuminate?
As making decisions on what to recommend -- if you can look at a map that shows some cluster of higher level applications are centered around some dependency that’s poorly supported. So, looking at it as a forecast, as well.

Also want to be able to track improvement and security. Census want’s to run regularly, plot a timeline -- how is this doing?

Need a time-indexed record where analysis of changes over time can be done.

One thing to track -- don’t track domains where software is used -- whether a game, or critical infrastructure. Not just “is a dependency”
How to automate? Can exclude packages installed automatically from “recommends”

Metric: “is this fundamentally a security component of systems?” Hard to do, because libraries that aren’t technically “security” libraries, can also be issues.
Can look at how often they’re a dependency for security-critical functions. And, how exposed are they.

Not always obvious whether something is connected to the network. Idea that we need a lot of automation to look at this -- source analysis, simple table analysis…

How to make this type of data generally augmentable? How to allow many eyes, etc.?

Can you have the developers help augment?

Can you have small grants to incent developers to analyze and create their own weights, views, analyses of the data?

There’s a huge amount of knowledge and intuition that those who’ve worked with these packages will have, and could add. Engaging them in helping to augment, crowdsource this info.

There are also many projects on github that aren’t packaged anywhere. Don’t show up on dependency trees, but have millions of users.

Starting to see a shift in the way people consume OS -- from packaged to direct pull… People used to consume OS by buying RedHat. Now do it because some guy on dev team was like, “need a library to do X, so pull it down.” Many ways to bypass package managers. GitHub and Docker very invested in this data.

GitHub could start specifying dependencies of GitHub projects on other GitHub projects… But, this has been solved in part. MBM can solve this, configure scripts can pull dependencies in…
What’s the incentive to put dependency info outside Docker image (etc.)? No incentive to make it available outside. Skepticism re. Docker among developers -- “surprise! You’re running all these fucking dependencies that you never knew!”

What are the classes of funding priorities and directions that people might use these for?

How do we use this to track ecosystem change over time? Very difficult to sustain funding without showing movement. Need a stable baseline to show movement.

Need historical view -- that’s what we’re creating here. Can show rise and fall of different popularity. Can’t measure security, but can integrate metrics that give us a sense of key metrics…

Another metric that could be interesting -- a variety of AI based tools that can assess “code quality.” Could run this over key projects…

Could do density measures -- # of warnings per lines of code…
Could be done as an external data source -- something that builds stuff and assesses across different metrics.

Meta question: what infrastructure is this built on?

Could use NORAD-like system (test infrastructure) to analyze sources, etc.?

Inability to build and test a datapoint on its own…

How to get this data? Incent to give us data? Insidious incentives? Infantilize people?
Get to the people who run the distros, have them demand it of people submitting…
Can’t just demand, social response will be shitty. Maybe before the next Debian conference...pay $50/$100 per data-entry, and donate this to conference for travel, etc.. “For every one of these that gets done, we’ll give $50K to the Debian travel fund.”

Could certify (“good software”) via best practice badges…

Lesson-learned by EFF re. Secure Messaging Tools Scorecard -- good effort, but a lot of criticism. People were reading as a nutrition label -- consumer reports...

Ways to prioritize projects (census work/metrics) - raw brainstormed results, NYC 2016

Blue
(A1) Code quality: (Measure with) compiler warnings, etc.
How else can we measure importance/popularity? E.G., if a popular program (like Skype) uses library X, it’s popular.
Follow dependencies to determine popularity (use system & language package managers)
CVEs - but what does it mean?
Examine code: Look for vulnerabilities (Coverity scan, etc.), look at “quality” metrics (lint, rubocop, etc.)
Classification/typing of programs. E.G., CII for networks, gaming, etc. Some package managers include this information
Does it build?
Demonstrating results / ecosystem change
Interdependency without package management (indirect)
Properties of CII/crypto primitives
Separate dependencies - if it’s depended on by 3 packages, it’s more important than if it’s pulled in once (expat)
Ubuntu popcorn
Dependency compilation
Crowd sourcing by email? ($ donated to Debian travel fund?)

Green
Debian popcorn
Open Hub. (GitHub, SourceForge, …)

of committers (& patterns), # of commits (& patterns), license, readme, contributors

Bug reports: Types, reactivity, context
Cron dis

Red:
Application level - how does it relate?
Initiatives
UX quality - what is the experience? Usability is human.
Exposure (to attack?)
Indie projects - direct pull
What are we running this all on?

Light green:
Community? (Is there one?)
Dev’s what
Tracking
What to recommend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant