Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust the Popularity measure in the risk index #21

Open
skhakimov opened this issue Jul 21, 2015 · 4 comments
Open

Adjust the Popularity measure in the risk index #21

skhakimov opened this issue Jul 21, 2015 · 4 comments

Comments

@skhakimov
Copy link
Contributor

Currently a package receives a point if it is in the top 90% of packages analyzed. Making this a relative measure. Consider making it absolute, adjusting this measure to the top 5% of ALL debian packages based on [1]. With more than 140K packages being tracked by the popularity contest, it is more sensible to reduce this measure to a much smaller percentage. Even 1% (~1400 packages) can be a reasonable threshold. Thanks.

[1] http://popcon.debian.org/

@david-a-wheeler
Copy link
Contributor

I like this idea of emphasizing the top 5% of ALL Debian packages. Is 5% the right value - should it be 1% or 2%? Perhaps we could give it 2 points if it's in the top 1%, and 1 point if it's in the 2-5% popularity of all packages; that would provide a little gradation.
Issue #5 notes that we could add other popularity information sources; if that's done, we might need to revisit.

@david-a-wheeler
Copy link
Contributor

Sam and I looked at the Debian popularity values in more detail. We think that giving additional scores at the 5% and 1% level would be justifiable (2 points if within the top 1% of popularity, 1 point if within the top 5% of popularity but not the top 1%). Here's why.

Looking at the popularity graph, the "knee" in the curve - which we'll define as the place where the absolute value of the slope of the curve is one - is at about package 5000 (out of 146754 packages). That means that the curve switches to a slope of less than one at about 3.4% into the set. Since this only a sample set, it makes sense to use a slightly broader definition, so I suggest that we cut off popularity at about 5% (since that would clearly include the 3.4% transition location), which would cut it off at package number 7338.

We then re-examined these top 5% values, and there's another transition within that set at about 1% of the total number of package. IE, the top 1% of all packages are ESPECIALLY popular.

Obviously the number of packages and their popularity changes over time; we want to use reasonable cutoffs that are a little less sensitive to exact values. Cutoffs of 5% and 1% are fairly common, and seem justified by the data set.

@david-a-wheeler
Copy link
Contributor

BTW, there seems to be no universal definition of a "knee" in a graph. More complex systems for defining and finding knees in curves (compared to what we used) can be found here:

  • "Finding a 'Kneedle' in a Haystack: Detecting Knee Points in System Behavior" by Satopaa et al. http://www1.icsi.berkeley.edu/~barath/papers/kneedle-simplex11.pdf
  • "Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms" by Stan Salvador and Philip Chan, IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004)

These involve finding the maximum of the curvature, which for a continuous function is:
curve(x) = y'' / ((1 + (y')^2)^1.5) where y' is the first derivative and y'' is the second derivative of y=f(x). Mathematical detail at https://en.wikipedia.org/wiki/Curvature and http://mathworld.wolfram.com/Curvature.html (among other places).

I don't think we need to dig into these more complex systems for our purposes.

@skhakimov
Copy link
Contributor Author

Popularity chart by installations for all Debian packages:
popularity_chart

Popularity chart of the top 5% of Debian Packages.
popularity5_chart

Data was obtained from: http://popcon.debian.org/by_inst

skhakimov added a commit that referenced this issue Aug 7, 2015
skhakimov added a commit that referenced this issue Aug 7, 2015
skhakimov added a commit that referenced this issue Aug 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants