# The DAG of Julia packages

## Problem statement

In this tutorial, we show how LG in conjunction with other utility packages can be used for extracting the most recent directed acyclic graph (DAG) of the Julia package system. This information can be used for interactive data visualization with [D3](https://d3js.org/) like in the following links:

- **The DAG of Julia packages:** https://juliohm.github.io/science/DAG-of-Julia-packages

All the packages used in this notebook can be installed with:

In [1]:
for dep in ["JSON","GitHub","LightGraphs","ProgressMeter"]
    Pkg.add(dep)
end

[1m[34mINFO: Nothing to be done
[0m[1m[34mINFO: Nothing to be done
[0m[1m[34mINFO: Nothing to be done
[0m[1m[34mINFO: Nothing to be done
[0m

In order to be able to query information from GitHub without be misinterpreted as a malicious robot, you need to [create a personal token](https://github.com/settings/tokens) in your GitHub settings. Since this token is private, we ask you to save it as an environment variable in your operating system (e.g. set `GITHUB_AUTH` in your `.bashrc` file). This variable will be read in Julia and used for authentication as follows:

In [2]:
using JSON
using GitHub
using LightGraphs
using ProgressMeter

# authenticate with GitHub to increase query limits
myauth = GitHub.authenticate(ENV["GITHUB_AUTH"])

GitHub.OAuth2(8cda0d**********************************)

After successful authentication, we are now ready to start coding. First, we extract the names of all registered packages in METADATA and assign to each of them a unique integer id:

In [3]:
# find all packages in METADATA
pkgs = readdir(Pkg.dir("METADATA"))
filterfunc = p -> isdir(joinpath(Pkg.dir("METADATA"), p)) && p ∉ [".git",".test"]
pkgs = filter(filterfunc, pkgs)

# assign each package an id
pkgdict = Dict{String,Int}()
for (i,pkg) in enumerate(pkgs)
  push!(pkgdict, pkg => i)
end
pkgdict

Dict{String,Int64} with 1386 entries:
  "Levenshtein"        => 664
  "ReadStat"           => 1052
  "Discretizers"       => 297
  "SchumakerSpline"    => 1116
  "FredData"           => 415
  "GaussQuadrature"    => 434
  "RecurrenceAnalysis" => 1056
  "AnsiColor"          => 18
  "ProximalOperators"  => 990
  "Luxor"              => 715
  "RobustLeastSquares" => 1094
  "Temporal"           => 1248
  "Robotlib"           => 1092
  "PiecewiseLinearOpt" => 947
  "JLDArchives"        => 608
  "MatrixDepot"        => 741
  "CodeTools"          => 154
  "NumericSuffixes"    => 866
  "COBRA"              => 152
  "Crypto"             => 215
  "Mongo"              => 792
  "ROOT"               => 1102
  "MNIST"              => 784
  "RandomMatrices"     => 1034
  "GMT"                => 471
  ⋮                    => ⋮

Using the ids, we can easily build the DAG of packages with LG:

In [4]:
# build DAG
G = DiGraph(length(pkgs))
@showprogress 1 "Building graph..." for pkg in pkgs
  children = Pkg.dependents(pkg)
  for c in children
    add_edge!(G, pkgdict[pkg], pkgdict[c])
  end
end

Building graph...100% Time: 0:03:17


We are interested in finding all the descendents of a package. In other words, we are interested in finding all packages that are influenced by a given package. In this context, we further want to save the level of dependency (or geodesic distance) from descendents to the package being queried. This is a straightforward operation in LG:

In [5]:
# find (indirect) descendents
descendents = []
for pkg in pkgs
  gdists = gdistances(G, pkgdict[pkg])
  desc = [Dict("id"=>pkgs[v], "level"=>gdists[v]) for v in find(gdists .> 0)]
  push!(descendents, desc)
end

For each package, we also want to save information about who has contributed to the project. This task is easy to implement with the awesome [GitHub.jl](https://github.com/JuliaWeb/GitHub.jl) API. However, some of the packages registered in METADATA are hosted on different websites such as gitlab, for which an API is missing. We simply skip them and ask authors to migrate their code to GitHub if possible:

In [6]:
# find contributors
pkgcontributors = []
hostnames = []
@showprogress 1 "Finding contributors..." for pkg in pkgs
  urlfile = joinpath(Pkg.dir("METADATA"), pkg, "url")
  url = readline(urlfile)
  m = match(r".*://([a-z.]*)/(.*)\.git.*", url)
  hostname = m[1]; reponame = m[2]
  if hostname == "github.com"
    users, _ = contributors(reponame, auth=myauth)
    usersdata = map(u -> (u["contributor"].login, u["contributions"]), users)
    pkgcontrib = [Dict("id"=>u, "contributions"=>c) for (u,c) in usersdata]
    push!(pkgcontributors, pkgcontrib)
    push!(hostnames, hostname)
  else
    push!(pkgcontributors, [])
    push!(hostnames, hostname)
  end
end

Finding contributors...100% Time: 0:16:51


We also extract the Julia version required in the last tag of a package. Both the lower and upper bounds are saved as well as a "cleaned" `major.minor` string for the lower bound, which is useful for data visualization:

In [7]:
# find required Julia version
juliaversion = []
for pkg in pkgs
  versiondir = joinpath(Pkg.dir("METADATA"), pkg, "versions")
  if isdir(versiondir)
    latestversion = readdir(versiondir)[end]
    reqfile = joinpath(versiondir, latestversion, "requires")
    reqs = Pkg.Reqs.parse(reqfile)
    if "julia" ∈ keys(reqs)
      vinterval = reqs["julia"].intervals[1]
      vmin = vinterval.lower
      vmax = vinterval.upper
      majorminor = "v$(vmin.major).$(vmin.minor)"
      push!(juliaversion, Dict("min"=>string(vinterval.lower),
                               "max"=>string(vinterval.upper),
                               "majorminor"=>majorminor))
    else
      push!(juliaversion, Dict("min"=>"NA", "max"=>"NA", "majorminor"=>"NA"))
    end
  else
    push!(juliaversion, Dict("min"=>"BOGUS", "max"=>"BOGUS", "majorminor"=>"BOGUS"))
  end
end

Finally, we save the data in a JSON file:

In [8]:
# construct JSON
nodes = [Dict("id"=>pkgs[v],
              "indegree"=>indegree(G,v),
              "outdegree"=>outdegree(G,v),
              "juliaversion"=>juliaversion[v],
              "descendents"=>descendents[v],
              "contributors"=>pkgcontributors[v]) for v in vertices(G)]

links = [Dict("source"=>pkgs[u], "target"=>pkgs[v]) for (u,v) in edges(G)]

data = Dict("nodes"=>nodes, "links"=>links)

# write to file
open("DAG-Julia-Pkgs.json", "w") do f
  JSON.print(f, data, 2)
end