<h1 style="text-align: center">DaSEA – A Dataset for Software Ecosystem Analysis</h1>

<h3 style="text-align: center">
    <a href="https://conf.researchr.org/track/msr-2022/msr-2022-technical-papers?track=MSR%20Data%20and%20Tool%20Showcase%20Track#">Data and Tool Showcase Track @ MSR 2022</a>
</h3>

<p style="text-align: center">
    Petya Buchkova, Joakim Hey Hinnerskov, Kasper Olsen, <u>Rolf-Helge Pfeiffer</u><br>
    <tt>[pebu|jhhi|kols|ropf]@itu.dk</tt><br>
    IT University of Copenhagen Copenhagen, Denmark
</p>

<h1 style="text-align: center">Motivation</h1>

## The libraries.io dataset

  * The [libraries.io dataset](https://libraries.io/data) has [not been updated since Jan. 12th 2020](https://zenodo.org/record/3626071)
  * Researchers [request an updated dataset](https://github.com/librariesio/libraries.io/issues/2744) since start 2021
  * We [need an updated dataset](https://arxiv.org/pdf/2203.01634.pdf) [for our work too](https://github.com/ossf/criticality_score/issues/53)
  * The libraries.io dataset cannot be reproduced or updated locally only with the help of the [provided tool]( https://github.com/librariesio/libraries.io) and [documentation](https://github.com/librariesio/libraries.io/blob/main/docs/development-setup.md)

<table>
  <tr>
    <th><a href="https://libraries.io/data"><img src="images/librariesio.png" width="75%"></a></th>
    <th><a href="https://github.com/librariesio/libraries.io/issues/2744"><img src="images/request.png" width="75%"></a></th>
  </tr>
</table> 



<h1 style="text-align: center">The DaSEA dataset</h1>

## Getting the DaSEA dataset

<center>
  <img src="images/getit.png" width="40%">
</center>

In [None]:
%%bash
wget https://zenodo.org/record/6369420/files/dasea_03-18-2022.tar.bz2

## How does the DaSEA dataset look like?

To get an overview over which ecosystems are included in the release of the dataset before decompression, the files of the dataset can be listed as in the following.

In [None]:
%%bash
tar -tvjf dasea_03-18-2022.tar.bz2

## How does the DaSEA dataset look like?

The respective files can be decompressed separately, in case only the dependency networks of certain package managers, or only cartain information is required.  For example, the dependency networks from the `pkgsrc` packages on NetBSD or the dependency graph from `Conan` can be extracted as in the following.

In [None]:
%%bash
tar -jxf dasea_03-18-2022.tar.bz2 ports/netbsd9/netbsd9_dependencies_03-18-2022.csv
tar -jxf dasea_03-18-2022.tar.bz2 ports/netbsd9/netbsd9_packages_03-18-2022.csv
tar -jxf dasea_03-18-2022.tar.bz2 ports/netbsd9/netbsd9_versions_03-18-2022.csv

In [None]:
%%bash
tar -jxf dasea_03-18-2022.tar.bz2 conan/conan_versions_03-18-2022.csv
tar -jxf dasea_03-18-2022.tar.bz2 conan/conan_packages_03-18-2022.csv
tar -jxf dasea_03-18-2022.tar.bz2 conan/conan_dependencies_03-18-2022.csv

In [None]:
%%bash
tar -jxf dasea_03-18-2022.tar.bz2 vcpkg/vcpkg_versions_03-18-2022.csv
tar -jxf dasea_03-18-2022.tar.bz2 vcpkg/vcpkg_packages_03-18-2022.csv
tar -jxf dasea_03-18-2022.tar.bz2 vcpkg/vcpkg_dependencies_03-18-2022.csv

## How does the DaSEA dataset look like?

Per package manager, the dataset consists of three CSV files named: `<package_manager>_[packages|versions|dependencies]_<miningdate>.csv`

In [None]:
%%bash
head -4 conan/conan_packages_03-18-2022.csv

In [None]:
%%bash
head -4 conan/conan_versions_03-18-2022.csv

In [None]:
%%bash
head -4 conan/conan_dependencies_03-18-2022.csv

## The DaSEA data model

Per package manager, the three CSV files correspond to the three correspondingly named classes in the data model.

<center>
    <img src="images/schema.png" width="35%">
</center>

## DaSEA, behind the scences

The DaSEA dataset is created by a set of _miners_ (Python scripts) that collect metadata of packages, their versions, and dependencies from suitable sources, convert it into the uniform data model, and serialize it to CSV files. 

<center>
    <img src="images/mining.png" width="37%">
</center>

<h1 style="text-align: center">How to use the DaSEA dataset</h1>

### Example use case – SQL: Identify packages with highest in-degree

In [None]:
import pandas as pd
from sqlalchemy import create_engine


db_engine = create_engine('sqlite://')  # in memory DB

deps_df = pd.read_csv("ports/netbsd9/netbsd9_dependencies_03-18-2022.csv")
deps_df.to_sql("Dependencies", db_engine)

query = """SELECT target_name, COUNT(target_name) AS indegree FROM Dependencies
GROUP BY target_name ORDER BY indegree DESC
LIMIT 10;"""

pd.read_sql(query, db_engine)

### Example use case – Pandas: Identify license changes

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

df = pd.read_csv("conan/conan_versions_03-18-2022.csv")
rdf = df.groupby("pkg_idx").filter(lambda x: len(set(x.license)) > 1)
ldf = rdf.groupby("name").apply(lambda x: set(x.license)).to_frame().reset_index()
ldf.rename(columns={0: "licenses"}, inplace=True)
ldf

### Example use case –  NetworkX: Compute a centrality metric

In [None]:
import pandas as pd
import networkx as nx
import numpy as np

ddf = pd.read_csv("conan/conan_dependencies_03-18-2022.csv")
ddf = ddf[(~ddf.pkg_idx.isnull()) & (~ddf.target_idx.isnull())]
adjl = ddf[["pkg_idx", "target_idx"]].to_numpy()
np.savetxt("/tmp/conan.adjl", adjl, fmt="%u", delimiter=" ")

g = nx.read_adjlist("/tmp/conan.adjl", nodetype=int, create_using=nx.DiGraph)
betweennes_ranks = nx.betweenness_centrality(g)
data = sorted(betweennes_ranks.items(), key=lambda item: item[1], reverse=True)
rdf = pd.DataFrame(data, columns=["pkg_idx", "betweenes"])

pdf = pd.read_csv("conan/conan_packages_03-18-2022.csv")
pd.merge(pdf, rdf, left_on="idx", right_on="pkg_idx").sort_values(by="betweenes", ascending=False)[:3]

### Example use case –  SigmaJS: Visualize a dependency graph

In [None]:
import pandas as pd, numpy as np, networkx as nx
from ipysigma import Sigma

ddf = pd.read_csv("vcpkg/vcpkg_dependencies_03-18-2022.csv")
ddf = ddf[(~ddf.pkg_idx.isnull()) & (~ddf.target_idx.isnull())]
adjl = ddf[["source_name", "target_name"]]
adjl.to_csv("/tmp/vcpkg.adjl", index=False, header=False)
g = nx.read_adjlist("/tmp/vcpkg.adjl", nodetype=str, delimiter=",", create_using=nx.DiGraph)
Sigma(g)

<table>
  <tr>
    <th>
    </th>
    <th>
      <a href="http://dasea.org">
        <center>
          <h2>Get the dataset</h2>
          <img src="images/getit_end.png" width="96%">
        </center>
      </a>      
    </th>
    <th>
    </th>      
  </tr>
  <tr>
    <th>
      <a href="https://mybinder.org/v2/gh/DaSEA-project/MSR22-DaSEA-Dataset-Presentation/main?filepath=index.ipynb">
        <center>
          <h2>Check the presentation</h2>
          <img src="images/presentation.png" width="87%">
        </center>
      </a>
    </th>
    <th>
      <a href="https://github.com/DaSEA-project/DASEA">
        <center>
          <h2>Contribute</h2>
          <img src="images/contribute.png" width="90%">
        </center>
      </a>
    </th>
    <th>
      <a href="https://itu.dk/~ropf/blog/assets/msr2022.pdf">
        <center>
          <h2>Read the paper</h2>
          <img src="images/paper.png" width="100%">
        </center>
      </a>
    </th>      
  </tr>    
</table>