Build repository dataset by describing an OCL query on the SoftwareHeritage Graph Property Dataset
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
Table of Contents
This project is a companion repository containing a prototype of the fingerprint approach. We mention that it's a prototype since we do not guaranty any usage different than those describe in the paper. For instance the compiler do not handle all the expressivity of OCL, only the ocl concepts present on the running query are implemented. Moreover a more advanced test coverage is required to use it at large scale.
Romain Lefeuvre, Jessie Galasso, Benoit Combemale, Houari Sahraoui, and Stefano Zacchiroli. 2023. Fingerprinting and Building Large Reproducible Datasets. In Proceedings of the 2023 ACM REP '23 https://doi.org/10.1145/3589806.3600043
fr.inria.diverse.swhModel
: The project containing the object oriented model of the SWH graph dataset expressed as an ecore projectfr.inria.diverse.swhModel.generator
: The project containing the generator, it generate java code that target the swh-graph api from a query expressed in OCL on the OO model.fr.inria.diverse.swhModel.fingerprint
: The project that exposes an OO oriented api of the swh-graph, allowing to execute query on an export of the property graph dataset.fr.inria.diverse.swhModel.generator.tests
: The project containing the tests of the generatorfr.inria.diverse.swhModel.queryExemple
: An example project containing ocl query template. It can be used to design OCL query while leveraging on OCL tooling.thirdPartyLibrary
: The third party libraries that are not available on maven central, for most of the library used by the generator are in the eclipse ecosystem and not available in m2 repo.result
: The experiments of the results described in the paper, NB the log cannot be loaded in this repos since they are really huge (tens of Go), if you need it do not hesitate to e-mail me. For each fingerprint run you will find a json listing the different origin id matching the query. You can easily use it to extract metadata relative to those repository in the corresponding export. You can also extract a subdataset of the swh-graph property dataset in order to distribute it and allow your user to exploit your dataset in a laptop !
- Java 11
- (Maven : we use maven wrapper such that it's not mandatory to have mave install in you machine, facilating reproducibility)
- launch ./install.sh script to install custom the swh-graph fork we use
- Ant >= 1.10.12
- Linux
The fingerprint approach is decompose in two part, a query describe in ocl and a timestamp, generally the version of the property graph dataset you will execute on it. 𝐹𝑃tx = ⟨𝑡x, 𝑞⟩. A run of a fingerprint is describe as FPtx X Gx.
The prototype is decompose also in 3 steps :
- 1- Designing your ocl query over the OO model of the swh property dataset (cf figure 2 in the paper), or use our running query
- 2- Compile your query in java code that leverage on the swh-graph api
- 3- Execute your code and get the repositories that match your query (You will need to download the dataset )
The generator can be used through oclQueryCompilerLaucher.sh script that automate the call of the generator and the copy/paste of the resulting java file to the fingerprint projet. In output you will obtain a fat jar of the fingerprint project containing your query :
- be sure to have run
sh ./install.sh
to install the forked version of swh-graph - run
sh oclQueryCompilerLaucher.sh <oclModelPath> <QueryName> <exportPath>
where<oclModelPath>
the path of the ocl model (either .ocl or the abstract syntax saved in .oclas)<QueryId>
the query Id, used for saving the query results during execution<exportPath>
the path were the resulting java code will be saved
Exemple : sh oclQueryCompilerLauncher.sh ./result/swhModelQuery.ocl RUNNING_QUERY ./result
- At least 500 GB of ram + 4to ssd
- Download a version of the property graph dataset (https://docs.softwareheritage.org/devel/swh-dataset/graph/dataset.html), you can create leverage on the
result/dl_script.sh
(run it from the result folder) - To obtain better result mount a tmpfs and put it the graph.graph file that you will symlink to the dataset/your_graph/compressed folder
- Display help :
java -jar ./result/fr.inria.diverse.fingerprint-0.1.0.jar --help
[picocli WARN] defaults configuration file /home/rlefeuvr/Workspaces/SAND_BOX/SW_GRAPH/OCL_PROJECT/./config/config.properties does not exist or is not readable
Usage: GraphQueryRunner [-h] [-c=<checkPointIntervalInMinutes>]
[-e=<exportPath>] -g=<graphFolderPath>
[-l=<loadingMode>] [-qt=<queryTimestamp>]
[-t=<threadNumber>]
Execute a query over the graph property dataset
-c, --checkPointIntervalInMinutes=<checkPointIntervalInMinutes>
The time in minutes after which a checkpoint will be produced
-e, --exportPath=<exportPath>
The export path, where all the queries results will be saved
including checkpoints
-g, --graphPath=<graphFolderPath>
The graph Folder path
-h, --help display this help and exit
-l, --loadingMode=<loadingMode>
The graph loading mode either MAPPED for memory mapped or RAM
for ram loading
-qt, --queryTimestamp=<queryTimestamp>
The query Timestamp
-t, --threadNumber=<threadNumber>
The number of thread the query will use
Romain Lefeuvre - DIVERSE team - Inria
- Run command Exemple:
java -ea -server -XX:PretenureSizeThreshold=512M -XX:MaxNewSize=4G -XX:+UseLargePages -XX:+UseTransparentHugePages -XX:+UseNUMA -XX:+UseTLAB -XX:+ResizeTLAB -Djava.io.tmpdir=../java-tmp-dir -Xmx600G --add-opens=java.base/sun.nio.ch=ALL-UNNAMED -jar ./result/fr.inria.diverse.fingerprint-0.1.0.jar --exportPath=./result/export/2022-04-24 --threadNumber=60 --queryTimestamp=2022-04-24 --graphPath=./result/graph/2022-04-25 --checkPointIntervalInMinutes=300
Queries are described as an OCL Operation named ”query” in the Graph context, returning a Set of Origin i.e. the set of repository matching the query. The following template can be used :
import swhModel : 'platform:/resource/fr.inria.diverse.swhModel/model/swhModel.ecore'
package swhModel
context Graph
def : query():Set(Origin) = origins->select(
/*QUERY BODY */
)
endpackage
To define the ocl request and therefore the input model of the generator, it is recommended to use the eclipse tooling providing development tools for OCL.
- Install the Gemoc GemocStudio
- Import the
fr.inria.diverse.swhModel project
andfr.inria.diverse.swhModel.queryExemple
projects - Modify the templace.ocl file
- Use the previous section (Standelone Usage) to generate the corresponding java code
As the project is based on eclipse technologies (OCL eclipse, xtext ...) the projects are eclipse plugins. A standalone build procedure is available, forming an uber-jar with all the necessary dependencies.
To build the project :
- run the build.sh script that compile
fr.inria.diverse.swhModel
thenfr.inria.diverse.swhModel.generator
- The jar is also installed in your local m2 repository to run tests
Then you can run the test project :
- run
mvn test
infr.inria.diverse.swhModel.generator.tests
Note: the fingerprint project is build automatically by using the oclQueryCompilerLaucher.sh scripts
A jupyter notebook is available in the tutorial
folder. It describe and E2E exemple of designing a dataset from a fingerprint over a reduced dataset.
A conda environment export is in environment.yml, you can create a conda environment (conda env create -f environment.yml) and open the tutorial notebook with vscode integrated jupyter notebook.
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE.txt
for more information.
Project Link: https://github.com/RomainLefeuvre/DatasetBuilder