DataSet Builder Tutorial
==

The aim of this tuto is to provide an E2E example for a simple query. We will use a reduced dataset provided by SWH and apply our approach as it was the entire graph.
>the first 3k popular repositories tagged as being written in the Python language, from GitHub, Gitlab, PyPI and Debian 

The correponding Date is 2021-03-23, we will take 2021-02-23 as the fingerprint timestamp.
𝑡x=2021-03-23

We consider the following query q as 
𝑞 = "Get All the Origin having at least one file named "README.md" on its file hiearchy.

The corresponding fingerprint is the following : 
𝐹𝑃tx = ⟨𝑡x, 𝑞⟩


#### 0. Install the required dependencies 

The current workdir should be the root of the repository, if it's not the case run the next query

In [5]:
%cd ..

/home/rlefeuvr/Workspaces/SAND_BOX/test_swh/DatasetBuilder


In [6]:
!sh install.sh

Installing dsi-utils
[[1;34mINFO[m] Scanning for projects...
[[1;34mINFO[m] 
[[1;34mINFO[m] [1m------------------< [0;36morg.apache.maven:standalone-pom[0;1m >-------------------[m
[[1;34mINFO[m] [1mBuilding Maven Stub Project (No POM) 1[m
[[1;34mINFO[m] [1m--------------------------------[ pom ]---------------------------------[m
[[1;34mINFO[m] 
[[1;34mINFO[m] [1m--- [0;32mmaven-install-plugin:2.4:install-file[m [1m(default-cli)[m @ [36mstandalone-pom[0;1m ---[m
[[1;34mINFO[m] Installing /home/rlefeuvr/Workspaces/SAND_BOX/test_swh/DatasetBuilder/thirdPartyLibrary/dsiutils/dsiutils-2.7.3-threadSafe.jar to /home/rlefeuvr/.m2/repository/it/unimi/dsi/dsiutils/2.7.3-threadSafe/dsiutils-2.7.3-threadSafe.jar
[[1;34mINFO[m] Installing /home/rlefeuvr/Workspaces/SAND_BOX/test_swh/DatasetBuilder/thirdPartyLibrary/dsiutils/pom.xml to /home/rlefeuvr/.m2/repository/it/unimi/dsi/dsiutils/2.7.3-threadSafe/dsiutils-2.7.3-threadSafe.pom
[[1;34mINFO[m] [1m-----------

#### 1. Download the graphs

Install aws s3 to download the graph , https://aws.amazon.com/cli/

In [8]:
!conda env export >environment.yml

In [7]:
!pip3 install awscli --upgrade --user

Collecting awscli
  Downloading awscli-1.29.63-py3-none-any.whl.metadata (11 kB)
Collecting botocore==1.31.63 (from awscli)
  Downloading botocore-1.31.63-py3-none-any.whl.metadata (6.1 kB)
Collecting docutils<0.17,>=0.10 (from awscli)
  Downloading docutils-0.16-py2.py3-none-any.whl (548 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m548.2/548.2 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m
[?25hCollecting s3transfer<0.8.0,>=0.7.0 (from awscli)
  Downloading s3transfer-0.7.0-py3-none-any.whl.metadata (1.8 kB)
Collecting PyYAML<6.1,>=3.10 (from awscli)
  Downloading PyYAML-6.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting colorama<0.4.5,>=0.2.5 (from awscli)
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting rsa<4.8,>=3.1.2 (from awscli)
  Downloading rsa-4.7.2-py3-none-any.whl (34 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from botocore==1.31.63->awscli)
  Download

In [None]:
!mkdir ./tutorial/graph
!aws s3 cp --no-sign-request s3://softwareheritage/graph/2021-03-23-popular-3k-python/orc/origin_visit_status/ ./tutorial/graph/2021-03-23-popular-3k-python/orc/python_origin_visit_status --recursive
!aws s3 cp --no-sign-request s3://softwareheritage/graph/2021-03-23-popular-3k-python/compressed/ ./tutorial/graph/2021-03-23-popular-3k-python/compressed --recursive

#### 2. Design your ocl query

Design your query based on the template, here we are selecting all the repository containing a "README.md" file on their master or main branch. In the tutorial we will use the query saved in ```myquery.ocl```. Basically you can create any ocl query that applies on the SWH-Graph UML model.

<img src="image.png" alt="drawing" style="width:700px;"/>

In [23]:
%%file ./tutorial/myquery.ocl
import swhModel : 'platform:/resource/fr.inria.diverse.swhModel/model/swhModel.ecore'
package swhModel
context Graph
def : query():Set(Snapshot) = origins->select(
	 getLastSnapshot().branches->exists(
	 				(name= 'refs/heads/master' or name= 'refs/heads/main')
	 				and
					/*The main branch contains a file 'README.md'*/
					getRevision().tree.entries->closure(entry:DirectoryEntry |
						if entry.child.oclIsKindOf(Directory) then
							entry.child.oclAsType(Directory).entries.oclAsSet()
						else 
							entry.oclAsSet()
						endif	
					)->exists(e:DirectoryEntry | e.name='README.md')	
	)
)


endpackage


Writing ./tutorial/myquery.ocl


To have LSP support you can use the gemoc studio by importing the 
```fr.inria.diverse.swhModel.queryExemple```
and
```fr.inria.diverse.swhModel```

then modify the template.ocl

NB: The completion is not fully functional

### 3. Compile your query 

In [None]:
!sh ./oclQueryCompilerLauncher.sh ./tutorial/myquery.ocl RUNNING_QUERY ./tutorial

Now you can observe a jar in your workspace ```fr.inria.diverse.fingerprint-0.1.0.jar```

### Execute your query

Let's have a look on the help 

In [39]:
!java -jar tutorial/fr.inria.diverse.fingerprint-0.1.0.jar --help

[picocli WARN] defaults configuration file /home/rlefeuvr/Workspaces/SAND_BOX/test_swh/DatasetBuilder/./config/config.properties does not exist or is not readable
Usage: [1mGraphQueryRunner[21m[0m [[33m-h[39m[0m] [[33m-c[39m[0m=[3m<checkPointIntervalInMinutes>[23m[0m]
                        [[33m-e[39m[0m=[3m<exportPath>[23m[0m] [33m-g[39m[0m=[3m<graphFolderPath>[23m[0m
                        [[33m-l[39m[0m=[3m<loadingMode>[23m[0m] [[33m-qt[39m[0m=[3m<queryTimestamp>[23m[0m]
                        [[33m-t[39m[0m=[3m<threadNumber>[23m[0m]
Execute a query over the graph property dataset
  [33m-c[39m[0m, [33m--checkP[39m[0m[33mointIntervalInMinutes[39m[0m=[3m<checkPointIntervalInMinutes>[23m[0m
               The time in minutes after which a checkpoint will be produced
  [33m-e[39m[0m, [33m--export[39m[0m[33mPath[39m[0m=[3m<exportPath>[23m[0m
               The export path, where all the queries results will be saved
  

In [None]:
!java -XX:PretenureSizeThreshold=512M -Djava.io.tmpdir=../java-tmp-dir -Xmx10G --add-opens=java.base/sun.nio.ch=ALL-UNNAMED -jar ./tutorial/fr.inria.diverse.fingerprint-0.1.0.jar  -g ./tutorial/graph/2021-03-23-popular-3k-python -l MAPPED -t 14 --queryTimestamp=2021-03-23 --exportPath=./tutorial/result --checkPointIntervalInMinutes=1

In [44]:
##Todo Investigate the exception raised for one origin, no impact on the others

You can now process the results, origin_url_snap_swhid.json contains the list of origins url that match the query as well as the snapshots ids present in the current version of the graph 