DataSet Builder Tutorial
==

The aim of this tuto is to provide an E2E example for a simple query. We will use a reduced dataset provided by SWH and apply our approach as it was the entire graph.
>the first 3k popular repositories tagged as being written in the Python language, from GitHub, Gitlab, PyPI and Debian 

The correponding Date is 2021-03-23, we will take 2021-02-23 as the fingerprint timestamp.
𝑡x=2021-03-23

We consider the following query q as 
𝑞 = "Get All the Origin having at least one file named "README.md" on its file hiearchy.

The corresponding fingerprint is the following : 
𝐹𝑃tx = ⟨𝑡x, 𝑞⟩


#### 0. Install the required dependencies 

In [6]:
%cd ..

/home/rlefeuvr/Workspaces/SAND_BOX/test_swh/DatasetBuilder


In [None]:
!sh install.sh

#### 1. Download the graphs

/home/rlefeuvr/Workspaces/SAND_BOX/test_swh/DatasetBuilder


In [None]:
!mkdir ./tutorial/graph
!aws s3 cp --no-sign-request s3://softwareheritage/graph/2021-03-23-popular-3k-python/orc/origin_visit_status/ ./tutorial/graph/2021-03-23-popular-3k-python/orc/python_origin_visit_status --recursive
!aws s3 cp --no-sign-request s3://softwareheritage/graph/2021-03-23-popular-3k-python/compressed/ ./tutorial/graph/2021-03-23-popular-3k-python/compressed --recursive

#### 2. Design your ocl query

Design your query based on the template, here we are selecting all the repository containing a "README.md" file on their master or main branch. In the tutorial we will use the query saved in ```myquery.ocl```. Basically you can create any ocl query that applies on the SWH-Graph UML model.

<img src="image.png" alt="drawing" style="width:700px;"/>

In [17]:
%%file ./tutorial/myquery.ocl
import swhModel : 'platform:/resource/fr.inria.diverse.swhModel/model/swhModel.ecore'
package swhModel
context Graph
def : query():Set(Origin) = origins->select(
	 getLastSnapshot().branches->exists(
	 				(name= 'refs/heads/master' or name= 'refs/heads/main')
	 				and
					/*The main branch contains a file 'AndroidManifest.xml'*/
					getRevision().tree.entries->closure(entry:DirectoryEntry |
						if entry.child.oclIsKindOf(Directory) then
							entry.child.oclAsType(Directory).entries.oclAsSet()
						else 
							entry.oclAsSet()
						endif	
					)->exists(e:DirectoryEntry | e.name='README.md')	
	)
)


endpackage


Overwriting ./tutorial/myquery.ocl


To have LSP support you can use the gemoc studio by importing the 
```fr.inria.diverse.swhModel.queryExemple```
and
```fr.inria.diverse.swhModel```

then modify the template.ocl

### 3. Compile your query 

In [None]:
!sh ./oclQueryCompilerLauncher.sh ./tutorial/myquery.ocl RUNNING_QUERY ./tutorial

Now you can observe a jar in your workspace ```fr.inria.diverse.fingerprint-0.1.0.jar```

### Execute your query

Let's have a look on the help 

In [22]:
!java -jar tutorial/fr.inria.diverse.fingerprint-0.1.0.jar --help

[picocli WARN] defaults configuration file /home/rlefeuvr/Workspaces/SAND_BOX/test_swh/DatasetBuilder/./config/config.properties does not exist or is not readable
Usage: [1mGraphQueryRunner[21m[0m [[33m-h[39m[0m] [33m-c[39m[0m=[3m<checkPointIntervalInMinutes>[23m[0m [33m-e[39m[0m=[3m<exportPath>[23m[0m
                        [33m-g[39m[0m=[3m<graphFolderPath>[23m[0m [33m-l[39m[0m=[3m<loadingMode>[23m[0m
                        [33m-qt[39m[0m=[3m<queryTimestamp>[23m[0m [33m-t[39m[0m=[3m<threadNumber>[23m[0m
Execute a query over the graph property dataset
[33m [39m[0m [33m-c[39m[0m, [33m--checkP[39m[0m[33mointIntervalInMinutes[39m[0m=[3m<checkPointIntervalInMinutes>[23m[0m
               The time in minutes after which a checkpoint will be produced
[33m [39m[0m [33m-e[39m[0m, [33m--export[39m[0m[33mPath[39m[0m=[3m<exportPath>[23m[0m
               The export path, where all the queries results will be saved
        

In [None]:
!java -XX:PretenureSizeThreshold=512M -Djava.io.tmpdir=../java-tmp-dir -Xmx10G --add-opens=java.base/sun.nio.ch=ALL-UNNAMED -jar ./tutorial/fr.inria.diverse.fingerprint-0.1.0.jar  -g ./tutorial/graph/2021-03-23-popular-3k-python -l MAPPED -t 14 --queryTimestamp=2021-03-23 --exportPath=./tutorial/result --checkPointIntervalInMinutes=1