Releases: SANSA-Stack/SANSA-Stack
sansa-cli-0.8.6-rml
This is the debian package used to evaluate Sansa's RML execution performance via conversion of RML to SPARQL conversion using rmltk's rml to sparql conversion.
Summary of the approach:
# Convert RML to SPARQL CONSTRUCT queries
rmltk rml to sparql mapping.rml > mapping.raw.rq
# Ensure that generated triples will be unique
rmltk optimize workload mapping.raw.rq --no-order > mapping.unique.rq
# Execute with the Sansa command
sansa query mapping.unique.rq --out-file mapping.nt
sansa-cli-0.8.0-rc3
This is the first release candidate featuring a runnable jar file with a consolidated command line interface based on the sansa-cli
module.
- Most prominently it features a spark-based re-implementation of tarql for mapping CSV to RDF via SPARQL CONSTRUCT queries. See the Sansa Tarql documentation for details.
To run the jar the following --add-opens
declarations are needed:
#!/bin/bash
# For Linux: Save this code into sansa.sh and make it executable with chmod +x sansa.sh
SANSA_EXTRA_OPTS="--add-opens=java.base/java.lang=ALL-UNNAMED \
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED \
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED \
--add-opens=java.base/java.io=ALL-UNNAMED \
--add-opens=java.base/java.net=ALL-UNNAMED \
--add-opens=java.base/java.nio=ALL-UNNAMED \
--add-opens=java.base/java.util=ALL-UNNAMED \
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED \
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED \
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED \
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED \
--add-opens=java.base/sun.security.action=ALL-UNNAMED \
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED \
--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
java $JAVA_OPTS $SANSA_EXTRA_OPTS -jar sansa-0.8.0-rc3.jar "$@"
ExPAD: An Explainable Distributed Automatic Anomaly Detection Framework over Large KGs
ExPAD: An Explainable Distributed Automatic Anomaly Detection Framework over Large KGs
This Release includes the recent developments for the ExPAD framework. ExPAD is the Explainable Anomaly Detection framework for large KGs.
Overview
This module is a generic, distributed, and scalable software framework that can automatically detect numeric anomalies in the KGs and produce human-readable explanations for why a given value of a variable in an observation can be considered an outlier. ExPAD is inspired by OutlierTree and works by evaluating and following distributed supervised decision tree splits on variables. This helps to detect and explain anomalous cases which can not be seen without considering other features.
Example Databricks Notebooks
Total usage of the pipeline components can also be found within Databricks sample notebooks here. To be able to run ExPAD in your cluster, you should have the following lines in "Spark config" Section:
spark.kryo.registrator net.sansa_stack.rdf.spark.io.JenaKryoRegistrator,net.sansa_stack.query.spark.sparqlify.KryoRegistratorSparqlify
spark.serializer org.apache.spark.serializer.KryoSerializer
Also ExPAD needs JDK 11, so you should add the following line to the "Environment variables" section:
JNAME=zulu11-ca-amd64
The above steps should happen during cluster creation.
Dataset
To be able to have different sizes of datasets, we implemented an RDF data simulator that generates a synthetic RDF graph. For the RDF data generator, we consider the Person class with 5 synthetic properties and the respective distribution. The dataset contains predicates age (numerical), job (URI), pregnant (boolean) , and gender (boolean) with 20,000 persons (100K triples). The dataset can be downloaded below. The following table shows the respective distribution.
Predicate | Value Type | Example | Distribution |
---|---|---|---|
id | non negative integer | {0,1,2,...} | incremental starting from 0 |
gender | boolean | {male,female} | 50% male, 50% female |
job | URI | {Student,President} | 50% student, 50% president |
pregnant | boolean | {true,false} | if male then false, if job=student then false, if job=president and age>55 then false, if job=president and age<=55 then 40% |
age | positive integer | {1,2,3,...} | if job=student in [7,14], if job=president in [25,70] |
Resource
We provide the full jar of this version below. Moreover, more information about ExPAD can be found here.
DistAD: A Distributed Generic Anomaly Detection Framework over Large KGs
DistAD: A Distributed Generic Anomaly Detection Framework over Large KGs
This Release includes the recent developments for the DistAD framework.
DistAD is the Anomaly Detection framework for large KGs.
Overview
This module is a generic, scalable, and distributed framework for anomaly detection on large RDF knowledge graphs. DistAD provides a great granularity for the end-users to select from a vast number of different algorithms, methods, and (hyper-)parameters to detect outliers. The framework performs anomaly detection by extracting semantic features from entities for calculating similarity, applying clustering on the entities, and running multiple anomaly detection algorithms to detect the outliers on the different levels and granularity. The output of DistAD will be the list of anomalous RDF triples.
Documents
An explained tutorial and full documention can be found here and here.
Resource
We provide the full jar of this version below
SimE4KG - Release
SimE4KG: Explainable Distributed multi-modal Semantic Similarity Estimation for Knowledge Graphs
This Release includes all of the most recent developments for the SimE4KG framework.
SimE4KG is the Explainable Distributed In-Memory multi-modal Semantic Similarity Estimation for Knowledge Graphs.
Overview
In this release, we introduce multiple changes to the Sansa Stack to offer the SimE4KG functionalities
The content is structured as follows:
- Databricks Notebooks
- ReadMe of novel Modules
- Novel Classes
- Unit Tests
- Data Sets
- Further Reading
SimE4KG Databricks Notebook
To showcase in a hands-on session the usage of SimE4KG modules, we introduce multiple Databricks Notebooks. Those show the Full pipeline but also dedicated parts like the SmartFeature Extractor. Within the notebooks, you can see the mixture of Explanations, Sample code, and the output of the code snippets. With the Notebooks, you can reproduce the functionality within your browser without a need to install the Framework locally.
The Notebooks can be found here:
- SimE4KG Databricks Notebook for sample pipeline building including outputs
- SmartFeatureExtractor Databricks Notebook for multi-modal feature extraction with the novel Smart Feature Extractor
- SimE4KG Semantic Pipeline for Similarity Based Recommendations Sample Pipeline using semantified results to create recommendations
- Further Use cases are ongoing developed and can be found here
ReadME
The novel modules of SimE4KG are documented within the SANSA ML ReadMe. For quick links especially to the high-level SimE4KG Transformer and the SmartFeatureExtractor, you can use these two links:
- SimE4KG/Dasim Transformer ReadMe which is the high leveled Similarity Estimation transformer calling the entire pipeline
- SmartFeatureExtractor ReadMe which is the novel developed generic multi-modal feature extractor transformer
Novel Classes
Novel Classes developed within this release are especially the Dasim Transformer and the SmartFeature extractor but also the corresponding unit test as well as the Evaluation scripts to test module performance:
- DasimTransformer Class, Unit Test
- Smart Feature Extractor Class, Unit Test
- Evaluation Classes like data size scalability, feature availability evaluation, Smartfeature extractor evaluation, and many more ...
Datasets
As starting point to play around with the developments of this framework, we recommend the Linked Movie Data Base RDF Knowledge Graph. This KG represents in millions of triples data about movies and consists of multi modal features like lists of URIs as the lists of actors, numeric features like the runtime but also timestamp data like the release date. For purposes of Unit test, we propose also an extract of this data which follow the same schema.
Further Reading
If you are interested into further reading and background information of other related modules we recommend the following papers:
- Distributed semantic analytics using the SANSA stack
- Sparklify: A Scalable Software Component for Efficient Evaluation of SPARQL Queries over Distributed RDF Datasets
- DistSim - Scalable Distributed in-Memory Semantic Similarity Estimation for RDF Knowledge Graphs
- DistRDF2ML - Scalable Distributed In-Memory Machine Learning Pipelines for RDF Knowledge Graphs
Other
- In addition, we provide the full jar of this version below
DistRDF2ML & Literal2Feature Release
Within this release, we offer the recent developments of the DistRDF2ML framework as an extension of the SANSA framework. The corresponding ReadMe can be found here
Examples
These modules are presented within pipelines that uses SANSA and Spark MLlib modules to create Regression, Clustering, and Classification pipelines:
Example Databricks Notebooks
Full usage of the pipeline components can also be found within Databricks sample notebooks:
Example Classes
Modules
This release majorly provides the modules:
Evaluation
These Modules were evaluated based on these Scripts:
Unit tests
This release majorly provides the modules:
Datasets
within this release, we evaluated the modules by the usage of the Linked Movie database and artificially create RDF movie datasets with the exponential growth of a number of movies. All of these data are appended here within the release within the ZIPs.
sample triples:
<http://data.linkedmdb.org/film/70> <http://purl.org/dc/terms/date> "2002-05-16" .
<http://data.linkedmdb.org/film/70> <http://www.w3.org/2000/01/rdf-schema#label> "Star Wars Episode II: Attack of the Clones" .
<http://data.linkedmdb.org/film/70> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://data.linkedmdb.org/movie/film> .
<http://data.linkedmdb.org/film/70> <http://data.linkedmdb.org/movie/runtime> "142" .
<http://data.linkedmdb.org/film/70> <http://data.linkedmdb.org/movie/country> <http://data.linkedmdb.org/country/US> .
<http://data.linkedmdb.org/film/70> <http://data.linkedmdb.org/movie/actor> <http://data.linkedmdb.org/actor/17124> .
<http://data.linkedmdb.org/actor/17124> <http://data.linkedmdb.org/movie/actor_name> "Ewan McGregor" .
<http://data.linkedmdb.org/film/70> <http://data.linkedmdb.org/movie/genre> <http://data.linkedmdb.org/film_genre/99> .
<http://data.linkedmdb.org/film_genre/99> <http://data.linkedmdb.org/movie/film_genre_name> "Science fiction" .
<http://data.linkedmdb.org/country/US> <http://data.linkedmdb.org/movie/country_population> "303824000" .
<http://data.linkedmdb.org/film/70> <http://data.linkedmdb.org/movie/film_series> <http://data.linkedmdb.org/film_series/104> .
<http://data.linkedmdb.org/film/70> <http://data.linkedmdb.org/movie/director> <http://data.linkedmdb.org/director/5233> .
<http://data.linkedmdb.org/actor/17124> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://data.linkedmdb.org/movie/actor> .
<http://data.linkedmdb.org/actor/17124> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Ewan_McGregor> .
<http://data.linkedmdb.org/country/US> <http://www.w3.org/2000/01/rdf-schema#label> "United States (Country)" .
<http://data.linkedmdb.org/country/US> <http://data.linkedmdb.org/movie/country_areaInSqKm> "9629091"^^<http://www.w3.org/2001/XMLSchema#double> .
<http://data.linkedmdb.org/director/5233> <http://data.linkedmdb.org/movie/director_name> "George Lucas" .
<http://data.linkedmdb.org/film/70> <http://data.linkedmdb.org/movie/actor> <http://data.linkedmdb.org/actor/15021> .
SANSA Backport Scala 2.11 Spark 2.x v0.2.0
Adjust dependencies after last release ...to make the whole repo compile
v0.8.0-RC1
Noteworthy changes and updates since the previous release are:
- Support for Ontop Based Query Engine over RDF
- Distributed Trig/Turtle record reader
- Support to write out RDDs of OWL axioms in a variety of formats.
- Distributed Data Summaries with ABstraction and STATistics (ABSTAT)
- Configurable mapping of RDD of triples dataframes
- Initial support for RDD of Graphs and Datasets, executing queries on each entry and aggregating over the results
- Sparql Transformer for ML-Pipelines
- Autosparql Generation for Feature Extraction
- Distributed Feature based Semantic Similarity Estimations
- Added a common R2RML abstraction layer for Ontop, Sparqlify and possible future query engines
- Consolidated SANSA layers into a single GIT repository
- Retired the support for Apache Flink
Dependency Changes
- Apache Spark 2.4.4 → 3.0.1
- Apache Flink 1.9.1 → 1.11.2
- Apache Jena 3.13.1 → 3.17.0
Scala Compiler Level
- Scala 2.11 → Scala 2.12
SANSA-ML Backport Scala 2.11 Spark 2.x v0.1.0
get rid of Scala 2.11 deps Added Spark-Bench and BigDL Scala 2.12 dependencies
sansa-example-data bundle
-
This is a collection of data files/folders that were push to git uncompressed
-
sansa-notebooks-examples-data.tar.gz from sansa-stack-parent/sansa-notebooks/examples/
-
LUBM_5.owl.tar.gz from sansa-owl/sansa-owl-spark/src/main/resources