Skip to content

Releases: SANSA-Stack/SANSA-Stack

sansa-cli-0.8.0-rc3

15 Nov 17:05
Compare
Choose a tag to compare

This is the first release candidate featuring a runnable jar file with a consolidated command line interface based on the sansa-cli module.

  • Most prominently it features a spark-based re-implementation of tarql for mapping CSV to RDF via SPARQL CONSTRUCT queries. See the Sansa Tarql documentation for details.

To run the jar the following --add-opens declarations are needed:

#!/bin/bash
# For Linux: Save this code into sansa.sh and make it executable with chmod +x sansa.sh

SANSA_EXTRA_OPTS="--add-opens=java.base/java.lang=ALL-UNNAMED \
    --add-opens=java.base/java.lang.invoke=ALL-UNNAMED \
    --add-opens=java.base/java.lang.reflect=ALL-UNNAMED \
    --add-opens=java.base/java.io=ALL-UNNAMED \
    --add-opens=java.base/java.net=ALL-UNNAMED \
    --add-opens=java.base/java.nio=ALL-UNNAMED \
    --add-opens=java.base/java.util=ALL-UNNAMED \
    --add-opens=java.base/java.util.concurrent=ALL-UNNAMED \
    --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED \
    --add-opens=java.base/sun.nio.ch=ALL-UNNAMED \
    --add-opens=java.base/sun.nio.cs=ALL-UNNAMED \
    --add-opens=java.base/sun.security.action=ALL-UNNAMED \
    --add-opens=java.base/sun.util.calendar=ALL-UNNAMED \
    --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"

java $JAVA_OPTS $SANSA_EXTRA_OPTS -jar sansa-0.8.0-rc3.jar "$@"

ExPAD: An Explainable Distributed Automatic Anomaly Detection Framework over Large KGs

22 Sep 14:42
Compare
Choose a tag to compare

ExPAD: An Explainable Distributed Automatic Anomaly Detection Framework over Large KGs

This Release includes the recent developments for the ExPAD framework. ExPAD is the Explainable Anomaly Detection framework for large KGs.

Overview

This module is a generic, distributed, and scalable software framework that can automatically detect numeric anomalies in the KGs and produce human-readable explanations for why a given value of a variable in an observation can be considered an outlier. ExPAD is inspired by OutlierTree and works by evaluating and following distributed supervised decision tree splits on variables. This helps to detect and explain anomalous cases which can not be seen without considering other features.

Example Databricks Notebooks

Total usage of the pipeline components can also be found within Databricks sample notebooks here. To be able to run ExPAD in your cluster, you should have the following lines in "Spark config" Section:

spark.kryo.registrator net.sansa_stack.rdf.spark.io.JenaKryoRegistrator,net.sansa_stack.query.spark.sparqlify.KryoRegistratorSparqlify
spark.serializer org.apache.spark.serializer.KryoSerializer

Also ExPAD needs JDK 11, so you should add the following line to the "Environment variables" section:

JNAME=zulu11-ca-amd64

The above steps should happen during cluster creation.

Dataset

To be able to have different sizes of datasets, we implemented an RDF data simulator that generates a synthetic RDF graph. For the RDF data generator, we consider the Person class with 5 synthetic properties and the respective distribution. The dataset contains predicates age (numerical), job (URI), pregnant (boolean) , and gender (boolean) with 20,000 persons (100K triples). The dataset can be downloaded below. The following table shows the respective distribution.

Predicate Value Type Example Distribution
id non negative integer {0,1,2,...} incremental starting from 0
gender boolean {male,female} 50% male, 50% female
job URI {Student,President} 50% student, 50% president
pregnant boolean {true,false} if male then false, if job=student then false, if job=president and age>55 then false, if job=president and age<=55 then 40%
age positive integer {1,2,3,...} if job=student in [7,14], if job=president in [25,70]

Resource

We provide the full jar of this version below. Moreover, more information about ExPAD can be found here.

DistAD: A Distributed Generic Anomaly Detection Framework over Large KGs

09 Jan 22:07
Compare
Choose a tag to compare

DistAD: A Distributed Generic Anomaly Detection Framework over Large KGs

This Release includes the recent developments for the DistAD framework.
DistAD is the Anomaly Detection framework for large KGs.

Overview

This module is a generic, scalable, and distributed framework for anomaly detection on large RDF knowledge graphs. DistAD provides a great granularity for the end-users to select from a vast number of different algorithms, methods, and (hyper-)parameters to detect outliers. The framework performs anomaly detection by extracting semantic features from entities for calculating similarity, applying clustering on the entities, and running multiple anomaly detection algorithms to detect the outliers on the different levels and granularity. The output of DistAD will be the list of anomalous RDF triples.

Documents

An explained tutorial and full documention can be found here and here.

Resource

We provide the full jar of this version below

SimE4KG - Release

26 Nov 10:06
Compare
Choose a tag to compare
SimE4KG - Release Pre-release
Pre-release

SimE4KG: Explainable Distributed multi-modal Semantic Similarity Estimation for Knowledge Graphs

This Release includes all of the most recent developments for the SimE4KG framework.
SimE4KG is the Explainable Distributed In-Memory multi-modal Semantic Similarity Estimation for Knowledge Graphs.

Overview

In this release, we introduce multiple changes to the Sansa Stack to offer the SimE4KG functionalities
The content is structured as follows:

  • Databricks Notebooks
  • ReadMe of novel Modules
  • Novel Classes
  • Unit Tests
  • Data Sets
  • Further Reading

SimE4KG Databricks Notebook

To showcase in a hands-on session the usage of SimE4KG modules, we introduce multiple Databricks Notebooks. Those show the Full pipeline but also dedicated parts like the SmartFeature Extractor. Within the notebooks, you can see the mixture of Explanations, Sample code, and the output of the code snippets. With the Notebooks, you can reproduce the functionality within your browser without a need to install the Framework locally.
The Notebooks can be found here:

ReadME

The novel modules of SimE4KG are documented within the SANSA ML ReadMe. For quick links especially to the high-level SimE4KG Transformer and the SmartFeatureExtractor, you can use these two links:

Novel Classes

Novel Classes developed within this release are especially the Dasim Transformer and the SmartFeature extractor but also the corresponding unit test as well as the Evaluation scripts to test module performance:

  • DasimTransformer Class, Unit Test
  • Smart Feature Extractor Class, Unit Test
  • Evaluation Classes like data size scalability, feature availability evaluation, Smartfeature extractor evaluation, and many more ...

Datasets

As starting point to play around with the developments of this framework, we recommend the Linked Movie Data Base RDF Knowledge Graph. This KG represents in millions of triples data about movies and consists of multi modal features like lists of URIs as the lists of actors, numeric features like the runtime but also timestamp data like the release date. For purposes of Unit test, we propose also an extract of this data which follow the same schema.

Further Reading

If you are interested into further reading and background information of other related modules we recommend the following papers:

Other

  • In addition, we provide the full jar of this version below

DistRDF2ML & Literal2Feature Release

19 Jun 14:16
31ce79d
Compare
Choose a tag to compare
Pre-release

Within this release, we offer the recent developments of the DistRDF2ML framework as an extension of the SANSA framework. The corresponding ReadMe can be found here

Examples

These modules are presented within pipelines that uses SANSA and Spark MLlib modules to create Regression, Clustering, and Classification pipelines:

Example Databricks Notebooks

Full usage of the pipeline components can also be found within Databricks sample notebooks:

Example Classes

Modules

This release majorly provides the modules:

Evaluation

These Modules were evaluated based on these Scripts:

Unit tests

This release majorly provides the modules:

Datasets

within this release, we evaluated the modules by the usage of the Linked Movie database and artificially create RDF movie datasets with the exponential growth of a number of movies. All of these data are appended here within the release within the ZIPs.

sample triples:

<http://data.linkedmdb.org/film/70> <http://purl.org/dc/terms/date> "2002-05-16" .
<http://data.linkedmdb.org/film/70> <http://www.w3.org/2000/01/rdf-schema#label> "Star Wars Episode II: Attack of the Clones" .
<http://data.linkedmdb.org/film/70> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://data.linkedmdb.org/movie/film> .
<http://data.linkedmdb.org/film/70> <http://data.linkedmdb.org/movie/runtime> "142" .
<http://data.linkedmdb.org/film/70> <http://data.linkedmdb.org/movie/country> <http://data.linkedmdb.org/country/US> .
<http://data.linkedmdb.org/film/70> <http://data.linkedmdb.org/movie/actor> <http://data.linkedmdb.org/actor/17124> .
<http://data.linkedmdb.org/actor/17124> <http://data.linkedmdb.org/movie/actor_name> "Ewan McGregor" .
<http://data.linkedmdb.org/film/70> <http://data.linkedmdb.org/movie/genre> <http://data.linkedmdb.org/film_genre/99> .
<http://data.linkedmdb.org/film_genre/99> <http://data.linkedmdb.org/movie/film_genre_name> "Science fiction" .
<http://data.linkedmdb.org/country/US> <http://data.linkedmdb.org/movie/country_population> "303824000" .
<http://data.linkedmdb.org/film/70> <http://data.linkedmdb.org/movie/film_series> <http://data.linkedmdb.org/film_series/104> .
<http://data.linkedmdb.org/film/70> <http://data.linkedmdb.org/movie/director> <http://data.linkedmdb.org/director/5233> .
<http://data.linkedmdb.org/actor/17124> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://data.linkedmdb.org/movie/actor> .
<http://data.linkedmdb.org/actor/17124> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Ewan_McGregor> .
<http://data.linkedmdb.org/country/US> <http://www.w3.org/2000/01/rdf-schema#label> "United States (Country)" .
<http://data.linkedmdb.org/country/US> <http://data.linkedmdb.org/movie/country_areaInSqKm> "9629091"^^<http://www.w3.org/2001/XMLSchema#double> .
<http://data.linkedmdb.org/director/5233> <http://data.linkedmdb.org/movie/director_name> "George Lucas" .
<http://data.linkedmdb.org/film/70> <http://data.linkedmdb.org/movie/actor> <http://data.linkedmdb.org/actor/15021> .

SANSA Backport Scala 2.11 Spark 2.x v0.2.0

10 Jun 09:21
Compare
Choose a tag to compare
Adjust dependencies after last release

...to make the whole repo compile

v0.8.0-RC1

19 Mar 00:09
Compare
Choose a tag to compare

Noteworthy changes and updates since the previous release are:

  • Support for Ontop Based Query Engine over RDF
  • Distributed Trig/Turtle record reader
  • Support to write out RDDs of OWL axioms in a variety of formats.
  • Distributed Data Summaries with ABstraction and STATistics (ABSTAT)
  • Configurable mapping of RDD of triples dataframes
  • Initial support for RDD of Graphs and Datasets, executing queries on each entry and aggregating over the results
  • Sparql Transformer for ML-Pipelines
  • Autosparql Generation for Feature Extraction
  • Distributed Feature based Semantic Similarity Estimations
  • Added a common R2RML abstraction layer for Ontop, Sparqlify and possible future query engines
  • Consolidated SANSA layers into a single GIT repository
  • Retired the support for Apache Flink

Dependency Changes

  • Apache Spark 2.4.4 → 3.0.1
  • Apache Flink 1.9.1 → 1.11.2
  • Apache Jena 3.13.1 → 3.17.0

Scala Compiler Level

  • Scala 2.11 → Scala 2.12

SANSA-ML Backport Scala 2.11 Spark 2.x v0.1.0

20 Nov 12:41
Compare
Choose a tag to compare
get rid of Scala 2.11 deps

Added Spark-Bench and BigDL Scala 2.12 dependencies

sansa-example-data bundle

30 Oct 09:38
Compare
Choose a tag to compare
Pre-release
  • This is a collection of data files/folders that were push to git uncompressed

  • sansa-notebooks-examples-data.tar.gz from sansa-stack-parent/sansa-notebooks/examples/

  • LUBM_5.owl.tar.gz from sansa-owl/sansa-owl-spark/src/main/resources

DistSim Release

11 Oct 16:42
66fd602
Compare
Choose a tag to compare
DistSim Release Pre-release
Pre-release

This Release is dedicated to present the DistSim Modules.
DistrSim is the Scalable distributed in-Memory Semantic Similarity Estimation for RDF Knowledge Graph Frameworks which has been integrated into the SANSA stack in the SANSA Machine Learning package.

  • Documentation is available on GitHub-pages in scala-docs https://sansa-stack.github.io/SANSA-Stack/
  • Evaluation is performed on Cluster over spark-submit. the resulting data are available in this release as attachments as well. attachments are:
    • Experiment Jar
    • Experiment Sample Datasets
    • Evaluation and Approach figures
    • Python notebook creating the figures
    • Processing time datasets