Requirements

A step-by-step guide to get the most of librAIry

Requirements

Docker-Engine and Docker-Compose are required.

Installation

Clone this project

git clone https://github.com/librairy/demo.git

Move into the project folder:
```
cd demo
```
download the docker images by
```
docker-compose pull
```
and run it!
```
docker-compose up
```

Services

NLP Analysis [ tool ]

Efficient and easy way to analyze large amounts of multilingual texts through standard HTTP and TCP APIs.

Built on top of several NLP open-source tools it offers:

Part-of-Speech Tagger (and filter)
Stemming (Lemmas)
Entity Recognizer
Wikipedia Relations
Wordnet Synsets

Corpus Exploration [ dashboard ]

Analyze document collections to discover the main hidden themes in their texts and create learning models that can be explored through HTTP Restful interfaces. These models can be used for large-scale document classification and information retrieval tasks.

librAIry natively supports indexing structured documents in CSV or JSONL, even GZ compressed. PDF documents or any other format supported by the Solr load interface are also accepted.

Let's index a documents subset of JRC-Acquis corpora. It is just a HTTP-POST request through the /documents service (default user:password is demo:2019) with the following json (Set your email account to be notified when all documents are indexed):

{
  "contactEmail": "<your@email.com>",
  "dataSink": {
    "format": "SOLR_CORE",
    "url": "http://librairy-repo:8983/solr/documents"
  },
  "dataSource": {
    "name":"jrc",
    "dataFields": {
      "id": "0",
      "name": "1",
      "labels":["2"],      
      "text": ["3"]
    },
    "filter":",",
    "format": "CSV",
    "offset": 1,
    "size": -1,
    "url": "/librairy/batch/jrc-en.csv"
  }
}

Statistics and some graphs about the corpus are available in the dashboard.

Now, a probabilistic topic model wrapped by a HTTP-Restful API will be created from these documents, and published in DockerHub.

The following HTTP-POST request is required by the /topics service (a DockerHub account is required):

{
  "name": "my-first-model",
  "description": "Collection of legislative texts (EN) from the European Union generated between years 1958 and 2006",
  "contactEmail": "<your@email.com>",
  "version": "<model-version>",
  "docker": {
    "email": "<dockerHub-account>",
    "password": "<dockerHub-password>",
    "repository": "<dockerHub-account>/<model-name>",
    "user": "<dockerHub-username>"
  },
  "parameters": {
    "topics": "20"
  },
  "dataSource": {
    "name":"jrc",
    "dataFields": {
      "id": "id",
      "text": ["txt_t"]
    },
    "filter":"source_s:jrc && lang_s:en",
    "format": "SOLR_CORE",
    "offset": 0,
    "size": -1,
    "url": "http://librairy-repo:8983/solr/documents"
  }
}

Once the notification email has been received, a Docker container with the service publishing the topics discovered in the corpus will be available in DockerHub, and it can be started using:

docker run -it --rm -p 8585:7777 <docker-account>/<model-name>:<model-version>

Thematic Annotations [ service ]

Categorize texts with labels learned from them or from a different corpus. Our annotators are designed to generate annotations for each of the items inside big collections of textual documents, in a way that is computationally affordable and enables a semantic-aware exploration of the knowledge inside.

Let's annotate the corpus with the JRC-model created from Eurovoc categories. The following HTTP-POST request to http://localhost:8081/annotations should be made:

{
  "contactEmail": "<your@email.com>",
  "dataSink": {
    "format": "SOLR_CORE",
    "url": "http://librairy-repo:8983/solr/documents"
  },
  "dataSource": {
    "name":"jrc",
    "dataFields": {
      "id": "id",
      "name": "name_s",
      "text": ["txt_t"]
    },
    "filter": "source_s:jrc && lang_s:en",
    "format": "SOLR_CORE",
    "offset": 0,
    "size": -1,
    "url": "http://librairy-repo:8983/solr/documents"
  },
  "modelEndpoint": "http://librairy.linkeddata.es/jrc-en-model"
}

Documents will be annotated from models in their respective languages. librAIry links documents across multi-lingual models.

Cross-lingual Similarity [ browser ]

Texts are linked from their semantic similarity through cross-lingual labels and hierarchies of multi-lingual concepts. Documents from multi-language corpora are efficiently browsed and related without the need for translation. They are described by hash codes that preserve the notion of topics and group similar documents.

Documents should be previously annotated to be semantically related. Using Corpus Browser, a navigation through the corpora by semantically similar documents can be performed, regardless of language, filtering those that do not meet an specific criteria.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
batch		batch
json		json
models		models
steps		steps
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
logo.png		logo.png
tasks.md		tasks.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batch

batch

json

json

models

models

steps

steps

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

docker-compose.yml

docker-compose.yml

logo.png

logo.png

tasks.md

tasks.md

Repository files navigation

Requirements

Installation

Services

NLP Analysis [ tool ]

Corpus Exploration [ dashboard ]

Thematic Annotations [ service ]

Cross-lingual Similarity [ browser ]

About

Releases

Packages

Languages

License

librairy/demo

Folders and files

Latest commit

History

Repository files navigation

Requirements

Installation

Services

NLP Analysis [ tool ]

Corpus Exploration [ dashboard ]

Thematic Annotations [ service ]

Cross-lingual Similarity [ browser ]

About

Resources

License

Stars

Watchers

Forks

Languages