Skip to content

librairy/demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

                                         Docker-Engine Docker-Compse Release Status GitHub Issues License

A step-by-step guide to get the most of librAIry

Requirements

Docker-Engine and Docker-Compose are required.

Installation

  1. Clone this project
    git clone https://github.com/librairy/demo.git
    
  2. Move into the project folder:
    cd demo
    
  3. download the docker images by
    docker-compose pull
    
  4. and run it!
    docker-compose up
    

Services

NLP Analysis [ tool ]

Efficient and easy way to analyze large amounts of multilingual texts through standard HTTP and TCP APIs.

Built on top of several NLP open-source tools it offers:

  • Part-of-Speech Tagger (and filter)
  • Stemming (Lemmas)
  • Entity Recognizer
  • Wikipedia Relations
  • Wordnet Synsets

Corpus Exploration [ dashboard ]

Analyze document collections to discover the main hidden themes in their texts and create learning models that can be explored through HTTP Restful interfaces. These models can be used for large-scale document classification and information retrieval tasks.

librAIry natively supports indexing structured documents in CSV or JSONL, even GZ compressed. PDF documents or any other format supported by the Solr load interface are also accepted.

Let's index a documents subset of JRC-Acquis corpora. It is just a HTTP-POST request through the /documents service (default user:password is demo:2019) with the following json (Set your email account to be notified when all documents are indexed):

{
  "contactEmail": "<your@email.com>",
  "dataSink": {
    "format": "SOLR_CORE",
    "url": "http://librairy-repo:8983/solr/documents"
  },
  "dataSource": {
    "name":"jrc",
    "dataFields": {
      "id": "0",
      "name": "1",
      "labels":["2"],      
      "text": ["3"]
    },
    "filter":",",
    "format": "CSV",
    "offset": 1,
    "size": -1,
    "url": "/librairy/batch/jrc-en.csv"
  }
}

Statistics and some graphs about the corpus are available in the dashboard.

Now, a probabilistic topic model wrapped by a HTTP-Restful API will be created from these documents, and published in DockerHub.

The following HTTP-POST request is required by the /topics service (a DockerHub account is required):

{
  "name": "my-first-model",
  "description": "Collection of legislative texts (EN) from the European Union generated between years 1958 and 2006",
  "contactEmail": "<your@email.com>",
  "version": "<model-version>",
  "docker": {
    "email": "<dockerHub-account>",
    "password": "<dockerHub-password>",
    "repository": "<dockerHub-account>/<model-name>",
    "user": "<dockerHub-username>"
  },
  "parameters": {
    "topics": "20"
  },
  "dataSource": {
    "name":"jrc",
    "dataFields": {
      "id": "id",
      "text": ["txt_t"]
    },
    "filter":"source_s:jrc && lang_s:en",
    "format": "SOLR_CORE",
    "offset": 0,
    "size": -1,
    "url": "http://librairy-repo:8983/solr/documents"
  }
}

Once the notification email has been received, a Docker container with the service publishing the topics discovered in the corpus will be available in DockerHub, and it can be started using:

docker run -it --rm -p 8585:7777 <docker-account>/<model-name>:<model-version>

Thematic Annotations [ service ]

Categorize texts with labels learned from them or from a different corpus. Our annotators are designed to generate annotations for each of the items inside big collections of textual documents, in a way that is computationally affordable and enables a semantic-aware exploration of the knowledge inside.

Let's annotate the corpus with the JRC-model created from Eurovoc categories. The following HTTP-POST request to http://localhost:8081/annotations should be made:

{
  "contactEmail": "<your@email.com>",
  "dataSink": {
    "format": "SOLR_CORE",
    "url": "http://librairy-repo:8983/solr/documents"
  },
  "dataSource": {
    "name":"jrc",
    "dataFields": {
      "id": "id",
      "name": "name_s",
      "text": ["txt_t"]
    },
    "filter": "source_s:jrc && lang_s:en",
    "format": "SOLR_CORE",
    "offset": 0,
    "size": -1,
    "url": "http://librairy-repo:8983/solr/documents"
  },
  "modelEndpoint": "http://librairy.linkeddata.es/jrc-en-model"
}

Documents will be annotated from models in their respective languages. librAIry links documents across multi-lingual models.

Cross-lingual Similarity [ browser ]

Texts are linked from their semantic similarity through cross-lingual labels and hierarchies of multi-lingual concepts. Documents from multi-language corpora are efficiently browsed and related without the need for translation. They are described by hash codes that preserve the notion of topics and group similar documents.

Documents should be previously annotated to be semantically related. Using Corpus Browser, a navigation through the corpora by semantically similar documents can be performed, regardless of language, filtering those that do not meet an specific criteria.

About

A step-by-step guide to get the most of librAIry

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages