Skip to content
This repository has been archived by the owner on Nov 3, 2022. It is now read-only.

Latest commit

 

History

History
234 lines (194 loc) · 9.36 KB

README.md

File metadata and controls

234 lines (194 loc) · 9.36 KB

eLENS Miner System

License Build Status Python 3.6 Platform

The eLENS miner system retrieves, processes and analyzes legal documents and maps them to specific geographical areas.

The system follows the microservice architecture and is written in Python 3. It consists of the following microservices:

  • Document Retrieval. The service responsible for providing documents based on the user's query. It leverages query expansion to improve the query results.

  • Document Similarity. This service calculates the semantic similarity of the documents and can provide a list of most similar documents to a user selected one. Here, we integrate state-of-the-art methods using word and document embeddings to capture the semantic meaning of the documents and use it to compare the documents.

  • Text Embeddings. The service is a collection of text embedding methods. For a given text it generates the text embedding which is then used in the previous microservices.

  • Entrypoint. This service is the interface and connects the previous microservices together. It is the entrypoint for the users to access the services.

Prerequisites

You may want to create separate virtual environments for each of the microservices or you can create one for all of them. We advise to use virtual environments if you are developing multiple projects with Python, due to clashing of dependencies between projects. (Suppose one project only supports numpy < 1.0 and the other needs numpy=1.5).

To create a virtual environment navigate to the desired directory (usually the main folder of the project) and write

python -m venv venv

To activate this virtual environment navigate into venv/Scripts and then execute activate. To deactivate a virtual environment execute deactivate.

You can see that your virtual environment is being used if you see (venv) before the command line.

Each microservice must be run separately. Each service can be used for themself or one can employ the entrypoint microservice that connects all of the microservices together.

What follows is a short description of how to run each microservice. A more detailed description of the microservice can be found in their designated folders.

Text Embeddings Microservice

Currently you are able to run only one version of the text embedding so that it will be connected to the main component. But later you will be able to connect more.

  • Activate virtual environment if you wish to do so
  • Navigate into text_embeddings folder
  • Execute
    pip install -r requirements.txt
  • Run
    python -m nltk.downloader all
  • Place a copy of your word2vec or fasttext word embeddings in the data/embeddings folder
  • Navigate back to the base of the text_embeddings folder and run the service with
    # linux or mac
    python -m text_embedding.main start \
           -e production \
           -H localhost \
           -p 4001 \
           -mp (path to the model) \
           -ml (language of the model)
    
    # windows
    python -m text_embedding.main start -e production -H localhost -p 4001 -mp (path to the model) -ml (language of the model)

Document Retrieval Microservice

  • Activate virtual environment if you wish to do so
  • Navigate into document_retrieval folder
  • Execute
    pip install -r requirements.txt
  • Navigate into microservice/config folder
  • Create .env file and inside define the following variables:
    PROD_PG_DATABASE=
    PROD_PG_USERNAME=
    PROD_PG_PASSWORD=
    PROD_TEXT_EMBEDDING_HOST=
    PROD_TEXT_EMBEDDING_PORT=
    
    DEV_PG_DATABASE=
    DEV_PG_USERNAME=
    DEV_PG_PASSWORD=
    DEV_TEXT_EMBEDDING_HOST=
    DEV_TEXT_EMBEDDING_PORT=
  • Navigate to the base of document_retrieval folder and run the service with:
    # linux or mac
    python -m microservice.main start \
           -e production \
           -H localhost \
           -p 4100
    
    # windows
    python -m microservice.main start -e production -H localhost -p 4100

If you want you can also run the service on custom host and port.

Document Similarity Microservice

  • Activate virtual environment if you wish to do so
  • Navigate into document_similarity folder
  • Execute
    pip install -r requirements.txt
    
  • Navigate into microservice/config folder
  • Create a .env file with the following variables
    PROD_DATABASE_NAME =
    PROD_DATABASE_USER =
    PROD_DATABASE_PASSWORD =
    PROD_TEXT_EMBEDDING_URL =
    
    DEV_DATABASE_NAME =
    DEV_DATABASE_USER =
    DEV_DATABASE_PASSWORD =
    DEV_TEXT_EMBEDDING_URL =
    
  • Set the text embedding url to http://{HOST}:{PORT}/api/v1/embeddings/create where HOST and PORT are the values used to run text embedding microservice
  • Navigate back into the base of the document_similarity folder and run the service with
    # linux or mac
    python -m microservice.main start \
           -e production \
           -H localhost \
           -p 4200
    
    # windows
    python -m microservice.main start -e production -H localhost -p 4200

You can also use custom host and port.

Entrypoint

  • Activate virtual environment if you wish to do so
  • Navigate into entrypoint folder
  • Run
    pip install -r requirements.txt
    
  • Navigate into microservice/config folder
  • Create .env file with contents
    DEV_DATABASE_USER =
    DEV_DATABASE_HOST =
    DEV_DATABASE_PORT =
    DEV_DATABASE_PASSWORD =
    DEV_DATABASE_NAME =
    
    PROD_DATABASE_USER =
    PROD_DATABASE_HOST =
    PROD_DATABASE_PORT =
    PROD_DATABASE_PASSWORD =
    PROD_DATABASE_NAME =
    
  • Navigame back into entrypoint folder
  • Run the main service with
    # linux or mac
    python -m microservice.main start \
           -e production \
           -H localhost \
           -p 4500
    
    # windows
    python -m microservice.main start -e production -H localhost -p 4500
    However if you routed other microservices to different hosts/ports, you can provide this values in the following way:
    # linux or mac
    python -m microservice.main start -H localhost -p 4500 \
      -teh {host of the text embedding microservice} \
      -tep {port of the text embedding microservice} \
      -drh {host of the document retrieval microservice} \
      -drp {port of the document retrieval microservice} \
      -dsh {host of the document similarity microservice} \
      -dsp {port of the document similarity microservice}
    
    # windows
    python -m microservice.main start -H localhost -p 4500 -teh {host of the text embedding microservice} -tep {port of the text embedding microservice} -drh {host of the document retrieval microservice} -drp {port of the document retrieval microservice} -dsh {host of the document similarity microservice} -dsp {port of the document similarity microservice}

Usage:

Available endpoints:

  • GET {HOST}/{PORT}/api/v1/documents/search query_params query, m

    • query -> your text query
    • m -> number of results

    Example request:

    {BASE_URL}/api/v1/documents/search?query=deforestation&m=10 You will receive top 10 documents similar to query "deforestation".

  • GET {HOST}/{PORT}/api/v1/documents/<document_id>/similar query_params get_k

    • document_id -> id of the document
    • get_k -> number of results

    Example request:

    {BASE_URL}/api/v1/documents/123/similar?get_k=5 You will receive 5 of the most similar documents to document with id 123.

  • POST {HOST}/{PORT}/api/v1/documents/<document_id>/similarity_update

    • document_id -> id of the document

    Example request:

    {BASE_URL}/api/v1/documents/similarity_update Recalculates similarities of the document with the given id to the other documents.

  • GET {HOST}/{PORT}/api/v1/embeddings/create query_params text, language

    • text -> your text
    • language -> language of the text

    Example request:

    {BASE_URL}/api/v1/embedding/create?text=ice cream&language=en You will receive the embedding of the text "ice cream" from the english word embedding model.

  • GET {HOST}/{PORT}/api/v1/documents query_params document_ids

    • document_ids : (comma separated document ids)

    Example request:

    {BASE_URL}/api/v1/documents?document_ids=1,3,17 With the GET request at this endpoint you will receive documents data for documents ids 1, 3 and 17.

  • GET {HOST}/{PORT}/api/v1/documents/<document_id>

    • document_id : (id of the document)

    Example request:

    {BASE_URL}/api/v1/documents/3 With the GET request at this endpoint you will receive documents data for document with id 3.

Acknowledgments

This work is developed by AILab at Jozef Stefan Institute.

The work is supported by the EnviroLENS project, a project that demonstrates and promotes the use of Earth observation as direct evidence for environmental law enforcement, including in a court of law and in related contractual negotiations.