The eLENS miner system retrieves, processes and analyzes legal documents and maps them to specific geographical areas.
The system follows the microservice architecture and is written in Python 3. It consists of the following microservices:
-
Document Retrieval. The service responsible for providing documents based on the user's query. It leverages query expansion to improve the query results.
-
Document Similarity. This service calculates the semantic similarity of the documents and can provide a list of most similar documents to a user selected one. Here, we integrate state-of-the-art methods using word and document embeddings to capture the semantic meaning of the documents and use it to compare the documents.
-
Text Embeddings. The service is a collection of text embedding methods. For a given text it generates the text embedding which is then used in the previous microservices.
-
Entrypoint. This service is the interface and connects the previous microservices together. It is the entrypoint for the users to access the services.
You may want to create separate virtual environments for each of the microservices or you can create one for all of them. We advise to use virtual environments if you are developing multiple projects with Python, due to clashing of dependencies between projects. (Suppose one project only supports numpy < 1.0 and the other needs numpy=1.5).
To create a virtual environment navigate to the desired directory (usually the main folder of the project) and write
python -m venv venv
To activate this virtual environment navigate into venv/Scripts and then execute activate
. To deactivate a virtual environment execute deactivate
.
You can see that your virtual environment is being used if you see (venv) before the command line.
Each microservice must be run separately. Each service can be used for themself or one can employ the entrypoint
microservice that connects all of the microservices together.
What follows is a short description of how to run each microservice. A more detailed description of the microservice can be found in their designated folders.
Currently you are able to run only one version of the text embedding so that it will be connected to the main component. But later you will be able to connect more.
- Activate virtual environment if you wish to do so
- Navigate into
text_embeddings
folder - Execute
pip install -r requirements.txt
- Run
python -m nltk.downloader all
- Place a copy of your
word2vec
orfasttext
word embeddings in the data/embeddings folder - Navigate back to the base of the
text_embeddings
folder and run the service with# linux or mac python -m text_embedding.main start \ -e production \ -H localhost \ -p 4001 \ -mp (path to the model) \ -ml (language of the model) # windows python -m text_embedding.main start -e production -H localhost -p 4001 -mp (path to the model) -ml (language of the model)
- Activate virtual environment if you wish to do so
- Navigate into
document_retrieval
folder - Execute
pip install -r requirements.txt
- Navigate into
microservice/config
folder - Create
.env
file and inside define the following variables:PROD_PG_DATABASE= PROD_PG_USERNAME= PROD_PG_PASSWORD= PROD_TEXT_EMBEDDING_HOST= PROD_TEXT_EMBEDDING_PORT= DEV_PG_DATABASE= DEV_PG_USERNAME= DEV_PG_PASSWORD= DEV_TEXT_EMBEDDING_HOST= DEV_TEXT_EMBEDDING_PORT=
- Navigate to the base of
document_retrieval
folder and run the service with:# linux or mac python -m microservice.main start \ -e production \ -H localhost \ -p 4100 # windows python -m microservice.main start -e production -H localhost -p 4100
If you want you can also run the service on custom host and port.
- Activate virtual environment if you wish to do so
- Navigate into
document_similarity
folder - Execute
pip install -r requirements.txt
- Navigate into microservice/config folder
- Create a
.env
file with the following variablesPROD_DATABASE_NAME = PROD_DATABASE_USER = PROD_DATABASE_PASSWORD = PROD_TEXT_EMBEDDING_URL = DEV_DATABASE_NAME = DEV_DATABASE_USER = DEV_DATABASE_PASSWORD = DEV_TEXT_EMBEDDING_URL =
- Set the text embedding url to
http://{HOST}:{PORT}/api/v1/embeddings/create
where HOST and PORT are the values used to run text embedding microservice - Navigate back into the base of the
document_similarity
folder and run the service with# linux or mac python -m microservice.main start \ -e production \ -H localhost \ -p 4200 # windows python -m microservice.main start -e production -H localhost -p 4200
You can also use custom host and port.
- Activate virtual environment if you wish to do so
- Navigate into
entrypoint
folder - Run
pip install -r requirements.txt
- Navigate into
microservice/config
folder - Create
.env
file with contentsDEV_DATABASE_USER = DEV_DATABASE_HOST = DEV_DATABASE_PORT = DEV_DATABASE_PASSWORD = DEV_DATABASE_NAME = PROD_DATABASE_USER = PROD_DATABASE_HOST = PROD_DATABASE_PORT = PROD_DATABASE_PASSWORD = PROD_DATABASE_NAME =
- Navigame back into
entrypoint
folder - Run the main service with
However if you routed other microservices to different hosts/ports, you can provide this values in the following way:
# linux or mac python -m microservice.main start \ -e production \ -H localhost \ -p 4500 # windows python -m microservice.main start -e production -H localhost -p 4500
# linux or mac python -m microservice.main start -H localhost -p 4500 \ -teh {host of the text embedding microservice} \ -tep {port of the text embedding microservice} \ -drh {host of the document retrieval microservice} \ -drp {port of the document retrieval microservice} \ -dsh {host of the document similarity microservice} \ -dsp {port of the document similarity microservice} # windows python -m microservice.main start -H localhost -p 4500 -teh {host of the text embedding microservice} -tep {port of the text embedding microservice} -drh {host of the document retrieval microservice} -drp {port of the document retrieval microservice} -dsh {host of the document similarity microservice} -dsp {port of the document similarity microservice}
Available endpoints:
-
GET
{HOST}/{PORT}/api/v1/documents/search
query_params query, m- query -> your text query
- m -> number of results
{BASE_URL}/api/v1/documents/search?query=deforestation&m=10
You will receive top 10 documents similar to query "deforestation". -
GET
{HOST}/{PORT}/api/v1/documents/<document_id>/similar
query_params get_k- document_id -> id of the document
- get_k -> number of results
{BASE_URL}/api/v1/documents/123/similar?get_k=5
You will receive 5 of the most similar documents to document with id 123. -
POST
{HOST}/{PORT}/api/v1/documents/<document_id>/similarity_update
- document_id -> id of the document
{BASE_URL}/api/v1/documents/similarity_update
Recalculates similarities of the document with the given id to the other documents. -
GET
{HOST}/{PORT}/api/v1/embeddings/create
query_params text, language- text -> your text
- language -> language of the text
{BASE_URL}/api/v1/embedding/create?text=ice cream&language=en
You will receive the embedding of the text "ice cream" from the english word embedding model. -
GET
{HOST}/{PORT}/api/v1/documents
query_params document_ids- document_ids : (comma separated document ids)
{BASE_URL}/api/v1/documents?document_ids=1,3,17
With the GET request at this endpoint you will receive documents data for documents ids 1, 3 and 17. -
GET
{HOST}/{PORT}/api/v1/documents/<document_id>
- document_id : (id of the document)
{BASE_URL}/api/v1/documents/3
With the GET request at this endpoint you will receive documents data for document with id 3.
This work is developed by AILab at Jozef Stefan Institute.
The work is supported by the EnviroLENS project, a project that demonstrates and promotes the use of Earth observation as direct evidence for environmental law enforcement, including in a court of law and in related contractual negotiations.