# Set-up a NLP Dev Environment

Date: 2021-01-01  
Author: Jason Beach  
Categories: Best_Practice, Introduction_Tutorial, Data_Science 
Tags: nlp, development, docker, doccano, aws, openstack, mlflow

<!--eofm-->

This post sets up a NLP development environment with Doccano, MLflow, and AWS backend.

## Introduction

Development envrionements for NLP solutions can take many forms, including typical IDE and notebooks.  But an important aspect that is often overlooked is the data and model management.  Most often you will not make one simple classifier; rather, you will have multiple models per solution, and many solutions for a project.  This will necessitate maintaining and managing both the models and the data to train them.  Included with managing data is being able to sift through it to discover the most salient examples.

This post will walk through some of the general tasks in NLP development and use essential tools to ensure proper management of data and models.  We create a scenario where thousands of SEC 10K filings are available for processing, such as [this example](https://sec.report/Document/0001193125-22-000226/d257921d10ka.htm), which is located in local directory `/resources/dataFinance_10-K_A 10-K_A.html`.

## Directory Structure and Docker

Notebooks are fine for prototyping, but are a poor choice for steady development into a solution.  Import the pieces of the codebase into your notebook as modules.  Our codebase will be kept in local directory: `/python_modules/nlp_pipeline/`.

Docker is great for ensuring a clean, reproducible deployment whether to your local laptop or cloud VMs.  Once this section is complete we have notebook (Jupyter Lab) and tagging app (Doccano) running.

### Setup

#### JupyterLab

Use the `server list` command to get the url of the jupyter lab.  Be sure to leave port 8000 for doccano.  The thorough docs are [here](https://jupyter-docker-stacks.readthedocs.io/en/latest/).

```bash
docker run -d -p 8888:8888 -p 8889:8889 -v $HOME_NOTEBOOK_PUBLIC:/home/jovyan/NOTEBOOK_PUBLIC/ -v $HOME_NOTEBOOK_PRIVATE:/home/jovyan/PRIVATE/ -e GRANT_SUDO=yes --name cntr_all-spark jupyter/all-spark-notebook start.sh jupyter lab 
docker start cntr_all-spark

docker exec -it -e GRANT_SUDO=yes --user root cntr_all-spark bash
jupyter server list
```

#### Doccano

Doccano provides a docker image pre-built.  We will use this to start, but can improve it, later, by working diretly with the repo.

```bash
docker pull doccano/doccano
docker container create --name doccano \
  -e "ADMIN_USERNAME=admin" \
  -e "ADMIN_EMAIL=admin@example.com" \
  -e "ADMIN_PASSWORD=password" \
  -p 8000:8000 doccano/doccano
docker container start doccano
docker container stop doccano -t 5   #persist data
```

Move to the browser to get started.

```bash
browser> http://127.0.0.1:8000/
browser> login > admin, password 
```

#### LocalStack

Mock-up a AWS environment using localstack.

```bash
docker run --rm -it -p 4566:4566 -p 4571:4571 localstack/localstack
browser> http://localhost:4566/health
```

The open source code running the docker container connects to https://app.localstack.cloud, which you can sign-up to view your resources.

>Email: lhwytaietuglqrebzn@nvhrw.com

>Password: NewPassword

#### Network


The `bridge` network is made for us and we will use it for communication.  You can read more about [docker networking](https://www.freecodecamp.org/news/how-to-get-a-docker-container-ip-address-explained-with-examples/).

```bash
$ docker network ls
```
|CONTAINER ID |   NAMES |   COL |   COL |  
|---|---|---|---|
|acebac7d6d89 | bridge | bridge | local | 
|2b1c24a6128e | host | host | local | 
|1e49888eb19e | none | null | local | 


Get the doccano container id

```bash
$ docker ps -a
```
|CONTAINER ID |   IMAGE |   COMMAND |   CREATED |   STATUS |   PORTS |   NAMES|
|---|---|---|---|---|---|---|
|e3be02018f1a |   doccano/doccano |   "/doccano/tools/run.…" |   3 hours ago |   Up 3 hours |   0.0.0.0:8000->8000/tcp, :::8000->8000/tcp |   doccano|
|4fec40b7268f |   all-spark-notebook-mysave |   "tini -g -- start.sh…" |   3 hours ago |   Up 3 hours |   8000/tcp, 0.0.0.0:8888-8889->8888-8889/tcp, :::8888-8889->8888-8889/tcp |   cntr_all-spark|


Get the internal ip address for doccano container

```bash
$ docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' e3be02018f1a
```
>172.17.0.3


### Notebook configuration

Let's get everything setup and ensure we import our modules from the codebase.  

You can install modules a variety of ways:

* A good explanation of the paths and ipythonkernels is [here](https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/).  
* The guidance for managing multiple virtual environments in the same repo is [here](https://stackoverflow.com/questions/51061358/how-can-i-manage-more-than-2-package-groups-with-pipenv).

Install [pyenv](https://github.com/pyenv/pyenv#basic-github-checkout) to easily access different python versions with: `pyenv install 3.9.7`.

To use different requirements configurations, create a PipFile within the `requirements/` directory using (for example): `pipenv install python 3.9.7`, then start the shell and work within that virtual environment.

```bash
#prepare venv files
mkdir requirements; cd requirements
mkdir spacy2; cd spacy2
pipenv install python==3.9.7
pipenv shell
pipenv install "spacy~=2.0"

#add venv to jupyer
pipenv install ipykernel
python -m ipykernel install --name=spacy2

#exit out of shell to open the new kernel in a notebook
cd ../..
exit
jupyter kernelspec list
```

Adding these PipFiles as a specific ipython kernel for an environment is explained [here](https://janakiev.com/blog/jupyter-virtual-envs/).

You must be aware of the machine and its resources.

In [198]:
! cat /etc/os-release

NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal


In [193]:
from platform import python_version
print( python_version() )

3.9.7


In [190]:
%load_ext watermark
%watermark -v -m -g

Python implementation: CPython
Python version       : 3.9.7
IPython version      : 7.30.1

Compiler    : GCC 9.4.0
OS          : Linux
Release     : 5.10.25-linuxkit
Machine     : x86_64
Processor   : x86_64
CPU cores   : 4
Architecture: 64bit

Git hash: 2d4f0dffd9baa489ca3bc675bb7ec7c1dc69b5fc



In [191]:
import threading
threading.activeCount()

9

In [196]:
! lscpu | head -n10

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       4
Vendor ID:                       GenuineIntel


Ensure a proper internal connections among docker containers is available.  Get the doccano container internal ip from the above `docker inspect` command.

Proper installation should use the following:

```python
#install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install numpy
```

In [68]:
import pandas as pd

#! pip install doccano-client
from doccano_api_client import DoccanoClient

In [72]:
#Data-in
dc = DoccanoClient(
    'http://172.17.0.3:8000',
    'admin',
    'password'
)

# get basic information about the user
r_me = dc.get_me()
print(r_me)

{'id': 1, 'username': 'admin', 'is_superuser': True, 'is_staff': True}


Load installed modules

In [23]:
import re
from bs4 import BeautifulSoup

In [91]:
import pandas as pd
import numpy as np
import spacy

nlp = spacy.load("en_core_web_sm")

Install for python3.9 (typical notebook)

```bash
pip install gensim==3.8.3
pip install flair==0.8.0.post1
```

Install for python3.6 (flair repo development) within pipenv

```bash
pipenv run pip install torch==1.7.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
pipenv run pip install .   #run setup.py
```

In [38]:
%load_ext autoreload
%autoreload 2

In [None]:
#! ls python_modules/nlp_pipeline/

In [62]:
#! pip install fasttext-wheel

In [61]:
#os.getcwd()

Load codebase modules

In [None]:
import os
os.chdir('./python_modules/nlp_pipeline/')

In [59]:
import model_fasttext

Ensure data is available

In [5]:
! ls -alh resources/dataFinance_10-K_A.html

-rw-r--r-- 1 jovyan users 1.6M Jan  4 16:22 resources/dataFinance_10-K_A.html


### Initial preprocess 

Now, we will prepare our file with an initial preprocess so that it can be loaded into the tagging application.  The preprocess will be refined multiple times before being added to the pipeline.

We assume that sentences are the lowest level of detail that we would want to model.

In [18]:
file_path = 'resources/dataFinance_10-K_A.html'

lines = 'str'
with open(file_path, 'r') as file:
    lines = file.read()

In [21]:
soup = BeautifulSoup(lines)
text = soup.get_text()

In [32]:
ed_text1 = re.sub('\n+', ' ', text)
ed_text2 = re.sub('\xa0+', ' ', ed_text1)

In [64]:
ed_text2[99000:100000]

't approve of the business combination we complete.  If we seek shareholder approval of our initial business combination, our initial shareholders and management team have agreed to vote in favor of such initial business combination, regardless of how our public shareholders vote.  Our initial shareholders own 20% of our issued and outstanding ordinary shares immediately following the completion of the offering. Our initial shareholders and management team also may from time to time purchase Class A ordinary shares prior to our initial business combination. Our amended and restated memorandum and articles of association provides that, if we seek shareholder approval of an initial business combination, such initial business combination will be approved if we receive an ordinary resolution under Cayman Islands law, which requires the affirmative vote of a majority of the shareholders who attend and vote at a general meeting of the company, including the founder shares. If we seek shareho

In [63]:
ed_text = ed_text2

In [112]:
doc = nlp(ed_text)

In [158]:
sents1 = list(doc.sents)
sents2 = [sent.text for sent in sents1 if (len(sent) > 10 and len(sent) < 1000)]

In [152]:
#os.chdir('./NOTEBOOK_PUBLIC')
#os.getcwd()

In [165]:
df = pd.DataFrame()
df['text'] = sents2
df['id'] = df.index
df['doc'] = file_path
df['doc_type'] = '10k'

df.to_json('./resources/dataFinance_dataframe.jsonl', orient='records', lines=True)

There is quite a bit of semi-structured text at the beginning of the document, but the majority is in a readable format.  The text is in decent shape and can be loaded into a pipeline.

## Doccano Interaction

This tagging application focuses on the dataset and associated records and metadata.  We can interact with it both programmatically and through the browser interface.  Move through the following steps to prepare the dataset.

### Browser interface

In the upper right corner, the __'Projects'__ button always takes you to that page where you can select any of the projects you created.

* At the top left click __'Create'__ to create a new project
  - in the popup box insert text for 'Project name' (MyProject) and 'Description' (desc)
  - select a 'Project Type'

* On the left bar select __'Labels'__ to create labels
  - at the top select 'Actions' and then select 'Create Label'
  - complete key-value pairs with colors
  - stay on the 'Labels' section
  - at the top select 'Actions', then select 'Export Labels'
  - a json file is downloaded that decodes the output dataset

* On the left bar select __'Dataset'__ to upload data and annotate
  - at the top select 'Actions' and then select 'Import Data'
  - select the file format 'JSONL'
  - click 'Drop files here...' and locate './resources/dataFinance_dataframe.jsonl'
  - click 'Ingest' at bottom right
  - notice the 'Metadata' column is populated
  - at the top select 'Actions' and then select 'Export Data'
  - select the 'JSONL' format
  - select a file name
  - click 'Export'

* Select __'Statistics__ to review

### Programmatically

We can use the [doccano-api-client](https://github.com/doccano/doccano-client) to interact directly with the backend.  In the MyProject you just created, 'Delete All' records.

#### Create project

In [79]:
#Create project
dc.create_project(name = "MyProject",
                  description = "test project",
                  project_type = "SequenceLabeling",
                  resourcetype = "SequenceLabelingProject",
                  randomize_document_order = False,
                  collaborative_annotation = False)

{'id': 1,
 'name': 'MyProject',
 'description': 'test project',
 'guideline': '',
 'users': [1],
 'current_users_role': {'is_project_admin': True,
  'is_annotator': False,
  'is_annotation_approver': False},
 'project_type': 'SequenceLabeling',
 'updated_at': '2022-01-04T20:37:56.984052Z',
 'random_order': False,
 'collaborative_annotation': False,
 'single_class_classification': False,
 'tags': [],
 'allow_overlapping': False,
 'grapheme_mode': False,
 'resourcetype': 'SequenceLabelingProject'}

#### Import data

In [166]:
#Import data
dc.post_doc_upload(project_id = 1,
                   file_name = './resources/dataFinance_dataframe.jsonl',
                   file_path = "./",
                   column_data = "text",
                   column_label = "label",
                   delimiter = "", encoding =  "utf_8", format = "JSONL")

{'task_id': 'c648d1f7-751a-4f85-9f01-e862e24debd3'}

#### Export data

In [174]:
project_id = "1"
file_style = "json"
limit = 1000
offset = 0

jsonl = dc.get(f"http://172.17.0.3:8000/v1/projects/{project_id}/docs?format={file_style}&limit={limit}&offset={offset}")['results']
json_csv_file = pd.DataFrame(jsonl)

In [179]:
json_csv_file.head(3)

Unnamed: 0,id,filename,annotations,meta,annotation_approver,comment_count,text,is_confirmed
0,1431,http://172.17.0.3:8000/media/doccano/backend/m...,[],"{'id': 0, 'doc': 'resources/dataFinance_10-K_A...",,0,"For the fiscal year ended December 31, 2020 OR...",False
1,1432,http://172.17.0.3:8000/media/doccano/backend/m...,[],"{'id': 1, 'doc': 'resources/dataFinance_10-K_A...",,0,(Exact name of registrant as specified in its ...,False
2,1433,http://172.17.0.3:8000/media/doccano/backend/m...,[],"{'id': 2, 'doc': 'resources/dataFinance_10-K_A...",,0,"8 Xinyuan South RoadChaoyang District, Beijing...",False


In [181]:
#Export data using a loop
dc.exp_get_doc_list(project_id=1, limit=1, offset=0)

{'count': 1429,
 'next': 'http://172.17.0.3:8000/v1/projects/1/docs?limit=1&offset=1',
 'previous': None,
 'results': [{'id': 1431,
   'filename': 'http://172.17.0.3:8000/media/doccano/backend/media/CsYxqSYraZ9qxfFYV8GLcf/RVTcRhbnmow9keu6nie4qU/dataFinance_dataframe.jsonl',
   'annotations': [],
   'meta': {'id': 0,
    'doc': 'resources/dataFinance_10-K_A.html',
    'doc_type': '10k'},
   'annotation_approver': None,
   'comment_count': 0,
   'text': 'For the fiscal year ended December 31, 2020 OR   ☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 FOR THE TRANSITION PERIOD FROM   TO   COMMISSION FILE NUMBER 001-39222  CITIC CAPITAL ACQUISITION CORP.',
   'is_confirmed': False}]}

#### Admin info

In [180]:
#Project data
dc.get_project_list()

[{'id': 1,
  'name': 'MyProject',
  'description': 'test project',
  'guideline': '',
  'users': [1],
  'current_users_role': {'is_project_admin': True,
   'is_annotator': False,
   'is_annotation_approver': False},
  'project_type': 'SequenceLabeling',
  'updated_at': '2022-01-04T21:45:30.010239Z',
  'random_order': False,
  'collaborative_annotation': False,
  'single_class_classification': False,
  'tags': [],
  'allow_overlapping': False,
  'grapheme_mode': False,
  'resourcetype': 'SequenceLabelingProject'}]

In [184]:
dc.get_annotation_list(project_id=1, doc_id=1)

[]

In [185]:
dc.get_document_detail(project_id=1, doc_id=1)

{'detail': 'Not found.'}

In [186]:
dc.get_label_list(project_id=1)

[{'id': 1,
  'text': 'False',
  'prefix_key': None,
  'suffix_key': '0',
  'background_color': '#FF6F00',
  'text_color': '#ffffff'},
 {'id': 2,
  'text': 'True',
  'prefix_key': None,
  'suffix_key': '1',
  'background_color': '#8BC34A',
  'text_color': '#ffffff'}]

#### Tips

* only label the specific (target) text that represents teh category
* output will be:
  - text [sentence label is found] - post processed text for train/test
  - original [sentence label is found] - unaltered text for comparison with postprocessed
  - target_text - represents category
  - metadata (key, value) - specific data to be extracted
* first, look at text record within context of report to ensure what is described.  Then, look at text record within the same category to ensure consistency of pattern
* post-processing allows multiple patterns to be used based on differetn aspects: key terms used, part-of-speech, grammatical mood (english has indicative, imperative, subjective, and infinitive) etc.
* ???

#### Issues

* import of label_config incorrectly populates color fill
* need view for displaying same labelled text from all records.  This is used to qucikly add comments, which are the final extraction of information
* how to add example data
* in Comments section, can't click on record to annotate / label it
* how to add metadata?  there is a position for it , but no way to add to it!
* Id (index) and label index never reset despite being a different project.  It just keeps being added to it.

#### Improvements

* setup so that sqlite.db is presisted in a shared volumen for safe keeping.
* need a way to create new dataset from existing datasets, based on label(s).  this allows for dcomposing from a generla label to a more specific label.

## AWS S3 Mocking with LocalStack

### Components

* LocalStack
* aws cli
* aws configure
* python boto

Install with

```
pip install localstack
```

### Running in Docker

The provided docker image is fine, but you can also you a docker-compose file.

```docker-compose.yml
version: '3.1'

services:
  localstack:
    image: localstack/localstack:latest
    environment:
      - AWS_DEFAULT_REGION=us-east-1
      - EDGE_PORT=4566
      - SERVICES=lambda,s3
      - DATA_DIR=/tmp/localstack/data
    ports:
      - '4566-4583:4566-4583'
    volumes:
      - "${TEMPDIR:-/tmp/localstack}:/tmp/localstack"
      - "/var/run/docker.sock:/var/run/docker.stock"
```

Create a file with content and invoke teh following command in terminal (mac)

```
docker-compose up
```

Once the container is ready, navigate to http://localhost:4566/health  which should show the following json output

```
{"services": {"lambda":"running", "logs":"running","s3":"running","cloudwatch":"running},
{"features": {"persistence":"initialized","initScripts":intialized"}}
```

### Interacting

Invoke the following commands in local as well as docker container

* install aws cli: `pip install awscli`
* aws configure:
```
aws_access_key_id=tes
aws_secret_access_key=test
region=us-east-1
output=json
```

* make a bucket: `aws --endpoint-url=http://localhost:4566 s3 mb s3://<bucketname>`
* copy files: `aws --endpoint-url=http://localhost:4566 s3 cp filename.txt s3://<bucketname>`
* list files: `aws --endpoint-url=http://localhost:4566 s3 ls s3://<bucketname>`
* sync folders: `aws --endpoint-url=http://localhost:4566 s3 sync model s3://<bucketname>`
* backup buckets and files locally:

If container is stopped, docker file compose the container, invoke the command below, then restart the container

```
docker-compose up
cp /workspace/recorded_api_calls.json /tmp/localstack/data/
docker container ls   #get container id
docker container restart <container_id>
```

### Python connections

* CRUD can be done with python connection string
* localstack endpoint url can also be searched from: `docker network inspect`
* note the gateway IP address of localstack container

Example usage:

* read files

```
s3 = boto3.resource('s3', endpoint_url='http://172.19.0.1:4566',
                   aws_access_key_id = 'test',
                   aws_secret_access_key = 'test)
                   
obj = s3.Object(bucket, file)
data = obj.get()['Body'].read()
df = pd.read_csv(io.BytesIO(data), header=None)
```
* make bucket
```
s3.create_bucket(Bucket=name)
```
* upload file
```
s3 = boto3.client('s3', endpoint_url='http://172.19.0.1:4566')
s3.upload_file(file_name, bucket_name, object_file_name)
```
* delete file
```
s3 = boto3.resource('s3', endpoint_url='http://172.19.0.1:4566')
s3Object(bucket, filename).delete()
```
* download file
```
session = boto3.Session()
s3 = boto3.resource('s3', endpoint_url='http://172.19.0.1:4566')
s3.meta.client.download_file(bucket_name, object_name, destination_filepath)
```