<a href="https://colab.research.google.com/github/bminixhofer/nnsplit/blob/master/train/train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook shows how to train [NNSplit](https://github.com/bminixhofer/nnsplit/) on a custom dataset and load it for inference.

# Setup

First, clone the Github Repo and install requirements. If you are running this on Colab, you will likely have to restart the runtime after installing the requirements because of some version mismatches.

In [None]:
!git clone https://www.github.com/bminixhofer/nnsplit

Cloning into 'nnsplit'...
remote: Enumerating objects: 1388, done.[K
remote: Total 1388 (delta 0), reused 0 (delta 0), pack-reused 1388
Receiving objects: 100% (1388/1388), 9.31 MiB | 9.96 MiB/s, done.
Resolving deltas: 100% (723/723), done.


In [None]:
!pip install -r nnsplit/train/requirements.txt

Collecting appdirs==1.4.4
  Downloading https://files.pythonhosted.org/packages/3b/00/2344469e2084fb287c2e0b57b72910309874c3245463acd6cf5e3db69324/appdirs-1.4.4-py2.py3-none-any.whl
Collecting black==19.10b0
[?25l  Downloading https://files.pythonhosted.org/packages/fd/bb/ad34bbc93d1bea3de086d7c59e528d4a503ac8fe318bd1fa48605584c3d2/black-19.10b0-py36-none-any.whl (97kB)
[K     |███▍                            | 10kB 25.6MB/s eta 0:00:01[K     |██████▊                         | 20kB 29.4MB/s eta 0:00:01[K     |██████████                      | 30kB 8.7MB/s eta 0:00:01[K     |█████████████▍                  | 40kB 10.9MB/s eta 0:00:01[K     |████████████████▉               | 51kB 13.0MB/s eta 0:00:01[K     |████████████████████▏           | 61kB 13.7MB/s eta 0:00:01[K     |███████████████████████▌        | 71kB 13.2MB/s eta 0:00:01[K     |██████████████████████████▉     | 81kB 10.3MB/s eta 0:00:01[K     |██████████████████████████████▎ | 92kB 10.0MB/s eta 0:00:01[K  

# Data preparation

Training NNSplit is not limited to a specific dataset. Howevever, I have found the [Linguatools Wikipedia Dumps](https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/) to work well, so there is built-in functionality to load those. Feel free to use other data!

First, download the `.xml.bz2` file and unzip it.

In [None]:
!wget https://www.dropbox.com/s/cnrhd11zdtc1pic/enwiki-20181001-corpus.xml.bz2?dl=1

--2020-06-29 13:14:11--  https://www.dropbox.com/s/cnrhd11zdtc1pic/enwiki-20181001-corpus.xml.bz2?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.66.1, 2620:100:6022:1::a27d:4201
Connecting to www.dropbox.com (www.dropbox.com)|162.125.66.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/cnrhd11zdtc1pic/enwiki-20181001-corpus.xml.bz2 [following]
--2020-06-29 13:14:12--  https://www.dropbox.com/s/dl/cnrhd11zdtc1pic/enwiki-20181001-corpus.xml.bz2
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc421e7aa12fed6831c3d995998c.dl.dropboxusercontent.com/cd/0/get/A6mM-AE7ETYq9hpA-O3UObP1NjMT2odd9EVbXKFkFrEYwTiQmGVYcCX-vaQhcQH6T7OivUBbKDv4nDAq1NGUNrwhBeviCRO_roqQtdWuULOpVg/file?dl=1# [following]
--2020-06-29 13:14:12--  https://uc421e7aa12fed6831c3d995998c.dl.dropboxusercontent.com/cd/0/get/A6mM-AE7ETYq9hpA-O3UObP1NjMT2odd9EVbXKFkFrEYwTiQmGVYcCX-vaQhcQH6T7OivUBbKDv4

In [None]:
!mv enwiki-20181001-corpus.xml.bz2?dl=1 enwiki-20181001-corpus.xml.bz2

In [None]:
!bzip2 -d enwiki-20181001-corpus.xml.bz2

Now we can create the dataset. `xml_dump_iter` is one of the built in methods which yields an iterator over all texts in the wikipedia dump, trying to remove tags and other markup.

In [None]:
import sys
sys.path.append("nnsplit/train")
from text_data import MemoryMapDataset, xml_dump_iter

In [None]:
xml_iter = xml_dump_iter("enwiki-20181001-corpus.xml", 
                         min_text_length=300, 
                         max_text_length=5000)
next(xml_iter)

'Anarchism is a political philosophy   that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies,    although several authors have defined them more specifically as institutions based on non-hierarchical or free associations.     Anarchism holds the state to be undesirable, unnecessary and harmful. According to Peter Kropotkin, Godwin was "the first to formulate the political and economical conceptions of anarchism, even though he did not give that name to the ideas developed in his work"  while Godwin attached his anarchist ideas to an early Edmund Burke.'

`MemoryMapDataset` is another convient built-in class, but not specific to the Wikipedia dump. It is a `torch.utils.data.Dataset` which can be created using a `texts.txt` and `slices.pkl` file. The `texts.txt` file is [memory-mapped](https://en.wikipedia.org/wiki/Memory-mapped_file) and `slices.pkl` contains a Python array with indices that determine at which position in the dataset which range of the text should be loaded. This allows accessing each text without ever loading all the data into memory.

To create `texts.txt` and `slices.pkl` from an iterator over text, use `MemoryMapDataset.iterator_to_text_and_slices`.

Note that this will be quite slow since iterating over the XML dump takes a significant amount of time, so I would recommend caching `texts.txt` and `slices.pkl` somewhere.

`max_n_texts=10_000_000` is only needed in Colab to keep disk usage in check, feel free to remove this otherwise.

In [None]:
xml_iter = xml_dump_iter("enwiki-20181001-corpus.xml", 
                         min_text_length=300,
                         max_text_length=5000)
MemoryMapDataset.iterator_to_text_and_slices(xml_iter, 
                                             "texts.txt", 
                                             "slices.pkl",
                                             max_n_texts=10_000_000)

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




Here, I am saving the outputs to my Drive, you will have to adjust these paths.

In [None]:
!cp -a slices.pkl "drive/My Drive/Projects/nnsplit/slices.pkl"
!cp -a texts.txt "drive/My Drive/Projects/nnsplit/texts.txt"

# Training

Now we can get started with training!

In [None]:
import sys
sys.path.append("nnsplit/train")

In [None]:
from pytorch_lightning.trainer import Trainer
from tqdm.auto import tqdm
from model import Network
from text_data import MemoryMapDataset

NNSplit has a `Network` class which is a `pl.LightningModule` specifying network architecture, data loading logic etc. To instantiate a new network, we need to first get the default hyperparameters.

In [None]:
parser = Network.get_parser()
hparams = parser.parse_args([])
hparams

Namespace(accumulate_grad_batches=1, amp_level='O1', auto_lr_find=False, auto_scale_batch_size=False, auto_select_gpus=False, benchmark=False, check_val_every_n_epoch=1, checkpoint_callback=True, default_root_dir=None, deterministic=False, distributed_backend=None, early_stop_callback=False, fast_dev_run=False, gpus=<function Trainer._arg_default at 0x7f2940aea488>, gradient_clip_val=0, limit_test_batches=1.0, limit_train_batches=1.0, limit_val_batches=1.0, log_gpu_memory=None, log_save_interval=100, logger=True, max_epochs=1, max_steps=None, min_epochs=1, min_steps=None, num_nodes=1, num_processes=1, num_sanity_val_steps=2, overfit_batches=0.0, overfit_pct=None, precision=32, prepare_data_per_node=True, print_nan_grads=False, process_position=0, profiler=None, progress_bar_refresh_rate=1, reload_dataloaders_every_epoch=True, replace_sampler_ddp=True, resume_from_checkpoint=None, row_log_interval=50, terminate_on_nan=False, test_percent_check=None, test_size=50000, tpu_cores=None, trac

## Load text data

Next, we can load the text data created previously.

In [None]:
text_dataset = MemoryMapDataset("texts.txt", "slices.pkl")

Keep in mind that this can be any `torch.utils.data.Dataset` with `str` entries, so you can completely customize it.

In [None]:
text_dataset[0]

'Anarchism is a political philosophy   that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies,    although several authors have defined them more specifically as institutions based on non-hierarchical or free associations.     Anarchism holds the state to be undesirable, unnecessary and harmful. According to Peter Kropotkin, Godwin was "the first to formulate the political and economical conceptions of anarchism, even though he did not give that name to the ideas developed in his work"  while Godwin attached his anarchist ideas to an early Edmund Burke.'

## Load labeler

Next, create a `Labeler`, which is used to annotate the text from above. Any SpaCy model which supports sentencization can be used. You will have to install the appropriate SpaCy model with `python -m spacy ...` when running this in Colab.

In [None]:
from labeler import Labeler, SpacySentenceTokenizer, SpacyWordTokenizer

In [None]:
labeler = Labeler(
    [
        SpacySentenceTokenizer(
            "en_core_web_sm", lower_start_prob=0.7, remove_end_punct_prob=0.7
        ),
        SpacyWordTokenizer("en_core_web_sm"),
    ]
)

`Labeler.visualize` shows you what the network sees: 
- `byte` is the UTF-8 encoded text. This has changed in the newest version of NNSplit. Previously characters where used, but using bytes allows NNSplit to work for any language regardless of the characters used to represent it.
- The other rows depend on the `Labeler` and determine what the neural networks tries to predict.

In [None]:
labeler.visualize("This is a test. This is another test.")

                                                                           \
byte                    116  104  105  115  32  105  115  32  97  32  116   
SpacySentenceTokenizer    0    0    0    0   0    0    0   0   0   0    0   
SpacyWordTokenizer        0    0    0    0   1    0    0   1   0   1    0   

                                                                            \
byte                    101  115  116  32  84  104  105  115  32  105  115   
SpacySentenceTokenizer    0    0    0   1   0    0    0    0   0    0    0   
SpacyWordTokenizer        0    0    0   1   0    0    0    0   1    0    0   

                                                                            \
byte                    32  97  110  111  116  104  101  114  32  116  101   
SpacySentenceTokenizer   0   0    0    0    0    0    0    0   0    0    0   
SpacyWordTokenizer       1   0    0    0    0    0    0    0   1    0    0   

                                      
byte                    11

## Start training!

Now we can finally start training. 

`train_size` determines how many entries in the dataset to sample for each epoch. 

Using SpaCy with multiprocessing leaks memory, so the memory usage will continously increase during each epoch and reset at the end. So you will have to set `train_size` to a size that corresponds to how much memory is available. `500_000` works well in Colab.


In [None]:
hparams.gpus = 1
hparams.max_epochs = 4
hparams.train_size = 500_000

Instantiate the network.

In [None]:
model = Network(
  text_dataset,
  labeler,
  hparams,
)
model

Network(
  (embedding): Embedding(256, 32)
  (lstm1): LSTM(32, 128, batch_first=True, bidirectional=True)
  (lstm2): LSTM(256, 64, batch_first=True, bidirectional=True)
  (out): Linear(in_features=128, out_features=2, bias=True)
)

Instantiate the `pl.trainer.Trainer`.

In [None]:
trainer = Trainer.from_argparse_args(hparams)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]


And fit the model. Each row of the f1 and precision scores corresponds to each tokenizer of the `Labeler`.

In [None]:
trainer.fit(model)


  | Name      | Type      | Params
----------------------------------------
0 | embedding | Embedding | 8 K   
1 | lstm1     | LSTM      | 165 K 
2 | lstm2     | LSTM      | 164 K 
3 | out       | Linear    | 258   


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…


f1=0.000	precision=0.000	recall=0.000
f1=0.000	precision=0.000	recall=0.000


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…


f1=0.823	precision=0.733	recall=0.939
f1=0.996	precision=0.992	recall=0.999


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…


f1=0.886	precision=0.836	recall=0.942
f1=0.998	precision=0.996	recall=1.000


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…


f1=0.907	precision=0.869	recall=0.949
f1=0.999	precision=0.998	recall=1.000


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…


f1=0.886	precision=0.817	recall=0.968
f1=0.999	precision=0.998	recall=1.000



1

Finally, store the trained model somewhere. This saves a `.onnx` export of the model in the specified directory.

In [None]:
model.store("drive/My Drive/Projects/nnsplit/en")

  "or define the initial states (h0/c0) as inputs of the model. ")


# Load the model in NNSplit

First, install NNSplit.

In [None]:
!pip install nnsplit

In [14]:
from nnsplit import NNSplit

Instantiate the splitter.

In [15]:
splitter = NNSplit("drive/My Drive/Projects/nnsplit/en/model.onnx", True)

And split a text!

In [18]:
splits = splitter.split(["This is a test This is another test."])[0]
splits

Split(Split(Split('This', ' '), Split('is', ' '), Split('a', ' '), Split('test', ' ')), Split(Split('This', ' '), Split('is', ' '), Split('another', ' '), Split('test', ''), Split('.', '')))

The public API of NNSplit has changed significantly, making it much easier to use now. Everything is a `nnsplit.Split` which can be iterated over or stringified with `str(...)`.

In [36]:
for sentence in splits:
  print(str(sentence).ljust(30), type(sentence))

This is a test                 <class 'Split'>
This is another test.          <class 'Split'>


Or if you want to go token-level:

In [37]:
for sentence in splits:
  for token in sentence:
    print(str(token).ljust(10), repr(token).ljust(30), type(token))

  print()

This       Split('This', ' ')             <class 'Split'>
is         Split('is', ' ')               <class 'Split'>
a          Split('a', ' ')                <class 'Split'>
test       Split('test', ' ')             <class 'Split'>

This       Split('This', ' ')             <class 'Split'>
is         Split('is', ' ')               <class 'Split'>
another    Split('another', ' ')          <class 'Split'>
test       Split('test', '')              <class 'Split'>
.          Split('.', '')                 <class 'Split'>



Until the smallest unit, which then returns a `str` instead of an `nnsplit.Split`.

In [32]:
for sentence in splits:
  for [text, whitespace] in sentence:
    print(text.ljust(10), type(text))
    print(f'"{whitespace}"'.ljust(10), type(whitespace))
    print()

This       <class 'str'>
" "        <class 'str'>

is         <class 'str'>
" "        <class 'str'>

a          <class 'str'>
" "        <class 'str'>

test       <class 'str'>
" "        <class 'str'>

This       <class 'str'>
" "        <class 'str'>

is         <class 'str'>
" "        <class 'str'>

another    <class 'str'>
" "        <class 'str'>

test       <class 'str'>
""         <class 'str'>

.          <class 'str'>
""         <class 'str'>



Finally, for some benchmarks: If you are running `NNSplit` on GPU, you can increase the speed on large datasets by using a big batch size.

In [53]:
splitter = NNSplit(model, "cuda", batch_size=2**14)

In [60]:
text = "This is a test This is another test."

%timeit splitter.split([text])[0]
%timeit splitter.split([text] * 100)[0]
%timeit splitter.split([text] * 1000)[0]
%timeit splitter.split([text] * 10_000)[0]

100 loops, best of 3: 6.38 ms per loop
100 loops, best of 3: 9.03 ms per loop
10 loops, best of 3: 34.6 ms per loop
1 loop, best of 3: 372 ms per loop


And voilà! Splitting 10000 short texts in less than 400 milliseconds.