# What will we do ⁉

### In this Notebook, we will perform the following steps with some changes

1. Compare Model Architectures: We will start by comparing two different model architectures suitable for the target task. This includes understanding the differences between the models in terms of their structure, complexity, and capabilities.

2. Data Loading and Preprocessing: We will implement data loading and preprocessing routines for JSON format file. This will involve reading data from the respective files, handling missing values, and preparing the input features and labels for each model.

3. Model Training and Evaluation: With the data loaded and preprocessed, we will train both model architectures separately using data generated. We will then evaluate the performance of each model on a test dataset to measure their accuracy and generalization capabilities.

4. Compare Results: Once the models are trained and evaluated, we will compare their performance and analyze the results to understand how the different model architectures and data formats impact the model's performance and predictive capabilities.

5. Considerations for Model and Data Comparison: After the comparison, we will discuss the insights gained from the experiment. We will consider the implications of using different model architectures and data formats, and how they affect the model's strengths and weaknesses in tackling the target task.

6. Best Model Selection: Based on the comparison results, we will identify the best-performing model for the specific task. The selected model will be chosen considering its performance, computational efficiency, and other relevant criteria.

----------


By conducting this experiment with different model architectures and data formats, we aim to gain valuable insights into the interplay between model choice and data representation. This will help us make informed decisions in future projects when selecting appropriate models and data formats for specific tasks and datasets.

### Data Descripsion


Format: Each line represents a single word in a sentence.
- Column 1 (Sentence ID): The sentence ID is listed in the first column.

- Column 2 (Word): This column contains the word itself.

- Column 3 (POS Tag): It contains the Part-of-Speech (POS) tag for the word.

- Column 4 (Chunking Tag): This column contains the chunking tag for the word. Chunking is the process of dividing text into syntactically related chunks or phrases.

- Column 5 (NE Label): If a word is part of a named entity, the Named Entity (NE) label is provided in this column. Otherwise, it is filled with "O" to indicate that the word does not have an NE label.

- Column 6 (Nested NE Label): This column is not used in this format and is also filled with "O".

NE labels are annotated using the IOB notation as in the CoNLL Shared Tasks. There are 7 labels: B-PER and I-PER are used for persons, B-ORG and I-ORG are used for organizations, B-LOC and I-LOC are used for locations, and O is used for other elements.


# Let's do this.

In [1]:
!pip install pandas
!pip install spacy
!pip install pyvi
!pip install spacy-transformers

Collecting pyvi
  Downloading pyvi-0.1.1-py2.py3-none-any.whl (8.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m40.7 MB/s[0m eta [36m0:00:00[0m
Collecting sklearn-crfsuite (from pyvi)
  Downloading sklearn_crfsuite-0.3.6-py2.py3-none-any.whl (12 kB)
Collecting python-crfsuite>=0.8.3 (from sklearn-crfsuite->pyvi)
  Downloading python_crfsuite-0.9.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (993 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m993.5/993.5 kB[0m [31m42.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: python-crfsuite, sklearn-crfsuite, pyvi
Successfully installed python-crfsuite-0.9.9 pyvi-0.1.1 sklearn-crfsuite-0.3.6
Collecting spacy-transformers
  Downloading spacy_transformers-1.2.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.8/190.8 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Collect

In [2]:
import spacy
import spacy.cli
import string
import spacy_transformers
from spacy.lang.vi import Vietnamese
import pandas as pd

## Data cleaning and something else...

#### Load dataset

In [18]:
df = pd.read_csv("spaCy.csv")
df

Unnamed: 0,sentence_id,word,pos_tag,chunk_tag,ne_label,nested_ne_label
0,vn-01,Bà,N,B-NP,O,O
1,vn-01,Mai,Np,I-NP,B-PER,O
2,vn-01,là,V,B-VP,O,O
3,vn-01,giáo_viên,N,B-NP,O,O
4,vn-01,tại,E,B-PP,O,O
...,...,...,...,...,...,...
919,vn-80,bà,N,B-PER,O,O
920,vn-80,Tô,Np,I-PER,B-PER,O
921,vn-80,Yến,N,I-PER,I-PER,O
922,vn-80,Hoa,Np,I-PER,I-PER,O


Check the NULL values in table

In [19]:
df.isnull().sum()

sentence_id         0
word               32
pos_tag             3
chunk_tag           0
ne_label            0
nested_ne_label     5
dtype: int64

#### Drop 'NULL' values before set new index of the data

In [20]:
df = df.dropna()
df

Unnamed: 0,sentence_id,word,pos_tag,chunk_tag,ne_label,nested_ne_label
0,vn-01,Bà,N,B-NP,O,O
1,vn-01,Mai,Np,I-NP,B-PER,O
2,vn-01,là,V,B-VP,O,O
3,vn-01,giáo_viên,N,B-NP,O,O
4,vn-01,tại,E,B-PP,O,O
...,...,...,...,...,...,...
919,vn-80,bà,N,B-PER,O,O
920,vn-80,Tô,Np,I-PER,B-PER,O
921,vn-80,Yến,N,I-PER,I-PER,O
922,vn-80,Hoa,Np,I-PER,I-PER,O


In [21]:
df.isnull().sum()

sentence_id        0
word               0
pos_tag            0
chunk_tag          0
ne_label           0
nested_ne_label    0
dtype: int64

In [22]:
df = df.reset_index()
df

Unnamed: 0,index,sentence_id,word,pos_tag,chunk_tag,ne_label,nested_ne_label
0,0,vn-01,Bà,N,B-NP,O,O
1,1,vn-01,Mai,Np,I-NP,B-PER,O
2,2,vn-01,là,V,B-VP,O,O
3,3,vn-01,giáo_viên,N,B-NP,O,O
4,4,vn-01,tại,E,B-PP,O,O
...,...,...,...,...,...,...,...
882,919,vn-80,bà,N,B-PER,O,O
883,920,vn-80,Tô,Np,I-PER,B-PER,O
884,921,vn-80,Yến,N,I-PER,I-PER,O
885,922,vn-80,Hoa,Np,I-PER,I-PER,O


In [23]:
df = df.set_index("index")

You can see, the presence of duplicate IDs within each dataset is undesirable, so we will remove them.

In [24]:
df = df.drop(["sentence_id"], axis=1)
df

Unnamed: 0_level_0,word,pos_tag,chunk_tag,ne_label,nested_ne_label
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Bà,N,B-NP,O,O
1,Mai,Np,I-NP,B-PER,O
2,là,V,B-VP,O,O
3,giáo_viên,N,B-NP,O,O
4,tại,E,B-PP,O,O
...,...,...,...,...,...
919,bà,N,B-PER,O,O
920,Tô,Np,I-PER,B-PER,O
921,Yến,N,I-PER,I-PER,O
922,Hoa,Np,I-PER,I-PER,O


Save the cleaned dataset

In [25]:
import csv
df.to_csv("/content/spaCy_vs2.csv")

## Now we need to create a new dataset for Fine tune process.

Why we need create a new dataset for Fine tune procces ⁉ 😕


---


In `version 1`, you can see in this [link]().
How the data significantly influences a model's results can be demonstrated by using a data format similar to the previous JSON format. This format allows us to observe the impact of data representation on the model's performance. JSON organizes data hierarchically with nested objects, which can affect how the model processes and learns from the information.

The quality, quantity, and relevance of the data play a crucial role in determining how well the model generalizes to new, unseen examples. In supervised learning, where the model learns from labeled data, the training data directly influences the model's ability to learn patterns and make accurate predictions.

By using JSON data, we can assess how the model performs compared to other data formats. We may encounter variations in data loading, preprocessing, and input representations. It will be essential to ensure that the dataset remains relevant to the task at hand, and any changes in data format do not introduce biases or inconsistencies that could affect the overall evaluation.

Throughout this experiment, we will maintain the dataset's integrity and relevance, focusing on how data preparation impacts the model's behavior and predictions. This analysis will help us understand the importance of data processing and its role in achieving optimal model performance. 😀

In [26]:
# Convert JSON file dataset from CSV file
import csv
import json

def csv_to_json(csv_path, json_path):
  jsonArr = []

  with open(csv_path, "r", encoding="utf-8") as csv_file:
    # Load csv file data using csv library's dictionary reader
    csvReader = csv.DictReader(csv_file)

    # Convert each csv row into Python dict
    for row in csvReader:
      jsonArr.append(row)

  with open(json_path, "w", encoding="utf-8") as json_file:
    # Use the json.dump() method with the ensure_ascii=False parameter
    # to ensure that Unicode characters are written as-is without being escaped
    json.dump(jsonArr, json_file, ensure_ascii=False, indent=4)
    # print(jsonString)

  print("Completed")

csv_path = "/content/spaCy_vs2.csv"
json_path = "/content/spaCy_vs2.json"
csv_to_json(csv_path, json_path)

Completed


And now we have the JSON file like this:
```JSON
 [
    {
        "index": "0",
        "word": "Bà",
        "pos_tag": "N",
        "chunk_tag": "B-NP",
        "ne_label": "O",
        "nested_ne_label": "O"
    },
    {
        "index": "1",
        "word": "Mai",
        "pos_tag": "Np",
        "chunk_tag": "I-NP",
        "ne_label": "B-PER",
        "nested_ne_label": "O"
    },
    {
        "index": "2",
        "word": "là",
        "pos_tag": "V",
        "chunk_tag": "B-VP",
        "ne_label": "O",
        "nested_ne_label": "O"
    },
    ...
 ]
```


Or you can see documentation [here](https://spacy.io/api/cli#convert) to convert from `.csv` to `.json` file.


> **In the next steps, we will use the method done in version 1 and then evaluate the effectiveness of the model.**


Make sure GPU used

In [12]:
import torch
torch.cuda.is_available()

False

# Fine-tuning model using generated datasets

Load the pre-trained model

In [13]:
nlp = Vietnamese()
nlp

<spacy.lang.vi.Vietnamese at 0x7b4fb83febf0>

Test model

In [14]:
doc_string = "Thẩm phán - Chủ tọa phiên tòa Bà Đặng Thị Tuyết Hải"
doc = nlp(doc_string)
for token in doc:
    print(token)

Thẩm phán
-
Chủ
tọa
phiên tòa
Bà
Đặng Thị Tuyết Hải


Look it's good, right?
But try another example.

In [15]:
docs_string = "Thư ký phiên tòa: Bà Trà Thị Thúy Diễm – Thư ký Tòa án nhân dân Quận 10, Thành phố Hồ Chí Minh."
tokens = nlp(docs_string)
for token in tokens:
  print(token)

Thư ký
phiên tòa
:
Bà
Trà Thị
Thúy Diễm
–
Thư ký
Tòa án
nhân dân
Quận
10
,
Thành phố
Hồ Chí Minh
.


Uhh, maybe something is wrong 😟  
Ok let's start next step.

#### Import json file

In [16]:
import json

with open("/content/spaCy_vs2.json", "r") as f:
  data = json.load(f)

#### Convert the data

In this project, we need to recognize human names, so I have added some conditions to filter out human names, reduce the size of the data file, and speed up the training process. If you want to recognize more components within a sentence, replace
```python
training_data = []
for example in data:
  ...
  entities = [(0, len(text), tag) for tag in (pos_tag, ne_label) if (pos_tag == "Np" and ne_label in ("B-PER", "I-PER"))]
    if entities:
      training_data.append({"text": text, "entities": entities})
```
with the following code:
```python
training_data = []
for example in data:
  ...
  entities = [(0, len(tag), tag) for tag in (pos_tag, chunk_tag, ne_label)]
  training_data.append({"text": text, "entities": entities})
```

In [None]:
training_data = []
for example in data:
  text = example["word"].replace(string.punctuation, "")
  pos_tag = example["pos_tag"]
  chunk_tag = example["chunk_tag"]
  ne_label = example["ne_label"]

  # filter to make sure we can collect all person names
  # entities = [(0, len(text), tag) for tag in (pos_tag, ne_label) if (pos_tag == "Np" and ne_label in ("B-PER", "I-PER"))]
  entitites = [(0, len(text), tag) for tag in ne_label if ne_label in ("B-PER", "I-PER")]
  if entities:
    training_data.append({"text": text, "entities": entities})

print([x for x in training_data])

#### Import training libraries

In [86]:
from spacy.tokens import DocBin
from tqdm import tqdm
from spacy.util import filter_spans

In [87]:
nlp = spacy.blank("vi")
nlp

<spacy.lang.vi.Vietnamese at 0x7b4fb23cc760>

#### Train the model
The below code will create a custom model with the data that we give. A binary file  named `train.spacy` will be generated at the end.

In [88]:
doc_bin = DocBin()
for training_example in tqdm(training_data):
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp.make_doc(text)
    ents = []
    # Process each training example and add to DocBin
    for training_example in training_data:
      text = training_example['text']
      labels = training_example['entities']
      doc = nlp.make_doc(text)
      ents = []
      for start, end, ent_label in labels:
        for tag in ent_label:
          span = doc.char_span(start, end, label=ent_label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents
    doc_bin.add(doc)

doc_bin.to_disk("train.spacy")

100%|██████████| 140/140 [00:01<00:00, 111.41it/s]


Or you can convert the training JSON files to .spacy binary file using this command (update the file path with your own):

`!python -m spacy convert content/spaCy_vs2.json ./ -t spacy`  

see more [here](https://spacy.io/api/cli#convert)

In [89]:
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


Debuging

In [90]:
!python -m spacy debug data ./config.cfg

[1m
[38;5;2m✔ Pipeline can be initialized with data[0m
[38;5;2m✔ Corpus is loadable[0m
[1m
Language: vi
Training pipeline: tok2vec, ner
140 training docs
140 evaluation docs
[38;5;3m⚠ 1 training examples also in evaluation data[0m
[38;5;3m⚠ Low number of examples to train a new pipeline (140)[0m
[1m
[38;5;4mℹ 140 total word(s) in the data (1 unique)[0m
[38;5;4mℹ No word vectors present in the package[0m
[1m
[38;5;4mℹ 1 label(s)[0m
0 missing value(s) (tokens with '-' label)
[38;5;2m✔ Good amount of examples for all labels[0m
[38;5;2m✔ Examples without occurrences available for all labels[0m
[38;5;2m✔ No entities consisting of or starting/ending with whitespace[0m
[38;5;2m✔ No entities crossing sentence boundaries[0m
[1m
[38;5;2m✔ 6 checks passed[0m


In [91]:
!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy

[38;5;2m✔ Created output directory: output[0m
[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
[2023-07-22 08:28:28,974] [INFO] Set up nlp object from config
[2023-07-22 08:28:29,011] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-07-22 08:28:29,018] [INFO] Created vocabulary
[2023-07-22 08:28:29,018] [INFO] Finished initializing nlp object
[2023-07-22 08:28:29,323] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     83.33  100.00  100.00  100.00    1.00
192     200          0.05     96.68  100.00  100.00  100.00    1.00
392     400          0.00      0.00  100.00  100.00  100.00    1.00
592     600          0.00      0.00  100.00  100.00  100.00    1.00
792     80

In [92]:
nlp_ner = spacy.load("output/model-best")
nlp_ner

<spacy.lang.vi.Vietnamese at 0x7b4fb505b6a0>

### Test our model

In [93]:
doc = nlp_ner("Ông Tô Bình Yi, sinh năm 1970 (Có đơn xin vắng mặt)")

spacy.displacy.render(doc, style="ent", jupyter=True)

In [94]:
doc1 = nlp_ner("Thư ký phiên tòa: Bà Trà Thị Thúy Diễm – Thư ký Tòa án nhân dân Quận 10, Thành phố Hồ Chí Minh. ")
spacy.displacy.render(doc1, style="ent", jupyter=True)

In [101]:
doc2 = nlp_ner("Bị đơn: Ông Nguyễn Đăng T, sinh năm: 1989")
spacy.displacy.render(doc2, style="ent", jupyter=True)

In [103]:
ents = [(e.text, e.label_) for e in doc.ents]
ents

[('Ông', 'Np'), ('Tô Bình Yi, sinh năm 1970 (Có đơn xin vắng mặt)', 'Np')]

# Transformer BERT using the same dataset

In [105]:
!python -m spacy init fill-config base_config_transfer.cfg config_transfer.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config_transfer.cfg
You can now add your data and train your pipeline:
python -m spacy train config_transfer.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [106]:
!python -m spacy train config_transfer.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy

[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
[2023-07-22 09:46:08,017] [INFO] Set up nlp object from config
[2023-07-22 09:46:08,036] [INFO] Pipeline: ['transformer', 'ner']
[2023-07-22 09:46:08,040] [INFO] Created vocabulary
[2023-07-22 09:46:08,041] [INFO] Finished initializing nlp object
Downloading (…)okenizer_config.json: 100% 28.0/28.0 [00:00<00:00, 163kB/s]
Downloading (…)lve/main/config.json: 100% 625/625 [00:00<00:00, 3.79MB/s]
Downloading (…)solve/main/vocab.txt: 100% 872k/872k [00:00<00:00, 18.3MB/s]
Downloading (…)/main/tokenizer.json: 100% 1.72M/1.72M [00:00<00:00, 46.6MB/s]
Downloading model.safetensors: 100% 672M/672M [00:03<00:00, 212MB/s]
Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight',

In [108]:
nlp = spacy.load("output/model-best")
nlp

<spacy.lang.vi.Vietnamese at 0x7b4fb0f92410>

### Testing

In [109]:
doc = nlp("Ông Tô Bình Yi, sinh năm 1970 (Có đơn xin vắng mặt)")

spacy.displacy.render(doc, style="ent", jupyter=True)

In [110]:
doc1 = nlp("Thư ký phiên tòa: Bà Trà Thị Thúy Diễm – Thư ký Tòa án nhân dân Quận 10, Thành phố Hồ Chí Minh. ")
spacy.displacy.render(doc1, style="ent", jupyter=True)

In [111]:
doc2 = nlp("Bị đơn: Ông Nguyễn Đăng T, sinh năm: 1989")
spacy.displacy.render(doc2, style="ent", jupyter=True)

In [None]:
text = 
'''NƯỚC CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
TÒA ÁN NHÂN DÂN QUẬN 10, THÀNH PHỐ HỒ CHÍ MINH
- Thành phần Hội đồng xét xử sơ thẩm gồm có:
Thẩm phán - Chủ tọa phiên tòa: Bà Lê Thị Lan 
Các Hội thẩm nhân dân:
1. Bà Nguyễn Thị Thu Hằng
2. Ông Nguyễn Vi Tường Thụy 
- Thư ký phiên tòa: Bà Phạm Hà Thiên Tâm - Thư ký Tòa án, Tòa án nhân dân Quận 10, Thành phố Hồ Chí Minh.
- Đại diện Viện kiểm sát nhân dân Quận 10, Thành phố Hồ Chí Minh tham gia phiên tòa: Ông Nguyễn Tuấn Anh - Kiểm sát viên
Ngày 06 tháng 01 năm 2020 tại trụ sở Toà án nhân dân Quận 10, Thành phố Hồ Chí Minh, xét xử sơ thẩm công khai vụ án thụ lý số: 629/2019/TLST-HNGĐ ngày 07 tháng 10 năm 2019 về tranh chấp ly hôn, theo Quyết định đưa vụ án ra xét xử số: 331/2019/QĐXXST-HNGĐ ngày 12 tháng 12 năm 2019 và Quyết định hoãn phiên toà số: 231/2019/QĐST-HNGĐ ngày 25 tháng 12 năm 2019, giữa các đương sự:
- Nguyên đơn: Bà Lê Ngân H, sinh năm: 1989
Địa chỉ: Số 73 đường Phó Đức Chính, phường V, Thành phố Nha Trang, tỉnh Khánh Hoà. (Có đơn xin vắng mặt)
- Bị đơn: Ông Nguyễn Đăng T, sinh năm: 1989
Địa chỉ: Số 132 đường Hùng vương, Phường X, Quận D, Thành phố Hồ Chí Minh. (Vắng mặt)
NỘI DUNG VỤ ÁN:
- Tại đơn khởi kiện ngày 23/9/2019, cùng các tài liệu, chứng cứ có trong hồ sơ, nguyên đơn bà Lê Ngân H trình bày: Bà và ông Nguyễn Đăng T tự nguyện chung sống và đăng ký kết hôn tại Uỷ ban nhân dân Phường X, Quận D, Thành phố Hồ Chí Minh, theo giấy chứng nhận kết hôn số 98, quyển số 01/2014 ngày 06/11/2014.
Sau khi kết hôn, vì nhiều nguyên nhân trong đó có việc ông T có quan hệ tình cảm với người phụ nữ khác dẫn đến vợ chồng đã bắt đầu phát sinh nhiều mâu thuẫn. Vì muốn níu kéo hạnh phúc gia đình, bà H đã nhiều lần bỏ qua nhưng ông T vẫn không thay đổi. Từ tháng 9 năm 2018, bà H đã dọn ra khỏi nhà và vợ chồng sống ly thân cho đến nay. Nhận thấy tình cảm vợ chồng không còn khả năng hàn gắn nên bà yêu cầu Toà giải quyết cho ly hôn để ổn định cuộc sống.
Về con chung: Bà H khai, giữa bà và ông T chung sống không có con chung.
Về tài sản chung: Bà H không yêu cầu Toà án giải quyết.
Và nợ chung: Bà H khai không có
Ngày 10/12/2019, bà H có đơn đề nghị Toà án xét xử vắng mặt.
Toà án tống đạt thông báo thụ lý, các văn bản tố tụng khác cho ông T nhưng ông T vắng mặt không có lý do.
Đại diện Viện Kiểm sát nhân dân Quận D phát biểu quan điểm về việc tuân thủ pháp luật về tố tụng của Thẩm phán và Hội đồng xét xử từ giai đoạn thụ lý đến khi nghị án là tuân thủ đúng quy định pháp luật, đầy đủ.
Về nội dung: Kiểm sát viên đề nghị chấp nhận yêu cầu của nguyên đơn. 
'''

documents = nlp(text)
ents = [(ents.text, ents.start_char, ents.end_char, ents.label_) for ents in documents.ents]
ents

You can see the the result is not too much better, cause 1 part of data in missing many tag likes: pos_tag, chunk_tag so the model does not reach the best state, you can see version 3 we using all tag in dataset, or Name_Entity_Recognition model in here