# How to Fine-Tune spaCy models for NLP Use Cases?

## What is spaCy?

- In the world of Natural Language Processing (NLP), spaCy has emerged as a powerful and efficient library, revolutionizing the way developers and researchers work with text data.

- spaCy is an open-source Python library designed specifically for NLP tasks such as part-of-speech tagging, named entity recognition, dependency parsing, and more.

- It was developed with the goal of providing industrial-strength performance, while still being easy to use and integrate into existing workflows.

- spaCy is built on the latest research and implements state-of-the-art techniques, making it an ideal choice for both beginners and experienced NLP practitioners.

## Prerequisites

### Basic knowledge of spaCy

The [official documentation site](https://spacy.io/usage/spacy-101) of spaCy provides a lot of information about the tool. Please read the documentation.

## Pre-process the training data

### Why we need to pre-process the Data?

Collecting data covers just one part of the equation. We need to pre-process the data and transform it in a way that spaCy can easily understand. We should also define what kind of data (tags) should be identified from the given sentences.

Let's take the following sentence as an example:
> "Schedule event for visit to Trivandrum on July 18".  

Let's try to split out some tags from above sentence:

- Schedule – this belongs to the "action" tag
- event – this belongs to the "domain" tag
- visit to Trivandrum – this belongs to the "name" tag
- July 18 – this belongs to the "date" tag

Every tag defined above may contain alternatives in other sentences. For an example, we may input the following sentences:

1. Cancel client meeting scheduled tomorrow
2. Change time of mall visit to 6 PM

#### Now, I've created 51 Data and prepared the index for each tag in JSON file named "spacy.json". Here is the example JSON file

```json 
{
  "Example": [
    {
      "id": "vn-01",
      "content": "Hôm nay, tôi gặp em Trần Thị Thanh Hương tại công ty Techcombank.",
      "annotations": [
        {"start": 12, "end": 30, "tag_name": "person_name"},
        {"start": 19, "end": 30, "tag_name": "last_person_name"},
        {"start": 34, "end": 42, "tag_name": "location_name"},
        {"start": 43, "end": 54, "tag_name": "organization_name"}
      ]
    },
    {
      "id": "vn-02",
      "content": "Tôi đã gặp anh Nguyễn Văn An tại quán cà phê Trung Nguyên.",
      "annotations": [
        {"start": 11, "end": 23, "tag_name": "person_name"},
        {"start": 19, "end": 23, "tag_name": "last_person_name"},
        {"start": 27, "end": 41, "tag_name": "location_name"},
        {"start": 42, "end": 54, "tag_name": "organization_name"}
      ]
    },
    {
      "id": "vn-03",
      "content": "Chị Lê Thị Hạnh làm việc tại công ty Vingroup.",
      "annotations": [
        {"start": 3, "end": 15, "tag_name": "person_name"},
        {"start": 9, "end": 15, "tag_name": "last_person_name"},
        {"start": 18, "end": 25, "tag_name": "location_name"},
        {"start": 26, "end": 34, "tag_name": "organization_name"}
      ]
    }, 
  ]
}
```

You can see more in this [link](https://github.com/TungCan273/Fine-tuning/blob/master/Spacy/spacy.json).

# Let's try to fine-tune spaCy with the data that we have.

#### Install spaCy and other libraries then import this

In [1]:
!pip install spacy
!pip install pyvi
!pip install https://gitlab.com/trungtv/vi_spacy/-/raw/master/packages/vi_core_news_lg-3.6.0/dist/vi_core_news_lg-3.6.0.tar.gz

Collecting pyvi
  Downloading pyvi-0.1.1-py2.py3-none-any.whl (8.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m37.8 MB/s[0m eta [36m0:00:00[0m
Collecting sklearn-crfsuite (from pyvi)
  Downloading sklearn_crfsuite-0.3.6-py2.py3-none-any.whl (12 kB)
Collecting python-crfsuite>=0.8.3 (from sklearn-crfsuite->pyvi)
  Downloading python_crfsuite-0.9.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (993 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m993.5/993.5 kB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: python-crfsuite, sklearn-crfsuite, pyvi
Successfully installed python-crfsuite-0.9.9 pyvi-0.1.1 sklearn-crfsuite-0.3.6
Collecting https://gitlab.com/trungtv/vi_spacy/-/raw/master/packages/vi_core_news_lg-3.6.0/dist/vi_core_news_lg-3.6.0.tar.gz
  Downloading https://gitlab.com/trungtv/vi_spacy/-/raw/master/packages/vi_core_news_lg-3.6.0/dist/vi_core_news_lg-3.6.0.tar.gz (233.3 MB)


In [2]:
!pip install spacy-transformers

Collecting spacy-transformers
  Downloading spacy_transformers-1.2.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (190 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/190.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.8/190.8 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Collecting transformers<4.31.0,>=3.4.0 (from spacy-transformers)
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m54.5 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-alignments<1.0.0,>=0.7.2 (from spacy-transformers)
  Downloading spacy_alignments-0.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers<4.31.0,>=3.4.0->spacy-transforme

In [3]:
import spacy
import spacy.cli
import string
import spacy_transformers
from spacy.lang.vi import Vietnamese
from spacy.vocab import Vocab

#### Load the pre-trained model

In here we are going to use the model for Vietnamese, you can use model trained for English if you want

In [4]:
# nlp = spacy.load("en_core_web_trf")
nlp = spacy.load("vi_core_news_lg")

In [5]:
nlp

<spacy.lang.vi.Vietnamese at 0x78ea33565d50>

Test the accuracy of model

In [6]:
doc_string = "Thẩm phán - Chủ tọa phiên tòa Bà Đặng Thị Tuyết Hải"
doc = nlp(doc_string)
for token in doc:
    print(token)

Thẩm phán
-
Chủ
tọa
phiên tòa
Bà
Đặng Thị Tuyết Hải


#### Import JSON file

In [7]:
import json

with open('/content/spacy.json', 'r') as f:
  data = json.load(f)

#### Convert the Data

Convert the data read from JSON file into tuple of dictionaries containing original text and entities.

In [8]:
training_data = []
for example in data['Example']:
    temp_dict = {}
    temp_dict['text'] = example['content'].replace(string.punctuation, "")
    temp_dict['entities'] = []
    for annotation in example['annotations']:
        start = annotation['start']
        end = annotation['end'] + 1
        label = annotation['tag_name'].upper()
        temp_dict['entities'].append((start, end, label))
    training_data.append(temp_dict)
print(training_data[0])

{'text': 'Hôm nay, tôi gặp em Trần Thị Thanh Hương tại công ty Techcombank.', 'entities': [(12, 31, 'PERSON_NAME'), (19, 31, 'LAST_PERSON_NAME'), (34, 43, 'LOCATION_NAME'), (43, 55, 'ORGANIZATION_NAME')]}


#### Import training libraries

In [9]:
from spacy.tokens import DocBin
from tqdm import tqdm
from spacy.util import filter_spans

In [10]:
nlp = spacy.blank("vi")
nlp

<spacy.lang.vi.Vietnamese at 0x78e902d78cd0>

#### Train the model
The below code will create a custom model with the data that we give. A binary file  named `train.spacy` will be generated at the end.

In [11]:
doc_bin = DocBin()
for training_example in tqdm(training_data):
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in labels:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents
    doc_bin.add(doc)

doc_bin.to_disk("train.spacy")

  0%|          | 0/51 [00:00<?, ?it/s]

Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity

100%|██████████| 51/51 [00:00<00:00, 660.67it/s]


Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity
Skipping entity





In [12]:
import torch
torch.cuda.is_available()

False

In [13]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0


### Create a config files

SpaCy 3 uses a config file config.cfg that contains all the model training components to train the model. In [spaCy training page](https://spacy.io/usage/training), you can select the language of the model (English in this tutorial), the component (NER) and hardware (GPU) to use and download the config file template.  

> In this model we use the file named base_config.cfg like [this](https://github.com/TungCan273/Fine-tuning/blob/master/Spacy/base_config.cfg)

After you’ve saved the starter config to a file `base_config.cfg`, you can use the [init fill-config](https://spacy.io/api/cli#init-fill-config) command to fill in the remaining defaults. Training configs should always be complete and without hidden defaults, to keep your experiments reproducible.

In [22]:
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


This cell means `spaCy debug`.  
The [spaCy debug](https://spacy.io/api/cli#debug) CLI includes helpful commands for debugging and profiling your configs, data and implementations.

In [23]:
!python -m spacy debug data ./config.cfg

[1m
[38;5;2m✔ Pipeline can be initialized with data[0m
[38;5;2m✔ Corpus is loadable[0m
[1m
Language: vi
Training pipeline: tok2vec, ner
51 training docs
51 evaluation docs
[38;5;3m⚠ 51 training examples also in evaluation data[0m
[38;5;1m✘ Low number of examples to train a new pipeline (51)[0m
[1m
[38;5;4mℹ 473 total word(s) in the data (226 unique)[0m
[38;5;4mℹ No word vectors present in the package[0m
[1m
[38;5;4mℹ 4 label(s)[0m
0 missing value(s) (tokens with '-' label)
[38;5;3m⚠ Low number of examples for label 'ORGANIZATION_NAME' (16)[0m
[2K[38;5;3m⚠ Low number of examples for label 'PERSON_NAME' (36)[0m
[2K[38;5;3m⚠ Low number of examples for label 'LOCATION_NAME' (22)[0m
[2K[38;5;3m⚠ Low number of examples for label 'DATE' (2)[0m
[2K[38;5;2m✔ Examples without occurrences available for all labels[0m
[38;5;2m✔ No entities consisting of or starting/ending with whitespace[0m
[38;5;2m✔ No entities crossing sentence boundaries[0m
[1m
[38;5;2m✔ 5 c

Instead of exporting your starter config from the [quickstart](https://spacy.io/usage/training#quickstart) widget and auto-filling it, you can also use the [init config](https://spacy.io/api/cli#init-fill-config) command and specify your requirement and settings as CLI arguments. You can now add your data and run train with your config. See the convert command for details on how to [convert](https://spacy.io/api/cli#convert) your data to spaCy’s binary `.spacy` format. You can either include the data paths in the [paths] section of your config, or pass them in via the command line.

In [24]:
!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy

[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
[2023-07-16 17:06:52,952] [INFO] Set up nlp object from config
[2023-07-16 17:06:52,974] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-07-16 17:06:52,979] [INFO] Created vocabulary
[2023-07-16 17:06:52,979] [INFO] Finished initializing nlp object
[2023-07-16 17:06:53,459] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     60.67    0.00    0.00    0.00    0.00
 42     200         25.87   1649.57  100.00  100.00  100.00    1.00
 92     400          0.38      0.98  100.00  100.00  100.00    1.00
158     600         20.99     79.67  100.00  100.00  100.00    1.00
228     800         73.65    150.50  100.00  100.00  100.0

At the end, it'll generate 2 folders named model-best and model-last.

In [27]:
nlp_ner = spacy.load("output/model-best")

In [28]:
nlp_ner

<spacy.lang.vi.Vietnamese at 0x78e900f4abc0>

# Test our model

Install docx library for read docx files

In [29]:
!pip install python-docx



In [31]:
import docx
path_docx = "/content/6 HN_MAU.docx"
# doc = docx.Document(path_docx)
def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)
print(getText("/content/6 HN_MAU.docx"))

NHÂN DANH
NƯỚC CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
TÒA ÁN NHÂN DÂN QUẬN 10, THÀNH PHỐ HỒ CHÍ MINH
- Thành phần Hội đồng xét xử sơ thẩm gồm có:
Thẩm phán - Chủ tọa phiên tòa: Bà Lê Thị Lan 
Các Hội thẩm nhân dân:
1. Bà Nguyễn Thị Thu Hằng
2. Ông Nguyễn Vi Tường Thụy 
- Thư ký phiên tòa: Bà Phạm Hà Thiên Tâm - Thư ký Tòa án, Tòa án nhân dân Quận 10, Thành phố Hồ Chí Minh.
- Đại diện Viện kiểm sát nhân dân Quận 10, Thành phố Hồ Chí Minh tham gia phiên tòa: Ông Nguyễn Tuấn Anh - Kiểm sát viên
 Ngày 06 tháng 01 năm 2020 tại trụ sở Toà án nhân dân Quận 10, Thành phố Hồ Chí Minh, xét xử sơ thẩm công khai vụ án thụ lý số: 629/2019/TLST-HNGĐ ngày 07 tháng 10 năm 2019 về tranh chấp ly hôn, theo Quyết định đưa vụ án ra xét xử số: 331/2019/QĐXXST-HNGĐ ngày 12 tháng 12 năm 2019 và Quyết định hoãn phiên toà số: 231/2019/QĐST-HNGĐ ngày 25 tháng 12 năm 2019, giữa các đương sự:
 - Nguyên đơn: Bà Lê Ngân H, sinh năm: 1989
Địa chỉ: Số 73 đường Phó Đức Chính, phường V, Thành phố Nha Trang, tỉnh Khánh Hoà. 

In [33]:
doc = nlp_ner("Ông Tô Bình Yi, sinh năm 1970 (Có đơn xin vắng mặt)")

spacy.displacy.render(doc, style="ent", jupyter=True)

In [34]:
doc1 = nlp_ner("Thư ký phiên tòa: Bà Trà Thị Thúy Diễm – Thư ký Tòa án nhân dân Quận 10, Thành phố Hồ Chí Minh. ")
spacy.displacy.render(doc1, style="ent", jupyter=True)

In [35]:
doc2 = nlp_ner("Bà Phan Thị Cẩm Ngọc, sinh năm 1975 (Có mặt)")
spacy.displacy.render(doc2, style = "ent", jupyter=True)

In [36]:
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)
ents1 = [(e.text, e.start_char, e.end_char, e.label_) for e in doc1.ents]
print(ents1)
ents2 = [(e.text, e.label_) for e in doc2.ents]
print(ents2)

[('Tô Bình Yi,', 4, 15, 'PERSON_NAME')]
[(': Bà Trà Thị', 16, 28, 'PERSON_NAME'), ('Tòa án', 48, 54, 'ORGANIZATION_NAME'), ('Quận 10,', 64, 72, 'LOCATION_NAME')]
[('Cẩm Ngọc,', 'PERSON_NAME')]


In [45]:
text = getText("/content/6 HN_MAU.docx")
# Combine the texts into one string
combined_text = text.replace(string.punctuation, " ").replace(" - ", "")

# Split the combined_text into individual lines
lines = combined_text.splitlines()

text_ = " ".join([line for line in lines])
text_

'NHÂN DANH NƯỚC CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM TÒA ÁN NHÂN DÂN QUẬN 10, THÀNH PHỐ HỒ CHÍ MINH - Thành phần Hội đồng xét xử sơ thẩm gồm có: Thẩm phánChủ tọa phiên tòa: Bà Lê Thị Lan  Các Hội thẩm nhân dân: 1. Bà Nguyễn Thị Thu Hằng 2. Ông Nguyễn Vi Tường Thụy  - Thư ký phiên tòa: Bà Phạm Hà Thiên TâmThư ký Tòa án, Tòa án nhân dân Quận 10, Thành phố Hồ Chí Minh. - Đại diện Viện kiểm sát nhân dân Quận 10, Thành phố Hồ Chí Minh tham gia phiên tòa: Ông Nguyễn Tuấn AnhKiểm sát viên  Ngày 06 tháng 01 năm 2020 tại trụ sở Toà án nhân dân Quận 10, Thành phố Hồ Chí Minh, xét xử sơ thẩm công khai vụ án thụ lý số: 629/2019/TLST-HNGĐ ngày 07 tháng 10 năm 2019 về tranh chấp ly hôn, theo Quyết định đưa vụ án ra xét xử số: 331/2019/QĐXXST-HNGĐ ngày 12 tháng 12 năm 2019 và Quyết định hoãn phiên toà số: 231/2019/QĐST-HNGĐ ngày 25 tháng 12 năm 2019, giữa các đương sự: Nguyên đơn: Bà Lê Ngân H, sinh năm: 1989 Địa chỉ: Số 73 đường Phó Đức Chính, phường V, Thành phố Nha Trang, tỉnh Khánh Hoà. (Có đơn xin

In [46]:
docs_ = nlp_ner(text_)

In [47]:
ents = [(e.text, e.label_) for e in docs_.ents]
ents

[('NAM TÒA ÁN NHÂN DÂN', 'LOCATION_NAME'),
 ('QUẬN 10,', 'LOCATION_NAME'),
 ('HỒ CHÍ MINH -', 'LOCATION_NAME'),
 (': Thẩm', 'PERSON_NAME'),
 (': Bà', 'PERSON_NAME'),
 ('Hội thẩm', 'PERSON_NAME'),
 (': 1', 'PERSON_NAME'),
 ('Nguyễn Vi Tường Thụy', 'PERSON_NAME'),
 (': Bà', 'PERSON_NAME'),
 ('ký Tòa án', 'PERSON_NAME'),
 ('Tòa án', 'ORGANIZATION_NAME'),
 ('Quận 10,', 'LOCATION_NAME'),
 ('Quận 10,', 'LOCATION_NAME'),
 (': Ông', 'PERSON_NAME'),
 ('tại trụ sở', 'ORGANIZATION_NAME'),
 ('Toà án', 'ORGANIZATION_NAME'),
 ('Quận 10,', 'LOCATION_NAME'),
 (': 629', 'PERSON_NAME'),
 ('/', 'PERSON_NAME'),
 (': 331', 'PERSON_NAME'),
 ('/QĐXXST', 'PERSON_NAME'),
 ('hoãn phiên toà', 'ORGANIZATION_NAME'),
 (': 231', 'PERSON_NAME'),
 ('/QĐST-', 'LOCATION_NAME'),
 (': Bà', 'PERSON_NAME'),
 ('Lê Ngân H,', 'PERSON_NAME'),
 (': Số', 'PERSON_NAME'),
 ('đường Phó Đức Chính', 'LOCATION_NAME'),
 (': Ông', 'PERSON_NAME'),
 (': Số', 'PERSON_NAME'),
 ('đường', 'PERSON_NAME'),
 ('Quận D,', 'LOCATION_NAME'),
 (': Tại

In [48]:
spacy.displacy.render(docs_, style = "ent", jupyter=True)

### show person name

In [None]:
list_name = []
for e in docs_.ents:
  if e.label_ == "PERSON_NAME":
    # list_name.append(e.text)
    print(e)

### Further fine tuning (TODO)

Refer to:
- https://spacy.io/usage/training
- https://spacy.io/usage/spacy-101
- https://spacy.io/api/cli#init-fill-config