<a href="https://colab.research.google.com/github/LxYuan0420/nlp/blob/main/notebooks/Chinese_Legal_NLP_Text_Classification_with_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###### Check version and Install packages

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0


In [None]:
#!pip install -U spacy[cuda101]
!pip install datasets
!python -m spacy download zh_core_web_lg

In [3]:
import spacy
print('GPU:', spacy.prefer_gpu())

GPU: True


-----
###### Load dataset and dataset preview

In [4]:
from datasets import load_dataset

dataset = load_dataset("coastalcph/fairlex", "cail")

Downloading builder script:   0%|          | 0.00/15.0k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/13.3k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/22.5k [00:00<?, ?B/s]

Downloading and preparing dataset fairlex/cail (download: 107.76 MiB, generated: 311.32 MiB, post-processed: Unknown size, total: 419.08 MiB) to /root/.cache/huggingface/datasets/coastalcph___fairlex/cail/1.0.0/b755f714459ab788a8e3f9167fe7463f79981775296915d36ac10fc58ea93737...


Downloading data:   0%|          | 0.00/113M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/80000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/12000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/12000 [00:00<?, ? examples/s]

Dataset fairlex downloaded and prepared to /root/.cache/huggingface/datasets/coastalcph___fairlex/cail/1.0.0/b755f714459ab788a8e3f9167fe7463f79981775296915d36ac10fc58ea93737. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'defendant_gender', 'court_region'],
        num_rows: 80000
    })
    test: Dataset({
        features: ['text', 'label', 'defendant_gender', 'court_region'],
        num_rows: 12000
    })
    validation: Dataset({
        features: ['text', 'label', 'defendant_gender', 'court_region'],
        num_rows: 12000
    })
})

In [6]:
dataset["train"][0]

{'text': '南宁市兴宁区人民检察院指控，2012年1月1日19时许，被告人蒋满德在南宁市某某路某号某市场内，因经营问题与被害人杨某某发生争吵并推打，后蒋满德持菜刀将杨某某砍伤。 </s> 经鉴定，被害人杨某某伤情为轻伤。 </s> 被告人蒋满德于2012年1月1日晚23时被公安机关抓获。 </s>  另查明，2012年1月10日，被告人蒋满德的亲属赔偿被害人杨某某人民币12万元，并取得被害人杨某某的谅解。 </s>  上述事实，被告人在开庭审理过程中亦无异议，并有刑事案件登记表、被害人陈述、证人证言、抓获经过、法医学人体损伤程度鉴定书、辨认笔录及指认作案现场照片、勘验笔录、调解协议、收据、转账记录、户籍证明、被告人的供述等证据所证实，足以认定。',
 'label': 0,
 'defendant_gender': 0,
 'court_region': 5}

In [30]:
dataset["train"].features["label"]

ClassLabel(names=['0', '<=12', '<=36', '<=60', '<=120', '>120'], id=None)

In [32]:
labels = dataset["train"].features["label"].names

In [43]:
# quick check on the value count too
from collections import Counter

c = Counter(dataset["train"]["label"])
c

Counter({0: 13122, 1: 41200, 3: 4160, 2: 17579, 4: 2268, 5: 1671})

Notice that we see a lot of end tag: `</s>`, we can write simple function to remove them. In this task, we frame it as a multi-class problem; each sample will have one positive label out of k labels. 

We will be focusing on the "label" column which has 6 categories: ['0', '<=12', '<=36', '<=60', '<=120', '>120']

The label mapping is:
```python
{
    0: "0",
    1: "<=12",
    2: "<=36",
    3: "<=60",
    4: "<=120",
    5: ">120",
}
```

In [19]:
import re

def remove_end_tag(examples):
    # Define a regular expression pattern to match "</s>"
    pattern = re.compile(r"</s>")

    for idx, text in enumerate(examples["text"]):
        examples["text"][idx] = re.sub(pattern, "", text)

    return examples


In [None]:
dataset = dataset.map(remove_end_tag, batched=True)

In [22]:
# quick check and we can confirm all the end tags have been removed.
dataset["train"][0]

{'text': '南宁市兴宁区人民检察院指控，2012年1月1日19时许，被告人蒋满德在南宁市某某路某号某市场内，因经营问题与被害人杨某某发生争吵并推打，后蒋满德持菜刀将杨某某砍伤。  经鉴定，被害人杨某某伤情为轻伤。  被告人蒋满德于2012年1月1日晚23时被公安机关抓获。   另查明，2012年1月10日，被告人蒋满德的亲属赔偿被害人杨某某人民币12万元，并取得被害人杨某某的谅解。   上述事实，被告人在开庭审理过程中亦无异议，并有刑事案件登记表、被害人陈述、证人证言、抓获经过、法医学人体损伤程度鉴定书、辨认笔录及指认作案现场照片、勘验笔录、调解协议、收据、转账记录、户籍证明、被告人的供述等证据所证实，足以认定。',
 'label': 0,
 'defendant_gender': 0,
 'court_region': 5}

----

######  Generate spaCy config

In [1]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [28]:
!python -m spacy init config chinese_textcat.cfg --lang zh --pipeline textcat --optimize accuracy --gpu

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: zh
- Pipeline: textcat
- Optimize for: accuracy
- Hardware: GPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
chinese_textcat.cfg
You can now add your data and train your pipeline:
python -m spacy train chinese_textcat.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


----

###### Convert Huggingface Dataset to spaCy dataset format: `DocBin`

In [33]:
import spacy

nlp = spacy.load("zh_core_web_lg")

In [34]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f8a48eeafe0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f8a48eebee0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f8a45605c40>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7f8a456c1140>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f8a456054d0>)]

In [60]:
from spacy.tokens import DocBin
from tqdm import tqdm

def convert(data, output):
    db = DocBin()
    docs = [] 
    cats = []

    for document in tqdm(data):
        docs.append(document["text"])
        cats.append(document["label"])
    print(f"Loaded {len(docs)} text and {len(cats)} labels")

    # It is running too slow on Colab so I just keep the first 500 samples for demo
    print(f"Manually keep the first 500 samples in each split")
    docs = docs[:500]
    cats = cats[:500]

    print(f"Process docs thru spacy nlp pipeline")
    docs = nlp.pipe(docs, disable=["ner", "parser"]) 

    print(f"Add cats into each sample doc")
    for doc, target_label_idx in tqdm(zip(docs, cats), total=len(cats)):  
        for idx, label in enumerate(labels):
            doc.cats[label] = 1 if idx == target_label_idx else 0

        db.add(doc)  

    print(f"Save spacy data as {output}")
    db.to_disk(output)

convert(dataset["train"], "./train.spacy")
convert(dataset["validation"], "./dev.spacy")
convert(dataset["test"], "./test.spacy")

100%|██████████| 80000/80000 [00:03<00:00, 21310.74it/s]


Loaded 80000 text and 80000 labels
Manually keep the first 500 samples in each split
Process docs thru spacy nlp pipeline
Add cats into each sample doc


100%|██████████| 500/500 [00:41<00:00, 12.06it/s] 


Save spacy data as ./train.spacy


100%|██████████| 12000/12000 [00:00<00:00, 17715.14it/s]


Loaded 12000 text and 12000 labels
Manually keep the first 500 samples in each split
Process docs thru spacy nlp pipeline
Add cats into each sample doc


100%|██████████| 500/500 [00:32<00:00, 15.62it/s]


Save spacy data as ./dev.spacy


100%|██████████| 12000/12000 [00:00<00:00, 14757.92it/s]


Loaded 12000 text and 12000 labels
Manually keep the first 500 samples in each split
Process docs thru spacy nlp pipeline
Add cats into each sample doc


100%|██████████| 500/500 [00:32<00:00, 15.61it/s]


Save spacy data as ./test.spacy


-----

###### Start training spaCy model

In [None]:
!python -m spacy train chinese_textcat.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy --output chinse_legal_textcat --verbose -g 0

Log:

```
2023-05-14 07:42:34.795316: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2023-05-14 07:42:38,616] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
ℹ Saving to output directory: chinse_legal_textcat
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2023-05-14 07:42:38,748] [INFO] Set up nlp object from config
[2023-05-14 07:42:38,763] [DEBUG] Loading corpus from path: dev.spacy
[2023-05-14 07:42:38,765] [DEBUG] Loading corpus from path: train.spacy
[2023-05-14 07:42:38,765] [INFO] Pipeline: ['tok2vec', 'textcat']
[2023-05-14 07:42:38,768] [INFO] Created vocabulary
[2023-05-14 07:42:43,215] [INFO] Added vectors: zh_core_web_lg
[2023-05-14 07:42:45,878] [INFO] Finished initializing nlp object
[2023-05-14 07:43:13,574] [INFO] Initialized pipeline components: ['tok2vec', 'textcat']
✔ Initialized pipeline

============================= Training pipeline =============================
[2023-05-14 07:43:13,592] [DEBUG] Loading corpus from path: dev.spacy
[2023-05-14 07:43:13,594] [DEBUG] Loading corpus from path: train.spacy
ℹ Pipeline: ['tok2vec', 'textcat']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TEXTCAT  CATS_SCORE  SCORE 
---  ------  ------------  ------------  ----------  ------
  0       0          0.00          0.14        4.99    0.05
  0     200          7.23         32.89       13.09    0.13
  0     400          7.76         29.86        8.53    0.09
  1     600          6.90         29.37        4.88    0.05
  1     800          2.33         30.01       11.02    0.11
⚠ Aborting and saving the final best model. Encountered exception:
OutOfMemoryError('Out of memory allocating 545,080,320 bytes (allocated so far:
14,700,011,008 bytes).')

```

Note: Didn't manage to complete the training due to out of memory error. To avoid that issue, we can try reduce model size by setting lower number for the `HashEmbedCNN.v2.width` and `HashEmbededCNN.v2.embed_size` parameters in the config file. 

----

###### Evaluate trained spaCy model

The score is bad because we didnt really complete the training. However, I personally like the logging format of the evaluation fucntion that show precision/recall/f1-score of each label.

In [65]:
!python -m spacy evaluate ./chinse_legal_textcat/model-best/ ./test.spacy --gpu-id 0

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK                 100.00
TEXTCAT (macro F)   15.24 
SPEED               74399 

[1m

            P       R       F
0       24.07   14.44   18.06
<=12    54.27   70.86   61.47
<=36    50.00    5.21    9.43
<=60     0.00    0.00    0.00
<=120    1.41   10.00    2.47
>120     0.00    0.00    0.00

[1m

        ROC AUC
0          0.53
<=12       0.47
<=36       0.48
<=60       0.51
<=120      0.45
>120       0.52



-----

###### Load the best model and test it

In [73]:
# random sample from train split
# label: 1 (<=12)
text = (
    "公诉机关指控，2017年3月25日11时许，被告人陈向明在本市东城区地铁5号线磁器口站至崇文门站车厢内，从被害人徐某的挎包内，盗窃钱包1个，"
    "内有人民币351.4元、身份证、医保卡、医疗卡、工商银行储蓄卡1张、建设银行储蓄卡1张、农业银行卡1张等，其中钱包经鉴定价值人民币30元。 "
    "被告人陈向明后被抓获到案，赃物已起获并发还。 </s> 上述事实，被告人陈向明在审理过程中无异议，并有到案经过，被害人徐某的陈述，"
    "证人张某、刘某、赵某的证言，辨认笔录，鉴定意见，扣押、发还清单，照片，视听资料，常住人口基本信息，被告人陈向明的供述及其前科劣迹材料等证据予以证实，"
    "足以认定。"
)
nlp = spacy.load("./chinse_legal_textcat/model-best")
doc = nlp(text)
print(doc.cats,  "-",  text)

{'0': 0.017592856660485268, '<=12': 0.9379693865776062, '<=36': 0.018389828503131866, '<=60': 0.005402231123298407, '<=120': 0.016701236367225647, '>120': 0.0039446111768484116} - 公诉机关指控，2017年3月25日11时许，被告人陈向明在本市东城区地铁5号线磁器口站至崇文门站车厢内，从被害人徐某的挎包内，盗窃钱包1个，内有人民币351.4元、身份证、医保卡、医疗卡、工商银行储蓄卡1张、建设银行储蓄卡1张、农业银行卡1张等，其中钱包经鉴定价值人民币30元。 被告人陈向明后被抓获到案，赃物已起获并发还。 </s> 上述事实，被告人陈向明在审理过程中无异议，并有到案经过，被害人徐某的陈述，证人张某、刘某、赵某的证言，辨认笔录，鉴定意见，扣押、发还清单，照片，视听资料，常住人口基本信息，被告人陈向明的供述及其前科劣迹材料等证据予以证实，足以认定。
