## Examples of using lmchunker package

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name_or_path='/mnt/data102_d2/huggingface/models/Qwen2-1.5B-Instruct' 
device_id = 6   
device = torch.device(f'cuda:{device_id}' if torch.cuda.is_available() and torch.cuda.device_count() > device_id else 'cpu')  
small_tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,trust_remote_code=True)  
small_model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True).to(device)  
small_model.eval()

  from .autonotebook import tqdm as notebook_tqdm


Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 1536)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2SdpaAttention(
          (q_proj): Linear(in_features=1536, out_features=1536, bias=True)
          (k_proj): Linear(in_features=1536, out_features=256, bias=True)
          (v_proj): Linear(in_features=1536, out_features=256, bias=True)
          (o_proj): Linear(in_features=1536, out_features=1536, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=1536, out_features=8960, bias=False)
          (up_proj): Linear(in_features=1536, out_features=8960, bias=False)
          (down_proj): Linear(in_features=8960, out_features=1536, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
      )
    )
    (norm): Qw

> Segments the given text into chunks based on the specified method and parameters.

Parameters:

necessary
- text: Text that needs to be segmented
- small_model: The small language model used for segmentation
- small_tokenizer: The tokenizer used for text tokenization
- language: en or zh

optional
- methodth: The LLM chunking method that needs to be used, ['ppl','ms','lumber_ms']
- threshold: The threshold for controlling PPL Chunking is inversely proportional to the chunk length; the smaller the threshold, the shorter the chunk length.
- dynamic_merge: no or yes
- target_size: If dynamic_merge='yes', then the chunk length value needs to be set
- batch_size: The length of a single document processed at a time, used to optimize GPU memory usage when processing longer documents
- max_txt_size: The total context length that can be considered or the maximum length that the GPU memory can accommodate

Returns:
- List[str]: A list of segmented text chunks


In [2]:
from lmchunker import chunker
import json
with open('data/example1.json', 'r', encoding='utf-8') as file:  
    examples = json.load(file)
language='zh' # en or zh
text=examples[0][language] # Text that needs to be segmented

chunks=chunker(text,small_model,small_tokenizer,language)
i=1
for chunk in chunks:
    print(f'Number {i}: ', chunk)
    i+=1

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.879 seconds.
Prefix dict has been built successfully.


111 [[0, 1, 2], [3, 4], [5, 6, 7], [8, 9, 10], [11, 12], [13, 14, 15], [16]]
The program execution time is: 1.938673973083496 seconds.
Number 1:  2023-08-01 10:47，正文：通气会现场 来源：湖南高院7月31日，湖南高院联合省司法厅召开新闻通气会。湖南高院副院长杨翔，省委依法治省办成员、省司法厅党组成员、副厅长杨龙金通报2022年湖南省行政机关负责人出庭应诉有关情况，并发布5个典型案例。2022年，全省经人民法院通知出庭的行政机关负责人出庭应诉率提升至96.5%。
Number 2:  杨翔介绍，从出庭应诉数量看，负责人出庭应诉意识普遍提升。2022年，全省法院共发出行政机关负责人出庭应诉通知书4228份，行政机关负责人到庭应诉4018件。
Number 3:  行政机关负责人参加调查询问1117件，参与案件协调化解741件。与2021年相比，行政机关负责人到庭应诉和参加调查询问等案件增加2802件。从地区分布情况来看，全省各地经人民法院通知的行政机关负责人出庭应诉率均达到90%以上，较往年有明显提升。
Number 4:  2022年，从行政管理领域看，全省法院制发负责人出庭应诉通知书的案件所涉行政管理领域较为集中，自然资源、社会保障、公安、市场监管等部门负责人出庭应诉的案件数量较多。从涉案行政行为看，被诉行为类型相对集中。排名前五的行政行为类型依次为行政征收或征用类案件、行政确认类案件、不履行法定职责类案件、行政处罚类案件及行政登记类案件。
Number 5:  从出庭应诉负责人层级比例看，基层行政机关负责人出庭应诉占比较高。县市区及乡镇负责人出庭应诉数量占全部出庭应诉案件数的80.8%。
Number 6:  杨龙金介绍，为进一步加强和完善负责人出庭应诉制度建设，省委依法治省办、省法院、省司法厅联合印发《关于进一步推进行政机关负责人出庭应诉的工作方案》（以下简称《工作方案》），推动省政府出台《湖南省行政应诉工作规定》并召开全省行政应诉工作会议，依托府院联动，推动行政机关负责人出庭应诉工作有序开展。湖南高院、省司法厅根据最高人民法院相关司法解释，在《工作方案》中统一了行政机关负责人出庭应诉的认定标准和计算方

In [1]:
from lmchunker.modules import lumberchunker
import json
with open('data/example1.json', 'r', encoding='utf-8') as file:  
    examples = json.load(file)
### zhipuai needs to be installed: pip install zhipuai
api_name='zhipuai' # The model name of the API that needs to be called
api_configure={"api_key":"your_api_key","model_name":"glm-4-0520"} # Need to fill in according to the model name
language='zh' # en or zh
text=examples[0][language] # Text that needs to be segmented
dynamic_merge='no' # no or yes
target_size=200 # If dynamic_merge='yes', then the chunk length value needs to be set
chunks=lumberchunker(api_name,api_configure,language,text,dynamic_merge,target_size)
i=1
for chunk in chunks:
    print(f'Number {i}: ', chunk)
    i+=1

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 1.714 seconds.
Prefix dict has been built successfully.


Answer: ID 13
The program execution time is: 4.001843690872192 seconds.
Number 1:  2023-08-01 10:47，正文：通气会现场 来源：湖南高院7月31日，湖南高院联合省司法厅召开新闻通气会。 湖南高院副院长杨翔，省委依法治省办成员、省司法厅党组成员、副厅长杨龙金通报2022年湖南省行政机关负责人出庭应诉有关情况，并发布5个典型案例。 2022年，全省经人民法院通知出庭的行政机关负责人出庭应诉率提升至96.5%。 杨翔介绍，从出庭应诉数量看，负责人出庭应诉意识普遍提升。 2022年，全省法院共发出行政机关负责人出庭应诉通知书4228份，行政机关负责人到庭应诉4018件。 行政机关负责人参加调查询问1117件，参与案件协调化解741件。 与2021年相比，行政机关负责人到庭应诉和参加调查询问等案件增加2802件。 从地区分布情况来看，全省各地经人民法院通知的行政机关负责人出庭应诉率均达到90%以上，较往年有明显提升。 2022年，从行政管理领域看，全省法院制发负责人出庭应诉通知书的案件所涉行政管理领域较为集中，自然资源、社会保障、公安、市场监管等部门负责人出庭应诉的案件数量较多。 从涉案行政行为看，被诉行为类型相对集中。 排名前五的行政行为类型依次为行政征收或征用类案件、行政确认类案件、不履行法定职责类案件、行政处罚类案件及行政登记类案件。 从出庭应诉负责人层级比例看，基层行政机关负责人出庭应诉占比较高。 县市区及乡镇负责人出庭应诉数量占全部出庭应诉案件数的80.8%。
Number 2:  杨龙金介绍，为进一步加强和完善负责人出庭应诉制度建设，省委依法治省办、省法院、省司法厅联合印发《关于进一步推进行政机关负责人出庭应诉的工作方案》（以下简称《工作方案》），推动省政府出台《湖南省行政应诉工作规定》并召开全省行政应诉工作会议，依托府院联动，推动行政机关负责人出庭应诉工作有序开展。 湖南高院、省司法厅根据最高人民法院相关司法解释，在《工作方案》中统一了行政机关负责人出庭应诉的认定标准和计算方式，实现了全省负责人出庭应诉工作的标准化和规范化。 同时，推动将行政机关负责人出庭应诉情况纳入省绩效考核、平安建设、市域社会治理等考核指标体系，进一步压实出庭应诉主体责任。 《工作方案》还明确将行

In [5]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_name = "/d2/zhaojihao/models/propositionizer-wiki-flan-t5-large"
device_id = 0  
device = torch.device(f'cuda:{device_id}' if torch.cuda.is_available() and torch.cuda.device_count() > device_id else 'cpu')  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

In [6]:
from lmchunker.modules import dense_x_retrieval
import json
with open('data/example1.json', 'r', encoding='utf-8') as file:  
    examples = json.load(file)
language='en' # At present, the model of this method is only applicable to English texts
text=examples[0][language] # Text that needs to be segmented
title=''
section=''
target_size=256  # This model can only accept a maximum length of 512, but longer texts tend to make it difficult for the model to extract effective information.
chunks=dense_x_retrieval(tokenizer,model,text,title,section,target_size)
i=1
for chunk in chunks:
    print(f'Number {i}: ', chunk)
    i+=1

Title: . Section: . Content:  Waldrada of Lotharingia
Waldrada was the mistress, and later the wife, of Lothair II of Lotharingia. Biography
Waldrada's family origin is uncertain. The prolific 19th-century French writer Baron Ernouf suggested that Waldrada was of noble Gallo-Roman descent, sister of Thietgaud, the bishop of Trier, and niece of Gunther, archbishop of Cologne. However, these suggestions are not supported by any evidence, and more recent studies have instead suggested she was of relatively undistinguished social origins, though still from an aristocratic milieu.
[ERROR] Failed to parse output text as JSON.
Title: . Section: . Content: The Vita Sancti Deicoli states that Waldrada was related to Eberhard II, Count of Nordgau (included Strasbourg) and the family of Etichonids, though this is a late 10th-century source and so may not be entirely reliable on this question.In 855 the Carolingian king Lothar II married Teutberga, a Carolingian aristocrat and the daughter of Boso