# Statistical model for pre-tokenization

## Pointwise mutual information and branching entropy to identify coherent character spans

This notebook implements statistical methods to identify natural word boundaries in Chinese text:
1. Pointwise Mutual Information (PMI) measures the association strength between adjacent characters
2. Branching Entropy quantifies the uncertainty of the next character given the current context

These metrics help identify where characters form coherent units (words) by:
- High PMI indicates strong character associations likely to form words
- Low branching entropy suggests predictable character sequences typical of words

The implementation uses these statistical measures to guide pre-tokenization before applying tokenization methods.


### Training dataset Preparation

In [1]:
def extract_first_1000_lines(input_file_path, output_file_path):
    """
    Extracts the first 1000 lines from a UTF-8 encoded file.

    Args:
        input_file_path (str): The path to the input UTF-8 file.
        output_file_path (str): The path where the first 1000 lines will be saved.

    Returns:
        None
    """
    with open(input_file_path, 'r', encoding='utf-8') as infile, \
         open(output_file_path, 'w', encoding='utf-8') as outfile:

        line_count = 0
        for line in infile:
            outfile.write(line)
            line_count += 1
            if line_count >= 1000:
                break

In [None]:
extract_first_1000_lines("pku_training.utf8", "pku_small.utf8")

In [None]:
def remove_spaces(input_file_path, output_file_path):
    """
    Removes all space characters (' ') from each line of the input file and
    writes the modified text to the output file.

    Args:
        input_file_path (str): Path to the input file (UTF-8 encoding).
        output_file_path (str): Path to save the resulting file (UTF-8 encoding).

    Returns:
        None
    """
    with open(input_file_path, 'r', encoding='utf-8') as infile, \
         open(output_file_path, 'w', encoding='utf-8') as outfile:
        for line in infile:
            no_spaces_line = line.replace(' ', '')
            outfile.write(no_spaces_line)


In [None]:
remove_spaces("pku_small.utf8", "no_spaces_pku.utf8")

### Statistical Model

Greedy word segmentation by total_score = minPMI + min(left_entropy, right_entropy).
1. No hard thresholding. All multi-character candidates are kept; their
   scores are used directly during greedy segmentation.
2. Single pass statistics collection. We walk through the corpus once
   (up to max_len characters per position) and gather:
   * n-gram frequencies (for PMI)
   * left / right context character counts (for entropy)
   This avoids repeated re.findall calls.
3. Clearer data flow. Helper functions do exactly one job, making dependencies explicit.


In [2]:
from __future__ import annotations
import os
import math
import sys
import re
from collections import Counter, defaultdict, Set
from pathlib import Path
from typing import Dict, List, Sequence, Tuple
import sentencepiece as spm
import matplotlib.pyplot as plt

  


#### Below is the script for the statistical model 

In [None]:
# Global containers required for later analysis / plotting
pmi_score_dict: Dict[str, float] = {}
# word -> (left_entropy, right_entropy)
entropy_dict: Dict[str, Tuple[float, float]] = {}

# Corpus statistics helpers
def collect_corpus_stats(text: str, max_len: int) -> tuple[Counter, dict[str, Counter], dict[str, Counter]]:
    """Walk through text once and gather statistics needed later.

    Parameters
    ----------
    text : str
        Raw corpus with stop characters already removed.
    max_len : int
        Maximum n-gram length considered.

    Returns
    -------
    freq : Counter
        word -> frequency for all n-grams 1 .. max_len.
    left_ctx : dict[str, Counter]
        word -> Counter(left_char) (only positions with both a left and a
        right neighbour are counted, matching the original regex behaviour).
    right_ctx : dict[str, Counter]
        word -> Counter(right_char) (see remark above).
    """
    freq: Counter = Counter()
    left_ctx: dict[str, Counter] = defaultdict(Counter)
    right_ctx: dict[str, Counter] = defaultdict(Counter)

    n = len(text)
    for i in range(n):
        # Limit the slice to stay inside text
        slice_end = min(i + max_len, n)
        for j in range(i + 1, slice_end + 1):
            w = text[i:j]
            freq[w] += 1

            # Context only if there is both a left & right neighbour
            if i > 0 and j < n:
                left_ctx[w][text[i - 1]] += 1
                right_ctx[w][text[j]] += 1

    return freq, left_ctx, right_ctx

# Score computation (minPMI + min LR-entropy)

def compute_min_pmi(freq: Counter, total_count: int) -> None:
    """Populate pmi_score_dict with the min-PMI of each multi-char word."""
    pmi_score_dict.clear()

    for word, f_xy in freq.items():
        if len(word) == 1:
            continue  # single characters are skipped (as in the original code)

        # Gather PMI for every bi-partition of word
        pmi_values: list[float] = []
        for i in range(1, len(word)):
            left, right = word[:i], word[i:]
            f_x = freq.get(left, 0)
            f_y = freq.get(right, 0)
            if not f_x or not f_y:
                pmi_values.append(float("-inf"))
            else:
                ratio = (f_xy * total_count) / (f_x * f_y)
                pmi_values.append(math.log2(ratio))

        pmi_score_dict[word] = min(pmi_values) if pmi_values else float("-inf")


def entropy_from_counter(counter: Counter) -> float:
    """Shannon entropy H(X) with log-base 2."""
    total = sum(counter.values())
    if total == 0:
        return 0.0
    return -sum((c / total) * math.log2(c / total) for c in counter.values())


def compute_total_scores(
    candidate_words: Sequence[str],
    left_ctx: dict[str, Counter],
    right_ctx: dict[str, Counter],
) -> dict[str, float]:
    """Return word -> total_score and populate entropy_dict for plotting."""
    entropy_dict.clear()
    score_dict: dict[str, float] = {}

    for w in candidate_words:
        le = entropy_from_counter(left_ctx.get(w, Counter()))
        re = entropy_from_counter(right_ctx.get(w, Counter()))
        min_lr = min(le, re)
        entropy_dict[w] = (le, re)

        pmi = pmi_score_dict.get(w, float("-inf"))
        score_dict[w] = pmi + min_lr  # weighting factor 1.0 as in original script

    return score_dict

# Greedy segmentation with conflict resolution
def build_conflict_free_seg(text: str, word_score: dict[str, float]) -> list[tuple[int, int, str]]:
    """Greedy selection identical to the original logic.

    The tuple returned for each chosen occurrence is (start, end, word).
    """
    matches: list[tuple[int, int, str, float]] = []

    for w, score in word_score.items():
        if score == float("-inf"):
            continue  # useless candidate
        start = 0
        while True:
            idx = text.find(w, start)
            if idx == -1:
                break
            matches.append((idx, idx + len(w), w, score))
            start = idx + 1  # allow overlaps in search as in original code

    # Sort: primary by start idx, then by descending score, then by descending length
    matches.sort(key=lambda t: (t[0], -t[3], -(t[1] - t[0])))

    chosen: list[tuple[int, int, str, float]] = []
    for st, ed, wd, sc in matches:
        for i, (cst, ced, _, csc) in enumerate(chosen):
            if not (ed <= cst or st >= ced):  # overlap
                if sc > csc:
                    chosen[i] = (st, ed, wd, sc)
                break
        else:
            chosen.append((st, ed, wd, sc))

    chosen.sort(key=lambda t: t[0])
    return [(st, ed, wd) for st, ed, wd, _ in chosen]

def _plot_hist(data: List[float], title: str, xlabel: str, bins: int = 50, xlim: Tuple[float, float] | None = None) -> None:
    plt.figure()
    counts, bin_edges, patches = plt.hist(data, bins=bins, range=xlim, edgecolor="black")
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel("Frequency")

    for cnt, rect in zip(counts, patches):
        if cnt > 0:
            x_pos = rect.get_x() + rect.get_width() / 2
            plt.text(x_pos, cnt, f"{int(cnt)}", ha="center", va="bottom", rotation=90, fontsize=8)

    if xlim:
        plt.xlim(xlim)


def plot_distributions() -> None:
    """Show histograms for PMI and entropy statistics."""
    pmi_vals = [v for v in pmi_score_dict.values() if v != float("-inf")]
    left_vals, right_vals, min_vals = [], [], []
    for le, re in entropy_dict.values():
        left_vals.append(le)
        right_vals.append(re)
        min_vals.append(min(le, re))

    # PMI
    _plot_hist(pmi_vals, "PMI distribution (full range)", "PMI")
    _plot_hist(pmi_vals, "PMI distribution (zoomed)", "PMI", xlim=(-1, 5))

    # Entropy
    _plot_hist(left_vals, "Left entropy (full range)", "Left entropy")
    _plot_hist(left_vals, "Left entropy (zoomed)", "Left entropy", xlim=(0, 2))
    _plot_hist(right_vals, "Right entropy (full range)", "Right entropy")
    _plot_hist(right_vals, "Right entropy (zoomed)", "Right entropy", xlim=(0, 2))
    _plot_hist(min_vals, "Min(L,R) entropy (full range)", "Min entropy")
    _plot_hist(min_vals, "Min(L,R) entropy (zoomed)", "Min entropy", xlim=(0, 1))

def main() -> None:
    # Load corpus & basic cleanup
    stop_chars: Sequence[str] = [
        "【", "】", ")", "(", "、", "，", "“", "”", "。", "\n", "《", "》", " ", "-", "！", "？",
        ".", "'", "[", "]", "：", "/", '"', "\u3000", "’", "．", ",", "…", "?", "（", "）",
    ]
    raw_path = Path("no_spaces_pku.utf8")
    text = raw_path.read_text(encoding="utf-8")
    for ch in stop_chars:
        text = text.replace(ch, "")

    # Collect statistics
    max_ngram = 6
    freq, left_ctx, right_ctx = collect_corpus_stats(text, max_ngram)
    print(f"Total n‑gram types: {len(freq):,}")


    # Scores: minPMI + minLR‑entropy (no thresholding)
    compute_min_pmi(freq, sum(freq.values()))
    candidate_words = [w for w in freq if len(w) > 1]
    score_dict = compute_total_scores(candidate_words, left_ctx, right_ctx)

    # Greedy segmentation
    segments = build_conflict_free_seg(text, score_dict)

    # Stitch back to a tokenised string (single chars for uncovered spans)
    segments.sort(key=lambda t: t[0])
    tokens: list[str] = []
    cursor = 0
    for st, ed, wd in segments:
        if st > cursor:
            tokens.extend(text[cursor:st])
        tokens.append(wd)
        cursor = ed
    if cursor < len(text):
        tokens.extend(text[cursor:])

    segmented = " ".join(tokens)
    print("Segmented sample:", segmented[:200], "...")

    Path("pred_weight_1.utf8").write_text(segmented + "\n", encoding="utf-8")



if __name__ == "__main__":
    main()

### Process Functions
Normalize a gold segmentation file that mistakenly uses two (or more)
spaces between tokens so that it contains exactly one ASCII space
between tokens, no leading/trailing spaces per line, and preserves the
original number of lines. Both input and output are assumed to be UTF‑8.

Edit the two path variables (*INPUT_PATH* and *OUTPUT_PATH*) as needed
and run the script directly; no command‑line arguments are required.

In [4]:
INPUT_PATH = Path("data/pku_small_sentence.txt")   # source file (two spaces)
OUTPUT_PATH = Path("data/pku_small_sentence.txt")  # destination file (one space)

_WS_PATTERN = re.compile(r"\s+")  # collapse every run of whitespace


def normalize_line(line: str) -> str:
    """Convert all runs of whitespace into a single ASCII space."""
    # Replace full‑width spaces with ASCII for safety
    line = line.replace("\u3000", " ")
    tokens = _WS_PATTERN.split(line.strip())
    return " ".join(tokens)


def main() -> None:
    raw_lines = INPUT_PATH.read_text(encoding="utf-8").splitlines()
    with OUTPUT_PATH.open("w", encoding="utf-8") as fw:
        for raw in raw_lines:
            cleaned = normalize_line(raw)
            fw.write(cleaned + "\n")
    print(f"Written cleaned file to: {OUTPUT_PATH} (lines: {len(raw_lines)})")


if __name__ == "__main__":
    main()

Written cleaned file to: data/pku_small_sentence.txt (lines: 2255)


# Evaluation 
## F1 Score 
We evaluate the F1 score of our entropy & statistical models with SIGHAN style (exact match matters) and boundary F-1 Score.
### F1 Score Function

In [None]:
from difflib import SequenceMatcher
from pathlib import Path
from typing import List, Tuple

def lcs_length(a: List[str], b: List[str]) -> int:
    """Length of the longest common subsequence between two token lists."""
    return sum(triple.size for triple in SequenceMatcher(None, a, b).get_matching_blocks())

def compute_sighan_f1(
    gold_path,
    pred_path
):
    """
    Compute token-level Precision, Recall, and F1 for Chinese word segmentation
    in SIGHAN style (one sentence per line, tokens separated by ASCII spaces).

    Args:
        gold_path: Path to the gold standard file.
        pred_path: Path to the predictions file.

    Returns:
        precision, recall, f1
    """
    gold_lines = Path(gold_path).read_text(encoding="utf-8").splitlines()
    pred_lines = Path(pred_path).read_text(encoding="utf-8").splitlines()

    if len(gold_lines) != len(pred_lines):
        raise ValueError(f"Line count mismatch: gold={len(gold_lines)} vs pred={len(pred_lines)}")

    total_gold = total_pred = total_corr = 0
    for g_line, p_line in zip(gold_lines, pred_lines):
        g_tokens = g_line.strip().split()
        p_tokens = p_line.strip().split()
        total_gold += len(g_tokens)
        total_pred += len(p_tokens)
        total_corr += lcs_length(g_tokens, p_tokens)

    if total_pred == 0 or total_gold == 0:
        raise ValueError("Gold or prediction is empty — cannot compute metrics.")

    precision = total_corr / total_pred
    recall = total_corr / total_gold
    f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0.0

    return precision, recall, f1


In [4]:
def process_sentences(input_file, output_file):
    """
    Process a raw text file by splitting it into sentences at punctuation marks
    and write each sentence to a new line in the output file.
    
    Args:
        input_file (str): Path to the input raw text file
        output_file (str): Path to the output file where sentences will be written
    
    Returns:
        int: Number of sentences processed
    """
    with open(input_file, "r", encoding="utf-8") as f:
        text = f.read()

    # Define a regex pattern that splits on sentence-ending punctuation
    pattern = r'(?<=[。！？])'

    # Split the text on the defined punctuation
    sentences = re.split(pattern, text)

    # Clean the sentences by stripping whitespace and filter out any empty strings
    sentences = [s.strip() for s in sentences if s.strip()]

    # Write the sentences into the output file, each sentence on a new line
    with open(output_file, "w", encoding="utf-8") as f:
        for sentence in sentences:
            f.write(sentence + "\n")

    print(f"Processed {len(sentences)} sentences. Saved to file '{output_file}'.")
    return len(sentences)


process_sentences("data/pku_small.utf8", "data/pku_small_sentence.txt")


Processed 1822 sentences. Saved to file 'data/pku_small_sentence.txt'.


1822

# BPE

We trained the BPE model based on our pre-tokenization outputs and then test for F1-Score on our test dataset. 

In [None]:
class ChineseBPEProcessor:
    """
    Modular class to train a character-level BPE tokenizer on a Chinese corpus,
    apply it to pre-tokenize text, and evaluate segmentation via boundary F1 & Sighan F1.
    """
    def __init__(
        self,
        model_prefix: str,
        vocab_size: int,
        input_path: str,
        tokenized_path: str,
        gold_path: str
    ):
        """
        Args:
            model_prefix:    Prefix for the SentencePiece model files (will write
                             `{model_prefix}.model` & `{model_prefix}.vocab`).
            vocab_size:      Desired BPE vocabulary size.
            input_path:      Raw corpus to train on, one sentence per line.
            tokenized_path:  Output path for the tokenized text.
            gold_path:       Path to gold segmented file (spaces between tokens).
        """
        self.model_prefix   = model_prefix
        self.vocab_size     = vocab_size
        self.input_path     = input_path
        self.tokenized_path = tokenized_path
        self.gold_path      = gold_path
        self.model_file     = f"{model_prefix}.model"
        
    def split_train_test(self):
        """Split input data and gold data into training and test sets."""
        # Split input data
        with open(self.input_path, "r", encoding="utf-8") as f:
            lines = f.readlines()
        
        # Calculate split point for 70/30 split
        split_idx = int(len(lines) * 0.3)
        
        # Split into train and test sets
        train_data = lines[split_idx:]
        test_data = lines[:split_idx]
        
        train_path = f"{self.model_prefix}_train.txt"
        with open(train_path, "w", encoding="utf-8") as f:
            f.writelines(train_data)
            
        test_path = f"{self.model_prefix}_test.txt"
        with open(test_path, "w", encoding="utf-8") as f:
            f.writelines(test_data)
            
        # Split gold data
        with open(self.gold_path, "r", encoding="utf-8") as f:
            gold_lines = f.readlines()
            
        # Split gold data using same split point
        gold_train_data = gold_lines[split_idx:]
        gold_test_data = gold_lines[:split_idx]
        
        # Write gold train data
        gold_train_path = f"{self.model_prefix}_train_gold.txt"
        with open(gold_train_path, "w", encoding="utf-8") as f:
            f.writelines(gold_train_data)
            
        # Write gold test data
        gold_test_path = f"{self.model_prefix}_test_gold.txt"
        with open(gold_test_path, "w", encoding="utf-8") as f:
            f.writelines(gold_test_data)
            
        # Store paths as instance variables
        self.train_path = train_path
        self.test_path = test_path
        self.gold_train_path = gold_train_path
        self.gold_test_path = gold_test_path
        
        print("Split data into training and test lines")

    
    def train(self, unk_piece: str = '[UNK]') -> None:
        """Train a BPE tokenizer with SentencePiece."""
        spm.SentencePieceTrainer.train(
            input=self.train_path,
            model_prefix=self.model_prefix,
            vocab_size=self.vocab_size,
            model_type="bpe",
            character_coverage=1.0,
            unk_piece=unk_piece,
            pad_id=0, unk_id=1,
            bos_id=-1, eos_id=-1,
            user_defined_symbols=''
        )
        print(f"Trained BPE model: {self.model_file}")
    
    def tokenize(self) -> None:
        """
        Use the trained SentencePiece model to tokenize each line of input,
        join subword pieces with spaces, and write to `self.tokenized_path`.
        """
        sp = spm.SentencePieceProcessor()
        sp.load(self.model_file)
        
        os.makedirs(os.path.dirname(self.tokenized_path), exist_ok=True)
        with open(self.test_path,  "r", encoding="utf-8") as fin, \
             open(self.tokenized_path, "w", encoding="utf-8") as fout:
            for line in fin:
                line = line.strip()
                if not line:
                    #fout.write("\n")
                    continue
                piece_ids = sp.encode(line, out_type=int)
                pieces = [sp.id_to_piece(pid) for pid in piece_ids]
                # Strip leading underscore in first piece if present
                # 1) Replace each leading '▁' with a space
                processed = []
                for p in pieces:
                    if p.startswith('▁'):
                        p = " " + p[1:]
                    processed.append(p)

                # 2) Join into one string, then collapse multiple spaces to a single space
                text = " ".join(processed)
                text = re.sub(r"\s+", " ", text).strip()

                fout.write(text + "\n")
        print(f"Tokenized output saved to: {self.tokenized_path}")

    @staticmethod
    def boundaries_from_words(words: List[str]) -> set[int]:
        """
        Compute boundary positions (cumulative char offsets) from a word list.
        """
        bnds, cum = set(), 0
        for w in words[:-1]:
            cum += len(w)
            bnds.add(cum)
        return bnds
    
    @staticmethod
    def boundary_prf(gold_seg: str, pred_seg: str) -> Tuple[float, float, float]:
        """
        Compute boundary-level Precision, Recall, F1 (percent 0 - 100).
        """
        gold_words = re.split(r"\s+", gold_seg.strip())
        pred_words = re.split(r"\s+", pred_seg.strip())
        gold_b = ChineseBPEProcessor.boundaries_from_words(gold_words)
        pred_b = ChineseBPEProcessor.boundaries_from_words(pred_words)
        if not gold_b and not pred_b:
            return 100.0, 100.0, 100.0
        if not gold_b or not pred_b:
            return 0.0, 0.0, 0.0
        correct = len(gold_b & pred_b)
        p = correct / len(pred_b) * 100
        r = correct / len(gold_b) * 100
        f1 = 2 * p * r / (p + r) if (p + r) else 0.0
        return p, r, f1
    
    def evaluate(self) -> Tuple[float, float, float]:
        """
        Read gold and tokenized files, compute per-line boundary PRF,
        and return average Precision, Recall, F1.
        """
        with open(self.gold_test_path,      "r", encoding="utf-8") as f_gold, \
             open(self.tokenized_path, "r", encoding="utf-8") as f_pred:
            gold_lines = f_gold.readlines()
            pred_lines = f_pred.readlines()
        
        # align lengths
        assert len(gold_lines) == len(pred_lines), \
            "Gold & pred line counts differ"
        
        ps, rs, fs = [], [], []
        for g, p in zip(gold_lines, pred_lines):
            g, p = g.strip(), p.strip()
            if not g or not p:
                continue
            pr, rc, f1 = self.boundary_prf(g, p)
            ps.append(pr)
            rs.append(rc)
            fs.append(f1)
        
        avg_p  = sum(ps)/len(ps) if ps else 0.0
        avg_r  = sum(rs)/len(rs) if rs else 0.0
        avg_f1 = sum(fs)/len(fs) if fs else 0.0
        
        print(f"Avg Precision: {avg_p:.2f}")
        print(f"Avg Recall:    {avg_r:.2f}")
        print(f"Avg F1:        {avg_f1:.2f}")
        print(f"{avg_p:.2f},{avg_r:.2f},{avg_f1:.2f},{self.vocab_size}")
        return avg_p, avg_r, avg_f1
    
    
def eval_sighan_f1(self) -> Tuple[float, float, float]:
    """
    Evaluate token level Precision, Recall, and F1 using SIGHAN style metrics.
    """
    precision, recall, f1 = compute_sighan_f1(self.gold_test_path, self.tokenized_path)
    print(f"SIGHAN Precision: {precision*100:.2f}")
    print(f"SIGHAN Recall:    {recall*100:.2f}")
    print(f"SIGHAN F1:        {f1*100:.2f}")
    print(f"{precision*100:.2f},{recall*100:.2f},{f1*100:.2f},{self.vocab_size}")
    return precision, recall, f1


ChineseBPEProcessor.eval_sighan_f1 = eval_sighan_f1

### Vocab Size = 12000
#### Baseline BPE

In [7]:
processor = ChineseBPEProcessor(
    model_prefix="model/ori_bpe_12k",
    vocab_size=12000,
    input_path="data/no_spaces_sentences.txt",
    tokenized_path="data/ori_seg_12k.txt",
    gold_path="data/gold_pku_test.txt"
)
processor.split_train_test()
processor.train()
processor.tokenize()
processor.eval_sighan_f1()

Split data into training and test lines
Trained BPE model: model/ori_bpe_12k.model
Tokenized output saved to: data/ori_seg_12k.txt
SIGHAN Precision: 46.89
SIGHAN Recall:    51.96
SIGHAN F1:        49.30
46.89,51.96,49.30,12000


(0.468929198361615, 0.5196135641574272, 0.4929720419524498)


#### GPT2 Pre-tokenization + BPE 



In [6]:
processor = ChineseBPEProcessor(
    model_prefix="model/gpt_bpe_12k",
    vocab_size=12000,
    input_path="data/segment_out.txt",
    tokenized_path="data/gpt_seg_12k.txt",
    gold_path="data/gold_pku_test.txt"
)
processor.split_train_test()
processor.train()
processor.tokenize()
processor.eval_sighan_f1()

Split data into training and test lines
Trained BPE model: model/gpt_bpe_12k.model
Tokenized output saved to: data/gpt_seg_12k.txt
SIGHAN Precision: 52.07
SIGHAN Recall:    64.69
SIGHAN F1:        57.70
52.07,64.69,57.70,12000


(0.5207474294065453, 0.6468910069376904, 0.5770053785206177)

#### Only PMI pre-tokenization

In [8]:
processor = ChineseBPEProcessor(
    model_prefix="model/pred_0_12k",
    vocab_size=12000,
    input_path="data/pred_0_split.utf8",
    tokenized_path="data/pred_0_seg_12k.txt",
    gold_path="data/gold_pku_test.txt"
)
processor.split_train_test()
processor.train()
processor.tokenize()
processor.eval_sighan_f1()

Split data into training and test lines
Trained BPE model: model/pred_0_12k.model
Tokenized output saved to: data/pred_0_seg_12k.txt
SIGHAN Precision: 28.69
SIGHAN Recall:    42.92
SIGHAN F1:        34.39
28.69,42.92,34.39,12000


(0.28694672042311525, 0.42916423523309344, 0.3439334892179787)

###  total_score = minPMI +  $\lambda$ min(left_entropy, right_entropy)

#### $\lambda = 1$

In [9]:
processor = ChineseBPEProcessor(
    model_prefix="model/pred_1_12k",
    vocab_size=12000,
    input_path="data/pred_1_split.utf8",
    tokenized_path="data/pred_1_seg_12k.txt",
    gold_path="data/gold_pku_test.txt"
)
processor.split_train_test()
processor.train()
processor.tokenize()
processor.eval_sighan_f1()

Split data into training and test lines
Trained BPE model: model/pred_1_12k.model
Tokenized output saved to: data/pred_1_seg_12k.txt
SIGHAN Precision: 41.24
SIGHAN Recall:    55.91
SIGHAN F1:        47.47
41.24,55.91,47.47,12000


(0.4123864179818269, 0.5591000453867601, 0.4746649051826164)

#### $\lambda = 4$

In [10]:
processor = ChineseBPEProcessor(
    model_prefix="model/pred_4_12k",
    vocab_size=12000,
    input_path="data/pred_4_split.utf8",
    tokenized_path="data/pred_4_seg_12k.txt",
    gold_path="data/gold_pku_test.txt"
    
)
processor.split_train_test()
processor.train()
processor.tokenize()
processor.eval_sighan_f1()

Split data into training and test lines
Trained BPE model: model/pred_4_12k.model
Tokenized output saved to: data/pred_4_seg_12k.txt
SIGHAN Precision: 54.21
SIGHAN Recall:    64.06
SIGHAN F1:        58.73
54.21,64.06,58.73,12000


(0.5421124828532236, 0.6406016987615898, 0.5872563005230623)

#### $\lambda = 15$

In [11]:
processor = ChineseBPEProcessor(
    model_prefix="model/pred_15_12k",
    vocab_size=12000,
    input_path="data/pred_15_split.utf8",
    tokenized_path="data/pred_15_seg_12k.txt",
    gold_path="data/gold_pku_test.txt"
)
processor.split_train_test()
processor.train()
processor.tokenize()
processor.eval_sighan_f1()

Split data into training and test lines
Trained BPE model: model/pred_15_12k.model
Tokenized output saved to: data/pred_15_seg_12k.txt
SIGHAN Precision: 52.83
SIGHAN Recall:    62.17
SIGHAN F1:        57.12
52.83,62.17,57.12,12000


(0.5283195592286501, 0.6217337742332879, 0.5712328359098086)

#### Entropy Only

In [12]:
processor = ChineseBPEProcessor(
    model_prefix="model/pred_only_entropy_12k",
    vocab_size=12000,
    input_path="data/pred_only_entropy_split.utf8",
    tokenized_path="data/pred_only_entropy_seg_12k.txt",
    gold_path="data/gold_pku_test.txt"
)
processor.split_train_test()
processor.train()
processor.tokenize()
processor.eval_sighan_f1()

Split data into training and test lines
Trained BPE model: model/pred_only_entropy_12k.model
Tokenized output saved to: data/pred_only_entropy_seg_12k.txt
SIGHAN Precision: 51.28
SIGHAN Recall:    60.98
SIGHAN F1:        55.71
51.28,60.98,55.71,12000


(0.5128414853590708, 0.6098035401672827, 0.55713524080327)