# Chinese Text Segmentation (Maximal Matching)

The Chinese language does not use spaces between words, so the computer's task is to determine word boundaries using a dictionary.

### 1. Environment Setup
We will import the necessary libraries and set the filenames.
In a script, filenames are passed via the command line (`sys.argv`)

In [6]:
#Imports
import sys
MAXWORDLEN = 5

### 2. Reading the Dictionary (Dictionary Loading)

The first step of the algorithm is to load a list of known Chinese words.
We will read the file `chinesetrad_wordlist.utf8` and store the words in a **`set`** data structure.
This is critical for performance: checking for the existence of a word in a `set` happens instantly (O(1)), whereas in a regular list (`list`), it would take a long time.

In [3]:
def print_help():
    progname = sys.argv[0]
    progname = progname.split('/')[-1] # strip out extended path
    help = __doc__.replace('<PROGNAME>', progname, 1)
    print('-' * 60, help, '-' * 60, file=sys.stderr)
    sys.exit(0)

### 3. Implementing the Segmentation Algorithm (Maximal Matching)

This is the main part of the work. The `segment` function takes a sentence (a string without spaces) and returns a list of found words.

**How the "Greedy Algorithm" (Maximal Matching) works:**
1. We start from the beginning of the sentence (position 0).
2. We take the longest possible chunk of text (length `MAXWORDLEN = 5`).
3. We check: is this chunk in our dictionary?
    * **Yes:** Great, it's a word! We add it to the result and move forward by its length.
    * **No:** We decrease the length of the chunk by 1 and check again.
4. If the chunk length reaches 1 character, we consider this character a word (even if it's not in the dictionary), add it, and move 1 step forward.

In [7]:
if '-h' in sys.argv or len(sys.argv) != 4:
    print_help()

word_list_file = sys.argv[1]
input_file = sys.argv[2]
output_file = sys.argv[3]

------------------------------------------------------------ Automatically created module for IPython interactive environment ------------------------------------------------------------


SystemExit: 0

### 4. Running File Processing (Main Loop)

Now let's put it all together.
We open the input file `chinesetext.utf8` with unsegmented text and the output file `MYRESULTS.utf8` to write the result.

The script will read the input file line by line, apply our `segment` function to each line, and write the result (words separated by spaces) to the output file.

In [None]:
word_list = set()

with open(word_list_file, "r", encoding="utf-8") as f:
    for line in f:
        word = line.strip()
        word_list.add(word)

### 5. Function Testing

Before processing the entire file, let's check our `segment` function on a single concrete example.
This helps ensure that the algorithm works correctly, splits the string, and returns a list of words.
We will join the resulting list with spaces to make the output readable.

In [None]:
def segment(sent, wordset):
    sent = sent.strip()
    words = []
    position = 0
    
    while position < len(sent):
        word_len = MAXWORDLEN 
        
        while word_len > 0:
            
            # Extract candidate word
            candidate = sent[position : position + word_len]
            
            # Check the length of the candidate to avoid index error or equals to 1
            if candidate in wordset or word_len == 1:
                words.append(candidate)
                position += word_len # change the position
                break # found a word, break the inner loop
            
            word_len -= 1 # decrease the length of the candidate
            
    return words

### 6. Main File Processing Loop (Main Loop)

This is the final stage. We open the input file `chinesetext.utf8` for reading and the file `MYRESULTS.utf8` for writing the results.

The script iterates through each line of the input file:
1. Applies the `segment` function to split the line into words.
2. Joins the resulting list of words into a single string separated by spaces.
3. Writes the result to the output file (don't forget to add a newline character `\n` at the end).

In [None]:
with open(input_file, "r", encoding="utf-8") as f_in:
    with open(output_file, "w", encoding="utf-8") as f_out:
        for line in f_in:
            # 1. Segmenting the line
            segmented_list = segment(line, word_list)
            
            # 2. Joining the segmented list into a single string
            result_line = " ".join(segmented_list)
            
            # 3. Writing the result line to the output file
            f_out.write(result_line + "\n")


### 1. Run Segmentation

In this cell, we execute the main script `chinese_segmentation_STARTER_CODE.py`.
It takes the following inputs:
1.  **Word List** (`chinesetrad_wordlist.utf8`) — a list of known Chinese words (dictionary).
2.  **Input Text** (`chinesetext.utf8`) — the raw text without spaces that needs to be segmented.

The result of the algorithm (segmented text) is saved to the file **`MYRESULTS.utf8`**.

In [1]:
!python chinese_segmentation_STARTER_CODE.py chinesetrad_wordlist.utf8 chinesetext.utf8 MYRESULTS.utf8

Segmentation complete. Results saved to MYRESULTS.utf8


### 2. Evaluation

Now we evaluate the accuracy of our algorithm.
The script `eval_chinese_segmentation.py` compares our result (`MYRESULTS.utf8`) against the **Gold Standard** (`chinesetext_goldstandard.utf8`).

The script outputs two metrics:
* **Word-level accuracy:** The percentage of words identified correctly.
* **Sentence-level accuracy:** The percentage of sentences where every single word was identified correctly.

In [2]:
!python eval_chinese_segmentation.py chinesetext_goldstandard.utf8 MYRESULTS.utf8


Total correct words: 88573
Total gold-std words: 91627
Word-level accuracy: 96.67%
Sentence-level accuracy: 85.38%
