#  Morphological Segmentation with Morfessor (Telugu)

This tutorial demonstrates unsupervised morphological segmentation using **Morfessor** on **Telugu**, a morphologically rich language.

Morfessor breaks words into morphemes (smallest meaning-carrying units) without requiring labeled data ‚Äî useful for low-resource NLP tasks.

We will:
- Preprocess Telugu data
- Train the Morfessor model
- Predict and visualize segmentations
- Evaluate segmentation quality


In [1]:
!pip install morfessor
import morfessor
import os




## üìù Step 1: Create a Sample Telugu Word File

We'll write a few Telugu words to a `.txt` file ‚Äî this is needed to train the Morfessor model.


In [2]:
# Sample Telugu words
telugu_words = [
    "‡∞™‡±ç‡∞∞‡∞™‡∞Ç‡∞ö‡∞æ‡∞®‡∞ø‡∞ï‡∞ø", "‡∞Ö‡∞ß‡±ç‡∞Ø‡∞æ‡∞™‡∞ï‡±Å‡∞°‡±Å", "‡∞â‡∞™‡∞æ‡∞ß‡±ç‡∞Ø‡∞æ‡∞Ø‡±Å‡∞∞‡∞æ‡∞≤‡±Å", "‡∞Ö‡∞®‡±Å‡∞≠‡∞µ‡∞ø‡∞Ç‡∞ö‡∞æ‡∞Ø‡∞ø", "‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞®‡±Å"
]

# Save to file
with open("telugu_sample.txt", "w", encoding="utf-8") as f:
    for word in telugu_words:
        f.write(word + "\n")

print("‚úÖ Wrote Telugu words to telugu_sample.txt")


‚úÖ Wrote Telugu words to telugu_sample.txt


## ü§ñ Step 2: Train Morfessor on Telugu Data
We‚Äôll now train a Morfessor model using the sample Telugu words.


In [3]:
# Load training data
io = morfessor.MorfessorIO()
data = io.read_corpus_file("telugu_sample.txt")

# Initialize and train the model
model = morfessor.BaselineModel()
model.load_data(data)
model.train_batch()

print("‚úÖ Morfessor training complete!")


100% (5 of 5) |##########################| Elapsed Time: 0:00:00 Time:  0:00:00
100% (5 of 5) |##########################| Elapsed Time: 0:00:00 Time:  0:00:00


‚úÖ Morfessor training complete!


## üîç Step 3: Predict Morpheme Segmentation

Now we‚Äôll test how Morfessor segments some Telugu words.


In [4]:
# Test words (same ones or new ones)
test_words = [
    "‡∞™‡±ç‡∞∞‡∞™‡∞Ç‡∞ö‡∞æ‡∞®‡∞ø‡∞ï‡∞ø", "‡∞Ö‡∞ß‡±ç‡∞Ø‡∞æ‡∞™‡∞ï‡±Å‡∞°‡±Å", "‡∞â‡∞™‡∞æ‡∞ß‡±ç‡∞Ø‡∞æ‡∞Ø‡±Å‡∞∞‡∞æ‡∞≤‡±Å", "‡∞Ö‡∞®‡±Å‡∞≠‡∞µ‡∞ø‡∞Ç‡∞ö‡∞æ‡∞Ø‡∞ø", "‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞®‡±Å"
]

# Segment and display results
for word in test_words:
    segments = model.viterbi_segment(word)[0]
    print(f"{word} ‚ûù {' + '.join(segments)}")


‡∞™‡±ç‡∞∞‡∞™‡∞Ç‡∞ö‡∞æ‡∞®‡∞ø‡∞ï‡∞ø ‚ûù ‡∞™‡±ç‡∞∞‡∞™‡∞Ç‡∞ö‡∞æ‡∞®‡∞ø‡∞ï‡∞ø
‡∞Ö‡∞ß‡±ç‡∞Ø‡∞æ‡∞™‡∞ï‡±Å‡∞°‡±Å ‚ûù ‡∞Ö‡∞ß‡±ç‡∞Ø‡∞æ‡∞™‡∞ï‡±Å‡∞°‡±Å
‡∞â‡∞™‡∞æ‡∞ß‡±ç‡∞Ø‡∞æ‡∞Ø‡±Å‡∞∞‡∞æ‡∞≤‡±Å ‚ûù ‡∞â‡∞™‡∞æ‡∞ß‡±ç‡∞Ø‡∞æ‡∞Ø‡±Å‡∞∞‡∞æ‡∞≤‡±Å
‡∞Ö‡∞®‡±Å‡∞≠‡∞µ‡∞ø‡∞Ç‡∞ö‡∞æ‡∞Ø‡∞ø ‚ûù ‡∞Ö‡∞®‡±Å‡∞≠‡∞µ‡∞ø‡∞Ç‡∞ö‡∞æ‡∞Ø‡∞ø
‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞®‡±Å ‚ûù ‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞®‡±Å


In [5]:
telugu_words = [
    "‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡∞æ‡∞°‡±Å", "‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞∞‡±Å", "‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞®‡±Å", "‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞Ç‡∞¶‡∞ø", 
    "‡∞Ü‡∞°‡±Å‡∞§‡±ã‡∞Ç‡∞¶‡∞ø", "‡∞Ü‡∞°‡∞§‡∞æ‡∞°‡±Å", "‡∞Ü‡∞°‡∞æ‡∞∞‡±Å", "‡∞Ü‡∞°‡∞ø‡∞Ç‡∞¶‡∞ø",
    "‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑", "‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞≤‡∞ï‡∞ø", "‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞≤‡±Å", "‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞ï‡±Å"
]


## üìö Step 4: Train with a Larger Telugu Word List

We‚Äôll now use a longer list of Telugu words that share common patterns (prefixes/suffixes) for better training.


In [6]:
# Larger word list with repeated roots and suffixes
telugu_words_large = [
    "‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡∞æ‡∞°‡±Å", "‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞∞‡±Å", "‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞®‡±Å", "‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞Ç‡∞¶‡∞ø", "‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞Ç‡∞¶‡∞ø",
    "‡∞Ü‡∞°‡±Å‡∞§‡±ã‡∞Ç‡∞¶‡∞ø", "‡∞Ü‡∞°‡∞§‡∞æ‡∞°‡±Å", "‡∞Ü‡∞°‡∞æ‡∞∞‡±Å", "‡∞Ü‡∞°‡∞ø‡∞Ç‡∞¶‡∞ø", "‡∞Ü‡∞°‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞°‡±Å",
    "‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑", "‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞≤‡±Å", "‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞≤‡∞ï‡∞ø", "‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞≤‡∞ï‡±Å", "‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞ï‡±Å",
    "‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å‡∞®‡∞æ‡∞°‡±Å", "‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å‡∞≤‡±ã", "‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å‡∞µ‡∞æ‡∞∞‡±Å", "‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å‡∞¶‡±á‡∞∂‡∞Ç", "‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å‡∞¨‡∞°‡∞ø"
]

# Save to file again (overwrite)
with open("telugu_sample.txt", "w", encoding="utf-8") as f:
    for word in telugu_words_large:
        f.write(word + "\n")

print(f"‚úÖ Wrote {len(telugu_words_large)} Telugu words to telugu_sample.txt")


‚úÖ Wrote 20 Telugu words to telugu_sample.txt


## üîÅ Step 5: Retrain Morfessor with the New Data
Now we‚Äôll retrain the model using the larger word list.


# Reload corpus
data = io.read_corpus_file("telugu_sample.txt")

# Reinitialize model (start fresh)
model = morfessor.BaselineModel()
model.load_data(data)
model.train_batch()

print("‚úÖ Retrained Morfessor with larger dataset!")


## üîÑ Step 6: Segment Again Using the Retrained Model

Let‚Äôs check if Morfessor now splits words into meaningful morphemes.


In [7]:
# Predict segmentations on the training words
for word in telugu_words_large:
    segments = model.viterbi_segment(word)[0]
    print(f"{word} ‚ûù {' + '.join(segments)}")


‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡∞æ‡∞°‡±Å ‚ûù ‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡∞æ‡∞°‡±Å
‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞∞‡±Å ‚ûù ‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞∞‡±Å
‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞®‡±Å ‚ûù ‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞®‡±Å
‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞Ç‡∞¶‡∞ø ‚ûù ‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞Ç‡∞¶‡∞ø
‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞Ç‡∞¶‡∞ø ‚ûù ‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞Ç‡∞¶‡∞ø
‡∞Ü‡∞°‡±Å‡∞§‡±ã‡∞Ç‡∞¶‡∞ø ‚ûù ‡∞Ü‡∞°‡±Å‡∞§‡±ã‡∞Ç‡∞¶‡∞ø
‡∞Ü‡∞°‡∞§‡∞æ‡∞°‡±Å ‚ûù ‡∞Ü‡∞°‡∞§‡∞æ‡∞°‡±Å
‡∞Ü‡∞°‡∞æ‡∞∞‡±Å ‚ûù ‡∞Ü‡∞°‡∞æ‡∞∞‡±Å
‡∞Ü‡∞°‡∞ø‡∞Ç‡∞¶‡∞ø ‚ûù ‡∞Ü‡∞°‡∞ø‡∞Ç‡∞¶‡∞ø
‡∞Ü‡∞°‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞°‡±Å ‚ûù ‡∞Ü‡∞°‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞°‡±Å
‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑ ‚ûù ‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑
‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞≤‡±Å ‚ûù ‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞≤‡±Å
‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞≤‡∞ï‡∞ø ‚ûù ‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞≤‡∞ï‡∞ø
‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞≤‡∞ï‡±Å ‚ûù ‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞≤‡∞ï‡±Å
‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞ï‡±Å ‚ûù ‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞ï‡±Å
‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å‡∞®‡∞æ‡∞°‡±Å ‚ûù ‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å‡∞®‡∞æ‡∞°‡±Å
‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å‡∞≤‡±ã ‚ûù ‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å‡∞≤‡±ã
‡∞§‡

In [8]:
telugu_words_extended = [
    "‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡∞æ‡∞°‡±Å", "‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞∞‡±Å", "‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞®‡±Å", "‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞Ç‡∞¶‡∞ø", "‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞µ‡∞ø",
    "‡∞™‡∞æ‡∞ü‡∞≤‡∞ï‡±ã‡∞∏‡∞Ç", "‡∞™‡∞æ‡∞ü‡∞≤‡±Å", "‡∞™‡∞æ‡∞ü‡∞≤‡∞§‡±ã", "‡∞™‡∞æ‡∞ü‡∞ï‡±Å‡∞≤‡∞ï‡±Å", "‡∞™‡∞æ‡∞ü‡∞ó‡∞æ",
    "‡∞Ü‡∞°‡±Å‡∞§‡±Å‡∞Ç‡∞¶‡∞ø", "‡∞Ü‡∞°‡∞§‡∞æ‡∞°‡±Å", "‡∞Ü‡∞°‡∞æ‡∞∞‡±Å", "‡∞Ü‡∞°‡∞§‡∞æ‡∞®‡∞®‡∞ø", "‡∞Ü‡∞°‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞Æ‡±Å",
    "‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑", "‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞≤‡±Å", "‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞≤‡∞ï‡∞ø", "‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞≤‡∞ï‡±Å", "‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞ï‡±Å",
    "‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å‡∞®‡∞æ‡∞°‡±Å", "‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å‡∞≤‡±ã", "‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å‡∞µ‡∞æ‡∞∞‡±Å", "‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å‡∞¶‡±á‡∞∂‡∞Ç", "‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å‡∞¨‡∞°‡∞ø"
]

# Overwrite file
with open("telugu_sample.txt", "w", encoding="utf-8") as f:
    for word in telugu_words_extended:
        f.write(word + "\n")

print(f"‚úÖ Wrote {len(telugu_words_extended)} words to file")


‚úÖ Wrote 25 words to file


In [9]:
# Reload and retrain
data = io.read_corpus_file("telugu_sample.txt")
model = morfessor.BaselineModel()
model.load_data(data)
model.train_batch()

print("üîÅ Model retrained on extended data!")


100% (25 of 25) |########################| Elapsed Time: 0:00:00 Time:  0:00:00
100% (25 of 25) |########################| Elapsed Time: 0:00:00 Time:  0:00:00
100% (25 of 25) |########################| Elapsed Time: 0:00:00 Time:  0:00:00


üîÅ Model retrained on extended data!


In [10]:
# Try segmenting again
for word in telugu_words_extended:
    segments = model.viterbi_segment(word)[0]
    print(f"{word} ‚ûù {' + '.join(segments)}")


‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡∞æ‡∞°‡±Å ‚ûù ‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡∞æ‡∞°‡±Å
‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞∞‡±Å ‚ûù ‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞∞‡±Å
‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞®‡±Å ‚ûù ‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞®‡±Å
‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞Ç‡∞¶‡∞ø ‚ûù ‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞Ç‡∞¶‡∞ø
‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞µ‡∞ø ‚ûù ‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞µ‡∞ø
‡∞™‡∞æ‡∞ü‡∞≤‡∞ï‡±ã‡∞∏‡∞Ç ‚ûù ‡∞™‡∞æ‡∞ü + ‡∞≤‡∞ï‡±ã‡∞∏‡∞Ç
‡∞™‡∞æ‡∞ü‡∞≤‡±Å ‚ûù ‡∞™‡∞æ‡∞ü + ‡∞≤‡±Å
‡∞™‡∞æ‡∞ü‡∞≤‡∞§‡±ã ‚ûù ‡∞™‡∞æ‡∞ü + ‡∞≤‡∞§‡±ã
‡∞™‡∞æ‡∞ü‡∞ï‡±Å‡∞≤‡∞ï‡±Å ‚ûù ‡∞™‡∞æ‡∞ü + ‡∞ï‡±Å + ‡∞≤‡∞ï‡±Å
‡∞™‡∞æ‡∞ü‡∞ó‡∞æ ‚ûù ‡∞™‡∞æ‡∞ü + ‡∞ó‡∞æ
‡∞Ü‡∞°‡±Å‡∞§‡±Å‡∞Ç‡∞¶‡∞ø ‚ûù ‡∞Ü‡∞°‡±Å‡∞§‡±Å‡∞Ç‡∞¶‡∞ø
‡∞Ü‡∞°‡∞§‡∞æ‡∞°‡±Å ‚ûù ‡∞Ü‡∞°‡∞§‡∞æ‡∞°‡±Å
‡∞Ü‡∞°‡∞æ‡∞∞‡±Å ‚ûù ‡∞Ü‡∞°‡∞æ‡∞∞‡±Å
‡∞Ü‡∞°‡∞§‡∞æ‡∞®‡∞®‡∞ø ‚ûù ‡∞Ü‡∞°‡∞§‡∞æ‡∞®‡∞®‡∞ø
‡∞Ü‡∞°‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞Æ‡±Å ‚ûù ‡∞Ü‡∞°‡±Å‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞Æ‡±Å
‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑ ‚ûù ‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑
‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞≤‡±Å ‚ûù ‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑ + ‡∞≤‡±Å
‡∞™‡∞∞‡±Ä‡∞ï‡±ç‡∞∑‡∞≤‡∞ï‡∞ø ‚ûù ‡∞™‡∞

In [11]:
with open("mixed_telugu_2000.txt", "r", encoding="utf-8") as f:
    train_words = [line.strip() for line in f if line.strip()]


In [12]:
# Step 2: Initialize and train the Morfessor model
io = morfessor.MorfessorIO()
model = morfessor.BaselineModel()

# Convert words into (count, word) pairs
train_data = [(1, word) for word in train_words]

model.load_data(train_data)
model.train_batch()

print("‚úÖ Morfessor trained on 2000-word mixed dataset.")


100% (1466 of 1466) |####################| Elapsed Time: 0:00:03 Time:  0:00:03
100% (1466 of 1466) |####################| Elapsed Time: 0:00:02 Time:  0:00:02
100% (1466 of 1466) |####################| Elapsed Time: 0:00:02 Time:  0:00:02
100% (1466 of 1466) |####################| Elapsed Time: 0:00:02 Time:  0:00:02
100% (1466 of 1466) |####################| Elapsed Time: 0:00:02 Time:  0:00:02
100% (1466 of 1466) |####################| Elapsed Time: 0:00:02 Time:  0:00:02


‚úÖ Morfessor trained on 2000-word mixed dataset.


In [18]:
# Load the 500 test words
with open("telugu_500.txt", "r", encoding="utf-8") as f:
    test_words = [line.strip() for line in f if line.strip()]

# Segment each test word using the trained model
segmented_results = []
for word in test_words:
    segments = model.viterbi_segment(word)[0]
    segmented_results.append(f"{word} ‚ûù {' + '.join(segments)}")

# Save to a file
with open("final_novel_test_500.txt", "w", encoding="utf-8") as f:
    for line in segmented_results:
        f.write(line + "\n")

print("‚úÖ Segmentation complete. Results saved to segmented_output_500_from_2000mix.txt")


‚úÖ Segmentation complete. Results saved to segmented_output_500_from_2000mix.txt


In [19]:
# Count how many words got split (i.e., more than one segment)
split_count = 0

for word in test_words:
    segments = model.viterbi_segment(word)[0]
    if len(segments) > 1:
        split_count += 1

print(f"üîç {split_count} out of {len(test_words)} words were segmented into multiple morphemes.")


üîç 461 out of 500 words were segmented into multiple morphemes.


In [20]:
# Load gold-standard segmentations
with open("Tested500words.txt", "r", encoding="utf-8") as f:
    gold_pairs = [line.strip().split(" ‚ûù ") for line in f if "‚ûù" in line]

# Create a dictionary from word ‚Üí correct segmentation
gold_dict = {word: seg.strip() for word, seg in gold_pairs}

In [21]:
correct = 0
total = len(gold_dict)

print("üîç Word-by-word comparison:\n")

for word, expected in gold_dict.items():
    predicted = model.viterbi_segment(word)[0]
    match = predicted == expected
    if match:
        correct += 1
    print(f"{word:15} ‚Üí {' + '.join(predicted):25} | Expected: {' + '.join(expected):25} | {'‚úÖ' if match else '‚ùå'}")

accuracy = (correct / total) * 100
print(f"\nüìä Overall Evaluation Accuracy: {accuracy:.2f}% ({correct}/{total} correct)")

üîç Word-by-word comparison:

‡∞ö‡∞¶‡±Å‡∞µ‡±Å           ‚Üí ‡∞ö‡∞¶‡±Å‡∞µ‡±Å                     | Expected: ‡∞ö + ‡∞¶ + ‡±Å + ‡∞µ + ‡±Å         | ‚ùå
‡∞Æ‡∞æ‡∞ö‡∞¶‡±Å‡∞µ‡±Å         ‚Üí ‡∞Æ‡∞æ‡∞ö‡∞¶‡±Å‡∞µ‡±Å                   | Expected: ‡∞Æ + ‡∞æ + ‡∞ö + ‡∞¶ + ‡±Å + ‡∞µ + ‡±Å | ‚ùå
‡∞Ö‡∞ö‡∞¶‡±Å‡∞µ‡±Å          ‚Üí ‡∞Ö‡∞ö‡∞¶‡±Å‡∞µ‡±Å                    | Expected: ‡∞Ö + ‡∞ö + ‡∞¶ + ‡±Å + ‡∞µ + ‡±Å     | ‚ùå
‡∞â‡∞™‡∞ö‡∞¶‡±Å‡∞µ‡±Å         ‚Üí ‡∞â‡∞™‡∞ö‡∞¶‡±Å‡∞µ‡±Å                   | Expected: ‡∞â + ‡∞™ + ‡∞ö + ‡∞¶ + ‡±Å + ‡∞µ + ‡±Å | ‚ùå
‡∞∏‡∞π‡∞ö‡∞¶‡±Å‡∞µ‡±Å         ‚Üí ‡∞∏‡∞π‡∞ö‡∞¶‡±Å‡∞µ‡±Å                   | Expected: ‡∞∏ + ‡∞π + ‡∞ö + ‡∞¶ + ‡±Å + ‡∞µ + ‡±Å | ‚ùå
‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡∞æ‡∞°‡±Å       ‚Üí ‡∞ö‡∞¶‡±Å‡∞µ‡±Å + ‡∞§‡∞æ‡∞°‡±Å              | Expected: ‡∞ö + ‡∞¶ + ‡±Å + ‡∞µ + ‡±Å + ‡∞§ + ‡∞æ + ‡∞° + ‡±Å | ‚ùå
‡∞Æ‡∞æ‡∞ö‡∞¶‡±Å‡∞µ‡±Å‡∞§‡∞æ‡∞°‡±Å     ‚Üí ‡∞Æ‡∞æ‡∞ö‡∞¶‡±Å‡∞µ‡±Å + ‡∞§‡∞æ‡∞°‡±Å            | Expected: ‡∞Æ + ‡∞æ + ‡∞ö + ‡∞¶ + ‡±Å + ‡∞µ + ‡±Å +   + + +   + ‡∞§ + ‡∞æ + ‡∞° + ‡