# Building a Word Segmentation model for Vietnamese

In this assignment, we will build a Word Segmentation model for Vietnamese.



## How to submit

- Attach notebook file (.ipynb) and submit your work to Google Class Room. **Please do NOT submit URL**
- Name your file as YourName_StudentID_Assignment3.ibynb. E.g., Nguyen_Van_A_ST099834_Assignment3.ipynb
- Copying others' assignments is strictly prohibited.
- Write your name and student ID into this notebook

**The due for the programming assignment 3 will be at 23:59 on March 10, 2023 (Hard deadline)**

- You will be deducted 5 points for each day late submission
- Students who fail to attach the file will not be graded.

## Rules

- You can only use HMM or CRF model to complete the assignment.
- If you apply HMM, it is allowed to use nltk.HiddenMarkovModelTagger to train the model.
- Your code should run without errors


## Vietnamese Word Segmentation

The smallest unit in the Vietnamese language is syllable (tiếng). A word may consist of one or multiple consecutive words. In some problems (such as extracting keywords from text), it is necessary to identify the words in the text.

The input of the word tokenizer is a sentence consisting of syllables, and the output is a sentence with words segmented.

Example:

Input: Nam là sinh viên đại học ngành kỹ thuật

Output: Nam là sinh_viên đại_học ngành kỹ_thuật

The underscore symbol "_" is used to connect syllables that belong to the same word.

There are many ways to solve the word segmentation problem. In this exercise, we will use a sequence labeling model to do the task. We use the BI labeling method to label each syllable in the sentence. The tag B-W is used to mark the beginning of a word, and the tag I-W is used to mark a syllable that is inside the same word as the previous syllable.

If we can label each syllable in the input sentence, we can accurately tokenize the sentence.

```
Nam/B-W là/B-W sinh/B-W viên/I-W đại/B-W học/I-W ngành/B-W kỹ/B-W thuật/I-W
```

We can determine the words in the sentence from the above output. A sequence of syllables labeled as B-W I-W ... will form a word.

## Dataset

You will use the train data in the file [train.txt](https://www.dl.dropboxusercontent.com/s/reor8jnqedk7svt/train.txt) to train your Vietnamese word segmentation model and evaluate the model image on the test data in the file [test.txt](https://www.dl.dropboxusercontent.com/s/zp635cd1zhofm62/test.txt) extracted from the Word Segmentation VLSP 2013 dataset.

You can download the file using the wget command.

In [24]:
%%capture

!rm -f train.txt
!wget https://www.dl.dropboxusercontent.com/s/reor8jnqedk7svt/train.txt

!rm -f test.txt
!wget https://www.dl.dropboxusercontent.com/s/zp635cd1zhofm62/test.txt

The training data contains 20000 sentences (sentences separated by a blank line) and the test data contains 2000 sentences.

In [25]:
!head -n 10 train.txt

Nam	B-W
hồn	B-W
nhiên	I-W
:	B-W
"	B-W
Tụi	B-W
tôi	B-W
xài	B-W
tiền	B-W
ngân	B-W


## Task Description

- Your task is to build a tagging model that can predict a tag sequence (a tag sequence of B-W and I-W) for words in an input sentence.
- You can use HMM or CRF model to complete the assignment.
- If you apply HMM, it is allowed to use `nltk.HiddenMarkovModelTagger` to train the model.

The package [seqeval](https://github.com/chakki-works/seqeval) can be used to evaluate the result of model word segmentation.


In [26]:
!pip install -q seqeval[cpu]

## Loading data


In [27]:
def load_data(file_path):
    """Load data from a file (train.txt or test.txt)

    Return:
        tagged_sentences (list): List of sentence. Each sentence is a list of tuples (word, tag)
    """
    tagged_sentences = []
    cur_sen = []
    with open(file_path, 'r') as f:
        for line in f:
            line = line.rstrip()
            if line == '':
                if len(cur_sen) != 0:
                    tagged_sentences.append(cur_sen)
                    cur_sen = []
            else:
                word, tag = line.split()
                cur_sen.append((word, tag))
    if len(cur_sen) != 0:
        tagged_sentences.append(cur_sen)
    return tagged_sentences

train_data = load_data('train.txt')
test_data = load_data('test.txt')

## Part 1: Building a Word Segmentation model for Vietnamese (70 points)

In this section, you will build a tagging model using HMM or CRF. 

*Hint*: Please refer to [HMM_POS_Tagger.ipynb](https://colab.research.google.com/drive/1lcTncvhlhx8KaJ_oBW6MR7cy470MMpD9#scrollTo=Ty-Qh9Jo23dS) or [CRF_POS_Tagger.ipynb](https://colab.research.google.com/drive/1SuxBmZudn4Tn3w-pBXDoRsuSkBA7RiVV) to understand how to build a tagging model with HMM and CRF.


In [None]:
# TODO: Write code to build the model

### HMM model

In [78]:
import nltk

tagger = nltk.HiddenMarkovModelTagger.train(train_data)

In [79]:
tagger.accuracy(test_data)

0.8272016974814062

## Part 2: Model evaluation (30 points)

Evaluation measures:

- P(recision): (Number of word models correctly split)/(Number of words in the model's output)
- R(ecall): (Number of word models correctly split)/(Number of words in ground-truth data)
- F1 = 2*P*R/(P+R)

What you need to do is to use the model that you have trained to generate the list of tag sequences for sentences in the test data.

**Complete the following function**


In [97]:
def tag(syllables):
    """Return a list of tags for a list of syllables

    Arguments:
        tokens (list)
    
    Returns:
        tags (list): list of tags for input tokens
    """
    # TODO: Write your code here
    tags = tagger.tag(syllables)
    tags = [tag[-1] for tag in tags]
    return tags

We apply the above function to get the predicted tag sequences of the model.

In [98]:
## First get the unlabeled sentences and corresponding tag sequences
test_sentences = []
test_tag_sequences = []

for sen in test_data:
    words, tags = zip(*sen)
    assert len(words) == len(tags)
    test_sentences.append(list(words))
    test_tag_sequences.append(list(tags))

predicted_tag_sequences = [tag(s) for s in test_sentences]

Next, just use seqeval package to calculate precision, recall and f1 score

In [99]:
from seqeval.metrics import f1_score, classification_report
print(classification_report(test_tag_sequences, predicted_tag_sequences))

              precision    recall  f1-score   support

           W       0.70      0.75      0.72     62131

   micro avg       0.70      0.75      0.72     62131
   macro avg       0.70      0.75      0.72     62131
weighted avg       0.70      0.75      0.72     62131



## References

- Huyen, N. T. M., Roussanaly, A., & Vinh, H. T. (2008, March). A hybrid approach to word segmentation of Vietnamese texts. In International Conference on Language and Automata Theory and Applications (pp. 240-249). Springer, Berlin, Heidelberg.
- Nguyen, T. P., & Le, A. C. (2016, November). [A hybrid approach to Vietnamese word segmentation](https://www.researchgate.net/profile/Tuan_Phong_Nguyen/publication/311980397_A_hybrid_approach_to_Vietnamese_word_segmentation/links/5a9507e3a6fdccecff0771ff/A-hybrid-approach-to-Vietnamese-word-segmentation.pdf). In 2016 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF) (pp. 114-119). IEEE.
- Nguyen, D. Q., Nguyen, D. Q., Vu, T., Dras, M., & Johnson, M. (2017). [A fast and accurate vietnamese word segmenter](https://arxiv.org/abs/1709.06307). arXiv preprint arXiv:1709.06307.
- [seqeval](https://github.com/chakki-works/seqevalhttps://github.com/chakki-works/seqeval) for sequence labeling evaluation.

