In this assignment, we will build a Word Segmentation model for Vietnamese.

# How to submit

- Attach notebook file (.ipynb) and submit your work to Google Class Room. **Please do NOT submit URL**
- Name your file as YourName_StudentID_Assignment3.ibynb. E.g., Nguyen_Van_A_ST099834_Assignment3.ipynb
- Copying others' assignments is strictly prohibited.
- Write your name and student ID into this notebook

**The due for the programming assignment 3 will be at 23:59 on March 10, 2023 (Hard deadline)**

- You will be deducted 5 points for each day late submission
- Students who fail to attach the file will not be graded.

# Rules

- You can only use HMM or CRF model to complete the assignment.
- If you apply HMM, it is allowed to use nltk.HiddenMarkovModelTagger to train the model.
- Your code should run without errors


# Vietnamese Word Segmentation

The smallest unit in the Vietnamese language is syllable (tiếng). A word may consist of one or multiple consecutive words. In some problems (such as extracting keywords from text), it is necessary to identify the words in the text.

The input of the word tokenizer is a sentence consisting of syllables, and the output is a sentence with words segmented.

Example:

Input: Nam là sinh viên đại học ngành kỹ thuật

Output: Nam là sinh_viên đại_học ngành kỹ_thuật

The underscore symbol "_" is used to connect syllables that belong to the same word.

There are many ways to solve the word segmentation problem. In this exercise, we will use a sequence labeling model to do the task. We use the BI labeling method to label each syllable in the sentence. The tag B-W is used to mark the beginning of a word, and the tag I-W is used to mark a syllable that is inside the same word as the previous syllable.

If we can label each syllable in the input sentence, we can accurately tokenize the sentence.

```
Nam/B-W là/B-W sinh/B-W viên/I-W đại/B-W học/I-W ngành/B-W kỹ/B-W thuật/I-W
```

We can determine the words in the sentence from the above output. A sequence of syllables labeled as B-W I-W ... will form a word.

# Dataset

You will use the train data in the file [train.txt](https://drive.google.com/file/d/1Y4AuWqbInOv1HNMiGPMhuntcA-n1jfWF/view?usp=share_link) to train your Vietnamese word segmentation model and evaluate the model image on the test data in the file [test.txt](https://drive.google.com/file/d/1Y57hlYLpxVCZUbVuwOa1IRCLgUiEAx0c/view?usp=share_link) extracted from the Word Segmentation VLSP 2013 dataset.

The training data contains 20000 sentences (sentences are separated by a blank line), and the test data contains 2000 sentences.

You can download the file using the wget command.

I have uploaded vietnamese word segmentation datasets to Kaggle.

# Install necessary packages

We will use following packages:

- [python-crfsuite](https://github.com/scrapinghub/python-crfsuite) is a python binding to CRFsuite.
- seqeval for sequence tagging evaluation.

In [1]:
%%capture
%pip install -q seqeval[cpu]
%pip install -q python-crfsuite

# Loading data

We will load data into a list of tuples (word, tag) by using the following function

In [2]:
def load_data(file_path):
    """Load data from a file (train.txt or test.txt)

    Return:
        tagged_sentences (list): List of sentence. Each sentence is a list of tuples (word, tag)
    """
    # TODO: Write your code here
    tagged_sentences = []
    cur_sen = []
    with open(file_path, 'r') as f:
        for line in f:
            line = line.rstrip()
            if line == '':
                if len(cur_sen) != 0:
                    tagged_sentences.append(cur_sen)
                    cur_sen = []
            else:
                word, tag = line.split()
                cur_sen.append((word, tag))
    if len(cur_sen) != 0:
        tagged_sentences.append(cur_sen)
    return tagged_sentences

train_data = load_data('/kaggle/input/vietnamese-word-segmentation-datasets/train_vietnamese_word_segmentation_2013.txt')
test_data = load_data('/kaggle/input/vietnamese-word-segmentation-datasets/test_vietnamese_word_segmentation_2013.txt')

# Part 1: Building a Word Segmentation model for Vietnamese (70 points)

In this section, you will build a tagging model using HMM or CRF. 

*Hint*: Please refer to [HMM_POS_Tagger.ipynb](https://colab.research.google.com/drive/1lcTncvhlhx8KaJ_oBW6MR7cy470MMpD9#scrollTo=Ty-Qh9Jo23dS) or [CRF_POS_Tagger.ipynb](https://colab.research.google.com/drive/1SuxBmZudn4Tn3w-pBXDoRsuSkBA7RiVV) to understand how to build a tagging model with HMM and CRF.

## Feature

In this section, we are going to implement features in the model.

In [3]:
def word2features(sentence, i):
    """
    Arguments:
        sentence (list): list of words [w1, w2,...,w_n]
        i (int): index of the word
    Return:
        features (dict): dictionary of features
    """
    word = sentence[i]
    features = {
        'is_first': i == 0,
        'is_last': i == len(sentence) - 1,
        'is_first_capital': word[0].isupper(),
        'is_all_caps': int(word.upper() == word),
        'is_all_lower': word.lower() == word,
        'word': word,
        'word.lower()': word.lower(),
        'prefix_1': word[0],
        'prefix_2': word[:2],
        'prefix_3': word[:3],
        'prefix_4': word[:4],
        'suffix_1': word[-1],
        'suffix_2': word[-2:],
        'suffix_3': word[-3:],
        'suffix_4': word[-4:],
        'has_hyphen': '-' in word,
        'is_numeric': word.isdigit(),
        'capitals_inside': word[1:].lower() != word[1:],
        # word unigram, bigram, and trigram
        'word[i-2].lower()': '' if i-2<0 else sentence[i-2].lower(),
        'word[i-1].lower()': '' if i-1<0 else sentence[i-1].lower(),
        'word[i+1].lower()': '' if i+1>=len(sentence) else sentence[i+1].lower(),
        'word[i+2].lower()': '' if i+2>=len(sentence) else sentence[i+2].lower(),

        'word[i-2]': '' if i-2<0 else sentence[i-2],
        'word[i-1]': '' if i-1<0 else sentence[i-1],
        'word[i+1]': '' if i+1>=len(sentence) else sentence[i+1],
        'word[i+2]': '' if i+2>=len(sentence) else sentence[i+2],

        'words[-2,-1]': '' if i-2 < 0 else ' '.join(sentence[i-2:i]),
        'words[-1,0]': '' if i-1 < 0 else ' '.join(sentence[i-1:i+1]),
        'words[0,1]': '' if i+1>=len(sentence) else ' '.join(sentence[i:i+2]),
        'words[1,2]': '' if i+2>=len(sentence) else ' '.join(sentence[i+1:i+3]),
        'words[-2,0]': '' if i-2<0 else ' '.join(sentence[i-2:i+1]),
        'words[-1,1]': '' if i-1<0 or i+1>=len(sentence) else ' '.join(sentence[i-1:i+1]),
        'words[0,2]': '' if i+2>=len(sentence) else ' '.join(sentence[i:i+3]),                                                 
    }
    
    return features


def sent2features(sentence):
    """
    sentence is a list of words [w1, w2,...,w_n]
    """
    return [word2features(sentence, i) for i in range(len(sentence))]


def sent2labels(sentence):
    """
    sentence is a list of tuples (word, postag)
    """    
    return [postag for token, postag in sentence]

def untag(sentence):
    """
    sentence is a list of tuples (word, postag)
    """
    return [token for token, _ in sentence]

Let's see how the feature function works.

In [4]:
sent2features(untag(train_data[0]))[0]

{'is_first': True,
 'is_last': False,
 'is_first_capital': True,
 'is_all_caps': 0,
 'is_all_lower': False,
 'word': 'Nam',
 'word.lower()': 'nam',
 'prefix_1': 'N',
 'prefix_2': 'Na',
 'prefix_3': 'Nam',
 'prefix_4': 'Nam',
 'suffix_1': 'm',
 'suffix_2': 'am',
 'suffix_3': 'Nam',
 'suffix_4': 'Nam',
 'has_hyphen': False,
 'is_numeric': False,
 'capitals_inside': False,
 'word[i-2].lower()': '',
 'word[i-1].lower()': '',
 'word[i+1].lower()': 'hồn',
 'word[i+2].lower()': 'nhiên',
 'word[i-2]': '',
 'word[i-1]': '',
 'word[i+1]': 'hồn',
 'word[i+2]': 'nhiên',
 'words[-2,-1]': '',
 'words[-1,0]': '',
 'words[0,1]': 'Nam hồn',
 'words[1,2]': 'hồn nhiên',
 'words[-2,0]': '',
 'words[-1,1]': '',
 'words[0,2]': 'Nam hồn nhiên'}

Now we can extract features from the data

In [5]:
X_train = [sent2features(untag(s)) for s in train_data]
y_train = [sent2labels(s) for s in train_data]

X_test = [sent2features(untag(s)) for s in test_data]
y_test = [sent2labels(s) for s in test_data]

## Training
To see all possible CRF parameters check its docstring. Here we are using SGD training algorithm with L2 regularization

In [6]:
%%time
import pycrfsuite

trainer = pycrfsuite.Trainer(algorithm='lbfgs', verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

trainer.train('word_segmenter.crfsuite')

CPU times: user 2min 20s, sys: 925 ms, total: 2min 21s
Wall time: 2min 21s


# Part 2: Model evaluation (30 points)
Evaluation measures:
- P(recision): (Number of word models correctly split)/(Number of words in the model's output)
- R(ecall): (Number of word models correctly split)/(Number of words in ground-truth data)
- F1 = 2PR/(P+R)

We first load model to a tagger and then predict on the test data

In [7]:
from seqeval.metrics import f1_score, classification_report

tagger = pycrfsuite.Tagger()
tagger.open('word_segmenter.crfsuite')
predicted_tag_sequences = [tagger.tag(xseq) for xseq in X_test]

print(classification_report(y_test, predicted_tag_sequences))

              precision    recall  f1-score   support

           W       0.96      0.97      0.96     62131

   micro avg       0.96      0.97      0.96     62131
   macro avg       0.96      0.97      0.96     62131
weighted avg       0.96      0.97      0.96     62131

