<a href="https://colab.research.google.com/github/AnDDoanf/learn_NLP/blob/master/notebooks/CRF_Models_for_Vietnamese_Word_Segmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Conditional Random Fields for Vietnamese Word Segmentation

In this notebook, we will show a solution for the programming assignment 4 - Vietnamese Word Segmentation. We are using Conditional Random Fields (CRFs) model for the task.




## Dataset

You will use the training data in the file [train.txt](https://www.dl.dropboxusercontent.com/s/reor8jnqedk7svt/train.txt) to train your Vietnamese Word Segmentation Model and evaluate the model on the test data in the file [test.txt](https://www.dl.dropboxusercontent.com/s/zp635cd1zhofm62/test.txt) derived from VLSP 2013 Word Segmentation dataset.

The training data contains 20000 sentences (sentences are separated by a blank line), and the test data contains 2000 sentences.

You can download the file using wget command.

In [None]:
%%capture

!rm -f train.txt
!wget https://www.dl.dropboxusercontent.com/s/reor8jnqedk7svt/train.txt

!rm -f test.txt
!wget https://www.dl.dropboxusercontent.com/s/zp635cd1zhofm62/test.txt

## Install necessary packages

We will use following packages:

- [python-crfsuite](https://github.com/scrapinghub/python-crfsuite) is a python binding to CRFsuite.
- seqeval for sequence tagging evaluation.

In [None]:
%%capture
!pip install -q seqeval[cpu]
!pip install -q python-crfsuite

## Loading data

We will load data into a list of tuples (word, tag) by using the following function.

In [None]:
def load_data(file_path):
    """Load data from a file (train.txt or test.txt)

    Return:
        tagged_sentences (list): List of sentence. Each sentence is a list of tuples (word, tag)
    """
    # TODO: Write your code here
    tagged_sentences = []
    cur_sen = []
    with open(file_path, 'r') as f:
        for line in f:
            line = line.rstrip()
            if line == '':
                if len(cur_sen) != 0:
                    tagged_sentences.append(cur_sen)
                    cur_sen = []
            else:
                word, tag = line.split()
                cur_sen.append((word, tag))
    if len(cur_sen) != 0:
        tagged_sentences.append(cur_sen)
    return tagged_sentences

train_data = load_data('train.txt')
test_data = load_data('test.txt')

## Features

In this section, we are going to implement features in the model.

In [None]:
def word2features(sentence, i):
    """
    Arguments:
        sentence (list): list of words [w1, w2,...,w_n]
        i (int): index of the word
    Return:
        features (dict): dictionary of features
    """
    word = sentence[i]
    features = {
        'is_first': i == 0,
        'is_last': i == len(sentence) - 1,
        'is_first_capital': word[0].isupper(),
        'is_all_caps': int(word.upper() == word),
        'is_all_lower': word.lower() == word,
        'word': word,
        'word.lower()': word.lower(),
        'prefix_1': word[0],
        'prefix_2': word[:2],
        'prefix_3': word[:3],
        'prefix_4': word[:4],
        'suffix_1': word[-1],
        'suffix_2': word[-2:],
        'suffix_3': word[-3:],
        'suffix_4': word[-4:],
        'has_hyphen': '-' in word,
        'is_numeric': word.isdigit(),
        'capitals_inside': word[1:].lower() != word[1:],
        # word unigram, bigram, and trigram
        'word[i-2].lower()': '' if i-2<0 else sentence[i-2].lower(),
        'word[i-1].lower()': '' if i-1<0 else sentence[i-1].lower(),
        'word[i+1].lower()': '' if i+1>=len(sentence) else sentence[i+1].lower(),
        'word[i+2].lower()': '' if i+2>=len(sentence) else sentence[i+2].lower(),

        'word[i-2]': '' if i-2<0 else sentence[i-2],
        'word[i-1]': '' if i-1<0 else sentence[i-1],
        'word[i+1]': '' if i+1>=len(sentence) else sentence[i+1],
        'word[i+2]': '' if i+2>=len(sentence) else sentence[i+2],

        'words[-2,-1]': '' if i-2 < 0 else ' '.join(sentence[i-2:i]),
        'words[-1,0]': '' if i-1 < 0 else ' '.join(sentence[i-1:i+1]),
        'words[0,1]': '' if i+1>=len(sentence) else ' '.join(sentence[i:i+2]),
        'words[1,2]': '' if i+2>=len(sentence) else ' '.join(sentence[i+1:i+3]),
        'words[-2,0]': '' if i-2<0 else ' '.join(sentence[i-2:i+1]),
        'words[-1,1]': '' if i-1<0 or i+1>=len(sentence) else ' '.join(sentence[i-1:i+1]),
        'words[0,2]': '' if i+2>=len(sentence) else ' '.join(sentence[i:i+3]),                                                 
    }
    
    return features


def sent2features(sentence):
    """
    sentence is a list of words [w1, w2,...,w_n]
    """
    return [word2features(sentence, i) for i in range(len(sentence))]


def sent2labels(sentence):
    """
    sentence is a list of tuples (word, postag)
    """    
    return [postag for token, postag in sentence]

def untag(sentence):
    """
    sentence is a list of tuples (word, postag)
    """
    return [token for token, _ in sentence]

Let's see how the feature function works.

In [None]:
sent2features(untag(train_data[0]))[0]

{'is_first': True,
 'is_last': False,
 'is_first_capital': True,
 'is_all_caps': 0,
 'is_all_lower': False,
 'word': 'Nam',
 'word.lower()': 'nam',
 'prefix_1': 'N',
 'prefix_2': 'Na',
 'prefix_3': 'Nam',
 'prefix_4': 'Nam',
 'suffix_1': 'm',
 'suffix_2': 'am',
 'suffix_3': 'Nam',
 'suffix_4': 'Nam',
 'has_hyphen': False,
 'is_numeric': False,
 'capitals_inside': False,
 'word[i-2].lower()': '',
 'word[i-1].lower()': '',
 'word[i+1].lower()': 'hồn',
 'word[i+2].lower()': 'nhiên',
 'word[i-2]': '',
 'word[i-1]': '',
 'word[i+1]': 'hồn',
 'word[i+2]': 'nhiên',
 'words[-2,-1]': '',
 'words[-1,0]': '',
 'words[0,1]': 'Nam hồn',
 'words[1,2]': 'hồn nhiên',
 'words[-2,0]': '',
 'words[-1,1]': '',
 'words[0,2]': 'Nam hồn nhiên'}

Now we can extract features from the data.

In [None]:
X_train = [sent2features(untag(s)) for s in train_data]
y_train = [sent2labels(s) for s in train_data]

X_test = [sent2features(untag(s)) for s in test_data]
y_test = [sent2labels(s) for s in test_data]

## Training

To see all possible CRF parameters check its docstring. Here we are using SGD training algorithm with L2 regularization.

In [None]:
%%time
import pycrfsuite

trainer = pycrfsuite.Trainer(algorithm='lbfgs', verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

trainer.train('word_segmenter.crfsuite')

CPU times: user 2min 13s, sys: 956 ms, total: 2min 14s
Wall time: 2min 15s


## Evaluation

Evaluation measures:

- P(recision): (Number of word models correctly split)/(Number of words in the model's output)
- R(ecall): (Number of word models correctly split)/(Number of words in ground-truth data)
- F1 = 2*P*R/(P+R)

We first load model to a tagger and then predict on the test data

In [None]:
from seqeval.metrics import f1_score, classification_report

tagger = pycrfsuite.Tagger()
tagger.open('word_segmenter.crfsuite')
predicted_tag_sequences =[tagger.tag(xseq) for xseq in X_test]

print(classification_report(y_test, predicted_tag_sequences))

              precision    recall  f1-score   support

           W       0.96      0.97      0.96     62131

   micro avg       0.96      0.97      0.96     62131
   macro avg       0.96      0.97      0.96     62131
weighted avg       0.96      0.97      0.96     62131

