<h1>Thai segmentation using TCC</h1>
<p>In this python notebook, I'll try to demonstrate how to 
convert string to TCC form and apply machine learning technique to create a model to predict whether
2 TCC should be combine or not</p>

<p>First let's import library and dataset.</p>

In [1]:
# -*- coding: utf-8 -*-
#Import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from bs4 import BeautifulSoup
import timeit

In [2]:
import codecs
file = codecs.open('dataset/orchid97-utf8.crp','r','utf-8')
fileString = file.read()
testArr = fileString.split("#")
testArr = [row for row in testArr if '/' in row]

<h2>First let's take a look at the data</h2>
<p>For each row, we will have a sentence, and each of the word in sentence tagged with part of speech.</p>

In [3]:
testArr[0:5]

['1\nการประชุมทางวิชาการ ครั้งที่ 1//\nการ/FIXN\nประชุม/VACT\nทาง/NCMN\nวิชาการ/NCMN\n<space>/PUNC\nครั้ง/CFQC\nที่ 1/DONM\n//\n',
 '2\nโครงการวิจัยและพัฒนาอิเล็กทรอนิกส์และคอมพิวเตอร์//\nโครงการวิจัยและพัฒนา/NCMN\nอิเล็กทรอนิกส์/NCMN\nและ/JCRG\nคอมพิวเตอร์/NCMN\n//\n',
 '3\nปีงบประมาณ 2531//\nปีงบประมาณ/NCMN\n<space>/PUNC\n2531/NCNM\n//\n',
 '4\nเล่ม 1//\nเล่ม/CNIT\n<space>/PUNC\n1/DONM\n//\n',
 '1\nศูนย์เทคโนโลยีอิเล็กทรอนิกส์และคอมพิวเตอร์แห่งชาติ//\nศูนย์เทคโนโลยีอิเล็กทรอนิกส์และคอมพิวเตอร์แห่งชาติ/NPRP\n//\n']

<h3>Let's convert this to a pandas DataFrame so that we can work with it easier.</h3>

In [4]:
#Convert string of each row to arrays.
sentences = []
wordsList = []
for row in testArr:
    parts = row.split('//')
    if("\n" in parts[0] and len(parts)>1):
        sentence = parts[0].split("\n")[1]
        sentence = sentence.replace(" ","")
        sentences.append(sentence)
        
        partsArr = parts[1].split("\n")
        partsArr = [p.split("/")[0] for p in partsArr]
        partsArr = [p for p in partsArr if p!='<space>']
        wordsList.append(partsArr[1:-1])

In [5]:
print("Sentences example : ",end='')
print(sentences[0:2])
print("Words List example : ",end='')
print(wordsList[0:2])

Sentences example : ['การประชุมทางวิชาการครั้งที่1', 'โครงการวิจัยและพัฒนาอิเล็กทรอนิกส์และคอมพิวเตอร์']
Words List example : [['การ', 'ประชุม', 'ทาง', 'วิชาการ', 'ครั้ง', 'ที่ 1'], ['โครงการวิจัยและพัฒนา', 'อิเล็กทรอนิกส์', 'และ', 'คอมพิวเตอร์']]


<h2>More on TCC</h2>
<p>In Thai language, some contiguous characters tend to be an
inseparable unit, called Thai character cluster (TCC). Unlike word
segmentation that is a very difficult task, segmenting a text into
TCCs is easily realized by applying a set of rules. This method
needs no dictionary and can always correctly segment a text at
every word boundaries.</p>

<p>Read more at : http://wing.comp.nus.edu.sg/~antho/H/H01/H01-1057.pdf</p>

In [6]:
def string2TCC(sentence):
    consonants = 'กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรฤลฦวศษสหฬอฮ'
    upper_vowel = '็   ิ ี ึ ํ'
    lower_vowel = 'ุู'
    front_vowel = 'เแโใไ'
    rear_vowel = 'าําๅๆะฯๅๆ'
    word = ''
    words = []
    #หา consonants ตัด consonants ตัวต่อไปที่เจอ
    for i in range(len(sentence)):
        s = sentence[i]
        if(s in front_vowel or (s in consonants and (sentence[i-1] not in front_vowel and sentence[i-1] != 'ั')) or s==' '):
            words.append(word)
            word = ''
        if(s!=' '):
            word += s
    words.append(word)
    return words[1:]
"|".join(string2TCC('การเก็บภาษีประเทศไทยและประเทศ'))

'กา|ร|เก็|บ|ภา|ษี|ป|ระ|เท|ศ|ไท|ย|และ|ป|ระ|เท|ศ'

<h2>New dataset construction</h2>
<p>Now we will construct a dataset consist of 4 features and 1 label</p>
<p>The 4 features will be preWord, word1,word2,sufWord in TCC form</p>
<p>The label will be whether we should combine TCC into 1 word or not.</p>
<p>With this we can use some machine learning technique to create a model to predict whether we should combine any given 4 TCC or not</p>

In [7]:
def construct_data_sample(sentences,wordsList):
#    TCC = string2TCC(sentence)
    features_1 = []
    features_2 = []
    features_3 = []
    features_4 = []
    labels = []

    for i,sentence in enumerate(sentences):
        TCCS = string2TCC(sentence)
        for j,TCC in enumerate(TCCS[:-3]):
            next1 = TCCS[j+1]
            next2 = TCCS[j+2]
            next3 = TCCS[j+3]
            #Check if we should combine next1 and next2
            #Check if TCC+next1 is in the same word and next2+next3 is in same word and next1/next2 is in different one
            firstHalf = TCC+next1
            secondHalf = next2+next3
            words = wordsList[i]
            features_1.append(TCC)
            features_2.append(next1)
            features_3.append(next2)
            features_4.append(next3)
            labelValue = 1
            for k,word in enumerate(words[:-1]):
                nextWord = words[k+1]
                if (firstHalf in word) and (secondHalf in nextWord) and (next1+next2 not in word) and (next1+next2 not in words[k+1]):
                    labelValue = 0
                    break
            labels.append(labelValue)        
    TCC_corpus = pd.DataFrame({'preWord':features_1,'word1':features_2,'word2':features_3,'sufWord':features_4,'shouldCombine':labels},  columns=['preWord','word1','word2','sufWord','shouldCombine'])
    return TCC_corpus
df = construct_data_sample(sentences,wordsList)
#Save file in case we want to use it later.
df.to_csv('dataset/TCCs_corpus.csv',index=False ,encoding="utf-8")

<h2>New Dataset obtained</h2>
<p>Let's look more at our new construct dataset</p>

In [8]:
df.head()

Unnamed: 0,preWord,word1,word2,sufWord,shouldCombine
0,กา,ร,ป,ระ,0
1,ร,ป,ระ,ชุ,1
2,ป,ระ,ชุ,ม,1
3,ระ,ชุ,ม,ทา,1
4,ชุ,ม,ทา,ง,0


In [9]:
df.shape

(612226, 5)

<h2>Data preprocess (again).</h2>
<p>Now let's do some more data preprocess with this new dataset.</p>
<p>For now let's extract feature propose by [1]</p>

In [10]:
consonants = 'กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรฤลฦวศษสหฬอฮ'
upper_vowel = '็   ิ ี ึ ํ'
lower_vowel = 'ุู'
front_vowel = 'เแโใไ'
rear_vowel = 'าําๅๆะฯๅๆ'
kok = ['ก', 'ข' ,'ค', 'ฆ']  # kok
kod = ['จ', 'ด', 'ต', 'ถ', 'ท', 'ธ', 'ฎ', 'ฏ', 'ฑ', 'ฒ', 'ช', 'ซ', 'ศ', 'ษ' ,'ส']  # กด
kob = ['บ','ป' ,'พ', 'ภ', 'ฟ'] 
kon = ['น', 'ณ', 'ญ', 'ร' ,'ล','ฬ'] 
kong = ['ง'] 
kom = ['ม ']
keiy = ['ย']
keve = ['ว']
feature_names=  ['fv','fc','mv','mc','rv','rc','length','spaceEnter']
wordTypeName= ['preWord_','word1_','word2_','sufWords_']
middleConsonantGroup = [kok,kod,kob,kon,kong,kom,keiy,keve]

In [11]:
#Extract TCC from given TCC
def TCC2features(TCC):    
    fv = 0
    fc = 0
    mv = 0
    mc = 0
    rv = 0
    for i,s in enumerate(TCC):
        #Front Vowel
        if(s in front_vowel):
            fv = 1
        #check for Front Consonant
        if(s=='ห' or s=='อ' and (TCC[i-1] not in consonants or i-1<0)):
            fc = 2
        elif(s in consonants):
            fc = 1
        #Check for Middle Vowel
        if(s in upper_vowel):
            mv = 1
        elif(s in lower_vowel):
            mv = 2
        #Check for Middle Consonant
        if(s in consonants and TCC[i-1] in consonants and i-1>0):
            mc = 1
        #Check for rear_vowel
        if(s=='ะ'):
            rv = 1
        elif(s=='า' or s=='ำ'):
            rv = 2
        #Check rear consonant
        #end without consonant
    lastChar = TCC[-1]
    rc = 9
    if(lastChar not in consonants): rc= 0 
    for j,group in enumerate(middleConsonantGroup):
        if lastChar in group:
            rc = j+1
    length = len(TCC)
    spaceEnter = 0 #Maybe change in da future
    
    return [fv,fc,mv,mc,rv,rc,length,spaceEnter]
TCC2features('กา')

[0, 1, 0, 0, 2, 0, 2, 0]

In [12]:
def get_features(row):
    preWordFeatures = TCC2features(row['preWord'])
    word1Features = TCC2features(row['word1'])
    word2Features = TCC2features(row['word2'])
    sufWordsFeatures = TCC2features(row['sufWord'])
    for k,features in enumerate([preWordFeatures,word1Features,word2Features,sufWordsFeatures]):
        feature_prefix = wordTypeName[k]
        for i in range(8):
            row[feature_prefix+feature_names[i]] = features[i]
    return row
df.head().apply(get_features,axis=1)

Unnamed: 0,preWord,word1,word2,sufWord,shouldCombine,preWord_fv,preWord_fc,preWord_mv,preWord_mc,preWord_rv,...,word2_length,word2_spaceEnter,sufWords_fv,sufWords_fc,sufWords_mv,sufWords_mc,sufWords_rv,sufWords_rc,sufWords_length,sufWords_spaceEnter
0,กา,ร,ป,ระ,0,0,1,0,0,2,...,1,0,0,1,0,0,1,0,2,0
1,ร,ป,ระ,ชุ,1,0,1,0,0,0,...,2,0,0,1,2,0,0,0,2,0
2,ป,ระ,ชุ,ม,1,0,1,0,0,0,...,2,0,0,1,0,0,0,9,1,0
3,ระ,ชุ,ม,ทา,1,0,1,0,0,1,...,1,0,0,1,0,0,2,0,2,0
4,ชุ,ม,ทา,ง,0,0,1,2,0,0,...,2,0,0,1,0,0,0,5,1,0


<h2>Do this with the rest</h2>
<p>Now we'll apply them to all of our dataset (65555 rows)</p>
<p>This might take some times</p>

In [13]:
#The one below is the finished version of csv.
start = timeit.default_timer()
#features_dataset =  df.apply(get_features,axis=1)
features_dataset = pd.read_csv('dataset/TCC_Features.csv',encoding = "utf-8")
stop = timeit.default_timer()
print("Execution time : "+str(round(stop - start,2))+"s.")


Execution time : 2.85s.


<h2>Let's see how much can ANN predict this dataset.</h2>
<p>First, set X and y for training in ANN.</p>

In [14]:
#Let's look at only 10k because it might take too long.
features_dataset = features_dataset.head(10000)
features = ['preWord_fv','preWord_fc', 'preWord_mv', 'preWord_mc', 'preWord_rv', 'preWord_rc',
       'preWord_length',  'word1_fv', 'word1_fc',
       'word1_mv', 'word1_mc', 'word1_rv', 'word1_rc', 'word1_length',
        'word2_fv', 'word2_fc', 'word2_mv', 'word2_mc',
       'word2_rv', 'word2_rc', 'word2_length', 
       'sufWords_fv', 'sufWords_fc', 'sufWords_mv', 'sufWords_mc',
       'sufWords_rv', 'sufWords_rc', 'sufWords_length']
target = ['shouldCombine']
X = features_dataset[features].values
y = features_dataset[target].values

In [15]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

<h2>Now let's look at the accuracy ANN can product</h2>
<p>This might even be longer than extracting features.</p>

Real implementation were done in spyder. 
Toomuch computation for jupyter
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from keras.models import Sequential
from keras.layers import Dense
def build_classifier():
    classifier = Sequential()
    classifier.add(Dense(output_dim=16,init='uniform',activation='relu',input_dim=31))    
    classifier.add(Dense(output_dim=16,init='uniform',activation='relu'))    
    classifier.add(Dense(output_dim=1,init='uniform',activation='sigmoid'))
    classifier.compile(optimizer= 'adam',loss='binary_crossentropy',metrics=['accuracy']) #categorical_crossentropy
    return classifier
classifier = KerasClassifier(build_fn = build_classifier, batch_size=10,nb_epoch=100)
accuracies = cross_val_score(estimator= classifier, X = X_train,y= y_train, cv =10)
mean = accuracies.mean()
std = accuracies.std()

In [16]:
from keras.models import load_model
classifier = load_model('tcc_model.h5')

Using TensorFlow backend.


In [17]:
X_test.shape

(2000, 28)

In [18]:
y_pred = classifier.predict(X_test)
y_pred = (y_pred>0.5)

In [19]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

from sklearn.metrics import f1_score
f1score = f1_score(y_test,y_pred)

In [28]:
print("Accuracy : "+str((cm[0][0]+cm[1][1])*100/(cm[0][0]+cm[1][1]+cm[1][0]+cm[0][1]))+"%")

Accuracy : 89.05%


In [29]:
print("F1 score : "+str(f1score))

F1 score : 0.938706968934


# Part 4 - Evaluating
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from keras.models import Sequential
from keras.layers import Dense
def build_classifier():
    classifier = Sequential()
    classifier.add(Dense(output_dim=6,init='uniform',activation='relu',input_dim=11))    
    classifier.add(Dense(output_dim=6,init='uniform',activation='relu'))    
    classifier.add(Dense(output_dim=1,init='uniform',activation='sigmoid'))
    classifier.compile(optimizer= 'adam',loss='binary_crossentropy',metrics=['accuracy']) #categorical_crossentropy
    return classifier
classifier = KerasClassifier(build_fn = build_classifier, batch_size=10,nb_epoch=100)
accuracies = cross_val_score(estimator= classifier, X = X_train,y= y_train, cv =10)
mean = accuracies.mean()
std = accuracies.std()

In [30]:
sample = features_dataset.sample(30)
sample_X = sample[features].values
sample_y = sample[target].values
sample_pred = classifier.predict(sample_X)

In [31]:
def evaluate_pred(y,y_pred):
    y_pred = classifier.predict(X_test)
    y_pred = (y_pred>0.5)
    # Making the Confusion Matrix
    from sklearn.metrics import confusion_matrix
    cm = confusion_matrix(y_test, y_pred)
    from sklearn.metrics import f1_score
    f1score = f1_score(y_test,y_pred)
    print("Accuracy : "+str((cm[0][0]+cm[1][1])*100/(cm[0][0]+cm[1][1]+cm[1][0]+cm[0][1]))+"%")
    print("F1 score : "+str(f1score))

In [32]:
evaluate_pred(sample_pred,sample_y)

Accuracy : 89.05%
F1 score : 0.938706968934


In [33]:
sample['shouldCombinePred'] = sample_pred

In [34]:
sample[['preWord','word1','word2','sufWord', 'shouldCombine','shouldCombinePred']]

Unnamed: 0,preWord,word1,word2,sufWord,shouldCombine,shouldCombinePred
7154,ใน,กา,ร,ป,1,0.975347
9076,ร,ร,ว,ม,1,0.95851
5903,ร,แป,ล,เอ,1,0.996851
8780,ด,เป็,น,สา,1,0.998248
9957,บ,กา,ร,เจื,1,0.999829
9961,อ,สา,ร,นั้,1,0.974557
9982,ว,คื,อ,ผู้,1,0.907214
8714,พาะ,ใน,บ,ริ,1,0.942895
5363,ระ,ดิ,ษ,ฐ์(Artificialintel<LI>gence),1,1.0
4667,ผู้,ป,ระ,ก,1,0.704274


<h2>New ANN Model obtain!</h2>
<p>Now we got a ANN that can predict whether 2 TCC should be combine or not with accuracy of ~90%</p>
<p>Let's try to use the classifier to tokenize Thai word</p>

In [65]:
sentences = ['อยากไปเรียนวิทยาศาสตร์ที่ไทย','ไปหาเพื่อนที่สยาม','โครงการศึกษาที่สาม']
def segment(sentence,classifier=classifier):
    #Step 1 : Change string to TCC
    TCCs = string2TCC(sentence)
    #Step 2 : construct unlabled dataset 
    preWord = []
    word1 = []
    word2 = []
    sufWord = []
    for i in range(len(TCCs)-3):
        [w1,w2,w3,w4] = TCCs[i:i+4] 
        preWord.append(w1)
        word1.append(w2)
        word2.append(w3)
        sufWord.append(w4)
    df = pd.DataFrame({"preWord":preWord,"word1":word1,"word2":word2,"sufWord":sufWord} ,columns=['preWord','word1','word2','sufWord'])
    #Step 3 : Extract features from dataset
    feature_df = df.apply(get_features,axis=1)
    X = feature_df[features].values
    #Step 4 predict using ANN
    y_pred = classifier.predict(X)
    df['shouldCombine'] = y_pred
    #Step 5 : Segment word by predicted result
    wordsList = []
    word=TCCs[0]+TCCs[1]
    for i in range(len(y_pred)):
        preWord = TCCs[i]
        word1 = TCCs[i+1]
        word2 = TCCs[i+2]
        sufWord = TCCs[i+3]
        if(y_pred[i]<0.60):
            #ตัดคำ
            wordsList.append(word)
            word = ''
        word+=word2 
    word +=TCCs[-1]
    wordsList.append(word)
    return wordsList
for sentence in sentences:
    print(segment(sentence))



['อยากไปเรียน', 'วิทยาศาสตร์ที่ไทย']
['ไปหาเพื่อน', 'ที่สยาม']
['โครงการ', 'ศึกษาที่สาม']


<h2>Conclusion</h2>
<p>With this approah, we  can seperate thai word by using deep learning model.</p>
<p>If we have more free text, then the model might be more accurace.</p>

<h3>Referecenes</h3>
<ul>
<li>Non-dictionary-based Thai word segmentation using decision trees by Thanaruk Theeramunkong : http://globe2.thaicyberu.go.th/node/1915763</li>
</ul>