# CRF-Cut: Sentence Segmentation
---
This notebook combine 3 datasets (ted, orchid and fake review) to train a model and validate separated datasets

The result of CRF-Cut is trained by datasets are as follows:

| dataset_train              | dataset_validate | E_f1-score |
|----------------------------|------------------|------------|
| Ted                        | Ted              | 0.72       |
| Orchid                     | Orchid           | 0.77       |
| Fake review                | Fake review      | 0.97       |
| Ted + Orchid + Fake review | Ted              | 0.72       |
| Ted + Orchid + Fake review | Orchid           | 0.69       |
| Ted + Orchid + Fake review | Fake review      | 0.97       |

We sample 25% from each dataset to train and validate because it does not have memory enough.

In [1]:
# !git clone https://github.com/PyThaiNLP/pythainlp.git

Cloning into 'pythainlp'...
remote: Enumerating objects: 18158, done.[K
remote: Counting objects: 100% (1356/1356), done.[K
remote: Compressing objects: 100% (751/751), done.[K
remote: Total 18158 (delta 840), reused 1062 (delta 604), pack-reused 16802
Receiving objects: 100% (18158/18158), 47.05 MiB | 23.83 MiB/s, done.
Resolving deltas: 100% (12302/12302), done.


In [2]:
# cd pythainlp

/content/pythainlp


In [None]:
# !pip install .[full]

In [1]:
# uncomment if running from colab
!pip install python-crfsuite
!mkdir models data
!wget -P data https://raw.githubusercontent.com/vistec-AI/ted_crawler/master/data/orchid_corpus/orchid97.crp.utf
!wget -P data https://github.com/vistec-AI/ted_crawler/raw/master/data/checkpoint/ted_fake.zip
!cd data; unzip ted_fake.zip

You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
--2021-07-21 15:25:24--  https://raw.githubusercontent.com/vistec-AI/ted_crawler/master/data/orchid_corpus/orchid97.crp.utf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11173797 (11M) [text/plain]
Saving to: ‘data/orchid97.crp.utf’


2021-07-21 15:25:26 (9.56 MB/s) - ‘data/orchid97.crp.utf’ saved [11173797/11173797]

--2021-07-21 15:25:27--  https://github.com/vistec-AI/ted_crawler/raw/master/data/checkpoint/ted_fake.zip
Resolving github.com (github.com)... 13.229.188.59
Connecting to github.com (github.com)|13.229.188.59|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/vistec-AI/crfcut/raw/master/

In [2]:
#adapted from @bact at https://colab.research.google.com/drive/1hdtmwTXHLrqNmDhDqHnTQGpDVy1aJc4t
import json
import pandas as pd
import numpy as np
import re
import pycrfsuite
import warnings
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from pythainlp.tokenize import word_tokenize
from pythainlp.tag import pos_tag
from ast import literal_eval
from tqdm import tqdm
pd.set_option('display.max_rows', 10)
warnings.filterwarnings('ignore')

In [3]:
orchid = pd.read_csv('data/orchid97.crp.utf',sep='\t',header=None)
orchid.columns = ['text']
#remove weird words
orchid['first_char'] = orchid.text.map(lambda x: x[0])
orchid = orchid[(orchid.first_char!='%')&(orchid.first_char!='#')][['text']]
#get word,pos
orchid['word'] = orchid.text.map(lambda x: x.split('/')[0])
orchid['word'] = orchid.word.map(lambda x: ' ' if (x=='<space>')|(x=='') else x)
orchid['pos'] = orchid.text.map(lambda x: x.split('/')[1] if len(x.split('/'))==2 else None)
#labels
orchid['lab'] = orchid.apply(lambda row: 'E' if row['text']=='//' else 'I',1)
orchid = orchid[(orchid.lab=='E')|(~orchid.pos.isna())].reset_index(drop=True)

In [4]:
%%time
ted_all_sentences = np.load('data/ted-all-sentences.npy') 
fake_review_all_sentences = np.load('data/fake-review-all-sentences.npy') 

CPU times: user 0 ns, sys: 487 ms, total: 487 ms
Wall time: 486 ms


In [5]:
ted_all_sentences

array(['บนหลังอาชาแห้งกร่องดอนกิโฆเต้ พระเอกของเราบุกตะลุยสู้กับกองทัพยักษ์ |ในสายตาของเขา มันเป็นหน้าที่ของเขาที่จะปราบอสูรร้ายเหล่านี้ในนามแห่งหญิงอันเป็นที่รักของเขา ดุลสิเนอา |ทว่า การกระทำอันหาญกล้านี้ก็สูญเปล่า |เมื่อซานโซ่ ปันซ่า ผู้รับใช้ของเขาอธิบายครั้งแล้วครั้งเล่า ว่าสิ่งเหล่านี้จะเป็นยักษ์ก็หาไม่พวกมันเป็นเพียงกังหันลมเท่านั้น |ดอนกิโฆเต้ หาได้เสียความแน่วแน่แทงทวนของเขาเข้าไปยังใบพัดอย่างจัง |ด้วยพลังใจที่ไม่เคยถดถอยอัศวินผู้นั้นยืนขึ้นอย่างภาคภูมิและยิ่งเชื่อมั่นในปฏิบัติการของเขามากขึ้น |ลำดับเหตุการณ์นี้ครอบคลุมเรื่องราวส่วนใหญ่ของดอนกิโฆเต้ ที่เป็นที่รักมหากาพย์ ไร้ตรรกะ และมีชีวิตชีวาของ อลองโซ กีฆานาผู้กลายเป็น ดอนกิโฆเต้ แห่งลามันช่าผู้ซุ่มซ่ามแต่กล้าหาญหรือที่รู้จักกันในนาม ขุนนางต่ำศักดิ์นักฝัน |แต่เดิมวรรณกรรมนี้มีสองเล่มบรรยายเรื่องราวของดอนกิโฆเต้ในขณะที่เขาเดินทางผ่านตอนกลาง และตอนเหนือของสเปนเพื่อต่อสู้กับบรรดาปิศาจร้าย |แม้จินตนาการในเรื่อง ดอนกิโฆเต้ จะสูงล้ำเหนือเมฆผู้ประพันธ์ มิเกล เด เซร์บันเตสก็ไม่เคยนึกฝันว่าหนังสือของเขาจะกลายเป็นนิยายที่ขายดีที่สุดต

In [6]:
fake_review_all_sentences[:10]

array(['การเขียนไหลเวียนได้ดีตลอดทั้งเล่ม |อย่างไรก็ตามมีบางส่วนที่รู้สึกไม่สมจริง ',
       'รายการนี้ยอดเยี่ยมมาก |มันทำให้เสื้อสุนัขแห้ง |แต่มันก็รั่วไหล ',
       'ฉันไม่รู้ว่ามันเป็นชุดของเรื่องสั้น |สองคนแรกนั้นดีมาก แต่มันก็จบลงอย่างกะทันหัน ',
       'ฉันใช้ Windows 8 และจะไม่เชื่อมต่อกับโทรศัพท์ของฉัน |เสียเวลากับมัน ... คุณสามารถพูดอะไรได้อีก! ',
       'มันเยี่ยมมาก |รักมัน |ความคุ้มครองประเภทนี้ควรใช้งานได้ดีสำหรับคุณ |ขอบคุณมาก. ',
       'ไม่ทำงาน. |ไม่คุ้มที่จะลองใช้ |โฆษณาที่ทำให้เข้าใจผิดมาก |พวกเขาต้องการการติดฉลากสินค้าจากผู้ขายรายอื่นให้ดีขึ้น ',
       'ฉันสนุกกับการดูหนังเรื่องนี้ก่อนที่จะได้เห็นงานของ David Lynch ทั้งหมด |ซาวด์แทร็กเพลงที่สร้างขึ้นโดย John Williams ในช่วงเวลานี้เป็นเพลงโปรดของฉันเสมอ ',
       'ฉันรักรสชาติของกาแฟนี้ |อย่างไรก็ตามคุณไม่สามารถรับได้ในร้านค้าในพื้นที่ของฉัน |แย่มากเพราะฉันต้องเปลี่ยนยี่ห้อ ',
       'นี่คือการซื้อเป็นของขวัญดังนั้นหลังจากรับด้วยไม่มีปัญหา |จากนั้นรับอีกอันหนึ่ง |เสียงที่ยอดเยี่ยมอีกครั้ง! |ลูกสาวของฉันต้องการเธอ ',

In [7]:
# Sample from 3 datasets
np.random.seed(42)
ratio = .25 # sample ratio
ted_sample = np.random.choice(ted_all_sentences, int(len(ted_all_sentences) * ratio))
orchid_sample = orchid.iloc[:int(len(orchid) * ratio)]
# fake_review_sample = np.random.choice(fake_review_all_sentences, int(len(fake_review_all_sentences) * ratio))
fake_review_train, fake_review_test = fake_review_all_sentences[:-39632], fake_review_all_sentences[-39632:]
fake_review_sample = np.random.choice(fake_review_train, int(len(fake_review_all_sentences) * ratio))
fake_review_test_sample = np.random.choice(fake_review_test, int(len(fake_review_test) * ratio))

In [8]:
print(f"Length of TED talk (talk): {len(ted_sample)}")
print(f"Length of orchid (word): {len(orchid_sample)}")
print(f"Length of fake review train (review): {len(fake_review_sample)}")
print(f"Length of fake review test (review): {len(fake_review_test_sample)}")

Length of TED talk (talk): 385
Length of orchid (word): 91453
Length of fake review train (review): 49540
Length of fake review test (review): 9908


In [9]:
def assign_word_lab(all_sentences):
    all_tuples = []
    for i in tqdm(range(len(all_sentences)), total=len(all_sentences)):
        tuples = []
        for s in all_sentences[i].split('|'):
            s_lst = word_tokenize(s)
            for j in range(len(s_lst)):
                lab = 'E' if j==len(s_lst)-1 else 'I'
                tuples.append((s_lst[j],lab))
        all_tuples.append(tuples)
    return all_tuples

In [10]:
%%time
ted_all_tuples = assign_word_lab(ted_sample)
orchid_all_tuples = [(row['word'],row['lab']) for i,row in orchid_sample.iterrows()]
# fake_review_all_tuples = assign_word_lab(fake_review_sample)
fake_review_all_tuples = assign_word_lab(fake_review_sample)
fake_review_test_tuples = assign_word_lab(fake_review_test_sample)

100%|██████████| 385/385 [00:18<00:00, 20.59it/s]
100%|██████████| 49540/49540 [01:24<00:00, 584.57it/s]
100%|██████████| 9908/9908 [00:14<00:00, 706.87it/s]

CPU times: user 2min 11s, sys: 2.46 s, total: 2min 14s
Wall time: 2min 13s





In [11]:
enders = ["ครับ","ค่ะ","คะ","นะคะ","นะ","จ้ะ","จ้า","จ๋า","ฮะ", #ending honorifics
          #enders
          "ๆ","ได้","แล้ว","ด้วย","เลย","มาก","น้อย","กัน","เช่นกัน","เท่านั้น",
          "อยู่","ลง","ขึ้น","มา","ไป","ไว้","เอง","อีก","ใหม่","จริงๆ",
          "บ้าง","หมด","ทีเดียว","เดียว",
          #demonstratives
          "นั้น","นี้","เหล่านี้","เหล่านั้น",
          #questions
          "อย่างไร","ยังไง","หรือไม่","มั้ย","ไหน","อะไร","ทำไม","เมื่อไหร่"]
starters = ["ผม","ฉัน","ดิฉัน","ชั้น","คุณ","มัน","เขา","เค้า",
            "เธอ","เรา","พวกเรา","พวกเขา", #pronouns
            #connectors
            "และ","หรือ","แต่","เมื่อ","ถ้า","ใน",
            "ด้วย","เพราะ","เนื่องจาก","ซึ่ง","ไม่",
            "ตอนนี้","ทีนี้","ดังนั้น","เพราะฉะนั้น","ฉะนั้น",
            "ตั้งแต่","ในที่สุด",
            #demonstratives
            "นั้น","นี้","เหล่านี้","เหล่านั้น"]

def extract_features(doc, window=2, max_n_gram=3):
    doc_features = []
    #paddings for word and POS
    doc = ['xxpad' for i in range(window)] + doc + ['xxpad' for i in range(window)]
    doc_ender = []
    doc_starter = []
    #add enders
    for i in range(len(doc)):
        if doc[i] in enders:
            doc_ender.append('ender')
        else:
            doc_ender.append('normal')
    #add starters
    for i in range(len(doc)):
        if doc[i] in starters:
            doc_starter.append('starter')
        else:
            doc_starter.append('normal')
    #for each word
    for i in range(window, len(doc)-window):
        #bias term
        word_features = ['bias'] 
        
        #ngram features
        for n_gram in range(1, min(max_n_gram+1,2+window*2)):
            for j in range(i-window,i+window+2-n_gram):
                feature_position = f'{n_gram}_{j-i}_{j-i+n_gram}'
                word_ = f'{"|".join(doc[j:(j+n_gram)])}'
                word_features += [f'word_{feature_position}={word_}']
                ender_ =  f'{"|".join(doc_ender[j:(j+n_gram)])}'
                word_features += [f'ender_{feature_position}={ender_}']
                starter_ =  f'{"|".join(doc_starter[j:(j+n_gram)])}'
                word_features += [f'starter_{feature_position}={starter_}']
        
        #append to feature per word
        doc_features.append(word_features)
    return doc_features

In [12]:
%%time
# ted
# target
ted_y = []
for t in tqdm(ted_all_tuples, total=len(ted_all_tuples)):
    temp = []
    for (w, l) in t:
        temp.append(l)
    ted_y.append(temp)

# features
ted_x_pre = []
for t in tqdm(ted_all_tuples, total=len(ted_all_tuples)):
    temp = []
    for (w, l) in t:
        temp.append(w)
    ted_x_pre.append(temp)
ted_x = []
for x_ in tqdm(ted_x_pre, total=len(ted_x_pre)):
    ted_x.append(extract_features(x_, window=2, max_n_gram = 3))

100%|██████████| 385/385 [00:00<00:00, 1485.29it/s]
100%|██████████| 385/385 [00:00<00:00, 1581.82it/s]
100%|██████████| 385/385 [00:46<00:00,  8.36it/s]

CPU times: user 43.1 s, sys: 3.53 s, total: 46.6 s
Wall time: 46.5 s





In [13]:
%%time
# orchid
# target
orchid_y = []
for (w, l) in tqdm(orchid_all_tuples, total=len(orchid_all_tuples)):
    orchid_y.append(l)
# features
orchid_x_pre = []
for (w, l) in tqdm(orchid_all_tuples, total=len(orchid_all_tuples)):
    orchid_x_pre.append(w)
orchid_x = extract_features(orchid_x_pre, window=2, max_n_gram = 3) 

100%|██████████| 91453/91453 [00:00<00:00, 928733.23it/s]
100%|██████████| 91453/91453 [00:00<00:00, 979189.15it/s]


CPU times: user 4.19 s, sys: 248 ms, total: 4.43 s
Wall time: 4.43 s


In [14]:
# # fake review
# # target
# fake_review_y = []
# for t in tqdm(fake_review_all_tuples, total=len(fake_review_all_tuples)):
#     temp = []
#     for (w, l) in t:
#         temp.append(l)
#     fake_review_y.append(temp)

# # features
# fake_review_x_pre = []
# for t in tqdm(fake_review_all_tuples, total=len(fake_review_all_tuples)):
#     temp = []
#     for (w, l) in t:
#         temp.append(w)
#     fake_review_x_pre.append(temp)
# fake_review_x = []
# for x_ in tqdm(fake_review_x_pre, total=len(fake_review_x_pre)):
#     fake_review_x.append(extract_features(x_, window=2, max_n_gram = 3))

# fake review
# Test
# target
fake_review_test_y = []
for t in tqdm(fake_review_test_tuples, total=len(fake_review_test_tuples)):
    temp = []
    for (w, l) in t:
        temp.append(l)
    fake_review_test_y.append(temp)

# features
fake_review_test_x_pre = []
for t in tqdm(fake_review_test_tuples, total=len(fake_review_test_tuples)):
    temp = []
    for (w, l) in t:
        temp.append(w)
    fake_review_test_x_pre.append(temp)
fake_review_test_x = []
for x_ in tqdm(fake_review_test_x_pre, total=len(fake_review_test_x_pre)):
    fake_review_test_x.append(extract_features(x_, window=2, max_n_gram = 3))
    
# Train
# target
fake_review_y = []
for t in tqdm(fake_review_all_tuples, total=len(fake_review_all_tuples)):
    temp = []
    for (w, l) in t:
        temp.append(l)
    fake_review_y.append(temp)

# features
fake_review_x_pre = []
for t in tqdm(fake_review_all_tuples, total=len(fake_review_all_tuples)):
    temp = []
    for (w, l) in t:
        temp.append(w)
    fake_review_x_pre.append(temp)
fake_review_x = []
for x_ in tqdm(fake_review_x_pre, total=len(fake_review_x_pre)):
    fake_review_x.append(extract_features(x_, window=2, max_n_gram = 3))

100%|██████████| 9908/9908 [00:00<00:00, 41833.14it/s]
100%|██████████| 9908/9908 [00:00<00:00, 45552.58it/s]
100%|██████████| 9908/9908 [00:40<00:00, 244.66it/s]
100%|██████████| 49540/49540 [00:01<00:00, 25354.64it/s]
100%|██████████| 49540/49540 [00:01<00:00, 25599.33it/s]
100%|██████████| 49540/49540 [03:53<00:00, 212.51it/s]


In [15]:
# Split train and test set at 80/20 proportion
ted_x_train, ted_x_test, ted_y_train, ted_y_test = train_test_split(ted_x, ted_y, test_size=0.2, random_state=1412)
idx = int(len(orchid_x)*0.8)
orchid_x_train, orchid_x_test = orchid_x[:idx], orchid_x[idx:]
orchid_y_train, orchid_y_test = orchid_y[:idx], orchid_y[idx:]
# fake_review_x_train, fake_review_x_test, fake_review_y_train, fake_review_y_test \
#     = train_test_split(fake_review_x, fake_review_y, test_size=0.2, random_state=1412)
fake_review_x_train, fake_review_x_test = fake_review_x, fake_review_test_x
fake_review_y_train, fake_review_y_test = fake_review_y, fake_review_test_y

In [16]:
%%time
# Train model
trainer = pycrfsuite.Trainer(verbose=True)

for xseq, yseq in tqdm(zip(ted_x_train, ted_y_train), total=len(ted_y_train)):
    trainer.append(xseq, yseq)
    
trainer.append(orchid_x_train, orchid_y_train)

for xseq, yseq in tqdm(zip(fake_review_x_train, fake_review_y_train), total=len(fake_review_y_train)):
    trainer.append(xseq, yseq)

trainer.set_params({
    'c1': 1,
    'c2': 0,
    'max_iterations': 1000,
    'feature.possible_transitions': True,
})

trainer.train('models/datasets-crf.model')

100%|██████████| 308/308 [00:52<00:00,  5.91it/s]
100%|██████████| 49540/49540 [04:17<00:00, 192.41it/s]


Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 6870888
Seconds required: 61.107

L-BFGS optimization
c1: 1.000000
c2: 0.000000
num_memories: 6
max_iterations: 1000
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 1717751.157588
Feature norm: 1.000000
Error norm: 1747502.947940
Active features: 1341844
Line search trials: 1
Line search step: 0.000000
Seconds required for this iteration: 39.806

***** Iteration #2 *****
Loss: 1500762.871080
Feature norm: 0.875255
Error norm: 1676350.089757
Active features: 1020801
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 18.958

***** Iteration #3 *****
Loss: 1395806.804113
Feature norm: 0.393197
Error norm: 4720174.754368
Active features: 394478
Line search trials: 3
Line search step: 0.2500

In [17]:
# ted
# Predict (using test set)
tagger = pycrfsuite.Tagger()
tagger.open('models/datasets-crf.model')
# y_pred = [tagger.tag(xseq) for xseq in x_test]
y_pred = []
for xseq in tqdm(ted_x_test, total=len(ted_x_test)):
    y_pred.append(tagger.tag(xseq))

# Evaluate at word-level
labels = {'E': 0, "I": 1} # classification_report() needs values in 0s and 1s
predictions = np.array([labels[tag] for row in y_pred for tag in row])
truths = np.array([labels[tag] for row in ted_y_test for tag in row])

print("Validate TED dataset")
print(classification_report(
    truths, predictions,
    target_names=["E", "I"]))

100%|██████████| 77/77 [00:05<00:00, 15.09it/s]


Validate TED dataset
              precision    recall  f1-score   support

           E       0.66      0.77      0.71      6990
           I       0.99      0.98      0.99    155172

    accuracy                           0.97    162162
   macro avg       0.82      0.88      0.85    162162
weighted avg       0.98      0.97      0.97    162162



In [18]:
results = []
for i in range(len(ted_y_test)):
    s=0
    for j in range(len(ted_y_test[i])):
        results.append({'sentence_idx':f'{str(i).zfill(3)}_{str(s).zfill(3)}',
                        'word':ted_x_test[i][j][7].split('=')[1],
                        'y':ted_y_test[i][j],
                        'pred':y_pred[i][j]})
        if ted_y_test[i][j]=='E': s+=1
result_df = pd.DataFrame(results)[['sentence_idx','word','y','pred']]
result_df['wrong_flag'] = result_df.apply(lambda row: 0 if row.y==row.pred else 1,1)

#space correct
space_df = result_df.copy()
space_df = space_df[space_df.word==' ']
print(f"Error space correct: {space_df.wrong_flag.mean()} from shape: {space_df.shape}")
print(f"Accuracy space correct: {1 - space_df.wrong_flag.mean():.2f}")

Error space correct: 0.21750450630883236 from shape: (19972, 5)
Accuracy space correct: 0.78


In [19]:
# orchid
# Predict (using test set)
tagger = pycrfsuite.Tagger()
tagger.open('models/datasets-crf.model')
y_pred = tagger.tag(orchid_x_test)

# Evaluate at word-level
labels = {'E': 0, "I": 1} # classification_report() needs values in 0s and 1s
predictions = np.array([labels[tag] for row in y_pred for tag in row])
truths = np.array([labels[tag] for row in orchid_y_test for tag in row])

print("Validate orchid dataset")
print(classification_report(
    truths, predictions,
    target_names=["E", "I"]))

Validate orchid dataset
              precision    recall  f1-score   support

           E       0.73      0.66      0.69      1179
           I       0.98      0.98      0.98     17112

    accuracy                           0.96     18291
   macro avg       0.85      0.82      0.84     18291
weighted avg       0.96      0.96      0.96     18291



In [20]:
results = []
for i in range(len(orchid_y_test)):
    results.append({'word':orchid_x_test[i][7].split('=')[1],
                    'y':orchid_y_test[i],
                    'pred':y_pred[i]})
result_df = pd.DataFrame(results)[['word','y','pred']]
result_df['wrong_flag'] = result_df.apply(lambda row: 0 if row.y==row.pred else 1,1)

#space correct
space_df = result_df.copy()
space_df = space_df[space_df.word==' ']
print(f"Error space correct: {space_df.wrong_flag.mean()} from shape: {space_df.shape}")
print(f"Accuracy space correct: {1 - space_df.wrong_flag.mean():.2f}")

Error space correct: 0.1805809997382884 from shape: (3821, 4)
Accuracy space correct: 0.82


In [21]:
# fake review
# Predict (using test set)
tagger = pycrfsuite.Tagger()
tagger.open('models/datasets-crf.model')
# y_pred = [tagger.tag(xseq) for xseq in x_test]
y_pred = []
for xseq in tqdm(fake_review_x_test, total=len(fake_review_x_test)):
    y_pred.append(tagger.tag(xseq))

# Evaluate at word-level
labels = {'E': 0, "I": 1} # classification_report() needs values in 0s and 1s
predictions = np.array([labels[tag] for row in y_pred for tag in row])
truths = np.array([labels[tag] for row in fake_review_y_test for tag in row])

print("Validate TED dataset")
print(classification_report(
    truths, predictions,
    target_names=["E", "I"]))

100%|██████████| 9908/9908 [00:18<00:00, 528.03it/s]


Validate TED dataset
              precision    recall  f1-score   support

           E       0.98      0.95      0.96     43254
           I       1.00      1.00      1.00    568931

    accuracy                           1.00    612185
   macro avg       0.99      0.98      0.98    612185
weighted avg       1.00      1.00      1.00    612185



In [22]:
results = []
for i in range(len(fake_review_y_test)):
    s=0
    for j in range(len(fake_review_y_test[i])):
        results.append({'sentence_idx':f'{str(i).zfill(3)}_{str(s).zfill(3)}',
                        'word':fake_review_x_test[i][j][7].split('=')[1],
                        'y':fake_review_y_test[i][j],
                        'pred':y_pred[i][j]})
        if fake_review_y_test[i][j]=='E': s+=1
result_df = pd.DataFrame(results)[['sentence_idx','word','y','pred']]
result_df['wrong_flag'] = result_df.apply(lambda row: 0 if row.y==row.pred else 1,1)

#space correct
space_df = result_df.copy()
space_df = space_df[space_df.word==' ']
print(f"Error space correct: {space_df.wrong_flag.mean()} from shape: {space_df.shape}")
print(f"Accuracy space correct: {1 - space_df.wrong_flag.mean():.2f}")

Error space correct: 0.03782579862110387 from shape: (79919, 5)
Accuracy space correct: 0.96
