# Explore Datasets

We will start with OpenSubtitles dataset. We explore the following aspects of the dataset:
0. Drop all duplicates.
1. How many characters in each line / average / total for dataset (`th` and `zh`)?
2. How many words in each line / average / total for dataset (`th` tokenized by `pythainlp.tokenize`; check your pythainlp version)?
3. How many words in each line / average / total for dataset (try `zh` tokenizers [jieba](https://github.com/fxsjy/jieba), [pkuseg](https://github.com/explosion/spacy-pkuseg/blob/master/readme/readme_english.md), or any other ones you find interesting)?
4. zh-to-th word ratio in each line / average for dataset; for example, `(我吃飯, ฉันกินข้าว)` has 3 `zh` words and 3 `th` words so the ratio is $3/3=1$)
5. Find similarity score for each sentence pair and average for dataset using [multilingual universal sentence encoder](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3)

In [33]:
# #read full dataset
# with open("../data/OpenSubtitles/OpenSubtitles.th-zh_cn.th","r",encoding='utf-8') as f:
#     th_lines = f.readlines()
#     th_lines = [i[:-1] for i in th_lines]
# with open("../data/OpenSubtitles/OpenSubtitles.th-zh_cn.zh_cn","r",encoding='utf-8') as f:
#     zh_lines = f.readlines()
#     zh_lines = [i[:-1] for i in zh_lines]

In [42]:
#read sample dataset
with open("../data/OpenSubtitles/OpenSubtitles_sample.th", "r", encoding='utf-8') as f:
    th_lines = f.readlines()
    th_lines = [i[:-1] for i in th_lines]
with open("../data/OpenSubtitles/OpenSubtitles_sample.zh_cn", "r", encoding='utf-8') as f:
    zh_lines = f.readlines()
    zh_lines = [i[:-1] for i in zh_lines]

In [43]:
len(th_lines)

100

In [44]:
th_lines[:10]

['คุณจำตอนพิธีล้างบาปของคุณได้ เป็นไปได้ยังไง?',
 'เป็นไปไม่ได้รึไง? แต่มันจริงนะ',
 'คุณได้ยินผู้ใหญ่เขาคุยกันรึเปล่า?',
 'ฉันรู้สึกได้ถึงแสงอาทิตย์ลอดผ่านกระจกเข้ามา',
 'ฉันยังจำเสียงหัวใจเต้นของพ่อได้',
 'ไม่ใช่ฉันที่จำได้ แต่เป็นความทรงจำของฉันต่างหาก',
 'แต่คุณไม่ใช่ คาธอลิกแล้วนี่?',
 'เขาปล่อยให้คนอื่นผ่านไป',
 'ทำไมอยู่ดีๆถึงพูดเรื่องพิธีล้างบาปขึ้นมาล่ะ?',
 'ฉันนึกถึงมันบ่อยๆ บางทีฉันก็จำได้']

In [45]:
len(zh_lines)

100

In [46]:
zh_lines[:10]

['记得自己的洗礼仪式 这可能吗?',
 '不可能?',
 '可那是事实啊 是听大人们说的吧?',
 '我能感受到透过玻璃的阳光',
 '我还记得爸爸的心跳声呢',
 '真的不是听来的 是记忆里的',
 '你也不是信天主教的吧',
 '改新教也有洗礼这种仪式',
 '为什么忽然提起洗礼仪式?',
 '最近想起来的 偶尔会想起']

## [multilingual universal sentence encoder](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3)

Code from [thai2nmt](https://github.com/vistec-AI/thai2nmt_preprocess/blob/master/clean_subdataset.py)

In [47]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text

In [48]:
def get_similar_score(lang1: str, lang2: str, batch_size: int, embed):

    scores = []

    if len(lang1) % batch_size != 0:
        num_of_batch = int(len(lang1)/batch_size)+1
    else:
        num_of_batch = int(len(lang1)/batch_size)

    for i in range(num_of_batch):
        start = i*batch_size
        end = start+batch_size
        if i <= num_of_batch:

            lang1_temp = lang1[start:end]
            lang2_temp = lang2[start:end]

            lang1_embedding = embed(lang1_temp)
            lang2_embedding = embed(lang2_temp)
            distance_matrix = tf.matmul(
                lang1_embedding, lang2_embedding, transpose_b=True).numpy()

            for j in range(len(distance_matrix)):
                scores.append(distance_matrix[j][j])

    return scores

In [49]:
zhs = [
    '我吃食物',
    '她喜歡看電視',
    '她為什麼喜歡吃電視',
    '我吃食物',
    '她喜歡看電視',
    '她為什麼喜歡吃電視'
]
ths = [
    'ฉันกินอาหาร',
    'เธอชอบดูทีวี',
    'ทำไมเธอถึงชอบกินทีวี',
    'ฉันไม่ชอบกินอาหาร',
    'เธอเกลียดทีวี',
    'ทำไมทีวีกินเธอเข้าไป',
]

emb = hub.load('https://tfhub.dev/google/universal-sentence-encoder-multilingual/3')

W0512 16:14:01.669605 140226635827008 def_function.py:120] 6 out of the last 8 calls to <function recreate_function.<locals>.restored_function_body at 0x7f890416f950> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
W0512 16:14:01.699396 140226635827008 def_function.py:120] 7 out of the last 9 calls to <function recreate_function.<locals>.restored_function_body at 0x7f884d38ad08> triggered tf.function retracing. Tracing

In [16]:
get_similar_score(zhs, ths, 16, emb)

W0512 16:04:15.366335 140226635827008 def_function.py:120] 5 out of the last 6 calls to <function recreate_function.<locals>.restored_function_body at 0x7f884d378268> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for  more details.


[0.88183826, 0.6799599, 0.732453, 0.6549038, 0.57960916, 0.7019044]

In [50]:
sims = get_similar_score(zh_lines, th_lines, 16, emb)

W0512 16:14:12.142370 140226635827008 def_function.py:120] 8 out of the last 10 calls to <function recreate_function.<locals>.restored_function_body at 0x7f88743516a8> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for  more details.


In [53]:
import pandas as pd
#find a good threshold with full dataset
sim_df = pd.DataFrame({'zh':zh_lines, 'th': th_lines, 'similarity_score':sims}).sort_values('similarity_score')
sim_df

Unnamed: 0,zh,th,similarity_score
7,改新教也有洗礼这种仪式,เขาปล่อยให้คนอื่นผ่านไป,0.073779
22,真是小气,คุณคิดเอาไว้ก่อนอยู่แล้วจริงๆ,0.086167
84,例如,การจ่ายค่าชดเชย,0.094630
73,就放在抽屉的深处,ฉันก็เลยเก็บให้พ้นตา,0.148213
56,你下来坐,มานั่งตรงนี้สิ หลังผมจะเย็นพอดี,0.154625
...,...,...,...
35,2万1000 不知道有没有500的,"20000, 1000 ฉันไม่รู้ว่าจะมี 500 วอนรึเปล่า",0.741247
90,这个为什么会在厨房里,ทำไมนี่ถึงอยู่ในครัวน่ะ,0.754683
68,真的?,จริงเหรอ?,0.793694
12,为什么?,ทำไมล่ะ?,0.845884
