### Data Cleaning  
將語句資料經過處理，將不需要(不影響內容)的部分去除，以降低模型學習時受到的干擾。

#### import套件

In [2]:
import sys
import pdb
import pprint
import logging
import os
import random

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils import data
import numpy as np
import tqdm.auto as tqdm
from pathlib import Path
from argparse import Namespace
#from fairseq import utils

import matplotlib.pyplot as plt

#### 固定random seed

In [3]:
seed = 33
random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

#### data路徑設置

In [9]:
data_dir = './data'

src_lang = 'en'
tgt_lang = 'zh'

data_prefix = f'{data_dir}/raw'
test_prefix = f'{data_dir}/test'

In [10]:
{data_prefix+'.'+src_lang}

{'./data/raw.en'}

#### 文檔內容

In [14]:
#!head {data_prefix+'.'+src_lang} -n 5
#!head {data_prefix+'.'+tgt_lang} -n 5

with open(f'{data_prefix+"."+src_lang}', 'r') as f:
    for i in range(5):
        print(f.readline())
        
with open(f'{data_prefix+"."+tgt_lang}', 'r',encoding="utf-8") as f:
    for i in range(5):
        print(f.readline())

Thank you so much, Chris.

And it's truly a great honor to have the opportunity to come to this stage twice; I'm extremely grateful.

I have been blown away by this conference, and I want to thank all of you for the many nice comments about what I had to say the other night.

And I say that sincerely, partly because I need that.

Put yourselves in my position.

非常謝謝你，克里斯。能有這個機會第二度踏上這個演講台

真是一大榮幸。我非常感激。

這個研討會給我留下了極為深刻的印象，我想感謝大家 對我之前演講的好評。

我是由衷的想這麼說，有部份原因是因為 —— 我真的有需要!

請你們設身處地為我想一想！



#### cleaning function  
* `strQ2B`: 將全形字母與符號轉成半形。  

* `clean_s`: 轉半形後，去除文中括號裡的內容，依照中英文去除掉不需要的連接符號、固定符號格式等，最後不論中英文，標點符號兩側連接一個空白。  

* `len_s` : 計算句子長度，中文以字為單位，英文以詞為單位。  

* `clean_corpus` : Cleaning Function，同時接收兩種語言文本，可設置剃除過長或過短的句子，以及剔除兩種語言句子長度差異過大的資料。

In [23]:
import re

def strQ2B(ustring):
    """Full width -> half width"""
    # reference:https://ithelp.ithome.com.tw/articles/10233122
    ss = []
    for s in ustring:
        rstring = ""
        for uchar in s:
            inside_code = ord(uchar)
            if inside_code == 12288:  # Full width space: direct conversion
                inside_code = 32
            elif (inside_code >= 65281 and inside_code <= 65374):  # Full width chars (except space) conversion
                inside_code -= 65248
            rstring += chr(inside_code)
        ss.append(rstring)
    return ''.join(ss)

def clean_s(s, lang):
    if lang == 'en':
        s = strQ2B(s) # Q2B
        s = re.sub(r"\([^()]*\)", "", s) # remove ([text])
        s = s.replace('-', '') # remove '-'
        s = re.sub('([.,;!?()\"])', r' \1 ', s) # keep punctuation
    elif lang == 'zh':
        s = strQ2B(s) # Q2B
        s = re.sub(r"\([^()]*\)", "", s) # remove ([text])
        s = s.replace(' ', '')
        s = s.replace('—', '')
        s = s.replace('“', '"')
        s = s.replace('”', '"')
        s = s.replace('_', '')
        s = re.sub('([。,;!?()\"~「」])', r' \1 ', s) # keep punctuation
    s = ' '.join(s.strip().split())
    return s

def len_s(s, lang):
    if lang == 'zh':
        return len(s)
    return len(s.split())

def clean_corpus(prefix, l1, l2, ratio=9, max_len=1000, min_len=1):
    if Path(f'{prefix}.clean.{l1}').exists() and Path(f'{prefix}.clean.{l2}').exists():
        print(f'{prefix}.clean.{l1} & {l2} exists. skipping clean.')
        return
    with open(f'{prefix}.{l1}', 'r',encoding="utf-8") as l1_in_f:
        with open(f'{prefix}.{l2}', 'r',encoding="utf-8") as l2_in_f:
            with open(f'{prefix}.clean.{l1}', 'w',encoding="utf-8") as l1_out_f:
                with open(f'{prefix}.clean.{l2}', 'w',encoding="utf-8") as l2_out_f:
                    for s1 in l1_in_f:
                        s1 = s1.strip()
                        s2 = l2_in_f.readline().strip()
                        s1 = clean_s(s1, l1)
                        s2 = clean_s(s2, l2)
                        s1_len = len_s(s1, l1)
                        s2_len = len_s(s2, l2)
                        if min_len > 0: # remove short sentence
                            if s1_len < min_len or s2_len < min_len:
                                continue
                        if max_len > 0: # remove long sentence
                            if s1_len > max_len or s2_len > max_len:
                                continue
                        if ratio > 0: # remove by ratio of length
                            if s1_len/s2_len > ratio or s2_len/s1_len > ratio:
                                continue
                        print(s1, file=l1_out_f)
                        print(s2, file=l2_out_f)

####  執行清理  
test資料因為需要英翻中，沒有中文(文本內只有句號)，設置數值-1讓其忽略判斷。

In [24]:
clean_corpus(data_prefix, src_lang, tgt_lang)

In [25]:
clean_corpus(test_prefix, src_lang, tgt_lang, ratio=-1, min_len=-1, max_len=-1)

#### valid分割

In [26]:
valid_ratio = 0.01 # 3000~4000 would suffice
train_ratio = 1 - valid_ratio

In [38]:
if Path(data_dir+f'/train.clean.{src_lang}').exists() \
and Path(data_dir+f'/train.clean.{tgt_lang}').exists() \
and Path(data_dir+f'/valid.clean.{src_lang}').exists() \
and Path(data_dir+f'/valid.clean.{tgt_lang}').exists():
    print(f'train/valid splits exists. skipping split.')
else:
    line_num = sum(1 for line in open(f'{data_prefix}.clean.{src_lang}',encoding="utf-8"))
    labels = list(range(line_num))
    random.shuffle(labels)
    for lang in [src_lang, tgt_lang]:
        train_f = open(os.path.join(data_dir, f'train.clean.{lang}'), 'w',encoding="utf-8")
        valid_f = open(os.path.join(data_dir, f'valid.clean.{lang}'), 'w',encoding="utf-8")
        count = 0
        for line in open(f'{data_prefix}.clean.{lang}', 'r',encoding="utf-8"):
            if labels[count]/line_num < train_ratio:
                train_f.write(line)
            else:
                valid_f.write(line)
            count += 1
        train_f.close()
        valid_f.close()