<a href="https://colab.research.google.com/github/RtjShreyD/Eng-Mandarin/blob/master/Preprocessing_eng_cmn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [4]:
from collections import defaultdict


def read_sentences(filename, target_language, src_language, sentences_with_audio=None):
    """
    Read sentences.csv and returns a dict containing sentence information.
    Parameters:
        filename (str): filename of 'sentence.csv'
        target_language (str): target language
        src_language (str): src language
        sentences_with_audio (set of int): set of sentence ids with audio.
            If not None, limit the output to this set.
    Returns:
        dict from sentence id (int) to Sentence information, where
        sentence information is a dict with 'sent_id', 'lang', and 'text' keys.
        dict only contains sentences in target_language or src_language."""

    sentences = {}
    for line in open(filename):
        sent_id, lang, text = line.rstrip().split('\t')
        if lang == src_language or lang == target_language:
            sent_id = int(sent_id)
            if (sentences_with_audio is not None
                    and lang == target_language
                    and sent_id not in sentences_with_audio):
                continue
            sentences[sent_id] = {'sent_id': sent_id, 'lang': lang, 'text': text}
    return sentences

def read_links(filename):
    """
    Read links.csv and returns a dict containing links information.
    Args:
        filename (str): filename of 'links.csv'
    Returns:
        dict from sentence id (int) of a sentence and a set of all its translation sentence ids."""

    links = defaultdict(set)
    for line in open(filename):
        sent_id, trans_id = line.rstrip().split('\t')
        links[int(sent_id)].add(int(trans_id))
    return links


def generate_translation_pairs(sentences, links, target_language, src_language):
    """
    Given sentences and links, generate a list of sentence pairs in target and source languages.
    Parameters:
        sentences: dict of sentence information (returned by read_sentences())
        links: dict of links information (returned by read_links())
        target_language (str): target language
        src_language (str): src language
    Returns:
        list of sentence pairs (sentence info 1, sentence info 2)
        where sentence info 1 is in target_language and sentence info 2 in src_language.
    """
    translations = []
    for sent_id, trans_ids in links.items():
        # Links in links.csv are reciprocal, meaning that if (id1, id2) is in the file,
        # (id2, id1) is also in the file. So we don't have to check both directions.
        if sent_id in sentences and sentences[sent_id]['lang'] == target_language:
            for trans_id in trans_ids:
                if trans_id in sentences and sentences[trans_id]['lang'] == src_language:
                    translations.append((sentences[sent_id], sentences[trans_id]))
    return translations


def write_tsv(translations):
    """
    Write translations as TSV to stdout.
    Parameters:
        translations (list): list of sentence pairs returned by generate_translation_pairs()
    """
    out_file = "/content/drive/My Drive/Eng_Mandarin/data/tatoeba.eng_cmn.tsv"
    with open(out_file, "w") as out:
      for sent1, sent2 in translations:
          sent1_text = '{text}'.format(**sent1)
          sent2_text = '{text}'.format(**sent2)
          print("%s\t%s" % (sent1_text, sent2_text))
          out.write("%s\t%s\n" % (sent1_text, sent2_text))


def main():
    target_language = "eng"
    src_language = "cmn"

    sentences = read_sentences("/content/drive/My Drive/Eng_Mandarin/data/sentences.csv", target_language, src_language, None)

    links = read_links("/content/drive/My Drive/Eng_Mandarin/data/links.csv")

    translations = generate_translation_pairs(sentences, links, target_language, src_language)

    write_tsv(translations)

if __name__ == '__main__':
    main()

Let's try something.	我們試試看！
I have to go to sleep.	我该去睡觉了。
Today is June 18th and it is Muiriel's birthday!	今天是６月１８号，也是Muiriel的生日！
Muiriel is 20 now.	Muiriel现在20岁了。
The password is "Muiriel".	密码是"Muiriel"。
The password is "Muiriel".	密碼是「Muiriel」。
I will be back soon.	我很快就會回來。
I'm at a loss for words.	我不知道應該說什麼才好。
This is never going to end.	這個永遠完不了了。
This is never going to end.	这将永远继续下去。
I just don't know what to say.	我只是不知道應該說什麼而已……
I just don't know what to say.	我就是不知道說些什麼。
That was an evil bunny.	那是一隻有惡意的兔子。
I was in the mountains.	我以前在山里。
Is it a recent picture?	那是一张近照吗？
I don't know if I have the time.	我不知道我有沒有時間。
Education in this world disappoints me.	世界上的教育都讓我失望。
You're in better shape than I am.	你的體型比我的好。
You are in my way.	你擋住了我的路。
This will cost €30.	這個要三十歐元。
I make €100 a day.	我一天賺一百歐元。
I may give up soon and just nap instead.	也许我会马上放弃然后去睡一觉。
That won't happen.	那是不會發生的。
I can only wonder if this is the same for everyone else.	我只能问自己这对其他所有人是不是一回事呢。
I suppose it's different w