#Dataset 

In [1]:
import pandas as pd
from datasets import Dataset
import numpy as np


For my finetuned dataset, what I'm doing is download the top ten most popular jp novels from the site novelupdates.com. Then using the japanese title found on this page and downloading the japanse webnovel from syosetsu. While I could try and automate this part, I don't really have the front end experience to do so. That is also not the focus of this project.

So I realized that encoding the JP and the EN parts separately would mean they wouldn't be able to talk to each other. I created the file below to turn into my now reduced dataset

In [2]:
dataframe = pd.read_csv("C:/Users/Dylan/Downloads/syosetu/Sentence Pairs.txt", sep="|", header=None, encoding="utf-8")
dataframe.columns = ["Translation", "EN_for_now"]
dataframe

Unnamed: 0,Translation,EN_for_now
0,そして兄ブダリオンを超えたブギータスは、遂に帝位を奪い取った。そしてそのまま瞬く間に他の国を...,And having surpassed his older brother Budari...
1,地球での伯父も流石に白米を食べる事を贅沢だとは言わなかったので、当時は普通に白米を食していた。,His uncle on Earth hadn’t gone as far as to s...
2,しかし、オリジンで二度目の人生を過ごした軍事国家の研究所は、地球の欧州圏に該当する文化圏にあ...,"However, in the military nation in Origin tha..."
3,荷物の運搬の為に、嫌悪感を抑えて空間属性魔術を使うグーバモンのアンデッドを作るべきかと思った...,Vandalieu had been wondering whether he shoul...
4,しかしテイマーの女は諦めない。鞭を振るい、【限界超越】、【魔鞭限界突破】を発動。ただでさえ残...,She brandished her whip and activated ‘Transc...
5,『スキルとは、生物が持つ魂の力を簡単に引き出すために、魂の一部を改変したもの』,『A skill is the transformation of part of the...
6,「さきほどあなたがおっしゃったのは、ＭＡエネルギーだけに焦点を合わせたことです。この世界の住...,「Referring to what you said a short while ago...
7,繰り返す、分体は本体の思考をサポートし、口が滑らかに動くように尽力せよ！,"I repeat, the clones are to support the main ..."
8,《確認しました。対熱耐性獲得・・・成功しました》,<<Confirmed. Establishing heat resistance. Su...
9,時間をくれ、落ち着くから。こういう時は素数を数えたらいいんだっけ？,"Just an hour please, let me catch my breath. ..."


I ended up having to do some manipulation dataframe manipulation to get the tokenizer to recognize the dataset I created

In [3]:
dataframe["Translation"] = "\"jp\": " + dataframe["Translation"]
dataframe["EN_for_now"] = "\"en\": " + dataframe["EN_for_now"]
dataframe["Translation"] = dataframe["Translation"] + ", " + dataframe["EN_for_now"]


In [4]:
dataframe.drop(["EN_for_now"], axis=1, inplace=True)

In [5]:
dataframe

Unnamed: 0,Translation
0,"""jp"": そして兄ブダリオンを超えたブギータスは、遂に帝位を奪い取った。そしてそのまま瞬く..."
1,"""jp"": 地球での伯父も流石に白米を食べる事を贅沢だとは言わなかったので、当時は普通に白米..."
2,"""jp"": しかし、オリジンで二度目の人生を過ごした軍事国家の研究所は、地球の欧州圏に該当す..."
3,"""jp"": 荷物の運搬の為に、嫌悪感を抑えて空間属性魔術を使うグーバモンのアンデッドを作るべ..."
4,"""jp"": しかしテイマーの女は諦めない。鞭を振るい、【限界超越】、【魔鞭限界突破】を発動。..."
5,"""jp"": 『スキルとは、生物が持つ魂の力を簡単に引き出すために、魂の一部を改変したもの』 ..."
6,"""jp"": 「さきほどあなたがおっしゃったのは、ＭＡエネルギーだけに焦点を合わせたことです。..."
7,"""jp"": 繰り返す、分体は本体の思考をサポートし、口が滑らかに動くように尽力せよ！ , ""..."
8,"""jp"": 《確認しました。対熱耐性獲得・・・成功しました》 , ""en"": <<Conf..."
9,"""jp"": 時間をくれ、落ち着くから。こういう時は素数を数えたらいいんだっけ？ , ""en""..."


In [6]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")

In [7]:


dataset = Dataset.from_pandas(dataframe.applymap(str))

dataset

Dataset({
    features: ['Translation'],
    num_rows: 17
})

In [8]:
dataset[0]["Translation"]

'"jp": そして兄ブダリオンを超えたブギータスは、遂に帝位を奪い取った。そしてそのまま瞬く間に他の国を支配し、自分を頂点とした弱肉強食の世界支配が始まるはずだった。 , "en":  And having surpassed his older brother Budarion, Bugitas finally stole the throne. And from there, he was supposed to have conquered the other nations in the blink of an eye and begin ruling a world of survival of the fittest, where he sat at the top.'

In [9]:
def tokenize_function(examples):
    return tokenizer(examples["Translation"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets


Map:   0%|          | 0/17 [00:00<?, ? examples/s]

Dataset({
    features: ['Translation', 'input_ids', 'attention_mask'],
    num_rows: 17
})

In [10]:
from sklearn.model_selection import train_test_split

tokenized_train, tokenized_test = train_test_split(tokenized_datasets, test_size=0.3, random_state=1)


In [16]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

In [13]:
import evaluate

metric = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [14]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [17]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics,
)

In [18]:
trainer.train()

  0%|          | 0/3 [00:00<?, ?it/s]

KeyError: 1