# 🛡 🔨 🤖 **Dataset, assemble!**

I want to create a semi-large supervised (labelled) dataset of questions and answers, in their original language (English) and their translated language (French).

Once I'll have a lot of pairs of small sentences in English (source language) and French (target language), I'll try to use a Google colab cloud GPU to finetune the translation model I have been using in April 2025.

Hopefully, I'll be able to do that efficiently, quickly, and then the model could be saved and shared on my [HuggingFace's profile](https://huggingface.co/Naereen) so that anyone can use it.

In [1]:
!pip install watermark sacremoses tensorflow transformers requests beautifulsoup4 pandas evaluate numpy sacrebleu
%load_ext watermark
%watermark -v -p numpy,pandas,tensorflow,transformers,requests,bs4,pandas,evaluate,sacrebleu

Collecting watermark
  Downloading watermark-2.5.0-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.17-py311-none-any.whl.metadata (7.2 kB)
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.1.1-py3-none-any.wh

In [2]:
running_on_GPU = False
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
    print("Running on CPU (20 to 30x slower for LLM/NLP/ML)")
else:
    running_on_GPU = True
    print("Running on GPU")

Running on GPU


> We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 1. Use segmented versions of the text from English/French articles of Cranial Insertion

This would require to be able to do here the segmentation of the VO/VF (English/French) text of articles, and that would be too complicated.

I've used JAM.pl (https://github.com/kraifo/PCP/blob/master/lib/JAM.pl) to try to do this as well as possible, offline.
The `all-tmx.zip` file contains about 700 TMX files, each from 50 to 150 paragraphs long.

Let's extract these, and charge them into a Pandas's DataFrame, and only later on we will load our data into a HuggingFace's DataSets.

In [None]:
!ls all-tmx.zip
!mkdir -p all-tmx
!unzip -q all-tmx.zip -d all-tmx

all-tmx.zip


In [None]:
!ls -larth all-tmx/ | tail
!ls -larth all-tmx/ | wc

-rw-rw-r-- 1 root root  57K Apr 10 22:40 id-en-1691_id-fr-1693.full.en-fr.tmx
-rw-rw-r-- 1 root root  42K Apr 10 22:40 id-en-1617_id-fr-1623.full.en-fr.tmx
-rw-rw-r-- 1 root root  38K Apr 10 22:40 id-en-1589_id-fr-1595.full.en-fr.tmx
-rw-rw-r-- 1 root root  39K Apr 10 22:40 id-en-1563_id-fr-1569.full.en-fr.tmx
-rw-rw-r-- 1 root root  51K Apr 10 22:40 id-en-1470_id-fr-1479.full.en-fr.tmx
-rw-rw-r-- 1 root root  39K Apr 10 22:40 id-en-1392_id-fr-1398.full.en-fr.tmx
-rw-rw-r-- 1 root root  36K Apr 10 22:40 id-en-1248_id-fr-1256.full.en-fr.tmx
-rw-rw-r-- 1 root root  45K Apr 10 22:40 id-en-1185_id-fr-1192.full.en-fr.tmx
drwxr-xr-x 1 root root 4.0K Apr 11 00:45 ..
drwxr-xr-x 2 root root  48K Apr 11 00:45 .
    699    6284   54208


As claimed, we have about 700 .tmx file. Let's extract as many pair of English/French sentences as we can, from these files.

In [None]:
# prompt: for one of this TMX file, let say its filename is "all-tmx/id-en-1185_id-fr-1192.full.en-fr.tmx", I want to open it, load its content using the PythonTmx library (https://python-tmx.readthedocs.io/en/stable/), and generate two lists of aligned sentence, the first in English and the second in French. TMX here is Translation Memory eXchange (TMX) files, not Tiles Map files of PyGame.

from pathlib import Path
from lxml import etree
from typing import List, Tuple

def extract_sentences_from_tmx(tmx_filepath: str) -> Tuple[List[str], List[str]]:
    """
    Extracts aligned English and French sentences from a TMX file.

    Args:
        tmx_filepath: Path to the TMX file.

    Returns:
        A tuple containing two lists: English sentences and French sentences.
    """
    english_sentences = []
    french_sentences = []
    idx = 0
    try:
        tree = etree.parse(tmx_filepath)
        for tu in tree.xpath("//tu"):
            english_seg = tu.xpath(".//tuv[@xml:lang='en']/seg")
            french_seg = tu.xpath(".//tuv[@xml:lang='fr']/seg")

            if english_seg and french_seg:  # Ensure both segments exist
                for i in range(min(len(english_seg), len(french_seg))):
                    en_text = english_seg[i].text
                    fr_text = french_seg[i].text
                    if en_text and fr_text:
                        idx += 1
                        # print(f"# {idx}-th sentences:\n  English: {en_text}\n  French: {fr_text}")
                        english_sentences.append(en_text)
                        french_sentences.append(fr_text)
        print(f"==> for file {tmx_filepath},\n\twe found {idx} English/French sentences likely to be aligned.")
    except etree.XMLSyntaxError as e:
        print(f"Error parsing {tmx_filepath}: {e}")
    return english_sentences, french_sentences

Let's try this on one file:

In [None]:
# Example usage:
tmx_file = "all-tmx/id-en-1185_id-fr-1192.full.en-fr.tmx"  # Replace with your file
english_sents, french_sents = extract_sentences_from_tmx(tmx_file)

==> for file all-tmx/id-en-1185_id-fr-1192.full.en-fr.tmx,
	we found 45 English/French sentences likely to be aligned.


Getting 45 most-likely correctly aligned pairs of sentence from a single pair of English/French article is already great!

In [None]:
# Print the first few sentences to verify
print("First 5 pair of English/French sentences:")
for i in range(5):
    print(f"- {i+1}-th sentence:")
    en_sent = english_sents[i]
    print(en_sent)
    fr_sent = french_sents[i]
    print(fr_sent)

First 5 pair of English/French sentences:
- 1-th sentence:
While I'm doing my investigation, don't forget we're also here to answer your rules questions. If you have a short question, you can tweet to use at [url=https://twitter.com/CranialTweet]@CranialTweet[/url], and if you have a longer question, you can e-mail us at moko@cranialinsertion.com. We might even use your question in a future article!
Pendant que j'enquête, n'oubliez pas que nous sommes toujours là pour répondre à vos questions de règle. Si vos questions sont courtes, vous pouvez nous les tweeter à [url=https://twitter.com/CranialTweet]@CranialTweet[/url] et, si elles sont plus longues, envoyez-les-nous par mail à moko@cranialinsertion.com. Nous pourrions même utiliser votre question dans un prochain article !
- 2-th sentence:
[hr][q]I control a [c]Lambholt Pacifist[/c], and I enchant it with [c]Equestrian Skill[/c]. Can it attack?[/q]
[hr][Q=Q : ]Je contrôle une [c=Lambholt Pacifist]Pacifiste de Lambholt[/c] et je l'enc

Now, get a list of all the pairs of sentences, and save it as a Pandas DataFrame (then export it to a CSV, in order to speed up this step):

In [None]:
# prompt: Call the extract_sentences_from_tmx function on EACH .tmx file that are in the "all-tmx/" folder, and generate a list of all the english_sents, and french_sents called all_english_sents, all_fench_sents, concatenating all the results.

import os
from pathlib import Path

all_english_sents = []
all_french_sents = []

# Iterate through all .tmx files in the "all-tmx" directory
for filename in os.listdir("all-tmx"):
    if filename.endswith(".tmx"):
        tmx_file = os.path.join("all-tmx", filename)
        english_sents, french_sents = extract_sentences_from_tmx(tmx_file)
        all_english_sents.extend(english_sents)
        all_french_sents.extend(french_sents)

print(f"Length of all_english_sents: {len(all_english_sents)}")
print(f"Length of all_french_sents: {len(all_french_sents)}")
assert (len(all_english_sents) == len(all_french_sents) )


==> for file all-tmx/id-en-981_id-fr-992.full.en-fr.tmx,
	we found 226 English/French sentences likely to be aligned.
==> for file all-tmx/id-en-4194_id-fr-4196.full.en-fr.tmx,
	we found 75 English/French sentences likely to be aligned.
==> for file all-tmx/id-en-2953_id-fr-2957.full.en-fr.tmx,
	we found 42 English/French sentences likely to be aligned.
==> for file all-tmx/id-en-1119_id-fr-1126.full.en-fr.tmx,
	we found 52 English/French sentences likely to be aligned.
==> for file all-tmx/id-en-1159_id-fr-1173.full.en-fr.tmx,
	we found 41 English/French sentences likely to be aligned.
==> for file all-tmx/id-en-1551_id-fr-1556.full.en-fr.tmx,
	we found 51 English/French sentences likely to be aligned.
==> for file all-tmx/id-en-1127_id-fr-1134.full.en-fr.tmx,
	we found 44 English/French sentences likely to be aligned.
==> for file all-tmx/id-en-4006_id-fr-4009.full.en-fr.tmx,
	we found 40 English/French sentences likely to be aligned.
==> for file all-tmx/id-en-3932_id-fr-3936.full.e

In [None]:
# prompt: Now convert these two list of 34292 sentences/paragraphs in English or French, to a single Pandas DataFrame, that would have only two columns, named "english" and "french".
# Then save this DataFrame to a CSV file named "all-sentences-from-tmx.csv".
import pandas as pd

# Assuming all_english_sents and all_french_sents are already defined from the previous code
# Create a Pandas DataFrame
df_cranial_sentences = pd.DataFrame({'english': all_english_sents, 'french': all_french_sents})

# Save the DataFrame to a CSV file
df_cranial_sentences.to_csv("all-cranial-sentences-from-tmx.csv", index=False)

We can read the file, if we don't want to work again on generating it:

In [6]:
import pandas as pd
try:
    df_cranial_sentences = pd.read_csv("all-cranial-sentences-from-tmx.csv")
except:
    print(f"Couldn't load the CSV file 'all-cranial-sentences-from-tmx.csv', oupsie.")
    df_cranial_sentences = pd.DataFrame()  # empty DataFrame, to avoid bugs in the next parts.
df_cranial_sentences

Unnamed: 0,english,french
0,"that mustache.[/cright]Howdy, folks! Welcome b...",Salut les gens ! Bienvenue pour une autre sema...
1,Lots of things would be better. Especially thi...,Beaucoup de choses seraient améliorées. Surtou...
2,So here's the plan. We're going to go back to ...,"Donc, voilà le plan. Nous retournons en 1993, ..."
3,[hr],[hr]
4,[Q]You control [c=Ali from Cairo|ARN]Ali from ...,[Q=Q :]Tu contrôles [c=Ali from Cairo|ARN]Ali ...
...,...,...
34287,[hr][q]I was watching coverage of the Pro Tour...,[hr][Q=Q : ]En regardant le Pro Tour il y a qu...
34288,"[a]Nope, that was just something special for t...","[A=R : ]Non, c'était quelque chose de spécial ..."
34289,[hr][q]I was looking through Gatherer the othe...,"[hr][Q=Q : ]En regardant Gatherer, j'ai remarq..."
34290,"[a]Yep, it's true! The Welcome Deck 2016 cards...","[A=R : ]Oui, c'est bien le cas. Les cartes Wel..."


I checked manually the content of this file, and it seems to be well aligned!

------

## 2. Use SQLite and the [MTG JSON](https://mtgjson.com/downloads/all-files/#allprintings) database, to get the English/French text of all the Magic the Gathering cards

The MTGjson.com project gives a database, which is downloaded as a SQLite file, and usable with simple SQL command, to work on the entire database of ALL the cards.

In [7]:
# prompt: Download the SQLite file at "https://mtgjson.com/api/v5/AllPrintings.sqlite" and save it so I can use it in Python code
import requests
import os

# Download the SQLite file
url = "https://mtgjson.com/api/v5/AllPrintings.sqlite"
filename = "AllPrintings.sqlite"

if not os.path.exists(filename):
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()  # Raise an exception for bad status codes

        with open(filename, 'wb') as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)
        print(f"Downloaded {filename} successfully.")

    except requests.exceptions.RequestException as e:
        print(f"Error downloading the file: {e}")
else:
    print(f"{filename} already exists. Skipping download.")


Downloaded AllPrintings.sqlite successfully.


In [8]:
!ls -larth *.sqlite
!du -h *.sqlite

-rw-r--r-- 1 root root 472M Apr 11 05:09 AllPrintings.sqlite
472M	AllPrintings.sqlite


We are ready to execute some SQLite code!

In [9]:
# prompt: import the sqlite database in "AllPrintings.sqlite" into a Panas dataframe and display it nicely
import pandas as pd
import sqlite3

# Connect to the SQLite database
conn = sqlite3.connect('AllPrintings.sqlite')

def run_query(query):
  return pd.read_sql_query(query, conn)

Now I want to obtain a card's text in English and its translation in French, for each differently-named card.

In [10]:
# Query the database and load into a Pandas DataFrame
query = """
SELECT DISTINCT cards.name as name_en, cFD.multiverseId, cFD.name as name_fr,
       cards.text as text_en, cFD.text as text_fr
FROM cards
JOIN cardForeignData as cFD ON cards.uuid = cFD.uuid
WHERE cFD.multiverseId IS NOT NULL
  AND cFD.language = 'French'
ORDER BY cFD.multiverseId;
"""
df = run_query(query)

# Display the DataFrame nicelyn
# pd.set_option("display.max_rows", None, "display.max_columns", None) # Show all rows and columns
display(df)

Unnamed: 0,name_en,multiverseId,name_fr,text_en,text_fr
0,Aether Snap,75971,Coup d'Aether,Remove all counters from all permanents and ex...,Retirez tous les marqueurs de tous les permane...
1,Aether Vial,75972,Fiole d'Aether,"At the beginning of your upkeep, you may put a...","Au début de votre entretien, vous pouvez mettr..."
2,Ageless Entity,75973,Entité sans âge,"Whenever you gain life, put that many +1/+1 co...",À chaque fois que vous gagnez des points de vi...
3,Angel's Feather,75974,Plume d'ange,"Whenever a player casts a white spell, you may...","À chaque fois qu'un joueur joue un sort blanc,..."
4,Arcane Spyglass,75975,Longue-vue des arcanes,"{2}, {T}, Sacrifice a land: Draw a card and pu...","{2}, {T}, Sacrifiez un terrain : Piochez une c..."
...,...,...,...,...,...
46604,Treasure Vault,692899,Salle au trésor,"{T}: Add {C}.\n{X}{X}, {T}, Sacrifice this lan...","{T} : Ajoutez {C}.\n{X}{X}, {T}, sacrifiez ce ..."
46605,Underground River,692900,Rivière souterraine,{T}: Add {C}.\n{T}: Add {U} or {B}. This land ...,{T} : Ajoutez {C}.\n{T} : Ajoutez {U} ou {B}. ...
46606,Unholy Grotto,692901,Grotte impie,"{T}: Add {C}.\n{B}, {T}: Put target Zombie car...","{T} : Ajoutez {C}.\n{B}, {T} : Mettez une cart..."
46607,Vineglimmer Snarl,692902,Lacis pâlevigne,"As this land enters, you may reveal a Forest o...","Au moment où ce terrain arrive, vous pouvez ré..."


In [11]:
# prompt: I want to extract this as a list of pairs text_en, text_fr.

# Query to extract English and French text pairs
query = """
SELECT cards.text as text_en, cFD.text as text_fr
FROM cards
JOIN cardForeignData as cFD ON cards.uuid = cFD.uuid
WHERE cFD.language = 'French'
  AND cards.text IS NOT NULL
  AND cFD.text IS NOT NULL;
"""

df = run_query(query)

# Create a list of pairs
text_pairs = []
for index, row in df.iterrows():
    text_pairs.append((row['text_en'], row['text_fr']))

# Example: Accessing the first pair
if text_pairs:
    print(f"First pair: English - \n« {text_pairs[0][0]} »\nFrench - \n« {text_pairs[0][1]} »")


First pair: English - 
« First strike (This creature deals combat damage before creatures without first strike.)\nWhen this creature enters, you gain 1 life for each card in your graveyard. »
French - 
« Initiative (Cette créature inflige des blessures de combat avant les créatures sans l'initiative.)\nQuand l'Élu de l'Ancêtre arrive en jeu, vous gagnez 1 point de vie pour chaque carte dans votre cimetière. »


In [12]:
len(text_pairs)

44694

Similarly, let's save this list of pair of rules text from English/French cards to a CSV file, to be able to use it faster if we need it again.

In [13]:
# prompt: Now convert this list "text_pairs" of 44694 of pair of sentences in English and French, to a single Pandas DataFrame, that would have only two columns, named "english" and "french". Then save this DataFrame to a CSV file named "all-pairs-from-all-the-cards.csv".

import pandas as pd
# Create a Pandas DataFrame from the text_pairs list
df_mtg = pd.DataFrame(text_pairs, columns=['english', 'french'])

# Save the DataFrame to a CSV file
df_mtg.to_csv("all-pairs-from-all-the-cards.csv", index=False)

df_mtg

Unnamed: 0,english,french
0,First strike (This creature deals combat damag...,Initiative (Cette créature inflige des blessur...
1,"Flying\nWhen this creature enters, you gain 3 ...",Vol (Cette créature ne peut être bloquée que p...
2,Target creature gets +3/+3 and gains flying un...,La créature ciblée gagne +3/+3 et acquiert le ...
3,"Whenever a creature you control enters, you ga...",À chaque fois qu'une créature arrive en jeu so...
4,Defender (This creature can't attack.)\nFlying,"Défenseur, vol (Cette créature ne peut pas att..."
...,...,...
44689,"{2}, {T}: Copy target activated or triggered a...","{2}, {T} : Copiez une capacité activée ou décl..."
44690,"Kicker {3}\nIf this creature was kicked, it en...",Kick {3}\nSi la Construction myriadaire a été ...
44691,Kicker {3}\nIndestructible\nWhen this artifact...,Kick {3}\nIndestructible\nQuand la Relique du ...
44692,{T}: Add {C}.\n{4}: Put two +1/+1 counters on ...,{T} : Ajoutez {C}.\n{4} : Mettez deux marqueur...


-----------------------

## 3. Use the paragraphs from the Comprehensive Rules as a third source of English/French aligned sentences

In [16]:
# prompt: Open two SQLite cursors to the two databases named "rules_en_5.db" and "rules_fr_5.db".
# Define two functions that can execute SQL code on one of these two database, respectively for the English or the French one.

import sqlite3

# Connect to the two SQLite databases
conn_en = sqlite3.connect('rules_en_5.db')
conn_fr = sqlite3.connect('rules_fr_5.db')

# Create cursors for each database
cursor_en = conn_en.cursor()
cursor_fr = conn_fr.cursor()

def execute_sql_en(sql_query):
  """Executes an SQL query on the English database."""
  return pd.read_sql_query(sql_query, conn_en)
  try:
    cursor_en.execute(sql_query)
    return cursor_en.fetchall()  # Or any other method to retrieve results
  except sqlite3.Error as e:
    print(f"An error occurred: {e}")
    return None

def execute_sql_fr(sql_query):
  """Executes an SQL query on the French database."""
  return pd.read_sql_query(sql_query, conn_fr)
  try:
    cursor_fr.execute(sql_query)
    return cursor_fr.fetchall()  # Or any other method to retrieve results
  except sqlite3.Error as e:
    print(f"An error occurred: {e}")
    return None


Now we can explore the two databases I obtained and uploaded alongside this notebook.
They should contain (lengthy) paragraphs of the Comprehensive Rules of Magic: the Gathering, in English and in French.
The English version should be identical to the most recent version on <https://yawgatog.com/resources/magic-rules>, and the French version was manually updated by the authors of the Mythic.Tools (<https://www.mythic.tools/>) phone app. Many thanks to them!

In [18]:
df_rules_en = execute_sql_en("SELECT rules_text FROM rules;")
df_rules_fr = execute_sql_fr("SELECT rules_text FROM rules;")

In [19]:
# prompt: Merge these two Pandas DataFrame into a single one that have two columns, "english" from df_rules_en, and "french" from the othe

import pandas as pd
# Assuming df_rules_en and df_rules_fr are already defined from the previous code
df_rules = pd.DataFrame({'english': df_rules_en['rules_text'], 'french': df_rules_fr['rules_text']})

df_rules.to_csv("all-pairs-from-comprehensive-rules.csv", index=False)
df_rules

Unnamed: 0,english,french
0,General,Généralités
1,These Magic rules apply to any Magic game with...,Ces règles de Magic s'appliquent à toutes les ...
2,A two-player game is a game that begins with o...,Une partie à deux joueurs est une partie qui c...
3,A multiplayer game is a game that begins with ...,Une partie multi-joueurs est une partie qui co...
4,"To play, each player needs their own deck of t...","Afin de jouer, chaque joueur a besoin de son p..."
...,...,...
3187,A Conspiracy Draft game is a multiplayer game....,Une partie de Draft Conspiracy est une partie ...
3188,"At the start of the game, before decks are shu...","Au début de la partie, avant que les decks ne ..."
3189,Conspiracy cards with hidden agenda are put in...,Les cartes de conspiration avec la capacité d'...
3190,The owner of a conspiracy card is the player w...,Le propriétaire d'une carte de conspiration es...


This gives us about 3,000 more English/French text pairs!

-------------------

## We can combine these three DataFrames into a single one.

In [None]:
df_combined = pd.concat([df_cranial_sentences, df_mtg, df_rules], ignore_index=True)
df_combined

Unnamed: 0,english,french
0,"that mustache.[/cright]Howdy, folks! Welcome b...",Salut les gens ! Bienvenue pour une autre sema...
1,Lots of things would be better. Especially thi...,Beaucoup de choses seraient améliorées. Surtou...
2,So here's the plan. We're going to go back to ...,"Donc, voilà le plan. Nous retournons en 1993, ..."
3,[hr],[hr]
4,[Q]You control [c=Ali from Cairo|ARN]Ali from ...,[Q=Q :]Tu contrôles [c=Ali from Cairo|ARN]Ali ...
...,...,...
82173,A Conspiracy Draft game is a multiplayer game....,Une partie de Draft Conspiracy est une partie ...
82174,"At the start of the game, before decks are shu...","Au début de la partie, avant que les decks ne ..."
82175,Conspiracy cards with hidden agenda are put in...,Les cartes de conspiration avec la capacité d'...
82176,The owner of a conspiracy card is the player w...,Le propriétaire d'une carte de conspiration es...


We can export this as a SQLite database, a CSV file, an Excel file, just in case I want to speed this up.

In [None]:
# prompt: Save or export this latest largest DataFrame df_combined to a SQLite database "all-pairs-from-3-sources.sqlite", to a CSV file "all-pairs-from-3-sources.csv", and to an Excel file "all-pairs-from-3-sources.xlsx", please.

# Export the DataFrame to various formats
df_combined.to_sql("all_pairs", sqlite3.connect("all-pairs-from-3-sources.sqlite"), if_exists="replace", index=False)
df_combined.to_csv("all-pairs-from-3-sources.csv", index=False)
df_combined.to_excel("all-pairs-from-3-sources.xlsx", index=False)

-rw-r--r-- 1 root root 472M Apr 11 01:13 AllPrintings.sqlite
-rw-r--r-- 1 root root  15M Apr 11 01:17 all-sentences-from-tmx.csv
-rw-r--r-- 1 root root  16M Apr 11 01:18 all-pairs-from-all-the-cards.csv
-rw-r--r-- 1 root root  36M Apr 11 01:34 all-pairs-from-3-sources.sqlite
-rw-r--r-- 1 root root  32M Apr 11 01:34 all-pairs-from-3-sources.csv
-rw-r--r-- 1 root root 9.6M Apr 11 01:35 all-pairs-from-3-sources.xlsx


82178

In [None]:
!ls -larth *.sqlite *.csv *.xlsx
len(df_combined)

## Conclusion: we've successfully built a database of 82,000 lines of English/French pairs

We now have about 32 Mb of data (82K lines) about Magic the Gathering, consisting of pairs of English original text and their French translation.

These translations are either coming from:

1. the official translation of the cards' texts, obtained from MTGjson.com (same database as the one on <https://Scryfall.com>),
2. the unofficial translation of fan-written articles about Magic: the Gathering, from the <https://www.cranial-insertion.com/> judges blog,
3. the semi-unofficial translation of the Comprehensive Rules of Magic: the Gathering, from various online sources.

We can also combine the three `df_cranial_sentences, df_mtg, df_rules` manually into a larger DataFrame:

In [20]:
df_combined = pd.concat([df_cranial_sentences, df_mtg, df_rules], ignore_index=True)

# Display basic stats after appending
print("Number of lines in combined DataFrame:", len(df_combined))
print(df_combined.describe())
df_combined

Number of lines in combined DataFrame: 82178
       english french
count    82178  82178
unique   56073  63527
top       [hr]   [hr]
freq      2593   2528


Unnamed: 0,english,french
0,"that mustache.[/cright]Howdy, folks! Welcome b...",Salut les gens ! Bienvenue pour une autre sema...
1,Lots of things would be better. Especially thi...,Beaucoup de choses seraient améliorées. Surtou...
2,So here's the plan. We're going to go back to ...,"Donc, voilà le plan. Nous retournons en 1993, ..."
3,[hr],[hr]
4,[Q]You control [c=Ali from Cairo|ARN]Ali from ...,[Q=Q :]Tu contrôles [c=Ali from Cairo|ARN]Ali ...
...,...,...
82173,A Conspiracy Draft game is a multiplayer game....,Une partie de Draft Conspiracy est une partie ...
82174,"At the start of the game, before decks are shu...","Au début de la partie, avant que les decks ne ..."
82175,Conspiracy cards with hidden agenda are put in...,Les cartes de conspiration avec la capacité d'...
82176,The owner of a conspiracy card is the player w...,Le propriétaire d'une carte de conspiration es...


-----------------------

## Fine-tune [Helsinki-NLP/opus-mt-tc-big-en-fr](https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-fr) on these data

I'm going to fine-tune the model I was using since a few days: <https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-fr>.
It is specialized on fast translation of non-technical English to French.

In [21]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

original_model_name = "Helsinki-NLP/opus-mt-tc-big-en-fr"

In [22]:
%%time
tokenizer = AutoTokenizer.from_pretrained(f"{original_model_name}")
tokenizer

tokenizer_config.json:   0%|          | 0.00/337 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/820k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.33M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

CPU times: user 705 ms, sys: 40.2 ms, total: 745 ms
Wall time: 2.28 s


MarianTokenizer(name_or_path='Helsinki-NLP/opus-mt-tc-big-en-fr', vocab_size=53017, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	43311: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	50387: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	53016: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [23]:
%%time
model = AutoModelForSeq2SeqLM.from_pretrained(f"{original_model_name}")
model

config.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/461M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

CPU times: user 4.4 s, sys: 1.3 s, total: 5.7 s
Wall time: 5.98 s


MarianMTModel(
  (model): MarianModel(
    (shared): Embedding(53017, 1024, padding_idx=53016)
    (encoder): MarianEncoder(
      (embed_tokens): Embedding(53017, 1024, padding_idx=53016)
      (embed_positions): MarianSinusoidalPositionalEmbedding(1024, 1024)
      (layers): ModuleList(
        (0-5): 6 x MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (activation_fn): ReLU()
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
          (final_layer_norm): LayerNorm((1

We can use the pre-trained model to translate some Magic: the Gathering text:

In [24]:
from transformers import pipeline

translator =  pipeline ('translation' , model=model, tokenizer=tokenizer)
def en2fr(text):
    return translator(text)[0]['translation_text']

Device set to use cuda:0


In [25]:
from IPython.display import display, Markdown

def print_md(text):
    display(Markdown(text))

In [26]:
example_english_text = "When this creature enters, create three 4/4 Elephant tokens with trample. Then populate."
print_md(f"Example of English rules text: « {example_english_text} », getting translated:")

example_french_text = en2fr(example_english_text)
print_md(f"==> {example_french_text}")

Example of English rules text: « When this creature enters, create three 4/4 Elephant tokens with trample. Then populate. », getting translated:

==> Lorsque cette créature entre, créez trois jetons Éléphant 4/4 avec piétinement.

It's already not so bad, but I'm sure we can improve by trying to fine-tune this model on MTG cards data!

In [27]:
example_english_text = """[a]I'm afraid you'll have to remove all six counters. Bounty of the Hunt tracks how many counters each creature received "this way", which means how many counters it received due to following the instructions of Bounty of the Hunt as modified by any applicable replacement effects. Since each target creature received two counters this way, you'll have to remove two counters from each of them.[/a]"""
print_md(f"Example of English rules text: « {example_english_text} », getting translated:")

example_french_text = en2fr(example_english_text)
print_md(f"==> {example_french_text}")

Example of English rules text: « [a]I'm afraid you'll have to remove all six counters. Bounty of the Hunt tracks how many counters each creature received "this way", which means how many counters it received due to following the instructions of Bounty of the Hunt as modified by any applicable replacement effects. Since each target creature received two counters this way, you'll have to remove two counters from each of them.[/a] », getting translated:

==> [a]J'ai peur que vous deviez retirer les six marqueurs. Bounty of the Hunt suit le nombre de marqueurs que chaque créature a reçu "de cette façon", ce qui signifie le nombre de marqueurs qu'elle a reçu en suivant les instructions de Bounty of the Hunt telles que modifiées par les effets de remplacement applicables. Puisque chaque créature cible a reçu deux marqueurs de cette façon, vous devrez retirer deux marqueurs de chacun d'eux.[/a]

Let's compare quickly what the initial version of the translation model can do on some of our specialized text:

In [28]:
number_of_random_samples = 3

for idx, (name, df) in enumerate(zip(
        ["Cranial Insertion EN/FR", "MTG Cards Text EN/FR", "MTG Comprehensive Rules EN/FR"],
        [df_cranial_sentences, df_mtg, df_rules]
    )):
    print_md(f"\n\n## §{idx+1}. For some sentences from the « {name} » part of our database\n")

    # Get some random lines from this df
    some_random_lines = df.sample(n=number_of_random_samples)
    some_random_lines = some_random_lines.values.tolist()

    # Display the some random lines
    for line in some_random_lines:
        print_md(f"#### 1. In English:\n{line[0]}")
        print_md(f"#### 2. En français :\n{line[1]}")
        translation_from_model = en2fr(line[0])
        print_md(f"#### 3. Et en français traduit par le modèle initial :\n{translation_from_model}")



## §1. For some sentences from the « Cranial Insertion EN/FR » part of our database


#### 1. In English:
[Q] Is [c]Notion Thief[/c] really awesome with [c]Duskmantle Seer[/c] or not?[/Q]

#### 2. En français :
[Q=Q : ][c=Notion Thief]Voleur de notion[/c] est-il vraiment top avec [c=Duskmantle Seer]Voyant de Manteaubrune[/c] ?[/Q]

#### 3. Et en français traduit par le modèle initial :
[Q] Est-ce que [c]Notion Thief[/c] est vraiment génial avec [c]Duskmantle Seer[/c] ou pas ?[/Q]

#### 1. In English:
[A]Maybe -- when a creature changes controllers, it gets a fresh case of summoning sickness. So most creatures won't be able to attack right away, but ones with haste will. And if you wait a turn for summoning sickness to wear off, you'll be able to attack either way.[/A]

#### 2. En français :
[A=R : ]Mai ça dépend ! Quand une créature change de contrôleur, elle devient affectée par le mal d'invocation. Donc la plupart des créatures ne seront pas capables d'attaquer dans l'immédiat, mais certaines qui ont la célérité, le pourront elles. Si tu attends un tour, afin que le mal d'invocation cesse, tu seras alors en mesure d'attaquer.[/A]

#### 3. Et en français traduit par le modèle initial :
[A] Peut-être que quand une créature change de contrôleur, elle reçoit un nouveau cas de maladie d'invocation. Ainsi, la plupart des créatures ne pourront pas attaquer tout de suite, mais elles le feront avec hâte. Et si vous attendez un tour pour que la maladie d'invocation s'estompe, vous pourrez attaquer de toute façon.[/A]

#### 1. In English:
[center][size=1]Ach, Hans, run!

#### 2. En français :
[center][size=1]Argh, Hans, cours !

#### 3. Et en français traduit par le modèle initial :
[centre][taille1]Ach, Hans, courez !



## §2. For some sentences from the « MTG Cards Text EN/FR » part of our database


#### 1. In English:
As this creature enters, choose odd or even. (Zero is even.)\nThis creature has protection from each mana value of the chosen quality.

#### 2. En français :
Quand l'Aventurière de Bourg-sur-Lave arrive sur le champ de bataille, choisissez pair ou impair. (Zéro est pair.)\nL'Aventurière de Bourg-sur-Lave a la protection contre chaque coût converti de mana de la valeur choisie.

#### 3. Et en français traduit par le modèle initial :
Lorsque cette créature entre, choisissez impair ou pair. (Zéro est pair.)nCette créature a la protection de chaque valeur de mana de la qualité choisie.

#### 1. In English:
{2}{W}, {T}, Sacrifice this artifact: Destroy target artifact or enchantment.

#### 2. En français :
{2}{W}, {T}, sacrifiez la Capsule du dissipateur : Détruisez l'artefact ou l'enchantement ciblé.

#### 3. Et en français traduit par le modèle initial :
Sacrifiez cet artefact : Détruisez l'artefact ou l'enchantement ciblé.

#### 1. In English:
Equipped creature has haste and shroud. (It can't be the target of spells or abilities.)\nEquip {0}

#### 2. En français :
La créature équipée a la célérité et le linceul. (Elle ne peut pas être la cible de sorts ou de capacités.)Équipement {0}

#### 3. Et en français traduit par le modèle initial :
La créature équipée a la hâte et le linceul. (Elle ne peut pas être la cible de sorts ou de capacités.)



## §3. For some sentences from the « MTG Comprehensive Rules EN/FR » part of our database


#### 1. In English:
The infect rules function no matter what zone an object with infect deals damage from.

#### 2. En français :
Les règles de l'infection fonctionnent quelle que soit la zone depuis laquelle un objet avec l'infection inflige des blessures.

#### 3. Et en français traduit par le modèle initial :
Les règles d'infection fonctionnent quelle que soit la zone dans laquelle un objet infecté inflige des dégâts.

#### 1. In English:
If an object would move from one zone to another, determine what event is moving the object. If the object is moving to a public zone and its owner will be able to look at it in that zone, its owner looks at it to see if it has any abilities that would affect the move. If the object is moving to the battlefield, each other player who will be able to look at it in that zone does so. Then any appropriate replacement effects, whether they come from that object or from elsewhere, are applied to that event. If any effects or rules try to do two or more contradictory or mutually exclusive things to a particular object, that object’s controller—or its owner if it has no controller—chooses which effect to apply, and what that effect does. (Note that multiple instances of the same thing may be mutually exclusive; for example, two simultaneous “destroy” effects.) Then the event moves the object. Example: [[Exquisite Archangel]] has an ability which reads “If you would lose the game, instead exile this creature and your life total becomes equal to your starting life total.” A spell deals 5 damage to a player with 5 life and 5 damage to an [[Exquisite Archangel]] under that player’s control. As state-based actions are performed, that player’s life total becomes equal to their starting life total, and that player chooses whether [[Exquisite Archangel]] moves to its owner’s graveyard or to exile.

#### 2. En français :
Si un objet devait changer de zone, déterminez quel est l'événement qui le déplace. Si l'objet se déplace vers une zone publique et que son propriétaire peut le voir dans cette zone, il le fait et regarde si l'objet a des capacités qui affecteraient son mouvement. Si l'objet se déplace vers le champ de bataille, chaque autre joueur qui pourrait le voir dans cette zone le fait. Ensuite, tous les effets de remplacement, qu'ils proviennent de l'objet ou non, sont appliqués à cet événement. Si des effets ou des règles tentent d'effectuer des actions contradictoires ou mutuellement exclusives sur un objet, le contrôleur de l'objet (ou, à défaut, son propriétaire) choisit quel effet s'applique, et ce que fait cet effet. (Notez que deux occurrences d'une action identique peuvent être mutuellement exclusives, comme par exemple deux effets de destruction.) Ensuite, l'événement déplace l'objet. Exemple : l'[[Archange admirable]] a une capacité qui dit « Si vous deviez perdre la partie, à la place exilez l'[[Archange admirable]] et votre total de points de vie devient égal à votre total de points de vie de départ. » Un sort inflige 5 blessures à un joueur avec 5 points de vie et 5 blessures à l'[[Archange admirable]] qu'il contrôle. Quand on applique les actions basées sur un état, le total de points de vie de ce joueur revient à celui de départ, et ce joueur choisit s'il exile ou met dans le cimetière de son propriétaire l'[[Archange admirable]].

#### 3. Et en français traduit par le modèle initial :
Si un objet se déplace d’une zone à une autre, déterminez quel événement déplace l’objet. Si l’objet se déplace vers une zone publique et que son propriétaire sera capable de le regarder dans cette zone, son propriétaire le regarde pour voir s’il a des capacités qui affecteraient le mouvement. Si l’objet se déplace vers le champ de bataille, chaque autre joueur qui sera capable de le regarder dans cette zone le fera.

#### 1. In English:
Free-for-All Variant

#### 2. En français :
Option chacun pour soi

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


#### 3. Et en français traduit par le modèle initial :
Variante libre-pour-tous

I will try to fine-tune the model on the three databases first, and come back and change this choice if needed.

### Formatting the data for training

In [29]:
use_cranial = True
use_mtg     = True
use_rules   = True

In [30]:
# prompt: df_combined should be a pd.concat of df_cranial_sentences if use_cranial, od df_mtg if use_mtg, and of df_rules if use_rules
import pandas as pd

df_combined = pd.DataFrame()  # Initialize an empty DataFrame

if use_cranial:
    df_combined = pd.concat([df_combined, df_cranial_sentences], ignore_index=True)

if use_mtg:
    df_combined = pd.concat([df_combined, df_mtg], ignore_index=True)

if use_rules:
    df_combined = pd.concat([df_combined, df_rules], ignore_index=True)

In [31]:
# Display basic stats after appending
print("Number of lines in combined DataFrame:", len(df_combined))
print(df_combined.describe())

Number of lines in combined DataFrame: 82178
       english french
count    82178  82178
unique   56073  63527
top       [hr]   [hr]
freq      2593   2528


In [32]:
# prompt: convert the large DataFrame df_combined to a simple list of pairs
# Assuming df_combined is already defined from the previous code
# Convert the DataFrame to a list of pairs
text_pairs_list = df_combined[['english', 'french']].values.tolist()

In [33]:
# Print the first pairs for verification
print_md(text_pairs_list[0][0])
print_md(text_pairs_list[0][1])

that mustache.[/cright]Howdy, folks! Welcome back to another action-packed week of Cranial Insertion. This one's a little more packed than usual. The writers are off for a week, so you're stuck with me again. I'm Moko, the zombie mail-sorting chimpanzee and Vintage aficionado, and also the only one on the team small enough to crawl through air ducts to steal top-secret prototypes. We're celebrating [b]Magic: The Gathering[/b]'s twentieth anniversary this week - the very first release was on August 5th, 1993. What better way to celebrate than by hopping into a stolen [c]Time Machine[/c] and heading back through [b]Magic[/b]'s past, looking at rules questions through the ages?

Salut les gens ! Bienvenue pour une autre semaine de Cranial Insertion, bourrée d’action. Celle là est un peu plus dense que d’habitude. Les auteurs sont au repos pour une semaine, donc vous êtes coincés avec moi une fois de plus ! Je suis Moko, le Chimpanzé-Zombie trieur de mails, et fan incontournable du Vintage, et également le seul de l’équipe assez petit pour ramper dans les conduits d’aération pour voler des prototypes ultra-secrets. Nous fêtons cette semaine le vingtième anniversaire de [b]Magic : l’Assemblée[/b] – la toute première sortie était en fait le 05 Août 1993. Quelle meilleure façon de fêter cela qu’en sautant dans une  [c=Time Machine]Machine à remonter le temps[/c] volée, et de rentrer dans le passé de [b]Magic[/b] en regardant des questions de règles à travers les âges ?

Now that we have all these text pairs, we can use them to generate two datasets: one for training and one for testing.

In [34]:
source = [ text_pair[0] for text_pair in text_pairs_list ]
print(f"Length of the sources = {len(source)}")
target = [ text_pair[1] for text_pair in text_pairs_list ]
print(f"Length of the targets = {len(target)}")

Length of the sources = 82178
Length of the targets = 82178


Let's use, as usual, 20% for testing, 80% for training.
The rest is generic code I've copied from this StackOverflow question <https://stackoverflow.com/questions/75970711/how-to-properly-fine-tune-translational-transformer-models>:

In [35]:
assert(len(source) == len(target))
len(source), len(target)

(82178, 82178)

In [36]:
%%time
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split

# Combine your source and target lists into a list of dictionaries
raw_data = [{'en': s, 'fr': t} for s, t in zip(source, target)]

# Split the combined data into training and a temporary test+validation set
train_raw, temp_raw = train_test_split(raw_data, test_size=0.2, random_state=42) # Adjust test_size as needed

# Split the temporary set into validation and test sets
val_raw, test_raw = train_test_split(temp_raw, test_size=0.5, random_state=42) # 0.5 of the temp set is 10% of the original

# Create Dataset objects from the raw data dictionaries
train_dataset = Dataset.from_list([{'translation': item} for item in train_raw])
val_dataset = Dataset.from_list([{'translation': item} for item in val_raw])
test_dataset = Dataset.from_list([{'translation': item} for item in test_raw])

# Create the DatasetDict
raw_datasets = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset,
    'test': test_dataset
})

print(raw_datasets)

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 65742
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 8218
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 8218
    })
})
CPU times: user 402 ms, sys: 175 ms, total: 577 ms
Wall time: 576 ms


The max length should not be too high, but not too low also:

In [41]:
max_length = 256  # shouldn't be too high

This was the previous code, I'll keep it for a bit, until the fine-tuning (additionnal training) of at least one model works.

In [37]:
%%time
"""
    from sklearn.model_selection import train_test_split

    max_length = 256  # shouldn't be too high
    test_size = 0.20

    X_train, X_val, y_train, y_val = train_test_split(source, target, test_size=test_size)
    X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=max_length, return_tensors="pt")
    y_train_tokenized = tokenizer(y_train, padding=True, truncation=True, max_length=max_length, return_tensors="pt")
    X_val_tokenized   = tokenizer(X_val, padding=True, truncation=True, max_length=max_length, return_tensors="pt")
    y_val_tokenized   = tokenizer(y_val, padding=True, truncation=True, max_length=max_length, return_tensors="pt")

    import torch
    class ForDataset(torch.utils.data.Dataset):
        def __init__(self, inputs, targets):
            self.inputs = inputs
            self.targets = targets

        def __len__(self):
            return len(self.targets)

        def __getitem__(self, index):
            input_ids = torch.tensor(self.inputs["input_ids"][index]).squeeze()
            target_ids = torch.tensor(self.targets["input_ids"][index]).squeeze()

            return {"input_ids": input_ids, "labels": target_ids}

    train_dataset = ForDataset(X_train_tokenized, y_train_tokenized)
    test_dataset  = ForDataset(X_val_tokenized, y_val_tokenized)
"""

CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 10.3 µs


'\n    from sklearn.model_selection import train_test_split\n\n    max_length = 256  # shouldn\'t be too high\n    test_size = 0.20\n\n    X_train, X_val, y_train, y_val = train_test_split(source, target, test_size=test_size)\n    X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=max_length, return_tensors="pt")\n    y_train_tokenized = tokenizer(y_train, padding=True, truncation=True, max_length=max_length, return_tensors="pt")\n    X_val_tokenized   = tokenizer(X_val, padding=True, truncation=True, max_length=max_length, return_tensors="pt")\n    y_val_tokenized   = tokenizer(y_val, padding=True, truncation=True, max_length=max_length, return_tensors="pt")\n\n    import torch\n    class ForDataset(torch.utils.data.Dataset):\n        def __init__(self, inputs, targets):\n            self.inputs = inputs\n            self.targets = targets\n\n        def __len__(self):\n            return len(self.targets)\n\n        def __getitem__(self, index):\n       

We can quickly explore both the `raw_datasets`:

In [42]:
print(raw_datasets)

from pprint import pprint
pprint(raw_datasets["train"][0])
pprint(raw_datasets["validation"][0])
pprint(raw_datasets["test"][0])

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 65742
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 8218
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 8218
    })
})
{'translation': {'en': '[hr]', 'fr': '[hr]'}}
{'translation': {'en': '[cright=Gideon Jura]"No! Not that! Not '
                       '[i]planeswalkers![/i]"[/cright][Q]I block with a morph '
                       'and flip it face-up. Does my opponent have a chance to '
                       'respond before damage (provided no flip trigger)?[/Q]',
                 'fr': '[cright=Gideon Jura]"Non ! Non pas ça !'}}
{'translation': {'en': "Haste\\n{1}{R}, {T}: Target creature can't block this "
                       'turn.',
                 'fr': 'Célérité\\n{1}{R}, {T} : Une créature ciblée ne peut '
                       'pas bloquer ce tour-ci.'}}


That should be satisfying!

> A `DataCollator` in the Hugging Face `transformers` library is a utility class that prepares batches of data for training or evaluation.  It takes a list of examples (dictionaries containing features like input IDs, attention masks, labels, etc.) and combines them into a single batch.  Crucially, it handles the padding and tensorization of these examples, ensuring that all examples in a batch have the same sequence length. This is essential because models require inputs of consistent shape.
>
> Different `DataCollator` classes exist depending on the specific task and model.  For example, there's a `DataCollatorWithPadding`, which is commonly used.  It pads shorter sequences to the maximum length within the batch (dynamic padding), which is more efficient than padding all sequences to the maximum length in the dataset.
>
> In essence, the `DataCollator` simplifies the process of creating training batches and ensures that the input format is correct for the model, thus saving you the effort of manually handling these details.

In [43]:
# DataCollator
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, max_length=max_length, return_tensors="pt")

## Finish preprocessing the dataset

In [44]:
max_input_length = max_length
max_target_length = max_length
source_lang = "en"
target_lang = "fr"

def preprocess_function(examples):
    inputs  = [ ex[source_lang] for ex in examples["translation"] ]
    targets = [ ex[target_lang] for ex in examples["translation"] ]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets, FIXED: this was something I wasn't doing at first!
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [45]:
%%time
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/65742 [00:00<?, ? examples/s]



Map:   0%|          | 0/8218 [00:00<?, ? examples/s]

Map:   0%|          | 0/8218 [00:00<?, ? examples/s]

CPU times: user 50.2 s, sys: 190 ms, total: 50.4 s
Wall time: 50.2 s


In [46]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 65742
    })
    validation: Dataset({
        features: ['translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 8218
    })
    test: Dataset({
        features: ['translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 8218
    })
})

### Metric

In [47]:
!pip install evaluate numpy sacrebleu



We can use the "SacreBLEU" metric. I didn't try to understand that. Gemini helped for the code, as well as the tutorials from HuggingFace.

In [48]:
import numpy as np
import evaluate

metric = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

Some post-processing and the computation of the metrics we want to track while training our model:

In [49]:
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)

    result = {k: round(v, 4) for k, v in result.items()}
    return result

### Training (ie. fine-tuning)

I've preferred to disable the integration with Weights and Biases (wandb), with this code:

In [50]:
import os
os.environ["WANDB_DISABLED"] = "true"

Let's add callbacks, to save the model to [my HuggingFace profile (@Naereen)](https://huggingface.co/Naereen) regularly.

⚠ it wasn't working, so I've disabled them.

In [None]:
# from transformers.keras_callbacks import KerasMetricCallback
# metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=test_dataset)

from transformers.keras_callbacks import PushToHubCallback
push_to_hub_callback = PushToHubCallback(
    output_dir="./english-to-french-translation-for-Magic-the-Gathering",
    tokenizer=tokenizer,
)

callbacks = [
    # metric_callback,
    push_to_hub_callback,
]

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/Naereen/english-to-french-translation-for-Magic-the-Gathering into local empty directory.


Download file model.safetensors:   0%|          | 15.6k/880M [00:00<?, ?B/s]

Download file training_args.bin: 100%|##########| 5.43k/5.43k [00:00<?, ?B/s]

Download file target.spm:   4%|3         | 32.0k/801k [00:00<?, ?B/s]

Download file source.spm:   4%|4         | 32.0k/784k [00:00<?, ?B/s]

Clean file training_args.bin:  18%|#8        | 1.00k/5.43k [00:00<?, ?B/s]

Clean file target.spm:   0%|          | 1.00k/801k [00:00<?, ?B/s]

Clean file source.spm:   0%|          | 1.00k/784k [00:00<?, ?B/s]

Clean file model.safetensors:   0%|          | 1.00k/880M [00:00<?, ?B/s]

We are ready to define our Seq2SeqTrainingArguments and our Seq2SeqTrainer.
I've kept the hyper-parameters from this StackOverflow post I took inspiration from <https://stackoverflow.com/questions/75970711/how-to-properly-fine-tune-translational-transformer-models>, as well as the documentation here <https://huggingface.co/learn/llm-course/en/chapter7/4#fine-tuning-the-model> on HuggingFace's course on LLM and their fine-tuning.

In [51]:
# prompt: jupyter widget to select the value of num_train_epochs as an integer, default to 5

import ipywidgets as widgets
from IPython.display import display

# Create a slider widget
num_train_epochs_slider = widgets.IntSlider(
    value=3,
    min=1,
    max=20,
    step=1,
    description='num_train_epochs:',
    disabled=False,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

# Display the widget
display(num_train_epochs_slider)

IntSlider(value=3, description='num_train_epochs:', max=20, min=1)

In [52]:
# Access the selected value using num_train_epochs_slider.value
num_train_epochs = num_train_epochs_slider.value
print(f"Selected num_train_epochs: {num_train_epochs}")

Selected num_train_epochs: 3


This was just a fancy thing. Let's get down to business!

In [53]:
%%time
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
# from transformers import TrainingArguments, Trainer

training_args = Seq2SeqTrainingArguments(
# training_args = TrainingArguments(

    output_dir="english-to-french-translation-for-Magic-the-Gathering",

    eval_strategy="epoch",

    save_strategy="epoch",

    learning_rate=3e-5,

    per_device_train_batch_size=32,

    per_device_eval_batch_size=64,

    weight_decay=0.01,

    save_total_limit=3,

    num_train_epochs=num_train_epochs,

    predict_with_generate=True,

    load_best_model_at_end=True,

    report_to="none",  # Disable all integrations,

    fp16=True,

    push_to_hub=True,

    warmup_steps=500,  # Adjust this value

    lr_scheduler_type="linear",

    seed=42,  # for reproducibility

    # Or your determined value
    #max_source_length=max_length,
    #max_target_length=max_length,

)

trainer = Seq2SeqTrainer(
# trainer = Trainer(
    model=model,

    args=training_args,

    train_dataset=tokenized_datasets["train"],

    eval_dataset=tokenized_datasets["validation"],

    tokenizer=tokenizer,

    data_collator=data_collator,

    compute_metrics=compute_metrics,

    # callbacks=callbacks,
)

CPU times: user 1.08 s, sys: 115 ms, total: 1.19 s
Wall time: 1.48 s




In [54]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

We should be able to train (ie. fine-tune) our model, one a certain numbe of epochs.
It should take quite some time, on a free Google's T4 GPU running on Google Colab notebooks' cloud.

Before we run the training, let's evaluate the model, on both its training dataset, and its testing dataset (the part that will *not* be used during training).

In [55]:
%%time
evaluation_before_finetuning = trainer.evaluate(max_length=max_length)

from pprint import pprint
pprint(evaluation_before_finetuning)

{'eval_bleu': 39.2849,
 'eval_gen_len': 51.8153,
 'eval_loss': 1.651272177696228,
 'eval_model_preparation_time': 0.0033,
 'eval_runtime': 301.7038,
 'eval_samples_per_second': 27.239,
 'eval_steps_per_second': 0.428}
CPU times: user 5min 3s, sys: 812 ms, total: 5min 4s
Wall time: 5min 1s


In [56]:
import pandas as pd
pd.DataFrame([evaluation_before_finetuning])

Unnamed: 0,eval_loss,eval_model_preparation_time,eval_bleu,eval_gen_len,eval_runtime,eval_samples_per_second,eval_steps_per_second
0,1.651272,0.0033,39.2849,51.8153,301.7038,27.239,0.428


> A BLEU score of 46.2% is not too bad, which reflects the fact that our model is already good at translating English sentences to French ones.
> (sentence taken from <https://huggingface.co/learn/llm-course/en/chapter7/4#fine-tuning-the-model-with-the-trainer-api>, I had no idea what BLUE was until an hour ago).

I was getting 46% on the dataset *without the articles from Cranial Insertion*, let's try to use them this time.

### Testing the current performance (mainly, the BLUE metric) on the 10% of "test" tokenized dataset

In [57]:
%%time

evaluation_before_finetuning_on_testdataset = trainer.evaluate(tokenized_datasets["test"], max_length=max_length)

# pprint(evaluation_before_finetuning_on_testdataset)
pd.DataFrame([evaluation_before_finetuning_on_testdataset])

CPU times: user 5min 2s, sys: 403 ms, total: 5min 2s
Wall time: 4min 58s


Unnamed: 0,eval_loss,eval_model_preparation_time,eval_bleu,eval_gen_len,eval_runtime,eval_samples_per_second,eval_steps_per_second
0,1.64886,0.0033,39.5565,51.4305,298.8335,27.5,0.432


> A BLEU score of 46.5% is not too bad, which reflects the fact that our model is already good at translating English sentences to French ones.
> (sentence taken from <https://huggingface.co/learn/llm-course/en/chapter7/4#fine-tuning-the-model-with-the-trainer-api>, I had no idea what BLUE was until an hour ago).

In [62]:
pd.DataFrame([
    evaluation_before_finetuning,
    evaluation_before_finetuning_on_testdataset,
])

Unnamed: 0,eval_loss,eval_model_preparation_time,eval_bleu,eval_gen_len,eval_runtime,eval_samples_per_second,eval_steps_per_second
0,1.651272,0.0033,39.2849,51.8153,301.7038,27.239,0.428
1,1.64886,0.0033,39.5565,51.4305,298.8335,27.5,0.432


We have very similar result: BLEU metric is about 46% on our corpus of validation or test, which both consist of ~4800 pairs of English and French sentences.

In [63]:
len(tokenized_datasets["validation"]), len(tokenized_datasets["test"])

(8218, 8218)

## LET's GO! For the fine-tuning, we are ready

In [64]:
# prompt: string of current date in standard full format
import datetime

def current_date_string():
    now = datetime.datetime.now()
    return now.strftime("%Y-%m-%d__%H:%M:%S")

print(current_date_string())

2025-04-11__05:35:33


In [65]:
print_md(f"""## Starting to train for {num_train_epochs} epochs...
This should take {round(num_train_epochs * 3.5, 3)} minutes ? based on a rought estimate""")

## Starting to train for 3 epochs...
This should take 10.5 minutes ? based on a rought estimate

In [None]:
%%time
before = current_date_string()
trainer.train()
after = current_date_string()

url_of_commit = trainer.push_to_hub(tags="translation",
                                    commit_message=f"Training complete (started at {before}, finished at {after})")

trainer.save_model('final_model')
url_of_commit

Epoch,Training Loss,Validation Loss


Now that the fine-tuning has been finished, let's evaluate the gain we obtained on our corpus, in terms of performance (mainly, BLUE metric).

We can check the metrics from before and now after the fine-tuning:

In [None]:
%%time
print_md("## Before training, performance on the « validation » dataset ...")
pd.DataFrame([evaluation_before_finetuning])

print_md("## After training, performance on the « validation » dataset ...")
evaluation_after_finetuning = trainer.evaluate(tokenized_datasets["validation"], max_length=max_length)

pd.DataFrame([evaluation_after_finetuning])

{'eval_bleu': 15.936,
 'eval_gen_len': 255.0,
 'eval_loss': 7.785130023956299,
 'eval_model_preparation_time': 0.0029,
 'eval_runtime': 3.8141,
 'eval_samples_per_second': 0.524,
 'eval_steps_per_second': 0.262}


{'epoch': 3.0,
 'eval_bleu': 15.936,
 'eval_gen_len': 255.0,
 'eval_loss': 7.785130023956299,
 'eval_model_preparation_time': 0.0029,
 'eval_runtime': 7.0326,
 'eval_samples_per_second': 0.284,
 'eval_steps_per_second': 0.142}


In [None]:
%%time
print_md("## Before training, performance on the « test » dataset ...")
pd.DataFrame([evaluation_before_finetuning_on_testdataset])

print_md("## After training, performance on the « test » dataset ...")
evaluation_after_finetuning_on_testdataset = trainer.evaluate(tokenized_datasets["test"], max_length=max_length)

pd.DataFrame([evaluation_after_finetuning_on_testdataset])

Unnamed: 0,eval_loss,eval_model_preparation_time,eval_bleu,eval_gen_len,eval_runtime,eval_samples_per_second,eval_steps_per_second,epoch
0,7.78513,0.0029,15.936,255.0,3.8141,0.524,0.262,
1,7.78513,0.0029,15.936,255.0,7.0326,0.284,0.142,3.0


Hopefully, this fine-tuning improved the BLUE metric on both the « validation » and « test » datasets.

-----

## Testing manually the final fine-tuned model, on some sentences

In [None]:
from transformers import pipeline

final_model     = AutoModelForSeq2SeqLM.from_pretrained("./final_model")
final_tokenizer = AutoTokenizer.from_pretrained("./final_model")

final_translator =  pipeline ('translation' , model=final_model, tokenizer=final_tokenizer)
def final_en2fr(text):
    return final_translator(text)[0]['translation_text']

Device set to use cuda:0


Let's try again the first two examples:

In [None]:
%%time
example_english_text = "When this creature enters, create three 4/4 Elephant tokens with trample, then populate."
print_md(f"Example of English rules text: « {example_english_text} », getting translated:")

example_french_text = en2fr(example_english_text)
print_md(f"Initial translation ==> {example_french_text}")

example_french_text_final = final_en2fr(example_english_text)
print_md(f"Second translation ==> {example_french_text_final}")

Example of English rules text: « When this creature enters, create three 4/4 Elephant tokens with trample, then populate. », getting translated:

==> Lorsque cette créature entre, créez trois jetons Éléphant 4/4 avec piétinement, puis peuplez.

CPU times: user 316 ms, sys: 1.2 ms, total: 317 ms
Wall time: 377 ms


In [None]:
%%time
example_english_text = """[a]I'm afraid you'll have to remove all six counters. Bounty of the Hunt tracks how many counters each creature received "this way", which means how many counters it received due to following the instructions of Bounty of the Hunt as modified by any applicable replacement effects. Since each target creature received two counters this way, you'll have to remove two counters from each of them.[/a]"""
print_md(f"Example of English rules text: « {example_english_text} », getting translated:")

example_french_text = en2fr(example_english_text)
print_md(f"Initial translation ==> {example_french_text}")

example_french_text_final = final_en2fr(example_english_text)
print_md(f"Second translation ==> {example_french_text_final}")

Example of English rules text: « [a]I'm afraid you'll have to remove all six counters. Bounty of the Hunt tracks how many counters each creature received "this way", which means how many counters it received due to following the instructions of Bounty of the Hunt as modified by any applicable replacement effects. Since each target creature received two counters this way, you'll have to remove two counters from each of them.[/a] », getting translated:

==> [a]J'ai peur que vous deviez retirer les six marqueurs. Bounty of the Hunt suit le nombre de marqueurs que chaque créature a reçu "de cette façon", ce qui signifie le nombre de marqueurs qu'elle a reçu en suivant les instructions de Bounty of the Hunt telles que modifiées par les effets de remplacement applicables. Puisque chaque créature cible a reçu deux marqueurs de cette façon, vous devrez retirer deux marqueurs de chacun d'eux.[/a]

CPU times: user 1.21 s, sys: 0 ns, total: 1.21 s
Wall time: 1.29 s


OK so the finetuning worked, but we do not observe any changes in the translation obtained by this model...

## Compare the translations after and before fine-tuning

In [None]:
number_of_random_samples = 3

for idx, (name, df) in enumerate(zip(
        ["Cranial Insertion EN/FR", "MTG Cards Text EN/FR", "MTG Comprehensive Rules EN/FR"],
        [df_cranial_sentences, df_mtg, df_rules]
    )):
    print_md(f"\n\n## §{idx+1}. For some sentences from the « {name} » part of our database\n")

    # Get some random lines from this df
    some_random_lines = df.sample(n=number_of_random_samples)
    some_random_lines = some_random_lines.values.tolist()

    # Display the some random lines
    for line in some_random_lines:
        print_md(f"#### 1. In English:\n{line[0]}")
        print_md(f"#### 2. En français :\n{line[1]}")
        translation_from_model = en2fr(line[0])
        print_md(f"#### 3. Et en français traduit par le modèle initial :\n{translation_from_model}")
        translation_from_final_model = final_en2fr(line[0])
        print_md(f"#### 4. Et en français traduit par le modèle final :\n{translation_from_final_model}")

Let see if I'm now happy with the results...
Anyway, it's time to go to bed 😴!

In [None]:
current_date_string()

'2025-04-11__04:36:56'