# 🛡 🔨 🤖 **Dataset, assemble!**

I want to create a semi-large supervised (labelled) dataset of questions and answers, in their original language (English) and their translated language (French).

Once I'll have a lot of pairs of small sentences in English (source language) and French (target language), I'll try to use a Google colab cloud GPU to finetune the translation model I have been using in April 2025.

Hopefully, I'll be able to do that efficiently, quickly, and then the model could be saved and shared on my [HuggingFace's profile](https://huggingface.co/Naereen) so that anyone can use it.

In [1]:
!pip install watermark sacremoses tensorflow transformers requests beautifulsoup4
%load_ext watermark
%watermark -v -p numpy,pandas,tensorflow,transformers,requests,beautifulsoup4

Collecting watermark
  Downloading watermark-2.5.0-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Collecting jedi>=0.16 (from ipython>=6.0->watermark)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading watermark-2.5.0-py2.py3-none-any.whl (7.7 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sacremoses, jedi, watermark
Successfully installed jedi-0.19.2 sacremoses-0.1.1 watermark-2.5.0
Python implementation: CPython
Python version       : 3.11.11
IPython version      : 7.34.0

numpy         : 2.0.2
pandas        : 2.2.2
tensorflow    : 2.18.0
t

In [3]:
running_on_GPU = False
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
    print("Running on CPU (20 to 30x slower for LLM/NLP/ML)")
else:
    running_on_GPU = True
    print("Running on GPU")

Running on GPU


> We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:

In [4]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Fetch all the content of English articles from Cranial Insertion

In [None]:
url_list_of_articles_english = "https://www.cranial-insertion.com/archive?lang=en"

In [None]:
# prompt: use requests to get the content of a HTML page on this URL
import requests
from bs4 import BeautifulSoup

response = requests.get(url_list_of_articles_english)
soup = BeautifulSoup(response.content, "html.parser")

In [None]:
from pprint import pprint
# Now you can work with the parsed HTML content in the 'soup' object.
all_recent_english_articles = []
all_recent_french_articles = []

for i, table in enumerate(soup.find("div", id="content").find_all("table")):
    print(f"\n# {i+1}-th article in the list on {url_list_of_articles_english}:")
    article_title = table.find("a", class_="plainlink").find("span").find("b").text
    article_id = table.find("a", class_="plainlink").get("href").split("/")[-1]
    article_link = f"https://www.cranial-insertion.com/article/{article_id}"
    article_edit_link = f"https://www.cranial-insertion.com/staff/articles/{article_id}/edit"
    article = {
        'article_title': article_title,
        'article_id': article_id,
        'article_link': article_link,
        'article_edit_link': article_edit_link,
    }
    pprint(article)

    # now get the translation in French
    article_fr_id = None
    for translation in table.find_all("td")[1].find("span").find_all("a"):
        if translation.text == "French" or translation.text == "Français":
            article_fr_id = translation.get("href").split("/")[-1]
            break

    article_fr_link = f"https://www.cranial-insertion.com/article/{article_fr_id}"
    article_fr_edit_link = f"https://www.cranial-insertion.com/staff/articles/{article_fr_id}/edit"
    article_fr = {
        #'article_title': article_fr_title,
        'article_id': article_fr_id,
        'article_link': article_fr_link,
        'article_edit_link': article_fr_edit_link,
    }
    pprint(article_fr)

    if article_fr_id:
        all_recent_english_articles.append(article)
        all_recent_french_articles.append(article_fr)

nb_english_articles = len(all_recent_english_articles)
nb_french_articles = len(all_recent_french_articles)
assert(nb_english_articles == nb_french_articles)
nb_articles = nb_french_articles
print(f"\n\n==> We have the metadata about {nb_articles} articles, that should be enough.")


# 1-th article in the list on https://www.cranial-insertion.com/archive?lang=en:
{'article_edit_link': 'https://www.cranial-insertion.com/staff/articles/4380/edit',
 'article_id': '4380',
 'article_link': 'https://www.cranial-insertion.com/article/4380',
 'article_title': 'Cloudy With a Chance of Dragonstorms'}
{'article_edit_link': 'https://www.cranial-insertion.com/staff/articles/None/edit',
 'article_id': None,
 'article_link': 'https://www.cranial-insertion.com/article/None'}

# 2-th article in the list on https://www.cranial-insertion.com/archive?lang=en:
{'article_edit_link': 'https://www.cranial-insertion.com/staff/articles/4376/edit',
 'article_id': '4376',
 'article_link': 'https://www.cranial-insertion.com/article/4376',
 'article_title': 'All in Jeopardy'}
{'article_edit_link': 'https://www.cranial-insertion.com/staff/articles/4379/edit',
 'article_id': '4379',
 'article_link': 'https://www.cranial-insertion.com/article/4379'}

# 3-th article in the list on https://www.cran

Now for each article, get its content:

In [None]:
try:
    with open("cranial-insertion.com.cookies", 'r') as file:
        cranial_insertion_cookie = file.read()
    print("Cranial Insertion cookie successfully loaded from local file 'cranial-insertion.com.cookies', it can now be used (it's a secret)")
except FileNotFoundError:
    from google.colab import userdata
    cranial_insertion_cookie = userdata.get('CI_COOKIE')
    print("Cranial Insertion cookie successfully loaded from Google Colab secrets, it can now be used (it's a secret)")

Cranial Insertion cookie successfully loaded from Google Colab secrets, it can now be used (it's a secret)


In [None]:
def get_article_text(article_id=article_id, latest_article_edit_url=None):
    if not latest_article_edit_url:
        latest_article_edit_url = f"https://www.cranial-insertion.com/staff/articles/{article_id}/edit"
    print(f"Reading the article at URL {latest_article_edit_url} ...")

    cookies = {
        'loggedin': cranial_insertion_cookie,
        'siteLang': 'fr',
    }
    #print(f"Using French language and my editor cookie...")

    response = requests.get(latest_article_edit_url, cookies=cookies)
    soup = BeautifulSoup(response.text, 'html.parser')
    print(f"The web page has been read and it's title is « {soup.title} » !")

    latest_article_text = soup.find(id="thisArticleText").get_text()
    latest_article_title = soup.find(id="thisArticleTitle").get('value')
    latest_article_date = soup.find(id="thisArticlePubdate").get('value')

    return latest_article_text, latest_article_title, latest_article_date, soup

### We can now download the raw content of each of these 51 articles, in both languages

In [None]:
content_articles = []

for en_article, fr_article in zip(all_recent_english_articles, all_recent_french_articles):
    latest_article_text, latest_article_title, latest_article_date, soup = get_article_text(article_id=en_article['article_id'])
    content_en = {
        'article_id': en_article['article_id'],
        'article_title': latest_article_title,
        'article_link': en_article['article_link'],
        'article_edit_link': en_article['article_edit_link'],
        'article_text': latest_article_text,
        'article_date': latest_article_date,
    }

    latest_article_text, latest_article_title, latest_article_date, soup = get_article_text(article_id=fr_article['article_id'])
    content_fr = {
        'article_id': fr_article['article_id'],
        'article_title': latest_article_title,
        'article_link': fr_article['article_link'],
        'article_edit_link': fr_article['article_edit_link'],
        'article_text': latest_article_text,
        'article_date': latest_article_date,
    }

    content_articles.append({
        'en': content_en,
        'fr': content_fr,
    })

    # TODO: remove this break if everything works fine
    break

pprint(content_articles)

Reading the article at URL https://www.cranial-insertion.com/staff/articles/4376/edit ...
The web page has been read and it's title is « <title>Cranial Insertion | Manage Articles</title> » !
Reading the article at URL https://www.cranial-insertion.com/staff/articles/4379/edit ...
The web page has been read and it's title is « <title>Cranial Insertion | Manage Articles</title> » !
[{'en': {'article_date': '03/31/2025',
         'article_edit_link': 'https://www.cranial-insertion.com/staff/articles/4376/edit',
         'article_id': '4376',
         'article_link': 'https://www.cranial-insertion.com/article/4376',
         'article_text': '[cright=Living Death]"Alex, the answer is "What '
                         'would be an\n'
                         'example of an oximoron?"[/cright]\n'
                         '\n'
                         'Today we have a collection of questions found on the '
                         'IRC Chatroom #magicjudges-rules and the Facebook '
           

Let's split the raw text by `[a]`or `[A]` or `[q]` or `[Q]` tags, to group them by questions/answers , one at a time:

In [None]:
from IPython.display import display, Markdown
def print_markdown(text):
    display(Markdown(text))

print_markdown("# Example of a raw content in English:")
print_markdown(content_articles[0]['en']['article_text'][:1000])

print_markdown("# Example of a raw content in French:")
print_markdown(content_articles[0]['fr']['article_text'][:1000])

# Example of a raw content in English:

[cright=Living Death]"Alex, the answer is "What would be an
example of an oximoron?"[/cright]

Today we have a collection of questions found on the IRC Chatroom #magicjudges-rules and the Facebook group Ask the Judge - [b]Magic[/b]: The Gathering Rules and Policy Questions. Feel free to join us in either or both groups.

If you have any [b]Magic[/b] questions burning at the back of your brain, you can send them to us. We may even use them in a future article. If you have a short question, you can send it to us via our Twitter account at [url=https://twitter.com/CranialTweet]@CranialTweet[/url], and you can send us longer questions at [email]moko@cranialinsertion.com[/email].

[hr]
[Q] I have two [c]Squire[/c]s on the battlefield and I have [c]Custodi Soulbinders[/c] and four copies of [c]Bear Cub[/c] in my graveyard. If I cast [c]Living Death[/c], how many +1/+1 counters will Soulbinders enter with?[/Q]

[A] Custodi Soulbinders will enter with zero +1/+1 counters. Souldbinders and the 

# Example of a raw content in French:

[cright=Living Death]"Alex, la réponse est
“Quel serait un bon exemple d’oxymore ?”[/cright]Aujourd’hui nous avons une collection de questions trouvées sur le salon de chat IRC #magicjudges-rules et le groupe Facebook “Ask the Judge - Magic: The Gathering Rules and Policy Questions”. N’hésitez pas à rejoindre ces espaces.

Si vous avez une question [b]Magic[/b] qui vous turlupine, vous pouvez nous l’envoyer. Nous pourrions même l’utiliser dans un futur article. Si vous avez une question courte, vous pouvez l’envoyer à [email]moko@cranialinsertion.com[/email].

(NDLT : Jeopardy est un jeu télévisé américain où on nous donne une réponse, et on doit trouver la question dont c'est la réponse.)

[hr]
[Q=Q :] Je contrôle deux [c=Squire]Écuyers[/c] sur le champ de bataille et j’ai [c=Custodi Soulbinders]Lieurs d'âme des Custodi[/c] et 4 copies d’[c=Bear Cub]Ourson[/c] dans mon cimetière. Si je lance [c=Living Death]Mort vivante[/c], avec combien de marqueurs +1/+1 arrivent les Lieurs sur le c

### Now a difficult task: split the article content by sentences / paragraphs, and group them by pairs of correponding translated text

I'm trying but it seems to be so hard...

In [None]:
# prompt: I have the content of a long text file in « content_article['en']['article_text'] » for the English version, and in « content_article['fr']['article_text'] » for the translated French version.
# I want to split both of these long text by sentences or paragraphs, but while being certain than each piece of the French text is associated with its corresponding piece in the English text.

import re

def split_text(text):
    # Split by sentence boundaries, handling various punctuation and abbreviations.
    # This regex is a basic example and might need refinement for specific cases.
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s', text)
    return [s for s in sentences if s]

# Split the English and French texts into sentences
english_sentences = split_text(content_articles[0]['en']['article_text'])
french_sentences = split_text(content_articles[0]['fr']['article_text'])

# Pair the sentences.  A more robust approach would handle differences in length.
paired_sentences = []
min_len = min(len(english_sentences), len(french_sentences))
for i in range(min_len):
  paired_sentences.append({'en': english_sentences[i], 'fr': french_sentences[i]})

# Example usage: print all the pairs
for i in range(len(paired_sentences)):
    print(f"# Pair {i+1}:")
    print(f"- English: {paired_sentences[i]['en']}")
    print(f"- French: {paired_sentences[i]['fr']}")
    print("---")

# Pair 1:
- English: [cright=Living Death]"Alex, the answer is "What would be an
example of an oximoron?"[/cright]

Today we have a collection of questions found on the IRC Chatroom #magicjudges-rules and the Facebook group Ask the Judge - [b]Magic[/b]: The Gathering Rules and Policy Questions.
- French: [cright=Living Death]"Alex, la réponse est
“Quel serait un bon exemple d’oxymore ?”[/cright]Aujourd’hui nous avons une collection de questions trouvées sur le salon de chat IRC #magicjudges-rules et le groupe Facebook “Ask the Judge - Magic: The Gathering Rules and Policy Questions”.
---
# Pair 2:
- English: Feel free to join us in either or both groups.
- French: N’hésitez pas à rejoindre ces espaces.
---
# Pair 3:
- English: 
If you have any [b]Magic[/b] questions burning at the back of your brain, you can send them to us.
- French: 
Si vous avez une question [b]Magic[/b] qui vous turlupine, vous pouvez nous l’envoyer.
---
# Pair 4:
- English: We may even use them in a future article

In [None]:
all_pairs_of_paragraphs = []

for content_article in content_articles:
    pairs_of_paragraphs = []
    paragraphs_en = content_article['en']['article_text'].split('\n\n')
    paragraphs_fr = content_article['fr']['article_text'].split('\n\n')
    for i, paragraph_en in enumerate(paragraphs_en):
        paragraph_fr = paragraphs_fr[i]
        pair_of_paragraphs = {
            'en': paragraph_en,
            'fr': paragraph_fr,
        }
        pprint(pair_of_paragraphs)
        pairs_of_paragraphs.append(pair_of_paragraphs)

    all_pairs_of_paragraphs.append(pairs_of_paragraphs)

{'en': '[cright=Living Death]"Alex, the answer is "What would be an\n'
       'example of an oximoron?"[/cright]',
 'fr': '[cright=Living Death]"Alex, la réponse est\n'
       '“Quel serait un bon exemple d’oxymore ?”[/cright]Aujourd’hui nous '
       'avons une collection de questions trouvées sur le salon de chat IRC '
       '#magicjudges-rules et le groupe Facebook “Ask the Judge - Magic: The '
       'Gathering Rules and Policy Questions”. N’hésitez pas à rejoindre ces '
       'espaces.'}
{'en': 'Today we have a collection of questions found on the IRC Chatroom '
       '#magicjudges-rules and the Facebook group Ask the Judge - '
       '[b]Magic[/b]: The Gathering Rules and Policy Questions. Feel free to '
       'join us in either or both groups.',
 'fr': 'Si vous avez une question [b]Magic[/b] qui vous turlupine, vous pouvez '
       'nous l’envoyer. Nous pourrions même l’utiliser dans un futur article. '
       'Si vous avez une question courte, vous pouvez l’envoyer à '
    

IndexError: list index out of range

## Trying to use SQLite and the [MTG JSON](https://mtgjson.com/downloads/all-files/#allprintings) database

In [9]:
# prompt: Download the SQLite file at "https://mtgjson.com/api/v5/AllPrintings.sqlite" and save it so I can use it in Python code
import requests
import os

# Download the SQLite file
url = "https://mtgjson.com/api/v5/AllPrintings.sqlite"
filename = "AllPrintings.sqlite"

if not os.path.exists(filename):
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()  # Raise an exception for bad status codes

        with open(filename, 'wb') as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)
        print(f"Downloaded {filename} successfully.")

    except requests.exceptions.RequestException as e:
        print(f"Error downloading the file: {e}")
else:
    print(f"{filename} already exists. Skipping download.")


Downloaded AllPrintings.sqlite successfully.


In [10]:
!ls -larth *.sqlite
!du -h *.sqlite

-rw-r--r-- 1 root root 471M Apr  7 20:10 AllPrintings.sqlite
471M	AllPrintings.sqlite


We are ready to execute some SQLite code!

In [11]:
# prompt: import the sqlite database in "AllPrintings.sqlite" into a Panas dataframe and display it nicely
import pandas as pd
import sqlite3

# Connect to the SQLite database
conn = sqlite3.connect('AllPrintings.sqlite')

def run_query(query):
  return pd.read_sql_query(query, conn)

In [8]:
# Query the database and load into a Pandas DataFrame
query = "SELECT * FROM cards"
df = run_query(query)

# Display the DataFrame nicelyn
# pd.set_option("display.max_rows", None, "display.max_columns", None) # Show all rows and columns
display(df)

# Close the connection
# conn.close()

Unnamed: 0,artist,artistIds,asciiName,attractionLights,availability,boosterTypes,borderColor,cardParts,colorIdentity,colorIndicator,...,subsets,subtypes,supertypes,text,toughness,type,types,uuid,variations,watermark
0,Pete Venters,d54c4a1a-c0c5-4834-84db-125d341f3ad8,,,"mtgo, paper",default,black,,W,,...,,"Human, Cleric",,First strike (This creature deals combat damag...,4,Creature — Human Cleric,Creature,5f8287b1-5bb6-5f4c-ad17-316a40d5bb0c,b7c19924-b4bf-56fc-aa73-f586e940bd42,
1,Pete Venters,d54c4a1a-c0c5-4834-84db-125d341f3ad8,,,"mtgo, paper",default,black,,W,,...,,"Human, Cleric",,First strike (This creature deals combat damag...,4,Creature — Human Cleric,Creature,b7c19924-b4bf-56fc-aa73-f586e940bd42,5f8287b1-5bb6-5f4c-ad17-316a40d5bb0c,
2,Volkan Baǵa,93bec3c0-0260-4d31-8064-5d01efb4153f,,,"mtgo, paper",default,black,,W,,...,,Angel,,"Flying\nWhen this creature enters, you gain 3 ...",3,Creature — Angel,Creature,57aaebc1-850c-503d-9f6e-bb8d00d8bf7c,8fd4e2eb-3eb4-50ea-856b-ef638fa47f8a,
3,Volkan Baǵa,93bec3c0-0260-4d31-8064-5d01efb4153f,,,"mtgo, paper",default,black,,W,,...,,Angel,,"Flying\nWhen this creature enters, you gain 3 ...",3,Creature — Angel,Creature,8fd4e2eb-3eb4-50ea-856b-ef638fa47f8a,57aaebc1-850c-503d-9f6e-bb8d00d8bf7c,
4,Mark Zug,48e2b98c-5467-4671-bd42-4c3746115117,,,"mtgo, paper",default,black,,W,,...,,,,Target creature gets +3/+3 and gains flying un...,,Sorcery,Sorcery,55bd38ca-dc73-5c06-8f80-a6ddd2f44382,c5655330-5131-5f40-9d3e-0549d88c6e9e,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102761,Campbell White,d281eab4-463a-4ba8-9039-8943737960a0,,,"mtgo, paper",,black,,U,,...,,,,Kicker {1}{U} (You may pay an additional {1}{U...,,Instant,Instant,3f492516-7767-5ed7-a1d4-e3f7c06aee2f,dd604910-0e81-5d56-b022-e08f21de0879,planeswalker
102762,Jason Rainville,6ed7e669-579b-443d-b223-e5cbcb2a7483,,,"mtgo, paper",,black,,B,,...,,,,Kicker {2}{B} (You may pay an additional {2}{B...,,Sorcery,Sorcery,3f9a0369-5fe7-5aee-85fe-3cfaacd275af,d947d9cd-f855-5496-b5de-88006b49865f,planeswalker
102763,Campbell White,d281eab4-463a-4ba8-9039-8943737960a0,,,"mtgo, paper",,black,,R,,...,,,,Kicker {5} (You may pay an additional {5} as y...,,Sorcery,Sorcery,97577e9e-69a9-5a8b-9c24-a72703790046,dbd65728-bba8-536e-a908-5dfa56068dcb,planeswalker
102764,Jonas De Ro,561ebf9e-8d93-4b57-8156-8826d0c19601,,,"mtgo, paper",,black,,G,,...,,,,Sacrifice a land. Search your library for up t...,,Instant,Instant,deb51cbd-b890-5b2d-9d6f-7b896e16c6fd,7ef6cffb-335a-5831-8319-20b6a020f1c4,planeswalker


Now I want to obtain a card's text in English and its translation in French, for each differently-named card.

In [12]:
# Query the database and load into a Pandas DataFrame
query = """
SELECT DISTINCT cards.name as name_en, cFD.multiverseId, cFD.name as name_fr,
       cards.text as text_en, cFD.text as text_fr
FROM cards
JOIN cardForeignData as cFD ON cards.uuid = cFD.uuid
WHERE cFD.multiverseId IS NOT NULL
  AND cFD.language = 'French'
ORDER BY cFD.multiverseId;
"""
df = run_query(query)

# Display the DataFrame nicelyn
# pd.set_option("display.max_rows", None, "display.max_columns", None) # Show all rows and columns
display(df)

Unnamed: 0,name_en,multiverseId,name_fr,text_en,text_fr
0,Aether Snap,75971,Coup d'Aether,Remove all counters from all permanents and ex...,Retirez tous les marqueurs de tous les permane...
1,Aether Vial,75972,Fiole d'Aether,"At the beginning of your upkeep, you may put a...","Au début de votre entretien, vous pouvez mettr..."
2,Ageless Entity,75973,Entité sans âge,"Whenever you gain life, put that many +1/+1 co...",À chaque fois que vous gagnez des points de vi...
3,Angel's Feather,75974,Plume d'ange,"Whenever a player casts a white spell, you may...","À chaque fois qu'un joueur joue un sort blanc,..."
4,Arcane Spyglass,75975,Longue-vue des arcanes,"{2}, {T}, Sacrifice a land: Draw a card and pu...","{2}, {T}, Sacrifiez un terrain : Piochez une c..."
...,...,...,...,...,...
46604,Treasure Vault,692899,Salle au trésor,"{T}: Add {C}.\n{X}{X}, {T}, Sacrifice this lan...","{T} : Ajoutez {C}.\n{X}{X}, {T}, sacrifiez ce ..."
46605,Underground River,692900,Rivière souterraine,{T}: Add {C}.\n{T}: Add {U} or {B}. This land ...,{T} : Ajoutez {C}.\n{T} : Ajoutez {U} ou {B}. ...
46606,Unholy Grotto,692901,Grotte impie,"{T}: Add {C}.\n{B}, {T}: Put target Zombie car...","{T} : Ajoutez {C}.\n{B}, {T} : Mettez une cart..."
46607,Vineglimmer Snarl,692902,Lacis pâlevigne,"As this land enters, you may reveal a Forest o...","Au moment où ce terrain arrive, vous pouvez ré..."


In [13]:
# prompt: I want to extract this as a list of pairs text_en, text_fr.

# Query to extract English and French text pairs
query = """
SELECT cards.text as text_en, cFD.text as text_fr
FROM cards
JOIN cardForeignData as cFD ON cards.uuid = cFD.uuid
WHERE cFD.language = 'French'
  AND cards.text IS NOT NULL
  AND cFD.text IS NOT NULL;
"""

df = run_query(query)

# Create a list of pairs
text_pairs = []
for index, row in df.iterrows():
    text_pairs.append((row['text_en'], row['text_fr']))

# Example: Accessing the first pair
if text_pairs:
    print(f"First pair: English - \n« {text_pairs[0][0]} »\nFrench - \n« {text_pairs[0][1]} »")


First pair: English - 
« First strike (This creature deals combat damage before creatures without first strike.)\nWhen this creature enters, you gain 1 life for each card in your graveyard. »
French - 
« Initiative (Cette créature inflige des blessures de combat avant les créatures sans l'initiative.)\nQuand l'Élu de l'Ancêtre arrive en jeu, vous gagnez 1 point de vie pour chaque carte dans votre cimetière. »


-----------------------

## Fine-tune [Helsinki-NLP/opus-mt-tc-big-en-fr](https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-fr) on these data

I'm going to take inspiration from <https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-fr>.

In [14]:
model_name = "opus-mt-tc-big-en-fr"

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained(f"Helsinki-NLP/{model_name}")
model = AutoModelForSeq2SeqLM.from_pretrained(f"Helsinki-NLP/{model_name}")

tokenizer_config.json:   0%|          | 0.00/337 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/820k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.33M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/461M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

We can use the pre-trained model to translate some Magic: the Gathering text:

In [15]:
from transformers import pipeline

translator =  pipeline ('translation' , model=model, tokenizer=tokenizer)
def en2fr(text):
    return translator(text)[0]['translation_text']

Device set to use cuda:0


In [16]:
example_english_text = "When this creature enters, create three 4/4 Elephant tokens with trample. Then populate."
print(f"Example of English rules text: « {example_english_text} », getting translated:")

example_french_text = en2fr(example_english_text)
print(f"==> {example_french_text}")

Example of English rules text: « When this creature enters, create three 4/4 Elephant tokens with trample. Then populate. », getting translated:
==> Lorsque cette créature entre, créez trois jetons Éléphant 4/4 avec piétinement.


It's already not so bad, but I'm sure we can improve by trying to fine-tune this model on MTG cards data!

In [17]:
example_english_text = """[a]I'm afraid you'll have to remove all six counters. Bounty of the Hunt tracks how many counters each creature received "this way", which means how many counters it received due to following the instructions of Bounty of the Hunt as modified by any applicable replacement effects. Since each target creature received two counters this way, you'll have to remove two counters from each of them.[/a]"""
print(f"Example of English rules text: « {example_english_text} », getting translated:")

example_french_text = en2fr(example_english_text)
print(f"==> {example_french_text}")

Example of English rules text: « [a]I'm afraid you'll have to remove all six counters. Bounty of the Hunt tracks how many counters each creature received "this way", which means how many counters it received due to following the instructions of Bounty of the Hunt as modified by any applicable replacement effects. Since each target creature received two counters this way, you'll have to remove two counters from each of them.[/a] », getting translated:
==> [a]J'ai peur que vous deviez retirer les six marqueurs. Bounty of the Hunt suit le nombre de marqueurs que chaque créature a reçu "de cette façon", ce qui signifie le nombre de marqueurs qu'elle a reçu en suivant les instructions de Bounty of the Hunt telles que modifiées par les effets de remplacement applicables. Puisque chaque créature cible a reçu deux marqueurs de cette façon, vous devrez retirer deux marqueurs de chacun d'eux.[/a]


### Formatting Data

In [18]:
source = [ text_pair[0] for text_pair in text_pairs ]
target = [ text_pair[1] for text_pair in text_pairs ]

from sklearn.model_selection import train_test_split
max_length = 512
test_size = 0.20
X_train, X_val, y_train, y_val = train_test_split(source, target, test_size=test_size)
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=max_length, return_tensors="pt")
y_train_tokenized = tokenizer(y_train, padding=True, truncation=True, max_length=max_length, return_tensors="pt")
X_val_tokenized   = tokenizer(X_val, padding=True, truncation=True, max_length=max_length, return_tensors="pt")
y_val_tokenized   = tokenizer(y_val, padding=True, truncation=True, max_length=max_length, return_tensors="pt")

import torch
class ForDataset(torch.utils.data.Dataset):
    def __init__(self, inputs, targets):
        self.inputs = inputs
        self.targets = targets

    def __len__(self):
        return len(self.targets)

    def __getitem__(self, index):
        input_ids = torch.tensor(self.inputs["input_ids"][index]).squeeze()
        target_ids = torch.tensor(self.targets["input_ids"][index]).squeeze()

        return {"input_ids": input_ids, "labels": target_ids}

train_dataset = ForDataset(X_train_tokenized, y_train_tokenized)
test_dataset  = ForDataset(X_val_tokenized, y_val_tokenized)

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, return_tensors="pt")

### Metric

In [19]:
!pip install evaluate numpy sacrebleu

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.17-py311-none-any.whl.metadata (7.2 kB)
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.1.1-py3-none-any.whl.metadata (8.6 kB)
Collecting colorama (from sacrebleu)
 

In [20]:
import evaluate
metric = evaluate.load("sacrebleu")
import numpy as np

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

### Training (ie. fine-tuning)

In [21]:
import os
os.environ["WANDB_DISABLED"] = "true"

Let's add callbacks, to save the model to [my HuggingFace profile (@Naereen)](https://huggingface.co/Naereen) regularly.

In [27]:
# from transformers.keras_callbacks import KerasMetricCallback
# metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=test_dataset)

from transformers.keras_callbacks import PushToHubCallback
push_to_hub_callback = PushToHubCallback(
    output_dir="./english-to-french-translation-for-Magic-the-Gathering",
    tokenizer=tokenizer,
)

callbacks = [
    # metric_callback,
    push_to_hub_callback,
]

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
/content/english-to-french-translation-for-Magic-the-Gathering is already a clone of https://huggingface.co/Naereen/english-to-french-translation-for-Magic-the-Gathering. Make sure you pull the latest changes with `repo.git_pull()`.


In [34]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
# from transformers import TrainingArguments, Trainer

training_args = Seq2SeqTrainingArguments(
# training_args = TrainingArguments(

    output_dir="english-to-french-translation-for-Magic-the-Gathering",

    evaluation_strategy="epoch",

    save_strategy="epoch",

    learning_rate=2e-5,

    per_device_train_batch_size=16,

    per_device_eval_batch_size=16,

    weight_decay=0.01,

    save_total_limit=3,

    num_train_epochs=10,

    predict_with_generate=True,

    load_best_model_at_end=True,

    report_to="none",  # Disable all integrations

    # push_to_hub=True,
)

trainer = Seq2SeqTrainer(
# trainer = Trainer(
    model=model,

    args=training_args,

    train_dataset=train_dataset,

    eval_dataset=test_dataset,

    tokenizer=tokenizer,

    data_collator=data_collator,

    compute_metrics=compute_metrics,

    # callbacks=callbacks,
)

  trainer = Seq2SeqTrainer(


In [35]:
%%time
trainer.train()
trainer.push_to_hub()
trainer.save_model('final_model')

  input_ids = torch.tensor(self.inputs["input_ids"][index]).squeeze()
  target_ids = torch.tensor(self.targets["input_ids"][index]).squeeze()


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,No log,5.687753,8.5232,101.5
2,No log,5.381392,4.1739,294.5
3,No log,5.01653,10.9549,89.5
4,No log,4.696126,9.8152,103.0
5,No log,4.526599,11.4949,98.0
6,No log,4.326241,6.087,279.5
7,No log,4.15381,6.9968,300.0
8,No log,4.046091,15.2952,95.5
9,No log,3.97973,15.2952,95.5
10,No log,3.943416,15.2952,95.5


  input_ids = torch.tensor(self.inputs["input_ids"][index]).squeeze()
  target_ids = torch.tensor(self.targets["input_ids"][index]).squeeze()
  input_ids = torch.tensor(self.inputs["input_ids"][index]).squeeze()
  target_ids = torch.tensor(self.targets["input_ids"][index]).squeeze()
  input_ids = torch.tensor(self.inputs["input_ids"][index]).squeeze()
  target_ids = torch.tensor(self.targets["input_ids"][index]).squeeze()
  input_ids = torch.tensor(self.inputs["input_ids"][index]).squeeze()
  target_ids = torch.tensor(self.targets["input_ids"][index]).squeeze()
  input_ids = torch.tensor(self.inputs["input_ids"][index]).squeeze()
  target_ids = torch.tensor(self.targets["input_ids"][index]).squeeze()
  input_ids = torch.tensor(self.inputs["input_ids"][index]).squeeze()
  target_ids = torch.tensor(self.targets["input_ids"][index]).squeeze()
  input_ids = torch.tensor(self.inputs["input_ids"][index]).squeeze()
  target_ids = torch.tensor(self.targets["input_ids"][index]).squeeze()
  inpu

model.safetensors:   0%|          | 0.00/923M [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/820k [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/5.56k [00:00<?, ?B/s]

CPU times: user 1min 13s, sys: 38.3 s, total: 1min 51s
Wall time: 9min 52s


### Testing

In [36]:
from transformers import pipeline

final_model = AutoModelForSeq2SeqLM.from_pretrained("./final_model")

translator =  pipeline ('translation' , model=final_model, tokenizer=tokenizer)
def en2fr(text):
    return translator(text)[0]['translation_text']

Device set to use cuda:0


Let's try again the first two examples:

In [38]:
%%time
example_english_text = "When this creature enters, create three 4/4 Elephant tokens with trample, then populate."
print(f"Example of English rules text: « {example_english_text} », getting translated:")

example_french_text = en2fr(example_english_text)
print(f"==> {example_french_text}")

Example of English rules text: « When this creature enters, create three 4/4 Elephant tokens with trample, then populate. », getting translated:
==> Lorsque cette créature entre, créez trois jetons d'éléphant 4/4 avec piétinement, puis peuplez.
CPU times: user 306 ms, sys: 0 ns, total: 306 ms
Wall time: 337 ms


In [39]:
%%time
example_english_text = """[a]I'm afraid you'll have to remove all six counters. Bounty of the Hunt tracks how many counters each creature received "this way", which means how many counters it received due to following the instructions of Bounty of the Hunt as modified by any applicable replacement effects. Since each target creature received two counters this way, you'll have to remove two counters from each of them.[/a]"""
print(f"Example of English rules text: « {example_english_text} », getting translated:")

example_french_text = en2fr(example_english_text)
print(f"==> {example_french_text}")

Example of English rules text: « [a]I'm afraid you'll have to remove all six counters. Bounty of the Hunt tracks how many counters each creature received "this way", which means how many counters it received due to following the instructions of Bounty of the Hunt as modified by any applicable replacement effects. Since each target creature received two counters this way, you'll have to remove two counters from each of them.[/a] », getting translated:
==> [a]J'ai peur que vous deviez retirer les six marqueurs. Bounty of the Hunt suit le nombre de marqueurs que chaque créature a reçu "de cette façon", ce qui signifie le nombre de marqueurs qu'elle a reçu en suivant les instructions de Bounty of the Hunt telles que modifiées par les effets de remplacement applicables. Puisque chaque créature cible a reçu deux marqueurs de cette façon, vous devrez retirer deux marqueurs de chacun d'eux.[/a]
CPU times: user 1.21 s, sys: 0 ns, total: 1.21 s
Wall time: 1.26 s


OK so the finetuning worked, but we do not observe any changes in the translation obtained by this model...