# 🛡 🔨 🤖 **Dataset, assemble!**

I want to create a semi-large supervised (labelled) dataset of questions and answers, in their original language (English) and their translated language (French).

Once I'll have a lot of pairs of small sentences in English (source language) and French (target language), I'll try to use a Google colab cloud GPU to finetune the translation model I have been using in April 2025.

Hopefully, I'll be able to do that efficiently, quickly, and then the model could be saved and shared on my [HuggingFace's profile](https://huggingface.co/Naereen) so that anyone can use it.

In [1]:
!pip install watermark sacremoses tensorflow transformers requests beautifulsoup4
%load_ext watermark
%watermark -v -p numpy,pandas,tensorflow,transformers,requests,beautifulsoup4

Collecting watermark
  Downloading watermark-2.5.0-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Collecting jedi>=0.16 (from ipython>=6.0->watermark)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading watermark-2.5.0-py2.py3-none-any.whl (7.7 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m47.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sacremoses, jedi, watermark
Successfully installed jedi-0.19.2 sacremoses-0.1.1 watermark-2.5.0
Python implementation: CPython
Python version       : 3.11.11
IPython version      : 7.34.0

numpy         : 2.0.2
pandas        : 2.2.2
tensorflow    : 2.18.0
t

## Fetch all the content of English articles from Cranial Insertion

In [2]:
url_list_of_articles_english = "https://www.cranial-insertion.com/archive?lang=en"

In [3]:
# prompt: use requests to get the content of a HTML page on this URL
import requests
from bs4 import BeautifulSoup

response = requests.get(url_list_of_articles_english)
soup = BeautifulSoup(response.content, "html.parser")

In [19]:
from pprint import pprint
# Now you can work with the parsed HTML content in the 'soup' object.
all_recent_english_articles = []
all_recent_french_articles = []

for i, table in enumerate(soup.find("div", id="content").find_all("table")):
    print(f"\n# {i+1}-th article in the list on {url_list_of_articles_english}:")
    article_title = table.find("a", class_="plainlink").find("span").find("b").text
    article_id = table.find("a", class_="plainlink").get("href").split("/")[-1]
    article_link = f"https://www.cranial-insertion.com/article/{article_id}"
    article_edit_link = f"https://www.cranial-insertion.com/staff/articles/{article_id}/edit"
    article = {
        'article_title': article_title,
        'article_id': article_id,
        'article_link': article_link,
        'article_edit_link': article_edit_link,
    }
    pprint(article)

    # now get the translation in French
    article_fr_id = None
    for translation in table.find_all("td")[1].find("span").find_all("a"):
        if translation.text == "French" or translation.text == "Français":
            article_fr_id = translation.get("href").split("/")[-1]
            break

    article_fr_link = f"https://www.cranial-insertion.com/article/{article_fr_id}"
    article_fr_edit_link = f"https://www.cranial-insertion.com/staff/articles/{article_fr_id}/edit"
    article_fr = {
        #'article_title': article_fr_title,
        'article_id': article_fr_id,
        'article_link': article_fr_link,
        'article_edit_link': article_fr_edit_link,
    }
    pprint(article_fr)

    if article_fr_id:
        all_recent_english_articles.append(article)
        all_recent_french_articles.append(article_fr)

nb_english_articles = len(all_recent_english_articles)
nb_french_articles = len(all_recent_french_articles)
assert(nb_english_articles == nb_french_articles)
nb_articles = nb_french_articles
print(f"\n\n==> We have the metadata about {nb_articles} articles, that should be enough.")


# 1-th article in the list on https://www.cranial-insertion.com/archive?lang=en:
{'article_edit_link': 'https://www.cranial-insertion.com/staff/articles/4380/edit',
 'article_id': '4380',
 'article_link': 'https://www.cranial-insertion.com/article/4380',
 'article_title': 'Cloudy With a Chance of Dragonstorms'}
{'article_edit_link': 'https://www.cranial-insertion.com/staff/articles/None/edit',
 'article_id': None,
 'article_link': 'https://www.cranial-insertion.com/article/None'}

# 2-th article in the list on https://www.cranial-insertion.com/archive?lang=en:
{'article_edit_link': 'https://www.cranial-insertion.com/staff/articles/4376/edit',
 'article_id': '4376',
 'article_link': 'https://www.cranial-insertion.com/article/4376',
 'article_title': 'All in Jeopardy'}
{'article_edit_link': 'https://www.cranial-insertion.com/staff/articles/4379/edit',
 'article_id': '4379',
 'article_link': 'https://www.cranial-insertion.com/article/4379'}

# 3-th article in the list on https://www.cran

Now for each article, get its content:

In [20]:
try:
    with open("cranial-insertion.com.cookies", 'r') as file:
        cranial_insertion_cookie = file.read()
    print("Cranial Insertion cookie successfully loaded from local file 'cranial-insertion.com.cookies', it can now be used (it's a secret)")
except FileNotFoundError:
    from google.colab import userdata
    cranial_insertion_cookie = userdata.get('CI_COOKIE')
    print("Cranial Insertion cookie successfully loaded from Google Colab secrets, it can now be used (it's a secret)")

Cranial Insertion cookie successfully loaded from Google Colab secrets, it can now be used (it's a secret)


In [22]:
def get_article_text(article_id=article_id, latest_article_edit_url=None):
    if not latest_article_edit_url:
        latest_article_edit_url = f"https://www.cranial-insertion.com/staff/articles/{article_id}/edit"
    print(f"Reading the article at URL {latest_article_edit_url} ...")

    cookies = {
        'loggedin': cranial_insertion_cookie,
        'siteLang': 'fr',
    }
    #print(f"Using French language and my editor cookie...")

    response = requests.get(latest_article_edit_url, cookies=cookies)
    soup = BeautifulSoup(response.text, 'html.parser')
    print(f"The web page has been read and it's title is « {soup.title} » !")

    latest_article_text = soup.find(id="thisArticleText").get_text()
    latest_article_title = soup.find(id="thisArticleTitle").get('value')
    latest_article_date = soup.find(id="thisArticlePubdate").get('value')

    return latest_article_text, latest_article_title, latest_article_date, soup

### We can now download the raw content of each of these 51 articles, in both languages

In [24]:
content_articles = []

for en_article, fr_article in zip(all_recent_english_articles, all_recent_french_articles):
    latest_article_text, latest_article_title, latest_article_date, soup = get_article_text(article_id=en_article['article_id'])
    content_en = {
        'article_id': en_article['article_id'],
        'article_title': latest_article_title,
        'article_link': en_article['article_link'],
        'article_edit_link': en_article['article_edit_link'],
        'article_text': latest_article_text,
        'article_date': latest_article_date,
    }

    latest_article_text, latest_article_title, latest_article_date, soup = get_article_text(article_id=fr_article['article_id'])
    content_fr = {
        'article_id': fr_article['article_id'],
        'article_title': latest_article_title,
        'article_link': fr_article['article_link'],
        'article_edit_link': fr_article['article_edit_link'],
        'article_text': latest_article_text,
        'article_date': latest_article_date,
    }

    content_articles.append({
        'en': content_en,
        'fr': content_fr,
    })

    # TODO: remove this break if everything works fine
    break

pprint(content_articles)

Reading the article at URL https://www.cranial-insertion.com/staff/articles/4376/edit ...
The web page has been read and it's title is « <title>Cranial Insertion | Manage Articles</title> » !
Reading the article at URL https://www.cranial-insertion.com/staff/articles/4379/edit ...
The web page has been read and it's title is « <title>Cranial Insertion | Manage Articles</title> » !
[{'en': {'article_date': '03/31/2025',
         'article_edit_link': 'https://www.cranial-insertion.com/staff/articles/4376/edit',
         'article_id': '4376',
         'article_link': 'https://www.cranial-insertion.com/article/4376',
         'article_text': '[cright=Living Death]"Alex, the answer is "What '
                         'would be an\n'
                         'example of an oximoron?"[/cright]\n'
                         '\n'
                         'Today we have a collection of questions found on the '
                         'IRC Chatroom #magicjudges-rules and the Facebook '
           

Let's split the raw text by `[a]`or `[A]` or `[q]` or `[Q]` tags, to group them by questions/answers , one at a time:

In [32]:
from IPython.display import display, Markdown
def print_markdown(text):
    display(Markdown(text))

print_markdown("# Example of a raw content in English:")
print_markdown(content_articles[0]['en']['article_text'][:1000])

print_markdown("# Example of a raw content in French:")
print_markdown(content_articles[0]['fr']['article_text'][:1000])

# Example of a raw content in English:

[cright=Living Death]"Alex, the answer is "What would be an
example of an oximoron?"[/cright]

Today we have a collection of questions found on the IRC Chatroom #magicjudges-rules and the Facebook group Ask the Judge - [b]Magic[/b]: The Gathering Rules and Policy Questions. Feel free to join us in either or both groups.

If you have any [b]Magic[/b] questions burning at the back of your brain, you can send them to us. We may even use them in a future article. If you have a short question, you can send it to us via our Twitter account at [url=https://twitter.com/CranialTweet]@CranialTweet[/url], and you can send us longer questions at [email]moko@cranialinsertion.com[/email].

[hr]
[Q] I have two [c]Squire[/c]s on the battlefield and I have [c]Custodi Soulbinders[/c] and four copies of [c]Bear Cub[/c] in my graveyard. If I cast [c]Living Death[/c], how many +1/+1 counters will Soulbinders enter with?[/Q]

[A] Custodi Soulbinders will enter with zero +1/+1 counters. Souldbinders and the 

# Example of a raw content in French:

[cright=Living Death]"Alex, la réponse est
“Quel serait un bon exemple d’oxymore ?”[/cright]Aujourd’hui nous avons une collection de questions trouvées sur le salon de chat IRC #magicjudges-rules et le groupe Facebook “Ask the Judge - Magic: The Gathering Rules and Policy Questions”. N’hésitez pas à rejoindre ces espaces.

Si vous avez une question [b]Magic[/b] qui vous turlupine, vous pouvez nous l’envoyer. Nous pourrions même l’utiliser dans un futur article. Si vous avez une question courte, vous pouvez l’envoyer à [email]moko@cranialinsertion.com[/email].

(NDLT : Jeopardy est un jeu télévisé américain où on nous donne une réponse, et on doit trouver la question dont c'est la réponse.)

[hr]
[Q=Q :] Je contrôle deux [c=Squire]Écuyers[/c] sur le champ de bataille et j’ai [c=Custodi Soulbinders]Lieurs d'âme des Custodi[/c] et 4 copies d’[c=Bear Cub]Ourson[/c] dans mon cimetière. Si je lance [c=Living Death]Mort vivante[/c], avec combien de marqueurs +1/+1 arrivent les Lieurs sur le c

### Now a difficult task: split the article content by sentences / paragraphs, and group them by pairs of correponding translated text

I'm trying but it seems to be so hard...

In [35]:
# prompt: I have the content of a long text file in « content_article['en']['article_text'] » for the English version, and in « content_article['fr']['article_text'] » for the translated French version.
# I want to split both of these long text by sentences or paragraphs, but while being certain than each piece of the French text is associated with its corresponding piece in the English text.

import re

def split_text(text):
    # Split by sentence boundaries, handling various punctuation and abbreviations.
    # This regex is a basic example and might need refinement for specific cases.
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s', text)
    return [s for s in sentences if s]

# Split the English and French texts into sentences
english_sentences = split_text(content_articles[0]['en']['article_text'])
french_sentences = split_text(content_articles[0]['fr']['article_text'])

# Pair the sentences.  A more robust approach would handle differences in length.
paired_sentences = []
min_len = min(len(english_sentences), len(french_sentences))
for i in range(min_len):
  paired_sentences.append({'en': english_sentences[i], 'fr': french_sentences[i]})

# Example usage: print all the pairs
for i in range(len(paired_sentences)):
    print(f"# Pair {i+1}:")
    print(f"- English: {paired_sentences[i]['en']}")
    print(f"- French: {paired_sentences[i]['fr']}")
    print("---")

# Pair 1:
- English: [cright=Living Death]"Alex, the answer is "What would be an
example of an oximoron?"[/cright]

Today we have a collection of questions found on the IRC Chatroom #magicjudges-rules and the Facebook group Ask the Judge - [b]Magic[/b]: The Gathering Rules and Policy Questions.
- French: [cright=Living Death]"Alex, la réponse est
“Quel serait un bon exemple d’oxymore ?”[/cright]Aujourd’hui nous avons une collection de questions trouvées sur le salon de chat IRC #magicjudges-rules et le groupe Facebook “Ask the Judge - Magic: The Gathering Rules and Policy Questions”.
---
# Pair 2:
- English: Feel free to join us in either or both groups.
- French: N’hésitez pas à rejoindre ces espaces.
---
# Pair 3:
- English: 
If you have any [b]Magic[/b] questions burning at the back of your brain, you can send them to us.
- French: 
Si vous avez une question [b]Magic[/b] qui vous turlupine, vous pouvez nous l’envoyer.
---
# Pair 4:
- English: We may even use them in a future article

In [33]:
all_pairs_of_paragraphs = []

for content_article in content_articles:
    pairs_of_paragraphs = []
    paragraphs_en = content_article['en']['article_text'].split('\n\n')
    paragraphs_fr = content_article['fr']['article_text'].split('\n\n')
    for i, paragraph_en in enumerate(paragraphs_en):
        paragraph_fr = paragraphs_fr[i]
        pair_of_paragraphs = {
            'en': paragraph_en,
            'fr': paragraph_fr,
        }
        pprint(pair_of_paragraphs)
        pairs_of_paragraphs.append(pair_of_paragraphs)

    all_pairs_of_paragraphs.append(pairs_of_paragraphs)

{'en': '[cright=Living Death]"Alex, the answer is "What would be an\n'
       'example of an oximoron?"[/cright]',
 'fr': '[cright=Living Death]"Alex, la réponse est\n'
       '“Quel serait un bon exemple d’oxymore ?”[/cright]Aujourd’hui nous '
       'avons une collection de questions trouvées sur le salon de chat IRC '
       '#magicjudges-rules et le groupe Facebook “Ask the Judge - Magic: The '
       'Gathering Rules and Policy Questions”. N’hésitez pas à rejoindre ces '
       'espaces.'}
{'en': 'Today we have a collection of questions found on the IRC Chatroom '
       '#magicjudges-rules and the Facebook group Ask the Judge - '
       '[b]Magic[/b]: The Gathering Rules and Policy Questions. Feel free to '
       'join us in either or both groups.',
 'fr': 'Si vous avez une question [b]Magic[/b] qui vous turlupine, vous pouvez '
       'nous l’envoyer. Nous pourrions même l’utiliser dans un futur article. '
       'Si vous avez une question courte, vous pouvez l’envoyer à '
    

IndexError: list index out of range