# Create Luxembourgish instruction tuning dataset from Wikipedia

In this notebook, we create an instruction tuning dataset to fine-tune Llama Models from Luxembougish Wikipedia Data. We use Llama-3.1-405b-instruct to create synthetic data, using a Novita AI API key (https://novita.ai/). We only use 100 samples from the preprocessed Wikipedia dump, to keep the dataset small. We create 5 instruction-answer pairs for each entry, totalling to 500 synthetic instruction-answer pairs in Luxembourgish.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import json

# read in Wiki data from json file
with open('/content/drive/My Drive/Master Thesis/Data/filtered_wiki_articles.json', 'r') as f:
    wiki_data = json.load(f)

In [None]:
len(wiki_data)

100

In [None]:
wiki_data[0]

{'id': '116845',
 'revid': '1247',
 'url': 'https://lb.wikipedia.org/wiki?curid=116845',
 'title': 'Deborah De Robertis',
 'text': 'D\'Deborah De Robertis, gebuer den 12. Februar 1984 zu Lëtzebuerg, ass eng lëtzebuergesch-franséischPerformance-Kënschtlerin. \nSi ass um Lampertsbierg opgewuess, hier Elteren hunn déi franséisch (Mamm) an italieenesch (Papp) Nationalitéit.\nD\'Deborah de Robertis huet op der École de recherche graphique zu Bréissel studéiert. An hire Performancë stellt si Traditiounen a Konventiounen a Fro, wéi d\'Relatioun tëscht Mann a Fra, tëscht Kënschtler a Kurator resp. Galerist, asw.\nVu September bis November 2013 hat s\'eng Artiste-Residenz an der Paräisser "Cité internationale des arts", fir déi si vum Lëtzebuerger Kulturministère ausgewielt gi war.\nHir bekanntst Performance bis ewell ass déi, déi se "Miroir de l\'origine" nennt. Als Referenz op oder als Reverenz fir dem Gustave Courbet säin Tableau "L\'origine du monde" huet si sech viru Konschtwierker a Musée

In [None]:
# define instruction
instruction = "The objective is to create a synthetic dataset in Luxembourgish \
for instruction fine-tuning. You are an assistant and generate all your \
responses in Luxembourgish. You will receive a series of texts one after the \
other. Generate a dataset from the text given with 5 unique pairs of \
instruction and expected response. Give elaborate and high-quality responses \
and answer as accurately as possible. Make sure that the instruction contains \
all relevant information that is needed to give the response. If the question \
does not contain enough information to be answered accurately, reply that you \
need more information to answer the question. Whenever a date is provided for \
a news article, include the date in the instruction and its respective response \
or just in the response, depending on what you consider best in the context. \
Also give temporal context if possible. Name the columns instruction and \
response respectively. Put everything into a JSON file."
instruction

'The objective is to create a synthetic dataset in Luxembourgish for instruction fine-tuning. You are an assistant and generate all your responses in Luxembourgish. You will receive a series of texts one after the other. Generate a dataset from the text given with 5 unique pairs of instruction and expected response. Give elaborate and high-quality responses and answer as accurately as possible. Make sure that the instruction contains all relevant information that is needed to give the response. If the question does not contain enough information to be answered accurately, reply that you need more information to answer the question. Whenever a date is provided for a news article, include the date in the instruction and its respective response or just in the response, depending on what you consider best in the context. Also give temporal context if possible. Name the columns instruction and response respectively. Put everything into a JSON file.'

In [None]:
# Check the prompt formatting
for article in wiki_data[:5]:
  title = article.get("title")
  text = article.get("text")
  prompt = instruction + "\n\n" + "Text: " + "\n\n" + title + "\n\n" +  text
  print(prompt)
  print("---------------------------------")


The objective is to create a synthetic dataset in Luxembourgish for instruction fine-tuning. You are an assistant and generate all your responses in Luxembourgish. You will receive a series of texts one after the other. Generate a dataset from the text given with 5 unique pairs of instruction and expected response. Give elaborate and high-quality responses and answer as accurately as possible. Make sure that the instruction contains all relevant information that is needed to give the response. If the question does not contain enough information to be answered accurately, reply that you need more information to answer the question. Whenever a date is provided for a news article, include the date in the instruction and its respective response or just in the response, depending on what you consider best in the context. Also give temporal context if possible. Name the columns instruction and response respectively. Put everything into a JSON file.

Text: 

Deborah De Robertis

D'Deborah De 

# API PART - PROCEED WITH CAUTION (COSTS MONEY) !!!!!

In [None]:
# !pip install -q openai

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/365.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/365.7 kB[0m [31m1.6 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m174.1/365.7 kB[0m [31m2.3 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m358.4/365.7 kB[0m [31m3.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m365.7/365.7 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.9/318.9 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K  

In [None]:
# from openai import OpenAI
# from google.colab import userdata

In [None]:
# # From Novita AI
# client = OpenAI(
#   base_url = "https://api.novita.ai/v3/openai",
#   api_key = userdata.get("NOVITA_AI_UNI")
# )

In [None]:
# model_responses = []

In [None]:
# for article in wiki_data:
#   title = article.get("title")
#   text = article.get("text")
#   prompt = instruction + "\n\n" + "Text: " + "\n\n" + title + "\n\n" +  text

#   completion = client.chat.completions.create(
#     model="meta-llama/llama-3.1-405b-instruct",
#     messages=[{"role":"user","content":prompt}],
#     temperature=0.2,
#     top_p=0.7,
#     max_tokens=1024
#   )

#   answer = completion.choices[0].message.content
#   model_responses.append(answer)  # store model answer

#   print(answer)
#   print("---------------------------------")


Here is the JSON file with 5 unique pairs of instruction and expected response:

```
[
  {
    "instruction": "Ginn déi wichtegst Informatiounen iwwer d'Deborah De Robertis.",
    "response": "D'Deborah De Robertis ass eng lëtzebuergesch-franséisch Performance-Kënschtlerin, gebuer den 12. Februar 1984 zu Lëtzebuerg. Si ass um Lampertsbierg opgewuess an huet op der École de recherche graphique zu Bréissel studéiert."
  },
  {
    "instruction": "Wat ass déi bekanntst Performance vun der Deborah De Robertis?",
    "response": "Déi bekanntst Performance vun der Deborah De Robertis ass 'Miroir de l'origine', wou si sech viru Konschtwierker a Muséeën ouni Béxelche fotograféiere gelooss huet, als Referenz op dem Gustave Courbet säin Tableau 'L'origine du monde'."
  },
  {
    "instruction": "Wat ass den Hannergrond vun der Performance vun der Deborah De Robertis zu Lourdes den 31. August 2018?",
    "response": "Den 31. August 2018 huet d'Deborah De Robertis zu Lourdes virun der Grott vu Mas

# Clean responses

Since we only need the JSON files, we will exclude the rest of the response

In [None]:
def extract_json(text):
  """Extracts the substring between the first '[' and last ']' characters, including the braces."""
  start = text.find('[')
  end = text.rfind(']')
  if start != -1 and end != -1 and start < end:
    return text[start:end + 1]  # Include the braces
  else:
    return ""  # Return empty string if braces not found or in wrong order

In [None]:
model_responses_clean = []

In [None]:
for i in range(len(model_responses)):
  model_responses_clean.append(extract_json(model_responses[i]))

In [None]:
import json

# save model responses

with open('/content/drive/My Drive/Master Thesis/model_responses_clean_wiki.json', 'w') as f:
  json.dump(model_responses_clean, f)

In [None]:
# filter out empty strings
model_responses_clean = [s for s in model_responses_clean if s]

In [None]:
len(model_responses_clean)

100

In [None]:
# handle invalid JSON
combined_data = []

for json_string in model_responses_clean:
  try:
    data = json.loads(json_string.strip())
    combined_data.extend(data)
  except json.JSONDecodeError:
    print(f"Skipping invalid JSON: {json_string}")

Skipping invalid JSON: [
  {
    "instruction": "Ginn mir d'Gebuertsdatum vum Giovanni Schiaparelli.",
    "response": "De Giovanni Schiaparelli ass de 14. Mäerz 1835 gebuer."
  },
  {
    "instruction": "Wat huet de Schiaparelli am Joer 1877 entdeckt?",
    "response": "Am Joer 1877 huet de Schiaparelli d'sougenannte Marskanäl ("Canali") entdeckt, wéi de Mars der Äerd besonnesch no koum."
  },
  {
    "instruction": "Firwat huet de Percival Lowell de Flagstaff-Observatoire am Arizona gebaut?",
    "response": "De Percival Lowell huet de Flagstaff-Observatoire am Arizona gebaut, fir d'Marskanäl an d'“Liewen“ um Mars z'erfuerschen."
  },
  {
    "instruction": "Wat huet d'US-Sond Mariner 4 am Joer 1965 erausfonnt?",
    "response": "D'US-Sond Mariner 4 huet am Joer 1965 erausfonnt, datt d'Marskanäl gréisstendeels als optesch Täuschung duerch Kontrasteffekter gëllen."
  },
  {
    "instruction": "Wat fir Auszeechnungen huet de Schiaparelli krut?",
    "response": "De Schiaparelli huet am

Note: 2 Entries were not created successfully, as there was an error with the JSON file, hence they got caught by an exception and are excluded in the instruction tuning set. Upon further inspection with Gemini Advanced, the first falulty JSON file contains "curly quotes" (“), but JSON only allows standard ASCII double quotes for delimiting strings. The other file was not detected as faulty by Gemini, so we suspect it could contain invalid or hidden characters, not supported by JSON.

In [None]:
# Write the combined data to a new JSON file
with open('/content/drive/My Drive/Master Thesis/instruction_answer_wiki.json', 'w') as f:
  json.dump(combined_data, f, indent=4)  # Use indent for pretty printing

In [None]:
len(combined_data)

490

# Replace keys

Note: Some files were generated with non-english or invalid keys, such as "Instruktioun", "Répons", "Réponse", "Respon" and "Äntwert". We have to rename these to english key names, to make it uniform with the rest of the file. We use the pop() function.

In [None]:
all_keys = set()
for item in combined_data:
  all_keys.update(item.keys())

all_keys = list(all_keys)
print(all_keys)


['text', 'response', 'instruction']


In [None]:
# replace invalid keys with valid keys
for item in combined_data:
  if "Instruktioun" in item:
    item["instruction"] = item.pop("Instruktioun")
  if "Répons" in item:
    item["response"] = item.pop("Répons")
  if "Äntwert" in item:
    item["response"] = item.pop("Äntwert")
  if "Respon" in item:
    item["response"] = item.pop("Respon")
  if "Réponse" in item:
    item["response"] = item.pop("Réponse")


In [None]:
# add new column to json file that combines instruction and response to desired format
# following: https://huggingface.co/datasets/vicgalle/alpaca-gpt4?row=1
for i in range(len(combined_data)):
  instruction = combined_data[i]["instruction"]
  response = combined_data[i]["response"]
  text = f"The following provides an instruction in form of a question. \
  Give a response that answers the question accurately. \
  \n\n###Instruction: {instruction}\n\n###Response: {response}"
  combined_data[i]["text"] = text
combined_data

[{'instruction': "Ginn déi wichtegst Informatiounen iwwer d'Deborah De Robertis.",
  'response': "D'Deborah De Robertis ass eng lëtzebuergesch-franséisch Performance-Kënschtlerin, gebuer den 12. Februar 1984 zu Lëtzebuerg. Si ass um Lampertsbierg opgewuess an huet op der École de recherche graphique zu Bréissel studéiert.",
  'text': "The following provides an instruction in form of a question.   Give a response that answers the question accurately.   \n\n###Instruction: Ginn déi wichtegst Informatiounen iwwer d'Deborah De Robertis.\n\n###Response: D'Deborah De Robertis ass eng lëtzebuergesch-franséisch Performance-Kënschtlerin, gebuer den 12. Februar 1984 zu Lëtzebuerg. Si ass um Lampertsbierg opgewuess an huet op der École de recherche graphique zu Bréissel studéiert."},
 {'instruction': 'Wat ass déi bekanntst Performance vun der Deborah De Robertis?',
  'response': "Déi bekanntst Performance vun der Deborah De Robertis ass 'Miroir de l'origine', wou si sech viru Konschtwierker a Mus

In [None]:
# Write the combined data to a new JSON file
with open('/content/drive/My Drive/Master Thesis/instruction_tuning_wiki.json', 'w') as f:
  json.dump(combined_data, f, indent=4)  # Use indent for pretty printing

# Import data as dataframe

Let's import the dataset and take a look

In [None]:
import pandas as pd

# Read the JSON file into a pandas DataFrame
df = pd.read_json('/content/drive/My Drive/Master Thesis/instruction_tuning_wiki.json')

# Display the DataFrame
df

Unnamed: 0,instruction,response,text
0,Ginn déi wichtegst Informatiounen iwwer d'Debo...,D'Deborah De Robertis ass eng lëtzebuergesch-f...,The following provides an instruction in form ...
1,Wat ass déi bekanntst Performance vun der Debo...,Déi bekanntst Performance vun der Deborah De R...,The following provides an instruction in form ...
2,Wat ass den Hannergrond vun der Performance vu...,Den 31. August 2018 huet d'Deborah De Robertis...,The following provides an instruction in form ...
3,Wat huet d'Deborah De Robertis den 6. Mee 2024...,Den 6. Mee 2024 huet d'Deborah De Robertis am ...,The following provides an instruction in form ...
4,Wéi gouf d'Deborah De Robertis den 29. Mee 202...,Den 29. Mee 2024 gouf d'Deborah De Robertis vi...,The following provides an instruction in form ...
...,...,...,...
485,Ginn mir d'Geschicht vun der Band 'The Flowers...,D'Band 'The Flowers Of Romance' gouf 1976 gegr...,The following provides an instruction in form ...
486,Wéini gouf déi éischt Formatioun vun 'The Flow...,Déi éischt Formatioun vun 'The Flowers Of Roma...,The following provides an instruction in form ...
487,Wéini gouf déi zweet Formatioun vun 'The Flowe...,Déi zweet Formatioun vun 'The Flowers Of Roman...,The following provides an instruction in form ...
488,Wat ass de Bezuch tëscht der Band 'Public Imag...,Public Image Ltd. hunn 1981 ee vun hiren Album...,The following provides an instruction in form ...


We want the dataset to be in CSV format

In [None]:
# Export the DataFrame to a CSV file
df.to_csv('/content/drive/My Drive/Master Thesis/instruction_tuning_wiki.csv', index=False)