# Create Luxembourgish instruction tuning dataset from RTL news articles

In this notebook, we create an instruction tuning dataset to fine-tune Llama Models from Luxembougish RTL News Data. We use Llama-3.1-405b-instruct to create synthetic data, using a Novita AI API key (https://novita.ai/). We only use 400 samples from the preprocessed Wikipedia dump, to keep the dataset small. We create 5 instruction-answer pairs for each entry, totalling to 2000 synthetic instruction-answer pairs in Luxembourgish.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import json

# read in RTL data from json file
with open('/content/drive/My Drive/Master Thesis/Data/filtered_rtl_news.json', 'r') as f:
    rtl_data = json.load(f)

In [None]:
len(rtl_data)

400

In [None]:
rtl_data[0]

{'category_name': 'National',
 'category_id': 3,
 'article_id': 1277543,
 'type': 'news',
 'public_date': '2018-12-07 17:50:06',
 'title': 'Mat Summerpneuen en Accident gehat an ugedréckt',
 'header': 'Bei -4 Grad war e Chauffer am Februar zu Bieles accidentéiert. Hien hat Summerpneuen drop a war duerno fortgaangen ... an hofft elo op eng niddreg Strof.',
 'text': ' D\'Zäit vun de manner gudde Fuerkonditiounen op eise Stroosse réckelt am Prinzip méi no. An deem Fall, sief et z.B. bei Schnéi oder bei Glatäis, soll ee bekanntlech Wanterpneuen op sengem Auto hunn. Well hien dat net hat an dowéinst am Februar zu Bieles en Accident provozéiert hat, stoung e Freideg de Moien e Mann vu 37 Joer an der Stad virum Riichter. No der Kollisioun, nuets, mat enger Balustrad, war de Mann ugaangen ... Hie wär vun der Plaz vum Accident fortgaangen, well hien erféiert war, sot den Ugekloten e Freideg. Hie wär zwar nëmme liicht blesséiert, awer erschreckt gewiescht; Stonnen drop hätt hie sech du bei der P

In [None]:
# define instruction
instruction = "The objective is to create a synthetic dataset in Luxembourgish \
for instruction fine-tuning. You are an assistant and generate all your \
responses in Luxembourgish. You will receive a series of texts one after the \
other. Generate a dataset from the text given with 5 unique pairs of \
instruction and expected response. Give elaborate and high-quality responses \
and answer as accurately as possible. Make sure that the instruction contains \
all relevant information that is needed to give the response. If the question \
does not contain enough information to be answered accurately, reply that you \
need more information to answer the question. Whenever a date is provided for \
a news article, include the date in the instruction and its respective response \
or just in the response, depending on what you consider best in the context. \
Also give temporal context if possible. Name the columns instruction and \
response respectively. Put everything into a JSON file."
instruction

'The objective is to create a synthetic dataset in Luxembourgish for instruction fine-tuning. You are an assistant and generate all your responses in Luxembourgish. You will receive a series of texts one after the other. Generate a dataset from the text given with 5 unique pairs of instruction and expected response. Give elaborate and high-quality responses and answer as accurately as possible. Make sure that the instruction contains all relevant information that is needed to give the response. If the question does not contain enough information to be answered accurately, reply that you need more information to answer the question. Whenever a date is provided for a news article, include the date in the instruction and its respective response or just in the response, depending on what you consider best in the context. Also give temporal context if possible. Name the columns instruction and response respectively. Put everything into a JSON file.'

In [None]:
from datetime import datetime

def get_date_only(date_string):
  """Returns only the date from a date time string"""
  datetime_object = datetime.strptime(date_string, '%Y-%m-%d %H:%M:%S')
  date = datetime_object.strftime('%Y-%m-%d')
  return date

In [None]:
# Check the prompt formatting
for article in rtl_data[:5]:
  title = article.get("title")
  header = article.get("header")
  text = article.get("text")
  date = get_date_only(article.get("public_date"))
  prompt = instruction + "\n\n" + "Text: " + "\n\n" + "News article from: " + date + "\n\n" + title + "\n\n" + header + "\n\n" + text
  print(prompt)
  print("---------------------------------")


The objective is to create a synthetic dataset in Luxembourgish for instruction fine-tuning. You are an assistant and generate all your responses in Luxembourgish. You will receive a series of texts one after the other. Generate a dataset from the text given with 5 unique pairs of instruction and expected response. Give elaborate and high-quality responses and answer as accurately as possible. Make sure that the instruction contains all relevant information that is needed to give the response. If the question does not contain enough information to be answered accurately, reply that you need more information to answer the question. Whenever a date is provided for a news article, include the date in the instruction and its respective response or just in the response, depending on what you consider best in the context. Also give temporal context if possible. Name the columns instruction and response respectively. Put everything into a JSON file.

Text: 

News article from: 2018-12-07

Mat

# API PART - PROCEED WITH CAUTION (COSTS MONEY) !!!!!

In [None]:
# !pip install -q openai

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/365.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m358.4/365.7 kB[0m [31m12.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m365.7/365.7 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/77.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/318.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# from openai import OpenAI
# from google.colab import userdata

In [None]:
# # From Novita AI
# client = OpenAI(
#   base_url = "https://api.novita.ai/v3/openai",
#   api_key = userdata.get("NOVITA_AI_UNI")
# )

In [None]:
# model_responses = []

In [None]:
# for article in rtl_data:
#   title = article.get("title")
#   header = article.get("header")
#   text = article.get("text")
#   date = get_date_only(article.get("public_date"))
#   prompt = instruction + "\n\n" + "Text: " + "\n\n" + "News article from: " + date + "\n\n" + title + "\n\n" + header + "\n\n" + text

#   completion = client.chat.completions.create(
#     model="meta-llama/llama-3.1-405b-instruct",
#     messages=[{"role":"user","content":prompt}],
#     temperature=0.2,
#     top_p=0.7,
#     max_tokens=1024
#   )

#   answer = completion.choices[0].message.content
#   model_responses.append(answer)  # store model answer

#   print(answer)
#   print("---------------------------------")


[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
Note: I've kept the responses as close to the original text as possible, while making sure they are concise and clear. I've also tried to include all relevant information in the instruction, while avoiding unnecessary details. Let me know if you need any further adjustments!
---------------------------------
Here is the JSON file with 5 unique pairs of instruction and expected response:

```
[
  {
    "instruction": "Wéini wosst de Premier vu Spionage-Auer? Gëtt et eng Prescriptioun, d.h. eng Verjärung derfir gëtt, wann een deen aneren heemlech enregistréiert?",
    "response": "Et geet ëm d'Fro, wéini de Premier gewuer ginn ass, datt de Marco Mille hie mat der Spionage-Auer enregistréiert hat. De Verjärungs-Delai ass do wahrscheinlech 3 Joer."
  },
  {
    "instruction": "Wéini huet de Premier de Charel Goerens, de President vun der Geheimdéngschtkontrollkommissioun informéiert?",
    "response": "Am Mäerz 2009

# Clean responses

Since we only need the JSON files, we will exclude the rest of the response.

In [None]:
def extract_json(text):
  """Extracts the substring between the first '[' and last ']' characters, including the braces."""
  start = text.find('[')
  end = text.rfind(']')
  if start != -1 and end != -1 and start < end:
    return text[start:end + 1]  # Include the braces
  else:
    return ""  # Return empty string if braces not found or in wrong order

In [None]:
model_responses_clean = []

In [None]:
for i in range(len(model_responses)):
  model_responses_clean.append(extract_json(model_responses[i]))

In [None]:
import json

# save model responses

with open('/content/drive/My Drive/Master Thesis/model_responses_clean_rtl.json', 'w') as f:
  json.dump(model_responses_clean, f)

In [None]:
# filter out empty strings
model_responses_clean = [s for s in model_responses_clean if s]

In [None]:
len(model_responses_clean)

400

In [None]:
# handle invalid JSON
combined_data = []

for json_string in model_responses_clean:
  try:
    data = json.loads(json_string.strip())
    combined_data.extend(data)  # append each dictionary to the list
  except json.JSONDecodeError:
    print(f"Skipping invalid JSON: {json_string}")

Skipping invalid JSON: [
  {
    "instruction": "Wat ass de Zweck vun den Ateliers Nesting, déi vum Pôle Femme, Mère, Enfant des Hôpitaux Robert Schuman zanter März 2018 ugebued goufen?",
    "response": "De Zweck vun den Ateliers Nesting ass et, fir Elteren, Gesondheetspersonal a Leit, déi an der klenger Kannerzäit schaffen, ze informéieren iwwer d'Risiken, déi mat der Loftvervuurechtung an der Wunneng ze dinn hunn, an hinnen ze hëllefen, e gesondheetlech Ëmfeld fir hir Kanner ze schafen."
  },
  {
    "instruction": "Wat sinn déi wichtegst Ziler vun den Ateliers Nesting, déi vun den Infirmièren vum Pôle Femme, Mère, Enfant des Hôpitaux Robert Schuman ugebued ginn?",
    "response": "Déi wichtegst Ziler vun den Ateliers Nesting sinn et, fir d'Teilnehmer ze léieren, wéi si d'Quellen vun der Loftvervuurechtung an hirer Wunneng erkennen a reduzéieren kënnen, an hinnen ze hëllefen, e perséinleche Plang ze entwéckelen, fir hir Gesondheet a säi Wuel ze verbesseren."
  },
  {
    "instructio

In [None]:
# Write the combined data to a new JSON file
with open('/content/drive/My Drive/Master Thesis/instruction_answer_rtl.json', 'w') as f:
  json.dump(combined_data, f, indent=4)  # Use indent for pretty printing

In [None]:
len(combined_data)

1930

Note: 14 entries were not created successfully, as there was an error with the JSON file, hence they got caught by an exception and are excluded in the instruction tuning set. Some files contain invlaid characters, others have a straight up wrong formatting.

# Replace keys

Note: Some files were generated with non-english or invalid keys, such as "Instruktioun" or "Répons". We have to rename these to english key names, to make it uniform with the rest of the file. We use the pop() function.

In [None]:
all_keys = set()
for item in combined_data:
  all_keys.update(item.keys())

all_keys = list(all_keys)
print(all_keys)


['Répons', 'instruction', 'response', 'Instruktioun']


In [None]:
# replace invalid keys with valid keys
for item in combined_data:
  if "Instruktioun" in item:
    item["instruction"] = item.pop("Instruktioun")
  if "Répons" in item:
    item["response"] = item.pop("Répons")
  if "Äntwert" in item:
    item["response"] = item.pop("Äntwert")
  if "Respon" in item:
    item["response"] = item.pop("Respon")
  if "Réponse" in item:
    item["response"] = item.pop("Réponse")


In [None]:
# add new column to json file that combines instruction and response to desired format
# following: https://huggingface.co/datasets/vicgalle/alpaca-gpt4?row=1
for i in range(len(combined_data)):
  instruction = combined_data[i]["instruction"]
  response = combined_data[i]["response"]
  text = f"The following provides an instruction in form of a question. \
  Give a response that answers the question accurately. \
  \n\n###Instruction: {instruction}\n\n###Response: {response}"
  combined_data[i]["text"] = text
combined_data

[{'instruction': 'Wat war de Grond fir den Accident, deen de Chauffer am Februar 2018 zu Bieles hat?',
  'response': "De Chauffer hat Summerpneuen op sengem Auto gehat, obwuel et bei -4 Grad an d'Stroosse glat waren.",
  'text': "The following provides an instruction in form of a question.   Give a response that answers the question accurately.   \n\n###Instruction: Wat war de Grond fir den Accident, deen de Chauffer am Februar 2018 zu Bieles hat?\n\n###Response: De Chauffer hat Summerpneuen op sengem Auto gehat, obwuel et bei -4 Grad an d'Stroosse glat waren."},
 {'instruction': 'Wat huet de Chauffer no der Kollisioun gemaach?',
  'response': 'Hien ass ugaangen an huet sech duerno bei der Police presentéiert.',
  'text': 'The following provides an instruction in form of a question.   Give a response that answers the question accurately.   \n\n###Instruction: Wat huet de Chauffer no der Kollisioun gemaach?\n\n###Response: Hien ass ugaangen an huet sech duerno bei der Police presentéier

In [None]:
# Write the combined data to a new JSON file
with open('/content/drive/My Drive/Master Thesis/instruction_tuning_rtl.json', 'w') as f:
  json.dump(combined_data, f, indent=4)  # Use indent for pretty printing

# Import data as dataframe

In [None]:
import pandas as pd

# Read the JSON file into a pandas DataFrame
df = pd.read_json('/content/drive/My Drive/Master Thesis/instruction_tuning_rtl.json')

# Display the DataFrame
df

Unnamed: 0,instruction,response,text
0,"Wat war de Grond fir den Accident, deen de Cha...",De Chauffer hat Summerpneuen op sengem Auto ge...,The following provides an instruction in form ...
1,Wat huet de Chauffer no der Kollisioun gemaach?,Hien ass ugaangen an huet sech duerno bei der ...,The following provides an instruction in form ...
2,Wat huet de Riichter de Chauffer gefrot?,"De Riichter huet de Chauffer gefrot, ob hien g...",The following provides an instruction in form ...
3,Wat huet d'Vertriederin vum Parquet als Strof ...,D'Vertriederin vum Parquet huet als Strof eng ...,The following provides an instruction in form ...
4,Wann gëtt d'Uerteel gesprach?,D'Uerteel gëtt den 21. Dezember 2018 gesprach.,The following provides an instruction in form ...
...,...,...,...
1925,Wat war den 1. August 2011 de Kontext vum Staa...,Den 1. August 2011 hat de President Barack Oba...,The following provides an instruction in form ...
1926,Wéivill soll d'Scholdegrenz an den USA erhéijt...,D'Scholdegrenz soll an zwou Etappe ëm op manns...,The following provides an instruction in form ...
1927,Wat sinn d'Reaktiounen op den Accord?,Net jiddereen ass mat dësem Kompromëss d'Accor...,The following provides an instruction in form ...
1928,"Wat sinn d'Konsequenze, wann den Accord net ug...","Wann den Accord net ugeholl gëtt, kéinten d'US...",The following provides an instruction in form ...


We want the dataset to be in CSV format

In [None]:
# Export the DataFrame to a CSV file
df.to_csv('/content/drive/My Drive/Master Thesis/instruction_tuning_rtl.csv', index=False)