# Sample luxembourgish instruction tuning set

In this notebook, we randomly sample 200 rows from the original luxembourgish instruction tuning set, akin to the sampleing done during the reward process. We will sample 80% from the RTL data and 20% from the wiki data, keeping approximately the distribution in the original dataset, ending up with 160 rows generated from RTL data and 40 rows generated from Wiki data.

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# read in the english csv file
import pandas as pd
df = pd.read_csv('/content/drive/My Drive/Master Thesis/instruction_tuning_set.csv')
df

Unnamed: 0,instruction,response,text
0,Ginn déi wichtegst Informatiounen iwwer d'Debo...,D'Deborah De Robertis ass eng lëtzebuergesch-f...,The following provides an instruction in form ...
1,Wat ass déi bekanntst Performance vun der Debo...,Déi bekanntst Performance vun der Deborah De R...,The following provides an instruction in form ...
2,Wat ass den Hannergrond vun der Performance vu...,Den 31. August 2018 huet d'Deborah De Robertis...,The following provides an instruction in form ...
3,Wat huet d'Deborah De Robertis den 6. Mee 2024...,Den 6. Mee 2024 huet d'Deborah De Robertis am ...,The following provides an instruction in form ...
4,Wéi gouf d'Deborah De Robertis den 29. Mee 202...,Den 29. Mee 2024 gouf d'Deborah De Robertis vi...,The following provides an instruction in form ...
...,...,...,...
2415,Wat war den 1. August 2011 de Kontext vum Staa...,Den 1. August 2011 hat de President Barack Oba...,The following provides an instruction in form ...
2416,Wéivill soll d'Scholdegrenz an den USA erhéijt...,D'Scholdegrenz soll an zwou Etappe ëm op manns...,The following provides an instruction in form ...
2417,Wat sinn d'Reaktiounen op den Accord?,Net jiddereen ass mat dësem Kompromëss d'Accor...,The following provides an instruction in form ...
2418,"Wat sinn d'Konsequenze, wann den Accord net ug...","Wann den Accord net ugeholl gëtt, kéinten d'US...",The following provides an instruction in form ...


# Sample Data

We split the dataset into two parts. The first part contains all rows that were generated from the wikipedia data, rows 0 to 489 and the second part, contaning all rows that were generated from the RTL data, rows 490 to 2419. We then randomly sample 40 rows from the wiki data and 160 rows from the RTL data, maintaining the 0.2/0.8 split from the original dataset.

In [3]:
wiki_df = df.iloc[:490]
rtl_df = df.iloc[490:]
print(len(wiki_df))
print(len(rtl_df))

490
1930


In [4]:
# random sample 40 rows from wiki data
sampled_wiki_df = wiki_df.sample(n=40, random_state=42)
sampled_wiki_df = sampled_wiki_df.reset_index(drop=True) # reset index and drop orignal index
sampled_wiki_df

Unnamed: 0,instruction,response,text
0,Wat ass de hellste Stär am Dorado?,"Den α Doradus, den hellste Stär am Dorado, as...",The following provides an instruction in form ...
1,"Wat ass mam Edikt vun Nantes geschitt, nodeem ...","Den Edikt vun Nantes gouf leider net ""perpétue...",The following provides an instruction in form ...
2,Wann huet de Devolutiounskrich ugefaangen a wé...,De Devolutiounskrich huet am Fréijouer 1667 ug...,The following provides an instruction in form ...
3,Wann an wou ass de Max Ophüls gestuerwen?,De Max Ophüls ass de 26. Mäerz 1957 zu Hamburg...,The following provides an instruction in form ...
4,"Wat ass mam Gebai vum Schumacher geschitt, nod...",De Policemusée vun der Asbl 'Le musée de la po...,The following provides an instruction in form ...
5,Wat sinn déi essenziell Aminosaieren? Nimm déi...,"Déi essenziell Aminosaieren sinn: Histidin, Is...",The following provides an instruction in form ...
6,Ginn mir d'Nimm vun de fréiere Gemenge déi 197...,"D'fréier Gemenge Beho, Bovigny, Cherain, Limer...",The following provides an instruction in form ...
7,Ginn mir d'Fläch vun der Polvermillen a wat Pr...,"D'Polvermillen huet eng Fläch vu 24,82 ha, wat...",The following provides an instruction in form ...
8,Wéini huet de Victor Tedesco seng Prisongstrof...,No der gescheiterter Expeditioun vun Exilbelsc...,The following provides an instruction in form ...
9,Wéi vill Filmer huet d'Cinémathèque de la Vill...,D'Cinémathèque de la Ville de Luxembourg versu...,The following provides an instruction in form ...


In [5]:
# now same for RTL data, but with 160 rows
sampled_rtl_df = rtl_df.sample(n=160, random_state=42)
sampled_rtl_df = sampled_rtl_df.reset_index(drop=True)
sampled_rtl_df

Unnamed: 0,instruction,response,text
0,Wéini huet de Marco Mille bei Siemens geschafft?,Ereéischt am Mäerz 2010 geet de Marco Mille be...,The following provides an instruction in form ...
1,Wat ass de Resultat vun der grouss ugeluechte ...,D'Police huet 20 Avertissement taxé an ee Prot...,The following provides an instruction in form ...
2,"Gitt mir d'Zuel vun de Jonken zu Lëtzebuerg, d...","Am Joer 2019 waren 9,75 Prozent vun de Kanner ...",The following provides an instruction in form ...
3,Wat huet d'Proprietärin vun engem Déier iwwer ...,"D'Proprietärin huet gesot, d'Ugeklote hätt net...",The following provides an instruction in form ...
4,Wéi vill Agencen sinn zanter 2022 fir d'Format...,Zanter 2022 sinn sechs Agencen fir d'Formation...,The following provides an instruction in form ...
...,...,...,...
155,Gëtt et och aner Méiglechkeeten fir politesch ...,"Jo, et gëtt och aner Méiglechkeeten fir polite...",The following provides an instruction in form ...
156,Wann a wou fiert de City Night Bus CN1?,De City Night Bus CN1 geet vun der Uewerstad i...,The following provides an instruction in form ...
157,Wat war den Haaptthema vun de Berodunge vum Co...,"Den Haaptthema war d'Flüchtlingsproblematik, b...",The following provides an instruction in form ...
158,Wat ass d'Rôle vun der ierfgroussherzoglecher ...,Mir liewen wierklech a ganz komplizéierten Zäi...,The following provides an instruction in form ...


In [6]:
# Now we merge the dataframes back together
df = pd.concat([sampled_wiki_df, sampled_rtl_df])
df = df.reset_index(drop=True)
df

Unnamed: 0,instruction,response,text
0,Wat ass de hellste Stär am Dorado?,"Den α Doradus, den hellste Stär am Dorado, as...",The following provides an instruction in form ...
1,"Wat ass mam Edikt vun Nantes geschitt, nodeem ...","Den Edikt vun Nantes gouf leider net ""perpétue...",The following provides an instruction in form ...
2,Wann huet de Devolutiounskrich ugefaangen a wé...,De Devolutiounskrich huet am Fréijouer 1667 ug...,The following provides an instruction in form ...
3,Wann an wou ass de Max Ophüls gestuerwen?,De Max Ophüls ass de 26. Mäerz 1957 zu Hamburg...,The following provides an instruction in form ...
4,"Wat ass mam Gebai vum Schumacher geschitt, nod...",De Policemusée vun der Asbl 'Le musée de la po...,The following provides an instruction in form ...
...,...,...,...
195,Gëtt et och aner Méiglechkeeten fir politesch ...,"Jo, et gëtt och aner Méiglechkeeten fir polite...",The following provides an instruction in form ...
196,Wann a wou fiert de City Night Bus CN1?,De City Night Bus CN1 geet vun der Uewerstad i...,The following provides an instruction in form ...
197,Wat war den Haaptthema vun de Berodunge vum Co...,"Den Haaptthema war d'Flüchtlingsproblematik, b...",The following provides an instruction in form ...
198,Wat ass d'Rôle vun der ierfgroussherzoglecher ...,Mir liewen wierklech a ganz komplizéierten Zäi...,The following provides an instruction in form ...


In [7]:
# save as csv file
df.to_csv('/content/drive/My Drive/Master Thesis/sampled_instruction_tuning_set.csv', index=False)