# Comments

1.
    a) Easy negative is parsed from a summary of a random Wikipedia page. It is pretty slow (1 second per question). That is why I play only with the first ten entries of the dataset. However, one can run this code on multiple cores in parallel, e.g. by slicing the input dataset and parsing the slice names from the command line.

    b) I'm not sure I understand this task. I iterate over your dataset and separate positive, hard negative, and easy negative in three output datasets. Also, to ensure that no combination of contexts appears more than once, I use only a single hard negative context per question. Is that what you mean?

    As for efficiency, the algorithm's time and space complexity is O(n), where n is the dataset length. The space complexity can be improved by modifying the input data set in place. Is this the efficiency you expected?

3. I save the datasets in context_qa format. My output dataset is split into positive, hard negative, and easy negative contexts as separate files. If you need all the contexts within a single file, I can easily combine them.

4. I cleaned up titles from contexts, i.e. everything before the latest "=\n" per string. I also removed some special characters such as "\n" and "\u00a0". Apart from that, one should ensure that the model recognises umlauts.

In [None]:
import pandas as pd
import wikipedia
wikipedia.set_lang("de")

In [None]:
### Function to generate easy negatives
def generate_easy_negative():
    random_page_title = wikipedia.random(1)
    try:
        random_page_summary = wikipedia.summary(random_page_title)
    except wikipedia.exceptions.DisambiguationError as e:
        random_page_summary = generate_easy_negative()    
    return random_page_summary

In [None]:
### Functions to clean up context strings
def cleanup_ctx_title(str):
    idx = str.rfind("=\n")+2
    str = str[idx:]
    return str
    
def cleanup_context(str):
    str = cleanup_ctx_title(str)
    str = str.replace('\n', ' ')
    str = str.replace('\u00a0', ' ')
    return str

def cleanup_context_list(lst):
    for i in range(len(lst)):
        lst[i] = cleanup_context(lst[i])

In [None]:
### Reading dataset

input = 'https://huggingface.co/datasets/DiscoResearch/germanrag/resolve/main/germanrag.jsonl'
df = pd.read_json(input, lines=True)

In [None]:
### Taking a subset to play with
df = df.iloc[0:10]

In [None]:
### Iteration over the dataset and creation of the 
### dataset_positive, dataset_neg_easy and dataset_neg_hard.

dataset_positive = []
dataset_neg_easy = []
dataset_neg_hard = []

for index, row in df.iterrows():
    
    contexts = row['contexts'][:]
    question = row['question']
    answer = row['answer']
    idx = row['positive_ctx_idx']

    # Clean up junk characters from contexts
    cleanup_context_list(contexts)
    
    # Pop a positive context
    ctx_positive = ""
    if idx >= 0:
        ctx_positive = contexts.pop(idx)

    # Generate an easy negative context
    ctx_neg_easy = generate_easy_negative()
    answer_neg_easy = "Ihre Anfrage enthält keine erforderlichen Informationen"
    
    # Take a single hard negative context.
    ctx_neg_hard = ""
    if contexts:
        ctx_neg_hard = contexts[0]
    answer_neg_hard = "Ihre Anfrage enthält nicht genügend Informationen"

    dataset_positive.append({"context":ctx_positive,"question":question,"answer":answer})
    dataset_neg_easy.append({"context":ctx_neg_easy,"question":question,"answer":answer_neg_easy})
    dataset_neg_hard.append({"context":ctx_neg_hard,"question":question,"answer":answer_neg_hard})

In [None]:
# Make datasets as pandas frames
df_positive = pd.DataFrame(dataset_positive)
df_neg_easy = pd.DataFrame(dataset_positive)
df_neg_hard = pd.DataFrame(dataset_positive)

# Save to jsonl
df_positive.to_json("positive.jsonl",orient='records',lines=True)
df_neg_easy.to_json("neg_easy.jsonl",orient='records',lines=True)
df_neg_hard.to_json("neg_hard.jsonl",orient='records',lines=True)