In [1]:
import json
import pandas as pd

# 1. Processing Preference Pairs Gathered from M1

Collected data format: 
```json 
{
    "question_id": <int>,
    "question_complete": <str>,
    "course_id": <int>,
    "preference": [
        {
            "A": <str>,
            "B": <str>,
            "overall": <str>,
            "criteria": {
                "overall": <str>,
                "correctness": <str>,
                "relevance": <str>,
                "clarity": <str>,
                "completeness": <str>,
                "other": <str>
            }
        },
        ...
    ]
}
```

Processed data format:
```json
{
    "prompt": "...",
    "chosen": "...", 
    "rejected": "..."
}
```

In [2]:
def process_preference_pairs(data):
    processed_data = []
    for d in data:
        prompt = d["question_complete"]
        for p in d["preference"]:
            if p["overall"] == "A":
                chosen_option = "A"
                rejected_option = "B"
            elif p["overall"] == "B":
                chosen_option = "B"
                rejected_option = "A" 
            else:  # skip if the overall preference is not A or B e.g. AB 
                continue
                
            if p[chosen_option] == "..." or p[rejected_option] == "..." or p[chosen_option] == "" or p[rejected_option] == "":
                continue
            processed_data.append({
                "prompt": prompt,
                "chosen": p[chosen_option],
                "rejected": p[rejected_option]
            })
            
    return processed_data

In [3]:
 def load_data(file_path):
    with open(file_path, "r") as f:
        data = json.load(f)
    return data

In [4]:
def convert_to_jsonl(data, file_path):
    with open(file_path, "w") as f:
        for d in data:
            f.write(json.dumps(d) + "\n")

In [5]:
pp_m1_json_file = "raw_datasets/M1_preference_data_15052024.json"
pp_m1_jsonl_file = "raw_datasets/M1_preference_data_15052024.jsonl"
pp_m1_data = load_data(pp_m1_json_file)

df_pp_m1 = pd.DataFrame(pp_m1_data)
df_pp_m1

Unnamed: 0,question_id,question_complete,course_id,preference
0,0,Question: Consider the following contains func...,15000,[{'A': 'The asymptotic depth of the contains f...
1,3,Question: What is the asymptotic work of <code...,15000,"[{'A': '...', 'B': '...', 'overall': 'A', 'cri..."
2,4,Question: We have a collection of rectangles i...,15000,[{'A': 'Facts: - Rectangles in the plane have ...
3,5,Question: Which of the following scheduler pol...,15005,[{'A': 'Preemptive scheduling policies allow t...
4,7,"Question: In this week's lecture, you have bee...",15000,"[{'A': 'For the computation g(g(1, x1), g(x2, ..."
...,...,...,...,...
1517,7365,Question: Byzantine consistent broadcast (BCB)...,15003,"[{'A': 'In non-synchronous environments, intro..."
1518,7366,"Question: If process i fails, then eventually ...",15003,"[{'A': 'Yes, the statement is true. If process..."
1519,7368,Question: What happens in the reliable broadca...,15003,[{'A': 'If the completeness property of the fa...
1520,7370,Question: Consider a network that is organized...,15003,"[{'A': 'First, we can use a flooding algorithm..."


In [6]:
#Process the preference pairs
pp_m1_jsonl_file = "datasets/M1_preference_data_15052024.jsonl"
pp_m1_processed = process_preference_pairs(pp_m1_data)
convert_to_jsonl(pp_m1_processed, pp_m1_jsonl_file)

df_pp_m1_processed = pd.DataFrame(pp_m1_processed)
df_pp_m1_processed

Unnamed: 0,prompt,chosen,rejected
0,Question: Consider the following contains func...,"When `contains` is called on a List, the `drop...",The asymptotic depth of the contains function ...
1,Question: Consider the following contains func...,To determine the asymptotic depth of the `cont...,The asymptotic depth of the contains function ...
2,Question: Consider the following contains func...,To determine the asymptotic depth of the `cont...,The asymptotic depth of the `contains` functio...
3,Question: Consider the following contains func...,To determine the asymptotic depth of the `cont...,The contains function is a recursive function ...
4,Question: Consider the following contains func...,The asymptotic depth of the contains function ...,When the contains function is called on a List...
...,...,...,...
26636,Question: Consider the transformation from bin...,#### **Answer**: \n\nThe transformation from b...,#### **Answer**:\nThe transformation from bina...
26637,Question: Consider the transformation from bin...,Consider the transformation from binary MRSW s...,Let's consider the transformation from binary ...
26638,Question: Consider the transformation from bin...,To prove that the transformation from binary M...,"First, let's define the terms:\n\n- Binary MRS..."
26639,Question: Consider the transformation from bin...,"To solve this problem, first, let's understand...",Background Information:\n- Triple Data Encrypt...
