# **Reshaping explicit and implicit data into preference pairs**

### **Weighted formula:**

$$
\begin{aligned}
\text{Score} =\ & 
0.2 \cdot \text{TimeOnTask} +
0.1 \cdot \text{ScrollDepth} +
0.1 \cdot \text{ScrollEvents} +
0.2 \cdot \text{CompletionRate} \\
& +
0.1 \cdot \text{ActiveMinutes} +
0.1 \cdot \text{MemoryUse} +
0.05 \cdot \text{TutorInteractions} +
0.05 \cdot \text{AvgModuleRating} \\
& +
0.05 \cdot \text{Satisfaction} +
0.05 \cdot (\text{PostSkill} - \text{PreSkill}) +
0.05 \cdot (\text{Relevance} + \text{Trust} - \text{Difficulty}) \\
& +
0.02 \cdot \text{Pace} -
0.02 \cdot \text{Retries} -
0.01 \cdot \text{ResponseTime}
\end{aligned}
$$

Drawbacks:
- Weights are chosen manually, not learned from data
- Assumes all signals are comparable, without considering scale differences
- Doesn’t adapt to different users or contexts
- Sensitive to outliers in some metrics
- Lacks automatic improvement with more data or real user feedback



In [11]:
%load_ext autoreload
%autoreload 2

from preference_pairs import *
import json
import os

In [22]:
# Load data
DATA_PATH = 'data'
data_names = ["synth_10-samples_gpt-4o-mini_2025-05-05_08-31", "synth_10-samples_gpt-4o-mini_2025-05-05_08-48"]
user_data = []
for file in data_names:
    with open(os.path.join(DATA_PATH, f"{file}.json"), "r") as f:
        user_data += (json.load(f))


In [23]:
# Generate preference pairs
OUTPUT_PATH = "output"
preference_dataset = generate_preference_pairs(user_data)
print(json.dumps(preference_dataset, indent=2, sort_keys=True))
with open(os.path.join(OUTPUT_PATH, f"preference_pairs.json"), "w") as f:
    json.dump(preference_dataset, f, indent=2)

[
  {
    "chosen": {
      "explicit_data": {
        "approval_of_content_modifications": [
          {
            "change": "Add more examples in Python module",
            "status": "approved"
          }
        ],
        "curriculum_editing_feedback": "I'd like a more structured approach to the modules.",
        "difficulty_feedback": 3,
        "drag_and_drop_curriculum_edits": [],
        "explicit_learning_goals": "Gain practical skills in Python and statistics.",
        "preferred_content_format": "video",
        "ratings_on_modules": {
          "Introduction to Data Science": 4,
          "Python for Data Analysis": 5,
          "Statistics Basics": 3
        },
        "reflection_inputs": "I feel more confident in my ability to use Python.",
        "relevance_feedback": 5,
        "satisfaction_surveys": {
          "content_relevance": 4,
          "interface_usability": 5,
          "overall_satisfaction": 4
        },
        "skill_self_assessments": {
        

In [None]:
# Comparison of chosen vs rejected
SCORE_DIFF_THRESH = 500
for pair in preference_dataset:
    if pair["score_diff"] > SCORE_DIFF_THRESH:
        print(json.dumps(pair, indent=2, sort_keys=True))

{
  "chosen": {
    "explicit_data": {
      "approval_of_content_modifications": [
        {
          "change": "Add more examples in Python module",
          "status": "approved"
        }
      ],
      "curriculum_editing_feedback": "I'd like a more structured approach to the modules.",
      "difficulty_feedback": 3,
      "drag_and_drop_curriculum_edits": [],
      "explicit_learning_goals": "Gain practical skills in Python and statistics.",
      "preferred_content_format": "video",
      "ratings_on_modules": {
        "Introduction to Data Science": 4,
        "Python for Data Analysis": 5,
        "Statistics Basics": 3
      },
      "reflection_inputs": "I feel more confident in my ability to use Python.",
      "relevance_feedback": 5,
      "satisfaction_surveys": {
        "content_relevance": 4,
        "interface_usability": 5,
        "overall_satisfaction": 4
      },
      "skill_self_assessments": {
        "after_training": 4,
        "before_training": 2
      