Switch to filtered prosocial dataset (#3162)

This PR modifies ProsocialDialogue to use the filtered version of the dataset I have created, with less irrelevant data and less rejections. In this modified dataset I have filtered out the mostly irrelevant lines where the safety label is "casual" and "possibly/probably needs caution", which I have found to be mostly pointless, as well as some lines where the phrasing of the response might hurt the model's performance by refusing to act on a request. This is an alternative solution that may work instead of removing the dataset completely, as mentioned in #3144
LAION-AI · May 20, 2023 · d39976a · d39976a
1 parent abf7694
commit d39976a
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/model/model_training/custom_datasets/toxic_conversation.py b/model/model_training/custom_datasets/toxic_conversation.py
@@ -20,7 +20,7 @@ class ProsocialDialogueExplaination(Dataset):
 
     def __init__(self, split="train", cache_dir=".cache") -> None:
         super().__init__()
-        dataset = load_dataset("allenai/prosocial-dialog", cache_dir=cache_dir)[split]
+        dataset = load_dataset("Englishman2022/prosocial-dialog-filtered", cache_dir=cache_dir)[split]
         self.pairs = []
         for row in dataset:
             for safety_annotation, safe_answer in zip(row["safety_annotations"], row["safety_annotation_reasons"]):
@@ -54,7 +54,7 @@ class ProsocialDialogue(Dataset):
 
     def __init__(self, split="train", cache_dir=".cache") -> None:
         super().__init__()
-        dataset = load_dataset("allenai/prosocial-dialog", cache_dir=cache_dir)[split]
+        dataset = load_dataset("Englishman2022/prosocial-dialog-filtered", cache_dir=cache_dir)[split]
         self.pairs = []
         for row in dataset:
             prompt = row["context"]