Skip to content

Commit

Permalink
Switch to filtered prosocial dataset (#3162)
Browse files Browse the repository at this point in the history
This PR modifies ProsocialDialogue to use the filtered version of the
dataset I have created, with less irrelevant data and less rejections.

In this modified dataset I have filtered out the mostly irrelevant lines
where the safety label is "casual" and "possibly/probably needs
caution", which I have found to be mostly pointless, as well as some
lines where the phrasing of the response might hurt the model's
performance by refusing to act on a request.

This is an alternative solution that may work instead of removing the
dataset completely, as mentioned in #3144
  • Loading branch information
TCLProject committed May 20, 2023
1 parent abf7694 commit d39976a
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions model/model_training/custom_datasets/toxic_conversation.py
Expand Up @@ -20,7 +20,7 @@ class ProsocialDialogueExplaination(Dataset):

def __init__(self, split="train", cache_dir=".cache") -> None:
super().__init__()
dataset = load_dataset("allenai/prosocial-dialog", cache_dir=cache_dir)[split]
dataset = load_dataset("Englishman2022/prosocial-dialog-filtered", cache_dir=cache_dir)[split]
self.pairs = []
for row in dataset:
for safety_annotation, safe_answer in zip(row["safety_annotations"], row["safety_annotation_reasons"]):
Expand Down Expand Up @@ -54,7 +54,7 @@ class ProsocialDialogue(Dataset):

def __init__(self, split="train", cache_dir=".cache") -> None:
super().__init__()
dataset = load_dataset("allenai/prosocial-dialog", cache_dir=cache_dir)[split]
dataset = load_dataset("Englishman2022/prosocial-dialog-filtered", cache_dir=cache_dir)[split]
self.pairs = []
for row in dataset:
prompt = row["context"]
Expand Down

0 comments on commit d39976a

Please sign in to comment.