# Prompt Filter

This Python notebook processes a JSON file containing inconsistent data and filters it to retain only valid entries. The script begins by loading the original JSON and checking for entries that contain specific keys: prompt, response, and metadata. It further ensures that the metadata includes timestamp, source, and tags. Duplicates are removed by tracking unique prompts, and only valid entries are kept. The filtered data is then saved to a new JSON file, providing a clean and consistent dataset.

In [1]:
import json

# 1. Load the inconsistent JSON data
with open('promptlib.json', 'r') as f:
    data = json.load(f)
    print(f"Original JSON has {len(data['prompts'])} prompts")

# 2. Filter out only datapoint that has (prompt, response, and metadata)
def is_valid_entry(entry):
    # Check if 'prompt', 'response', and 'metadata' exist
    if set(entry.keys()) == {'prompt', 'response', 'metadata'}:
        metadata = entry['metadata']
        # Check if 'metadata' contains 'timestamp', 'source', and 'tags'
        return set(metadata.keys()) == {'timestamp', 'source', 'tags'}
    return False

# 3. Eemove duplicates and keep only valid entries
unique_prompts = set()
filtered_data = {'prompts': []}

# 4. Apply the filter
for entry in data.get('prompts', []):
    if is_valid_entry(entry) and entry['prompt'] not in unique_prompts:
        filtered_data['prompts'].append(entry)
        unique_prompts.add(entry['prompt'])

# 5. Save the filtered data to a new JSON file
with open('filtered_promptlib.json', 'w') as f:
    json.dump(filtered_data, f, indent=2)


Original JSON has 626 prompts


In [2]:
from pathlib import Path
import json

data = json.loads(Path('filtered_promptlib.json').read_text())
print(f"Filtered JSON has {len(data['prompts'])} prompts")

Filtered JSON has 601 prompts
