# CBT Therapy Conversation Dataset Generator

This notebook generates synthetic CBT therapy conversations following the 3-step reframing technique:
1. The therapist asks what evidence supports the patient's negative thought
2. The therapist asks what evidence would disprove the negative thought
3. The therapist asks the patient to rephrase their thought in a more balanced way

The generated data will be used for fine-tuning language models.

In [2]:
import os
import json
import random
import time
from typing import List, Dict, Any, Optional
import requests
from tqdm.notebook import tqdm
import anthropic
from dotenv import load_dotenv

# Load environment variables for API keys
load_dotenv()



True

## Define System and Generation Prompts

In [3]:
# System prompt for the LLM
SYSTEM_PROMPT = """You are an expert in cognitive behavioral therapy (CBT) who specializes in creating realistic therapy conversations.
Your task is to generate a conversation between a therapist and a patient that demonstrates the reframing technique in CBT.

The reframing technique follows these 3 steps:
1. The therapist asks what evidence supports the patient's negative thought
2. The therapist asks what evidence would disprove the negative thought
3. The therapist give a reasonable explanation to the problem and give an effective solution to the patient. make the therapist give realistic and possible solution to the patients problem.

Make the conversation realistic, with the patient providing thoughtful responses to each of the therapist's questions.
"""

# Prompt to generate negative thoughts
PATIENT_THOUGHT_PROMPT = """Generate {count} different realistic negative thoughts that a patient might express in a therapy session.
These should be everyday thoughts that demonstrate cognitive distortions like catastrophizing, black-and-white thinking, 
or overgeneralization. Make them diverse and realistic.

Format each as a single sentence from the patient's perspective, like:
"I failed my exam and now I'll never get into a good college."
"Nobody talked to me at the party, which proves I'm socially awkward."
"""

# Prompt to generate a full conversation
CONVERSATION_PROMPT = """Create a realistic therapy conversation between a CBT therapist and a patient following the exact format below:

The patient's initial negative thought is: "{patient_thought}"

Create a conversation that follows these 3 steps of reframing:
1. The therapist asks what evidence supports the patient's negative thought
2. The therapist asks what evidence would disprove the negative thought
3. The therapist give a reasonable explanation to the problem and give an effective solution to the patient. 
4. The therapist is like mr mackey from south park, he is always saying "mmkay" at the end of his sentences.
5. The therapist is in southpark, so always add in character names from southpark in the conversation and completely be Mr Mackey.

Return the conversation in this exact JSON format:
```json
{{
  "messages": [
    {{
      "role": "system", 
      "content": "You are a CBT therapist, You have extensive experience in helping patients cope with panic attacks, obsessive-compulsive disorder (OCD), and post-traumatic stress disorder (PTSD). You are well-versed in evidence-based treatments for these conditions, such as exposure therapy, cognitive restructuring, and relaxation techniques. You also have experience working with individuals who have experienced trauma, including those with complex PTSD and borderline personality disorder."
    }},
    {{
      "role": "Patient", 
      "content": "{patient_thought}"
    }},
    {{
      "role": "CBT Therapist",
      "content": "[First therapist response asking for evidence supporting the negative thought]"
    }},
    {{
      "role": "Patient",
      "content": "[Patient response providing evidence]"
    }},
    {{
      "role": "CBT Therapist",
      "content": "[Second therapist response asking for evidence that would disprove the negative thought]"
    }},
    {{
      "role": "Patient",
      "content": "[Patient response providing counter-evidence]"
    }},
    {{
      "role": "CBT Therapist",
      "content": "[Third therapist response asking patient to rephrase thought in a more balanced way]"
    }},
    {{
      "role": "Patient",
      "content": "[Patient response with reframed thought]"
    }}
  ]
}}
```

Make the conversation realistic and natural, with the patient providing thoughtful responses to each of the therapist's questions. Alongside, make the therapist give realistic and possible solution to the patients problem.
"""

## API Calling Functions

In [4]:
def call_claude(prompt: str, system: str = SYSTEM_PROMPT) -> str:
    """Call Claude API to generate text"""
    client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))
    
    try:
        response = client.messages.create(
            model="claude-3-opus-20240229",
            max_tokens=4000,
            system=system,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )
        return response.content[0].text
    except Exception as e:
        print(f"Error calling Claude API: {e}")
        return ""

model1 = "mistral-saba-24b"
model2 = "llama3-70b-8192"
def call_groq(prompt: str, system: str = SYSTEM_PROMPT, model: str = model1) -> str:
    """Call Groq API to generate text"""
    api_key = os.environ.get("GROQ_API_KEY")
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    
    data = {
        "model": model,
        "messages": [
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.7,
        "max_tokens": 4000
    }
    
    try:
        response = requests.post("https://api.groq.com/openai/v1/chat/completions", headers=headers, json=data)
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]
    except Exception as e:
        print(f"Error calling Groq API: {e}")
        return ""

def call_ollama(prompt: str, system: str = SYSTEM_PROMPT, model: str = "llama3") -> str:
    """Call Ollama API to generate text"""
    data = {
        "model": model,
        "messages": [
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ],
        "stream": False
    }
    
    try:
        response = requests.post("http://localhost:11434/api/chat", json=data)
        response.raise_for_status()
        return response.json()["message"]["content"]
    except Exception as e:
        print(f"Error calling Ollama API: {e}")
        return ""

def generate_with_model(prompt: str, model_type: str, system: str = SYSTEM_PROMPT) -> str:
    """Generate text using the specified model type"""
    if model_type.lower() == "claude":
        return call_claude(prompt, system)
    elif model_type.lower() == "groq":
        return call_groq(prompt, system)
    elif model_type.lower() == "ollama":
        return call_ollama(prompt, system)
    else:
        raise ValueError(f"Unknown model type: {model_type}")

## Functions to Generate Patient Thoughts and Conversations

In [5]:
def generate_patient_thoughts(count: int, model_type: str) -> List[str]:
    """Generate a list of patient negative thoughts"""
    prompt = PATIENT_THOUGHT_PROMPT.format(count=count)
    response = generate_with_model(prompt, model_type)
    
    # Extract thoughts from the response
    thoughts = []
    for line in response.split('\n'):
        line = line.strip()
        if line and line[0] == '"' and line[-1] == '"':
            thoughts.append(line.strip('"'))
        elif line and not line.startswith(('- ', '* ', '#')):
            thoughts.append(line)
    
    # Filter out any non-thoughts and limit to requested count
    thoughts = [t for t in thoughts if len(t) > 10 and '"' not in t]
    return thoughts[:count]

def repair_json(json_str: str) -> str:
    """Attempt to repair malformed JSON"""
    # Remove markdown code blocks
    json_str = json_str.replace('```json', '').replace('```', '')
    
    # Find the start and end of the JSON object
    start_idx = json_str.find('{')
    end_idx = json_str.rfind('}')
    
    if start_idx != -1 and end_idx != -1:
        return json_str[start_idx:end_idx+1]
    return json_str

def extract_and_parse_json(response: str) -> Dict[str, Any]:
    """Extract and parse JSON from the model response"""
    try:
        # First try to parse the entire response as JSON
        return json.loads(response)
    except json.JSONDecodeError:
        # If that fails, try to repair the JSON
        repaired_json = repair_json(response)
        try:
            return json.loads(repaired_json)
        except json.JSONDecodeError as e:
            print(f"Failed to parse JSON: {e}")
            print(f"Response: {response}")
            return {"messages": []}

def generate_full_conversation(patient_thought: str, model_type: str, max_retries: int = 3) -> Dict[str, Any]:
    """Generate a full CBT conversation based on a patient thought"""
    prompt = CONVERSATION_PROMPT.format(patient_thought=patient_thought)
    
    for attempt in range(max_retries):
        response = generate_with_model(prompt, model_type)
        conversation = extract_and_parse_json(response)
        
        # Check if we got a valid conversation
        if "messages" in conversation and len(conversation["messages"]) >= 8:
            return conversation
        
        # If we didn't get a valid conversation, wait and retry
        time.sleep(2)
    
    # If we still don't have a valid conversation after max_retries, return an empty one
    return {"messages": []}

## Generate and Save Dataset

In [6]:
def generate_dataset(count: int, model_type: str, output_file: str):
    """Generate a dataset of CBT conversations"""
    # Ask user for confirmation
    user_count = input(f"You are about to generate {count} CBT conversations using {model_type}. This may incur API costs. Continue? (y/n): ")
    if user_count.lower() != 'y':
        print("Operation cancelled.")
        return
    
    # Generate patient thoughts
    print(f"Generating {count} patient thoughts...")
    thoughts_per_batch = min(count, 20)  # Generate thoughts in batches of 20
    all_thoughts = []
    
    for i in range(0, count, thoughts_per_batch):
        batch_size = min(thoughts_per_batch, count - i)
        thoughts = generate_patient_thoughts(batch_size, model_type)
        all_thoughts.extend(thoughts)
        print(f"Generated {len(all_thoughts)}/{count} thoughts")
    
    # Generate conversations
    print(f"Generating {count} conversations...")
    conversations = []
    
    for i, thought in enumerate(tqdm(all_thoughts[:count])):
        conversation = generate_full_conversation(thought, model_type)
        if conversation["messages"]:
            conversations.append(conversation)
        
        # Save progress every 10 conversations
        if (i + 1) % 10 == 0 or i == len(all_thoughts) - 1:
            with open(output_file, 'w', encoding='utf-8') as f:
                json.dump({"conversations": conversations}, f, indent=2)
            print(f"Saved {len(conversations)} conversations to {output_file}")
        
        # Add a small delay to avoid rate limits
        time.sleep(1)
    
    print(f"Dataset generation complete. {len(conversations)} conversations saved to {output_file}")
    return conversations

In [7]:
# Generate the dataset
model_type = "groq"  # Change to "groq" or "ollama" as needed
output_file = "cbt_conversations_dataset.json"

# You can adjust the count as needed
generate_dataset(2500, model_type, output_file)

Generating 2500 patient thoughts...
Generated 2/2500 thoughts
Generated 17/2500 thoughts
Generated 20/2500 thoughts
Generated 22/2500 thoughts
Generated 25/2500 thoughts
Generated 36/2500 thoughts
Generated 38/2500 thoughts
Generated 40/2500 thoughts
Generated 42/2500 thoughts
Generated 44/2500 thoughts
Generated 61/2500 thoughts
Generated 63/2500 thoughts
Generated 65/2500 thoughts
Generated 76/2500 thoughts
Generated 87/2500 thoughts
Generated 89/2500 thoughts
Generated 91/2500 thoughts
Generated 108/2500 thoughts
Generated 110/2500 thoughts
Generated 112/2500 thoughts
Generated 127/2500 thoughts
Generated 140/2500 thoughts
Generated 142/2500 thoughts
Generated 145/2500 thoughts
Generated 147/2500 thoughts
Generated 160/2500 thoughts
Generated 163/2500 thoughts
Generated 171/2500 thoughts
Generated 182/2500 thoughts
Generated 194/2500 thoughts
Generated 196/2500 thoughts
Generated 198/2500 thoughts
Generated 210/2500 thoughts
Generated 213/2500 thoughts
Generated 226/2500 thoughts
Ge

  0%|          | 0/938 [00:00<?, ?it/s]

Saved 10 conversations to cbt_conversations_dataset.json
Saved 20 conversations to cbt_conversations_dataset.json
Saved 30 conversations to cbt_conversations_dataset.json
Saved 40 conversations to cbt_conversations_dataset.json
Saved 50 conversations to cbt_conversations_dataset.json
Saved 60 conversations to cbt_conversations_dataset.json
Failed to parse JSON: Invalid control character at: line 29 column 376 (char 2000)
Response: ```json
{
  "messages": [
    {
      "role": "system",
      "content": "You are a CBT therapist, You have extensive experience in helping patients cope with panic attacks, obsessive-compulsive disorder (OCD), and post-traumatic stress disorder (PTSD). You are well-versed in evidence-based treatments for these conditions, such as exposure therapy, cognitive restructuring, and relaxation techniques. You also have experience working with individuals who have experienced trauma, including those with complex PTSD and borderline personality disorder."
    },
    

[{'messages': [{'role': 'system',
    'content': 'You are a CBT therapist, You have extensive experience in helping patients cope with panic attacks, obsessive-compulsive disorder (OCD), and post-traumatic stress disorder (PTSD). You are well-versed in evidence-based treatments for these conditions, such as exposure therapy, cognitive restructuring, and relaxation techniques. You also have experience working with individuals who have experienced trauma, including those with complex PTSD and borderline personality disorder.'},
   {'role': 'Patient',
    'content': 'Here are 20 realistic negative thoughts that a patient might express in a therapy session, demonstrating various cognitive distortions:'},
   {'role': 'CBT Therapist',
    'content': "Let's start with one of those thoughts. You said you feel like you're always going to fail. What evidence do you have that supports this thought?"},
   {'role': 'Patient',
    'content': "Well, I've failed a few big projects at work recently, an

## Inspect Generated Data

In [53]:
def display_sample_conversation(file_path: str, index: int = 0):
    """Display a sample conversation from the generated dataset"""
    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    if index >= len(data["conversations"]):
        print(f"Index {index} out of range. Dataset has {len(data['conversations'])} conversations.")
        return
    
    conversation = data["conversations"][index]
    print("Sample Conversation:")
    print("-" * 80)
    
    for message in conversation["messages"]:
        role = message["role"]
        content = message["content"]
        print(f"\n{role}:\n{content}")
    
    print("-" * 80)

In [54]:
# Display a sample conversation (change index to see different conversations)
display_sample_conversation(output_file, index=0)

Sample Conversation:
--------------------------------------------------------------------------------

system:
You are a CBT therapist, You have extensive experience in helping patients cope with panic attacks, obsessive-compulsive disorder (OCD), and post-traumatic stress disorder (PTSD). You are well-versed in evidence-based treatments for these conditions, such as exposure therapy, cognitive restructuring, and relaxation techniques. You also have experience working with individuals who have experienced trauma, including those with complex PTSD and borderline personality disorder.

Patient:
I'm just so anxious all the time, I feel like I'm a total failure. I'll never be able to get my life together like Cartman's mom does.

CBT Therapist:
Mmkay, so you're feeling like a total failure and you think you'll never get your life together. What makes you think that, what's the evidence for that thought, mmkay?

Patient:
Well, I just can't seem to hold down a job, I've been fired from three

## Prepare Data for Fine-tuning

In [55]:
def prepare_for_finetuning(input_file: str, output_file: str):
    """Prepare the dataset for fine-tuning by formatting it according to the chat template"""
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    formatted_conversations = []
    
    for conversation in data["conversations"]:
        # Ensure the conversation has all required messages
        if len(conversation["messages"]) < 8:
            continue
        
        # Format the conversation for fine-tuning
        formatted_conversations.append({
            "messages": conversation["messages"]
        })
    
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump({"conversations": formatted_conversations}, f, indent=2)
    
    print(f"Prepared {len(formatted_conversations)} conversations for fine-tuning")
    print(f"Saved to {output_file}")

In [56]:
# Prepare the dataset for fine-tuning
prepare_for_finetuning(output_file, "cbt_finetuning_dataset.json")

Prepared 69 conversations for fine-tuning
Saved to cbt_finetuning_dataset.json
