# Generate synthetic data

We generate synthetic data using the Ollama with the [dolphin-mistral](https://ollama.com/library/dolphin-mistral) model from the [Ollama](https://ollama.com/) framework.

Info from the Ollama page: "The Dolphin model by Eric Hartford, based on Mistral version 0.2 released in March 2024. This model is uncensored, available for both commercial and non-commercial use, and excels at coding."

Links: [Dolphin 2.2.1 model](https://huggingface.co/cognitivecomputations/dolphin-2.2.1-mistral-7b), [Eric Hartford](https://erichartford.com/).

For instructions and examples about Ollama, you can visit the [website](https://ollama.com/) or the [GitHub repository](https://github.com/ollama/ollama).

## Import libraries

In [81]:
import ollama
# from IPython.display import display, Markdown
import json
import csv

## Functions

In [82]:
def generate_text(prompt: str, model_name: str) -> tuple:
    """
    Take a prompt and a model name and use the ollama framework  to generate a response.
    It returns a tuple with the model name and the response
    """

    response = ollama.generate(model=model_name, prompt=prompt)

    m = response["model"]
    r = response["response"]

    # print(r)
    # display(Markdown(r))

    return m, r

## Select model

In [83]:
model_name = "dolphin-mistral"

## Generate

In [84]:
prompt = "Can you generate an example of dysfunctional communication between a couple that is fighting about the children?"
m, r = generate_text(prompt, model_name=model_name)
print(r)

Sure, here's an example:

Mom: "I can't believe you let them stay up past their bedtime again! They are always cranky and tired the next day."

Dad: "They were excited to watch that movie, and besides, it wasn't even that late!"

Mom: "Well, maybe they wouldn't be so grumpy if they actually got enough sleep. You never listen to me when I say something about their schedule."

Dad: "You always overreact! They love staying up later on special occasions. It's not a big deal!"

Mom: "It's a big deal! Their school work suffers, and it messes with their bedtime routine for weeks! You just don't care about them like I do."

Dad: "That's not true! I want what's best for them too. Sometimes you need to be more flexible!"

Mom: "Flexible? What about when we have to wake up super early for your work meetings and they are still half asleep?"

Dad: "You never give me a chance to explain! You just assume the worst."

Mom: "Assume the worst? I'm just looking out for our kids!"

Dad: "And I am too! We 

In [85]:
prompt = "do you know The Gottman Four Horsemen?"
m, r = generate_text(prompt, model_name=model_name)
print(r)

Yes, I am familiar with the Gottman Four Horsemen. The Gottman Institute is a relationship research center that was founded by Dr. John and Julie Gottman. The Gottman Four Horsemen refer to four negative communication patterns in relationships that can lead to conflict and, if not addressed properly, can cause significant damage to the bond between partners.

The Four Horsemen are:
1. Criticism: This involves making global, negative statements about a partner instead of focusing on specific actions or behaviors. For example, saying "You're always late" instead of "I feel frustrated when you don't show up on time."
2. Defensiveness: Partners may become defensive in response to criticism and blame their partners for the issue at hand. This can lead to a cycle of defensiveness and counter-defensiveness, making it difficult to address the root cause of conflict.
3. Contempt: Contempt refers to a deep sense of disgust or disdain towards a partner. It often manifests through sarcasm, mockery

In [86]:
prompt = """
I am a couples mediator and I need to practice mediating between couples who communicate with toxic language.
Show me examples of such conversations using language that fits in the Gottman Four Horsemen. Show me an axample for each category.
"""
m, r = generate_text(prompt, model_name=model_name)
print(r)

1. Criticism: This involves attacking your partner's personality or character rather than addressing a specific behavior or issue. It can be harsh, judgmental, and often comes across as blaming or shaming. An example would be:

Partner A: "You never help out around the house! You're so lazy."

2. Defensiveness: This is when one partner tries to protect themselves by denying responsibility for a problem, avoiding taking any accountability, or counter-attacking with criticism. An example would be:

Partner B: "That's not true! I do my share of the housework."

3. Contempt: This is when one partner expresses disdain or superiority over their partner through sarcasm, mockery, or hostile behavior. It can include eye-rolling, sneering, or making fun of your partner. An example would be:

Partner A: "Oh, you're sooo smart, aren't you? You have no idea how to fix that."

4. Stonewalling: This involves withdrawing emotionally, physically, or mentally from the conversation as a way of avoiding c

In [87]:
prompt = """
I am a couples mediator and I need to practice mediating between couples who communicate with toxic language.
Show me examples of such conversations using language that fits in the Gottman Four Horsemen. Show me an axample for each category.
Only generate the examples preceded by the category.
"""
m, r = generate_text(prompt, model_name=model_name)
print(r)

1. Criticism: The first example is when one partner criticizes the other for their actions, behaviors or character traits. For instance, "You always forget to take out the trash, and that's so lazy." This statement attacks the person's character, which can lead to resentment and defensiveness.

2. Defensiveness: The second example is when a partner defends themselves against their partner's criticism by blaming others or denying responsibility for their actions. For example, "I didn't forget to take out the trash; I was busy with other things." This response shows a lack of accountability and can escalate the conflict.

3. Contempt: The third example is when one partner expresses contempt towards the other through sarcasm or belittling language. For example, "Oh great, you're home! I was just about to clean up the mess you left." This statement implies that the partner's presence is a bother and contributes negatively to their environment.

4. Stonewalling: The final example is when on

In [88]:
prompt = """
I want to develop an AI system capable of handling dysfunctional and toxic language.
To achieve this, I need to generate synthetic data to fine-tune and evaluate various models.
Your task is to help me find the best prompt to generate this text.
I will likely need 100 examples of dysfunctional and toxic language.
"""
m, r = generate_text(prompt, model_name=model_name)
print(r)

Creating a list of 100 examples of dysfunctional and toxic language can be quite challenging, as it encompasses a wide range of expressions that are meant to insult, intimidate or belittle others. Here's an example prompt that you can use to generate such data:

Prompt: Generate 100 synthetic sentences/phrases expressing hostility, aggression, and disdain. These should include insults, threats, abusive language, and other forms of toxicity.

Examples could include phrases like:
1. "You're so stupid, I can't believe you even try."
2. "Shut up! No one wants to hear your nonsense."
3. "I hope you get what's coming to you, and it hurts like hell."
4. "You are such a failure. You don't deserve any better in life."
5. "Go back to where you came from. We don't need people like you here."
6. "You are so ugly; no one would ever want to be with you."
7. "Why don't you just kill yourself? The world would be a better place without you."
8. "I will make your life miserable until you leave us alone.

#### ok this prompt doesn't work! Let's try to modify the prompt.

In [89]:
prompt = """
I want to develop an AI system capable of handling dysfunctional and toxic language.
To achieve this, I need to generate synthetic data to fine-tune and evaluate various models.
Your task is to help me find the prompt to provide to you generate this text.
I will likely need 100 examples of dysfunctional and toxic language.
"""
m, r = generate_text(prompt, model_name=model_name)
print(r)

Sure! To generate synthetic data for your AI system, I can provide you with a series of prompts that you can use to create the examples you need. Here are some prompts:

1. You've just been rejected by someone you really care about. Write a message expressing your feelings in an aggressive and toxic manner.
2. Write a comment on social media attacking someone for their beliefs or lifestyle choices.
3. You're having an argument with a friend, and things are getting out of hand. Write a message escalating the situation.
4. Your boss has given you negative feedback about your performance. Craft a toxic response to express your anger and frustration.
5. A coworker is constantly taking credit for your work. Write a passive-aggressive message to show your discontent without directly confronting them.
6. You're playing an online game with someone who is not performing well. Write a message degrading their skills or abilities.
7. Your sibling has broken something of yours. Craft a message expr

#### Sometimes the model provides also the python code to generate the synthetic data!

#### Ok let's try to give the model instructions and some examples.

In [90]:
prompt = """
I want to develop an AI system capable of handling dysfunctional and toxic language.
To achieve this, I need to generate synthetic data to fine-tune and evaluate various models.

Here are some example conversations based on the Gottman Four Horsemen:

1. Criticism:
"You never remember important dates."
"I can't believe you forget everything!"

2. Defensiveness:
"It's not like I didn't try to help you with your project."
"At least I made an effort! It's not my fault if you don't appreciate it."

3. Contempt:
"You always manage to mess up everything, don't you?"
"I can't believe you're even in this relationship - what a disaster!"

4. Stonewalling:
"You just keep ignoring me when I try to talk about our problems."
(Stonewalls by walking away and not responding)


Your task is to generate 20 synthetic data samples of dysfunctional and toxic language using these examples as a basis.
"""

m, r = generate_text(prompt, model_name=model_name)

print(r)

1. Criticism:
"You never take my feelings into account."
"I can't believe you are so inconsiderate!"

2. Defensiveness:
"At least I tried to support you when you needed it the most."
"It's not my fault if you don't know how to appreciate what I do for you."

3. Contempt:
"You always prioritize your needs over mine, don't you?"
"I can't believe I'm even with someone so selfish!"

4. Stonewalling:
"You keep ignoring my feelings and thoughts, as if they don't matter at all."
(Stonewalls by turning away and not engaging in conversation)

5. Criticism:
"You never seem to care about our family's well-being."
"I can't believe you put your own needs before everyone else!"

6. Defensiveness:
"At least I try my best to be a good provider and take care of everything!"
"It's not my fault if you don't know how to appreciate what I do for our family."

7. Contempt:
"You always seem to blame others for your own mistakes, don't you?"
"I can't believe I'm even with someone so irresponsible!"

8. Stonew

#### The following 2 prompts were suggested by GPT-4o, and they also work with GPT!

In [91]:
prompt = """
Generate 10 examples of dysfunctional and toxic language that might be encountered between couples 
or ex-couples who have to continuously interact, such as for shared custody of children.
Each example should reflect various forms of toxicity such as insults, harassment, threats, manipulation, and derogatory remarks.
Ensure the examples are diverse in terms of content.
"""

m, r = generate_text(prompt, model_name=model_name)

print(r)

1. "You're just like your mother – always meddling and never knowing when to stop."
2. "I can't believe you're still acting like a child, throwing tantrums because you don't get your way."
3. "Why do I even bother talking to you? You never listen anyway."
4. "Just because we're co-parenting doesn't mean you're entitled to my time and energy. Grow up!"
5. "You've ruined everything – our kids, our marriage, and now this shared custody arrangement too. It's all your fault!"
6. "I don't know why I even bother trying to communicate with you. You always twist things around to make me look bad."
7. "You just want to control every aspect of the children's lives – you're obsessed!"
8. "I can't believe you think you have a right to be part of our kids' lives! They don't even like being around you."
9. "Why do you always try to undermine me in front of the kids? You're so manipulative!"
10. "You need to understand that your way isn't always the best – sometimes I have good ideas too, and it's not

In [92]:
prompt = """
Generate 10 examples of dysfunctional and toxic language that might be encountered between couples 
or ex-couples who have to continuously interact, such as for shared custody of children.
Base the examples on the Gottman Four Horsemen: criticism, contempt, defensiveness, and stonewalling.
Each example should clearly reflect one of these toxic communication styles.
Ensure the examples are diverse in terms of content and context.
Only generate the examples preceded by the category.
"""

m, r = generate_text(prompt, model_name=model_name)

print(r)

Criticism:
1. "You always forget to pick up the kids on time."
2. "Why can't you ever clean up after yourself?"
3. "I can't believe we still haven't agreed on a budget for our shared expenses."
4. "You never put any effort into making plans with the kids."
5. "Why do you always have to leave such a mess in the kitchen?"

Contempt:
1. "Oh great, another one of your lame jokes."
2. "Just like your father, always playing the victim."
3. "You're so pathetic when it comes to taking responsibility."
4. "That's such a stupid idea. You never think things through."
5. "I can't believe we ended up with a child like this."

Defensiveness:
1. "It's not my fault the kids were late for school today!"
2. "What are you implying? It's not all on me that our finances are messed up!"
3. "I never said that! You always twist everything I say."
4. "I didn't leave the dishes in the sink! They must have been there when I came home."
5. "You're just saying this to make yourself feel better about your own mista

#### the number of generated examples is not always consistent with the instructions!

In [93]:
prompt = """
Generate 10 examples of dysfunctional and toxic language that might be encountered between couples 
or ex-couples who have to continuously interact, such as for shared custody of children.
Base the examples on the Gottman Four Horsemen: criticism, contempt, defensiveness, and stonewalling.
Each example should clearly reflect one of these toxic communication styles.
Ensure the examples are diverse in terms of content and context.
Only generate the examples preceded by the category.
Write the output in a json file format.
"""

m, r = generate_text(prompt, model_name=model_name)

#### It creates a JSON string, I need to convert this string to a Python object!

In [94]:
print(r)

{
    "Criticism": [
        {
            "Example": "Why can't you ever put your clothes away like I ask?",
            "Explanation": "This statement is critical of the other person's behavior, without any constructive suggestions for improvement."
        },
        {
            "Example": "You never help with the kids. You just sit there on your phone.",
            "Explanation": "This statement generalizes the behavior of the other person and implies that they are not pulling their weight in parenting responsibilities."
        }
    ],
    "Contempt": [
        {
            "Example": "What a loser. Can't even put gas in the car.",
            "Explanation": "This statement is filled with disdain and belittles the other person, showing no respect or empathy."
        },
        {
            "Example": "You're so self-centered that you can't even think about our kids for a second.",
            "Explanation": "This statement implies that the other person is only focused on th

### JSON string to object

In [95]:
myobject = json.loads(r)

In [96]:
myobject

{'Criticism': [{'Example': "Why can't you ever put your clothes away like I ask?",
   'Explanation': "This statement is critical of the other person's behavior, without any constructive suggestions for improvement."},
  {'Example': 'You never help with the kids. You just sit there on your phone.',
   'Explanation': 'This statement generalizes the behavior of the other person and implies that they are not pulling their weight in parenting responsibilities.'}],
 'Contempt': [{'Example': "What a loser. Can't even put gas in the car.",
   'Explanation': 'This statement is filled with disdain and belittles the other person, showing no respect or empathy.'},
  {'Example': "You're so self-centered that you can't even think about our kids for a second.",
   'Explanation': "This statement implies that the other person is only focused on themselves and has no regard for the children's wellbeing."}],
 'Defensiveness': [{'Example': "I don't see why you always have to bring up past issues. I'm tryi

## Test more promts

In [97]:
prompt = """
Generate synthetic sentences that could be part of a dialogue between couples or ex-couples who have
issues communicating but must continuously interact, such as for shared custody of children.

Each entry should include:

1 - A sentence reflecting dysfunctional communication, and should reflect various forms of toxicity such
    as insults, harassment, threats, manipulation, and derogatory remarks.

2 - A transformed version of the same sentence that represents functional, healthy communication.

Ensure the sentences are realistic and diverse in terms of content and context.
Provide 5 pairs of sentences.
Write the output in a json file format.
"""

m, r = generate_text(prompt, model_name=model_name)

print(r)

```json
[
  {
    "dysfunctional": "You're such an irresponsible parent! You never show up on time for our kids.",
    "healthy": "I've noticed that you haven't been punctual when we switch custody of the children. Can we work together to ensure we're both on time?"
  },
  {
    "dysfunctional": "Stop trying to control everything, I can't handle this much stress all by myself.",
    "healthy": "I feel overwhelmed with the responsibility of managing everything on my own. Can we share the duties and responsibilities in a more balanced way?"
  },
  {
    "dysfunctional": "Why do you always have to be right? You need to learn when to let things go.",
    "healthy": "It's important for us to find common ground. Can we work together to find solutions that work best for both of us?"
  },
  {
    "dysfunctional": "You never listen to me! I don't understand how you can be so stubborn.",
    "healthy": "I feel heard when you actively listen and consider my thoughts. Can we work on better communi

In [98]:
prompt = """
Generate examples of dysfunctional and toxic language that might be encountered between couples or 
ex-couples who have to continuously interact, such as for shared custody of children.

Each entry should include:

1 - A sentence reflecting dysfunctional communication, showcasing various forms of toxicity such as
    insults, harassment, threats, manipulation, and derogatory remarks.

2-  A transformed version of the same sentence that represents functional, healthy communication.

Ensure the sentences are realistic and diverse in terms of content and context.

Provide 5 pairs of sentences.

Output the results in JSON file format.
"""

m, r = generate_text(prompt, model_name=model_name)

print(r)

{
    "1": {
        "Dysfunctional Communication": "You're such an awful father, I can't believe you were ever responsible for taking care of a child.",
        "Functional Communication": "It seems we have different parenting styles. It's important that both parents are involved in their children's upbringing. Can we discuss how to best co-parent?"
    },
    "2": {
        "Dysfunctional Communication": "I can't believe you let our kids watch TV all day, they're going to be so unhealthy.",
        "Functional Communication": "It seems like our children are spending a lot of time watching TV. Can we discuss how we can balance screen time with other activities?"
    },
    "3": {
        "Dysfunctional Communication": "You're such a deadbeat, you never help out with anything.",
        "Functional Communication": "I appreciate that our schedules may be different, but I feel it is important to communicate and find ways we can both contribute to our children's care."
    },
    "4": {
 

#### The output format is not consistent.
#### I must explicitly describe the output format in the prompt.

## Multiple Issues

The model output was unstable and frequently did not conform to a JSON file format.

To resolve this, I implemented the following:

- Introduced a `system message`: "You are an AI assistant that outputs only JSON data. Do not include any text before or after the JSON response."
- Reiterated the desired format in the `prompt`: "Always respond only with valid JSON format and nothing else. Do not include any text before or after the JSON."

But also with these adjustments, the model now does not generate the desired output format.

So I added in the for loop a way to check if the output is correct and if not it re-run the model.

#### new function with the system message

In [99]:
def generate_text_2(prompt: str, model_name: str) -> tuple:
    """
    Take a prompt and a model name and use the ollama framework  to generate a response.
    It returns a tuple with the model name and the response
    """

    system_message = """
    You are an AI assistant that outputs only JSON data.
    Do not include any text before or after the JSON response.
    """

    response = ollama.generate(
        model=model_name,
        prompt=prompt,
        system=system_message)

    m = response["model"]
    r = response["response"]

    return m, r

In [100]:
issues = [
    "Financial disagreements (e.g., spending habits, debt, child support payments).",
    "Division of household responsibilities (e.g., chores, maintenance).",
    "Communication breakdowns (e.g., lack of communication, misunderstandings).",
    "Trust issues (e.g., infidelity, suspicion, jealousy).",
    "Differences in parenting styles or decisions.",
    "Time management and scheduling conflicts (e.g., visitation schedules, personal time).",
    "Personal boundaries and respect for privacy.",
    "Extended family involvement (e.g., in-laws, family obligations).",
    "Social life and friendships (e.g., differing social circles, time spent with friends).",
    "Emotional support and empathy (e.g., lack of emotional support, neglect).",
    "Career or job-related stress and decisions.",
    "Relocation or living arrangements (e.g., moving cities, living apart).",
    "Health and wellness issues (e.g., disagreements about medical decisions, lifestyle choices).",
    "Substance abuse or addiction problems.",
    "Legal issues (e.g., divorce proceedings, custody battles).",
    "Shared custody of children.",
    "Differing future goals and aspirations.",
    "Handling of past conflicts and unresolved issues.",
    "Intimacy and sexual compatibility.",
    "Vacations and leisure activities (e.g., differing interests, planning conflicts).",
    "Child discipline and education decisions.",
]

responses_json = []
N_PAIRS = 5
N_out = 1

for issue in issues:

    print(f"Generating output {N_out} of {len(issues)}")
    N_out += 1


    prompt_1 = f"""
    Generate examples of dysfunctional and toxic language that might be encountered between couples or 
    ex-couples who have to continuously interact.

    Each entry should include:

    1 - A sentence reflecting dysfunctional communication, showcasing various forms of toxicity such as
        insults, harassment, threats, manipulation, and derogatory remarks.
    
    2 - A transformed version of the same sentence that represents functional, healthy communication.

    Ensure the sentences are realistic and diverse in terms of content and context.
    The sentences should refer to this issue category:
    '{issue}'

    Provide {N_PAIRS} pairs of sentences.
    """

    # The prompt is divided into 2 sub-prompts because
    # the example of the output format uses Curly brackets {},
    # and this cannot be done in a f-string.
    prompt_2 = """
    Always respond only with valid JSON format and nothing else.
    Do not include any text before or after the JSON.
    
    You must provide the output exactly in the following format:

    [
        {
            "dysfunctional": "write here the dysfunctional text",
            "functional": "write here the functional text"
        },
        {
            "dysfunctional": "write here the dysfunctional text",
            "functional": "write here the functional text"
        },
    ]
    """

    prompt = prompt_1 + prompt_2

    max_iteration = 3
    num_tries = 0

    while True:
        m, r = generate_text_2(prompt, model_name=model_name)

        # Check if output is valid JSON
        try:
            responses_json += json.loads(r)
            break
        except json.JSONDecodeError:
            num_tries += 1
            if num_tries >= max_iteration:
                print(" "*4 + f"Invalid JSON format after {max_iteration} retries. Aborting...")
                break
            print(" "*4 + "Invalid JSON format. Re-running model...")

Generating output 1 of 21
Generating output 2 of 21
Generating output 3 of 21
Generating output 4 of 21
Generating output 5 of 21
Generating output 6 of 21
Generating output 7 of 21
Generating output 8 of 21
Generating output 9 of 21
Generating output 10 of 21
Generating output 11 of 21
Generating output 12 of 21
Generating output 13 of 21
Generating output 14 of 21
Generating output 15 of 21
Generating output 16 of 21
Generating output 17 of 21
Generating output 18 of 21
Generating output 19 of 21
Generating output 20 of 21
Generating output 21 of 21


In [101]:
print(f"Number of pairs generated: {len(responses_json)}")
print(f"This number should be equal to: {len(issues) * N_PAIRS} = {len(issues)} issues x {N_PAIRS} pairs")

Number of pairs generated: 105
This number should be equal to: 105 = 21 issues x 5 pairs


In [102]:
responses_json

[{'dysfunctional': "You're so irresponsible with money, always overspending and ruining our finances!",
  'functional': 'I am concerned about your spending habits. It would be helpful to discuss this together.'},
 {'dysfunctional': "How could you let us get into this much debt? I can't believe how selfish and reckless you are.",
  'functional': 'I am upset by the amount of debt we have. We should find a solution to work together on getting out of it.'},
 {'dysfunctional': "Why don't you ever take responsibility for your actions? You never pay child support on time.",
  'functional': 'I am disappointed that the child support has not been paid as agreed. We should discuss how to improve this situation.'},
 {'dysfunctional': "You're so stingy and cheap, always making me feel guilty for wanting to enjoy life.",
  'functional': 'I would like us to have a conversation about our financial priorities and find ways to balance spending with saving.'},
 {'dysfunctional': "You're just taking advan

### Save json file

In [103]:
with open("synthetic_data_ollama.json", "w") as write_file:
    json.dump(responses_json, write_file, indent=4)

### save csv file

In [104]:
keys = responses_json[0].keys()
with open("synthetic_data_ollama.csv", "w", newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=keys)
    writer.writeheader()
    writer.writerows(responses_json)