# Synthetic dataset generation for Competitive Debate

This notebook is aimed at the generation of synthetic textual data with **gpt-4o-mini**: these data will be structured as:

- Debate motion
- Model thesis
- Question for the counter-thesis

Let's dive in!🚀

## 1. Install necessary dependencies

In [2]:
! pip install langchain langchain_core langchain_openai langchain_community datasets huggingface_hub

Collecting langchain
  Using cached langchain-0.2.11-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain_core
  Using cached langchain_core-0.2.26-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain_openai
  Using cached langchain_openai-0.1.20-py3-none-any.whl.metadata (2.6 kB)
Collecting langchain_community
  Downloading langchain_community-0.2.10-py3-none-any.whl.metadata (2.7 kB)
Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.2-py3-none-any.whl.metadata (2.1 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.95-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain_core)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting 

## 2. Download the dataset with motions

The dataset with 10K debate motions can be found [at this link](https://huggingface.co/datasets/kokhayas/english-debate-motions-utds).

In [3]:
from datasets import load_dataset

dts = load_dataset("kokhayas/english-debate-motions-utds")
dtsdct = dts["train"]

Downloading readme:   0%|          | 0.00/236 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/1.75M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10026 [00:00<?, ? examples/s]

Let's casually extract 1000 motions from the dataset

In [10]:
import random as r

motions = dtsdct["motion"]
motions2use = []
numbers = []
for j in range(1000):
    n = r.randint(0,10000)
    if n not in numbers:
        numbers.append(n)
        if motions[n].lower().startswith("th"):
            motions2use.append(motions[n])
        else:
            continue
    else:
        continue

In [11]:
motions2use

['THW cap political donation from industries which offer products/services which have high health/life risks/dangers. (such as gun, medicine, tobacco)',
 'THW drive out companies when they violate quality of product.',
 'That schools should prioritise teachers’ knowledge of their subject area over their knowledge of pedagogy and educational theory.',
 'THBT EU should become a federation.',
 'THBT West should stop all forms of sanctions on Myanmar now.',
 'THBT the US should sell a missile defence shield to Taiwan.',
 'THW ban the publication, exhibition, and sale of art created through self-injury.',
 'THW allow children to sue parents for past negligence in raising the children.',
 'THW punish parents for the child’s crime.',
 'THBT assassination should never be a legitimate tool of foreign policy',
 'TH supports corporal punishment in schools.',
 'THW grant medical staff the right to object to providing specific treatments on religious and ethical grounds',
 'THW take away the drivin

## 3. Connect your OpenAI API key
Register your API key under the Google Colab "Secrets" (the key icon in the sidebar menu) as `OPENAI_API_KEY` and make it an environment variable.



In [12]:
from google.colab import userdata
import os

a = userdata.get('OPENAI_API_KEY')
os.environ["OPENAI_API_KEY"] = a

## 4. Build a suitable system prompt
Build a system prompt in which you tell your model of choice (in our case **gpt4o-mini**) how to behave as a baseline.



In [15]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field

GPT_MODEL = "gpt-4o-mini"
llm = ChatOpenAI(temperature=0, model=GPT_MODEL)
system_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You are a contestant in a debate competition, whose goal is to provide the best ever argumentation in favor of the thesis you are assigned.
            You should reply according to this output structure:
            - Debate motion
            - Your argumentation in favor of the motion
            - A question for you adversary, which would prompt their counter-argumentation
            Remember to always be concise and straight-to-the point in your replies."""),
        ("human", "{message}"),
    ]
)

## 5. Define the output structure

Create a structured output in YAML format that will be followed by the LLM thanks to the `with_structured_output` method.



In [16]:
from langchain_core.pydantic_v1 import BaseModel, Field


class debate(BaseModel):
    """Debate argumentation output"""

    motion: str = Field(description="Motion of the debate")
    argumentation: str = Field(description="Argumentation in favor of the motion")
    question: str = Field(description="Question that would prompt the opponent to their counter-argumentation")
    description = "Schema for debate argumentations starting from a motion"

chain = system_prompt | llm.with_structured_output(debate)

## 6. Build the dataset

You can build the dataset from the 1000 motions collected before by iteratively invoking the LLM chain we built beforehands: after that, you save the answer into a JSON string that gets loaded into a JSONL file (*dataset.jsonl*).

You can then upload the dataset to HuggingFace or Kaggle or whatever data sharing platform you like!🤩

In [17]:
res = chain.invoke({"message": motions2use[0]})

In [21]:
import json
jf = open("dataset.jsonl", "w")
for m2u in motions2use:
    res = chain.invoke({"message": m2u})
    txt = f"Motion:\n{res.motion}\n\nArgumentation in favor of the motion:\n{res.argumentation}\n\nQuestion to answer:\n{res.question}"
    jstr = json.dumps({"text": txt})
    jf.write(jstr+"\n")
jf.close()

### Cite this notebook

**Author**: Astra Clelia Bertelli

**Creation date**: 2nd August 2024