<a href="https://colab.research.google.com/github/CFA-Institute-RPC/Synthetic-Data-For-Finance/blob/main/LLM/03_Synthetic_Data_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Producing synthetic data with a (larger) LLM

We now make use of another LLM, ChatGPT-4o, to generate some synthetic data. We'll give it a few training examples and use its generated data to fine-tune another Qwen3 model in the next notebook to see if we can improve its performance. We'll generate 800 synthetic samples, roughly the same size as the FiQA-SA training dataset.

**Note: Running this notebook will result in a different synthetic dataset - leading to better or worse results in notebooks 4 and 5. We have uploaded the original synthetic dataset for maximum reproducibility - which should be used for these notebooks. We suggest to save any new datasets created in this notebook under different names to prevent the original dataset from being overwritten.**

# Creating an OpenAI API Key

You will need an OpenAI API key in order to proceed with this notebook. If you do not have one, you can create one [here](https://platform.openai.com/api-keys). Copy the key as you will need it shortly.

1. After creating a key, navigate back to this notebook and look for a key symbol on the left hand side, labelled 'secrets'.
2. Click on it, then 'Add new secret'.
3. In the Name column, put 'OPENAI_API_KEY'
4. In the Value column, paste the API key.
5. Click the Notebook access column so it turns blue - which means the notebook can now use your API key.
6. Proceed with the notebook!


# Import libraries
We first import the OpenAI library which will allow us to use ChatGPT models for synthetic data generation through their API.

In [None]:
from openai import OpenAI
from pydantic import BaseModel # Used to structure outputs
from google.colab import userdata
import pandas as pd
from tqdm import tqdm
# Load OpenAI API key
api_key = userdata.get('OPENAI_API_KEY')

# Load training data - needed to provide examples to GPT-4o

In [None]:
from google.colab import drive
drive.mount('/content/drive')

train = pd.read_csv('drive/MyDrive/data/train.csv')

train = train[['sentence','label']]

# Extract n random examples to be fed as examples to the LLM for synthetic data generation
examples = train.sample(n=5,random_state=42)
examples

Mounted at /content/drive


Unnamed: 0,sentence,label
610,$BBRY Actually lost .03c per share if U incl V...,1
174,Tesco closes in on new chairman with Dixons Ca...,3
67,Philippines' San Miguel says to partner with K...,3
168,Royal Mail gets mixed bag from Ofcom postal re...,1
275,Legal & General arm buys 50 pct stake in Media...,3


In [None]:
# Format examples into structured sentences for input to LLM
def format_examples(dataframe):

  n_examples = len(dataframe) # How many examples we feed to the LLM
  formatted_examples = str() # Empty string to add formatted examples

  for i in range(0,n_examples):
    example = dataframe.iloc[i]
    sentence = example['sentence']
    label = example['label']

    prompt = f"""
    Example: {sentence}
    Label: {label}
    """

    formatted_examples = formatted_examples + prompt

  return formatted_examples


In [None]:
formatted_examples = format_examples(examples)
print(formatted_examples)


    Example: $BBRY Actually lost .03c per share if U incl VZ as no debt and 3.1 in Cash.
    Label: 1
    
    Example: Tesco closes in on new chairman with Dixons Carphone's John Allan in the frame
    Label: 3
    
    Example: Philippines' San Miguel says to partner with Kirin if it bids for SABMiller's ...
    Label: 3
    
    Example: Royal Mail gets mixed bag from Ofcom postal regulation report
    Label: 1
    
    Example: Legal & General arm buys 50 pct stake in MediaCityUK in Manchester
    Label: 3
    


Now that we've created a list of examples (stored in `formatted_examples`) to feed to the model, we can create a prompt. In the code below, we create a prompt that we'll feed to ChatGPT to get it to create its synthetic data.

In [None]:
# Create prompt for synthetic data generation
total_samples_to_generate = 800
samples_per_batch = 10
number_of_batches = total_samples_to_generate / samples_per_batch

prompt = f"""
You are an expert in labelling sentiment of financial sentences.
Your task is to generate {samples_per_batch} realistic financial sentences that
could have been extracted from a news article or social media page.
Each generated sentence should be about a different company.
The companies should be diverse.
Label each generated sentence with one of three sentiment labels:
1 = negative sentiment.
2 = neutral sentiment.
3 = positive sentiment.
Base your label decision only on the generated sentence and do not use any prior
knowledge about the company in the sentence.
Think carefully about each sentence and label.
Examples: {formatted_examples}
"""


The last thing we need to do is decide on the structure of our output data.
 We can do that by providing a 'schema' which tells ChatGPT how we want our responses to be formatted. The JSON schema is the easiest to work with, so in below we define a custom class `StructuredSchema` that tells ChatGPT we want our output to be stored as a list of strings representing our sentences and a list of numbers representing our sentiment labels.

In [None]:
# Define the output structure, we want a JSON format with a list of our sentences alongside the corresponding labels, also as strings.
class StructuredSchema(BaseModel):
  sentence: list[str]
  label: list[int]

We now have everything we need to generate our synthetic data.

In [None]:
client = OpenAI(api_key=api_key)

# Empty list to store synthetic data
synthetic_data = []

# For each batch
for _ in tqdm(range(int(number_of_batches)),desc='Generating synthetic data'):

# Pass prompt to ChatGPT, specifying response schema
        response = client.beta.chat.completions.parse(
            model='gpt-4o',
             messages=[
                 {'role': 'user','content':prompt}],
      response_format=StructuredSchema,
            temperature=1, # Temperature is set to 1 for maximum creativity to improve the diversity of the training samples
        )
        # Store the results
        data = response.choices[0].message.parsed

        # Append to synthetic_data list
        for sentence, label in zip(data.sentence, data.label):
            synthetic_data.append({"sentence": sentence, "label":label})

# Convert to dataframe
synthetic_df = pd.DataFrame(synthetic_data)

Generating synthetic data: 100%|██████████| 80/80 [06:13<00:00,  4.67s/it]


Note: These results will be different each time you run the notebook, this is because we set the temperature = 1 which increases the randomness of the generated outputs. As a result, the resulting dataset will be better or worse in terms of realism.

In [None]:
# View the first 5 entries
synthetic_df.head()

Unnamed: 0,sentence,label
0,Tesla's quarterly report shows an increase in ...,3
1,PepsiCo is holding a press conference to addre...,2
2,Home Depot's CEO steps down abruptly amidst in...,1
3,Amazon announces a partnership with Rivian to ...,3
4,Boeing faces additional scrutiny over safety i...,1


# Evaluating the quality of synthetic data

For our synthetic data to improve the performance of our Qwen3 model, it needs to reflect the original training data. That's why we gave ChatGPT some training examples.

To evaluate the quality of the generated outputs, there are a few steps we could take:
* Human-in-the-loop validation: Randomly sample and have domain experts review a subselection of the generated data to see if the sentences and sentiment labels make sense.
* Token distribution - Compare the similarity across the token distributions for both real and synthetic datasets to see if the generated sentences are of similar lengths to the real sentences
* Diversity across classes - If we have a class imbalance, we often want to produce synthetic samples of the minority class to improve the performance of a classification model.
* Relevance - Are the companies relevant to our use case? For example, if we are training a model to classify the sentiment of U.S companies - do we want our synthetic data to include international companies?



To get an idea of how diverse our generated data is, we can look at the number of generated labels, our original training data had significantly more positive sentiment (3) labels than the other two types.

In [None]:
synthetic_df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
3,390
1,267
2,143


We see a greater diversity in the generated labels compared to what we had in our training data. Of course, we are assuming each of these generated labels is correct.

In [None]:
# Save the data
synthetic_df.to_csv('drive/MyDrive/data/gpt_synthetic_data.csv')