# 1.2 Synthetic Data Generation
Main logic for my generation:
1. Have a defined sentiment for all of my prompting as a baseline for generating the input and output
    - The generation is a diversed mix of zero shot and few shot prompting on the model to strike a balance between the variety of the prompts
    - Prompt is requested to generate of around 5-30 words which covers the 25% to 75% percentile to ensure a diverse but not obsecure range of data is generated
    - The direct request to generate inputs of a certain sentiment ensures the initial generation to already have a more or less accurate output
    - This is done as I realise that the model is rather weak and requires extensive handhelding in generation
2. Have a very small sample(of about 30 samples) of real life data inputs from my peers to make the data more diverse again and also provide a sense of human authenticity to it
3. Evaluation is done with a heuristic to make sure the data generated is of certain quality

## Main Logic to Generate Data

In [None]:
# Imports
from transformers import AutoTokenizer, AutoModelForCausalLM
import re
import os
import pandas as pd

In [2]:
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-1B-Instruct")
model = AutoModelForCausalLM.from_pretrained("unsloth/Llama-3.2-1B-Instruct")


### Function to generate user inputs

In [None]:
def generate_user_input(shot, sentiment, type_of_input):
    """
    Generate user input for the model based on the type of input and sentiment.
    
    Args:
        shot (int): either 0 or 1, indicating whether to provide an example or not.
        sentiment (str): The sentiment of the headline (neutral, positive, negative).
        type_of_input (str): The type of input to generate (e.g., 'headline').
        
    Returns:
        str: The generated input string.
    """
    output = f"Generate a financial {type_of_input} with {sentiment} sentiment. "
    output += f"The {type_of_input} should be 5-30 words, factual in tone. "
    
    if shot == 0:
        output += f"Example: this is the {type_of_input} you generated"
        return output
    
    examples = {
        "neutral": "Example: A Sexist Joke Cost Ken Fisher $4 Billion in Assets. He Still Runs $121 Billion.\n",
        "positive": "Example: Western Union will be working with MercadoLibre, the South American eCommerce giant, so digital remittances can be sent in Mexico.\n",
        "negative": "Example: The app will be delisted from Apple's App Store on Oct. 5.\n"
    }
    
    output += examples.get(sentiment, "")
    return output

### Function to generate a input for the dataset

In [None]:
def generate_synthetic_input(prompt):
    """
    Generate synthetic data using the provided prompt with the normal model and tokenizer.
    
    Args:
        prompt (list or str): The input prompt for the model.
        
    Returns:
        str: The generated text from the model.
    """
    inputs = tokenizer.apply_chat_template(
        prompt,
        tokenize=True,
        return_tensors="pt",
    )

    outputs = model.generate(
        input_ids=inputs, 
        max_new_tokens=200,
        use_cache=True,
        temperature=0.9, 
        min_p=0.1
    )
    return tokenizer.batch_decode(outputs)[0]

### Function to extract out the exact headline

In [5]:
# Utility function to clean the generated text
def extract_headline(generated_text):
    """
    Extract a clean headline from the generated text.
    
    Args:
        generated_text (str): Raw generated text from the model
        
    Returns:
        str: Clean headline without formatting
    """
    # Remove all EOT tags (in various forms)
    generated_text = re.sub(r'<\|eot(?:_id)?\|>', '', generated_text)
    
    # Look for content after assistant tag
    if "<|start_header_id|>assistant<|end_header_id|>" in generated_text:
        content = generated_text.split("<|start_header_id|>assistant<|end_header_id|>")[1].strip()
    else:
        content = generated_text
    
    # Find numbered headlines (like "1. Headline text")
    lines = content.split('\n')
    for line in lines:
        # Match lines like "1. Headline" or "1. **Headline**"
        match = re.search(r'\d+\.\s+(?:\*\*)?([^*\n]+)(?:\*\*)?', line)
        if match:
            headline = match.group(1).strip()
            # Remove any remaining special tokens
            headline = re.sub(r'<\|[^|]+\|>', '', headline)
            return headline
    
    # If no numbered headlines found, just return first non-empty line
    for line in lines:
        if line.strip() and not line.startswith("<|") and not line.startswith("Here are"):
            # Remove any remaining special tokens
            clean_line = re.sub(r'<\|[^|]+\|>', '', line.strip())
            return clean_line
    
    # Clean the content as a last resort
    clean_content = re.sub(r'<\|[^|]+\|>', '', content.strip())
    return clean_content

### Main Generation

In [None]:
sentiments = ["neutral", "positive", "negative"]
type_of_shot = [0, 1]

synthetic_data = []
n = 0  # Counter for the number of generated headlines

for sentiment in sentiments:
    for shot in type_of_shot:
        messages = [
            {"role": "system", "content": "You are a market news generator. Generate ONLY the headline text. Do NOT include any introductions, explanations, or formatting instructions. Do NOT write phrases like 'Here's a headline' or 'Market headline:'. Just output the clean headline text directly."},
            {"role": "user", "content": generate_user_input(shot, sentiment, "headline")},
        ]

        i = 0
        
        while i < 79:
            if n > 470:
                break
            print("generating ", n)
            generated_text = generate_synthetic_input(messages)
        
            # Extract just the headline
            headline = extract_headline(generated_text)
            
            synthetic_data.append({
                    "input": headline,
                    "output": sentiment,
                    "instruction": "Base on the sentiment of the headline, classify it as neutral, positive, or negative."
                })
            print(headline)
            i += 1
            n += 1

print("Done generating lol")

generating  0
U.S. GDP Growth Slows to 2.3% in Q2, Reaching 61.4 Trillion Dollar Mark
generating  1
US Job Market Grows, Consumer Prices Rise Amidst Slow Economic Growth
Done generating lol


## Validation of Data

### Quality check
1. Length
2. Grammar
3. Whether there are URLs
4. The fluency of the prompts

In [None]:
import torch
import language_tool_python
from transformers import AutoTokenizer, AutoModelForCausalLM

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

#grammar
tool = language_tool_python.LanguageTool("en-US")

#scoring model
tok_pp = AutoTokenizer.from_pretrained("unsloth/llama-3.2-1b-instruct")
model_pp = AutoModelForCausalLM.from_pretrained(
    "unsloth/llama-3.2-1b-instruct",
    device_map="auto",
    torch_dtype=torch.float16,
).to(device).eval()

def perplexity(sentence):
    enc = tok_pp(sentence, return_tensors="pt").to(device)
    with torch.no_grad():
        out = model_pp(**enc, labels=enc["input_ids"])
    return torch.exp(out.loss).item()

def is_high_quality(headline):
    # length check
    w = headline.split()
    if len(w) < 5 or len(w) > 30:
        return False
    # no URLs or stray tokens
    if re.search(r"http[s]?://|\<\|[^\|]+\|\>", headline):
        return False
    # grammar/spelling check
    errs = tool.check(headline)
    if len(errs) > 2:
        return False
    # fluency via perplexity
    ppl = perplexity(headline)
    if ppl > 100:
        return False
    return True


In [23]:
from tqdm.auto import tqdm
import time
tqdm.pandas()

csv_path = "../data/synthetic_data_generate.csv"
df = pd.read_csv(csv_path)

results = []
for idx, text in tqdm(df["input"].items(), total=len(df), desc="quality"):
    t0 = time.time()
    ok = is_high_quality(text)
    results.append(ok)

df["is_high_quality"] = results

total  = len(df)
passe   d = df["is_high_quality"].sum()
print(f"Total examples:     {total}")
print(f"High-quality pass:  {passed} ({passed/total:.1%})")


quality:   0%|          | 0/500 [00:00<?, ?it/s]

→ row 0 done in 4.40s
→ row 1 done in 0.33s
→ row 2 done in 0.38s
→ row 3 done in 0.43s
→ row 4 done in 0.30s
→ row 5 done in 0.32s
→ row 6 done in 0.26s
→ row 7 done in 0.23s
→ row 8 done in 0.27s
→ row 9 done in 0.19s
→ row 10 done in 0.38s
→ row 11 done in 0.24s
→ row 12 done in 0.47s
→ row 13 done in 0.29s
→ row 14 done in 0.29s
→ row 15 done in 0.20s
→ row 16 done in 0.28s
→ row 17 done in 0.31s
→ row 18 done in 0.32s
→ row 19 done in 0.29s
→ row 20 done in 0.34s
→ row 21 done in 0.33s
→ row 22 done in 0.30s
→ row 23 done in 0.28s
→ row 24 done in 0.27s
→ row 25 done in 0.30s
→ row 26 done in 0.23s
→ row 27 done in 0.35s
→ row 28 done in 0.33s
→ row 29 done in 0.31s
→ row 30 done in 0.28s
→ row 31 done in 0.15s
→ row 32 done in 0.23s
→ row 33 done in 0.37s
→ row 34 done in 0.20s
→ row 35 done in 0.30s
→ row 36 done in 0.27s
→ row 37 done in 0.25s
→ row 38 done in 0.27s
→ row 39 done in 0.24s
→ row 40 done in 0.27s
→ row 41 done in 0.23s
→ row 42 done in 0.20s
→ row 43 done in 0.27

### Double check the sentiment
Given the limitation of the model itself, I don't trust it a lot so a relatively good accuracy to me will suffice in passing the sentiment accuracy check

In [3]:
tokenizer = AutoTokenizer.from_pretrained("unsloth/llama-3.2-1B-Instruct")
model = AutoModelForCausalLM.from_pretrained("unsloth/Llama-3.2-1B-Instruct", device_map="auto", torch_dtype=torch.float16).eval()

In [None]:
def classify_headline(headline: str) -> str:
    """
    Use the LLM to classify headline sentiment with few-shot examples.
    Returns one of: 'negative', 'neutral', 'positive'
    """
    # Create prompt with examples to guide the model
    prompt = f"""Classify the sentiment of financial headlines as negative, neutral, or positive.

Examples:
- "Stock Market Crashes 5% on Inflation Fears" -> negative
- "Company XYZ Reports Q3 Earnings As Expected" -> neutral 
- "Tech Giant's New Product Exceeds Sales Projections" -> positive
- "Federal Reserve Announces Interest Rate Decision" -> neutral
- "Startup Raises $50M in Latest Funding Round" -> positive
- "Major Bank Cuts 5000 Jobs Amid Restructuring" -> negative

Headline: "{headline}"
Sentiment:"""

    # Get model's response
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=5,
            temperature=0.1,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip().lower()
    
    # Extract sentiment with fallback
    if "negative" in response:
        return "negative"
    elif "positive" in response:
        return "positive"
    elif "neutral" in response:
        return "neutral"

In [6]:
import pandas as pd
csv_path = "../data/synthetic_data_generate.csv"
df = pd.read_csv(csv_path)

In [9]:
from sklearn.metrics import accuracy_score, classification_report
from tqdm.auto import tqdm
preds = []
for text in tqdm(df["input"], desc="classifying"):
    pred = classify_headline(text)
    print(f"{text} → {pred}")
    preds.append(pred)
df["predicted"] = preds

# 4) evaluate


classifying:   0%|          | 0/500 [00:00<?, ?it/s]

US Stock Market Gains Ground, but Slows Down as Investors Remain Cautious Ahead of Inflation Concerns and Economic Outlook → neutral
Global economic growth slows, inflation rate reaches 6.3%, leading to concerns about the sustainability of current monetary policy settings. → negative
"Global stock market closes slightly higher on Wednesday, with S&P 500 up 0.2% as investors focus on earnings reports from major tech companies." → neutral
US Stocks Rise, but Global Markets Stagnate; European Shares Slightly Higher as Economic Growth Concerns Remain a Hurdle → neutral
Stocks Rise Amid Economic Growth, but Trade Tensions and Inflation Concerns Weigh on Markets → neutral
US Stocks Rise, but Economic Growth Slows as Global Markets Continue to Struggle with Rising Inflation and Trade Tensions → neutral
U.S. Stock Market Gains Steady Momentum, but Slow Recovery Continues Amid Ongoing Inflation Concerns → neutral
Market Trends: Global Stocks Rise on Strong Economic Data, Boosting Consumer Confi

In [10]:
acc = accuracy_score(df["output"], df["predicted"])
print(f"Classification accuracy: {acc:.1%}\n")
print(classification_report(df["output"], df["predicted"],
                            labels=["negative","neutral","positive"]))

Classification accuracy: 82.0%

              precision    recall  f1-score   support

    negative       0.79      0.98      0.88       164
     neutral       0.80      0.68      0.73       172
    positive       0.88      0.80      0.84       164

    accuracy                           0.82       500
   macro avg       0.82      0.82      0.82       500
weighted avg       0.82      0.82      0.82       500



### Saving the final data

In [None]:
os.makedirs("data", exist_ok=True)

df = pd.DataFrame(synthetic_data)
csv_path = "../data/synthetic_data_generate.csv"

# Check if file exists and append if it does
if os.path.exists(csv_path):
    # Load existing data
    existing_df = pd.read_csv(csv_path)
    
    # Combine existing and new data
    combined_df = pd.concat([existing_df, df], ignore_index=True)
    
    # Save the combined data
    combined_df.to_csv(csv_path, index=False)
    print(f"Appended {len(df)} new entries to existing data. Total entries: {len(combined_df)}")
else:
    # If file doesn't exist yet, just save the new data
    df.to_csv(csv_path, index=False)
    print(f"Created new file with {len(df)} entries")

## Discussion Questions
1. I choose this model because I actually tested both models and the non-instruct was very unstructured and the results it generated was very hard to clean up and format into what I wanted

2. I made sure to include what I wanted exactly in the prompt, giving it restrictions and also gave a template for it to follow even for my zero shot prompting.