# Generating AG news data

Results for *"Qwen/Qwen2.5-0.5B-Instruct"* and *"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"*:
- The task seems easier than irony detection. The <text, label> pairs are more coherent.
- DeepSeek gave better performance than QWEN. It has better obeyed the constraint on the number of examples to output and the specified labels.

In [1]:
import os

# Move up one directory
if os.path.basename(os.getcwd()) == "mycode":
    os.chdir("..")
    
from transformers import AutoModelForCausalLM, AutoTokenizer
import pandas as pd
import re
from mycode.utilities import log_synthetic_data, response2json, get_response, set_seed

set_seed(42)

# file where the logs will be saved
log_file_path = "synthetic_data/agnews_log.json"
if os.path.exists(log_file_path):
    os.remove(log_file_path)

folder_name = "synthetic_data"
os.makedirs(folder_name, exist_ok=True) # Create folder if it doesn't exist

df = pd.read_csv("LREC-COLING/train/agnewstrainAll.csv")
df = df.rename(columns={"2": "text", "3": "label"})
display(df.head())

labels_lst = df['label'].unique()
labels_str = ", ".join(labels_lst)
print(f"Labels: {labels_str}")

# Print the first example for each label
examples = []
for label in labels_lst:
    example = (df[df['label'] == label].iloc[0]).loc['text']
    examples.append(example)
    print(f"Label: {label}\nText: {example}\n")

Unnamed: 0,0,1,text,label
0,1000_train,Hartford executive #39;s stock sale probed,New York AG is investigating the timing of a s...,Business
1,1001_train,Cool Batman Photo!,"In related news, it was announced yesterday th...",Sci/Tech
2,1002_train,Stocky Monkey in Himalayas Becomes Newest Prim...,Scientists from India working in the Himalayas...,Sci/Tech
3,1003_train,"Bush, Kerry Don''t Worry About Tech","Election Day is less than two weeks away, but ...",Sci/Tech
4,1004_train,The Teeming Crowd in Video Games,"Jeffrey Griffiths, the president and chief exe...",Sci/Tech


Labels: Business, Sci/Tech, Sports, World
Label: Business
Text: New York AG is investigating the timing of a sale by Thomas Marra; company reports higher 3Q. NEW YORK (Reuters) - The Hartford Financial Services Group Inc.

Label: Sci/Tech
Text: In related news, it was announced yesterday that Warner Bros. Interactive Entertainment, DC Comics and Electronic Arts will bring a Batman Begins videogame tie-in.

Label: Sports
Text: Lindsay Davenport #39;s world number one ranking is in doubt after she failed to make the finals of the WTA Tour Championships in Los Angeles.

Label: World
Text:  JERUSALEM (Reuters) - Israeli Prime Minister Ariel Sharon  accused far-rightists Sunday of trying to incite civil war over  his plan to withdraw from the occupied Gaza Strip and called  for measures to curb such groups.



## Qwen/Qwen2.5-0.5B-Instruct
### Baseline

In [2]:
model_name = "Qwen/Qwen2.5-0.5B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="cuda",
    attn_implementation='flash_attention_2',
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# for qwen we use this system prompt:
system_prompt_qwen = "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."

In [3]:
labels_str_bullet = "\n".join([f"- **{name}**" for name in labels_lst])
prompt = f"""\
You are an expert in journalism and NLP specializing in **news classification**. \
Your task is to generate **10 high-quality short documnets**, distributed across the following four News categories:  
{labels_str_bullet}

### **Guidelines:**  
- Ensure **each category has 2-3 representative examples**.  
- The text should be **realistic and concise**, resembling real-world news snippets.  
- Cover **diverse topics within each category**.

### **Output Format (JSON)**  
Return **only** a valid JSON list in the following structure:

```json
[
    {{"text": "{examples[0]}", "label": "{labels_lst[0]}"}},
    {{"text": "{examples[1]}", "label": "{labels_lst[1]}"}},
    {{"text": "{examples[2]}", "label": "{labels_lst[2]}"}},
    {{"text": "{examples[3]}", "label": "{labels_lst[3]}"}},
    ...
]
```\
"""

response, delta_t = get_response(prompt, model, tokenizer, system_prompt=system_prompt_qwen)

TIME TAKEN: 3.42
GENERATED RESPONSE:
```json
[
    {
        "text": "New York AG is investigating the timing of a sale by Thomas Marra; company reports higher 3Q.",
        "label": "Business"
    },
    {
        "text": "In related news, it was announced yesterday that Warner Bros. Interactive Entertainment, DC Comics and Electronic Arts will bring a Batman Begins videogame tie-in.",
        "label": "Sci/Tech"
    },
    {
        "text": "Lindsay Davenport's world number one ranking is in doubt after she failed to make the finals of the WTA Tour Championships in Los Angeles.",
        "label": "Sports"
    },
    {
        "text": "Israeli Prime Minister Ariel Sharon accused far-rightists Sunday of trying to incite civil war over his plan to withdraw from the occupied Gaza Strip and called for measures to curb such groups.",
        "label": "World"
    }
]
```

This JSON output adheres to the guidelines provided, with each category having 2-3 representative examples, covering div

Answers are too similar to the examples given.

In [4]:
prompt = f"""\
You are an expert in journalism and NLP specializing in **news classification**. \
Your task is to generate 10 high-quality short documnets. \
The category of the news is the label of interest. The distribution of the labels should be balanced.\
The labels are the following:  
{labels_str_bullet}

### Output Format (JSON)  
Return only a valid JSON list of 10 items in the following structure:

```json
[
    {{"text": <text>, "label": <label>}},
    ...
]
```\
"""

response, delta_t = get_response(prompt, model, tokenizer, system_prompt=system_prompt_qwen)
synthetic_data = response2json(response)
log_synthetic_data(model_name, "baseline", prompt, synthetic_data, delta_t, output_file=log_file_path)


TIME TAKEN: 3.59
GENERATED RESPONSE:
```json
[
    {
        "text": "Apple announces new iPhone models, including the latest model with a 5G capability.",
        "label": "Tech"
    },
    {
        "text": "Google launches its Pixel 6 series, offering advanced camera technology.",
        "label": "Tech"
    },
    {
        "text": "Amazon unveils its new Echo Dot smart speaker.",
        "label": "Tech"
    },
    {
        "text": "Tesla announces plans to expand its Gigafactory in China.",
        "label": "Tech"
    },
    {
        "text": "Microsoft introduces new Azure services for businesses.",
        "label": "Tech"
    },
    {
        "text": "Netflix releases a new streaming service, Netflix Originals.",
        "label": "Tech"
    },
    {
        "text": "Unicef launches its new campaign 'Safe Water, Safe Home' to promote access to clean water and sanitation.",
        "label": "News"
    },
    {
        "text": "Chinese authorities crack down on illegal wildlife tr

The labels are not coherent

### Targeted synthetic data

In [5]:
prompt = f"""\
You are an expert News reporter and NLP enthusiast. \
List all key linguistic, semantic, and contextual phenomena that must be covered in News documents.
Return only a list in a simple bullet-point format.\
"""

news_aspects, delta_t_0 = get_response(prompt, model, tokenizer, system_prompt=system_prompt_qwen)

TIME TAKEN: 2.97
GENERATED RESPONSE:
- Language: The choice of words, sentence structure, and the use of vocabulary to convey information.
- Semantics: Understanding the meaning behind the language used, including tone, idioms, metaphors, and other figurative expressions.
- Contextual phenomena: Analyzing how news articles relate to each other (chronology, causality, contrast) and the broader social or political context they operate within.
- Grammar and syntax: Verbs, nouns, adjectives, prepositions, conjunctions, and punctuation marks that guide the reader's understanding of the text.
- Categorization: Grouping related news items together for easier reading or searching.
- Word choice: Choosing appropriate words and phrases to convey specific ideas and emotions.
- Tone and style: The overall impression the writer wants to give about their subject matter.
- Logical reasoning: Deciding what information is important and which ones can be ignored based on the content.
- Relevance: Assess

In [6]:
prompt = f"""\
You are an expert in journalism and NLP specializing in news classification. \
Your task is to generate 10 high-quality short documnets, that talks about the following four News categories (labels):  
{labels_str_bullet}

### **Guidelines:**
{news_aspects}

### **Output Format (JSON)**
The labels must be one of the specified categories. Return only a valid JSON list in the following structure:

```json
[
    {{"text": <text>, "label": <label>}},
    ...
]
```
"""

response, delta_t = get_response(prompt, model, tokenizer, system_prompt=system_prompt_qwen)
delta_t += delta_t_0 # Add the time taken for the previous response

# add to log
synthetic_data = response2json(response)
log_synthetic_data(model_name, "targeted", prompt, synthetic_data, delta_t, output_file=log_file_path)

TIME TAKEN: 3.54
GENERATED RESPONSE:
```json
[
    {
        "text": "In today’s digital age, businesses have become more reliant on technology than ever before. From online marketplaces to cloud-based services, the rise of technology has transformed the way companies operate. However, it’s crucial for businesses to stay ahead of the curve and invest in sustainable practices to ensure long-term success.",
        "label": "Business"
    },
    {
        "text": "Scientists from various fields are constantly pushing the boundaries of science. From discovering new elements to developing revolutionary technologies, there’s never been a better time for innovation. But it’s essential to approach scientific research with caution and ethical considerations.",
        "label": "Science"
    },
    {
        "text": "Sports fans everywhere are eagerly anticipating the results of the upcoming Olympic Games. Whether it’s soccer, basketball, or track and field, the excitement is palpable as athlet

---

### deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

In [7]:
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    attn_implementation='flash_attention_2',
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model.generation_config.pad_token_id = tokenizer.pad_token_id

In [8]:
prompt = f"""\
You are an expert in journalism and NLP specializing in news classification. \
Your task is to generate 10 high-quality short documnets, distributed across the following four News categories:  
{labels_str_bullet}

### **Guidelines:**  
- Ensure each category has 2-3 representative examples.  
- The text should be realistic and concise, resembling real-world news snippets.  
- Cover diverse topics within each category.

### **Output Format (JSON)**  
Return **only** a valid JSON list in the following structure:

```json
[
    {{"text": "{examples[0]}", "label": "{labels_lst[0]}"}},
    {{"text": "{examples[1]}", "label": "{labels_lst[1]}"}},
    {{"text": "{examples[2]}", "label": "{labels_lst[2]}"}},
    {{"text": "{examples[3]}", "label": "{labels_lst[3]}"}},
    ...
]
```
"""

generated, delta_t = get_response(prompt, model, tokenizer, max_new_tokens=2048) # increase max_new_tokens to 1024 for deepseek

TIME TAKEN: 17.83
GENERATED RESPONSE:
Okay, so I need to generate 10 short documents across four news categories: Business, Sci/Tech, Sports, and World. Each category should have 2-3 examples. The output needs to be a JSON list with each item having "text" and "label" fields.

First, I'll start with Business. I need something realistic and concise. Maybe a company investigation. The example given was about New York AG investigating Thomas Marra's sale and reporting a higher Q3. That's good, it's a business-related news snippet.

Next, Sci/Tech. The example was about a video game tie-in involving Batman Begins. That makes sense as a sci-fi tech topic. I can think of other tech companies or advancements. Maybe something about a new AI system? Like a company announcing a breakthrough in AI. That fits.

For Sports, the example was about Lindsay Davenport's ranking uncertainty. That's a sports-related topic, so it's appropriate. I can think of another sports event, maybe a player's injury u

In [9]:
prompt = f"""\
You are an expert in journalism and NLP specializing in news classification. \
Your task is to generate 10 high-quality short documnets, that talks about the following four News categories:  
{labels_str_bullet}

### **Output Format (JSON)**  
Return only a valid JSON list of 10 items in the following structure:

```json
[
    {{"text": <text>, "label": <label>}},
    ...
]
```
"""

response, delta_t = get_response(prompt, model, tokenizer, max_new_tokens=2048)
synthetic_data = response2json(response)
log_synthetic_data(model_name, "baseline", prompt, synthetic_data, delta_t, output_file=log_file_path)


TIME TAKEN: 19.39
GENERATED RESPONSE:
Okay, so the user is asking me to generate 10 short documents about four news categories: Business, Sci/Tech, Sports, and World. I need to make sure each document is high-quality and relevant.

First, I should think about what each category includes. Business is about companies, economics, and management. Sci/Tech covers scientific discoveries, tech innovations, and research. Sports is about team performances, events, and players. World includes global events, culture, and international news.

I need to come up with a variety of topics for each category. For Business, maybe something like the rise of a new tech company, the effects of inflation on the economy, or the impact of a major economic event.

For Sci/Tech, I can think about upcoming breakthroughs, the ethical implications of AI, or the impact of climate change research.

Sports could include a team's recent win, a player's performance analysis, or a game-winning strategy.

World might cove

### Targeted synthetic data

In [10]:
prompt = f"""\
You are an expert News reporter and NLP enthusiast. \
List all key topics that must be covered in News document.\
In particular consider the following categories: {labels_str}
Return only a short list in a simple bullet-point format.\
"""

news_aspects, delta_t_0 = get_response(prompt, model, tokenizer, max_new_tokens=2048)

TIME TAKEN: 12.04
GENERATED RESPONSE:
Okay, so I need to come up with a list of key topics to cover in a News Document, considering Business, Sci/Tech, and Sports categories. Hmm, let me think about each category and what they typically cover.

Starting with Business. Well, business news is pretty broad, but I can think of a few main areas. Maybe the economy? Like, what's happening in the global markets, maybe some GDP numbers or unemployment rates. Also, tech companies could be a big topic, like earnings reports or product launches. Trade balances are important too, as they affect global trade policies. Also, consumer trends like e-commerce growth or subscription services. Government policies are always relevant, so that's another point. International business deals or trade agreements could be key areas too.

Moving on to Sci/Tech. Scientific advancements are always important, so things like breakthroughs in healthcare, maybe climate change or AI. Technology innovations, like new sof

In [11]:
# Extract bullet points
news_aspects = re.findall(r"- (.+)", news_aspects)
news_aspects = set([b.strip() for b in news_aspects])  # Remove leading/trailing whites and duplicates

# Convert to CSV string
news_aspects = ", ".join(news_aspects)
print(f"News aspects: {news_aspects}")

News aspects: Global economy trends, World Cup highlights, Sustainability impact, Sports betting, Trade agreements, Trade balances, Athlete journeys, Sports statistics, Renewable energy, Scientific advancements, Consumer trends, Sponsorships, AI and machine learning, Social media engagement, International business deals, Technology innovations, Sports impact on society, Team performances, Electric vehicles, Player news, Government policies, Tech company earnings, World Series results


In [12]:
prompt = f"""\
You are an expert in journalism and NLP specializing in news classification. \
Your task is to generate 10 high-quality short documents, that talks about the following four News categories (labels):  
{labels_str_bullet}

### **Guidelines:**
{news_aspects}

### **Output Format (JSON)**
Return only a valid JSON list of 10 items in the following structure:

```json
[
    {{"text": <text>, "label": <label>}},
    ...
]
```
"""

response, delta_t = get_response(prompt, model, tokenizer, max_new_tokens=2048)
delta_t += delta_t_0 # Add the time taken for the previous response

# add to log
synthetic_data = response2json(response)
log_synthetic_data(model_name, "targeted", prompt, synthetic_data, delta_t, output_file=log_file_path)

TIME TAKEN: 14.9
GENERATED RESPONSE:
Alright, I need to generate 10 short news documents for four categories: Business, Sci/Tech, Sports, and World. Let me start by understanding each category and what kind of news would fit into them.

First, Business. I should think about topics related to business, like earnings reports, market trends, company news, or financial news. Maybe something about a company launch, a financial report, or a merger and acquisition. I need to make sure each document is concise and relevant.

Next, Sci/Tech. This category should cover scientific advancements, tech innovations, and possibly AI. Topics could include breakthroughs in AI, new software, quantum computing, or advancements in biotechnology. I should focus on cutting-edge research or news about tech developments.

Sports is next. This category should cover sports news, including team performances, player news, tournaments, and events. Examples could be a team winning a championship, a player making a b