# Generating AG news data

**Table of contents**:
- Qwen/Qwen2.5-0.5B-Instruct
    1. Baseline
    2. Targeted
    3. Targeted + Linguistic tags
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
    1. Baseline
    2. Targeted
    3. Targeted + Linguistic tags

**Results**:
- The task seems easier than irony detection. The <text, label> pairs are more coherent.
- DeepSeek gave better performance than QWEN. It did a better job of obeying the restrictions on the number of examples to be output and the labels to be specified.
- Even when the linguistic tags need to be specified, the model gives reasonable results and follows the structure of the output well.
- The results are focused on less specific topics than the results given by chatgpt.
- Some samples created are in between different labels:
    - "The World is set to host the next FIFA World Cup..." World & Sport    
    - "A major tech company has announced a new venture that aims to revolutionize the business of artificial intelligence..." Business & Sci/Tech

In [None]:
import os

# CHANGE WORKING DIRECTORY TO ROOT
current_dir = os.path.basename(os.getcwd())
if current_dir == "src":
    os.chdir("..") # Move up by 1
elif os.path.basename(os.getcwd()) == "bai-thesis-nlp":  
    pass # If already at root, stay there
else:
    os.chdir("../..") # Move up by 2 otherwise
    
from transformers import AutoModelForCausalLM, AutoTokenizer
import pandas as pd
import re
from src._utils._helpers import log_synthetic_data, response2json, get_response, set_seed, clear_cuda_cache

In [20]:
# Create the folder to save the synthetic data
folder_name = "synthetic_data/logs"
os.makedirs(folder_name, exist_ok=True)

# file where the logs will be saved
log_file_path = folder_name + "/agnews_log.json"
RECREATE_LOG = False
if os.path.exists(log_file_path) and RECREATE_LOG:
    os.remove(log_file_path) # recreate from scratch

# DEVICE
device = 'cuda:0'

# DATA
df = pd.read_csv("real_data/train/agnewstrainAll.csv")
df = df.rename(columns={"2": "text", "3": "label"})
display(df.head())

labels_lst = df['label'].unique()
labels_str = ", ".join(labels_lst)
labels_str_bullet = "\n".join([f"- **{name}**" for name in labels_lst])
print(f"Labels: {labels_str}")

# Print the first example for each label
examples = []
for label in labels_lst:
    example = (df[df['label'] == label].iloc[0]).loc['text']
    examples.append(example)
    print(f"Label: {label}\nText: {example}\n")

Unnamed: 0,0,1,text,label
0,1000_train,Hartford executive #39;s stock sale probed,New York AG is investigating the timing of a s...,Business
1,1001_train,Cool Batman Photo!,"In related news, it was announced yesterday th...",Sci/Tech
2,1002_train,Stocky Monkey in Himalayas Becomes Newest Prim...,Scientists from India working in the Himalayas...,Sci/Tech
3,1003_train,"Bush, Kerry Don''t Worry About Tech","Election Day is less than two weeks away, but ...",Sci/Tech
4,1004_train,The Teeming Crowd in Video Games,"Jeffrey Griffiths, the president and chief exe...",Sci/Tech


Labels: Business, Sci/Tech, Sports, World
Label: Business
Text: New York AG is investigating the timing of a sale by Thomas Marra; company reports higher 3Q. NEW YORK (Reuters) - The Hartford Financial Services Group Inc.

Label: Sci/Tech
Text: In related news, it was announced yesterday that Warner Bros. Interactive Entertainment, DC Comics and Electronic Arts will bring a Batman Begins videogame tie-in.

Label: Sports
Text: Lindsay Davenport #39;s world number one ranking is in doubt after she failed to make the finals of the WTA Tour Championships in Los Angeles.

Label: World
Text:  JERUSALEM (Reuters) - Israeli Prime Minister Ariel Sharon  accused far-rightists Sunday of trying to incite civil war over  his plan to withdraw from the occupied Gaza Strip and called  for measures to curb such groups.



# Qwen/Qwen2.5-0.5B-Instruct
## 1. Baseline

In [3]:
model_name = "Qwen/Qwen2.5-0.5B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map=device,
    attn_implementation='flash_attention_2',
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# for qwen we use this system prompt:
system_prompt_qwen = "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."

In [4]:
prompt = f"""\
You are an expert in journalism and NLP specializing in news classification. \
Your task is to generate 10 high-quality short documents, distributed across the following four News categories:  
{labels_str_bullet}

### **Guidelines:**  
- Ensure **each category has 2-3 representative examples**.  
- The text should be **realistic and concise**, resembling real-world news snippets.  
- Cover **diverse topics within each category**.

### **Output Format (JSON)**  
Return **only** a valid JSON list in the following structure:

```json
[
    {{"text": "{examples[0]}", "label": "{labels_lst[0]}"}},
    {{"text": "{examples[1]}", "label": "{labels_lst[1]}"}},
    {{"text": "{examples[2]}", "label": "{labels_lst[2]}"}},
    {{"text": "{examples[3]}", "label": "{labels_lst[3]}"}},
    ...
]
```
"""
response, delta_t = get_response(prompt, model, tokenizer, system_prompt=system_prompt_qwen)

TIME TAKEN: 3.33
GENERATED RESPONSE:
```json
[
    {
        "text": "New York AG is investigating the timing of a sale by Thomas Marra; company reports higher 3Q.",
        "label": "Business"
    },
    {
        "text": "In related news, it was announced yesterday that Warner Bros. Interactive Entertainment, DC Comics and Electronic Arts will bring a Batman Begins videogame tie-in.",
        "label": "Sci/Tech"
    },
    {
        "text": "Lindsay Davenport's world number one ranking is in doubt after she failed to make the finals of the WTA Tour Championships in Los Angeles.",
        "label": "Sports"
    },
    {
        "text": "Israeli Prime Minister Ariel Sharon accused far-rightists Sunday of trying to incite civil war over his plan to withdraw from the occupied Gaza Strip and called for measures to curb such groups.",
        "label": "World"
    }
]
```


Answers are too similar to the examples given.

In [5]:
prompt = f"""\
You are an expert in journalism and NLP specializing in **news classification**. \
Your task is to generate 10 high-quality short documents. \
The category of the news is the label of interest. The distribution of the labels should be balanced.\
The labels are the following:  
{labels_str_bullet}

### Output Format (JSON)  
Return only a valid JSON list of 10 items in the following structure:

```json
[
    {{"text": <text>, "label": <label>}},
    ...
]
```
"""
response, delta_t = get_response(prompt, model, tokenizer, system_prompt=system_prompt_qwen)
synthetic_data = response2json(response)
log_synthetic_data(model_name, "baseline", prompt, synthetic_data, delta_t, output_file=log_file_path)

TIME TAKEN: 3.39
GENERATED RESPONSE:
```json
[
    {
        "text": "Apple Inc. announces new iPhone models for next year.",
        "label": "Tech"
    },
    {
        "text": "Tesla unveils new electric vehicle model.",
        "label": "Tech"
    },
    {
        "text": "Amazon launches new product line.",
        "label": "Tech"
    },
    {
        "text": "Microsoft releases new version of its Windows operating system.",
        "label": "Tech"
    },
    {
        "text": "Google introduces new AI-powered assistant.",
        "label": "Tech"
    },
    {
        "text": "Facebook launches new mobile app for business users.",
        "label": "Tech"
    },
    {
        "text": "NVIDIA introduces new graphics processing unit.",
        "label": "Tech"
    },
    {
        "text": "Uber announces new driver recruitment program.",
        "label": "Tech"
    },
    {
        "text": "Airbnb expands global presence.",
        "label": "Tech"
    },
    {
        "text": "Lululemo

## 2. Targeted synthetic data

Here the topics that needs to be covered are asked directly to this LLM

In [6]:
prompt = f"""\
You are an expert News reporter and NLP enthusiast. \
List key linguistic, semantic, and contextual phenomena that characterize News documents.

Return only a **short* list in a simple bullet-point format.\
"""
news_aspects, delta_t_0 = get_response(prompt, model, tokenizer, system_prompt=system_prompt_qwen)

TIME TAKEN: 4.21
GENERATED RESPONSE:
- **Language Use**: The use of specific vocabulary and grammar structures common in news writing.
- **Contextual Information**: The context provided about the topic, such as the time, date, location, and subject matter.
- **Topic Clarity**: The clarity with which the article presents its main points.
- **Fact Verification**: The accuracy and reliability of information presented.
- **Sarcasm and Irony**: The presence or absence of sarcasm or irony in the text.
- **Vocabulary**: Words and phrases used to describe events, people, places, and ideas in detail.
- **Emotional Tone**: The emotional tone of the piece, whether it is informative, persuasive, or entertaining.
- **Credibility**: The author's credibility and expertise on the topic being discussed.
- **Grammar and Structure**: The adherence to standard English grammar rules and sentence structure.
- **Hyperbole**: The exaggeration or overstatement of facts without justification.
- **Synesthesia**:

In [7]:
prompt = f"""\
You are an expert in journalism and NLP specializing in news classification. \
Your task is to generate 10 high-quality short documents, that talks about the following four News categories (labels):  
{labels_str_bullet}

### **Guidelines:**
{news_aspects}

### **Output Format (JSON)**
The labels must be one of the specified categories. Return only a valid JSON list in the following structure:

```json
[
    {{"text": <text>, "label": <label>}},
    ...
]
```
"""
response, delta_t = get_response(prompt, model, tokenizer, system_prompt=system_prompt_qwen)
# synthetic_data = response2json(response)
# log_synthetic_data(model_name, "targeted", prompt, synthetic_data, delta_t, output_file=log_file_path)

TIME TAKEN: 2.52
GENERATED RESPONSE:
```json
[
    {
        "text": "In the bustling city of Tokyo, the annual E3 trade fair attracts thousands of visitors each year, showcasing the latest technology and products from around the world.",
        "label": "Tech"
    },
    {
        "text": "The epicenter of the global financial crisis was the collapse of Lehman Brothers, a major financial institution that had been heavily invested in mortgage-backed securities.",
        "label": "Sci/Tech"
    },
    {
        "text": "The vibrant city of Paris hosted the 2024 Summer Olympics, where athletes from all over the world competed in various sports.",
        "label": "Sport"
    },
    {
        "text": "In the heart of New York City, the iconic Statue of Liberty stands as a symbol of freedom and democracy.",
        "label": "World"
    }
]
```


### Targeted synthetic data with Chat-GPT

The topics that need to be covered are generated from Chat-GPT online.

In [8]:
prompt = f"""\
You are an expert in journalism and NLP specializing in news classification. \
Your task is to generate 10 high-quality short documents, that talks about the following four News categories (labels):  
{labels_str_bullet}

### **Follow these topics:**
- **Business**  
  - Markets  
  - Economy  
  - Companies  
  - Startups  
  - Regulations  

- **Sci/Tech**  
  - AI  
  - Space  
  - Cybersecurity  
  - Biotech  
  - Climate  

- **Sports**  
  - Events  
  - Records  
  - Highlights  
  - Scandals  
  - Olympics  

- **World**  
  - Politics  
  - Conflicts  
  - Disasters  
  - Human Rights  
  - Trade

### **Output Format (JSON)**
The labels must be one of the specified categories, which are: {labels_str}. \
Return only a valid JSON list of 10 elements in the following structure:

```json
[
    {{"text": <news document>, "label": <corresponding label>}},
    ...
]
```
"""
response, delta_t = get_response(prompt, model, tokenizer, system_prompt=system_prompt_qwen)
synthetic_data = response2json(response)
log_synthetic_data(model_name, "targeted", prompt, synthetic_data, delta_t, output_file=log_file_path)

TIME TAKEN: 3.84
GENERATED RESPONSE:
```json
[
    {
        "text": "Market Trends Analysis: Global Stock Market Performance",
        "label": "Business"
    },
    {
        "text": "AI Breakthroughs: The Latest Developments in Artificial Intelligence",
        "label": "Sci/Tech"
    },
    {
        "text": "Space Exploration: Future Prospects for Mars Missions",
        "label": "Sci/Tech"
    },
    {
        "text": "Cybersecurity Threats: How to Stay Safe Online",
        "label": "Sci/Tech"
    },
    {
        "text": "Biotechnology Advances: Future Medical Innovations",
        "label": "Sci/Tech"
    },
    {
        "text": "Climate Change: Solutions for Mitigation and Adaptation",
        "label": "Sci/Tech"
    },
    {
        "text": "Events: The Best Newscasts of the Week",
        "label": "Sports"
    },
    {
        "text": "Records: The Unforgettable Moments of the Day",
        "label": "Sports"
    },
    {
        "text": "Olympics Highlights: The Most Exciti

## 3. Targeted + Tags linguistic phenomena 

Now we ask to the model also to identify the linguistic phenomena present in the generated sentence.

```
{  
    "text": "Oh, I absolutely adore being stuck in traffic for hours.",   
    "label": "ironic",  
    "phenomena": ["Polarity inversion", "Hyperbole", "Semantic incongruence", "Lexical exaggeration"]
},
```

In [9]:
prompt = f"""\
You are an expert in journalism and NLP specializing in news classification. \
Your task is to generate 10 high-quality **short documents**, that talks about the following four News categories (labels):  
{labels_str_bullet}.
For each example, also list the key phenomena it covers.

### **Follow these topics:**
- **Business**  
  - Markets  
  - Economy  
  - Companies  
  - Startups  
  - Regulations  

- **Sci/Tech**  
  - AI  
  - Space  
  - Cybersecurity  
  - Biotech  
  - Climate  

- **Sports**  
  - Events  
  - Records  
  - Highlights  
  - Scandals  
  - Olympics  

- **World**  
  - Politics  
  - Conflicts  
  - Disasters  
  - Human Rights  
  - Trade

### **Output Format (JSON)**
The labels must be one of the specified categories, which are: {labels_str}. \
Return only a valid JSON list of 10 elements in the following structure:

```json
[
    {{"text": <news document>, "label": <corresponding label>, "phenomena": ["<phenomenon1>", "<phenomenon2>", ...]}},
    ...
]
```
"""
response, delta_t = get_response(prompt, model, tokenizer, system_prompt=system_prompt_qwen)
synthetic_data = response2json(response)
log_synthetic_data(model_name, "targeted + linguistic tags", prompt, synthetic_data, delta_t, output_file=log_file_path)

TIME TAKEN: 5.86
GENERATED RESPONSE:
```json
[
    {
        "text": "Market Trends Analysis: Global Stock Market Performance",
        "label": "Business",
        "phenomena": ["Stock market performance", "Economic indicators"]
    },
    {
        "text": "Space Exploration: SpaceX Launches New Mission",
        "label": "Sci/Tech",
        "phenomena": ["Space exploration", "Rocket launches", "Technology advancements"]
    },
    {
        "text": "Cybersecurity Threats: Latest Infections in IoT Devices",
        "label": "Sci/Tech",
        "phenomena": ["Intrusion detection systems", "IoT devices", "Security threats"]
    },
    {
        "text": "Climate Change: The Impact on Agriculture",
        "label": "Sci/Tech",
        "phenomena": ["Agriculture practices", "Climate change impacts", "Solutions for sustainable agriculture"]
    },
    {
        "text": "Olympic Games: Highlights of the 2022 Tokyo Games",
        "label": "Sports",
        "phenomena": ["Events", "Record-br

---

# deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

In [10]:
clear_cuda_cache(model)

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map=device,
    attn_implementation='flash_attention_2',
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model.generation_config.pad_token_id = tokenizer.pad_token_id

## 1. Baseline

In [11]:
prompt = f"""\
You are an expert in journalism and NLP specializing in news classification. \
Your task is to generate 10 high-quality short documents, distributed across the following four News categories:  
{labels_str_bullet}

### **Guidelines:**  
- Ensure each category has 2-3 representative examples.  
- The text should be realistic and concise, resembling real-world news snippets.  
- Cover diverse topics within each category.

### **Output Format (JSON)**  
Return **only** a valid JSON list in the following structure:

```json
[
    {{"text": "{examples[0]}", "label": "{labels_lst[0]}"}},
    {{"text": "{examples[1]}", "label": "{labels_lst[1]}"}},
    {{"text": "{examples[2]}", "label": "{labels_lst[2]}"}},
    {{"text": "{examples[3]}", "label": "{labels_lst[3]}"}},
    ...
]
```
"""
generated, delta_t = get_response(prompt, model, tokenizer)

TIME TAKEN: 16.73
GENERATED RESPONSE:
Okay, so I need to generate 10 short news documents across four categories: Business, Sci/Tech, Sports, and World. Let me think about how to approach each category and come up with relevant examples.

Starting with Business. I should find news that's related to business operations or financial news. Maybe something about a company's recent actions or financial figures. I can't just make up anything, so I'll look for well-known companies or events in the business sector.

Next, Sci/Tech. This category needs scientific or technological news. I should think about recent advancements, breakthroughs, or interesting experiments in fields like AI, space exploration, or quantum computing. It's important to include both successes and maybe some controversies or innovations.

For Sports, I'll need news related to sports events, player news, or team developments. Maybe something about a new record, a significant injury, or a major tournament result. It's cruc

In [12]:
prompt = f"""\
You are an expert in journalism and NLP specializing in news classification. \
Your task is to generate 10 high-quality short documents, that talks about the following four News categories:  
{labels_str_bullet}

### **Output Format (JSON)**  
Return only a valid JSON list of 10 items in the following structure:

```json
[
    {{"text": <text>, "label": <label>}},
    ...
]
```
"""
response, delta_t = get_response(prompt, model, tokenizer)
synthetic_data = response2json(response)
log_synthetic_data(model_name, "baseline", prompt, synthetic_data, delta_t, output_file=log_file_path)


TIME TAKEN: 19.43
GENERATED RESPONSE:
Okay, so the user has asked me to generate 10 high-quality short news documents for four categories: Business, Sci/Tech, Sports, and World. They mentioned that I'm an expert in journalism and NLP, so I can use advanced techniques to create these documents. 

First, I need to make sure each document is concise and informative. The user specified JSON format, so I'll structure each item as an object with "text" and "label" keys. I should vary the topics within each category to cover a broad range of news.

For the Business category, I can think of articles about tech companies, maybe a merger, or a major financial event. A business article about a new tech start-up could be good. Then, a financial article on a major event like the US election would fit well.

Next, Sci/Tech is all about science and technology. I can create a news piece about groundbreaking research, maybe a recent breakthrough in AI. Also, an article on climate change could be releva

Some samples created are in between different labels:

"The World is set to host the next FIFA World Cup..." World & Sport  
"A new AI system has been developed..." Business & Sci/Tech  
"A major tech company has announced a new venture that aims to revolutionize the business of artificial intelligence..." Business & Sci/Tech

## 2. Targeted

In [13]:
prompt = f"""\
You are an expert News reporter and NLP enthusiast. \
List all key topics that must be covered in News document.\
In particular consider the following categories: {labels_str}
Return only a short list in a simple bullet-point format.\
"""
news_aspects, delta_t_0 = get_response(prompt, model, tokenizer)

TIME TAKEN: 10.12
GENERATED RESPONSE:
Okay, so I need to list the key topics that must be covered in a News Document, specifically broken down into Business, Sci/Tech, Sports, and World categories. The user wants a simple bullet-point format with only the categories, no subcategories. 

Starting with Business, I know there are major topics like the economy, especially the global economy and maybe tech companies. Then, global politics, international relations, and corporate governance come to mind. Also, consumer trends and consumer behavior are important. Marketing strategies and consumer perception are key areas too. International business and trade policy are definitely relevant. And the global economy's growth and challenges.

Moving on to Sci/Tech, the focus would be on innovation and advancements. The rise of AI and machine learning is a big one. Quantum computing and its potential impact is another. Renewable energy and sustainable development are crucial. AI in healthcare and it

In [14]:
# Extract bullet points
news_aspects = re.findall(r"- (.+)", news_aspects)
news_aspects = set([b.strip() for b in news_aspects])  # Remove leading/trailing whites and duplicates

# Convert to CSV string
news_aspects = ", ".join(news_aspects)
print(f"News aspects: {news_aspects}")

News aspects: **Sci/Tech**: Innovation and advancements, AI and machine learning, quantum computing, renewable energy, healthcare AI, climate change, emerging tech trends, future of tech., **Sports**: World Cup, world records, sports statistics, international sports news, athletes performance, sports strategies, World Cup analysis, game rules., **World**: pandemic impact, international relations, global governance, digital IoT, global economy, COVID-19 impact, population growth, global economy challenges., **Business**: Global economy trends, international politics, corporate governance, consumer behavior, marketing strategies, international business, trade policy, economic growth and challenges.


In [15]:
prompt = f"""\
You are an expert in journalism and NLP specializing in news classification. \
Your task is to generate 10 high-quality short documents, that talks about the following four News categories (labels):  
{labels_str_bullet}

### **Guidelines:**
{news_aspects}

### **Output Format (JSON)**
Return only a valid JSON list of 10 items in the following structure:

```json
[
    {{"text": <text>, "label": <label>}},
    ...
]
```
"""
response, delta_t = get_response(prompt, model, tokenizer)
# synthetic_data = response2json(response)
# log_synthetic_data(model_name, "targeted", prompt, synthetic_data, delta_t, output_file=log_file_path)

TIME TAKEN: 16.44
GENERATED RESPONSE:
Okay, so the user has asked me to generate 10 short news documents for four specific categories: Business, Sci/Tech, Sports, and World. They provided detailed guidelines on what each category includes, which is really helpful. 

First, I need to make sure I understand each category correctly. For Business, it's about global economy trends, international politics, corporate governance, etc. Sci/Tech includes AI, quantum computing, renewable energy, etc. Sports focuses on World Cup, sports stats, strategies, etc. World covers pandemic impacts, international relations, economy, population, etc.

I should think of topics that fit each category well. For example, under Business, maybe something about global trade or corporate social responsibility. Sci/Tech could involve AI in healthcare or climate change. Sports might have articles about a tournament or athlete performance. World could focus on the COVID-19 pandemic or population growth.

I need to ens

### Targeted synthetic data with Chat-GPT

In [16]:
prompt = f"""\
You are an expert in journalism and NLP specializing in news classification. \
Your task is to generate 10 high-quality short documents, that talks about the following four News categories (labels):  
{labels_str_bullet}

### **Follow these topics:**
- **Business**  
  - Markets  
  - Economy  
  - Companies  
  - Startups  
  - Regulations  

- **Sci/Tech**  
  - AI  
  - Space  
  - Cybersecurity  
  - Biotech  
  - Climate  

- **Sports**  
  - Events  
  - Records  
  - Highlights  
  - Scandals  
  - Olympics  

- **World**  
  - Politics  
  - Conflicts  
  - Disasters  
  - Human Rights  
  - Trade

### **Output Format (JSON)**
The labels must be one of the specified categories, which are: {labels_str}. \
Return only a valid JSON list of 10 elements in the following structure:

```json
[
    {{"text": <text of the document>, "label": <corresponding label>}},
    ...
]
```
"""
response, delta_t = get_response(prompt, model, tokenizer)
synthetic_data = response2json(response)
log_synthetic_data(model_name, "targeted", prompt, synthetic_data, delta_t, output_file=log_file_path)

TIME TAKEN: 14.14
GENERATED RESPONSE:
Okay, so the user has asked me to generate 10 short news documents for four different categories: Business, Sci/Tech, Sports, and World. They provided specific topics for each category, and I need to make sure each document is tailored to those topics. 

First, I need to understand each category's focus. Business should cover markets, economy, companies, startups, and regulations. Sci/Tech is AI, space, cybersecurity, biotech, and climate. Sports needs to talk about events, records, highlights, scandals, and Olympics. World includes politics, conflicts, disasters, human rights, and trade.

I should brainstorm some topics within each category that are current and relevant. For Business, maybe something about the tech sector's impact on markets. For Sci/Tech, perhaps the latest AI advancements. Sports could focus on an Olympic event with a highlight. World might involve a political event or conflict.

I need to ensure each document is concise and cap

## 3. Targeted + Tags linguistic phenomena 

In [17]:
prompt = f"""\
You are an expert in journalism and NLP specializing in news classification. \
Your task is to generate 10 high-quality short documents, that talks about the following four News categories (labels):  
{labels_str_bullet}.
For each example, also list the key phenomena it covers.

### **Follow these topics:**
- **Business**  
  - Markets  
  - Economy  
  - Companies  
  - Startups  
  - Regulations  

- **Sci/Tech**  
  - AI  
  - Space  
  - Cybersecurity  
  - Biotech  
  - Climate  

- **Sports**  
  - Events  
  - Records  
  - Highlights  
  - Scandals  
  - Olympics  

- **World**  
  - Politics  
  - Conflicts  
  - Disasters  
  - Human Rights  
  - Trade

### **Output Format (JSON)**
The labels must be one of the specified categories, which are: {labels_str}. \
Return only a valid JSON list of 10 elements in the following structure:

```json
[
    {{"text": <text of the document>, "label": <corresponding label>, "phenomena": ["<phenomenon1>", "<phenomenon2>", ...]}},
    ...
]
```
"""
response, delta_t = get_response(prompt, model, tokenizer)
synthetic_data = response2json(response)
log_synthetic_data(model_name, "targeted + linguistic tags", prompt, synthetic_data, delta_t, output_file=log_file_path)

TIME TAKEN: 17.92
GENERATED RESPONSE:
Okay, so the user has asked me to generate 10 short news documents for four different categories: Business, Sci/Tech, Sports, and World. Each document should cover specific phenomena within its category. Let me break this down step by step.

First, I need to understand what each category entails. For Business, topics include markets, economy, companies, startups, and regulations. Sci/Tech covers AI, space, cybersecurity, biotech, and climate. Sports focuses on events, records, Highlights, and scandals, while World includes politics, conflicts, disasters, and human rights.

I should come up with 10 unique topics that touch on these areas. Let me brainstorm some ideas for each category.

In Business, maybe something about the global economy, tech companies, or regulations. For Sci/Tech, AI in healthcare or space exploration. Sports could involve a sports event or a scandal. World might include politics or conflicts.

I need to ensure each document is

### Example of data generated by Chat-GPT

- Text is more realistic and focused on more specific topic

"text": "Global stock markets experienced a sharp rise today as tech giants posted record profits. Analysts predict continued growth driven by AI advancements and cloud computing.",  
"label": "Business",  
"phenomena": ["Markets", "Companies"]  
 
"text": "A breakthrough in AI-powered drug discovery has led to the rapid identification of new antibiotics, potentially revolutionizing treatment for antibiotic-resistant bacteria.",  
"label": "Sci/Tech",  
"phenomena": ["AI", "Biotech"]  

"text": "The UEFA Champions League final ended in dramatic fashion as Real Madrid secured a last-minute victory over Manchester City, marking their 15th title win.",  
"label": "Sports",  
"phenomena": ["Events", "Records"]  

"text": "Following a major cyberattack on a multinational bank, security experts warn of growing threats targeting financial institutions worldwide.",  
"label": "Sci/Tech",  
"phenomena": ["Cybersecurity", "Companies"]  

"text": "New US economic policies aimed at reducing inflation include increased interest rates and stricter fiscal measures, sparking debates among economists.",  
"label": "Business",  
"phenomena": ["Economy", "Regulations"]  

"text": "NASA has successfully launched its latest Mars rover, designed to search for signs of ancient microbial life and test new space exploration technologies.",  
"label": "Sci/Tech",  
"phenomena": ["Space", "Climate"]  

"text": "A doping scandal has rocked the world of professional athletics, with several top runners suspended after testing positive for banned substances.",  
"label": "Sports",  
"phenomena": ["Scandals", "Records"]  

"text": "A massive earthquake struck Southeast Asia, causing widespread destruction and prompting international aid efforts to support affected communities.",  
"label": "World",  
"phenomena": ["Disasters", "Human Rights"]  

"text": "Trade negotiations between the EU and South America have reached a breakthrough, paving the way for a new economic partnership that could boost global commerce.",  
"label": "World",  
"phenomena": ["Trade", "Politics"]  

"text": "Amid rising geopolitical tensions, world leaders gather for an emergency summit to address conflicts and work towards diplomatic solutions.",  
"label": "World",  
"phenomena": ["Politics", "Conflicts"]  
