#Data Augmentation using LLMs

Data Augmentation is a fundamental technique in machine learning used to expand and diversify datasets by generating synthetic data. With the rise of Large Language Models (LLMs), data augmentation has become one of their most impactful industry applications, enabling the creation of high-quality, diverse datasets for various use cases

#What is Data Augmentation?

Data Augmentation involves creating new, synthetic data points from an existing dataset to improve model robustness, generalization, and performance. In the context of text, this could mean paraphrasing sentences, generating new examples, or creating entirely new structured data based on patterns. LLMs excel at this task because they can understand context, mimic writing styles, and generate plausible outputs based on prompts.

For structured data like tables (e.g., CSV files), LLMs can be guided to produce rows that follow a specific format, such as employee records with fields like Employee ID, Name, Age, Department, Salary, and Experience.





#Data Augmentation using LLMs

Let’s consider a scenario where we have an Employee Salary Dataset (with columns like: Employee ID, Name, Age, Department, Salary, and Experience), but it contains only a few samples. Our goal is to generate additional realistic records to improve training data quality for a salary prediction model.

##Step 1: Load a Pretrained LLM

I am using GPT-2 for demonstration.

In [1]:
from transformers import pipeline

# load GPT-2 model for text generation
generator = pipeline("text-generation", model="gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


When this code runs, the GPT-2 model and its tokenizer are downloaded (if not already cached) and loaded into memory. The generator object is now ready to produce text based on input prompts.

##Step 2: Define a Structured Prompt for Data Generation

LLMs like GPT-2 rely on patterns in the input prompt. By providing a detailed example, we guide the model to mimic the structure (comma-separated values) and content (realistic employee data). You can also extract data from CSV files while writing prompts.

In [3]:
prompt = """
Generate a structured table in CSV format with columns:
Employee ID, Name, Age, Department, Salary ($), Experience (Years).

Example:
1, John Doe, 28, Engineering, 75000, 3
2, Jane Smith, 32, Marketing, 85000, 5
3, Alice Brown, 45, HR, 95000, 10
4, Robert White, 38, Engineering, 90000, 7
5, Emily Davis, 29, Finance, 72000, 4
6, Michael Johnson, 50, Sales, 110000, 20
7, Sarah Wilson, 31, HR, 78000, 6
8, David Lee, 42, Marketing, 88000, 12
9, Jennifer Moore, 27, Engineering, 71000, 2
10, Kevin Clark, 35, Finance, 93000, 8
11, Jessica Taylor, 30, Sales, 79000, 5
12, William Martin, 37, HR, 87000, 9
13, Olivia Adams, 40, Engineering, 99000, 14
14, Daniel Harris, 26, Finance, 70000, 2
15, Sophia Anderson, 33, Marketing, 85000, 7
16, Matthew Thomas, 29, Sales, 73000, 3
17, Laura Jackson, 36, HR, 89000, 10
18, Anthony Rodriguez, 41, Engineering, 105000, 15
19, Lisa Scott, 39, Marketing, 92000, 11
20, Andrew Hall, 34, Finance, 94000, 9
"""

# generate synthetic data using GPT-2
generated_data = generator(prompt, max_length=5000, num_return_sequences=1)

# print generated output
print(generated_data[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Generate a structured table in CSV format with columns:
Employee ID, Name, Age, Department, Salary ($), Experience (Years).

Example:
1, John Doe, 28, Engineering, 75000, 3
2, Jane Smith, 32, Marketing, 85000, 5
3, Alice Brown, 45, HR, 95000, 10
4, Robert White, 38, Engineering, 90000, 7
5, Emily Davis, 29, Finance, 72000, 4
6, Michael Johnson, 50, Sales, 110000, 20
7, Sarah Wilson, 31, HR, 78000, 6
8, David Lee, 42, Marketing, 88000, 12
9, Jennifer Moore, 27, Engineering, 71000, 2
10, Kevin Clark, 35, Finance, 93000, 8
11, Jessica Taylor, 30, Sales, 79000, 5
12, William Martin, 37, HR, 87000, 9
13, Olivia Adams, 40, Engineering, 99000, 14
14, Daniel Harris, 26, Finance, 70000, 2
15, Sophia Anderson, 33, Marketing, 85000, 7
16, Matthew Thomas, 29, Sales, 73000, 3
17, Laura Jackson, 36, HR, 89000, 10
18, Anthony Rodriguez, 41, Engineering, 105000, 15
19, Lisa Scott, 39, Marketing, 92000, 11
20, Andrew Hall, 34, Finance, 94000, 9
21, Paul Hall, 36, Marketing, 78000, 13
12, Lauren Walker

This prompt primes GPT-2 to generate additional rows that follow the same format and style. Once, we will run this code, GPT-2 will use its learned patterns to predict what comes next after the prompt, ideally producing new CSV rows with realistic employee data.

##Step 3: Parse the Generated Text into a DataFrame

The resulting data DataFrame should contain the synthetic rows generated by GPT-2, structured as a table. So, we need to extract structured data:

In [5]:
import pandas as pd
from io import StringIO

# extract generated text
generated_text = generated_data[0]['generated_text']

# remove the prompt portion from the output
generated_text = generated_text.replace(prompt, "").strip()

# convert to dataframe
data = pd.read_csv(StringIO(generated_text), names=["Employee ID", "Name", "Age", "Department", "Salary", "Experience"])
print(data)

   Employee ID               Name  Age  Department  Salary  Experience
0           21          Paul Hall   36   Marketing   78000          13
1           12      Lauren Walker   41     Finance   97000           1
2           13    David W. Miller   45       Sales  106000           7
3           14       James Wilson   31   Marketing   92000           8
4           15   John A. White II   46       Sales   86400           3


This step transforms the raw text into a usable format for analysis or machine learning

#Summary

So, data augmentation is a fundamental technique in machine learning used to expand and diversify datasets by generating synthetic data. In this project, we demonstrated how LLMs can generate synthetic tabular data to augment datasets.