What do we want to know: if we train a model with synthetic data, does the quality of the output decrease like it should according to "The Curse of Recursion: Training on Generated Data Makes Models Forget" by Shumailov et al?

If training on generated data does make models forget, then using the same architecture for an LLM with two different datasets, one with natural language and the other with synthetic data, we'd expect to see one perform worse on problems which require factual knowledge.  Do we?  What about reasoning experiments?

https://arxiv.org/pdf/2305.17493.pdf



In [2]:
#%pip install huggingface-hub huggingface-cli datasets accelerate evaluate
import evaluate
import torch
import numpy
from transformers import Pipeline, AutoTokenizer, AutoModelForCausalLM, AutoModelForSequenceClassification
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
# HUUUUGE dataset:
#ds = load_dataset("HuggingFaceTB/cosmopedia", "stories", split="train", num_proc=12)
# Medium size dataset:
#ds = load_dataset("HuggingFaceTB/cosmopedia-100k", split="train")

Downloading readme:   0%|          | 0.00/6.72k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/299M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [2]:
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [3]:
model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = numpy.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [10]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

In [16]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

"""
E:\Applications\miniconda3\envs\default\lib\site-packages\accelerate\accelerator.py:436: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  warnings.warn(
"""



In [17]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.040726,0.543
2,No log,1.113026,0.561
3,No log,1.08556,0.589


TrainOutput(global_step=375, training_loss=0.900199951171875, metrics={'train_runtime': 107.031, 'train_samples_per_second': 28.029, 'train_steps_per_second': 3.504, 'total_flos': 789354427392000.0, 'train_loss': 0.900199951171875, 'epoch': 3.0})