Installation and Setup

In [9]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


Loading the Model and Tokenizer

In [10]:
# Define model name
model_name = "gpt2"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:  # Add [PAD] token if missing
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Load model
model = AutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))  # Adjust for added tokens


Embedding(50258, 768)

Testing default model's Text Generation with Stopping Criteria

In [32]:
from transformers import AutoModelForCausalLM, AutoTokenizer, LogitsProcessorList, StoppingCriteriaList, StoppingCriteria

# Load model and tokenizer
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

if tokenizer.pad_token is None:  # Ensure pad token exists
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

# Define input text
input_text = "I am from India,"

# Tokenize input text
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)

# Define custom stopping criteria
class EndTokenStoppingCriteria(StoppingCriteria):
    """Custom stopping criteria to stop generation after a sentence-ending token."""
    def __init__(self, tokenizer):
        self.sentence_end_ids = tokenizer.convert_tokens_to_ids(['.', '!', '?'])

    def __call__(self, input_ids, scores, **kwargs):
        # Stop generation if the last token is a sentence-ending token
        if input_ids[0, -1].item() in self.sentence_end_ids:
            return True
        return False

# Define stopping criteria
stopping_criteria = StoppingCriteriaList([
    EndTokenStoppingCriteria(tokenizer)  # Stop at sentence-end tokens
])

# Generate output
output = model.generate(
    inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=50,  # Restrict new token generation
    num_return_sequences=1,
    pad_token_id=tokenizer.pad_token_id,
    do_sample=True,
    temperature=0.6,    # Lower randomness for focus
    top_k=30,           # Reduce the pool of token choices
    top_p=0.8,          # Increase nucleus sampling strictness
    repetition_penalty=1.5,  # Strongly penalize repetition
    stopping_criteria=stopping_criteria  # Stop after a complete sentence
)

# Decode and display the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)


I am from India, and I have never been to any country where you can't find a good beer.


In [12]:
import pandas as pd

# Read the CSV file
file_path = "train.csv"  # Replace with the actual path to your file
data = pd.read_csv(file_path)

# Display the first few rows to understand the content
print("First few rows of the dataset:")
print(data.head())

# Display the columns to see what features are available
print("\nColumns in the dataset:")
print(data.columns)

# Get basic statistics and info about the dataset
print("\nDataset info:")
print(data.info())

print("\nBasic statistics of numerical columns:")
print(data.describe())


First few rows of the dataset:
           id                                      original_text  \
0  lZGdiueMer  `` Well, there are healthier ways to tell me y...   
1  DfTJVFKrUk  Rory ran his shaky fingers through his wife's ...   
2  LmJvKranXK  As I made my way on foot across town to the Po...   
3  PpnqXQAdGH  `` Hello. We come in peace.'' \n \n The first ...   
4  qOeXTfqgAM  `` Karen, what the helllllll izzz...'' says my...   

                                      rewrite_prompt  \
0  Rewrite the story where the writer asks the re...   
1               Rewrite the essay as a dramatic play   
2  Rewrite the story with all the themes and sett...   
3  Rewrite the essay if the advanced aliens didn'...   
4  Rewrite the story as a court room drama starri...   

                                      rewritten_text  
0  Well, there are healthier ways to tell me you ...  
1  ## The Final Curtain\n\n[FADE IN]\n\n**Setting...  
2  As I made my way through the Tatooine desert o...  
3  

Testing Default gpt2's ability to guess prompt

In [13]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import pandas as pd

# Load the dataset
file_path = "TRAIN.csv" 
data = pd.read_csv(file_path)

# Use the first row of the dataset
original_text = data.loc[0, "original_text"]
rewritten_text = data.loc[0, "rewritten_text"]
actual_prompt = data.loc[0, "rewrite_prompt"]

# Load model and tokenizer
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

if tokenizer.pad_token is None:  # Ensure pad token exists
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

# Prepare a refined input prompt
input_text = (
    f"Task: Analyze the relationship between the original and rewritten text.\n"
    f"Original Text: {original_text}\n"
    f"Rewritten Text: {rewritten_text}\n"
    f"Question: What prompt might have caused this transformation?\n"
    f"Rewrite Prompt: "
)

# Tokenize input
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding=True)

# Generate the output (the guessed prompt)
output = model.generate(
    inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=50,  # Limit length of generation
    pad_token_id=tokenizer.pad_token_id,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.9
)

# Decode the generated text
generated_prompt = tokenizer.decode(output[0], skip_special_tokens=True).replace(input_text, "")
print("Generated Prompt:")
print(generated_prompt.strip())

# Print the actual prompt for comparison
print("\nActual Prompt:")
print(actual_prompt.strip())


Generated Prompt:
Jimmy, I think you're getting the point. 
Jimmy:  It was a little more difficult than I thought, and I've made a few mistakes, but I think I've changed a lot. 
Jimmy:

Actual Prompt:
Rewrite the story where the writer asks the reader to help with their essay and is instead surprised when the reader secretly has made dramatic improvements the essay without the original author knowing .


Data Preparation


In [9]:
import pandas as pd
from datasets import Dataset

# Load the dataset and sample 100 rows
file_path = "train.csv"  
data = pd.read_csv(file_path).head(2100)

# Format the dataset for fine-tuning
def format_data(row):
    input_text = (
        f"Original Text: {row['original_text']}\n"
        f"Rewritten Text: {row['rewritten_text']}\n"
        f"Rewrite Prompt: "
    )
    return {"input_text": input_text, "output_text": row["rewrite_prompt"]}

formatted_data = data.apply(format_data, axis=1).tolist()
dataset = Dataset.from_list(formatted_data)

# Split into training and validation sets
train_test_split = dataset.train_test_split(test_size=0.2)
train_dataset = train_test_split["train"]
eval_dataset = train_test_split["test"]


Tokenization for Fine-tuning

In [10]:
from transformers import AutoTokenizer

# Load tokenizer
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # Set padding token

# Tokenize the dataset
def tokenize_function(examples):
    # Tokenize input text
    model_inputs = tokenizer(
        examples["input_text"],
        truncation=True,
        padding="max_length",
        max_length=512  # Ensure consistent max length
    )
    
    # Tokenize target text (labels)
    labels = tokenizer(
        examples["output_text"],
        truncation=True,
        padding="max_length",
        max_length=512  # Ensure consistent max length
    )["input_ids"]
    
    # Replace padding tokens in labels with -100 (ignored by loss function)
    labels = [
        [(label if label != tokenizer.pad_token_id else -100) for label in label_list]
        for label_list in labels
    ]
    
    model_inputs["labels"] = labels
    return model_inputs


tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_eval = eval_dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/1680 [00:00<?, ? examples/s]

Map:   0%|          | 0/420 [00:00<?, ? examples/s]

Training Arguments Configuration

In [11]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./gpt2-finetuned-rewrite",
    eval_strategy="no",  
    logging_dir="./logs",
    logging_steps=10,
    save_steps=50,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    save_total_limit=1,
    load_best_model_at_end=False,  
    fp16=True,  
    report_to="none"
)


Data Collator for Sequence-to-Sequence Training

In [12]:
from transformers import DataCollatorForSeq2Seq
from transformers import Trainer

# Define a data collator for sequence-to-sequence tasks
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding="longest",
    return_tensors="pt"
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    data_collator=data_collator  
)

Model Training

In [13]:
trainer.train()


Step,Training Loss
10,5.3036
20,5.5329
30,5.8366
40,5.5494
50,5.6003
60,5.5942
70,5.5821
80,5.4725
90,5.0691
100,5.4895


TrainOutput(global_step=2520, training_loss=5.089531503404889, metrics={'train_runtime': 398.4998, 'train_samples_per_second': 12.647, 'train_steps_per_second': 6.324, 'total_flos': 1316911841280000.0, 'train_loss': 5.089531503404889, 'epoch': 3.0})

In [14]:
model.save_pretrained("./2000-gpt2-finetuned-rewrite")
tokenizer.save_pretrained("./2000-gpt2-finetuned-rewrite")


('./2000-gpt2-finetuned-rewrite\\tokenizer_config.json',
 './2000-gpt2-finetuned-rewrite\\special_tokens_map.json',
 './2000-gpt2-finetuned-rewrite\\vocab.json',
 './2000-gpt2-finetuned-rewrite\\merges.txt',
 './2000-gpt2-finetuned-rewrite\\added_tokens.json',
 './2000-gpt2-finetuned-rewrite\\tokenizer.json')

In [15]:

from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList
import torch
import pandas as pd


class SentenceEndStoppingCriteria(StoppingCriteria):
    """Stops generation when a sentence-ending token is generated."""
    def __init__(self, tokenizer, sentence_end_ids):
        self.tokenizer = tokenizer
        self.sentence_end_ids = sentence_end_ids

    def __call__(self, input_ids, scores, **kwargs):
        # Get the last token generated
        last_token_id = input_ids[0, -1].item()
        # If the last token is a sentence-ending token, stop generation
        if last_token_id in self.sentence_end_ids:
            return True
        return False


In [16]:
# Load the fine-tuned model and tokenizer
model_path = "./v2-gpt2-finetuned-rewrite"  # Ensure this path is correct
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Set device to CPU (since you're using CPU)
device = torch.device("cpu")
model.to(device)

# Ensure pad token is set
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))


In [17]:
def generate_rewrite_prompt(original_text, rewritten_text, few_shot_examples=None):
    """
    Generate a rewrite prompt based on the provided original and rewritten text.
    
    :param original_text: The original input text.
    :param rewritten_text: The rewritten version of the text.
    :param few_shot_examples: Optional list of dicts containing 'original', 'rewritten', and 'prompt'.
    :return: Generated rewrite prompt.
    """
    # Prepare few-shot examples if provided
    few_shot_prompt = ""
    if few_shot_examples:
        for example in few_shot_examples:
            few_shot_prompt += (
                f"Original Text: {example['original']}\n"
                f"Rewritten Text: {example['rewritten']}\n"
                f"Rewrite Prompt: {example['prompt']}\n\n"
            )

    # Combine few-shot examples with the test case
    test_input = (
        few_shot_prompt +
        f"Original Text: {original_text}\n"
        f"Rewritten Text: {rewritten_text}\n"
        f"Rewrite Prompt: Rewrite"
    )

    # Tokenize input
    inputs = tokenizer(test_input, return_tensors="pt").to(device)

    # Define sentence-ending token IDs
    sentence_end_tokens = ['.', '!', '?']
    sentence_end_ids = tokenizer.convert_tokens_to_ids(sentence_end_tokens)

    # Initialize stopping criteria
    stopping_criteria = StoppingCriteriaList([
        SentenceEndStoppingCriteria(tokenizer, sentence_end_ids)
    ])

    # Generate output with stopping criteria
    outputs = model.generate(
        inputs.input_ids,
        max_new_tokens=50,          # Generate up to 50 new tokens
        num_return_sequences=1,
        pad_token_id=tokenizer.pad_token_id,
        do_sample=True,             # Enable sampling for diversity
        temperature=0.7,            # Control randomness
        top_k=30,                   # Limit the next token to the top 30
        top_p=0.8,                  # Use nucleus sampling
        repetition_penalty=1.5,     # Penalize repetition
        stopping_criteria=stopping_criteria
    )

    # Decode the generated prompt
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Remove the input prompt from the output
    generated_prompt = generated_text.replace(test_input, "").strip()
    
    # Ensure the prompt starts with "Rewrite"
    if not generated_prompt.lower().startswith("rewrite"):
        generated_prompt = "Rewrite " + generated_prompt

    # Ensure the prompt ends with a sentence-ending token
    if not generated_prompt.endswith(('.', '!', '?')):
        generated_prompt += "."

    return generated_prompt


In [18]:
# Define few-shot examples
few_shot_examples = [
    {
        "original": "The cat sat on the mat.",
        "rewritten": "A lazy cat lounged on the soft mat.",
        "prompt": "Rewrite the sentence with a more descriptive tone."
    },
    {
        "original": "The child ran to the park.",
        "rewritten": "The excited child dashed toward the lively park.",
        "prompt": "Rewrite the sentence to make it more vivid."
    }
]

# Define the test cases with expected prompts
test_cases = [
    {
        "original": "The dog chased the ball.",
        "rewritten": "The playful dog ran after the ball with enthusiasm.",
        "expected_prompt": "Rewrite the sentence to make it more descriptive and lively."
    },
    {
        "original": "The sun set over the mountains, painting the sky orange.",
        "rewritten": "As the sun dipped behind the peaks, vibrant hues of orange and red lit up the sky.",
        "expected_prompt": "Rewrite the sentence to emphasize the beauty of the sunset."
    }
]

# Define the test cases
simple_test_case = {
    "original": "The dog chased the ball.",
    "rewritten": "The playful dog ran after the ball with enthusiasm.",
    "expected_prompt": "Rewrite the sentence to make it more descriptive and lively."
}

# Generate and print prompts for each test case
for i, case in enumerate(test_cases, 1):
    generated_prompt = generate_rewrite_prompt(
    simple_test_case['original'],
    simple_test_case['rewritten'],
    few_shot_examples=few_shot_examples
)
print("Generated Rewrite Prompt:")
print(generated_prompt)
print("Expected Prompt:")
print(simple_test_case['expected_prompt'])
print(f"Match: {'✅' if generated_prompt.lower() == simple_test_case['expected_prompt'].lower() else '❌'}")
print("-" * 50)


Generated Rewrite Prompt:
Rewrite in of, story as essay and , is action novel or game you are that about setting your .
Expected Prompt:
Rewrite the sentence to make it more descriptive and lively.
Match: ❌
--------------------------------------------------


In [19]:
def generate_rewrite_prompt_adjusted(original_text, rewritten_text, few_shot_examples=None):
    """
    Generate a rewrite prompt with adjusted generation settings for better coherence.
    """
    # Prepare few-shot examples if provided
    few_shot_prompt = ""
    if few_shot_examples:
        for example in few_shot_examples:
            few_shot_prompt += (
                f"Original Text: {example['original']}\n"
                f"Rewritten Text: {example['rewritten']}\n"
                f"Rewrite Prompt: {example['prompt']}\n\n"
            )

    # Combine few-shot examples with the test case, ensuring it starts with "Rewrite"
    test_input = (
        few_shot_prompt +
        f"Original Text: {original_text}\n"
        f"Rewritten Text: {rewritten_text}\n"
        f"Rewrite Prompt: Rewrite"
    )

    # Tokenize input
    inputs = tokenizer(test_input, return_tensors="pt").to(device)

    # Define sentence-ending token IDs
    sentence_end_tokens = ['.', '!', '?']
    sentence_end_ids = tokenizer.convert_tokens_to_ids(sentence_end_tokens)

    # Initialize stopping criteria
    stopping_criteria = StoppingCriteriaList([
        SentenceEndStoppingCriteria(tokenizer, sentence_end_ids)
    ])

    # Generate output with adjusted settings
    outputs = model.generate(
        inputs.input_ids,
        max_new_tokens=50,          # Generate up to 50 new tokens
        num_return_sequences=1,
        pad_token_id=tokenizer.pad_token_id,
        do_sample=True,             # Enable sampling for diversity
        temperature=0.5,            # Lower randomness for more focused output
        top_k=30,                   # Restrict to top 30 tokens
        top_p=0.85,                 # Use stricter nucleus sampling
        repetition_penalty=2.0,     # Heavier penalty to prevent repetition
        stopping_criteria=stopping_criteria
    )

    # Decode the generated prompt
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Remove the input prompt from the output
    generated_prompt = generated_text.replace(test_input, "").strip()

    # Ensure the prompt starts with "Rewrite"
    if not generated_prompt.lower().startswith("rewrite"):
        generated_prompt = "Rewrite " + generated_prompt

    # Ensure the prompt ends with a sentence-ending token
    if not generated_prompt.endswith(('.', '!', '?')):
        generated_prompt += "."

    return generated_prompt


In [20]:
# Generate and print prompts for each test case with adjusted settings
for i, case in enumerate(test_cases, 1):
    generated_prompt = generate_rewrite_prompt_adjusted(
        case['original'],
        case['rewritten'],
        few_shot_examples=few_shot_examples
    )
    print(f"Test Case {i} - Generated Prompt with Adjusted Settings:")
    print(generated_prompt)
    print(f"Expected Prompt:")
    print(case['expected_prompt'])
    print(f"Match: {'✅' if generated_prompt.lower() == case['expected_prompt'].lower() else '❌'}")
    print("-" * 50)


Test Case 1 - Generated Prompt with Adjusted Settings:
Rewrite and your characters is of in , that . are for story making, world as essay writing novel's yous humor about science character by but other time future action or setting where life at based an from instead be game adventure comedy writer can real events reality.
Expected Prompt:
Rewrite the sentence to make it more descriptive and lively.
Match: ❌
--------------------------------------------------
Test Case 2 - Generated Prompt with Adjusted Settings:
Rewrite , story novel that you are is for world- characters in life essay about adventure by game as all future reality's actions based from humor movie or where has not real but love if .
Expected Prompt:
Rewrite the sentence to emphasize the beauty of the sunset.
Match: ❌
--------------------------------------------------


Conclusion:
1. Insufficient Generalization
The model has failed to generalize the task of inferring rewrite_prompt. 

Limited training data: With only 1000 examples, the model doesn't have enough variety to learn meaningful relationships.
Complex task: Inferring a transformation prompt based on original_text and rewritten_text might require more nuanced understanding than what GPT-2 can achieve with minimal fine-tuning.

2.Overfitting
The model may have overfit to the small training dataset, memorizing patterns without understanding the task. Symptoms of overfitting include:

Repeated or nonsensical outputs.
Poor performance on unseen data or even on training examples.

4. Model Limitations
GPT-2 is a general-purpose model and It has fewer parameters (117M in the base model), limiting its ability to understand complex relationships or tasks.


What to Do Next

Alternatives to GPT-2
a. Larger Transformer Models
Meta’s LLaMA-2:

Why: More parameters, better trained on diverse datasets.
How to use: Open-source and fine-tunable, making it a strong candidate for your task.
Example: Fine-tune LLaMA-2 using the Hugging Face transformers library.
OpenAI’s GPT-3/4:

Why: These are far more powerful and require no fine-tuning for most tasks.
How to use: Prompt GPT-3 with your task in few-shot learning mode.
Example: Pass a few examples of your dataset to GPT-3 and ask it to generate prompts.
Google’s T5 (Text-to-Text Transfer Transformer):

Why: Specializes in text-to-text tasks like your rewrite prompt inference.


             