<a href="https://colab.research.google.com/github/Susrith45/Genie-Gan/blob/main/GAN_Text_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"  # prevents Weights & Biases login prompts


In [None]:
import torch
import pandas as pd
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import Dataset


In [None]:
# Upload your CSV file in Colab first (my_dataset.csv)
df = pd.read_csv("my_dataset.csv")
texts = df['text'].tolist()

# Preview first 5 sentences
print(texts[:5])

# Convert to Hugging Face Dataset
my_dataset = Dataset.from_dict({"text": texts})


In [None]:
# Load GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# GPT-2 does not have a pad token by default, set it
tokenizer.pad_token = tokenizer.eos_token

# Load GPT-2 model
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))  # adjust vocab size


In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

tokenized_dataset = my_dataset.map(tokenize_function, batched=True)


In [None]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # GPT-2 is not masked LM
)


In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=2,           # start with 1-2 epochs
    per_device_train_batch_size=2,
    save_steps=500,
    logging_steps=100,
    save_total_limit=1
)


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)


In [None]:
trainer.train()




Step,Training Loss


TrainOutput(global_step=56, training_loss=2.843153817313058, metrics={'train_runtime': 296.3021, 'train_samples_per_second': 0.371, 'train_steps_per_second': 0.189, 'total_flos': 7185530880000.0, 'train_loss': 2.843153817313058, 'epoch': 2.0})

In [None]:
input_text = "The acting in this movie"
input_ids = tokenizer.encode(input_text, return_tensors='pt')
attention_mask = torch.ones(input_ids.shape, dtype=torch.long)

output = model.generate(
    input_ids,
    attention_mask=attention_mask,
    max_length=60,
    num_return_sequences=3,  # generate 3 variations
    do_sample=True,
    temperature=0.9,
    top_k=50,
    pad_token_id=tokenizer.eos_token_id
)

for i, sequence in enumerate(output):
    print(f"Generated Review {i+1}:")
    print(tokenizer.decode(sequence, skip_special_tokens=True))
    print()


Generated Review 1:
The acting in this movie was atrocious. I was sick. All characters were weak. Honestly, at times, the performances were disappointing. It was a total disappointment. I liked the movie. I thought the plot was over the top. It was disappointing. The plot didn't work. The acting

Generated Review 2:
The acting in this movie was horrible. I was hoping for a better movie.

4. Overall felt good. This film felt like the movie before. The performance was poor. I enjoyed the ending.

7. Storyline was bad. The movie was bad. Was it entertaining?

Generated Review 3:
The acting in this movie was terrible. There were a lot of jokes and pacing. It was hard to watch after watching the movie. The plot was confusing and over-schedule. The screenplay was poorly executed. The directing was poor. I liked this one.

Overall: ★ ★ ★

