<a href="https://colab.research.google.com/github/Indukurivigneshvarma/Deep_Learning/blob/main/NLP/GPT_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install transformers datasets --quiet

from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import load_dataset
import torch

In [18]:
dataset = load_dataset("imdb", split="train[:1%]")
dataset = dataset.train_test_split(test_size=0.1)

!jupyter nbconvert --to notebook --ClearOutputPreprocessor.enabled=True \
    --ClearMetadataPreprocessor.enabled=True \
    --output GPT_1_clean_final.ipynb GPT_1_clean.ipynb


[NbConvertApp] Converting notebook GPT_1_clean.ipynb to notebook
[NbConvertApp] Writing 4295 bytes to GPT_1_clean_final.ipynb


In [19]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

!jupyter nbconvert --to notebook --ClearOutputPreprocessor.enabled=True \
    --ClearMetadataPreprocessor.enabled=True \
    --output GPT_1_clean_final.ipynb GPT_1_clean.ipynb


[NbConvertApp] Converting notebook GPT_1_clean.ipynb to notebook
[NbConvertApp] Writing 4295 bytes to GPT_1_clean_final.ipynb


In [27]:
def tokenize_function(examples):
    outputs = tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=128,
    )
    outputs["labels"] = outputs["input_ids"].copy()
    return outputs

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

!jupyter nbconvert --to notebook --ClearOutputPreprocessor.enabled=True \
    --ClearMetadataPreprocessor.enabled=True \
    --output GPT_1_clean_final.ipynb GPT_1_clean.ipynb


[NbConvertApp] Converting notebook GPT_1_clean.ipynb to notebook
[NbConvertApp] Writing 4295 bytes to GPT_1_clean_final.ipynb


In [21]:
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

!jupyter nbconvert --to notebook --ClearOutputPreprocessor.enabled=True \
    --ClearMetadataPreprocessor.enabled=True \
    --output GPT_1_clean_final.ipynb GPT_1_clean.ipynb


[NbConvertApp] Converting notebook GPT_1_clean.ipynb to notebook
[NbConvertApp] Writing 4295 bytes to GPT_1_clean_final.ipynb


In [22]:
training_args = TrainingArguments(
    output_dir="./gpt2-finetuned-imdb",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_dir="./logs",
    logging_steps=10,
    report_to=[]
)

In [23]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"]
)

In [24]:
trainer.train()

!jupyter nbconvert --to notebook --ClearOutputPreprocessor.enabled=True \
    --ClearMetadataPreprocessor.enabled=True \
    --output GPT_1_clean_final.ipynb GPT_1_clean.ipynb


Epoch,Training Loss,Validation Loss
1,3.5043,3.71573


[NbConvertApp] Converting notebook GPT_1_clean.ipynb to notebook
[NbConvertApp] Writing 4295 bytes to GPT_1_clean_final.ipynb


In [25]:
prompt = "the movie was great with"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_length=80,
    num_return_sequences=3,
    temperature=0.9,
    top_k=50,
    top_p=0.95,
    do_sample=True
)

for i, output in enumerate(outputs):
    print(f"\n=== Generated {i+1} ===\n")
    print(tokenizer.decode(output, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



=== Generated 1 ===

the movie was great with the acting, it did not seem like it was about as good as the rest of the films that it was. The only thing I was impressed by was the script. I knew everything about this movie, and I thought it was one of the best movies ever made. I also liked that there was something in this movie that I had not seen in quite a long time...

=== Generated 2 ===

the movie was great with very much love for the original. I was only able to find a couple of spots for the original film when we all had a great time. I would recommend this movie to everyone interested in seeing this film.

=== Generated 3 ===

the movie was great with no plot. The plot is quite convoluted and not even the script could have done a better job of giving the movie a story.


In [17]:
import nbformat

in_file = "GPT_1.ipynb"
out_file = "GPT_1_clean.ipynb"

with open(in_file, "r", encoding="utf-8") as f:
    nb = nbformat.read(f, as_version=4)

if "widgets" in nb["metadata"]:
    del nb["metadata"]["widgets"]

for cell in nb["cells"]:
    if "metadata" in cell and "widgets" in cell["metadata"]:
        del cell["metadata"]["widgets"]

with open(out_file, "w", encoding="utf-8") as f:
    nbformat.write(nb, f)

print(f"✅ Cleaned and saved as: {out_file}")

✅ Cleaned and saved as: GPT_1_clean.ipynb
