<a href="https://colab.research.google.com/github/AyeshaHuda21/CODECRAFT_GA_01/blob/main/codecraft_GA_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 01 – Text Generation with GPT-2  
**Internship: Generative AI (Code Craft)**  

In this notebook, I explore **text generation** using GPT-2, a transformer model developed by OpenAI.  

The task has three main goals:  
1. Run the **pre-trained GPT-2 model** to generate text.  
2. **Fine-tune GPT-2** on a custom dataset that I created.  
3. Experiment with **different generation techniques** (temperature, top-k, top-p) to control creativity.  

This exercise helped me understand the basics of **Generative AI in NLP** and how models can be adapted to new writing styles.  


In [None]:
!pip install transformers datasets torch --quiet

from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import load_dataset
import torch

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

prompt = "Generative AI will change the future because"
inputs = tokenizer.encode(prompt, return_tensors="pt")
outputs = model.generate(inputs, max_length=50, num_return_sequences=1, temperature=0.7)

print("Prompt:", prompt)
print("Generated Text:\n", tokenizer.decode(outputs[0], skip_special_tokens=True))


#**3.Custom dataset creation**
    Here we make a small unique dataset

In [None]:
custom_text = """
Artificial Intelligence is not just about automation, it is about augmentation.
Generative AI gives machines the power to imagine, just like humans.
Every breakthrough in AI begins with curiosity and persistence.
The future belongs to those who combine creativity with technology.
Code Craft interns are building the next wave of AI innovation.
Learning by doing is the best way to master Generative AI.
AI should not replace humans, but help humans achieve more.
"""

with open("custom_dataset.txt", "w") as f:
    f.write(custom_text)

dataset = load_dataset("text", data_files={"train": "custom_dataset.txt"})
dataset["train"][0]


##**4.Tokenize dataset**

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length")

tokenizer.pad_token = tokenizer.eos_token
tokenized_datasets = dataset.map(tokenize_function, batched=True)

##**5.Fine-tune GPT-02**

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

from transformers import DataCollatorForLanguageModeling, GPT2Tokenizer, GPT2LMHeadModel, TrainingArguments, Trainer
from datasets import load_dataset

# Load and tokenize the dataset
custom_text = """
Artificial Intelligence is not just about automation, it is about augmentation.
Generative AI gives machines the power to imagine, just like humans.
Every breakthrough in AI begins with curiosity and persistence.
The future belongs to those who combine creativity with technology.
Code Craft interns are building the next wave of AI innovation.
Learning by doing is the best way to master Generative AI.
AI should not replace humans, but help humans achieve more.
"""

with open("custom_dataset.txt", "w") as f:
    f.write(custom_text)

dataset = load_dataset("text", data_files={"train": "custom_dataset.txt"})

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token # Set pad token

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length")

tokenized_datasets = dataset.map(tokenize_function, batched=True)

model = GPT2LMHeadModel.from_pretrained("gpt2")
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    save_steps=200,
    logging_dir='./logs',
    logging_steps=50,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    data_collator=data_collator,
)

trainer.train()

##**6.Generate text after fine tuning**

>


In [6]:
prompt = "In the future, Generative AI"
inputs = tokenizer.encode(prompt, return_tensors="pt")
outputs = model.generate(inputs, max_length=60, num_return_sequences=1, temperature=0.7)

print("Prompt:", prompt)
print("Generated Text:\n", tokenizer.decode(outputs[0], skip_special_tokens=True))


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Prompt: In the future, Generative AI
Generated Text:
 In the future, Generative AI is going to be a huge step forward.

The next generation of AI is going to be a huge step forward in human intelligence.

AI is going to be a huge step forward in human intelligence.

We are going to be able to create


##**7.Advanced Generation Experiments**

In [25]:
from transformers import pipeline

prompt = "The role of Generative AI in the future is"

# Create a pipeline for text generation
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Default
output1 = generator(prompt, max_new_tokens=50, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)
print("\nDefault:\n", output1[0]['generated_text'])

# High creativity (temperature = 1.0)
output2 = generator(prompt, max_new_tokens=50, num_return_sequences=1, temperature=1.0 + 1e-8, pad_token_id=tokenizer.eos_token_id)
print("\nHigh Creativity (temp=1.0):\n", output2[0]['generated_text'])

# Top-K sampling
output3 = generator(prompt, max_new_tokens=50, num_return_sequences=1, top_k=50, pad_token_id=tokenizer.eos_token_id)
print("\nTop-K (k=50):\n", output3[0]['generated_text'])

# Top-P sampling
output4 = generator(prompt, max_new_tokens=50, num_return_sequences=1, top_p=0.9, pad_token_id=tokenizer.eos_token_id)
print("\nTop-P (p=0.9):\n", output4[0]['generated_text'])

Device set to use cpu



Default:
 The role of Generative AI in the future is not to automate tomorrow. It is to help us become smarter.

Generative AI is a paradigm shift in science. Today, the goal of AI is to eliminate human error, improve human interaction, and create new forms of personal, creative,

High Creativity (temp=1.0):
 The role of Generative AI in the future is well known. A few years ago, AI was touted as the next advance in artificial intelligence.

But it's not the brightest AI revolution ever thought to move the world. In fact, it only works in ways we already know how:


Top-K (k=50):
 The role of Generative AI in the future is not just to create a better world, but to create a new one. In this chapter, we will explore the possibility of Artificial Intelligence (AI) as a viable innovation that will change the way we think about technology, and how we can help it

Top-P (p=0.9):
 The role of Generative AI in the future is to enable robots to solve many tasks. It will help us to solve the pr

## 🔍 Observations
- Default GPT-2 was fluent but generic.
- High temperature (1.0) gave more creative, but less consistent outputs.
- Top-K sampling produced focused text.
- Top-P sampling balanced coherence with creativity best.  

## ✅ Conclusion
This task taught me how to:
1. Use GPT-2 for text generation.  
2. Fine-tune it on a custom dataset.  
3. Control AI creativity with generation parameters.  

It was my first step into **hands-on Generative AI** 🚀.


In [10]:
# Save model + tokenizer
output_dir = "./fine_tuned_gpt2"

model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

print("Model saved in", output_dir)


Model saved in ./fine_tuned_gpt2


In [12]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("./fine_tuned_gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("./fine_tuned_gpt2")


In [14]:
import gradio as gr
from transformers import pipeline, GPT2LMHeadModel, GPT2Tokenizer

# Load your fine-tuned model
model = GPT2LMHeadModel.from_pretrained("./fine_tuned_gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("./fine_tuned_gpt2")

# Create pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Define chatbot function
def generate_text(prompt, max_length=100, temperature=0.7):
    result = generator(
        prompt,
        max_length=max_length,
        temperature=temperature,
        pad_token_id=tokenizer.eos_token_id
    )
    return result[0]['generated_text']

# Build Gradio UI
with gr.Blocks() as demo:
    gr.Markdown("## 🤖 Fine-Tuned GPT-2 Text Generator")

    with gr.Row():
        prompt = gr.Textbox(label="Enter a prompt", placeholder="Type something...")
        max_len = gr.Slider(50, 200, value=100, step=10, label="Max Length")
        temp = gr.Slider(0.1, 1.5, value=0.7, step=0.1, label="Creativity (Temperature)")

    output = gr.Textbox(label="Generated Output")

    btn = gr.Button("Generate")
    btn.click(fn=generate_text, inputs=[prompt, max_len, temp], outputs=output)

# Launch app
demo.launch()


Device set to use cpu


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://3459ff2ed963bef7cf.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/<AyeshaHuda21>/<codecraft_GA_01.ipynb>/blob/main/<codecraft_GA_O1>.ipynb)