<a href="https://colab.research.google.com/github/Mahecoding/PRODIGY_GA_01/blob/main/PRODIGY_GA_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **This code fine-tunes a pre-trained GPT-2 model on a custom text dataset and then uses Gradio to create a web interface where users can generate text based on prompts they provide.**

In [2]:
!pip install gradio


Collecting gradio
  Downloading gradio-4.41.0-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.112.1-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.3.0 (from gradio)
  Downloading gradio_client-1.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.10.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.9 (from gradi

**Importing Libraries**

In [3]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments, DataCollatorForLanguageModeling, TextDataset
import gradio as gr


**Tokenizing and Loading the Dataset**

In [4]:

def load_dataset(file_path, tokenizer, block_size=128):
    dataset = TextDataset(
        tokenizer=tokenizer,
        file_path=file_path,
        block_size=block_size,
    )
    return dataset

**Defining the Training Process**

In [5]:

def train(dataset, model, tokenizer):
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False  # Masked Language Modeling is not used for GPT-2
    )

    training_args = TrainingArguments(
        output_dir='./results',  # Directory where the model checkpoints will be saved
        overwrite_output_dir=True,  # Overwrite the content of the output directory if it exists
        num_train_epochs=3,  # Number of training epochs (iterations over the dataset)
        per_device_train_batch_size=4,  # Number of samples per batch per device (e.g., per GPU)
        save_steps=10_000,  # Save the model every 10,000 steps
        save_total_limit=2,  # Limit the total number of saved checkpoints to 2
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=dataset,
    )
    trainer.train()

    # Save the final trained model
    trainer.save_model()

**Loading the GPT-2 Model and Tokenizer**

In [6]:
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

**Preparing the Dataset and Training the Model**

In [7]:
dataset = load_dataset('/content/drive/MyDrive/PRODIGY/dataset.txt', tokenizer)
train(dataset, model, tokenizer)



Step,Training Loss


**Generating Text with the Fine-Tuned Model**

In [8]:
def generate_text(prompt, max_length= 100, num_return_sequences=1):
  input_ids= tokenizer.encode(prompt, return_tensors='pt')
  output = model.generate(
      input_ids,
      max_length=max_length,
      num_return_sequences= num_return_sequences,
      no_repeat_ngram_size=2,
      repetition_penalty=1.5,
      length_penalty=1.0,
      early_stopping=True
  )
  return tokenizer.decode(output[0], skip_special_tokens=True)


**Creating the Gradio Interface**

In [9]:

gr_interface = gr.Interface(
    fn=generate_text,
    inputs="text",
    outputs="text",
    title="GPT-2 Text Generator",
    description="Generate text using a fine-tuned GPT-2 model."
)


**Test Outside of Gradio**

In [10]:
prompt = "In a faraway kingdom, a wise old king ruled with kindness and wisdom. Describe it."
print(generate_text(prompt))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In a faraway kingdom, a wise old king ruled with kindness and wisdom. Describe it.
The story begins in the village of Kainu Village where there are many different cultures that share their beliefs about life as well from what to do when you're young or dead. The villagers have been taught how to live together peacefully for centuries but now they see an opportunity is becoming more peaceful after years spent living apart on one island surrounded by people who love each other dearly...and all too often


**Launching the Gradio App**

In [11]:
gr_interface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://6b4ff312f2845f80ce.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


