# Fine Tuning Open AI's GPT2
## Introduction
* As a part of leveraging the power of large language models, we have also attempted fine tuning a language model to act as our application's chatbot assistant.

    We have used the following libraries for fine tuning: 
    1. `transformers[torch]`
    2. `pytorch`
    3. `shutil`

* Using these libraries, we have managed to fine tune openAI's `GPT2` Model on a custom made dataset containing 100 question answer pairs. The dataset has been explained more in detail in later parts of this notebook
* The following models were also tested during our finetuning process: `Bert-large`and `bert`, unfortunalely we didn't consider them due to unsatisfactory results/dependency issues. 

In [1]:
import shutil
src_path = r"/kaggle/input/sign-language-text/Sign Language info dataset.txt"
dst_path = r"/kaggle/working/"
shutil.copy(src_path, dst_path)
print('Copied')

Copied


## About the Dataset

* The dataset has been custom made by team tekken, we have collected around 100 different question answer pairs that have been formatted. Each answer is around 50-60 words.A Sample from the dataset: 

    **[Q] What is ASL? \newline**

    **[A] ASL stands for American Sign Language, which is a natural language primarily used by Deaf communities in the United States and Anglophone Canada.**

* The samples range from a wide range of topics such as American Sign Language, Indian Sign Language, british sign language, significance of sign language. Questions related to `Saradhi.AI` (our application) have also been included.

## Model Configuration

* The model has been loaded using the `GPT2LMHeadModel.from_pretrained()` method, along with it's tokenizer usign the `GPT2Tokenizer.from_pretrained()`. The parameters used for finetuning are as follows: 

    `num_train_epochs=100,
    per_device_train_batch_size=8,
    save_steps=10_000,
    save_total_limit=2,
    logging_dir="./logs",`

In [7]:
# Install necessary libraries
# !pip install transformers
# ! pip install torch
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments

# Load pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Load your text file
file_path = "/kaggle/working/Sign Language info dataset.txt"

# Create a dataset from the text file
dataset = TextDataset(tokenizer=tokenizer, file_path=file_path, block_size=128)

# Define data collator for language modeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./finetuned_model",
    overwrite_output_dir=True,
    num_train_epochs=100,
    per_device_train_batch_size=8,
    save_steps=10_000,
    save_total_limit=2,
    logging_dir="./logs",
)

# Create Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

# Fine-tune the model
trainer.train()

# Save the fine-tuned model
model_path = "/kaggle/working/finetuned_gpt2_model.bin"
model.save_pretrained(model_path)






dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
who created Saradhi AI), a sign language interpreter for programmers, that was developed to make sign language accessible to everyone.


[Q] How did Saradhi AI facilitate sign language learning?
[A] Saradhi AI facilitated sign language learning through informative courses on learning sign language, facilitated by a chatbot.


[Q] What is the significance of Martha's Vineyard Sign Language (MVSL) in the history of ASL?
[A] MV


# Inference from the finetuned model

In [9]:
# Perform inference
prompt_text = "[Q] Who created Saradhi AI?"
input_ids = tokenizer.encode(prompt_text, return_tensors="pt")

# Move input tensor to the same device as the model
# Move model to the same device as the input tensor
model.to(input_ids.device)

# Perform inference
output = model.generate(input_ids, max_length=100, num_return_sequences=1, temperature=0.1)

# Decode and print the output
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Text:")
print(generated_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
[Q] Who created Saradhi AI?
[A] Saradhi AI was created by a team passionate about making sign language accessible to everyone, known as Team Tekken.


[Q] What is the purpose of the real-time sign language translation feature in Saradhi AI?
[A] The real-time sign language translation feature in Saradhi AI converts sign language gestures captured by the device's camera into understandable text, facilitating communication.


[Q


In [6]:
torch.save(model.state_dict(), '/kaggle/working/model.pth')