<a href="https://colab.research.google.com/github/Pooja-0708/GEN-AI/blob/main/gpt_2_fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers datasets accelerate torchvision

In [None]:
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
)
from datasets import load_dataset

In [None]:
# dataset used for fine tuning...
# https://huggingface.co/datasets/hakurei/open-instruct-v1
dataset = load_dataset("hakurei/open-instruct-v1", split='train')
print(dataset.to_pandas().sample(1))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/104M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/19.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/22.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.64M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/26.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/109M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/19.3M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

                                                   output  \
163256  One way to get rid of a headache quickly is by...   

                                     instruction input  
163256  How can I get rid of a headache quickly.        


In [None]:
inst = "u r a calculator"
inpu = "6 + 4"
outp = "10"
prompt = f"{inst} {inpu} {outp}"
print(prompt)

u r a calculator 6 + 4 10


In [None]:
def preprocess(row):
    row['prompt'] = f"{row['instruction']} {row['input']} {row['output']}"
    return row

In [None]:
print(f"Before preprocessing: {dataset}")

Before preprocessing: Dataset({
    features: ['output', 'instruction', 'input'],
    num_rows: 498813
})


In [None]:
dataset = dataset.map(preprocess, remove_columns=['instruction', 'input', 'output'])
print(f"After preprocessing: {dataset}")

Map:   0%|          | 0/498813 [00:00<?, ? examples/s]

After preprocessing: Dataset({
    features: ['prompt'],
    num_rows: 498813
})


In [None]:
dataset = dataset.shuffle(seed=42).select(range(100000)).train_test_split(test_size=0.1)
print(f"After train test split: {dataset}")

After train test split: DatasetDict({
    train: Dataset({
        features: ['prompt'],
        num_rows: 90000
    })
    test: Dataset({
        features: ['prompt'],
        num_rows: 10000
    })
})


In [None]:
train_dataset = dataset['train']
test_dataset = dataset['test']

In [None]:
# https://huggingface.co/microsoft/DialoGPT-medium
MODEL_NAME = 'microsoft/DialoGPT-medium'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
print(f"First row Before tokenizing: {train_dataset['prompt'][0]}")

First row Before tokenizing: Give two opposite opinions on the issue and support them with examples.  Issue: Should the government ban all guns?
Opposing Opinions:
- Yes, because it will reduce gun violence.


In [None]:
def tokenize_dataset(dataset):
    tokenized_dataset = dataset.map(lambda row:tokenizer(row['prompt'], truncation=True, max_length=128), batched=True, remove_columns=['prompt'])
    return tokenized_dataset

In [None]:
train_dataset = tokenize_dataset(train_dataset)
# test_dataset = tokenize_dataset(test_dataset)
print(f"tokenized train dataset: {train_dataset}")
print(f"First row After tokenizing: {train_dataset['input_ids'][0], train_dataset['attention_mask'][0]}")

Map:   0%|          | 0/90000 [00:00<?, ? examples/s]

tokenized train dataset: Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 90000
})
First row After tokenizing: ([23318, 734, 6697, 9317, 319, 262, 2071, 290, 1104, 606, 351, 6096, 13, 220, 18232, 25, 10358, 262, 1230, 3958, 477, 6541, 30, 198, 27524, 2752, 8670, 259, 507, 25, 198, 12, 3363, 11, 780, 340, 481, 4646, 2485, 3685, 13], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])


In [None]:
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

training_args = TrainingArguments(
    output_dir='./dialogpt2-instruct',
    num_train_epochs=1,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=4,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=data_collator,
)

config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/863M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
print("training started...")
trainer.train()
print("training completed...")
trainer.save_model()
print("saved model...")

In [None]:
# fine_tuned_model = AutoModelForCausalLM.from_pretrained('./dialogpt2-instruct').to('cuda')
fine_tuned_model = AutoModelForCausalLM.from_pretrained('TheFuzzyScientist/diabloGPT_open-instruct').to('cuda')

config.json:   0%|          | 0.00/905 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.44G [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

In [None]:
def generate_text(prompt, model_selected):
    inputs = tokenizer.encode(prompt, return_tensors='pt').to('cuda')
    outputs = model_selected.generate(inputs, max_length=64, pad_token_id=tokenizer.eos_token_id)
    generated_text = tokenizer.decode(outputs[0], skip_special_token=True, use_mps_device=True)
    return generated_text[: generated_text.rfind('.')+1]

In [None]:
print("Generating text from base model... ")
print(generate_text("I am Arish", model))

Generating text from base model... 
I am Arish.


In [None]:
print("Generating text from fine tuned model... ")
print(generate_text("I like to drink...", fine_tuned_model))

Generating text from fine tuned model... 
I like to drink...  I like to drink coffee.  I like to drink tea.  I like to drink coffee with milk.  I like to drink tea with milk.  I like to drink coffee with milk.  I like to drink tea with milk.  I like to drink coffee with milk.
