### Installing Libraries

In [None]:
!pip install huggingface_hub



In [None]:
!pip install --upgrade transformers




In [None]:
!pip show Transformers

Name: transformers
Version: 4.45.2
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: 


In [None]:
!pip install datasets



### Dataset Preparation

In [None]:
model="distilbert/distilgpt2"

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
Model = AutoModelForCausalLM.from_pretrained(model)

In [None]:
from datasets import load_dataset

In [None]:
ds = load_dataset("openai/webgpt_comparisons")

In [None]:
ds

DatasetDict({
    train: Dataset({
        features: ['question', 'quotes_0', 'answer_0', 'tokens_0', 'score_0', 'quotes_1', 'answer_1', 'tokens_1', 'score_1'],
        num_rows: 19578
    })
})

In [None]:
ds = ds.map(remove_columns=("quotes_0", "tokens_0", "score_0", "quotes_1", "tokens_1", "score_1", "answer_1" ))

In [None]:
ds['train'][0]

{'question': {'dataset': 'triviaqa',
  'id': '18c654a169eb80287f4353d33e701b1c',
  'full_text': 'Voiced by Harry Shearer, what Simpsons character was modeled after Ted Koppel?'},
 'answer_0': 'The Simpsons character that was possibly based on Ted Koppel is Kent Brockman.  He is a local news anchor in Springfield and is modeled after Ted Koppel. [1]'}

In [None]:
import pandas as pd
data = pd.DataFrame(ds)

In [None]:
y = data["train"][3]

In [None]:
y

{'question': {'dataset': 'triviaqa',
  'id': '18c678272eb3692655f62a7e9b3d6815',
  'full_text': "What was the name of Dan Dare's co-pilot, in the comic strip adventures in the Eagle comic?"},
 'answer_0': 'Frank Hampson [1].'}

In [None]:
y["question"]["full_text"]

"What was the name of Dan Dare's co-pilot, in the comic strip adventures in the Eagle comic?"

In [None]:
y["answer_0"]

'Frank Hampson [1].'

In [None]:
def input_new(x):
  o = []
  for i in x:
    w = i["question"]["full_text"]
    o.append(w)
  return o

In [None]:
data["input"] = input_new(data["train"])

In [None]:
data["input"]

Unnamed: 0,input
0,"Voiced by Harry Shearer, what Simpsons charact..."
1,Alliumphobia is the irrational fear of which p...
2,Heterophobia is the irrational fear of what
3,"What was the name of Dan Dare's co-pilot, in t..."
4,"In 1965, which Christmas song became the first..."
...,...
19573,"Why do cars get better gas mileage on ""highway..."
19574,How do the new quantum equations suggest to sc...
19575,Why are politicians expected to release their ...
19576,Why do package delivery people's handheld devi...


In [None]:
def output_new(x):
  o = []
  for i in x:
    w = i["answer_0"]
    o.append(w)
  return o

In [None]:
data["output"] = output_new(data["train"])

In [None]:
data["output"]

Unnamed: 0,output
0,The Simpsons character that was possibly based...
1,Alliumphobia is the irrational fear of garlic....
2,Heterophobia is the irrational fear of the op...
3,Frank Hampson [1].
4,"On December 16, 1965, ""Jingle Bells"" became th..."
...,...
19573,
19574,Scientists Ali and Das have created a series o...
19575,The reason that presidents are expected to rel...
19576,


In [None]:
len(data["input"])

19578

In [None]:
New_Data = list(data["input"] + data["output"])

In [None]:
type(New_Data)

list

In [None]:
New_Data[0]

'Voiced by Harry Shearer, what Simpsons character was modeled after Ted Koppel?The Simpsons character that was possibly based on Ted Koppel is Kent Brockman.  He is a local news anchor in Springfield and is modeled after Ted Koppel. [1]'

In [None]:
New_Data = New_Data[0:25]

In [None]:
New_Data[0]

'Voiced by Harry Shearer, what Simpsons character was modeled after Ted Koppel?The Simpsons character that was possibly based on Ted Koppel is Kent Brockman.  He is a local news anchor in Springfield and is modeled after Ted Koppel. [1]'

In [None]:
Tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
Tokenizer.pad_token = Tokenizer.eos_token
Tokenized_data = Tokenizer(New_Data, return_tensors="pt", padding = True, truncation = True)

In [None]:
w = ["harry is a bad person","My name is sarah"]
Tokenizer(w)

{'input_ids': [[71, 6532, 318, 257, 2089, 1048], [3666, 1438, 318, 264, 23066]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

In [None]:
Tokenized_data

{'input_ids': tensor([[42144,  3711,   416,  ..., 50256, 50256, 50256],
        [ 3237,  1505,   746,  ..., 50256, 50256, 50256],
        [   39,  2357, 19851,  ..., 50256, 50256, 50256],
        ...,
        [13828,  7850,   373,  ..., 50256, 50256, 50256],
        [ 2061,  8200,  3814,  ..., 50256, 50256, 50256],
        [ 7762,   298,  2879,  ..., 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

In [None]:
from transformers import pipeline
classifier = pipeline("text-generation", model= "distilbert/distilgpt2")
classifier("Voiced by Harry Shearer, what Simpsons character was modeled after Ted Koppel")

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': 'Voiced by Harry Shearer, what Simpsons character was modeled after Ted Koppel?\n\nBrought to you by the Simpsons at the Comedy Central Network, this is our episode of The Simpsons Series on Comedy Central; we show an entire weekend'}]

In [None]:
Tokenized_data[0]

Encoding(num_tokens=181, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [None]:
from torch.utils.data import Dataset, random_split
class CustomDataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {
            'input_ids': self.encodings['input_ids'][idx],
            'attention_mask': self.encodings['attention_mask'][idx],
            'labels': self.encodings['input_ids'][idx]
        }

    def __len__(self):
        return len(self.encodings['input_ids'])

dataset = CustomDataset(Tokenized_data)

In [None]:
dataset[0]

{'input_ids': tensor([42144,  3711,   416,  5850,  1375, 11258,    11,   644, 34376,  2095,
           373, 29563,   706, 11396,   509, 10365,   417,    30,   464, 34376,
          2095,   326,   373,  5457,  1912,   319, 11396,   509, 10365,   417,
           318,  8758, 20501,   805,    13,   220,   679,   318,   257,  1957,
          1705, 18021,   287, 27874,   290,   318, 29563,   706, 11396,   509,
         10365,   417,    13,   685,    16,    60, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50

In [None]:
Tokenized_data, dataset

({'input_ids': tensor([[42144,  3711,   416,  ..., 50256, 50256, 50256],
         [ 3237,  1505,   746,  ..., 50256, 50256, 50256],
         [   39,  2357, 19851,  ..., 50256, 50256, 50256],
         ...,
         [13828,  7850,   373,  ..., 50256, 50256, 50256],
         [ 2061,  8200,  3814,  ..., 50256, 50256, 50256],
         [ 7762,   298,  2879,  ..., 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]])},
 <__main__.CustomDataset at 0x7c8f97438400>)

### Model Training

In [None]:
import torch
import numpy as np
from torch.utils.data import random_split
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
Model = AutoModelForCausalLM.from_pretrained("distilgpt2")
data = DataCollatorForLanguageModeling(Tokenizer, mlm = False)

Ratio_Training = 20
Ratio_Evaluation = 5
Training_set, Evaluation_set = random_split(dataset, [20,5])

In [None]:
Evaluation_set[0]

{'input_ids': tensor([   53,    53,    71,   488, 44127, 11596,  5442,  1772,   290, 10099,
           373,  1944,   379,  1111,   262, 43231,  1956,   654,   290,   262,
         22256,   286,  6342,    30, 33048, 49042,    11,   508,   318,  1266,
          1900,   329,   465, 44127, 15895,    12, 14463,  1492,  1052,  5407,
           379, 12258,    11,   318,   262,  1772,   286,   262, 29235, 37059,
            11,   543,  8698,   262,  1956,   654,   379, 43231,   290,   262,
         22256,   286,  6342,    13,   685,    16,    11,   362,    11,   513,
            60, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50

In [None]:
def metrics(eval_pred):
  logits, labels = eval_pred
  if isinstance(logits, np.ndarray):
    logits = torch.from_numpy(logits)
  if isinstance(labels, np.ndarray):
    labels = torch.from_numpy(labels)
  predictions = torch.argmax(logits, dim=-1)
  accuracy = (predictions==labels).float().mean().item()
  return {"Accuracy":accuracy}

In [None]:
Arguments = TrainingArguments(
    output_dir = "./Fine_tuned_model_final",
    overwrite_output_dir = True,
    num_train_epochs = 5,
    #per_devide_train_batch_size = 8,
    save_total_limit = 2,
    eval_strategy ='epoch'
)

trainer = Trainer(
    args = Arguments,
    model = Model,
    train_dataset = Training_set,
    eval_dataset = Evaluation_set,
    compute_metrics = metrics,
    data_collator = data
)

trainer.train()


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,3.538106,0.00221
2,No log,3.500619,0.00221
3,No log,3.483348,0.00221
4,No log,3.479694,0.00221
5,No log,3.479739,0.00221


TrainOutput(global_step=15, training_loss=2.974400583902995, metrics={'train_runtime': 275.4194, 'train_samples_per_second': 0.363, 'train_steps_per_second': 0.054, 'total_flos': 4618624204800.0, 'train_loss': 2.974400583902995, 'epoch': 5.0})

In [None]:
trainer.evaluate()

{'eval_loss': 3.4797394275665283,
 'eval_Accuracy': 0.002209944650530815,
 'eval_runtime': 4.2245,
 'eval_samples_per_second': 1.184,
 'eval_steps_per_second': 0.237,
 'epoch': 5.0}

### Pushing to Hugging face

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
trainer.push_to_hub("Nirwa22/Fine_tuned_model_Final")

events.out.tfevents.1728621078.2eaf7a825cb6.728.50:   0%|          | 0.00/7.25k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

events.out.tfevents.1728621733.2eaf7a825cb6.728.51:   0%|          | 0.00/405 [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/5.24k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Nirwa22/Fine_tuned_model_final/commit/1866233df0b2ae64f8aa3e2bc8f82292d61ef846', commit_message='Nirwa22/Fine_tuned_model_Final', commit_description='', oid='1866233df0b2ae64f8aa3e2bc8f82292d61ef846', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
Tokenizer.push_to_hub("Nirwa22/Fine_tuned_model_Final")

README.md:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Nirwa22/Fine_tuned_model_final/commit/1f2909771fd22cf325faded334f797969284761b', commit_message='Upload tokenizer', commit_description='', oid='1f2909771fd22cf325faded334f797969284761b', pr_url=None, pr_revision=None, pr_num=None)

###Loading the fine_tuned_model

In [None]:
from transformers import pipeline
Classifier = pipeline("text-generation", model = "Nirwa22/Fine_tuned_model_final")
output = Classifier("she is a good person")

In [None]:
print(output)

[{'generated_text': 'she is a good person, great person, well known for putting together a fine, well-known comedy show and also doing a couple of TV shows. His first season, "Big Brother," premiered on May 28 at 9 PM on the CBS/'}]


In [None]:
model = AutoModelForCausalLM.from_pretrained("Nirwa22/Fine_tuned_model_final")
Tokenizer = AutoTokenizer.from_pretrained("Nirwa22/Fine_tuned_model_final")
input_prompt =["chatgpt helps"]
x = Tokenizer(input_prompt, return_tensors="pt")
Output = model.generate(x["input_ids"], max_length = 50, do_sample = True, top_k = 50, temperature = 2.0, length_penalty = 0.5, repetition_penalty = 1.0)
#No accepting num_return_sequences > 1
print(Output)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


tensor([[17006,    70,   457,  5419,  8494,   257,  1271,   286,  6459,   287,
         19473,   290, 37145,   532,   284,   307,  1498,   284,  2050,   262,
          6608,  2950,   618, 24986,   351,  1103,  5563,   393,   379,  2176,
          2974,   286,  3518,   290, 10590, 20087,   532,   355,   880,   355,
          8494,   617,  7531,  3644, 24367,    13, 18987,   422,  2237,  5654]])


In [None]:
print(Tokenizer.decode(Output[0]))

chatgpt helps solve a number of challenges in mathematics and astronomy - to be able to study the properties involved when interacting with real objects or at specific levels of physical and psychological stimulation - as well as solve some fundamental computer puzzles. Users from six scientific
