<a href="https://colab.research.google.com/github/AdityaSrivastava-AI/SummarizingLLMs/blob/main/AdityaSrivastava_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Hello,

In this project, I fine_tuned the GPT-2 casual Language model using a meta-review dataset, then preprocessed the text as required, tokenized it, and defined training parameters for the model. After training, I also generated summaries using prompt engineering and evaluated using ROUGE evaluation.

Sincerely,

Aditya Srivastava

**aditya27srivastava10**@gmail.com

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Importing Data

In [None]:
#installing libraries from Hugging Face
!pip install transformers datasets rouge_score



In [None]:
#Loading dataset from Hugging Face
from datasets import load_dataset

#Loading the dataset
dataset = load_dataset("zqz979/meta-review")

#Printing first example for inspection
print(dataset['train'][0])

{'Input': "In this paper, the author investigates how to utilize large-scale human video to train dexterous robot manipulation skills. To leverage the information from the Internet videos, the author proposes a handful of techniques to pre-process the video data to extract the action information. Then the network is trained on the extracted hand data and deployed to the real robot with some human demonstration collected by teleoperation for fine-tuning. Experiments show that the proposed pipeline can solve multiple manipulation tasks.  **Strength**  - The direction explored in this paper is important. Utilizing the internet video data for robot learning is well motivated. Especially considering the similarity between human and multi-finger hands, this direction looks very promising.   - The authors perform experiments with multiple real-world tasks with pick and place, pushing, and rotating objects.  **Weakness**  - Although the objective of this paper is very impressive, the experimen

In [None]:
#process input textand tokenize it using GPT-2
from transformers import AutoTokenizer

#Taking pre-trained model and adding padding token
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

#Now will define preprocess function

def preprocess_text(text):
#Will remove punctuations and turn string into lower case alhabets only
  import string
  if text is None:
    return ''
  return text.lower().translate(str.maketrans('', '', string.punctuation))

#Tokenization function
def tokenize(batch):
  #To process input text
  inputs = [preprocess_text(text) for text in batch['Input']]


  #Tokenizing the Input
  tokenized_inputs = tokenizer(
      inputs,
      truncation=True,
      padding='max_length',
      max_length=512,
  )

  #using input_ids as labels
  tokenized_inputs['labels'] = tokenized_inputs['input_ids'].copy()



  return tokenized_inputs

#Applying tokenization on dataset
tokenized_dataset = dataset.map(tokenize, batched=True)

#Checking a sample
print(tokenized_dataset['train'][0])

#Formatting for Pytorch
tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])





Map:   0%|          | 0/1648 [00:00<?, ? examples/s]

{'Input': "In this paper, the author investigates how to utilize large-scale human video to train dexterous robot manipulation skills. To leverage the information from the Internet videos, the author proposes a handful of techniques to pre-process the video data to extract the action information. Then the network is trained on the extracted hand data and deployed to the real robot with some human demonstration collected by teleoperation for fine-tuning. Experiments show that the proposed pipeline can solve multiple manipulation tasks.  **Strength**  - The direction explored in this paper is important. Utilizing the internet video data for robot learning is well motivated. Especially considering the similarity between human and multi-finger hands, this direction looks very promising.   - The authors perform experiments with multiple real-world tasks with pick and place, pushing, and rotating objects.  **Weakness**  - Although the objective of this paper is very impressive, the experimen

In [None]:
#importing model from transformers
from transformers import AutoModelForCausalLM

#load model
model = AutoModelForCausalLM.from_pretrained("gpt2")
#Adjusting for new padding token
model.resize_token_embeddings(len(tokenizer))

Embedding(50258, 768)

In [None]:

train_size = int(0.8 * len(tokenized_dataset['train']))
train_dataset = tokenized_dataset['train'].select(range(train_size))
eval_dataset = tokenized_dataset['train'].select(range(train_size, len(tokenized_dataset['train'])))

Chosen Hyperparameters and prompt design
*Learning Rate(5e-5):Balances effective learning.
*Epochs(4): Ensures sufficient training
*Batch size(3):Fits within GPU memory
*Prompt: Clearly instructs model to summarize and is concise.






In [None]:
#Trainig arguments and run trainer for fine tuning
from transformers import Trainer, TrainingArguments

#Define traininh arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01

)

#Starting the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()




Epoch,Training Loss,Validation Loss
1,3.7061,3.568757
2,3.5571,3.495705
3,3.463,3.476382


TrainOutput(global_step=4617, training_loss=3.6897179808274787, metrics={'train_runtime': 3162.7976, 'train_samples_per_second': 5.836, 'train_steps_per_second': 1.46, 'total_flos': 4823189618688000.0, 'train_loss': 3.6897179808274787, 'epoch': 3.0})

In [None]:
#checking availbility of GPU moving the model and input tensors into GPU
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
model = model.to(device)

Using device: cuda


In [None]:
#define a prompt and generating summaries
prompt = "Summarize the meta-review of the dataset given:" + dataset['train'][0]['Input']

#Tokenize and generate summary
inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=512).to(device)
summary_ids = model.generate(inputs['input_ids'], max_new_tokens=50, num_beams=4, early_stopping=True)
generated_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

#Print the generated summary
print("Generated Summary:", generated_summary)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Summary: Summarize the meta-review of the dataset given:In this paper, the author investigates how to utilize large-scale human video to train dexterous robot manipulation skills. To leverage the information from the Internet videos, the author proposes a handful of techniques to pre-process the video data to extract the action information. Then the network is trained on the extracted hand data and deployed to the real robot with some human demonstration collected by teleoperation for fine-tuning. Experiments show that the proposed pipeline can solve multiple manipulation tasks.  **Strength**  - The direction explored in this paper is important. Utilizing the internet video data for robot learning is well motivated. Especially considering the similarity between human and multi-finger hands, this direction looks very promising.   - The authors perform experiments with multiple real-world tasks with pick and place, pushing, and rotating objects.  **Weakness**  - Although the ob

In [None]:
#installing evaluate library from Hugging Face
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
#using ROUGE metrics to evaluate quality of generated summaries
from datasets import load_dataset

#loading ROUGE metric
import evaluate
rouge = evaluate.load("rouge")

#Generate and accumulat predictions for evaluation
predictions = []
references = []
#evaluate 10 smaples
for i in range(10):
  prompt = "summarize the review:" + dataset['train'][i]['Input']
  inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=512).to(device)
  summary_ids = model.generate(inputs['input_ids'], max_new_tokens=50, num_beams=4, early_stopping=True)

  predictions.append(tokenizer.decode(summary_ids[0], skip_special_tokens=True))
  references.append(dataset['train'][i]['Output'])

#Computing ROUGE scores
results = rouge.compute(predictions=predictions, references=references)
print("ROUGE scores:", results)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generati

ROUGE scores: {'rouge1': 0.23817941269372833, 'rouge2': 0.06588060318083214, 'rougeL': 0.13175596039732357, 'rougeLsum': 0.13163860349144757}


Thank You!