In [22]:
# Install required libraries
!pip install transformers datasets openpyxl




In [23]:
# Import necessary libraries
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import Dataset

First, we install and import all the necessary libraries, including Hugging Face's transformers and datasets, as well as pandas to handle the Excel file. These libraries are essential for tokenization, model loading, and training.

In [25]:
# Load dataset from Excel file
data = pd.read_excel('/content/Fine-Tune-Model-Train/training_data_assignment.xlsx', header=None)
data.columns = ['Question', 'Answer']  # Rename columns for better clarity


In [26]:
# Check if the dataset is loaded correctly
print("First few rows of the dataset:")
print(data.head())  # Display the first 5 rows of the DataFrame

# Display the shape of the dataset
print("\nShape of the dataset:")
print(data.shape)  # This will show the number of rows and columns

First few rows of the dataset:
                                            Question  \
0         WHAT IS CALL OPTION & PUT OPTION IN BONDS?   
1                    What is Call / Put option date?   
2  -2023 Change / Correction in Name / Address co...   
3  -2023 View/Download Fillable-Transmission/Name...   
4                -2023 Exchange of Share Certificate   

                                              Answer  
0  Bond Option is a contract between seller and b...  
1  Call / Put option date is the date on which is...  
2  Please note, SEBI vide their Circulars have ma...  
3  View/Download -Fillable-Application form for N...  
4  Please note, SEBI vide their Circulars have ma...  

Shape of the dataset:
(790, 2)


Here, we load the training data from an Excel file. We assume the data has two columns: one for questions and one for answers. The columns are renamed for easier reference in later steps.

In [27]:
# Reduce dataset size for initial testing
data = data.sample(frac=0.1, random_state=42)  # Use 10% of the data


We reduce the dataset size by sampling 10% of the data to speed up initial testing. You can later adjust the fraction depending on the dataset size and available resources.

In [10]:
# Prepare the dataset
data['text'] = data.apply(lambda row: f"Q: {row['Question']} A: {row['Answer']}", axis=1)
dataset = Dataset.from_pandas(data[['text']])


We concatenate the questions and answers into a single text field to format them as Q: <question> A: <answer>. This format makes it easier for the model to understand the input-output relationship.

In [30]:
# Load pre-trained model and tokenizer
# model_name = 'distilgpt2'  # Use a smaller model
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [31]:
# Choose one of the following models
# model_name = 'distilgpt2'  # Small and fast GPT-2 model
# model_name = 'gpt2'  # Regular GPT-2 model
# model_name = 'EleutherAI/gpt-neo-125M'  # GPT-Neo (smaller)
# model_name = 'microsoft/DialoGPT-small'  # DialoGPT (smaller)

# Load the chosen model and tokenizer
# model, tokenizer = load_model_and_tokenizer(model_name)


We load the pre-trained distilgpt2 model, which is a smaller and faster version of GPT-2, making it more suitable for fine-tuning in limited compute environments.

In [32]:
# Add padding token if necessary
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token


Some models, like GPT-2, do not have a padding token. We set the pad_token to the same as eos_token (end of sentence token) for consistent padding.

In [33]:
# Load the model
model = AutoModelForCausalLM.from_pretrained(model_name, pad_token_id=tokenizer.pad_token_id)


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

We load the pre-trained distilgpt2 model for causal language modeling and set the padding token ID to the one we defined earlier.

In [35]:
# Tokenization function with labels
def tokenize_function(examples):
    # Tokenize the text and create labels
    encodings = tokenizer(examples['text'], padding="max_length", truncation=True, max_length=128)
    encodings['labels'] = encodings['input_ids']  # Set labels to input_ids
    return encodings


We define a tokenization function that converts the text into tokens and also assigns the same token IDs as the labels for training. This function will be applied to the entire dataset.



In [36]:
# Tokenize the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(['text'])


Map:   0%|          | 0/79 [00:00<?, ? examples/s]

The dataset is split into training (80%) and evaluation (20%) sets to monitor the model's performance during fine-tuning.

In [37]:
# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy='epoch',  # Evaluate after each epoch
    learning_rate=2e-5,
    per_device_train_batch_size=2,  # Smaller batch size for limited compute
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2,  # Limit saved checkpoints
    fp16=True,  # Enable mixed precision for faster training on compatible GPUs
)


We define the training arguments, including batch size, learning rate, number of epochs, and mixed precision training for faster execution (if supported by the GPU).

In [38]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)


We create the Trainer object from Hugging Face's transformers library, which handles the fine-tuning process.

In [39]:
# Start training
trainer.train()


Epoch,Training Loss,Validation Loss
1,No log,2.274684
2,No log,2.213943
3,No log,2.198199


TrainOutput(global_step=96, training_loss=2.857652028401693, metrics={'train_runtime': 615.9727, 'train_samples_per_second': 0.307, 'train_steps_per_second': 0.156, 'total_flos': 12346048512000.0, 'train_loss': 2.857652028401693, 'epoch': 3.0})

This command starts the fine-tuning process, where the model will learn to generate text based on the Q: <question> A: <answer> format from the training dataset.

In [40]:
# Function to generate a response from the fine-tuned model
def generate_response(prompt):
    inputs = tokenizer(prompt, return_tensors='pt')
    outputs = model.generate(**inputs, max_length=50)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


After training, we define a function to generate responses from the fine-tuned model. The model is given a prompt, and the response is decoded into text.

In [45]:
# Example prompt
prompt = "What is bond market?"

# Generate a response from the fine-tuned model
response = generate_response(prompt)

# Function to format the response for better readability
def format_response(response_text):
    # Split the response into multiple lines based on punctuation
    formatted_response = response_text.replace('. ', '.\n')  # Split on periods
    return formatted_response

# Print the prompt and the response in a structured format
print("=" * 50)
print("Prompt:")
print(prompt)
print("=" * 50)
print("Response after fine-tuning:")
formatted_response = format_response(response)
print(formatted_response)
print("=" * 50)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt:
What is bond market?
Response after fine-tuning:
What is bond market?

Bond market is a form of investment in securities that is held by investors in a securities exchange.
It is a form of investment in securities that is held by investors in a securities exchange.
It is a form of


Here, we test the fine-tuned model with an example prompt, and the model generates a response based on what it learned during training. You can replace the prompt with any other question from your dataset.

