<a href="https://colab.research.google.com/github/SurajMegharaj/Finance_bot/blob/main/Fianance_bot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers datasets torch

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupt

In [2]:
import pandas as pd

# Load dataset (make sure you upload your CSV)
df = pd.read_csv('/content/finance_qna.csv')
df.head()


Unnamed: 0,Question,Answer
0,What is a stock?,A stock represents ownership in a company and ...
1,What is a bond?,A bond is a fixed income instrument that repre...
2,What is the stock market?,The stock market refers to the collection of m...
3,What is an ETF?,An exchange-traded fund (ETF) is a type of sec...
4,What is a mutual fund?,A mutual fund is a type of investment vehicle ...


In [4]:
train_data = []

for index, row in df.iterrows():
    train_data.append(f"Question: {row['Question']} Answer: {row['Answer']}")

# Check sample
train_data[:3]


['Question: What is a stock? Answer: A stock represents ownership in a company and constitutes a claim on part of the company’s assets and earnings.',
 'Question: What is a bond? Answer: A bond is a fixed income instrument that represents a loan made by an investor to a borrower, typically corporate or governmental.',
 'Question: What is the stock market? Answer: The stock market refers to the collection of markets and exchanges where regular activities of buying, selling, and issuance of shares take place.']

Fine-Tuning GPT-2

In [1]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import Dataset

# Load pre-trained GPT-2
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Add a padding token to GPT-2 tokenizer (GPT-2 doesn't have one by default)
tokenizer.pad_token = tokenizer.eos_token  # Use EOS token as padding token

# Load your dataset from a CSV file
csv_file_path = '/content/finance_qna.csv'  # Replace this with your CSV file path
dataset = Dataset.from_csv(csv_file_path)

# Check the column names to ensure 'Question' and 'Answer' are correct
print(dataset.column_names)  # Debugging line

# Tokenize data by concatenating 'Question' and 'Answer'
def tokenize_function(examples):
    text = [f"Question: {q} Answer: {a}" for q, a in zip(examples["Question"], examples["Answer"])]
    encodings = tokenizer(text, truncation=True, padding='max_length', max_length=512)
    encodings["labels"] = encodings["input_ids"]  # Adding labels for causal LM training
    return encodings

# Tokenize the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=10_000,
    save_total_limit=2,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Start training
trainer.train()

# Save the fine-tuned model and tokenizer
trainer.save_model("./fine_tuned_gpt2")
tokenizer.save_pretrained("./fine_tuned_gpt2")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


['Question', 'Answer']


[34m[1mwandb[0m: Currently logged in as: [33msuraj1642001[0m ([33msuraj1642001-ssce[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss


('./fine_tuned_gpt2/tokenizer_config.json',
 './fine_tuned_gpt2/special_tokens_map.json',
 './fine_tuned_gpt2/vocab.json',
 './fine_tuned_gpt2/merges.txt',
 './fine_tuned_gpt2/added_tokens.json')

In [8]:
import torch
import math
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the fine-tuned model and tokenizer
model = GPT2LMHeadModel.from_pretrained("./fine_tuned_gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Set the model in evaluation mode
model.eval()

# Define a sample text for evaluation
input_text = "Question: What is a stock? Answer:"

# Tokenize the input text
inputs = tokenizer(input_text, return_tensors="pt")

# Ensure pad_token_id is set to eos_token_id
model.config.pad_token_id = tokenizer.eos_token_id

# Forward pass to get the logits and compute loss
with torch.no_grad():
    outputs = model(**inputs, labels=inputs["input_ids"])

# Compute the perplexity
log_likelihood = outputs.loss
perplexity = torch.exp(log_likelihood)

print(f"Perplexity: {perplexity.item()}")


Perplexity: 55.86064147949219


In [10]:
# Set the pad_token_id for GPT-2 (it doesn't have a padding token by default)
model.config.pad_token_id = model.config.eos_token_id  # This will use EOS token as padding

# Generation using the trained model
input_text = "Question: What is a stock? Answer:"
inputs = tokenizer(input_text, return_tensors="pt")

# Ensure the model is in evaluation mode
model.eval()

# Generate text
generated_ids = model.generate(
    inputs['input_ids'],
    max_length=100,  # Length of the generated text
    num_beams=5,  # You can adjust the number of beams for beam search
    no_repeat_ngram_size=2,  # To avoid repetitive n-grams
    temperature=0.7,  # Set the temperature for randomness
    do_sample=True,  # Enable sampling for more variety in the output
    top_p=0.9,  # Use top-p (nucleus) sampling to focus on high probability tokens
)

# Decode and print the generated output
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(generated_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: What is a stock? Answer:
