<a href="https://colab.research.google.com/github/Pallavi5775/text-gen/blob/main/text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [2]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments
from datasets import Dataset

# Load pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Set tokenizer padding token
tokenizer.pad_token = tokenizer.eos_token

# Load the financial news dataset
def create_dataset(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        lines = [line.strip() for line in f.readlines() if line.strip()]  # Remove empty lines
    # Use a subset of the data (e.g., first 5000 lines for faster training)
    lines = lines[:1000]

    return Dataset.from_dict({"text": lines})

# Specify the dataset file path (uploaded file in Colab)
file_path = "financial_news.txt"
dataset = create_dataset(file_path)

# Tokenize the dataset and include labels
def tokenize_function(examples):
    tokenized_inputs = tokenizer(examples["text"], truncation=True, max_length=128, padding="max_length")
    tokenized_inputs["labels"] = tokenized_inputs["input_ids"].copy()  # Labels are the same as input_ids
    return tokenized_inputs

# Tokenize the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
tokenized_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# Define fine-tuning arguments
training_args = TrainingArguments(
    output_dir="./fine_tuned_gpt2",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=16,  # Increased batch size to reduce iterations
    save_steps=500,
    save_total_limit=2,
    logging_dir="./logs",
    logging_steps=10,
    evaluation_strategy="no",
    learning_rate=5e-5,
    warmup_steps=100,
    weight_decay=0.01,
    fp16=True,
)

# Set up the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Start fine-tuning
print("Starting fine-tuning...")
trainer.train()

# Save the fine-tuned model
print("Saving the fine-tuned model...")
trainer.save_model("./fine_tuned_gpt2")
tokenizer.save_pretrained("./fine_tuned_gpt2")

print("Fine-tuning complete. Model saved to ./fine_tuned_gpt2")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Starting fine-tuning...


Step,Training Loss
10,7.166
20,4.2034
30,1.6641
40,1.1052
50,1.0811
60,0.9872
70,0.918
80,0.8492
90,0.7805
100,0.8098


Saving the fine-tuned model...
Fine-tuning complete. Model saved to ./fine_tuned_gpt2


In [3]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load fine-tuned model and tokenizer
model_path = "./fine_tuned_gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_path)
model = GPT2LMHeadModel.from_pretrained(model_path)

# Generate text
prompt = "The stock market today"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

output = model.generate(
    input_ids,
    max_length=50,
    num_return_sequences=1,
    temperature=0.7,
    top_k=50,
    top_p=0.95,
    do_sample=True,
)

print("Generated Financial News:")
print(tokenizer.decode(output[0], skip_special_tokens=True))


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated Financial News:
The stock market today rose 0.6 % to EUR10 .30 for the day , and the price was up 3.3 % on the previous day .


In [4]:
!git init


[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /content/.git/


In [6]:
!git add fine_tuned_gpt2/ logs/ financial_news.txt


^C


# New Section