<a href="https://colab.research.google.com/github/PuchToTalk/FinBERT/blob/fine-tuning/Fine_Tuning_practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers


Collecting transformers
  Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.17.2-py3-none-any.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 kB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m35.0 MB/s[0m eta [36m0:00:0

In [2]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from transformers import BertTokenizer, BertForSequenceClassification, AdamW

# Define your own dataset for text classification
# In this example, let's assume you have training_data and validation_data
# training_data should be a list of (text, label) pairs
# validation_data should be a list of (text, label) pairs
training_data = [
    ("I love this person.", 1),
    ("I hate reading", 0),
    # Add more data...
]

validation_data = [
    ("I love this dog.", 1),
    ("I hate doing my homework", 0),
    # Add more data...
]

# Define the BERT model and tokenizer
model_name = "bert-base-uncased"  # You can choose other pre-trained models
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Change num_labels to match your classification task

# Tokenize and preprocess the data
def preprocess_data(data):
    inputs = [tokenizer.encode(text, add_special_tokens=True, max_length=128, pad_to_max_length=True) for text, _ in data]
    labels = [label for _, label in data]
    inputs = torch.tensor(inputs)
    labels = torch.tensor(labels)
    return inputs, labels

train_inputs, train_labels = preprocess_data(training_data)
val_inputs, val_labels = preprocess_data(validation_data)

# Create data loaders
batch_size = 32
train_dataset = TensorDataset(train_inputs, train_labels)
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataset = TensorDataset(val_inputs, val_labels)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size)

# Define training parameters
learning_rate = 2e-5
num_epochs = 3  # You can adjust the number of epochs

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = AdamW(model.parameters(), lr=learning_rate)

# Fine-tune the BERT model
model.to("cuda" if torch.cuda.is_available() else "cpu")  # Use GPU if available
model.train()
for epoch in range(num_epochs):
    total_loss = 0
    for inputs, labels in train_dataloader:
        inputs = inputs.to("cuda" if torch.cuda.is_available() else "cpu")
        labels = labels.to("cuda" if torch.cuda.is_available() else "cpu")
        optimizer.zero_grad()
        outputs = model(inputs, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    average_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {average_loss:.4f}")

# Evaluate the model on the validation set
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for inputs, labels in val_dataloader:
        inputs = inputs.to("cuda" if torch.cuda.is_available() else "cpu")
        labels = labels.to("cuda" if torch.cuda.is_available() else "cpu")
        outputs = model(inputs)
        _, predicted = torch.max(outputs.logits, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = correct / total
print(f"Validation Accuracy: {accuracy:.4f}")

# Save the fine-tuned model
model.save_pretrained("fine_tuned_bert")


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Epoch 1/3, Loss: 0.7588
Epoch 2/3, Loss: 0.7083
Epoch 3/3, Loss: 0.6144
Validation Accuracy: 0.5000


The output shows the training and validation progress for a BERT-based text classification model. Here's how to interpret the results:

**Epoch 1/3, Loss: 0.7588:**


> This indicates that the model has completed the first training epoch.
The loss of approximately 0.7588 is the average loss calculated over all batches in the training data during this epoch.
Loss measures how well the model is performing; a lower loss is better. It represents the error between the model's predictions and the actual labels.



**Epoch 2/3, Loss: 0.7083:**

> This shows the results after the second training epoch.
The loss has decreased to approximately 0.7083, which is expected during training as the model learns to make better predictions.

**Epoch 3/3, Loss: 0.6144:**

> This is the result after the third and final training epoch.
The loss has decreased further to approximately 0.6144, indicating that the model continues to improve.
Validation Accuracy: 0.5000:

The validation accuracy of 0.5000 means that, when evaluating the model on the validation dataset, it correctly predicted the labels for 50% of the examples.
Validation accuracy is a common metric used to evaluate classification models. In this case, it indicates that the model is performing at a random or chance level, as it's correctly classifying roughly half of the examples.



**Interpretation:**

The decreasing training loss across epochs is a positive sign, suggesting that the model is learning and improving its predictions on the training data.
However, the low validation accuracy of 0.5000 indicates that the model's performance on unseen data (validation data) is no better than random guessing. This suggests that the model might be underfitting or that the data and model architecture may require further tuning.
