In this code, we will fine-tune a pre-trained DistilBERT model on the Stanford Sentiment Treebank (SST-2) dataset using the Hugging Face Transformers library. The SST-2 dataset is specifically designed for sentiment analysis, classifying movie reviews as either positive or negative. By training the model on this dataset, we aim to improve its ability to accurately predict sentiment labels based on input text. This process involves loading the dataset, setting up training parameters, and using the Trainer class from the Transformers library to manage the training and evaluation of the model.


### First of all let's install required packages

In [1]:
# Install Hugging Face Transformers and Datasets
! pip install transformers datasets

# Install Pandas for data manipulation
! pip install pandas

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

### Import Required Libraries
Here, we import essential components from the Hugging Face **Transformers** library. The **Trainer** class simplifies the training process, **TrainingArguments** allows us to configure our training parameters, and **AutoModelForSequenceClassification** is used to load a pre-trained model suitable for classification tasks. We also import the **load_dataset** function from the datasets library to easily access pre-built datasets.

In [2]:
# Import necessary libraries from Hugging Face Transformers and load the custom dataset
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset

we load the SST-2 dataset from the GLUE benchmark using the load_dataset function. The dataset is split into training and validation subsets. The training data (train_data) will be used to fine-tune the model, while the validation data (eval_data) will be used to evaluate its performance after training.

In [3]:
# Load the SST-2 dataset from Hugging Face, which contains sentiment-labeled data
# The dataset is used for fine-tuning sentiment analysis models
dataset = load_dataset("glue", "sst2")
train_data = dataset["train"]
eval_data = dataset["validation"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Here, we load the DistilBERT model specifically trained for sequence classification tasks. By using the from_pretrained method, we ensure that we start with a model that already understands language structure, which will speed up the training process and improve our results.

In [4]:
# Load a pre-trained DistilBERT model for sequence classification
# This model is fine-tuned for sentiment analysis on the SST-2 dataset
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Define a function to tokenize and preprocess the input data
def preprocess_function(examples):
    # Tokenize the inputs, keeping max length and truncation
    return tokenizer(examples["sentence"], padding="max_length", truncation=True)

# Apply the tokenization to the train and eval datasets
tokenized_train_data = train_data.map(preprocess_function, batched=True)
tokenized_eval_data = eval_data.map(preprocess_function, batched=True)

# Set the format to PyTorch tensors for compatibility with the Trainer
tokenized_train_data.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
tokenized_eval_data.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

In this section, we define our training parameters using the TrainingArguments class. Key parameters include:

- output_dir: Specifies where to save the model and checkpoints.
evaluation_strategy: Determines how often to evaluate the model during training; here, we choose to evaluate at the end of each epoch.
-  save_strategy: Similar to the evaluation strategy, it dictates when to save model checkpoints.
-  logging_dir: The directory where training logs will be stored for monitoring.
-  logging_steps: The frequency of logging metrics during training, helping us track progress.
-  num_train_epochs: The total number of times the model will go through the training data; for initial testing, we use 1 but can increase this for better performance.
-  learning_rate: Controls how much to change the model in response to the estimated error each time the model weights are updated.

In [11]:
# Set up training arguments for the Trainer
training_args = TrainingArguments(
   output_dir="./results",             # Directory to save model outputs and checkpoints
   evaluation_strategy="epoch",        # Evaluate the model after each epoch
   save_strategy="epoch",              # Save model checkpoints after each epoch
   logging_dir='./logs',               # Directory for saving training logs
   logging_steps=10,                   # Log metrics every 10 steps for monitoring
   num_train_epochs=1,                 # Number of epochs for training; can be increased for better results
   learning_rate=2e-5,                 # Learning rate for the optimizer; adjust for convergence
   report_to="none",                    # Disable reporting to W&B
)


Here, we instantiate the Trainer class, passing in the model, training arguments, and datasets for both training and evaluation. The Trainer class encapsulates the training process, handling the training loop, logging, and evaluation in a streamlined manner.

In [12]:
# Create a Trainer object, which handles the training loop, evaluation, and logging
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=train_data,
   eval_dataset=eval_data
)

This line starts the training process. The train() method runs the training loop, during which the model learns to classify the sentiment of the input text based on the training dataset.

In [13]:
# Train the model on the training dataset
trainer.train()

ValueError: You have to specify either input_ids or inputs_embeds

In [None]:
# Import libraries from Hugging Face Transformers and datasets
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset

# Load the SST-2 dataset from the GLUE benchmark
dataset = load_dataset("glue", "sst2")
train_data = dataset["train"]
eval_data = dataset["validation"]

# Load the pre-trained model and tokenizer for sequence classification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Define a function to tokenize and preprocess the input data
def preprocess_function(examples):
    # Tokenize the inputs, keeping max length and truncation
    return tokenizer(examples["sentence"], padding="max_length", truncation=True)

# Apply the tokenization to the train and eval datasets
tokenized_train_data = train_data.map(preprocess_function, batched=True)
tokenized_eval_data = eval_data.map(preprocess_function, batched=True)

# Set the format to PyTorch tensors for compatibility with the Trainer
tokenized_train_data.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
tokenized_eval_data.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

# Define training arguments (W&B disabled)
training_args = TrainingArguments(
    output_dir="./results",              # Directory to save model outputs
    evaluation_strategy="epoch",         # Evaluate the model after each epoch
    save_strategy="epoch",               # Save model checkpoints after each epoch
    logging_dir='./logs',                # Directory for saving training logs
    logging_steps=10,                    # Log metrics every 10 steps
    num_train_epochs=1,                  # Reduced to 1 epoch for quick example
    learning_rate=2e-5,                  # Learning rate for the optimizer
    report_to="none",                    # Disable reporting to W&B
)

# Create a Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_eval_data
)

# Train the model
trainer.train()

# Evaluate the model
eval_results = trainer.evaluate()
print("Evaluation Results:", eval_results)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss
