# Fine-Tuning test
This project demonstrates fine-tuning a pre-trained BERT model for sentiment analysis on the IMDB dataset. By adapting BERT, a model trained on vast amounts of text data, to the specific task of classifying movie reviews as either positive or negative, the project leverages transfer learning to improve the model's performance on a specialized task. Fine-tuning allows the model to adjust its parameters based on the IMDB dataset, enhancing its ability to accurately predict sentiment, making it an efficient approach for binary classification tasks.

# Install Necessary Libraries
- `transformers`: For pre-trained BERT and tokenization.
- `datasets`: For easy access to datasets like IMDB.
- `torch`: For PyTorch-based model training.





In [1]:
!pip install transformers datasets torch


Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

#Load and Tokenize the Dataset
`Purpose`: Load the IMDB dataset and use a pre-trained BERT tokenizer to tokenize the text data.

`Why`: Tokenization is necessary because BERT models expect tokenized inputs.

In [2]:
from datasets import load_dataset
from transformers import BertTokenizer


dataset = load_dataset("imdb")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

 # Prepare Data for PyTorch
`Purpose`: Format the tokenized dataset into PyTorch-compatible format, and prepare the data collator to handle dynamic padding for each batch.

`Why`: PyTorch models expect input data in tensor format. The data collator ensures that each batch is padded to the correct length during training.


In [3]:
from transformers import DataCollatorWithPadding

tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


# Set Up the Model
 Load the pre-trained BERT model specifically fine-tuned for sequence classification tasks (sentiment analysis).

`Why`: Using a pre-trained model like BERT gives us the advantage of transfer learning, where the model has already learned language representations and only needs to be fine-tuned for the specific task.

In [4]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Training Arguments
 In this stage Define the training parameters such as the number of `epochs`, `learning rate`, `batch sizes`, and `evaluation strategy`.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01
)


# Train the Model
I Created a Trainer object that manages the training loop for the model. The model will be trained using the train dataset, evaluated on the test dataset, and the `tokenizer` will be used to process the text during training

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()


  trainer = Trainer(
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss


### Save, Reload the Fine-Tuned Model, and Predict Sentiment
in the final step ,first save the fine-tuned model and tokenizer for future use. After saving, reload the model and tokenizer from the disk and create a function to predict the sentiment of a movie review (either `positive` or `negative`).


In [None]:
model.save_pretrained("../sentiment_model")
tokenizer.save_pretrained("./sentiment_model")
model.save_pretrained("./sentiment_model")
tokenizer.save_pretrained("./sentiment_model")

model = BertForSequenceClassification.from_pretrained("./sentiment_model")
tokenizer = BertTokenizer.from_pretrained("./sentiment_model")

def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=-1)
    return "positive" if prediction == 1 else "negative"

print(predict_sentiment("This movie was amazing!"))
