<a href="https://colab.research.google.com/github/IManasa19/huggingface/blob/main/22033469.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### **Introduction**

In this work, I address the complication of sentiment analysis on movie reviews, leveraging the IMDB dataset having movie reviews. The inspiration stems from the necessity to accurately assess prevailing sentiment as well as sentiment from textual data, a critical part in various requests including marketing and customer feedback analysis. I utilized the BERT model, a state-of-the-art transformer-based language model, to classify reviews as positive or negative. The task involved data preprocessing, model fine-tuning, and also evaluation. The anticipated end result is actually a sturdy sentiment classification model that can efficiently anticipate the sentiment of movie reviews, improving text analytics capacities.


### **Setting Up Environment**

In [None]:
!pip install transformers



##### Here, I have installed the transformers library

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00

##### In the code above, I have installed datasets library

In [None]:
!pip install torch

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-

##### Here, I have installed torch library for developing this work

##### **Literature Review**

Sentiment analysis, a subfield of natural language processing (NLP), strives to calculate the sentiment conveyed in a part of text. Early approaches to sentiment analysis counted on typical machine learning strategies, including Naive Bayes as well as Support Vector Machines, typically integrated along with by hand crafted features (Alaparthi and Mishra, 2021). Having said that, these procedures had problems with the nuances as well as complexities of individual language, leading to the growth of even more advanced techniques.
The development of transformer-based models, specifically BERT (Bidirectional Encoder Representations from Transformers), denoted a significant surge forward in NLP. BERT, presented through Prottasha et al. in 2022, leverages a bidirectional training approach to understand the context of a word based upon its bordering words, delivering a much more nuanced understanding of language. Pre-trained on a vast corpus and also fine-tuned on detailed work, BERT has actually established new criteria in several NLP tasks, including sentiment analysis (Geetha and Renuka, 2021). The IMDB dataset, a commonly utilized benchmark for sentiment analysis, gives an ideal testbed for evaluating the effectiveness of BERT in categorizing convictions correctly.

### **Importing Necessary Libraries**

In [None]:
import pandas as pd
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset
from transformers import Trainer, TrainingArguments
import re

##### **Understanding BERT**

BERT, or Bidirectional Encoder Representations from Transformers, stands for a considerable innovation in natural language processing (NLP). Introduced by Zhang and Zhang, (2022), BERT takes advantage of a transformer style to catch bidirectional context within text. Unlike previous models that read through text in a unidirectional manner, BERT refines text in both paths, allowing a deeper understanding of context as well as meaning.
BERT's training entails a pair of primary periods: pre-training and also fine-tuning. During pre-training, the model is actually left open to substantial quantities of text as well as learns to anticipate overlooking terms in paragraphs (Masked Language Modeling) and the partnership between paragraph sets (Next Sentence Prediction). This pre-training outfit BERT along with a rich understanding of language (Mutinda et al. 2023). Fine-tuning adapts BERT to certain duties, like sentiment analysis, by more training the model on task-specific datasets. This approach makes it possible for BERT to attain state-of-the-art performance around several NLP tasks by leveraging its own complete pre-trained knowledge.

### **Loading Pre-trained BERT Model and Tokenizer**

In [None]:
# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

#### Here, I have loaded the pre-trained Bert tokenizer

In [None]:
# Load pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


##### Pre-traind BERT model is loaded for the purpose of Sequence Classification

In [None]:
# Check if GPU is available and move model to GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

##### This code checks for GPU availability and moves the model to GPU if available

### **Simple Task: Text Classification**

In [None]:
# Sample text
texts = ["I love this movie!", "I hate this movie!"]

# Tokenize and encode the texts
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=512)

# Move input tensors to GPU if available
inputs = {key: value.to(device) for key, value in inputs.items()}

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)

# Get predictions
predictions = torch.argmax(outputs.logits, dim=-1)
print(predictions)

tensor([0, 0])


##### The code tokenizes and encodes text inputs, moves tensors to GPU if available, performs inference with a model, and prints the predicted class labels based on model outputs.

##### **Methodology**

The approach involves many measures to build a sentiment analysis model utilizing BERT. First, the IMDB dataset is cleansed by eliminating HTML tags and also non-alphabetic personalities, after that tokenized using BERT's tokenizer. The dataset is divided into training and recognition collections. The BERT model is actually fine-tuned on the tokenized data with specified training parameters, featuring gradient accumulation and also mixed precision. The Trainer API is actually made use of for training and also analyzing the model. After training, the model and also tokenizer are actually saved to Google Drive for potential use of. This approach makes certain dependable training and also strong sentiment classification.

### **Fine-tune BERT**

In [None]:
reviews_data = pd.read_csv("/content/IMDB Dataset.csv")

##### Here, I have loaded the IMDB dataset using the Pandas library

In [None]:
reviews_data.shape

(999, 2)

##### Number of rows present in this dataset is 999 and columns is 2.

In [None]:
# Data Cleaning Function
def clean_text(text):
    text = re.sub(r'<br />', ' ', text)  # Remove HTML tags
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra whitespace
    return text

# Apply the cleaning function to the reviews
reviews_data['review'] = reviews_data['review'].apply(clean_text)

##### The code defines a function to clean text by removing HTML tags, non-alphabetic characters, converting to lowercase, and stripping extra whitespace. It then applies this cleaning function to the 'review' column in the reviews_data DataFrame.

In [None]:
# Split the data into training and validation sets
train_data, val_data = train_test_split(reviews_data, test_size=0.2, random_state=42)

##### The dataset is split into training set and testing set in the ratio 8:2. This means that there are 80% data in the training set and 20% data in the testing data.

In [None]:
train_data.head()

Unnamed: 0,review,sentiment
778,i never watched the next action hero show and ...,positive
286,there have been many documentaries that i have...,positive
165,an american werewolf in london had some funny ...,negative
960,this was my first gaspar noe movie ive watched...,positive
493,an extremely downtoearth well made and acted r...,positive


##### Top 5 rows of the dataset is viewed

### **Tokenizing the Text**

In [None]:
def tokenize_data(data, tokenizer, max_length=256):
    return tokenizer(
        data['review'].tolist(),
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors='pt'
    )

##### The tokenize_data function converts the 'review' column from a DataFrame to tokenized tensors, applying padding, truncation, and a maximum length of 256, using the provided tokenizer.

In [None]:
# Tokenize the training and validation data
train_encodings = tokenize_data(train_data, tokenizer)
val_encodings = tokenize_data(val_data, tokenizer)

##### The code tokenizes both the training and validation data using the tokenize_data function, creating encoded tensors for each dataset to be used in model training and evaluation.

In [None]:
# Convert labels to tensors
train_labels = torch.tensor([1 if label == 'positive' else 0 for label in train_data['sentiment'].tolist()])
val_labels = torch.tensor([1 if label == 'positive' else 0 for label in val_data['sentiment'].tolist()])

##### The code converts sentiment labels from the training and validation datasets into tensors, assigning a value of 1 for 'positive' and 0 for other labels, to prepare for model training.

### **Creating Dataset Class**

In [None]:
class IMDBDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: tensor[idx] for key, tensor in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)

##### The IMDBDataset class defines a custom dataset for PyTorch, handling encoded texts and labels. It provides methods to retrieve items by index and determine the dataset's length.

In [None]:
# Create the training and validation datasets
train_dataset = IMDBDataset(train_encodings, train_labels)
val_dataset = IMDBDataset(val_encodings, val_labels)

##### The code creates IMDBDataset instances for training and validation data, using tokenized encodings and labels, to prepare datasets for model training and evaluation in PyTorch.

### **Fine-tune BERT Model**

In [None]:
# Define training arguments with gradient accumulation
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,  # Keep the batch size smaller
    gradient_accumulation_steps=4,  # Accumulate gradients over 4 steps
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy='steps',
    fp16=True  # Enable mixed precision training
)



##### The code sets up training arguments for a model using TrainingArguments. It specifies parameters like the number of epochs, batch sizes, gradient accumulation steps, and mixed precision training, along with directories for output, logs, and evaluation strategy.

In [None]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

##### The code initializes a Trainer object with the model, training arguments, and datasets, setting up everything needed to manage training and evaluation of the model.

In [None]:
# Train the model
trainer.train()

Step,Training Loss,Validation Loss
10,0.7029,0.683902
20,0.6923,0.681742
30,0.6807,0.675322
40,0.6821,0.663445
50,0.6788,0.664064
60,0.6599,0.646435
70,0.6338,0.627333


TrainOutput(global_step=75, training_loss=0.6726602745056153, metrics={'train_runtime': 6937.8807, 'train_samples_per_second': 0.345, 'train_steps_per_second': 0.011, 'total_flos': 315338599848960.0, 'train_loss': 0.6726602745056153, 'epoch': 3.0})

##### The code initiates the training process for the model using the Trainer object, which trains the model on the provided training dataset and evaluates it on the validation dataset.

### **Evaluating the Model**

In [None]:
# Evaluate the model
eval_results = trainer.evaluate()
print(eval_results)

{'eval_loss': 0.6304202675819397, 'eval_runtime': 163.2144, 'eval_samples_per_second': 1.225, 'eval_steps_per_second': 0.153, 'epoch': 3.0}


##### The code evaluates the trained model using the Trainer object, obtaining performance metrics on the validation dataset, and prints the evaluation results to assess model effectiveness.

##### **Results and Findings**

The developed language model is actually a fine-tuned BERT-based transformer designed for sentiment analysis on movie reviews. The key task is to categorize reviews from the IMDB dataset into positive or even negative sentiments. The model was actually educated using the BERT design, which leverages bidirectional context to comprehend the nuances of language effectively.
Training involved preprocessing the IMDB dataset, featuring text cleaning as well as tokenization, adhered to by splitting it into training and also validation sets. The BERT model was fine-tuned for three epochs along with a gradient accumulation method and also mixed precision to enhance performance. The training logs reveal a consistent decrease in training loss, beginning with 0.7029 and also assembling at 0.6727, showing successful learning. Validation loss additionally improved, lessening from 0.6839 to 0.6304. The evaluation results display that the model implements effectively, completing an evaluation loss of 0.6304. This outcome represents that the model adequately distinguishes between positive and likewise negative reviews. Using BERT, together with suitable training approaches along with dataset preparation, straightens along with best tactics for sentiment analysis, making sure both the condition's value as well as the treatment's productivity (Liu et al. 2020).

### **Saving the Model**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


##### The code mounts Google Drive to the Colab environment at /content/drive, allowing access to files stored in Drive for use in the Colab notebook.

In [None]:
model_save_path = '/content/drive/My Drive/sentiment-model'

# Save the trained model
trainer.save_model(model_save_path)

# Save the tokenizer
tokenizer.save_pretrained(model_save_path)

('/content/drive/My Drive/sentiment-model/tokenizer_config.json',
 '/content/drive/My Drive/sentiment-model/special_tokens_map.json',
 '/content/drive/My Drive/sentiment-model/vocab.txt',
 '/content/drive/My Drive/sentiment-model/added_tokens.json')

##### The code saves the trained model and tokenizer to Google Drive at the specified path, preserving them for future use or deployment.

##### **Conclusion**

The fine-tuned BERT model efficiently takes care of the sentiment analysis task on the IMDB dataset, showing reliable classification of movie reviews right into positive or perhaps negative sights. The model's performance, shown through a decrease in both training and also validation loss, legitimizes its personal ability to determine in addition to generalize coming from the data. Taking advantage of BERT's bidirectional capacities and additionally watchful fine-tuning helped in its own long-lasting performance. This approach showcases the productivity of transformer-based models for sentiment analysis as well as providing a tough base for added augmentations and even requests in text classification obligations.

##### **References**

Alaparthi, S. and Mishra, M., 2021. BERT: A sentiment analysis odyssey. Journal of Marketing Analytics, 9(2), pp.118-126.

Geetha, M.P. and Renuka, D.K., 2021. Improving the performance of aspect based sentiment analysis using fine-tuned Bert Base Uncased model. International Journal of Intelligent Networks, 2, pp.64-69.

Liu, Y., Lu, J., Yang, J. and Mao, F., 2020. Sentiment analysis for e-commerce product reviews by deep learning model of Bert-BiGRU-Softmax. Mathematical Biosciences and Engineering, 17(6), pp.7819-7837.

Mutinda, J., Mwangi, W. and Okeyo, G., 2023. Sentiment analysis of text reviews using lexicon-enhanced bert embedding (LeBERT) model with convolutional neural network. Applied Sciences, 13(3), p.1445.

Prottasha, N.J., Sami, A.A., Kowsher, M., Murad, S.A., Bairagi, A.K., Masud, M. and Baz, M., 2022. Transfer learning for sentiment analysis using BERT based supervised fine-tuning. Sensors, 22(11), p.4157.

Zhang, Y. and Zhang, L., 2022. Movie recommendation algorithm based on sentiment analysis and LDA. Procedia computer science, 199, pp.871-878.
