# Sentiment Analysis using BERT(Binary Encoder representations Transformers) model

# Step 1: Install Libraries

In [None]:
# prompt: install scikit-learn library without fail

!pip install -q scikit-learn
!pip install transformers huggingface_hub fsspec torch datasets pandas



#Step 2: Load Dataset(IMDb reviews)
         
We'll use Hugging Face's datasets library:

In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb")
print(dataset["train"][0])  # Check a sample review

ValueError: Invalid pattern: '**' can only be an entire path component

If you facing value error:
The traceback indicates a ValueError: Invalid pattern: '**' can only be an entire path component occurring within the fsspec library, which is used by the datasets library to handle file paths, particularly when dealing with remote files or patterns. This error happens during the process of loading the "imdb" dataset using load_dataset("imdb").

The load_dataset function, when given a dataset name like "imdb", looks for the dataset definition script and data files on the Hugging Face Hub. It appears that internally, the file system abstraction (fsspec) is encountering an invalid glob pattern, likely when trying to find the data files for the "imdb" dataset. The specific error message suggests that the ** pattern is used incorrectly within a path component, rather than as a full component itself (e.g., dir/**/file.txt is valid, but dir/fi**le.txt is not).

Since we providing a simple string ("imdb") and not a local path or a complex pattern, the issue is likely not with our input but rather with how the datasets library or its underlying file system abstraction is interpreting or constructing file patterns internally when accessing the "imdb" dataset resources on the Hugging Face Hub. This could be due to temporary issues with the dataset's metadata or file structure on the hub, or a specific version incompatibility between the installed libraries.

Suggested Changes
Since our code for loading the dataset is standard and the error is deep within the library's file handling, the most likely cause is an environmental issue or a temporary problem with accessing the dataset resources.

Ensure Internet Connectivity: The load_dataset function needs to download the dataset definition script and potentially the data files from the internet. Verify that the environment has stable internet access.
Clear the Cache: Sometimes, corrupted or outdated cached files can cause issues. Clearing the datasets cache might resolve the problem.
Check Library Versions: While the installed libraries (transformers, torch, datasets, pandas, sklearn) are listed, ensuring compatible versions might help. However, the traceback doesn't strongly suggest a version conflict as the primary cause of this specific error.
Retry: Temporary network issues or server-side problems on the Hugging Face Hub can cause transient errors. Retrying the load_dataset call after a short while might resolve the issue.
Given the nature of the error and the standard use of load_dataset, there's no code change needed in the user's script itself. The troubleshooting steps involve checking the environment and potentially clearing the cache.

Here's how to clear the cache in a Jupyter environment:

In [None]:
# Import necessary modules
from datasets.utils.logging import disable_progress_bar
import os
import shutil

# Disable progress bars for cleaner output if desired
disable_progress_bar()

# Define the default cache directory
# This is the typical default location for datasets cache
datasets_cache_dir = os.path.expanduser("~/.cache/huggingface/datasets")

# Check if the cache directory exists
if os.path.exists(datasets_cache_dir):
    print(f"Clearing datasets cache at: {datasets_cache_dir}")
    try:
        shutil.rmtree(datasets_cache_dir)
        print("Cache cleared successfully.")
    except Exception as e:
        print(f"Error clearing cache: {e}")
else:
    print(f"Datasets cache directory not found at: {datasets_cache_dir}")

# Now, try loading the dataset again
from datasets import load_dataset

try:
    dataset = load_dataset("imdb")
    print("Dataset loaded successfully after clearing cache.")
    print(dataset["train"][0]) # Check a sample review
except Exception as e:
    print(f"Failed to load dataset after clearing cache: {e}")

Clearing datasets cache at: /root/.cache/huggingface/datasets
Cache cleared successfully.
Failed to load dataset after clearing cache: Invalid pattern: '**' can only be an entire path component


#Still facing ValueError:
Clearing the cache did not resolve the ValueError: Invalid pattern: '**' can only be an entire path component. This indicates that the issue is likely not due to a corrupted local cache of the dataset metadata or files.

Since the error persists and is still within the fsspec library when trying to load the "imdb" dataset, the problem is likely originating from:

A temporary issue on the Hugging Face Hub: The metadata or file structure for the "imdb" dataset on the Hub might be temporarily misconfigured, causing the fsspec library to encounter an invalid glob pattern during the internal process of discovering or accessing the data files.
A version incompatibility: Although the initial analysis didn't strongly point to this, there might be a subtle incompatibility between the specific versions of datasets, huggingface_hub, and fsspec that are currently installed in your environment. This incompatibility could lead to incorrect pattern construction or interpretation when interacting with the Hub.
Network issues preventing proper metadata retrieval: Even if the cache is clear, transient network problems could prevent the datasets library from correctly fetching the necessary information from the Hugging Face Hub to construct valid file paths.
Given that clearing the cache didn't work, the most probable causes are a temporary Hub issue or a version incompatibility.

#Suggested Changes
Since the error is still happening after clearing the cache, and the code for loading the dataset is standard, we should focus on addressing potential version issues or giving the Hub another chance.

**Retry Loading:** As suggested before, temporary issues on the Hugging Face Hub can occur. Try running the load_dataset("imdb") command again after a short while (e.g., a few minutes) to see if it was a transient problem.
**Specify a revision:** The datasets library allows us to specify a particular revision (a commit hash or tag) of the dataset on the Hugging Face Hub. This can sometimes help if the main branch of the dataset repository on the Hub has a temporary issue. However, finding a stable recent revision might require checking the dataset's repository on the Hub.
**Upgrade Libraries:** While we don't have definitive proof of a version conflict, upgrading the relevant libraries to their latest versions is a common troubleshooting step for issues deep within library code. This ensures you have the latest bug fixes and compatibility improvements.
Given the simplicity of the load_dataset("imdb") call, attempting a retry and then upgrading the libraries are the most straightforward next steps.

Let's try upgrading the relevant libraries: datasets, huggingface_hub, and fsspec.

Once the pip command runs it's asks for restart after restarting comment the pip command using the '#'

In [None]:
# Upgrade relevant libraries
#!pip install --upgrade datasets huggingface_hub fsspec

# Now, try loading the dataset again after upgrading
from datasets import load_dataset

try:
    dataset = load_dataset("imdb")
    print("Dataset loaded successfully after upgrading libraries.")
    print(dataset["train"][0]) # Check a sample review
except Exception as e:
    print(f"Failed to load dataset after upgrading libraries: {e}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset loaded successfully after upgrading libraries.
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographi

# Step 3: Tokenize the Data
BERT requires tokenization with padding/truncation:

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

#Step 4: Load Pre-Trained BERT Model

In [None]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2) # 2 classes

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#Step 5: Fine-Tune with Trainer

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    evaluation_strategy="epoch", # Corrected keyword argument
    save_strategy="epoch",
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].select(range(1000)), # Use subset for quick testing
    eval_dataset=tokenized_datasets["test"].select(range(100)),
)

trainer.train()

TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

In [None]:
# Upgrade relevant libraries, including torch and torchvision this time
!pip install --upgrade transformers datasets huggingface_hub fsspec torch torchvision --index-url https://download.pytorch.org/whl/cu118 # Use the appropriate cu version for your environment, cu118 is common

# Re-run the necessary steps to define model and tokenized_datasets

from datasets import load_dataset
from transformers import BertTokenizer, BertForSequenceClassification, TrainingArguments, Trainer
import torch # Import torch explicitly

# Load Dataset
# Try loading the dataset with a specific revision if the issue persists,
# though the original error was likely due to torch/torchvision
try:
    dataset = load_dataset("imdb")
    print("Dataset loaded successfully.")
except Exception as e:
    print(f"Failed to load dataset: {e}")
    # As a fallback, if "imdb" still fails, you could try a different small dataset
    # dataset = load_dataset("emotion", 'split', split='train[:1000]') # Example of another dataset

# Tokenize the Data
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

# Handle potential dataset loading failure by checking if dataset is defined
if 'dataset' in locals():
    tokenized_datasets = dataset.map(tokenize_function, batched=True)
else:
    print("Dataset was not loaded successfully. Skipping tokenization and training.")
    # Exit or handle the error appropriately if dataset loading failed

# Load Pre-Trained BERT Model
# Check if tokenized_datasets was successfully created before proceeding
if 'tokenized_datasets' in locals():
    model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2) # 2 classes

    # Fine-Tune with Trainer
    training_args = TrainingArguments(
        output_dir="./results",
        num_train_epochs=3,
        per_device_train_batch_size=8,
        eval_strategy="epoch", # Corrected keyword argument
        save_strategy="epoch",
        logging_dir="./logs",
        report_to="none" # Add this to potentially avoid issues with reporting
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        # Ensure datasets exist before selecting slices
        train_dataset=tokenized_datasets["train"].select(range(min(1000, len(tokenized_datasets["train"])))), # Use subset for quick testing, ensure range is valid
        eval_dataset=tokenized_datasets["test"].select(range(min(100, len(tokenized_datasets["test"])))), # Ensure range is valid
    )

    trainer.train()
else:
    print("Skipping model loading and training due to dataset loading failure.")

Looking in indexes: https://download.pytorch.org/whl/cu118
Dataset loaded successfully.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,No log,8.7e-05
2,No log,5.6e-05
3,No log,4.9e-05




#Step 6: Evaluate the Model

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score

predictions = trainer.predict(tokenized_datasets["test"].select(range(100)))
preds = np.argmax(predictions.predictions, axis=-1)
accuracy = accuracy_score(predictions.label_ids, preds)
print(f"Accuracy:", accuracy_score(predictions.label_ids, preds))



Accuracy: 1.0


#Step 7: Save & Use the Fine-Tuned Model

In [None]:
model.save_pretrained("./my_bert_sentiment")
tokenizer.save_pretrained("./my_bert_sentiment")

#Load later for inference
from transformers import pipeline
classifer = pipeline("text-classification", model="./my_bert_sentiment", tokenizer="./my_bert_sentiment")
print(classifer("It's a good movie"))

Device set to use cpu


[{'label': 'LABEL_0', 'score': 0.9998157620429993}]


This section of the code is responsible for saving the trained BERT model and its corresponding tokenizer to disk. Saving the model and tokenizer allows you to reuse them later without needing to retrain the model. After saving, the code demonstrates how to load the saved components and use them with a pipeline for performing sentiment analysis on new text.

First, the code saves the fine-tuned BERT model:

model.save_pretrained("./my_bert_sentiment")

The save_pretrained() method is provided by the transformers library. It takes a directory path as an argument. This method will save the model's configuration and weights into files within the specified directory, which is "./my_bert_sentiment" in this case.

Next, the code saves the tokenizer used with the model:

tokenizer.save_pretrained("./my_bert_sentiment")
Use code with caution
Similarly, the tokenizer.save_pretrained() method saves the tokenizer's vocabulary and configuration files into the same directory specified for the model. It's crucial to save the tokenizer along with the model, as the model expects input text to be tokenized in a specific way, which is defined by the tokenizer it was trained with.

After saving, the code demonstrates how to load and use the saved model and tokenizer for inference (making predictions).

#Load later for inference
from transformers import pipeline
classifer = pipeline("text-classification", model="./my_bert_sentiment", tokenizer="./my_bert_sentiment")
print(classifer("It's a average movie"))
Use code with caution
This part first imports the pipeline function from the transformers library. The pipeline function provides a high-level API for performing various tasks, including text classification, using pre-trained or fine-tuned models.

A pipeline object is created for the "text-classification" task. The model and tokenizer arguments are set to the directory where the model and tokenizer were saved ("./my_bert_sentiment"). The pipeline automatically loads the model and tokenizer from this directory.

Finally, the created classifer pipeline is used to predict the sentiment of the text string "It's a average movie". The result of the classification, which will indicate whether the sentiment is positive or negative, is then printed to the console. This demonstrates how you can easily load your saved model and use it to make predictions on new data.