<a href="https://colab.research.google.com/github/TirendazAcademy/NLP-with-Transformers/blob/main/Sentiment-Analysis-using-Hugging-Face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <p style="background-color: #841818;font-family:newtimeroman;font-size:200%;color:white;text-align:center;border-radius:10px 10px;"><b>Sentiment Analysis with Hugging Face</b></p>

![](https://d3caycb064h6u1.cloudfront.net/wp-content/uploads/2021/06/sentimentanalysishotelgeneric-2048x803-1.jpg)
[Image Source](https://d3caycb064h6u1.cloudfront.net/wp-content/uploads/2021/06/sentimentanalysishotelgeneric-2048x803-1.jpg)

<a id="toc"></a>
# **Table of Contents**

**1.**  [**What is Sentiment Analysis**](#Step1)<br>
**2.**  [**Sentiment Analysis using Pipeline**](#Step2)<br>
**3.**  [**Building a Sentiment Analysis Model**](#Step3)<br>
**4.**  [**Data Preprocessing**](#Step4)<br>
**5.**  [**Model Building**](#Step5)<br>
**6.**  [**Model Evaluation**](#Step6)<br>
**7.**  [**Predicting New Data**](#Step7)<br>

<a id="Step1"></a>
# <p style="background-color: #841818;font-family:newtimeroman;font-size:125%;color:white;text-align:center;border-radius:10px 10px;"><b>1. What is Sentiment Analysis?</b></p>
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color: #E0BB20" data-toggle="popover">Content</a>

Sentiment analysis is one of the most common tasks in NLP which aims to identify the polarity of a given text.

<a id="Step2"></a>
# <p style="background-color: #841818;font-family:newtimeroman;font-size:125%;color:white;text-align:center;border-radius:10px 10px;"><b>2. Sentiment Analysis using Pipeline</b></p>
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color: #E0BB20" data-toggle="popover">Content</a>

First, let's install necesary libraries. 

In [1]:
!pip install transformers
!pip install datasets transformers huggingface_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Next, let's use pipeline class to make predictions from models available in the Hub.

In [2]:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I like you", "I hate this movie"]
sentiment_pipeline(data)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9998695850372314},
 {'label': 'NEGATIVE', 'score': 0.9996687173843384}]

Here you go. Let's show how to perform a sentiment analysis with our own model.

<a id="Step3"></a>
# <p style="background-color: #841818;font-family:newtimeroman;font-size:125%;color:white;text-align:center;border-radius:10px 10px;"><b>3. Building a Sentiment Analysis Model </b></p>
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color: #E0BB20" data-toggle="popover">Content</a>

Let's install the libraries and git-lfs to use git in our model repository:

In [3]:
!apt-get install git-lfs
import torch
torch.cuda.is_available()

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 5 not upgraded.


True

<a id="Step4"></a>
# <p style="background-color: #841818;font-family:newtimeroman;font-size:125%;color:white;text-align:center;border-radius:10px 10px;"><b>4. Preprocessing Data  </b></p>
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color: #E0BB20" data-toggle="popover">Content</a>

Let's use `Datasets` library to download and preprocess the IMDB dataset. 

In [4]:
from datasets import load_dataset
imdb = load_dataset("imdb")
print(imdb)



  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


Let's create smaller datasets to enable faster training and testing:

In [5]:
small_train_dataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(1000))])
small_test_dataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(300))])
print(small_train_dataset)
print(small_test_dataset)



Dataset({
    features: ['text', 'label'],
    num_rows: 1000
})
Dataset({
    features: ['text', 'label'],
    num_rows: 300
})


Let's use DistilBERT tokenizer to preprocess our data:

In [6]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Let's prepare the text inputs for the model with the map method:

In [7]:
def preprocess_function(examples):
   return tokenizer(examples["text"], truncation=True)
 
tokenized_train = small_train_dataset.map(preprocess_function, batched=True)
tokenized_test = small_test_dataset.map(preprocess_function, batched=True)



  0%|          | 0/1 [00:00<?, ?ba/s]

Let's use a data_collator to convert your training samples to PyTorch tensors to speed up training, and concatenate them with the correct amount of padding:

In [8]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

<a id="Step5"></a>
# <p style="background-color: #841818;font-family:newtimeroman;font-size:125%;color:white;text-align:center;border-radius:10px 10px;"><b>5. Training the Model   </b></p>
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color: #E0BB20" data-toggle="popover">Content</a>

We can fine-tune DistilBERT model for our dataset. First, let's define DistilBERT as your base model:

In [9]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifi

Next, let's define the metrics you'll be using to evaluate how good is your fine-tuned model:

In [10]:
import numpy as np
from datasets import load_metric
 
def compute_metrics(eval_pred):
   load_accuracy = load_metric("accuracy")
   load_f1 = load_metric("f1")
  
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
   return {"accuracy": accuracy, "f1": f1}

After that let's login to your Hugging Face account so you can manage your model repositories. 

In [11]:
from huggingface_hub import notebook_login
notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful


Before training our model, you need to specify the training arguments:

In [12]:
from transformers import TrainingArguments 

repo_name = "sentiment-model-on-imdb-dataset"

training_args = TrainingArguments(
   output_dir=repo_name,
   learning_rate=2e-5,
   per_device_train_batch_size=8,
   per_device_eval_batch_size=8,
   num_train_epochs=2,
   weight_decay=0.01,
   save_strategy="epoch",
   push_to_hub=True,
)

Let's define a Trainer with all the objects you constructed up to this point:

In [13]:
from transformers import Trainer
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)

/content/sentiment-model-on-imdb-dataset is already a clone of https://huggingface.co/Tirendaz/sentiment-model-on-imdb-dataset. Make sure you pull the latest changes with `repo.git_pull()`.


Let's fine-tune the model on the sentiment analysis dataset!

In [14]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1000
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 250
  Number of trainable parameters = 66955010
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


Saving model checkpoint to sentiment-model-on-imdb-dataset/checkpoint-125
Configuration saved in sentiment-model-on-imdb-dataset/checkpoint-125/config.json
Model weights saved in sentiment-model-on-imdb-dataset/checkpoint-125/pytorch_model.bin
tokenizer config file saved in sentiment-model-on-imdb-dataset/checkpoint-125/tokenizer_config.json
Special tokens file saved in sentiment-model-on-imdb-dataset/checkpoint-125/special_tokens_map.json
tokenizer config file saved in sentiment-model-on-imdb-dataset/tokenizer_config.json
Special tokens file saved in sentiment-model-on-imdb-dataset/special_tokens_map.json
Saving model checkpoint to sentiment-model-on-imdb-dataset/checkpoint-250
Configuration saved in sentiment-model-on-imdb-dataset/checkpoint-250/config.json
Model weights saved in sentiment-model-on-imdb-dataset/checkpoint-250/pytorch_model.bin
tokenizer config file saved in sentiment-model-on-imdb-dataset/checkpoint-250/tokenizer_config.json
Special tokens file saved in sentiment-mod

TrainOutput(global_step=250, training_loss=0.39688150024414065, metrics={'train_runtime': 113.4686, 'train_samples_per_second': 17.626, 'train_steps_per_second': 2.203, 'total_flos': 249496135959264.0, 'train_loss': 0.39688150024414065, 'epoch': 2.0})

<a id="Step6"></a>
# <p style="background-color: #841818;font-family:newtimeroman;font-size:125%;color:white;text-align:center;border-radius:10px 10px;"><b>6. Evaluating the Model   </b></p>
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color: #E0BB20" data-toggle="popover">Content</a>

We fine-tuned a DistilBERT model for sentiment analysis! Now, let's compute the evaluation metrics to see how good your model is: 

In [15]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 300
  Batch size = 8


  """


{'eval_loss': 0.3693915009498596,
 'eval_accuracy': 0.85,
 'eval_f1': 0.8543689320388349,
 'eval_runtime': 5.594,
 'eval_samples_per_second': 53.629,
 'eval_steps_per_second': 6.793,
 'epoch': 2.0}

Now that we have you have trained a model for sentiment analysis.

<a id="Step7"></a>
# <p style="background-color: #841818;font-family:newtimeroman;font-size:125%;color:white;text-align:center;border-radius:10px 10px;"><b>7. Predicting New Data </b></p>
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color: #E0BB20" data-toggle="popover">Content</a>

It's time to use our model to analyze new data and get predictions! Let's upload the model to the Hub first:

In [16]:
trainer.push_to_hub()

Saving model checkpoint to sentiment-model-on-imdb-dataset
Configuration saved in sentiment-model-on-imdb-dataset/config.json
Model weights saved in sentiment-model-on-imdb-dataset/pytorch_model.bin
tokenizer config file saved in sentiment-model-on-imdb-dataset/tokenizer_config.json
Special tokens file saved in sentiment-model-on-imdb-dataset/special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.34k/255M [00:00<?, ?B/s]

Upload file runs/Nov29_19-07-18_048e2adac5ef/events.out.tfevents.1669748978.048e2adac5ef.974.2: 100%|#########…

Upload file runs/Nov29_19-07-18_048e2adac5ef/events.out.tfevents.1669748854.048e2adac5ef.974.0:  86%|########6…

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/Tirendaz/sentiment-model-on-imdb-dataset
   67f8f8f..853f04c  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/Tirendaz/sentiment-model-on-imdb-dataset
   67f8f8f..853f04c  main -> main

To https://huggingface.co/Tirendaz/sentiment-model-on-imdb-dataset
   853f04c..fd083aa  main -> main

   853f04c..fd083aa  main -> main



'https://huggingface.co/Tirendaz/sentiment-model-on-imdb-dataset/commit/853f04c0b1aca3362a114945032463a16ac1e203'

Let's use our model in Hub to predict new data.

In [17]:
from transformers import pipeline
sentiment_model = pipeline(model="Tirendaz/sentiment-model-on-imdb-dataset")
sentiment_model(["I like this move", "This movie is bad!"])

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--Tirendaz--sentiment-model-on-imdb-dataset/snapshots/fd083aa860121a4a0c2355ffe23ab1bf8e310846/config.json
Model config DistilBertConfig {
  "_name_or_path": "Tirendaz/sentiment-model-on-imdb-dataset",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.24.0",
  "vocab_size": 30522
}



Downloading:   0%|          | 0.00/2.93k [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--Tirendaz--sentiment-model-on-imdb-dataset/snapshots/fd083aa860121a4a0c2355ffe23ab1bf8e310846/config.json
Model config DistilBertConfig {
  "_name_or_path": "Tirendaz/sentiment-model-on-imdb-dataset",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.24.0",
  "vocab_size": 30522
}



Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--Tirendaz--sentiment-model-on-imdb-dataset/snapshots/fd083aa860121a4a0c2355ffe23ab1bf8e310846/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at Tirendaz/sentiment-model-on-imdb-dataset.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForSequenceClassification for predictions without further training.
loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--Tirendaz--sentiment-model-on-imdb-dataset/snapshots/fd083aa860121a4a0c2355ffe23ab1bf8e310846/vocab.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--Tirendaz--sentiment-model-on-imdb-dataset/snapshots/fd083aa860121a4a0c2355ffe23ab1bf8e310846/tokenizer.json
loading file added_

[{'label': 'LABEL_1', 'score': 0.7537323236465454},
 {'label': 'LABEL_0', 'score': 0.9140784740447998}]

It's done. It turned out that the label of the first example is LABEL_1 and the label of the second example is LABEL_2.

📌 If you enjoy this notebook, upvote and follow me on [YouTube](https://www.youtube.com/channel/UCFU9Go20p01kC64w-tmFORw) | [Twitter](https://twitter.com/TirendazAcademy) | [Instagram](https://www.instagram.com/tirendazacademy) | [Tiktok](https://www.tiktok.com/@tirendazacademy) | [Medium](https://tirendazacademy.medium.com) | [Reddit](https://www.reddit.com/user/TirendazAcademy) 