# Sentiment Analysis

Sentiment analysis helps in identifying opinions, emotions, and attitudes expressed in text data, making it highly valuable for various applications, such as product reviews, social media monitoring, and customer feedback analysis.

You will learn:

* **Using pre trained sentiment analysis model**
* **Fine tune sentiment analysis models**

## Use Pre Trained Model

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2

### Import the necessary libraries

In [2]:
from transformers import pipeline

### Create a sentiment analysis pipeline

We initialize a sentiment analysis pipeline using the pretrained model distilbert-base-uncased-finetuned-sst-2-english.

In [None]:
sentiment_analyzer = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


### Provide input text

You can provide any text you want to analyze for sentiment.

In [None]:
text = "I love the way transformers are making NLP tasks so much easier!"

### Analyze the sentiment

Pass the text into the sentiment analyzer pipeline to determine the sentiment.


In [None]:
result = sentiment_analyzer(text)
print(result)

[{'label': 'POSITIVE', 'score': 0.9994403719902039}]


### Explanation of Output

*   label: The sentiment of the text (e.g., POSITIVE, NEGATIVE).
*   score: A confidence score indicating how certain the model is about its prediction.

## Fine Tuning

There are significant benefits to using a pretrained model. It reduces computation costs, your carbon footprint, and allows you to use state-of-the-art models without having to train one from scratch. 🤗 Transformers provides access to thousands of pretrained models for a wide range of tasks. When you use a pretrained model, you train it on a dataset specific to your task. This is known as fine-tuning, an incredibly powerful training technique. In this tutorial, you will fine-tune a pretrained model with a deep learning framework of your choice:

* Fine-tune a pretrained model with 🤗 Transformers Trainer.
* Fine-tune a pretrained model in TensorFlow with Keras.

### Install and Import Libraries

Train a TensorFlow model with Keras
You can also train 🤗 Transformers models in TensorFlow with the Keras API!

Loading data for Keras
When you want to train a 🤗 Transformers model with the Keras API, you need to convert your dataset to a format that Keras understands. If your dataset is small, you can just convert the whole thing to NumPy arrays and pass it to Keras. Let’s try that first before we do anything more complicated.

First, load a dataset. We’ll use the CoLA dataset from the GLUE benchmark, since it’s a simple binary text classification task, and just take the training split for now.

### Load the Data

In [None]:
from datasets import load_dataset

dataset = load_dataset("glue", "cola")
dataset = dataset["train"]  # Just take the training split for now

### Take a Look at the Data

In [None]:
dataset

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 8551
})

In [None]:
print(f'First sentence: {dataset["sentence"][0]}')
print(f'First label: {dataset["label"][0]}')
print(f'First idx: {dataset["idx"][0]}')

First sentence: Our friends won't buy this analysis, let alone the next one we propose.
First label: 1
First idx: 0


### Load a Tokenizer

Next, load a tokenizer and tokenize the data as NumPy arrays. Note that the labels are already a list of 0 and 1s, so we can just convert that directly to a NumPy array without tokenization!

In [None]:
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
tokenized_data = tokenizer(dataset["sentence"], return_tensors="np", padding=True)
# Tokenizer returns a BatchEncoding, but we convert that to a dict for Keras
tokenized_data = dict(tokenized_data)

labels = np.array(dataset["label"])  # Label is already an array of 0 and 1



Finally, load, [compile](https://keras.io/api/models/model_training_apis/#compile-method), and fit the model. Note that Transformers models all have a default task-relevant loss function, so you don’t need to specify one unless you want to:

### Load the Model and Fine Tuning it

In [None]:
from transformers import TFAutoModelForSequenceClassification, pipeline

# Load and compile our model
model = TFAutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased")
# Lower learning rates are often better for fine-tuning transformers
model.compile(optimizer='Adam')  # No loss argument!

model.fit(tokenized_data, labels)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




<tf_keras.src.callbacks.History at 0x7a4889ab1390>

**Note:**
You don’t have to pass a loss argument to your models when you compile() them! Hugging Face models automatically choose a loss that is appropriate for their task and model architecture if this argument is left blank. You can always override this by specifying a loss yourself if you want to!

This approach works great for smaller datasets, but for larger datasets, you might find it starts to become a problem. Why? Because the tokenized array and labels would have to be fully loaded into memory, and because NumPy doesn’t handle “jagged” arrays, so every tokenized sample would have to be padded to the length of the longest sample in the whole dataset. That’s going to make your array even bigger, and all those padding tokens will slow down training too!

### Save the Model

In [None]:
model.save_pretrained("./cola_finetuned_model")
tokenizer.save_pretrained("./cola_finetuned_model")

('./cola_finetuned_model/tokenizer_config.json',
 './cola_finetuned_model/special_tokens_map.json',
 './cola_finetuned_model/vocab.txt',
 './cola_finetuned_model/added_tokens.json',
 './cola_finetuned_model/tokenizer.json')

### Load the model

In [None]:
classifier = pipeline("text-classification", model="/content/cola_finetuned_model", tokenizer="/content/cola_finetuned_model", return_all_scores=False)

Some layers from the model checkpoint at /content/cola_finetuned_model were not used when initializing TFBertForSequenceClassification: ['dropout_113']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at /content/cola_finetuned_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.
Hardware accelerator e.g. GPU is available in the 

### Inference

In [None]:
sentence = "This is an acceptable sentence."
result = classifier(sentence)
print(f"Prediction: {result}")

Prediction: [{'label': 'LABEL_1', 'score': 0.7155210375785828}]
