# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>Emotion Analysis with Hugging Face</div></b>
![](https://img.freepik.com/free-photo/medium-shot-collage-cute-kids_23-2150169774.jpg?w=1380&t=st=1696406338~exp=1696406938~hmac=9d58dea338b8f0b58b41ebd8d7f1c17cbbf408a5451974b91043a78bc6e9cb7f)


Hi guys 😀 In this notebook, we're going to cover how to perform a emotion analysis with Hugging Face. Let's get started.

# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>1. Dataset Loading </div></b>

Let me import the dataset we're going to use in this analysis.

In [None]:
!pip install -q datasets

In [None]:
from datasets import load_dataset
emotions = load_dataset("dair-ai/emotion")

# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>2. Understanding the Dataset </div></b>

Beforing training the model, let's understand our dataset.

In [None]:
emotions

In [None]:
train_ds = emotions["train"]
train_ds

In [None]:
len(train_ds)

In [None]:
train_ds[1]

In [None]:
train_ds.column_names

In [None]:
train_ds.features

In [None]:
train_ds[:5]

In [None]:
train_ds["text"][:5]

# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>3. From Dataset to Pandas </div></b>

Let's dive into the dataset by converting Dataset to Pandas.

In [None]:
import pandas as pd
emotions.set_format(type="pandas")
df=emotions["train"][:]
df.head()

In [None]:
def label_int2str(row):
    return emotions["train"].features["label"].int2str(row)

df["label_name"] = df["label"].apply(label_int2str)
df.head()

# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>4. Data Visualization </div></b>

Let me draw a few graphs to understand our data set.

In [None]:
import matplotlib.pyplot as plt

df["label_name"].value_counts(ascending=True).plot.barh()
plt.title("Frequency of Classes")
plt.show()

In [None]:
df["Words Per Tweet"] = df["text"].str.split().apply(len)
df.boxplot("Words Per Tweet", by="label_name", grid = False, showfliers = False,
          color = "black")
plt.suptitle("")
plt.xlabel("")
plt.show()

In [None]:
emotions.reset_format()

# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>5. Data Preprocessing </div></b>

In this section, we're going to perform data preprocessing. First, let's take a look at three tokenization techniques: Character tokenization, word tokenization and subword tokenization. Let's dive in!

## Character Tokenization

In [None]:
text = "It is fun to work with NLP using HuggingFace."

tokenized_text = list(text)
print(tokenized_text)

In [None]:
token2idx = {ch: idx for idx, ch in enumerate(sorted(set(tokenized_text)))}
print(token2idx)

In [None]:
input_ids=[token2idx[token] for token in tokenized_text]
print(input_ids)

## How to work one-hot encoding?

In [None]:
df = pd.DataFrame({"name":["can", "efe","ada"],
                  "label":[0,1,2]})
df

In [None]:
pd.get_dummies(df, dtype=int)

## One-hot encoding with Torch

In [None]:
import torch
import torch.nn.functional as F

# Converting inputs into tensor
input_ids = torch.tensor(input_ids)

one_hot_encodings = F.one_hot(input_ids, num_classes=len(token2idx))
one_hot_encodings.shape

In [None]:
print(f"Token:{tokenized_text[0]}")
print(f"Tensor index: {input_ids[0]}")
print(f"One-hot: {one_hot_encodings[0]}")

## Word Tokenization 

In [None]:
tokenized_text = text.split()
print(tokenized_text)

## Subword Tokenization

In [None]:
from transformers import AutoTokenizer

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

## Custom Tokenization

In [None]:
from transformers import DistilBertTokenizer

distbert_tokenize=DistilBertTokenizer.from_pretrained(model_ckpt)

## How to work with Tokenizer

In [None]:
encoded_text = tokenizer(text)
print(encoded_text)

In [None]:
tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)

In [None]:
tokenizer.convert_tokens_to_string(tokens)

## Attributes of Tokenizer

In [None]:
tokenizer.vocab_size

In [None]:
tokenizer.model_max_length

## Tokenizing the entire dataset

In [None]:
def tokenize(batch):
    return tokenizer(batch["text"], truncation=True)

In [None]:
# How to work tokenizer on our some data:
print(tokenize(emotions["train"][:2]))

In [None]:
emotions_encoded = emotions.map(tokenize, batched=True,
                               batch_size=None)

## Padding

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

In [None]:
# Let's take a look at the columns of the dataset
emotions_encoded["train"].column_names

# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>6. Model Training </div></b>

In this section, we're going to load our BERT-based model, set metrics and training arguments and then train our model.

In [None]:
from transformers import AutoModelForSequenceClassification

num_labels = 6
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model=AutoModelForSequenceClassification.from_pretrained(model_ckpt,
         num_labels = num_labels).to(device)

## Evaluate

In [None]:
!pip install -q evaluate

In [None]:
import evaluate
import numpy as np

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, 
                           references = labels)

## Logging to HuggingFace

In [None]:
from huggingface_hub import notebook_login
notebook_login()

## Setting Training Arguments

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="distilbert-emotion",
    num_train_epochs = 2, 
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy = "epoch",
    load_best_model_at_end = True,
    push_to_hub = True,
    report_to = "none"    
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model = model,
    args = training_args,
    compute_metrics = compute_metrics,
    train_dataset = emotions_encoded["train"],
    eval_dataset = emotions_encoded["validation"],
    tokenizer = tokenizer,
)

In [None]:
trainer.train()

# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>7. Model Evaluation </div></b>

In this section, we're going to look at the performance of our model.

## Predicting the validation dataset

In [None]:
preds_output = trainer.predict(emotions_encoded["validation"])

In [None]:
preds_output.metrics

# Confusion Matrix

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

y_preds = np.argmax(preds_output.predictions, axis=1)

def plot_confusion_matrix(y_preds, y_true, labels):
    cm = confusion_matrix(y_true, y_preds, normalize = "true")
    fig, ax = plt.subplots(figsize=(6,6))
    disp = ConfusionMatrixDisplay(confusion_matrix = cm, 
                                 display_labels=labels)
    disp.plot(cmap="Blues", values_format = ".2f", ax = ax,
             colorbar=False)
    plt.title("Normalized confusion matrix")
    plt.show()

In [None]:
y_valid = np.array(emotions_encoded["validation"]["label"])
labels = emotions["train"].features["label"].names

In [None]:
plot_confusion_matrix(y_preds, y_valid, labels)

In [None]:
trainer.push_to_hub(commit_message="Training completed!")

# <b><div style='padding:15px;background-color:#850E35;color:white;border-radius:2px;font-size:110%;text-align: center'>8. Model Prediction </div></b>

In this section, we're first going to get a new text and predict the label of it with our model.

In [None]:
from transformers import pipeline

model_id = "Tirendaz/distilbert-emotion"
classifier = pipeline("text-classification", model= model_id)

In [None]:
custom_text="I watched a movie yesterday. It was really good."

In [None]:
preds=classifier(custom_text, return_all_scores = True)
preds_df = pd.DataFrame(preds[0])
plt.bar(labels, 100*preds_df["score"])
plt.title(f'"{custom_text}"')
plt.ylabel("Class probability (%)")
plt.show()

**Thanks for reading! If you enjoyed this notebook, don't forget to upvote** 👍

*Let's connect* [YouTube](http://youtube.com/tirendazacademy) | [Medium](http://tirendazacademy.medium.com) | [X](http://x.com/tirendazacademy) | [Linkedin](https://www.linkedin.com/in/tirendaz-academy) 😎

## Resource

- [NLP with Transformers](https://github.com/nlp-with-transformers/notebooks/blob/main/02_classification.ipynb)